# Data Science Championship - South Zone

> "Predicting the house rent of the given house details"

- toc: true
- branch: master
- badges: true
- comments: true
- categories: [predict, machine, hack, the, house, rent, price, data, science, championship, south, zone, learning]
- hide: false

In [1]:
# Installing the modules

!pip install wurlitzer
!pip3 install category_encoders
!pip install tensorflow_decision_forests

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wurlitzer
  Downloading wurlitzer-3.0.2-py3-none-any.whl (7.3 kB)
Installing collected packages: wurlitzer
Successfully installed wurlitzer-3.0.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting category_encoders
  Downloading category_encoders-2.4.1-py2.py3-none-any.whl (80 kB)
[K     |████████████████████████████████| 80 kB 5.7 MB/s 
Installing collected packages: category-encoders
Successfully installed category-encoders-2.4.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tensorflow_decision_forests
  Downloading tensorflow_decision_forests-0.2.6-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (15.8 MB)
[K     |████████████████████████████████| 15.8 MB 10.9 MB/s 
Collecting tensorflow~=2.9.1
  Downloading tensorflow-2.9.1-cp37-cp37m-manylin

In [2]:
# Required modules

import shutil
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf
import category_encoders as ce
import tensorflow_decision_forests as tfdf

from google.colab import drive
from matplotlib import pyplot as plt

from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

  import pandas.util.testing as tm


In [3]:
# Config

%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 12)
pd.set_option('display.max_columns', None)

In [4]:
# Mounting the drive

drive.mount('./mydrive')

Mounted at ./mydrive


In [5]:
# Moving files to workspace

shutil.copy('/content/mydrive/MyDrive/Machine Hack/Data Science Students Championship/train.csv', './train.csv')
shutil.copy('/content/mydrive/MyDrive/Machine Hack/Data Science Students Championship/test.csv', './test.csv')
shutil.copy('/content/mydrive/MyDrive/Machine Hack/Data Science Students Championship/submission.csv', './submission.csv')

'./submission.csv'

In [6]:
# Load the data

train = pd.read_csv('train.csv')
train.head()

Unnamed: 0,Property_ID,room,layout_type,property_type,locality,price,area,furnish_type,bathroom,city,parking_spaces,floor,pet_friendly,power_backup,washing_machine,air_conditioner,geyser/solar,security_deposit,CCTV/security,lift,neighbourhood
0,42208,3,BHK,Independent House,Palavakkam,33624,1312,Furnished,2,Chennai,1,1,1,0,0,1,0,302616,0,0,300
1,90879,1,BHK,Apartment,Manikonda,9655,1474,Unfurnished,2,Hyderabad,0,17,0,1,0,0,1,19310,0,1,1600
2,99943,3,BHK,Apartment,Jodhpur Park,23699,1837,Semi-Furnished,2,Kolkata,0,10,1,1,1,1,0,118495,0,1,3100
3,113926,1,BHK,Apartment,Chembur,6306,606,Unfurnished,1,Mumbai,0,18,0,0,0,0,0,37836,0,1,300
4,185438,1,BHK,Studio Apartment,Kalewadi Pandhapur Road,12008,498,Semi-Furnished,3,Pune,0,14,0,0,1,1,0,72048,0,1,0


In [7]:
# Inspect the data

train.info()
train.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 134683 entries, 0 to 134682
Data columns (total 21 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Property_ID       134683 non-null  int64 
 1   room              134683 non-null  int64 
 2   layout_type       134683 non-null  object
 3   property_type     134683 non-null  object
 4   locality          134683 non-null  object
 5   price             134683 non-null  int64 
 6   area              134683 non-null  int64 
 7   furnish_type      134683 non-null  object
 8   bathroom          134683 non-null  int64 
 9   city              134683 non-null  object
 10  parking_spaces    134683 non-null  int64 
 11  floor             134683 non-null  int64 
 12  pet_friendly      134683 non-null  int64 
 13  power_backup      134683 non-null  int64 
 14  washing_machine   134683 non-null  int64 
 15  air_conditioner   134683 non-null  int64 
 16  geyser/solar      134683 non-null  int

Unnamed: 0,Property_ID,room,price,area,bathroom,parking_spaces,floor,pet_friendly,power_backup,washing_machine,air_conditioner,geyser/solar,security_deposit,CCTV/security,lift,neighbourhood
count,134683.0,134683.0,134683.0,134683.0,134683.0,134683.0,134683.0,134683.0,134683.0,134683.0,134683.0,134683.0,134683.0,134683.0,134683.0,134683.0
mean,96036.100777,2.029677,36690.033894,1480.38849,2.040488,0.534388,9.163087,0.527602,0.337051,0.472561,0.692626,0.440137,220248.0,0.561838,0.595851,2033.024212
std,55565.228125,0.937308,62620.364025,1412.464718,0.867065,0.498818,5.957549,0.499239,0.472704,0.499248,0.461407,0.496405,420450.3,0.496163,0.490728,1159.635981
min,2.0,1.0,1583.0,81.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3180.0,0.0,0.0,0.0
25%,47940.0,1.0,12035.5,759.0,1.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,55802.5,0.0,0.0,1100.0
50%,95950.0,2.0,20856.0,1114.0,2.0,1.0,9.0,1.0,0.0,0.0,1.0,0.0,114264.0,1.0,1.0,2000.0
75%,144194.5,3.0,36014.0,1580.0,2.0,1.0,15.0,1.0,1.0,1.0,1.0,1.0,220704.5,1.0,1.0,3000.0
max,192405.0,5.0,799325.0,13942.0,5.0,1.0,19.0,1.0,1.0,1.0,1.0,1.0,7940780.0,1.0,1.0,4000.0


In [8]:
# Load the test data

test = pd.read_csv('./test.csv')
test.head()

Unnamed: 0,Property_ID,room,layout_type,property_type,locality,area,furnish_type,bathroom,city,parking_spaces,floor,pet_friendly,power_backup,washing_machine,air_conditioner,geyser/solar,security_deposit,CCTV/security,lift,neighbourhood,price
0,114342,2,BHK,Independent Floor,Palava,1347,Semi-Furnished,1,Mumbai,0,2,0,1,1,1,0,72624,1,0,900,
1,88819,1,BHK,Independent House,Somajiguda,634,Semi-Furnished,3,Hyderabad,1,4,0,0,1,1,0,19656,0,0,2500,
2,85623,1,BHK,Apartment,Toli Chowki,524,Unfurnished,1,Hyderabad,1,3,1,1,0,0,0,7500,0,0,3200,
3,130856,3,BHK,Apartment,Thane West,1837,Unfurnished,5,Mumbai,1,9,1,0,0,0,1,137646,1,1,1200,
4,40089,2,BHK,Apartment,Krishnarajapura,1208,Semi-Furnished,2,Bangalore,1,17,0,1,1,1,0,110898,0,1,1000,


In [9]:
# Checking for the missing values

if train.isna().any().any():
    print(train.isna().any())
else:
    print("No Missing Values")

No Missing Values


## Feature Engineering

In [10]:
# Converting the categorical variables

train['layout_type'] = np.where(train['layout_type'] == 'BHK', 1, 0)
test['layout_type'] = np.where(test['layout_type'] == 'BHK', 1, 0)

for col in train.columns:
    if train[col].dtype == 'object':
        encoder = ce.cat_boost.CatBoostEncoder()

        train[col] = encoder.fit_transform(train[col], train['price'])
        test[col] = encoder.transform(test[col])

In [11]:
# Seperating out features and labels

X = train.drop(['Property_ID', 'price'], axis=1)
y = train['price']
X_test = test.drop(['Property_ID', 'price'], axis=1)

In [None]:
# Scaling the train and test values

scaler = MinMaxScaler(feature_range=(0, 1))
X_scaled = scaler.fit_transform(X)
X_test_sclaed = scaler.transform(X_test)

scaler_label = MinMaxScaler(feature_range=(0, 1))
scaler_label.fit(y.values.reshape(-1, 1))
y_scaled = scaler_label.transform(y.values.reshape(-1, 1))

In [None]:
# Train and test split

X_train, X_valid, y_train, y_valid = train_test_split(X_scaled, y_scaled, test_size=0.2, random_state=88)

## Model Building

### Approach - 1

In [None]:
# Model Definition

input_len = len(train.columns) - 2

model = tf.keras.models.Sequential([
        tf.keras.layers.Input(shape=(input_len,)),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(32, activation='relu'),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(8, activation='relu'),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(1, activation='linear'),
])

model.summary()

Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_28 (Dense)            (None, 64)                1280      
                                                                 
 dropout_21 (Dropout)        (None, 64)                0         
                                                                 
 dense_29 (Dense)            (None, 32)                2080      
                                                                 
 dropout_22 (Dropout)        (None, 32)                0         
                                                                 
 dense_30 (Dense)            (None, 8)                 264       
                                                                 
 dropout_23 (Dropout)        (None, 8)                 0         
                                                                 
 dense_31 (Dense)            (None, 1)                

In [None]:
# Custom loss

def rmse(y_true, y_pred):
    return tf.math.sqrt(tf.keras.losses.mean_squared_error(y_true, y_pred))

In [None]:
# Compiling the model

loss = rmse
optim = tf.keras.optimizers.Adam(learning_rate=0.0001)
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint('custom_model_checkpoint.hdf5', save_best_only=True, custom_objects=[rmse])

model.compile(optimizer=optim, loss=loss, metrics=[tf.keras.losses.mean_squared_error])

In [None]:
# Fitting the model

epochs = 20
batch_size = 64

model.fit(X_train, y_train, validation_data=(X_valid, y_valid), epochs=epochs, batch_size=batch_size, shuffle=True)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f782ab2f3d0>

In [None]:
# Scoring the partitions

print(f"RMSE of Train is {mean_squared_error(scaler_label.inverse_transform(model.predict(X_train)), scaler_label.inverse_transform(y_train), squared=False)}")
print(f"RMSE of Valid is {mean_squared_error(scaler_label.inverse_transform(model.predict(X_valid)), scaler_label.inverse_transform(y_valid), squared=False)}")

RMSE of Train is 36337.83644900412
RMSE of Valid is 36745.74150641437


### Approach - 2

* Use TensorFlow Decision Forest in the prediction of house prices.

In [12]:
# Convert the dataset into a TensorFlow dataset.

train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train, label="price", task=tfdf.keras.Task.REGRESSION)
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test, label="price", task=tfdf.keras.Task.REGRESSION)

  features_dataframe = dataframe.drop(label, 1)


In [13]:
# Train a Random Forest model.

model = tfdf.keras.RandomForestModel(task=tfdf.keras.Task.REGRESSION)
model.fit(train_ds)

# Summary of the model structure.
model.summary()

# Evaluate the model.
model.evaluate(test_ds)

Use /tmp/tmp03c19zqp as temporary training directory
Reading training dataset...
Training dataset read in 0:00:14.151155. Found 134683 examples.
Training model...
Model trained in 0:03:41.615186
Compiling model...
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: could not get source code


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: could not get source code


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: could not get source code
Model compiled.
Model: "random_forest_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
Total params: 1
Trainable params: 0
Non-trainable params: 1
_________________________________________________________________
Type: "RANDOM_FOREST"
Task: REGRESSION
Label: "__LABEL"

Input Features (20):
	CCTV/security
	Property_ID
	air_conditioner
	area
	bathroom
	city
	floor
	furnish_type
	geyser/solar
	layout_type
	lift
	locality
	neighbourhood
	parking_spaces
	pet_friendly
	power_backup
	property_type
	room
	security_deposit
	washing_machine

No weights

Variable Importance: MEAN_MIN_DEPTH:
    1.          "__LABEL" 13.945312 ################
    2.             "lift" 13.812071 ###############
    3.     "pet_friendly"

0.0

In [14]:
# Generate the submission file

test = pd.DataFrame(index=range(X_test.shape[0]))
test['price'] = model.predict(test_ds)
test.to_csv('submission.csv', index=False)



With the approch-2, that is using tensorflow decision forest I got this (RMSE: 27126). 