Today, we are creating an ML Model to predict house rent which depends on a number of factors such as the number of rooms, area of the house, location and so on. Before we start lets give it a cool name.

# RentForecast Pro


We will be using the Scikit Learn library because it has predefined functions for developing and improving ML Models.

In [2]:
import sklearn
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv("C:\\Users\\Admin\\Desktop\\House_Rent_Dataset.csv") #Reading the dataset and storing it in a dataframe
df.head() #Gives the first five records of the dataframe

Unnamed: 0,Posted On,BHK,Rent,Size,Floor,Area Type,Area Locality,City,Furnishing Status,Tenant Preferred,Bathroom,Point of Contact
0,18-05-2022,2,10000,1100,Ground out of 2,Super Area,Bandel,Kolkata,Unfurnished,Bachelors/Family,2,Contact Owner
1,13-05-2022,2,20000,800,1 out of 3,Super Area,"Phool Bagan, Kankurgachi",Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
2,16-05-2022,2,17000,1000,1 out of 3,Super Area,Salt Lake City Sector 2,Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
3,04-07-2022,2,10000,800,1 out of 2,Super Area,Dumdum Park,Kolkata,Unfurnished,Bachelors/Family,1,Contact Owner
4,09-05-2022,2,7500,850,1 out of 2,Carpet Area,South Dum Dum,Kolkata,Unfurnished,Bachelors,1,Contact Owner


In [4]:
df.describe()#Gives the mathematical inferences from each numerical field.

Unnamed: 0,BHK,Rent,Size,Bathroom
count,4746.0,4746.0,4746.0,4746.0
mean,2.08386,34993.45,967.490729,1.965866
std,0.832256,78106.41,634.202328,0.884532
min,1.0,1200.0,10.0,1.0
25%,2.0,10000.0,550.0,1.0
50%,2.0,16000.0,850.0,2.0
75%,3.0,33000.0,1200.0,2.0
max,6.0,3500000.0,8000.0,10.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4746 entries, 0 to 4745
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Posted On          4746 non-null   object
 1   BHK                4746 non-null   int64 
 2   Rent               4746 non-null   int64 
 3   Size               4746 non-null   int64 
 4   Floor              4746 non-null   object
 5   Area Type          4746 non-null   object
 6   Area Locality      4746 non-null   object
 7   City               4746 non-null   object
 8   Furnishing Status  4746 non-null   object
 9   Tenant Preferred   4746 non-null   object
 10  Bathroom           4746 non-null   int64 
 11  Point of Contact   4746 non-null   object
dtypes: int64(4), object(8)
memory usage: 445.1+ KB


Our dataset has two kinds of features(inputs). One, continous features(containing numeric data). Other kind is categorical features. Categorical features don't store numeric data and hence these features cannot be directly programmed into an ML Model even though they influence the target variable greatly. For eg, in our dataset, we have City which is a categorical feature. Even though it doesn't store numbers, the rent varies significantly depending on the city( metro cities, A class, B class) etc.


To solve this problem we use label encoding. Label encoding assigns numerical values to categorical data fields. For eg Mumbai = 0, Delhi = 1 and so on. Scikit Learn provides LabelEncoder() class for the same. 

In [6]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
print(label_encoder)
df["City_Encoded"] = label_encoder.fit_transform(df["City"])#Encode to numerical format
df["Floor_Encoded"] = label_encoder.fit_transform(df["Floor"])
df["Area_Type_Encoded"] = label_encoder.fit_transform(df["Area Type"])
df["Area_Locality_Encoded"] = label_encoder.fit_transform(df["Area Locality"])
df["Furnishing_Status_Encoded"] = label_encoder.fit_transform(df["Furnishing Status"])
df["Tenant_Preferred_Encoded"] = label_encoder.fit_transform(df["Tenant Preferred"])
df["Point_of_Contact_Encoded"] = label_encoder.fit_transform(df["Point of Contact"])
print(df["City_Encoded"])
print(df["Floor_Encoded"])
print(df["Area_Type_Encoded"])
print(df["Furnishing_Status_Encoded"])
print(df["Tenant_Preferred_Encoded"])
print(df["Point_of_Contact_Encoded"])

LabelEncoder()
0       4
1       4
2       4
3       4
4       4
       ..
4741    3
4742    3
4743    3
4744    3
4745    3
Name: City_Encoded, Length: 4746, dtype: int32
0       455
1        14
2        14
3        10
4        10
       ... 
4741    271
4742     16
4743    271
4744    226
4745    313
Name: Floor_Encoded, Length: 4746, dtype: int32
0       2
1       2
2       2
3       2
4       1
       ..
4741    1
4742    2
4743    1
4744    1
4745    1
Name: Area_Type_Encoded, Length: 4746, dtype: int32
0       2
1       1
2       1
3       2
4       2
       ..
4741    1
4742    1
4743    1
4744    1
4745    2
Name: Furnishing_Status_Encoded, Length: 4746, dtype: int32
0       1
1       1
2       1
3       1
4       0
       ..
4741    1
4742    1
4743    1
4744    2
4745    0
Name: Tenant_Preferred_Encoded, Length: 4746, dtype: int32
0       2
1       2
2       2
3       2
4       2
       ..
4741    2
4742    2
4743    0
4744    0
4745    2
Name: Point_of_Contact_Encoded, Lengt

In [7]:
X = df[["BHK","Size","Floor_Encoded","Area_Type_Encoded","Area_Locality_Encoded","City_Encoded","Furnishing_Status_Encoded","Tenant_Preferred_Encoded","Bathroom","Point_of_Contact_Encoded"]]#Features
Y = df["Rent"]#Target
print(X)
print(Y)

      BHK  Size  Floor_Encoded  Area_Type_Encoded  Area_Locality_Encoded  \
0       2  1100            455                  2                    221   
1       2   800             14                  2                   1527   
2       2  1000             14                  2                   1760   
3       2   800             10                  2                    526   
4       2   850             10                  1                   1890   
...   ...   ...            ...                ...                    ...   
4741    2  1000            271                  1                    219   
4742    3  2000             16                  2                   1214   
4743    3  1750            271                  1                    724   
4744    3  1500            226                  1                    590   
4745    2  1000            313                  1                   1915   

      City_Encoded  Furnishing_Status_Encoded  Tenant_Preferred_Encoded  \
0           

We split our data in two parts. One is training set which involves data fields upon which the ML model is trained and developed. The test set is used to test the accuracy of predictions made by the trained model.

In [8]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2,random_state=0)
#test size = 0.2 denotes that 20% of the data will be in the test state. Random_state provides the seed value.
print(x_train)

      BHK  Size  Floor_Encoded  Area_Type_Encoded  Area_Locality_Encoded  \
4681    2   700            275                  2                   2008   
630     2   650             28                  1                   1408   
1742    3  1200            196                  2                   2041   
3077    1   560            198                  2                   1003   
2996    2   600             14                  2                   1209   
...   ...   ...            ...                ...                    ...   
1033    1   280             53                  1                    360   
3264    3  1650              1                  1                   1152   
1653    2  1100            200                  2                   1680   
2607    2  1800            455                  2                    961   
2732    2  1000            270                  2                    443   

      City_Encoded  Furnishing_Status_Encoded  Tenant_Preferred_Encoded  \
4681        

In [9]:
print(x_test)

      BHK  Size  Floor_Encoded  Area_Type_Encoded  Area_Locality_Encoded  \
1332    1   410            357                  1                    454   
4345    2  1300            271                  1                   1028   
4495    2   900            198                  2                    222   
2473    1   400            458                  1                    395   
3883    3  1798            383                  2                   1028   
...   ...   ...            ...                ...                    ...   
1692    3  1200            312                  1                    323   
3707    2  1424            192                  2                   2201   
2944    2  1000             16                  1                    138   
1722    2   700             14                  2                   1417   
4290    2  1050            313                  1                    706   

      City_Encoded  Furnishing_Status_Encoded  Tenant_Preferred_Encoded  \
1332        

In [10]:
print(y_train)

4681    20000
630     38000
1742    16000
3077     7500
2996    12000
        ...  
1033    20000
3264    20000
1653    22000
2607    30000
2732    20000
Name: Rent, Length: 3796, dtype: int64


In [11]:
print(y_test)

1332    50000
4345    15000
4495    12000
2473    13000
3883    48000
        ...  
1692    22000
3707    18000
2944    14000
1722    10000
4290    10000
Name: Rent, Length: 950, dtype: int64


We are going to use Linear Regression to train our ML Model. Linear Regression involves creating a hyperplane.

A hyperplane is defined as a (n-1)D plane represented in a nD space. For eg, a line in a 2D plane, a 2D plane in a 3D space and so on.

Since we have 10 features, we will assume a 10D plane in an 11D space. Of course, a 10D plane doesn't exist, but for the sake of writting an equation we can assume it does.

In [12]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
print(regressor)
regressor.fit(x_train,y_train)#Used to find the most accurate equation for our LR model as per the dataset

LinearRegression()


Our equation for the ML model will be w1x1 + w2x2 +.......+ w10x10 + b = y. x represents the various values of the features of our data(no of rooms, area etc),w is the coefficien,b is the intercept of our plane and y is the target variable. The values of w and b are calculated using the fit function, while x is usually entered by the user. 

In [13]:
print(regressor.intercept_)#value of b
print(regressor.coef_)#We will get 10 coefficients for 10 variables

-6698.564143894131
[ 5.64485349e+02  2.23138637e+01 -1.09015788e+00 -3.46530697e+03
 -1.69157528e+00  6.10799939e+03 -6.09756504e+03  7.68415269e+02
  1.70143275e+04 -1.20951520e+04]


In [14]:
y_test_predict = regressor.predict(x_test)
df = pd.DataFrame({"Actual Rent according to Test Data ":y_test,"Predicted Rent according to Test Data":y_test_predict})
print(df)

      Actual Rent according to Test Data   \
1332                                50000   
4345                                15000   
4495                                12000   
2473                                13000   
3883                                48000   
...                                   ...   
1692                                22000   
3707                                18000   
2944                                14000   
1722                                10000   
4290                                10000   

      Predicted Rent according to Test Data  
1332                           40617.311501  
4345                           64962.253618  
4495                           29824.088279  
2473                           21291.456411  
3883                           90065.965934  
...                                     ...  
1692                           15062.934112  
3707                           25959.467541  
2944                           53175.172103  

In [15]:
y_train_predict = regressor.predict(x_train)
df = pd.DataFrame({"Actual Values of training data":y_train,"Predicted Values using trained data":y_train_predict})
print(df)

      Actual Values of training data  Predicted Values using trained data
4681                           20000                         16158.654890
630                            38000                         69162.331070
1742                           16000                         15683.939441
3077                            7500                        -14976.122346
2996                           12000                         15542.499675
...                              ...                                  ...
1033                           20000                         38206.925291
3264                           20000                         53260.787902
1653                           22000                         13494.365766
2607                           30000                         65690.210333
2732                           20000                          9249.233527

[3796 rows x 2 columns]


Mean Absolute Error is calculated by dividing the summation of the difference between actual and predicted value by the number of observations(number of data fields in the dataset).

If the error is squared and then the summation operation is performed followed by division, it is called Mean Squared Error

Take the square root of Mean Squared Error, we get Root Mean Squared Error. 

In [16]:
from sklearn import metrics
print("Calculating error(Training Dataset)-: \n")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_train, y_train_predict))  
print('Mean Squared Error:', metrics.mean_squared_error(y_train, y_train_predict))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_train, y_train_predict)))

Calculating error(Training Dataset)-: 

Mean Absolute Error: 21913.77220895963
Mean Squared Error: 1689026384.8579981
Root Mean Squared Error: 41097.7661784433


In [17]:
print("Calculating error(Test Dataset)-: \n")
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_test_predict))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_test_predict))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_test_predict)))

Calculating error(Test Dataset)-: 

Mean Absolute Error: 28243.16373760451
Mean Squared Error: 15405710834.061102
Root Mean Squared Error: 124119.74393327236


In order to produce a much accurate result, Ensemble methods are used. Ensemble methods are techniques that create multiple models and then combine them to produce improved results.

One such ensemble method is Random Forest Regression. A forest of decision making tress is created, and each tree makes predictions. The final prediction is a weighted average of all the predictions.

In [18]:
from sklearn.ensemble import RandomForestRegressor

In [19]:
rf=RandomForestRegressor(n_estimators=100, max_depth=20)
rf.fit(x_train,y_train)

In [20]:
y_test_predict=rf.predict(x_test)
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_test_predict})
print(df)

      Actual     Predicted
1332   50000  26309.110000
4345   15000  32894.990000
4495   12000  13200.559698
2473   13000  12083.000000
3883   48000  40985.000000
...      ...           ...
1692   22000  20855.418628
3707   18000  21028.250000
2944   14000  24346.000000
1722   10000  11952.335761
4290   10000  15005.548277

[950 rows x 2 columns]


In [21]:
y_train_predict = rf.predict(x_train)
df = pd.DataFrame({"Actual Values of Training Data ":y_train,"Predicted Values of using Trained Data":y_train_predict})
print(df)

      Actual Values of Training Data   Predicted Values of using Trained Data
4681                            20000                            15731.777067
630                             38000                            38330.000000
1742                            16000                            19477.500000
3077                             7500                             7590.084513
2996                            12000                            12100.480454
...                               ...                                     ...
1033                            20000                            20385.000000
3264                            20000                            23535.000000
1653                            22000                            19712.184606
2607                            30000                            32990.000000
2732                            20000                            18621.333333

[3796 rows x 2 columns]


In [22]:
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_train, y_train_predict))  
print('Mean Squared Error:', metrics.mean_squared_error(y_train, y_train_predict))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_train, y_train_predict)))

Mean Absolute Error: 4369.659109727479
Mean Squared Error: 143390010.5896704
Root Mean Squared Error: 11974.556801388117


In [23]:
from sklearn import metrics
print("Mean Absolute Error: ",metrics.mean_absolute_error(y_test,y_test_predict))
print("Mean Squared Error: ",metrics.mean_absolute_error(y_test,y_test_predict))
print("Root Mean Squared Error: ",metrics.mean_absolute_error(y_test,y_test_predict))

Mean Absolute Error:  15314.85557318947
Mean Squared Error:  15314.85557318947
Root Mean Squared Error:  15314.85557318947


Another ensemble method is Gradient Boosting. Unlike Random Forest Regression, instead of combining predictions made by decision making trees in parallel, Gradient Boosting builds trees sequentially, minimizing the error of the previous one thereby optimizing the model.

In [24]:
from sklearn import ensemble

In [25]:
reg = ensemble.GradientBoostingRegressor()
reg.fit(x_train,y_train)

In [26]:
y_test_predict = reg.predict(x_test)
df = pd.DataFrame({"Actual":y_test,"Precited":y_test_predict})
print(df)

      Actual      Precited
1332   50000  34850.437452
4345   15000  25580.108567
4495   12000  12065.814446
2473   13000  13436.005672
3883   48000  48556.071268
...      ...           ...
1692   22000  21363.514423
3707   18000  14979.689548
2944   14000  27985.432975
1722   10000  13688.202691
4290   10000  14216.953651

[950 rows x 2 columns]


In [27]:
from sklearn import metrics
print("Mean Absolute Error: ",metrics.mean_absolute_error(y_test,y_test_predict))
print("Mean Squared Error: ",metrics.mean_squared_error(y_test,y_test_predict))
print("Root Mean Squared Error: ",np.sqrt(metrics.mean_squared_error(y_test,y_test_predict)))

Mean Absolute Error:  16406.476508160107
Mean Squared Error:  13803735384.079302
Root Mean Squared Error:  117489.29901943964


In [28]:
from sklearn import metrics
print("Mean Absolute Error: ",metrics.mean_absolute_error(y_train,y_train_predict))
print("Mean Squared Error: ",metrics.mean_squared_error(y_train,y_train_predict))
print("Root Mean Squared Error: ",np.sqrt(metrics.mean_squared_error(y_train,y_train_predict)))

Mean Absolute Error:  4369.659109727479
Mean Squared Error:  143390010.5896704
Root Mean Squared Error:  11974.556801388117


And that was RentForecast Pro! Following are some of the general inferences from the ML Model:

1. Scikit Learn model aids in the development of ML Models because of its brilliant predefined functions like train_test_split(), fit(), predict() etc.

2. Categorical features cannot be used directly in the development of a model. However, such columns can be encoded(assigned a numerical value) using LabelEncoder() class of SckiKit Learn.

3. Generated error is significantly reduced by using algorithms that use Ensemble Methods like Gradient Boosting and Random Forest Regression.