## Importing the libraries:

In [1]:
# Data Analytics and Visualization:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Machine Learning Regressors:
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.linear_model import LogisticRegression,LinearRegression
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix,mean_squared_error
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Others:
from sklearn import metrics
from math import sqrt

## Loading the dataset:

In [2]:
df = pd.read_csv('Bengaluru_House_Data.csv')
df.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


## Describing the available data:

In [3]:
df.shape

(13320, 9)

The above dataset contains 13320 variables with 9 features,

In [4]:
df.columns

Index(['area_type', 'availability', 'location', 'size', 'society',
       'total_sqft', 'bath', 'balcony', 'price'],
      dtype='object')

The dataset contains above features.

In [5]:
df.isnull().sum()

area_type          0
availability       0
location           1
size              16
society         5502
total_sqft         0
bath              73
balcony          609
price              0
dtype: int64

In [6]:
for feature in df.columns:
    if df[feature].isnull().sum() > 1:
        print("{} Feature has {}% Missing values ".format(feature,round(df[feature].isnull().mean()*100,2)))

size Feature has 0.12% Missing values 
society Feature has 41.31% Missing values 
bath Feature has 0.55% Missing values 
balcony Feature has 4.57% Missing values 


In [7]:
df.describe()

Unnamed: 0,bath,balcony,price
count,13247.0,12711.0,13320.0
mean,2.69261,1.584376,112.565627
std,1.341458,0.817263,148.971674
min,1.0,0.0,8.0
25%,2.0,1.0,50.0
50%,2.0,2.0,72.0
75%,3.0,2.0,120.0
max,40.0,3.0,3600.0


The above table gives us an idea about the features in the dataset.

## Data Preprocessing:

##### Dropping unwanted features:

Since, society feature has 41.31% Missing values, it will be okay to drop it.

In [8]:
df = df.drop(['society'], axis=1)
df.head()

Unnamed: 0,area_type,availability,location,size,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,1200,2.0,1.0,51.0


##### Dropping certain rows:

Since, 5 features have no NA values while other features have less than 5% of NA values, we can drop them and still have a huge dataset.

In [9]:
df = df.dropna()
df.shape

(12710, 8)

We still have 12710 variables with 8 features.

### Encoding the ordinal features:

First of all, we will convert 1 RK into 0 Bedrooms, as there are no bedrooms in 1 RK houses.

In [10]:
df.replace(to_replace = '1 RK', value = '0',inplace = True)
df.head()

Unnamed: 0,area_type,availability,location,size,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,1200,2.0,1.0,51.0


We need to convert size feture into an ordinal feature "bhk" where we can store the number of bedrooms in the apartment.

In [11]:
df['bhk'] = df['size'].apply(lambda x : int(x.split()[0]))
df.head()

Unnamed: 0,area_type,availability,location,size,total_sqft,bath,balcony,price,bhk
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,1056,2.0,1.0,39.07,2
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,2600,5.0,3.0,120.0,4
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,1440,2.0,3.0,62.0,3
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,1521,3.0,1.0,95.0,3
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,1200,2.0,1.0,51.0,2


Now, since we have got the number of bedrooms as a feature, we can now remove the size feature.

In [12]:
df = df.drop(['size'],axis = 1)
df.head()

Unnamed: 0,area_type,availability,location,total_sqft,bath,balcony,price,bhk
0,Super built-up Area,19-Dec,Electronic City Phase II,1056,2.0,1.0,39.07,2
1,Plot Area,Ready To Move,Chikka Tirupathi,2600,5.0,3.0,120.0,4
2,Built-up Area,Ready To Move,Uttarahalli,1440,2.0,3.0,62.0,3
3,Super built-up Area,Ready To Move,Lingadheeranahalli,1521,3.0,1.0,95.0,3
4,Super built-up Area,Ready To Move,Kothanur,1200,2.0,1.0,51.0,2


### Changing some variables:

Let's find out how many unique values of 'Location' feature are there.

In [13]:
len(df.location.unique())

1265

There are 1265 unique values, which are much more than necessary.

In [14]:
location_stats = df['location'].value_counts() 
below_10_dp = location_stats[location_stats <= 10]
df['location'] = df['location'].apply(lambda x : 'Others' if x in below_10_dp else x)
df.head()

Unnamed: 0,area_type,availability,location,total_sqft,bath,balcony,price,bhk
0,Super built-up Area,19-Dec,Electronic City Phase II,1056,2.0,1.0,39.07,2
1,Plot Area,Ready To Move,Chikka Tirupathi,2600,5.0,3.0,120.0,4
2,Built-up Area,Ready To Move,Uttarahalli,1440,2.0,3.0,62.0,3
3,Super built-up Area,Ready To Move,Lingadheeranahalli,1521,3.0,1.0,95.0,3
4,Super built-up Area,Ready To Move,Kothanur,1200,2.0,1.0,51.0,2


In [15]:
len(df.location.unique())

238

We have just reduced the number of unique values in location feture to 238.

### Pre-Processing the total_sqft column:

Preprocessing the total sqft cols as it has vivid entries.

In [16]:
import re

def preprocess_total_sqft(my_list):
    if len(my_list) == 1:
        
        try:
            return float(my_list[0])
        except:
            strings = ['Sq. Meter', 'Sq. Yards', 'Perch', 'Acres', 'Cents', 'Guntha', 'Grounds']
            split_list = re.split('(\d*.*\d)', my_list[0])[1:]
            area = float(split_list[0])
            type_of_area = split_list[1]
            
            if type_of_area == 'Sq. Meter':
                area_in_sqft = area * 10.7639
            elif type_of_area == 'Sq. Yards':
                area_in_sqft = area * 9.0
            elif type_of_area == 'Perch':
                area_in_sqft = area * 272.25
            elif type_of_area == 'Acres':
                area_in_sqft = area * 43560.0
            elif type_of_area == 'Cents':
                area_in_sqft = area * 435.61545
            elif type_of_area == 'Guntha':
                area_in_sqft = area * 1089.0
            elif type_of_area == 'Grounds':
                area_in_sqft = area * 2400.0
            return float(area_in_sqft)
        
    else:
        return (float(my_list[0]) + float(my_list[1]))/2.0

In [17]:
df['total_sqft'] = df.total_sqft.str.split('-').apply(preprocess_total_sqft)

In [18]:
df.head()

Unnamed: 0,area_type,availability,location,total_sqft,bath,balcony,price,bhk
0,Super built-up Area,19-Dec,Electronic City Phase II,1056.0,2.0,1.0,39.07,2
1,Plot Area,Ready To Move,Chikka Tirupathi,2600.0,5.0,3.0,120.0,4
2,Built-up Area,Ready To Move,Uttarahalli,1440.0,2.0,3.0,62.0,3
3,Super built-up Area,Ready To Move,Lingadheeranahalli,1521.0,3.0,1.0,95.0,3
4,Super built-up Area,Ready To Move,Kothanur,1200.0,2.0,1.0,51.0,2


### Converting Categorical data into Numeric data:

##### 1. Area Type:

In [19]:
replace_area_type = {'Super built-up  Area': 0, 'Built-up  Area': 1, 'Plot  Area': 2, 'Carpet  Area': 3}
df['area_type'] = df.area_type.map(replace_area_type)

In [20]:
df.head()

Unnamed: 0,area_type,availability,location,total_sqft,bath,balcony,price,bhk
0,0,19-Dec,Electronic City Phase II,1056.0,2.0,1.0,39.07,2
1,2,Ready To Move,Chikka Tirupathi,2600.0,5.0,3.0,120.0,4
2,1,Ready To Move,Uttarahalli,1440.0,2.0,3.0,62.0,3
3,0,Ready To Move,Lingadheeranahalli,1521.0,3.0,1.0,95.0,3
4,0,Ready To Move,Kothanur,1200.0,2.0,1.0,51.0,2


##### 2. Availability:

In [21]:
def replace_availabilty(my_string):
    if my_string == 'Ready To Move':
        return 0
    elif my_string == 'Immediate Possession':
        return 1
    else:
        return 2

In [22]:
df['availability'] = df.availability.apply(replace_availabilty)

In [58]:
df.head()

Unnamed: 0,area_type,availability,location,total_sqft,bath,balcony,price,bhk
0,0,2,78,1056.0,2.0,1.0,39.07,2
1,2,0,60,2600.0,5.0,3.0,120.0,4
2,1,0,223,1440.0,2.0,3.0,62.0,3
3,0,0,156,1521.0,3.0,1.0,95.0,3
4,0,0,148,1200.0,2.0,1.0,51.0,2


##### 3. Location:

In [24]:
le = LabelEncoder()
le.fit(df['location'].append(df['location']))
df['location']=le.transform(df['location'])

In [25]:
df.head()

Unnamed: 0,area_type,availability,location,total_sqft,bath,balcony,price,bhk
0,0,2,78,1056.0,2.0,1.0,39.07,2
1,2,0,60,2600.0,5.0,3.0,120.0,4
2,1,0,223,1440.0,2.0,3.0,62.0,3
3,0,0,156,1521.0,3.0,1.0,95.0,3
4,0,0,148,1200.0,2.0,1.0,51.0,2


### Splitting the dataset into the Training set and Test set:

In [26]:
X = df.drop('price',axis=1)
y = df['price']

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

## Modeling:

We are also perfoming a category of machine learning which is called supervised learning as we are training our model with a given dataset. These include:
1. Linear Regression
2. Decision Tree Regression
3. Random Forest Regression

### 1. Linear Regression:

In [28]:
lr = LinearRegression()
lr.fit(X_train,y_train)
lpred=lr.predict(X_test)

In [29]:
lpred

array([ 20.02713277,  65.28848284,  74.88041473, ...,  34.35528704,
       198.41689344, 132.88493033])

Above are predictions provided by our Linear Regression Model.

In [30]:
lrrmse=np.sqrt(np.mean((y_test-lpred)**2))
lrrmse

115.59042298006038

Above is the 'Root Mean Square Error' in our Linear Regression model.

### 2. Decision Tree Regression:

In [31]:
dt=DecisionTreeRegressor()
dt.fit(X_train,y_train)
dtpred=dt.predict(X_test)

In [32]:
dtpred

array([ 30. ,  70. , 103.2, ...,  18. , 215. , 341. ])

Above are predictions provided by our Decision Tree Regression Model.

In [33]:
dtrmse=np.sqrt(np.mean((y_test-dtpred)**2))
dtrmse

111.09946453435323

Above is the 'Root Mean Square Error' in our Decision Tree Regression model.

### 3. Random Forest Regression:

In [34]:
rf=RandomForestRegressor(n_estimators = 100)
rf.fit(X_train,y_train)
rfpred=rf.predict(X_test)

In [35]:
rfpred

array([ 27.8994    ,  65.97      , 103.23976755, ...,  27.46336905,
       181.13      , 279.015     ])

Above are predictions provided by our Random Forest Regression Model.

In [36]:
rfrmse=np.sqrt(np.mean((y_test-dtpred)**2))
rfrmse

111.09946453435323

Above is the 'Root Mean Square Error' in our Random Forest Regression model.

## Hyper Parameter Tuning:

#### 1. K-Fold Cross Validation:

In [39]:
from sklearn.model_selection import cross_val_score
lr_scores = cross_val_score(lr, X_train, y_train, cv = 5)
lr_scores.mean()

0.20772686771765025

In [40]:
from sklearn.model_selection import cross_val_score
dt_scores = cross_val_score(dt, X_train, y_train, cv = 5)
dt_scores.mean()

0.24946999923118937

In [42]:
from sklearn.model_selection import cross_val_score
rf_scores = cross_val_score(rf, X_train, y_train, cv = 5)
rf_scores.mean()

0.5385759867898912

From above observations, we can see that Random Forest Regressor is the best suited model for our dataset.

#### 2. Grid Search CV:

In [49]:
from sklearn.model_selection import GridSearchCV
clf=GridSearchCV(rf,
                {'n_estimators' : [20,50,100,200,500,1000]                  
                },cv=5, return_train_score=False)
clf.fit(X_train,y_train)
clf.cv_results_

{'mean_fit_time': array([ 0.7301909 ,  1.82548323,  3.6659523 ,  7.27010179, 18.28822041,
        36.62032948]),
 'std_fit_time': array([0.01691053, 0.02326036, 0.0390528 , 0.04677703, 0.07384806,
        0.16615379]),
 'mean_score_time': array([0.02016292, 0.05316844, 0.09936857, 0.19486704, 0.48312078,
        0.95641522]),
 'std_score_time': array([0.00260133, 0.00571837, 0.00655317, 0.00457654, 0.00369471,
        0.0091481 ]),
 'param_n_estimators': masked_array(data=[20, 50, 100, 200, 500, 1000],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'n_estimators': 20},
  {'n_estimators': 50},
  {'n_estimators': 100},
  {'n_estimators': 200},
  {'n_estimators': 500},
  {'n_estimators': 1000}],
 'split0_test_score': array([0.66767766, 0.58814767, 0.61562728, 0.63395393, 0.63786946,
        0.63854872]),
 'split1_test_score': array([0.45880563, 0.48037894, 0.48321915, 0.49091325, 0.47273895,
        0.4880454

In [50]:
newdf = pd.DataFrame(clf.cv_results_)
newdf

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.730191,0.016911,0.020163,0.002601,20,{'n_estimators': 20},0.667678,0.458806,0.573798,0.329503,0.633105,0.532584,0.123883,6
1,1.825483,0.02326,0.053168,0.005718,50,{'n_estimators': 50},0.588148,0.480379,0.59281,0.4132,0.635719,0.54205,0.082313,3
2,3.665952,0.039053,0.099369,0.006553,100,{'n_estimators': 100},0.615627,0.483219,0.593609,0.384965,0.622625,0.540011,0.092386,5
3,7.270102,0.046777,0.194867,0.004577,200,{'n_estimators': 200},0.633954,0.490913,0.596812,0.38187,0.606563,0.542027,0.093676,4
4,18.28822,0.073848,0.483121,0.003695,500,{'n_estimators': 500},0.637869,0.472739,0.598205,0.397117,0.621268,0.545442,0.094191,2
5,36.620329,0.166154,0.956415,0.009148,1000,{'n_estimators': 1000},0.638549,0.488045,0.595039,0.400102,0.613215,0.546993,0.089583,1


In [51]:
clf.best_params_

{'n_estimators': 1000}

Hence, to obtain the highest accuracy possible, we will use Random Forest model with 1000 estimators.

### Final Model:

In [52]:
Regressor = RandomForestRegressor(n_estimators = 1000)
Regressor.fit(X_train,y_train)
y_pred = Regressor.predict(X_test)

In [53]:
y_pred

array([ 27.5499    ,  64.38084   , 103.15676621, ...,  26.97817246,
       183.49      , 267.66033333])

### Prediction Example:

We can now try predicting price of some any random house.

We will predict the price of a house with following features:
1. Area Type: Super built-up Area (0)
2. Availability: Ready to move (0)
3. Location: Kothanur (148)
4. Area: 1791 sq. ft
5. Bath: 3
6. Balcony: 2
7. Number of Bedrooms: 3

In [59]:
Regressor.predict([[0,0,148,1791,3,2,3]])

array([105.25033667])

The predicted price of above flat is Rs. 105.25 Lakh (1.05 Cr).

###### Thus, we can predict the price of houses in various locations of Bengaluru.