<a href="https://colab.research.google.com/github/satishgunjal/Real-Estate-Price-Prediction-Project/blob/master/House_Price_Prediction_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Real Estate Price Prediction Project

## Introduction
Creating a Machine Learning model to predict the home prices in Bangalore, India. We are going to use the dataset from Kaggle.com.
We are also going to create a single page website which will provide the front end to access our model for predictions.

Below data science concepts are used in this project
* Data loading and cleaning
* Outlier detection and removal
* Feature engineering
* Dimensionality reduction
* Gridsearchcv for hyperparameter tunning
* K fold cross validation

Technology and tools used in this project
* Python
* Numpy and Pandas for data cleaning
* Matplotlib for data visualization
* Sklearn for model building
* Google Colaboratory Notebook
* Python flask for http server
* HTML/CSS/Javascript for UI

## Steps
1. We will first build a model using sklearn and linear regression using banglore home prices dataset from kaggle.com.
2. Second step would be to write a python flask server that uses the saved model to serve http requests.
3. Third component is the website built in html, css and javascript that allows user to enter home square ft area, bedrooms etc and it will call python flask server to retrieve the predicted price. 

## Dataset Reference
* [Bengaluru House price data](https://www.kaggle.com/amitabhajoy/bengaluru-house-price-data)
* I have also uploaed the csv file in this repository [Bengaluru_House_Data.csv](Bengaluru_House_Data.csv)

#Step#1: Import Libraries


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib
matplotlib.rcParams['figure.figsize'] = (20,10) # width, height in inches
from google.colab import files
import io

ModuleNotFoundError: No module named 'google'

#Step#2: Load the data
* Load the data in dataframe

In [None]:
# for local notebook
#df1 = pd.read_csv('Bengaluru_House_Data.csv')
#df1.head()

In [None]:
#For google colab
uploaded = files.upload()

In [None]:
df1 = pd.read_csv(io.StringIO(uploaded['Bengaluru_House_Data.csv'].decode('utf-8')))
df1.head()

#Step#3: Understand the data
* Finalize the columns to work with and drop the rest of them

In [None]:
# Get the no of rows and columns
df1.shape

In [None]:
#Get all the column names
df1.columns

In [None]:
#Lets check the unique values 'area_type' column
df1.area_type.unique()

In [None]:
#Let get the count of trianing examples for each area type
df1.area_type.value_counts()

## Dropping the columns
* As such all the columns are important for price prediction, but for the sake of this project I am going to drop few columns

In [None]:
# Note everytime we make change in dataset we store it in new dataframe
df2 = df1.drop(['area_type', 'availability', 'society', 'balcony'],axis='columns')

print('Rows and columns are = ', df2.shape)
df2.head()

#Step#4: Data Cleaning
* Check for na values
* Verify unique values of each column
* Make sure values are correct (eg. 23 BHK home with only 2000 Sqrft size seems wrong)

## Handling null values

In [None]:
# Get the sum of all na values from dataset
df2.isna().sum()

Since null values as comapre to total training examples(13320) is verry less we can safly drop those examples

In [None]:
df3 = df2.dropna()
df3.isnull().sum()

In [None]:
# Since all oor training examples containing null values are dropped lets check the shape of the dataset again
df3.shape

## Feature Engineering
* 'size' column contgaines the size of house in terms of BHK( Bedroom Hall Kitchen)
* To simply it we can create new column by the name 'bhk' and add only numeric value of how many BHK's

In [None]:
df3['size'].unique()

In [None]:
df4 = df3.copy()

# Using lambda function we can get the BHK numeric value
df4['bhk'] = df3['size'].apply(lambda x: int(x.split(' ')[0]))
#df3['bhk'] = df3['size'].apply(lambda x: int(x.split(' ')[0]))
df4.bhk.unique()

From above data we can see that there are home with upto 43 BHK's in Bangalore.. must be apolitician :)

In [None]:
#Get the training examples with home size more than 20 BHK
df4[df4.bhk >20]

Note above 43 BHK home area is only 2400 sqrft only. We will remove this data error later. First lets clean the 'total_sqft' column

Now lets check the unique values in 'total_sqft' column

In [None]:
df4.total_sqft.unique()

Note above, there are few records with range of the area like '1133 - 1384'.
Lets write a function to identify such values

In [None]:
def is_float(x):
  try:
    float(x)
  except:
    return False

  return True

In [None]:
# Test the function
print('is this (123) float value = %s' % (is_float(123)))
print('is this (1133 - 1384) float value = %s' % (is_float('1133 - 1384')))

In [None]:
#Lets apply this function to 'total_sqft' column

#Showing training examples where 'total_sqft' vale is not float
df4[~df4['total_sqft'].apply(is_float)].head(10) 

* Since most the value are range of sqft, we can write afunction to get the average value from a range. 
* There are few values like '34.46Sq. Meter' and '4125Perch' we can also try and convert those values into sqft but for now I amgoing to ignore them

In [None]:
def convert_range_to_sqft(x):
  try:
    tokens = x.split('-')

    if len(tokens) == 2:
      return (float(tokens[0]) + float(tokens[1]))/2
    else:
      return float(x)
  except:
    return None

In [None]:
#Lets test the convert_range_to_sqft()
print('Return value for i/p 12345 = %s' % (convert_range_to_sqft('12345')))
print('Return value for i/p 1133 - 1384 = %s' % (convert_range_to_sqft('1133 - 1384')))
print('Return value for i/p 34.46Sq. Meter = %s' % (convert_range_to_sqft('34.46Sq. Meter')))

In [None]:
# Lets apply this function for total_sqft column
df5 = df4.copy()

df5.total_sqft = df4.total_sqft.apply(convert_range_to_sqft)
df5.head()

In [None]:
# Since our converion function will return null for values like 34.46Sq. Meter. Lets check for any null values in it
df5.total_sqft.isnull().sum()

In [None]:
# Lets dro the null training sets from total_sqft
df6 = df5.dropna()
df6.total_sqft.isnull().sum()

# OR
#We can also select the not null training set using below filter
#df6 = df5[df5.total_sqft.notnull()]

In [None]:
# Lets cross check the values of 'total_sqft'
print('total_sqft value for 30th training set in df4 = %s' % (df4.total_sqft[30]))
print('total_sqft value for 30th training set in df6 = %s' % (df6.total_sqft[30]))

## Feature Engineering
* 'price' column containes the price of house in lacka ( 1 lakh = 100000)
* Price per square fit is important parameter in house prices.
* So we can create new column by the name 'price_per_sqft' and add price per sqft in it. formula = (price * 100000)/total_sqft

In [None]:
df7 = df6.copy()

df7['price_per_sqft'] = (df6['price'] * 100000)/df6['total_sqft']
df7.head()

In [None]:
df7_stats = df7['price_per_sqft'].describe()
df7_stats

## Dimesionality Reduction
* Dimensionality reduction is simply a process of reducing the dimension( or number of random variables) of your feature set
* In our dataset 'location' is categorical variable with 1287 categories.
* Before using One Hot Encoding to create dummy variables we must reduce the number of categories by using dimensionality reduction so that we will get less number of dummy variables.
* Our criteria for dimesionality reduction for 'location' is to use 'other' location for any location having less than 10 data points.


In [None]:
#Trim the location values
df7.location = df7.location.apply(lambda x: x.strip())
df7.head()

In [None]:
#Lets get the count of each location
location_stats = df7.location.value_counts(ascending=False)
location_stats

In [None]:
#Total number unique location categories are
len(location_stats)

We are going assign a category 'other' for every location where total datapoints are less than 10

In [None]:
#Get total number of categories where data points are less than 10
print('Total no of locations where data points are more than 10 = %s' % (len(location_stats[location_stats > 10])))
print('Total no of locations where data points are less than 10 = %s' % (len(location_stats[location_stats <= 10])))

Any location having less than 10 data points should be tagged as "other" location. This way number of categories can be reduced by huge amount. Later on when we do one hot encoding, it will help us with having fewer dummy columns

In [None]:
location_stats_less_than_10 = location_stats[location_stats <= 10]
location_stats_less_than_10

In [None]:
#Using lambda function assign the 'other' type to every element in 'location_stats_less_than_10'
df8 = df7.copy()

df8.location = df7.location.apply(lambda x: 'other' if x in location_stats_less_than_10 else x )
len(df8.location.unique())

Since 1047 location with less than 10 data points are converted to one category 'other'
Total no of unique location categories are = 240 +1  = 241

In [None]:
df8.head(10)

## Outlier Removal
* An outlier is an observation that is unlike the other observations. It is rare, or distinct, or does not fit in some way.
* Outliers are the data points that represent the extreame variation of dataset
* Outliers can be valid data points but since our model is generalization of the data, outliers can affect the performanace of the model. We are going to remove the otliers, but please note its not always a good practice to remove the outliers.
* To remove the outliers we can use domain knwoledge and standard deviation

### Standard Deviation
* Standard deviation is measure of spread that is to khow how much does the data vary from the average
* A low standard deviation tells us that the data is closely clustered around the mean (or average), while a high standard deviation indicates that the data is dispersed over a wider range of values. 
* It is used when the distribution of data is approximately normal, resembling a bell curve.
* One standard deviation(1 Sigma) of the mean will cover 68% of the data. i.e. Data between (mean - std deviation) & (mean + std deviation) is 1 Sigma and which is equal to 68%
* Here we are going to consider 1 Sigma as our threshold adn any data outside 1 Sigma will be considered as outlier
* [How to Use Statistics to Identify Outliers in Data](https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/)
* [Reference](https://youtu.be/MRqtXL2WX2M)

### Using domain knowledge for outlier removal
* Normally square fit per bedroom is 300 (i.e. 2 bhk apartment is minimum 600 sqft)
* If you have for example 400 sqft apartment with 2 bhk than that seems suspicious and can be removed as an outlier. 
* We will remove such outliers by keeping our minimum threshold per bhk to be 300 sqft

### Using domain knowledge for outlier removal


In [None]:
# Lets visualize the data where square fit per bedroom is less than 300
df8[(df8.total_sqft / df8.bhk) < 300]

Note abobe we have 744 training examples where square fit per bedroom is less than 300. These are outliers, so we can remove them

In [None]:
#Lets check current dataset shape before removing outliers
df8.shape

In [None]:
df9 = df8[~((df8.total_sqft / df8.bhk) < 300)]
df9.shape

### Outlier Removal - Using Standard Deviation and Mean
* One standard deviation(1 Sigma) of the mean will cover 68% of the data. i.
e. Data between (mean - std deviation) & (mean + std deviation) is 1 Sigma and which is equal to 68%
* Here any datapin t outside the 1 Sigma deviation (68%) is outlier for us

In [None]:
# Get basic stats of column 'price_per_sqft'
df9.price_per_sqft.describe()

Note: Its important to understand that price of every house is location specific. We are going to remove outliers using 'price_per_sqft' for each location

In [None]:
# Data visualization for 'price_per_sqft' for location 'Rajaji Nagar'
# Note here its normal distribuation of data so outlier removal using stad deviation and mean works perfectly here
plt.hist(df9[df9.location == "Rajaji Nagar"].price_per_sqft,rwidth=0.8)
plt.xlabel("Price Per Square Feet")
plt.ylabel("Count")

In [None]:
#Lets check current dataset shape before removing outliers
df9.shape

In [None]:
# Function to remove outliers using pps(price per sqft)
def remove_pps_outliers(df):
    df_out = pd.DataFrame()
    for key, subdf in df.groupby('location'):
        mean = np.mean(subdf.price_per_sqft)
        std = np.std(subdf.price_per_sqft)
        reduced_df = subdf[(subdf.price_per_sqft>(mean-std)) & (subdf.price_per_sqft<=(mean + std))] # 1 Sigma value i.e 68% of data
        df_out = pd.concat([df_out,reduced_df],ignore_index=True) # Storing data in 'df_out' dataframe
    return df_out

df10 = remove_pps_outliers(df9)
df10.shape

In [None]:
# Data visualization for 'price_per_sqft' for location 'Rajaji Nagar' after outlier removal
plt.hist(df10[df10.location == "Rajaji Nagar"].price_per_sqft,rwidth=0.8)
plt.xlabel("Price Per Square Feet")
plt.ylabel("Count")

### Using domain knowledge for outlier removal
* If location and square foot area is aslo same then price of 3BHK should be more than 2 BHK
* There are other factors that also affect the price but for this exercise we are treating such values as outlier and remove them

In [None]:
# Let's check if for a given location how does the 2 BHK and 3 BHK property prices look like
def plot_scatter_chart(df,location):
    bhk2 = df[(df.location==location) & (df.bhk==2)]
    bhk3 = df[(df.location==location) & (df.bhk==3)]
    matplotlib.rcParams['figure.figsize'] = (15,10)
    plt.scatter(bhk2.total_sqft,bhk2.price,color='blue',label='2 BHK', s=50)
    plt.scatter(bhk3.total_sqft,bhk3.price,marker='+', color='green',label='3 BHK', s=50)
    plt.xlabel("Total Square Feet Area")
    plt.ylabel("Price (Lakh Indian Rupees)")
    plt.title(location)
    plt.legend()
    
plot_scatter_chart(df10,"Rajaji Nagar")

In [None]:
plot_scatter_chart(df10,"Hebbal")

We should also remove properties where for same location, the price of (for example) 3 bedroom apartment is less than 2 bedroom apartment (with same square ft area). What we will do is for a given location, we will build a dictionary by name 'bhk_stats' with below values of 'price_per_sqft'

```
{
    '1' : {
        'mean': 4000,
        'std: 2000,
        'count': 34
    },
    '2' : {
        'mean': 4300,
        'std: 2300,
        'count': 22
    },    
}
```
Now we can remove those 2 BHK apartments whose price_per_sqft is less than mean price_per_sqft of 1 BHK apartment

In [None]:
def remove_bhk_outliers(df):
    exclude_indices = np.array([])
    for location, location_df in df.groupby('location'):
        bhk_stats = {}
        for bhk, bhk_df in location_df.groupby('bhk'):
            bhk_stats[bhk] = {
                'mean': np.mean(bhk_df.price_per_sqft),
                'std': np.std(bhk_df.price_per_sqft),
                'count': bhk_df.shape[0]
            }
        for bhk, bhk_df in location_df.groupby('bhk'):
            stats = bhk_stats.get(bhk-1)
            if stats and stats['count']>5:
                exclude_indices = np.append(exclude_indices, bhk_df[bhk_df.price_per_sqft<(stats['mean'])].index.values)
    return df.drop(exclude_indices,axis='index')

df11 = remove_bhk_outliers(df10)
df11.shape

In [None]:
#Plot same scatter chart again to visualize price_per_sqft for 2 BHK and 3 BHK properties
plot_scatter_chart(df11,"Rajaji Nagar")

In [None]:
plot_scatter_chart(df11,"Hebbal")

Now you can campre the scatter plots for location(Hebbal and Rajaji Nagar) for before and after outlier removal

In [None]:
#Now lets plot the histogram and visualize the price_per_sqft data after outlier removal

matplotlib.rcParams["figure.figsize"] = (20,10)
plt.hist(df11.price_per_sqft,rwidth=0.8)
plt.xlabel("Price Per Square Feet")
plt.ylabel("Count")

### Using domain knowledge for outlier removal
* Generally number of bathrooms per BHK are (no of BHK) + 2.
* So using above understanding we can identify the outliers and remove them

In [None]:
#Get unique bath from dataset
df11.bath.unique()

In [None]:
#Get the training examples where no of bath are more than (no of BHK +2)
df11[df11.bath > df11.bhk + 2]

We can remove above outliers from the datset

In [None]:
#Lets check current dataset shape before removing outliers
df11.shape

In [None]:
# Remove the outliers with more than (no of BHK + 2) bathrooms
df12 = df11[df11.bath < (df11.bhk + 2)]
df12.shape

This concludes our data cleaning, lets drop unnecessary columns
* Since we have 'bhk' feature lets drop 'size'
* We have crerated 'price_per_sqft' for outlier detection and removal purpose, so we can also drop it. 

In [None]:
df13 = df11.drop(['size', 'price_per_sqft'], axis='columns')
df13.head()

## One Hot Encoding
Since we have 'location' as categorical feature lets use One Hot Encoding to create separate column for each location category and assign binary value 1 or 0

In [None]:
dummies = pd.get_dummies(df13.location)
dummies.head()

In [None]:
#To avoid dummy variable trap problem lets delete the one of the dummy variable column
dummies = dummies.drop(['other'],axis='columns')
dummies.head()

In [None]:
#Now lets add dummies dataframe to original dataframe
df14 = pd.concat([df13,dummies],axis='columns')
df14.head()

In [None]:
#Lets delete the location feature
df15 = df14.drop(['location'],axis='columns')
df15.head()m

#Step#5: Build Machine Learning Model


In [None]:
#Final shape of our dataset is
df15.shape

Now leats create X(independent variable/features) and y(dependent variables/target)

In [None]:
X = df15.drop(['price'],axis='columns')
X.head()

In [None]:
y = df15.price
y.head()

### Split the dataset to training andtest dataset

In [None]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=10)
print('X_train shape = ',X_train.shape)
print('X_test shape = ',X_test.shape)
print('y_train shape = ',y_train.shape)
print('y_test shape = ',y_test.shape)

## Linear Regression
* Lets test the score with LinearRegression model

In [None]:
from sklearn.linear_model import LinearRegression

lr_clf = LinearRegression()
lr_clf.fit(X_train, y_train)
lr_clf.score(X_test, y_test)

### Use K Fold cross validation to measure accuracy of our LinearRegression model
* Using Sklearn cross_val_score function
* Note: Sklearn's cross_val_score uses StratifiedKFold by default

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import ShuffleSplit

# ShuffleSplit is used to randomize the each fold
cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
cross_val_score(LinearRegression(), X, y, cv = cv)

## GridSearchCV
* From above scores its clear that with LinearRegresion we get max score of upto 87%
* Let use GridSearchCV to test other regression algorithm 

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor

from sklearn.linear_model import Ridge
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import ElasticNet

def find_best_model_using_gridsearchcv(X,y):
  algos ={
      'linear_regression':{
          'model':LinearRegression(),
          'params': {
              'normalize':[True,False]
          }
      },
        'lasso': {
            'model': Lasso(),
            'params': {
                'alpha': [1,2],
                'selection': ['random', 'cyclic']
            }
        },
        'decision_tree': {
            'model': DecisionTreeRegressor(),
            'params': {
                'criterion' : ['mse','friedman_mse'],
                'splitter': ['best','random']
            }
        }
  }

  scores= []
  cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)

  for algo_name, config in algos.items():
      gs =  GridSearchCV(config['model'], config['params'], cv=cv, return_train_score=False)
      gs.fit(X,y)
      scores.append({
          'model': algo_name,
          'best_score': gs.best_score_,
          'best_params': gs.best_params_
      })

  return pd.DataFrame(scores,columns=['model','best_score','best_params'])

find_best_model_using_gridsearchcv(X,y)

**Based on above results we can say that LinearRegression gives the best score. Hence we will use that**.

## Step#6: Testing The model
* Since all our locations are now columns in form of dummy variabales, all other dummy variables value should be 0 except the one(dummy variable column for our location) we are predicting for
* This(np.where(X.columns==location)[0][0]) code will give us index of dummy column for our location
* Now we will assign value '1' to this index and keep all other dummy variable columns as '0'

In [None]:
def predict_price(location, sqft, bath, bhk):    
    loc_index = np.where(X.columns==location)[0][0]

    x = np.zeros(len(X.columns))
    x[0] = sqft
    x[1] = bath
    x[2] = bhk
    if loc_index >= 0:
        x[loc_index] = 1

    return lr_clf.predict([x])[0]

In [None]:
predict_price('1st Phase JP Nagar',1000, 2, 2)

In [None]:
predict_price('1st Phase JP Nagar',1000, 3, 3)

In [None]:
predict_price('Indira Nagar',1000, 2, 2)

In [None]:
predict_price('Indira Nagar',1000, 3, 3)

Step#7: Export the model to Pickle file

In [None]:
import pickle

with open('Real_Estate_Price_Prediction_Project.pickle','wb') as f:
    pickle.dump(lr_clf,f)

In [None]:
#Since we are using the Google colab, pickle will will be saved at current directory of Google cloud machine
import os

os.listdir('.')

In [None]:
#Lets download it
from google.colab import files

files.download('Real_Estate_Price_Prediction_Project.pickle')

Step#8: Export any other important info
* since weare using One Hot Encoding for location column we need the final list of all the columns in our feature set

In [None]:
import json
columns = {
    'data_columns' : [col.lower() for col in X.columns]
}
with open("columns.json","w") as f:
    f.write(json.dumps(columns))

In [None]:
os.listdir('.')

In [None]:
files.download('columns.json')