<a href="https://colab.research.google.com/github/SoumapriyoM/Bengaluru-House-price-prediction/blob/main/bengalure_houseprice_prediction_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bengalure House Price Prediction Dataset

# 1. Introduction

1. Predicting the price of houses in Bengaluru based on factors like:
2. A Supervised Learning Problem (Regression Problem)
3. Dataset is taken from Kaggle: https://www.kaggle.com/datasets/amitabhajoy/bengaluru-house-price-data
4. The goal of the dataset is to predict the price of house based on location, number of bedroom, number of bathroom
5. Dependent variable is 'price' rest are independent variable; Location is a 'categorical' value

# 2. Exploring Data

2.1 Import the necessary modules and read the data

In [None]:
import pandas as pd
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import tree
from sklearn.model_selection import train_test_split
import warnings
import numpy as np
warnings.filterwarnings('ignore')
from sklearn import metrics

In [None]:
df = pd.read_csv('/kaggle/input/bengaluru-house-price-data/Bengaluru_House_Data.csv')

In [None]:
df.shape

In [None]:
df.dtypes

In [None]:
df.head()

We are keeping the model simple, so we will drop few columns. Assuming availability, society, area_type are not very important features, we will drop them

In [None]:
df1 = df.drop(['availability', 'society', 'balcony', 'area_type'], axis='columns')
df1

# 3. Data Cleaning

Checking NA values

In [None]:
df1.isna().sum()

Removing na values
- We have 13320 values & very less NA values ie. 73, so we are safely removing it. We could instead even fill the values with their mean values

In [None]:
df1 = df1.dropna()
df1.isna().sum()

In [None]:
df1['size'].unique()

Somewhere the columns are BHK, somewhere it says Bedroom, to handle such data, we will create a new column bhk that will only hold the value of the data

In [None]:
df1['bhk'] = df1['size'].apply(lambda x: int(x.split(' ')[0]))

In [None]:
df1.head()

In [None]:
df1['bhk'].unique()

In [None]:
df1[df1.bhk > 20]

It has 43 bedroom with 2400 total_sqft this looks like an error and we will deal with it later

In [None]:
df1.total_sqft.unique()

In total_sqft we can see that there are some values like this '1133-1384' which we need to correct. Now we will see if the number is float or not & then we will convert appropriately

In [None]:
def is_float(x):
    try:
        float(x)
    except:
        return False
    return True

In [None]:
df1[~df1['total_sqft'].apply(is_float)].head(10)

For range values , we will take the mid values & for other annomilious values like 34.46Sq. Meter we will just ignore them

In [None]:
def convert_range_to_num(x):
    tokens = x.split('-')
    if (len(tokens) == 2):
        return(float(tokens[0])+float(tokens[1]))/2
    try:
        return float(x)
    except:
        return None

In [None]:
df2 = df1.copy()
df2['total_sqft'] = df1['total_sqft'].apply(convert_range_to_num)
df2.head()

In [None]:
df2['total_sqft'].unique()

We will add 1 new fature 'price_per_sqft' that would be very essential for outlier detection

In [None]:
df3 = df2.copy()
df3['price_per_sqft'] = df3['price']*100000/df3['total_sqft']

The price is in 1 lakhs INR that is why we multiplied it with that amount.

In [None]:
df3.head()

- We see that the new column has been added.
- Next, we see the 'location' column. Since, it is a categorical value & if we have too many locations it will cause issues.


In [None]:
len(df3.location.unique())

- We have 1304 unique features which is huge. Usually for character text, we use encoding, but here the number is too big & encoding would not be appropriate. This is called 'Dimensionality Curse' or High Dimensionality Problem.

- To deal with it, we can convert some location to 'Other'Category. Locations which haas 1-2 data points, we will convert it to 'Other' category
- Next, we see how many data points are available for location

In [None]:
df3['location'] = df3['location'].apply(lambda x: x.strip()) #Cleaning - Removing leading space from location

In [None]:
loc = df3.groupby('location')['location'].agg('count').sort_values(ascending = False)

In [None]:
loc

# 4. Feature Engineering & Dimensionality Reduction Techniques

In [None]:
len(loc[loc<=10]) #Checking how many data has less than or equal to 10 data points

In [None]:
loc_less_than_10 =  loc[loc<=10]

We can convert above 1039 datapoints to 'Other'

In [None]:
len(df3.location.unique()) #unique values - 1293 out of which 1039 is to be converted to 'Other' category

In [None]:
df3.location = df3['location'].apply(lambda x: 'Other' if x in loc_less_than_10 else x) #converting to 'Other'

In [None]:
len(df3.location.unique())

We see now we have 242 features (1293 - 1052 + 1)

# 5. Outlier Detection & Removal

For Outlier Detection & Removal either we can use our domain knowldege of Real Estate or use Standard Deviation methods.

In [None]:
df3.shape

First, with domain knowledge, we understand that 1 bedroom has approx size of 300

In [None]:
df3[df3.total_sqft/df3.bhk <300]

In [None]:
df4 = df3[~(df3.total_sqft/df3.bhk <300)]

In [None]:
df4.location

In [None]:
df4.shape

Here, we see that we have removed the outliers with the domain knowledge.

Next, we can see 'price_per_sqft' feature. We will see the highest and the lowest values.

In [None]:
df4.price_per_sqft.describe() #describe function gives you some basic statistics on that particular column

We see that the lowest price per sqft is ~267 Rs v/s costliest property is ~176470. We will remove such extreme cases using standard deviation function.

In [None]:
#Function to remove price_per_sqft outlier
def remove_pps_outlier(df):
    df_out = pd.DataFrame()
    for key, subdf in df.groupby('location'):
        m = np.mean(subdf.price_per_sqft)
        st = np.std(subdf.price_per_sqft)
        reduced_df = subdf[(subdf.price_per_sqft>(m-st))& (subdf.price_per_sqft<=(m+st))]
        df_out = pd.concat([df_out,reduced_df],ignore_index = True)
    return df_out

In [None]:
df5 = remove_pps_outlier(df4)

In [None]:
df5.shape

Now we want to see if for same sqft area between bedrooms, how does the price vary. We notice that in some cases when there are less bedrooms and same sqft when compared to 3 BHK the price is more. We will visualize such data in scatter plot.

In [None]:
#Function to visualize prices between 2 & 3BHK
def plot_scatter_chart(df,location):
    bhk2 = df[(df.location==location) & (df.bhk==2)]
    bhk3 = df[(df.location==location) & (df.bhk==3)]
    plt.rcParams['figure.figsize'] = (15,10)
    plt.scatter(bhk2.total_sqft,bhk2.price,color='blue',label='2 BHK', s=50)
    plt.scatter(bhk3.total_sqft,bhk3.price,marker='+', color='green',label='3 BHK', s=50)
    plt.xlabel("Total Square Feet Area")
    plt.ylabel("Price (Lakh Indian Rupees)")
    plt.title(location)
    plt.legend()

In [None]:
plot_scatter_chart(df5,"Rajaji Nagar")

Observations: When Total Sqaure Feet Area is around 1700, the price of 2 BHK is more. We want to remove such outliers.

In [None]:
plot_scatter_chart(df5,"Hebbal")

In [None]:
#Function to remove such outliers
def remove_bhk_outliers(df):
    exclude_indices = np.array([])
    for location, location_df in df.groupby('location'):
        bhk_stats = {}
        for bhk, bhk_df in location_df.groupby('bhk'):
            bhk_stats[bhk] = {
                'mean': np.mean(bhk_df.price_per_sqft),
                'std': np.std(bhk_df.price_per_sqft),
                'count': bhk_df.shape[0]
            }
        for bhk, bhk_df in location_df.groupby('bhk'):
            stats = bhk_stats.get(bhk-1)
            if stats and stats['count']>5:
                exclude_indices = np.append(exclude_indices, bhk_df[bhk_df.price_per_sqft<(stats['mean'])].index.values)
    return df.drop(exclude_indices,axis='index')

In [None]:
df6 = remove_bhk_outliers(df5)
df6.shape

In [None]:
plot_scatter_chart(df6,"Rajaji Nagar")

In [None]:
plot_scatter_chart(df6,"Hebbal")

We see that some of the outliers have been cleaned. Next,we plot histogram and see the data distribution

In [None]:
plt.rcParams["figure.figsize"] = (20,10)
plt.hist(df6.price_per_sqft,rwidth=0.8)
plt.xlabel("Price Per Square Feet")
plt.ylabel("Count")

We see that the data is normally distributed and has some bell shaped curve

Now we explore the number of bathroom feature. Usually for 2 bedroom we have 2 or 3 bathrooms, more than that is highly unlikely.

In [None]:
df6.bath.unique()

In [None]:
df6[df6.bath>10]

In [None]:
df6[df6.bath>df6.bhk+2] #Checking which properties does not have ideal number of bathrooms (have more)

In [None]:
#The above data are outliers & we need to remove them
df7 = df6[df6.bath<df6.bhk+2]

In [None]:
other_locations = df7[df7['location'] == 'Other']

# Print the other locations
print(other_locations)

In [None]:
df7.shape

- We will drop 'size' column, as we have already created 'bhk' column.
- We will drop price_per_sqft column, as it is no longer required and we had to use it only for outlier detection

As ML algorithm cannot interpret text data, we will use encoding to convert 'location'

In [None]:
dum = pd.get_dummies(df7.location)
dum

In [None]:
#To avoid dummy trap, we drop one column. For dummies, we should have one less column
df8 = pd.concat([df7, dum.drop('Other', axis = 1)], axis = 1)
#df7 = pd.concat([df6, dum.drop], axis = 'columns')
df8.head(10)

In [None]:
df8.dtypes

In [None]:
df9 = df8.drop(['price_per_sqft','location', 'size'],axis="columns")

# 6. Applying ML Algorithms

In [None]:
X = df9.drop('price', axis = "columns")
X.head()

In [None]:
y = df9.price

In [None]:
#using test train split method
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=10)

In [None]:
#Creating Linear Regression model
from sklearn.linear_model import LinearRegression
lr_clf = LinearRegression()
lr_clf.fit(X_train,y_train)
lr_clf.score(X_test,y_test)

Using K Fold cross validation to measure accuracy of LinearRegression model

In [None]:
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score

cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)

cross_val_score(LinearRegression(), X, y, cv=cv)

**We can see that in 5 iterations we get a score above 80% all the time. This is pretty good but we want to test few other algorithms for regression to see if we can get even better score. We will use GridSearchCV for this purpose**

# Find best model using GridSearchCV

In [None]:
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor

def find_best_model_using_gridsearchcv(X,y):
    algos = {
        'linear_regression' : {
            'model': LinearRegression(),
            'params': {
                'fit_intercept': [True, False]
            }
        },
        'lasso': {
            'model': Lasso(),
            'params': {
                'alpha': [1,2],
                'selection': ['random', 'cyclic']
            }
        },
        'decision_tree': {
            'model': DecisionTreeRegressor(),
            'params': {
                'criterion' : ['squared_error','friedman_mse'],
                'splitter': ['best','random']
            }
        }
    }
    scores = []
    cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
    for algo_name, config in algos.items():
        gs =  GridSearchCV(config['model'], config['params'], cv=cv, return_train_score=False)
        gs.fit(X,y)
        scores.append({
            'model': algo_name,
            'best_score': gs.best_score_,
            'best_params': gs.best_params_
        })

    return pd.DataFrame(scores,columns=['model','best_score','best_params'])

find_best_model_using_gridsearchcv(X,y)

In [None]:
X.columns

In [None]:
np.where(X.columns=='2nd Phase Judicial Layout')[0][0]

Based on above results we can say that LinearRegression gives the best score. Hence we will use that.

# Testing model for few properties

In [None]:
def predict_price(location,sqft,bath,bhk):
    loc_index = np.where(X.columns==location)[0][0]

    x = np.zeros(len(X.columns))
    x[0] = sqft
    x[1] = bath
    x[2] = bhk
    if loc_index >= 0:
        x[loc_index] = 1

    return lr_clf.predict([x])[0]

In [None]:
predict_price('1st Phase JP Nagar', 1000, 2, 2)

In [None]:
predict_price('Indira Nagar', 1000, 2, 2)

Exporting tested model to pickle file

In [None]:
import pickle
with open('banglore_home_prices_model.pickle','wb') as f:
    pickle.dump(lr_clf,f)

Export location and column information to a file that will be useful later on in our prediction application

In [None]:
import json
columns = {
    'data_columns' : [col.lower() for col in X.columns]
}
with open("columns.json","w") as f:
    f.write(json.dumps(columns))

## References:
https://github.com/codebasics