# Predict the Bangalore House Price (Regression Problem)

#### Problem Definition:
The goal is to predict the Banglore House price based on Location, Number of BHK, Number of bathroom and Total sq.ft area.

#### Data: [From Kaggle](https://www.kaggle.com/datasets/amitabhajoy/bengaluru-house-price-data?resource=download)

In [1]:
# importing basic libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df=pd.read_csv("Data/Bengaluru_House_Data.csv")
df.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [3]:
df.describe()

Unnamed: 0,bath,balcony,price
count,13247.0,12711.0,13320.0
mean,2.69261,1.584376,112.565627
std,1.341458,0.817263,148.971674
min,1.0,0.0,8.0
25%,2.0,1.0,50.0
50%,2.0,2.0,72.0
75%,3.0,2.0,120.0
max,40.0,3.0,3600.0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     13320 non-null  object 
 1   availability  13320 non-null  object 
 2   location      13319 non-null  object 
 3   size          13304 non-null  object 
 4   society       7818 non-null   object 
 5   total_sqft    13320 non-null  object 
 6   bath          13247 non-null  float64
 7   balcony       12711 non-null  float64
 8   price         13320 non-null  float64
dtypes: float64(3), object(6)
memory usage: 936.7+ KB


In [5]:
df.shape

(13320, 9)

## Data Preprocessing

In [6]:
# Dropping 'area_type','availability','society' and 'balcony' features 
df.drop(['area_type','availability','society','balcony'],axis=1,inplace=True)

In [7]:
df.head()

Unnamed: 0,location,size,total_sqft,bath,price
0,Electronic City Phase II,2 BHK,1056,2.0,39.07
1,Chikka Tirupathi,4 Bedroom,2600,5.0,120.0
2,Uttarahalli,3 BHK,1440,2.0,62.0
3,Lingadheeranahalli,3 BHK,1521,3.0,95.0
4,Kothanur,2 BHK,1200,2.0,51.0


### Dealing with missing values

In [8]:
df.isna().sum()

location       1
size          16
total_sqft     0
bath          73
price          0
dtype: int64

In [9]:
df.location.value_counts()

Whitefield                        540
Sarjapur  Road                    399
Electronic City                   302
Kanakpura Road                    273
Thanisandra                       234
                                 ... 
Bapuji Layout                       1
1st Stage Radha Krishna Layout      1
BEML Layout 5th stage               1
singapura paradise                  1
Abshot Layout                       1
Name: location, Length: 1305, dtype: int64

In [10]:
df.location.fillna("Electronic City", inplace=True)

In [11]:
df.isna().sum()

location       0
size          16
total_sqft     0
bath          73
price          0
dtype: int64

In [12]:
df['size'].value_counts()

2 BHK         5199
3 BHK         4310
4 Bedroom      826
4 BHK          591
3 Bedroom      547
1 BHK          538
2 Bedroom      329
5 Bedroom      297
6 Bedroom      191
1 Bedroom      105
8 Bedroom       84
7 Bedroom       83
5 BHK           59
9 Bedroom       46
6 BHK           30
7 BHK           17
1 RK            13
10 Bedroom      12
9 BHK            8
8 BHK            5
11 BHK           2
11 Bedroom       2
10 BHK           2
14 BHK           1
13 BHK           1
12 Bedroom       1
27 BHK           1
43 Bedroom       1
16 BHK           1
19 BHK           1
18 Bedroom       1
Name: size, dtype: int64

In [13]:
df["size"]

0            2 BHK
1        4 Bedroom
2            3 BHK
3            3 BHK
4            2 BHK
           ...    
13315    5 Bedroom
13316        4 BHK
13317        2 BHK
13318        4 BHK
13319        1 BHK
Name: size, Length: 13320, dtype: object

In [14]:
df["size"].fillna("2 BHK",inplace=True)

In [15]:
df.isna().sum()

location       0
size           0
total_sqft     0
bath          73
price          0
dtype: int64

In [16]:
df["Bhk"]=df["size"].str.split().str.get(0).astype(int)

In [17]:
Data=df.copy()

In [18]:
Data.drop("size",axis=1,inplace=True)

In [19]:
# There are outliers in Bhk features. So we will be keeping only 16Bhk and less entries.
df['Bhk'].sort_values(ascending=False)

4684     43
1718     27
3379     19
11559    18
3609     16
         ..
11498     1
3744      1
3741      1
11509     1
13319     1
Name: Bhk, Length: 13320, dtype: int32

In [20]:
Data=Data[Data['Bhk']<16]

In [21]:
Data.head()

Unnamed: 0,location,total_sqft,bath,price,Bhk
0,Electronic City Phase II,1056,2.0,39.07,2
1,Chikka Tirupathi,2600,5.0,120.0,4
2,Uttarahalli,1440,2.0,62.0,3
3,Lingadheeranahalli,1521,3.0,95.0,3
4,Kothanur,1200,2.0,51.0,2


In [22]:
Data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13315 entries, 0 to 13319
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   location    13315 non-null  object 
 1   total_sqft  13315 non-null  object 
 2   bath        13242 non-null  float64
 3   price       13315 non-null  float64
 4   Bhk         13315 non-null  int32  
dtypes: float64(2), int32(1), object(2)
memory usage: 572.1+ KB


In [23]:
Data["bath"].median()

2.0

In [24]:
Data['bath'].fillna(Data["bath"].median(),inplace=True)

In [25]:
Data.isna().sum()

location      0
total_sqft    0
bath          0
price         0
Bhk           0
dtype: int64

In [26]:
# We are having range of area in 'total_sqft' features, so we will create a function to deal with it.
Data['total_sqft'].unique()

array(['1056', '2600', '1440', ..., '1133 - 1384', '774', '4689'],
      dtype=object)

In [27]:
def range_remover(x):
    data=x.split("-")
    if len(data)==2:
        return (float(data[0])+float(data[1]))/2
    try:
        return float(x)
    except:
        return None

In [28]:
Data['total_sqft']=Data['total_sqft'].apply(range_remover)

In [29]:
Data['total_sqft'].unique()

array([1056. , 2600. , 1440. , ..., 1258.5,  774. , 4689. ])

In [30]:
Data.head()

Unnamed: 0,location,total_sqft,bath,price,Bhk
0,Electronic City Phase II,1056.0,2.0,39.07,2
1,Chikka Tirupathi,2600.0,5.0,120.0,4
2,Uttarahalli,1440.0,2.0,62.0,3
3,Lingadheeranahalli,1521.0,3.0,95.0,3
4,Kothanur,1200.0,2.0,51.0,2


In [31]:
Data['price'].describe()

count    13315.000000
mean       112.447927
std        148.834436
min          8.000000
25%         50.000000
50%         72.000000
75%        120.000000
max       3600.000000
Name: price, dtype: float64

#### Adding Price Per Square feet column because it will help in removing outliers

In [32]:
Data["price_per_sqft"]=(Data["price"]/Data["total_sqft"])*100000

In [33]:
Data.head()

Unnamed: 0,location,total_sqft,bath,price,Bhk,price_per_sqft
0,Electronic City Phase II,1056.0,2.0,39.07,2,3699.810606
1,Chikka Tirupathi,2600.0,5.0,120.0,4,4615.384615
2,Uttarahalli,1440.0,2.0,62.0,3,4305.555556
3,Lingadheeranahalli,1521.0,3.0,95.0,3,6245.890861
4,Kothanur,1200.0,2.0,51.0,2,4250.0


In [34]:
Data.describe()

Unnamed: 0,total_sqft,bath,price,Bhk,price_per_sqft
count,13269.0,13315.0,13315.0,13315.0,13269.0
mean,1558.435808,2.681036,112.447927,2.794593,7904.675
std,1235.172466,1.264791,148.834436,1.208639,106449.4
min,1.0,1.0,8.0,1.0,267.8298
25%,1100.0,2.0,50.0,2.0,4266.667
50%,1276.0,2.0,72.0,3.0,5433.83
75%,1680.0,3.0,120.0,3.0,7310.871
max,52272.0,15.0,3600.0,14.0,12000000.0


In [35]:
# Locations feature is having many unique locations(1302), so we will try to reduce it as much as possible.
Data["location"].value_counts()

Whitefield                        540
Sarjapur  Road                    399
Electronic City                   303
Kanakpura Road                    273
Thanisandra                       234
                                 ... 
Vasantapura main road               1
Bapuji Layout                       1
1st Stage Radha Krishna Layout      1
BEML Layout 5th stage               1
Abshot Layout                       1
Name: location, Length: 1302, dtype: int64

In [36]:
# There are white spaces in locations, so we will remove them using strip().
Data["location"]=Data["location"].apply(lambda x: x.strip())
location_counts=Data["location"].value_counts()

In [37]:
# Leading and the trailing white spaces are removed.
location_counts

Whitefield           541
Sarjapur  Road       399
Electronic City      305
Kanakpura Road       273
Thanisandra          237
                    ... 
Marathalli bridge      1
Papareddipalya         1
K R C kothanur         1
1Channasandra          1
Abshot Layout          1
Name: location, Length: 1291, dtype: int64

In [38]:
# Finding the locations having location count less then or equal to 10.
location_10_counts=location_counts[location_counts<=10]
location_10_counts

Basapura                 10
1st Block Koramangala    10
Sadashiva Nagar          10
Sector 1 HSR Layout      10
Naganathapura            10
                         ..
Marathalli bridge         1
Papareddipalya            1
K R C kothanur            1
1Channasandra             1
Abshot Layout             1
Name: location, Length: 1050, dtype: int64

In [39]:
Data["location"]=Data["location"].apply(lambda x: "other" if x in location_10_counts else x)

In [40]:
# We are able to reduce the no of unique locations from 1302 to 242 
Data["location"].value_counts()

other                 2881
Whitefield             541
Sarjapur  Road         399
Electronic City        305
Kanakpura Road         273
                      ... 
Nehru Nagar             11
Banjara Layout          11
LB Shastri Nagar        11
Pattandur Agrahara      11
Narayanapura            11
Name: location, Length: 242, dtype: int64

### Outlier detection and removal

In [41]:
# As you can see there are outliers in "total_sqft". (min=1 sqft & max=52272 sqft)
Data.describe()

Unnamed: 0,total_sqft,bath,price,Bhk,price_per_sqft
count,13269.0,13315.0,13315.0,13315.0,13269.0
mean,1558.435808,2.681036,112.447927,2.794593,7904.675
std,1235.172466,1.264791,148.834436,1.208639,106449.4
min,1.0,1.0,8.0,1.0,267.8298
25%,1100.0,2.0,50.0,2.0,4266.667
50%,1276.0,2.0,72.0,3.0,5433.83
75%,1680.0,3.0,120.0,3.0,7310.871
max,52272.0,15.0,3600.0,14.0,12000000.0


In [42]:
(Data["total_sqft"]/Data["Bhk"]).describe()

count    13269.000000
mean       575.204981
std        388.197821
min          0.250000
25%        473.333333
50%        552.500000
75%        625.000000
max      26136.000000
dtype: float64

In [43]:
Data=Data[(Data["total_sqft"]/Data["Bhk"])>=300]

In [44]:
Data.describe()

Unnamed: 0,total_sqft,bath,price,Bhk,price_per_sqft
count,12529.0,12529.0,12529.0,12529.0,12529.0
mean,1593.893665,2.558464,111.347393,2.649773,6304.043527
std,1259.083927,1.071272,152.032899,0.969408,4162.397896
min,300.0,1.0,8.44,1.0,267.829813
25%,1116.0,2.0,49.0,2.0,4210.526316
50%,1300.0,2.0,70.0,3.0,5294.117647
75%,1700.0,3.0,115.0,3.0,6916.666667
max,52272.0,14.0,3600.0,13.0,176470.588235


In [45]:
# price_per_sqft max value=176470, outlier.
Data["price_per_sqft"].describe()

count     12529.000000
mean       6304.043527
std        4162.397896
min         267.829813
25%        4210.526316
50%        5294.117647
75%        6916.666667
max      176470.588235
Name: price_per_sqft, dtype: float64

In [46]:
def sqft_outlier_removal(df):
    op_df=pd.DataFrame()
    for location,subdf in df.groupby("location"):
        m=np.mean(subdf.price_per_sqft)
        sd=np.std(subdf.price_per_sqft)
        
        gen_df=subdf[(subdf.price_per_sqft>(m-sd)) & (subdf.price_per_sqft<=(m+sd))]
        op_df=pd.concat([op_df,gen_df], ignore_index=True)
    return op_df
Data=sqft_outlier_removal(Data)
Data.describe()   

Unnamed: 0,total_sqft,bath,price,Bhk,price_per_sqft
count,10300.0,10300.0,10300.0,10300.0,10300.0
mean,1507.616185,2.470388,91.241837,2.573592,5659.078319
std,876.752879,0.970382,86.228577,0.887891,2265.884204
min,300.0,1.0,10.0,1.0,1250.0
25%,1110.0,2.0,49.0,2.0,4244.864208
50%,1286.0,2.0,67.0,2.0,5175.519668
75%,1650.0,3.0,100.0,3.0,6428.571429
max,30400.0,13.0,2200.0,13.0,24509.803922


In [47]:
def Bhk_outlier_removal(df):
    exclude_indices=np.array([])
    for location,location_df in df.groupby("location"):
        bhk_stats={}
        for bhk,bhk_df in location_df.groupby("Bhk"):
            bhk_stats[bhk]={
                'mean':np.mean(bhk_df.price_per_sqft),
                'std':np.std(bhk_df.price_per_sqft),
                'count':bhk_df.shape[0]  
            }
        for bhk,bhk_df in location_df.groupby('Bhk'):
            stats=bhk_stats.get(bhk-1) # get() function is used,bcoz if it didn't get the value it will not throw error and return none.
            if stats and stats['count']>5:
                exclude_indices=np.append(exclude_indices,bhk_df[bhk_df.price_per_sqft<(stats['mean'])].index.values)
    return df.drop(exclude_indices,axis='index')   

In [48]:
Data=Bhk_outlier_removal(Data)

In [49]:
Data.shape

(7360, 6)

In [50]:
Data

Unnamed: 0,location,total_sqft,bath,price,Bhk,price_per_sqft
0,1st Block Jayanagar,2850.0,4.0,428.0,4,15017.543860
1,1st Block Jayanagar,1630.0,3.0,194.0,3,11901.840491
2,1st Block Jayanagar,1875.0,2.0,235.0,3,12533.333333
3,1st Block Jayanagar,1200.0,2.0,130.0,3,10833.333333
4,1st Block Jayanagar,1235.0,2.0,148.0,2,11983.805668
...,...,...,...,...,...,...
10291,other,1200.0,2.0,70.0,2,5833.333333
10292,other,1800.0,1.0,200.0,1,11111.111111
10295,other,1353.0,2.0,110.0,2,8130.081301
10296,other,812.0,1.0,26.0,1,3201.970443


In [51]:
# Since price_per_sqft was the features responsible for handling outliers, now we can drop it.
Data.drop("price_per_sqft",axis=1,inplace=True)

In [52]:
# Saving clean data to csv.
Data.to_csv("Data/Clean_data.csv",index=False)

### Model Building

In [53]:
# Splitting data into into independent and dependent features.
x=Data.drop("price",axis=1) # features variables

In [54]:
y=Data["price"] # target variable

In [65]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression,Lasso,Ridge
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score

In [101]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=16)

In [102]:
x_train.shape,y_train.shape,x_test.shape,y_test.shape

((5888, 4), (5888,), (1472, 4), (1472,))

##### Linear Regression

In [103]:
column_transformer=make_column_transformer((OneHotEncoder(sparse=False),['location']), remainder='passthrough')

In [108]:
scaler=StandardScaler()
linear_rg=LinearRegression(normalize=True)

In [109]:
pipe=make_pipeline(column_transformer,scaler,linear_rg)

In [110]:
pipe.fit(x_train,y_train)

If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), LinearRegression())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)




Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(sparse=False),
                                                  ['location'])])),
                ('standardscaler', StandardScaler()),
                ('linearregression', LinearRegression(normalize=True))])

In [111]:
y_pred_lr=pipe.predict(x_test)

In [112]:
r2_score(y_test,y_pred_lr)

0.8381860339652341

#### Lasso

In [114]:
lasso=Lasso()

In [115]:
pipe_lasso=make_pipeline(column_transformer,scaler,lasso)

In [118]:
pipe_lasso.fit(x_train,y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(sparse=False),
                                                  ['location'])])),
                ('standardscaler', StandardScaler()), ('lasso', Lasso())])

In [119]:
y_predict_lasso=pipe_lasso.predict(x_test)

In [120]:
r2_score(y_test,y_predict_lasso)

0.8263029869374969

#### Ridge

In [121]:
ridge=Ridge()

In [122]:
pipe_ridg=make_pipeline(column_transformer,scaler,ridge)

In [123]:
pipe_ridg.fit(x_train,y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(sparse=False),
                                                  ['location'])])),
                ('standardscaler', StandardScaler()), ('ridge', Ridge())])

In [125]:
y_pred_ridge=pipe_ridg.predict(x_test)

In [126]:
r2_score(y_test,y_pred_ridge)

0.8383227066936583

In [130]:
print(f'''"Linerregression":{r2_score(y_test,y_pred_lr)}
"Lasso":{r2_score(y_test,y_predict_lasso)}
"Ridge":{r2_score(y_test,y_pred_ridge)}''')

"Linerregression":0.8381860339652341
"Lasso":0.8263029869374969
"Ridge":0.8383227066936583


#### Since Ridge regession model giving good R2 score, we will dump it.

In [131]:
import pickle
pickle.dump(pipe_ridg,open('Ridge_Model.pkl','wb'))