# üè† House Price Prediction using Machine Learning

## üìå Problem Statement
The goal of this project is to predict house prices based on features such as location, total square feet, number of bathrooms, balconies, and BHK configuration.

## üìä Dataset
Bengaluru House Price Dataset containing housing information including:
- Area type
- Location
- Total square feet
- Bathrooms
- Balconies
- Price

## ‚öôÔ∏è Workflow
1. Data Cleaning
2. Feature Engineering
3. Outlier Removal
4. One-Hot Encoding
5. Model Training (Linear Regression)
6. Model Evaluation
7. Price Prediction Function

## üìà Model Performance
- Algorithm: Linear Regression
- R¬≤ Score: 0.849

## üë®‚Äçüíª Author
Ujjawal Shrivastava


# House Price Prediction using Machine Learning

## Objective
The goal of this project is to predict house prices based on area, number of bedrooms, bathrooms, and location using Machine Learning.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline


In [2]:
df = pd.read_csv('../data/bengaluru_house_prices.csv')
df.head()


Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [3]:
df.shape


(13320, 9)

In [4]:
df.columns



Index(['area_type', 'availability', 'location', 'size', 'society',
       'total_sqft', 'bath', 'balcony', 'price'],
      dtype='object')

In [5]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     13320 non-null  object 
 1   availability  13320 non-null  object 
 2   location      13319 non-null  object 
 3   size          13304 non-null  object 
 4   society       7818 non-null   object 
 5   total_sqft    13320 non-null  object 
 6   bath          13247 non-null  float64
 7   balcony       12711 non-null  float64
 8   price         13320 non-null  float64
dtypes: float64(3), object(6)
memory usage: 936.7+ KB


In [6]:
df.describe()


Unnamed: 0,bath,balcony,price
count,13247.0,12711.0,13320.0
mean,2.69261,1.584376,112.565627
std,1.341458,0.817263,148.971674
min,1.0,0.0,8.0
25%,2.0,1.0,50.0
50%,2.0,2.0,72.0
75%,3.0,2.0,120.0
max,40.0,3.0,3600.0


In [7]:
df.isnull().sum()


area_type          0
availability       0
location           1
size              16
society         5502
total_sqft         0
bath              73
balcony          609
price              0
dtype: int64

In [8]:
df.columns



Index(['area_type', 'availability', 'location', 'size', 'society',
       'total_sqft', 'bath', 'balcony', 'price'],
      dtype='object')

In [9]:
df = df.drop(['society', 'availability'], axis=1)


In [10]:
df.head()


Unnamed: 0,area_type,location,size,total_sqft,bath,balcony,price
0,Super built-up Area,Electronic City Phase II,2 BHK,1056,2.0,1.0,39.07
1,Plot Area,Chikka Tirupathi,4 Bedroom,2600,5.0,3.0,120.0
2,Built-up Area,Uttarahalli,3 BHK,1440,2.0,3.0,62.0
3,Super built-up Area,Lingadheeranahalli,3 BHK,1521,3.0,1.0,95.0
4,Super built-up Area,Kothanur,2 BHK,1200,2.0,1.0,51.0


In [11]:
df['bhk'] = df['size'].apply(lambda x: int(x.split(' ')[0]))


AttributeError: 'float' object has no attribute 'split'

In [12]:
df['size'].isnull().sum()


np.int64(16)

In [13]:
df = df.dropna(subset=['size'])


In [14]:
df['size'].isnull().sum()


np.int64(0)

In [15]:
df['bhk'] = df['size'].apply(lambda x: int(x.split(' ')[0]))


In [16]:
df.head()

Unnamed: 0,area_type,location,size,total_sqft,bath,balcony,price,bhk
0,Super built-up Area,Electronic City Phase II,2 BHK,1056,2.0,1.0,39.07,2
1,Plot Area,Chikka Tirupathi,4 Bedroom,2600,5.0,3.0,120.0,4
2,Built-up Area,Uttarahalli,3 BHK,1440,2.0,3.0,62.0,3
3,Super built-up Area,Lingadheeranahalli,3 BHK,1521,3.0,1.0,95.0,3
4,Super built-up Area,Kothanur,2 BHK,1200,2.0,1.0,51.0,2


In [17]:
df = df.drop('size', axis=1)


In [24]:
df.head()

Unnamed: 0,area_type,location,total_sqft,bath,balcony,price,bhk
0,Super built-up Area,Electronic City Phase II,1056,2.0,1.0,39.07,2
1,Plot Area,Chikka Tirupathi,2600,5.0,3.0,120.0,4
2,Built-up Area,Uttarahalli,1440,2.0,3.0,62.0,3
3,Super built-up Area,Lingadheeranahalli,1521,3.0,1.0,95.0,3
4,Super built-up Area,Kothanur,1200,2.0,1.0,51.0,2


In [25]:
df['total_sqft'].unique()


array(['1056', '2600', '1440', ..., '1133 - 1384', '774', '4689'],
      dtype=object)

In [26]:
def is_float(x):
    try:
        float(x)
    except:
        return False
    return True


In [27]:
df[~df['total_sqft'].apply(is_float)].head(10)


Unnamed: 0,area_type,location,total_sqft,bath,balcony,price,bhk
30,Super built-up Area,Yelahanka,2100 - 2850,4.0,0.0,186.0,4
56,Built-up Area,Devanahalli,3010 - 3410,,,192.0,4
81,Built-up Area,Hennur Road,2957 - 3450,,,224.5,4
122,Super built-up Area,Hebbal,3067 - 8156,4.0,0.0,477.0,4
137,Super built-up Area,8th Phase JP Nagar,1042 - 1105,2.0,0.0,54.005,2
165,Super built-up Area,Sarjapur,1145 - 1340,2.0,0.0,43.49,2
188,Super built-up Area,KR Puram,1015 - 1540,2.0,0.0,56.8,2
224,Super built-up Area,Devanahalli,1520 - 1740,,,74.82,3
410,Super built-up Area,Kengeri,34.46Sq. Meter,1.0,0.0,18.5,1
549,Super built-up Area,Hennur Road,1195 - 1440,2.0,0.0,63.77,2


In [28]:
def convert_sqft(x):
    tokens = x.split('-')
    
    if len(tokens) == 2:
        return (float(tokens[0]) + float(tokens[1])) / 2
    
    try:
        return float(x)
    except:
        return None


In [29]:
df['total_sqft'] = df['total_sqft'].apply(convert_sqft)


In [30]:
df = df.dropna(subset=['total_sqft'])


In [31]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 13258 entries, 0 to 13319
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   area_type   13258 non-null  object 
 1   location    13257 non-null  object 
 2   total_sqft  13258 non-null  float64
 3   bath        13201 non-null  float64
 4   balcony     12669 non-null  float64
 5   price       13258 non-null  float64
 6   bhk         13258 non-null  int64  
dtypes: float64(4), int64(1), object(2)
memory usage: 828.6+ KB


In [33]:
df.loc[:, 'price_per_sqft'] = (df['price'] * 100000) / df['total_sqft']


In [34]:
df[['total_sqft','price','price_per_sqft']].head()


Unnamed: 0,total_sqft,price,price_per_sqft
0,1056.0,39.07,3699.810606
1,2600.0,120.0,4615.384615
2,1440.0,62.0,4305.555556
3,1521.0,95.0,6245.890861
4,1200.0,51.0,4250.0


In [35]:
df['price_per_sqft'].describe()


count    1.325800e+04
mean     7.912634e+03
std      1.064936e+05
min      2.678298e+02
25%      4.271229e+03
50%      5.438331e+03
75%      7.313266e+03
max      1.200000e+07
Name: price_per_sqft, dtype: float64

In [36]:
df.sort_values('price_per_sqft', ascending=False).head()


Unnamed: 0,area_type,location,total_sqft,bath,balcony,price,bhk,price_per_sqft
4086,Plot Area,Sarjapur Road,1.0,4.0,,120.0,4,12000000.0
4972,Built-up Area,Srirampuram,5.0,7.0,3.0,115.0,7,2300000.0
349,Plot Area,Suragajakkanahalli,11.0,3.0,2.0,74.0,3,672727.3
1122,Built-up Area,Grihalakshmi Layout,24.0,2.0,2.0,150.0,5,625000.0
11558,Plot Area,Whitefield,60.0,4.0,2.0,218.0,4,363333.3


In [37]:
df = df[df['price_per_sqft'] < 30000]


In [38]:
df['price_per_sqft'].describe()


count    13209.000000
mean      6577.456263
std       3842.051914
min        267.829813
25%       4265.734266
50%       5421.184320
75%       7274.490786
max      29629.629630
Name: price_per_sqft, dtype: float64

In [39]:
df['location'].nunique()


1294

In [40]:
df['location'] = df['location'].apply(lambda x: x.strip())


AttributeError: 'float' object has no attribute 'strip'

In [41]:
df['location'].isnull().sum()


np.int64(1)

In [42]:
df = df.dropna(subset=['location'])


In [43]:
df['location'] = df['location'].apply(lambda x: x.strip())


In [44]:
location_stats = df['location'].value_counts()
location_stats.head(10)


location
Whitefield               536
Sarjapur  Road           396
Electronic City          303
Kanakpura Road           271
Thanisandra              236
Yelahanka                212
Uttarahalli              185
Hebbal                   177
Marathahalli             175
Raja Rajeshwari Nagar    171
Name: count, dtype: int64

In [45]:
len(location_stats[location_stats <= 10])


1044

In [46]:
df['location'] = df['location'].apply(
    lambda x: 'other' if location_stats[x] <= 10 else x
)


In [47]:
df['location'].nunique()


240

In [48]:
location_stats = df['location'].value_counts()
location_stats.tail(10)


location
Pattandur Agrahara           11
Narayanapura                 11
Nehru Nagar                  11
Banjara Layout               11
Kodigehalli                  11
Tindlu                       11
2nd Phase Judicial Layout    11
Marsur                       11
Doddaballapur                11
HAL 2nd Stage                11
Name: count, dtype: int64

In [50]:
len(location_stats[location_stats <= 10])



0

In [51]:
df = df[df['total_sqft'] / df['bhk'] >= 300]


In [52]:
df.shape


(12479, 8)

In [53]:
def remove_pps_outliers(df):
    df_out = pd.DataFrame()
    
    for key, subdf in df.groupby('location'):
        m = np.mean(subdf.price_per_sqft)
        st = np.std(subdf.price_per_sqft)
        
        reduced_df = subdf[
            (subdf.price_per_sqft > (m - st)) &
            (subdf.price_per_sqft <= (m + st))
        ]
        
        df_out = pd.concat([df_out, reduced_df], ignore_index=True)
        
    return df_out


In [54]:
df = remove_pps_outliers(df)


In [55]:
df.shape


(9958, 8)

## ML part starts

In [56]:
df = df.drop('price_per_sqft', axis=1)


In [57]:
df.head()


Unnamed: 0,area_type,location,total_sqft,bath,balcony,price,bhk
0,Super built-up Area,1st Block Jayanagar,2850.0,4.0,1.0,428.0,4
1,Super built-up Area,1st Block Jayanagar,1630.0,3.0,2.0,194.0,3
2,Super built-up Area,1st Block Jayanagar,1875.0,2.0,3.0,235.0,3
3,Built-up Area,1st Block Jayanagar,1200.0,2.0,0.0,130.0,3
4,Super built-up Area,1st Block Jayanagar,1235.0,2.0,2.0,148.0,2


In [58]:
dummies = pd.get_dummies(df['location'])
dummies.head()


Unnamed: 0,1st Block Jayanagar,1st Phase JP Nagar,2nd Phase Judicial Layout,2nd Stage Nagarbhavi,5th Block Hbr Layout,5th Phase JP Nagar,6th Phase JP Nagar,7th Phase JP Nagar,8th Phase JP Nagar,9th Phase JP Nagar,...,Vishveshwarya Layout,Vishwapriya Layout,Vittasandra,Whitefield,Yelachenahalli,Yelahanka,Yelahanka New Town,Yelenahalli,Yeshwanthpur,other
0,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [59]:
df = pd.concat([df, dummies], axis=1)


In [60]:
df = df.drop('location', axis=1)


## Prepare X and y (Model Input & Output)

In [62]:
X = df.drop('price', axis=1)
y = df['price']


In [63]:
X.shape
y.shape


(9958,)

In [65]:
from sklearn.model_selection import train_test_split


In [66]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=10
)


In [67]:
X_train.shape
X_test.shape


(1992, 245)

In [68]:
from sklearn.linear_model import LinearRegression


In [69]:
lr_clf = LinearRegression()


In [70]:
lr_clf.fit(X_train, y_train)


ValueError: could not convert string to float: 'Super built-up  Area'

In [71]:
area_dummies = pd.get_dummies(df['area_type'])
area_dummies.head()


Unnamed: 0,Built-up Area,Carpet Area,Plot Area,Super built-up Area
0,False,False,False,True
1,False,False,False,True
2,False,False,False,True
3,True,False,False,False
4,False,False,False,True


In [72]:
df = pd.concat([df, area_dummies], axis=1)


In [73]:
df = df.drop('area_type', axis=1)


## Recreate X and y (IMPORTANT)

In [78]:
X = df.drop('price', axis=1)
y = df['price']


In [79]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=10
)


In [80]:
lr_clf.fit(X_train, y_train)


ValueError: Input X contains NaN.
LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

In [81]:
X.isnull().sum()


total_sqft                0
bath                     51
balcony                 339
bhk                       0
1st Block Jayanagar       0
                       ... 
other                     0
Built-up  Area            0
Carpet  Area              0
Plot  Area                0
Super built-up  Area      0
Length: 248, dtype: int64

In [82]:
df = df.dropna()


In [83]:
df.isnull().sum()


total_sqft              0
bath                    0
balcony                 0
price                   0
bhk                     0
                       ..
other                   0
Built-up  Area          0
Carpet  Area            0
Plot  Area              0
Super built-up  Area    0
Length: 249, dtype: int64

In [84]:
X = df.drop('price', axis=1)
y = df['price']


In [85]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=10
)


In [86]:
lr_clf.fit(X_train, y_train)


In [87]:
lr_clf.score(X_test, y_test)


0.8491672284515551

In [88]:
lr_clf.predict(X_test[0:1])


array([81.06278349])

In [89]:
y_test.iloc[0]


np.float64(76.18)

## reusable prediction function.

In [90]:
X.columns


Index(['total_sqft', 'bath', 'balcony', 'bhk', '1st Block Jayanagar',
       '1st Phase JP Nagar', '2nd Phase Judicial Layout',
       '2nd Stage Nagarbhavi', '5th Block Hbr Layout', '5th Phase JP Nagar',
       ...
       'Yelachenahalli', 'Yelahanka', 'Yelahanka New Town', 'Yelenahalli',
       'Yeshwanthpur', 'other', 'Built-up  Area', 'Carpet  Area', 'Plot  Area',
       'Super built-up  Area'],
      dtype='object', length=248)

In [91]:
columns = X.columns


In [92]:
def predict_price(location, sqft, bath, balcony, bhk, area_type):
    
    # create empty input array
    x = np.zeros(len(columns))
    
    # set numeric values
    x[columns.get_loc('total_sqft')] = sqft
    x[columns.get_loc('bath')] = bath
    x[columns.get_loc('balcony')] = balcony
    x[columns.get_loc('bhk')] = bhk
    
    # set location column
    if location in columns:
        x[columns.get_loc(location)] = 1
        
    # set area_type column
    if area_type in columns:
        x[columns.get_loc(area_type)] = 1
    
    # prediction
    return lr_clf.predict([x])[0]


In [93]:
predict_price(
    'Whitefield',
    1200,
    2,
    1,
    2,
    'Super built-up  Area'
)




np.float64(65.54302261989521)

In [94]:
import pickle


In [95]:
with open('../models/house_price_model.pickle', 'wb') as f:
    pickle.dump(lr_clf, f)


In [96]:
with open('../models/columns.pickle', 'wb') as f:
    pickle.dump(columns, f)


In [97]:
with open('../models/house_price_model.pickle', 'rb') as f:
    model = pickle.load(f)

model.predict(X_test[0:1])


array([81.06278349])