# Project Goal:
* Develop a machine learning model to predict house prices in Bengaluru based on features like location, size, etc.

# Steps:
* Data Collection.
* Data Exploration.
* Handle Categorical Data (impute and encode).
* Feature Engineering (create new features, drop unnecessary ones).
* Handle Skewness in numerical data (log transformation, Box-Cox, or Yeo-Johnson transformation).
* Scale Numerical Data.
* Data Splitting (train/test split).
* Model Selection and Training.
* Model Evaluation.
* Model Tuning and Optimization (if needed).
* Model Deployment (if required).

# Phase 1: Problem Understanding

In [None]:
# Importing liblaries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import warnings
warnings.filterwarnings("default", category=DeprecationWarning)


In [None]:
#download the data
import kagglehub

# Download latest version
path = kagglehub.dataset_download("amitabhajoy/bengaluru-house-price-data")

print("Path to dataset files:", path)


  and should_run_async(code)


Path to dataset files: /root/.cache/kagglehub/datasets/amitabhajoy/bengaluru-house-price-data/versions/2


In [None]:
#load the data intto csv file
dataset_path = "/root/.cache/kagglehub/datasets/amitabhajoy/bengaluru-house-price-data/versions/2/Bengaluru_House_Data.csv"

# Load the dataset
data = pd.read_csv(dataset_path)
data.head()


Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


  and should_run_async(code)


In [None]:
# display the columns
print("columns :",data.columns)


columns : Index(['area_type', 'availability', 'location', 'size', 'society',
       'total_sqft', 'bath', 'balcony', 'price'],
      dtype='object')


In [None]:
data.shape

(13320, 9)

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     13320 non-null  object 
 1   availability  13320 non-null  object 
 2   location      13319 non-null  object 
 3   size          13304 non-null  object 
 4   society       7818 non-null   object 
 5   total_sqft    13320 non-null  object 
 6   bath          13247 non-null  float64
 7   balcony       12711 non-null  float64
 8   price         13320 non-null  float64
dtypes: float64(3), object(6)
memory usage: 936.7+ KB


In [None]:
# statistical exploriation
data.describe(include = "all")

  and should_run_async(code)


Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
count,13320,13320,13319,13304,7818,13320.0,13247.0,12711.0,13320.0
unique,4,81,1305,31,2688,2117.0,,,
top,Super built-up Area,Ready To Move,Whitefield,2 BHK,GrrvaGr,1200.0,,,
freq,8790,10581,540,5199,80,843.0,,,
mean,,,,,,,2.69261,1.584376,112.565627
std,,,,,,,1.341458,0.817263,148.971674
min,,,,,,,1.0,0.0,8.0
25%,,,,,,,2.0,1.0,50.0
50%,,,,,,,2.0,2.0,72.0
75%,,,,,,,3.0,2.0,120.0


###location:

* Unique Values: 1305
High cardinality, meaning there are many unique locations. "Whitefield" appears most frequently (540 occurrences).
Action: I will  reduce dimensionality by grouping locations into broader categories .**

### society:

* Unique Values: 2688
This column likely contains identifiers for societies. High cardinality makes it challenging to encode directly.
Action: most values are unique, i will drop  this column .

In [None]:
# missing values
data.isnull().sum()

  and should_run_async(code)


Unnamed: 0,0
area_type,0
availability,0
location,1
size,16
society,5502
total_sqft,0
bath,73
balcony,609
price,0


# Visualize distributions

#  Handle Missing Data

In [None]:
data.isnull().sum()

Unnamed: 0,0
area_type,0
availability,0
location,1
size,16
society,5502
total_sqft,0
bath,73
balcony,609
price,0


In [None]:
data["location"] = data["location"].fillna(data["location"].mode()[0])
data["size"] = data["size"].fillna(data["size"].mode()[0])
data.drop("society", axis = 1, inplace = True)
data["bath"] = data["bath"].fillna(data["bath"].mean())
data["balcony"] = data["balcony"].fillna(data["balcony"].mean())

data.isnull().sum()


Unnamed: 0,0
area_type,0
availability,0
location,0
size,0
total_sqft,0
bath,0
balcony,0
price,0


# Handle Categorical Data

In [None]:
data.dtypes

  and should_run_async(code)


Unnamed: 0,0
area_type,object
availability,object
location,object
size,object
total_sqft,object
bath,float64
balcony,float64
price,float64


In [None]:
data.columns[0:5]

  and should_run_async(code)


Index(['area_type', 'availability', 'location', 'size', 'total_sqft'], dtype='object')

In [None]:
data = pd.get_dummies(data, columns=['area_type', 'size', 'availability'], drop_first=True)



In [None]:
!pip install category_encoders

import category_encoders as ce
encoder = ce.TargetEncoder(cols=['location','total_sqft'])
data[['location','total_sqft']] = encoder.fit_transform(data[['location','total_sqft']], data['price'])

  and should_run_async(code)




In [None]:
data.dtypes

  and should_run_async(code)


Unnamed: 0,0
location,float64
total_sqft,float64
bath,float64
balcony,float64
price,float64
...,...
availability_22-Mar,bool
availability_22-May,bool
availability_22-Nov,bool
availability_Immediate Possession,bool


In [None]:
data.head()

  and should_run_async(code)


Unnamed: 0,location,total_sqft,bath,balcony,price,area_type_Carpet Area,area_type_Plot Area,area_type_Super built-up Area,size_1 Bedroom,size_1 RK,...,availability_21-Oct,availability_21-Sep,availability_22-Dec,availability_22-Jan,availability_22-Jun,availability_22-Mar,availability_22-May,availability_22-Nov,availability_Immediate Possession,availability_Ready To Move
0,48.317545,101.463677,2.0,1.0,39.07,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
1,113.608351,182.847191,5.0,3.0,120.0,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,True
2,61.25253,108.945405,2.0,3.0,62.0,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
3,114.16409,108.027498,3.0,1.0,95.0,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,True
4,95.79884,110.794923,2.0,1.0,51.0,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,True


In [None]:
data.columns

  and should_run_async(code)


Index(['location', 'total_sqft', 'bath', 'balcony', 'price',
       'area_type_Carpet  Area', 'area_type_Plot  Area',
       'area_type_Super built-up  Area', 'size_1 Bedroom', 'size_1 RK',
       ...
       'availability_21-Oct', 'availability_21-Sep', 'availability_22-Dec',
       'availability_22-Jan', 'availability_22-Jun', 'availability_22-Mar',
       'availability_22-May', 'availability_22-Nov',
       'availability_Immediate Possession', 'availability_Ready To Move'],
      dtype='object', length=118)

# Feature Engineering

In [None]:
data.columns


Index(['location', 'total_sqft', 'bath', 'balcony', 'price',
       'area_type_Carpet  Area', 'area_type_Plot  Area',
       'area_type_Super built-up  Area', 'size_1 Bedroom', 'size_1 RK',
       ...
       'availability_21-Oct', 'availability_21-Sep', 'availability_22-Dec',
       'availability_22-Jan', 'availability_22-Jun', 'availability_22-Mar',
       'availability_22-May', 'availability_22-Nov',
       'availability_Immediate Possession', 'availability_Ready To Move'],
      dtype='object', length=118)

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled = scaler.fit_transform(data)
scaled_data = pd.DataFrame(scaled, columns=data.columns)

scaled_data.head()



Unnamed: 0,location,total_sqft,bath,balcony,price,area_type_Carpet Area,area_type_Plot Area,area_type_Super built-up Area,size_1 Bedroom,size_1 RK,...,availability_21-Oct,availability_21-Sep,availability_22-Dec,availability_22-Jan,availability_22-Jun,availability_22-Mar,availability_22-May,availability_22-Nov,availability_Immediate Possession,availability_Ready To Move
0,-1.33265,-0.103785,-0.517751,-0.731997,-0.493372,-0.081083,-0.423418,0.717885,-0.089138,-0.031256,...,-0.021229,-0.021229,-0.028749,-0.021229,-0.037795,-0.015009,-0.02741,-0.012254,-0.034679,-1.965474
1,0.159908,1.611368,1.724859,1.773231,0.049906,-0.081083,2.361732,-1.392981,-0.089138,-0.031256,...,-0.021229,-0.021229,-0.028749,-0.021229,-0.037795,-0.015009,-0.02741,-0.012254,-0.034679,0.508783
2,-1.036954,0.053892,-0.517751,1.773231,-0.339444,-0.081083,-0.423418,-1.392981,-0.089138,-0.031256,...,-0.021229,-0.021229,-0.028749,-0.021229,-0.037795,-0.015009,-0.02741,-0.012254,-0.034679,0.508783
3,0.172612,0.034547,0.229786,-0.731997,-0.117917,-0.081083,-0.423418,0.717885,-0.089138,-0.031256,...,-0.021229,-0.021229,-0.028749,-0.021229,-0.037795,-0.015009,-0.02741,-0.012254,-0.034679,0.508783
4,-0.247221,0.09287,-0.517751,-0.731997,-0.413286,-0.081083,-0.423418,0.717885,-0.089138,-0.031256,...,-0.021229,-0.021229,-0.028749,-0.021229,-0.037795,-0.015009,-0.02741,-0.012254,-0.034679,0.508783


# Split the Data

In [250]:

X = scaled_data.drop(columns=["price"])
y = scaled_data["price"]

print(X.shape)



(13320, 117)


  and should_run_async(code)


In [254]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=42)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)



X_train shape: (10656, 117)
X_test shape: (2664, 117)
y_train shape: (10656,)
y_test shape: (2664,)


In [255]:

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor


In [256]:
models = {
    'LinearRegression': LinearRegression(),
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.1),
    'DecisionTree': DecisionTreeRegressor(),
    'RandomForest': RandomForestRegressor(n_estimators=100, random_state=42),
    'GradientBoosting': GradientBoostingRegressor(random_state=42),
    'SVR': SVR(kernel='rbf'),
    'KNeighbors': KNeighborsRegressor(n_neighbors=3)
}

predictions = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    predictions[name] = y_pred


predictions = pd.DataFrame(predictions)
predictions['Actual'] = y_test.values


predictions


Unnamed: 0,LinearRegression,Ridge,Lasso,DecisionTree,RandomForest,GradientBoosting,SVR,KNeighbors,Actual
0,-0.893170,-0.891568,-0.473420,0.049906,-0.177720,-0.412720,-0.195360,-0.097778,-0.320648
1,-0.010985,-0.010714,0.054061,-0.131343,-0.109112,-0.187761,-0.054863,-0.122415,0.083471
2,-0.195219,-0.195482,-0.157899,-0.392812,-0.394571,-0.205962,-0.338355,-0.337206,-0.352870
3,0.142302,0.139000,0.212945,0.452683,0.300299,0.351456,0.285362,-0.084352,-0.017223
4,0.281066,0.281870,0.209390,0.922588,0.511595,0.203040,0.148498,-0.301404,0.654071
...,...,...,...,...,...,...,...,...,...
2659,-1.081786,-1.077842,-0.840149,-0.661665,-0.647635,-0.579114,-0.622188,-0.574397,-0.661665
2660,0.639001,0.640443,0.584407,1.832191,0.166779,0.507588,0.370748,1.168729,-0.456920
2661,0.252846,0.252993,0.292425,-0.151482,-0.022258,0.195403,0.045890,-0.131343,-0.319305
2662,-0.070809,-0.070991,-0.038840,-0.500554,-0.379068,-0.256031,-0.293682,-0.314517,-0.406573


In [258]:
from sklearn.metrics import mean_squared_error, r2_score

metrics = []
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    rmse = mse ** 0.5
    r2 = r2_score(y_test, y_pred)
    metrics.append({
        'Model': name,
        'MSE': mse,
        'RMSE': rmse,
        'R² Score': r2
    })

metrics = pd.DataFrame(metrics)

metrics


Unnamed: 0,Model,MSE,RMSE,R² Score
0,LinearRegression,3.369639e+24,1835658000000.0,-3.512138e+24
1,Ridge,0.4642105,0.68133,0.5161585
2,Lasso,0.488199,0.6987124,0.4911555
3,DecisionTree,0.3118472,0.5584328,0.6749651
4,RandomForest,0.2243847,0.4736926,0.7661263
5,GradientBoosting,0.214932,0.4636076,0.7759788
6,SVR,0.4603352,0.6784801,0.5201976
7,KNeighbors,0.4308995,0.6564293,0.5508782


  and should_run_async(code)



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



In [260]:
# Calculate and display the R² Score as percentage
metrics_percent = metrics.copy()
metrics_percent['R² Score (%)'] = metrics_percent['R² Score'] * 100

# Display the results with percentage R² score
print(metrics_percent[['Model', 'R² Score (%)']])


              Model  R² Score (%)
0  LinearRegression -3.512138e+26
1             Ridge  5.161585e+01
2             Lasso  4.911555e+01
3      DecisionTree  6.749651e+01
4      RandomForest  7.661263e+01
5  GradientBoosting  7.759788e+01
6               SVR  5.201976e+01
7        KNeighbors  5.508782e+01
