# 🏠 Bangalore House Price Prediction

This project builds a machine learning regression model to predict house prices in Bangalore based on features like area, location, number of bathrooms, and bedrooms.  
We use data cleaning, feature engineering, and regression techniques to achieve this goal.

## 📚 Importing Required Libraries


In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

## 📥 Load and Explore the Dataset


In [11]:
df = pd.read_csv("Bengaluru_House_Data.csv")
df.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


## 🧹 Data Cleaning
We drop irrelevant columns and handle missing values.


In [14]:
df.drop(['area_type', 'society', 'balcony', 'availability'], axis=1, inplace=True)
df.dropna(inplace=True)
df.isnull().sum()

location      0
size          0
total_sqft    0
bath          0
price         0
dtype: int64

## 🔧 Feature Engineering
Convert 'total_sqft' to numerical, extract 'bhk', and process 'location'.


In [17]:
# Convert total_sqft
def convert_sqft(x):
    try:
        return float(x)
    except:
        tokens = x.split('-')
        if len(tokens) == 2:
            return (float(tokens[0]) + float(tokens[1])) / 2
        return None

df['total_sqft'] = df['total_sqft'].apply(convert_sqft)
df.dropna(inplace=True)

# Add bhk column
df['bhk'] = df['size'].apply(lambda x: int(x.split(' ')[0]))

# Reduce unique locations
df['location'] = df['location'].apply(lambda x: x.strip())
location_stats = df['location'].value_counts()
df['location'] = df['location'].apply(lambda x: 'other' if location_stats[x] <= 10 else x)

## ⚠️ Outlier Removal
Remove unreasonable data points (e.g., very small sqft per bhk).

In [20]:
df = df[df['total_sqft'] / df['bhk'] >= 300]

## 🧪 Train-Test Split
Convert categorical features and split the dataset.


In [23]:
dummies = pd.get_dummies(df['location'], drop_first=True)
df_model = pd.concat([df.drop(['location', 'size'], axis=1), dummies], axis=1)

X = df_model.drop('price', axis=1)
y = df_model['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 📊 Model Evaluation
Use metrics like R² Score and Mean Squared Error.


In [36]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

In [38]:
y_pred = model.predict(X_test)
print("R² Score:", r2_score(y_test, y_pred))
print("MSE:", mean_squared_error(y_test, y_pred))

R² Score: 0.5303123010113816
MSE: 11479.5025915459


## 💾 Save the Model


In [40]:
import joblib
joblib.dump(model, "bangalore_house_price_model.pkl")

['bangalore_house_price_model.pkl']

## ✅ Conclusion

- Cleaned and processed real estate data for Bangalore.
- Trained a Linear Regression model.
- Achieved reasonable accuracy using minimal features.
- Next steps: Try more models like Random Forest or XGBoost, and build a web app.
