<a href="https://colab.research.google.com/github/HemaGarima/Real_state_Price_Prediction/blob/main/Real_State_Price_Prediction_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Regression Project : Predicting Home Prices in Bangalore

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import matplotlib
matplotlib.rcParams["figure.figsize"] = (20,10)

## Data Load : Load bangalore home prices into a dataframe

In [None]:
df1 = pd.read_csv("bengaluru_house_prices.csv")
df1.head()

In [None]:
df1.shape

In [None]:
df1.columns

In [None]:
df1['area_type'].unique()

In [None]:
df1['area_type'].nunique()

In [None]:
df1['area_type'].value_counts()

- Drop features that are not required to build our model

In [None]:
df2 = df1.drop(['area_type' , 'society' , 'balcony' , 'availability'],axis = 'columns')
df2.shape

In [None]:
df2.head()

## Data Cleaning : Handling NA values

In [None]:
df2.isnull().sum()

In [None]:
df2.shape

In [None]:
df3 = df2.dropna()

In [None]:
df3.isnull().sum()

In [None]:
df3.shape

In [None]:
df3['size'].nunique()

In [None]:
df3['size'].unique()

In [None]:
df3['bhk'] = df3['size'].apply(lambda x : int(x.split(' ')[0]))

In [None]:
df3.head()

In [None]:
df3['bhk'].unique()

In [None]:
df3[df3.bhk > 20]

In [None]:
df3.total_sqft.unique()

In [None]:
df3['total_sqft'].nunique()

In [None]:
df3['total_sqft'].unique()

In [None]:
def is_float(x):
  try:
    float(x)
  except:
    return False
  return True

In [None]:
df3[~df3['total_sqft'].apply(is_float)]

In [None]:
df3[~df3['total_sqft'].apply(is_float)].head(10)

- Above shows that total_sqft can be a range (e.g. 2100 - 2850). For such case we can just take average of min and max value in the range. There are other cases such as 34.46sq.Meter which one can convert to square ft using unit conversion.

In [None]:
def convert_sqft_to_num(x):
  tokens = x.split('-')
  if len(tokens) == 2:
    return (float(tokens[0]) + float(tokens[1]))/2
  try:
    return float(x)
  except:
    return None

In [None]:
convert_sqft_to_num('2166')

In [None]:
convert_sqft_to_num('2100 - 2850')

In [None]:
convert_sqft_to_num('34.46Sq. Meter')

In [None]:
df4 = df3.copy()
df4.total_sqft = df4.total_sqft.apply(convert_sqft_to_num)
df4 = df4[df4.total_sqft.notnull()]
df4.head()

In [None]:
df4.loc[30]

In [74]:
2100+2850

4950

In [75]:
4950/2

2475.0

In [76]:
df4.shape

(13200, 6)

## Feature Engineering

- Add new feature called price per square feet

In [77]:
df5 = df4.copy()
df5['price_per_sqft'] = df5['price']*100000 / df5['total_sqft']
df5.head()

Unnamed: 0,location,size,total_sqft,bath,price,bhk,price_per_sqft
0,Electronic City Phase II,2 BHK,1056.0,2.0,39.07,2,3699.810606
1,Chikka Tirupathi,4 Bedroom,2600.0,5.0,120.0,4,4615.384615
2,Uttarahalli,3 BHK,1440.0,2.0,62.0,3,4305.555556
3,Lingadheeranahalli,3 BHK,1521.0,3.0,95.0,3,6245.890861
4,Kothanur,2 BHK,1200.0,2.0,51.0,2,4250.0


In [78]:
df5_stats = df5['price_per_sqft'].describe()
df5_stats

Unnamed: 0,price_per_sqft
count,13200.0
mean,7920.759
std,106727.2
min,267.8298
25%,4267.701
50%,5438.331
75%,7317.073
max,12000000.0


In [79]:
df5.to_csv("bhp.csv",index = False)

- Examine locations which is a categorical variable. We need to apply dimensionality reduction technique here to reduce number of locations

In [80]:
df5.location = df5.location.apply(lambda x : x.strip())

In [82]:
location_stats = df5['location'].value_counts(ascending = False)
location_stats

Unnamed: 0_level_0,count
location,Unnamed: 1_level_1
Whitefield,533
Sarjapur Road,392
Electronic City,304
Kanakpura Road,264
Thanisandra,235
...,...
Rajanna Layout,1
Subramanyanagar,1
Lakshmipura Vidyaanyapura,1
Malur Hosur Road,1


In [83]:
location_stats.values.sum()

13200

In [84]:
df5.shape

(13200, 7)

In [85]:
len(location_stats[location_stats > 10])

240

In [86]:
len(location_stats)

1287

In [87]:
len(location_stats[location_stats <= 10])

1047

## Dimensionality Reduction

- Any location having less than 10 data points should be tagged as "other" location. This way number of categories can be reduced by huge amount. Later on when we do one hot encoding, it will help us with having fewer dummy columns


In [89]:
location_stats_less_than_10 = location_stats[location_stats <= 10]
location_stats_less_than_10

Unnamed: 0_level_0,count
location,Unnamed: 1_level_1
BTM 1st Stage,10
Gunjur Palya,10
Nagappa Reddy Layout,10
Sector 1 HSR Layout,10
Thyagaraja Nagar,10
...,...
Rajanna Layout,1
Subramanyanagar,1
Lakshmipura Vidyaanyapura,1
Malur Hosur Road,1


In [90]:
len(df5.location.unique())

1287

In [91]:
df5.location.nunique()

1287

In [92]:
df5.location = df5.location.apply(lambda x : 'other' if x in location_stats_less_than_10 else x)

In [93]:
len(df5.location.unique())

241

In [94]:
df5.head(10)

Unnamed: 0,location,size,total_sqft,bath,price,bhk,price_per_sqft
0,Electronic City Phase II,2 BHK,1056.0,2.0,39.07,2,3699.810606
1,Chikka Tirupathi,4 Bedroom,2600.0,5.0,120.0,4,4615.384615
2,Uttarahalli,3 BHK,1440.0,2.0,62.0,3,4305.555556
3,Lingadheeranahalli,3 BHK,1521.0,3.0,95.0,3,6245.890861
4,Kothanur,2 BHK,1200.0,2.0,51.0,2,4250.0
5,Whitefield,2 BHK,1170.0,2.0,38.0,2,3247.863248
6,Old Airport Road,4 BHK,2732.0,4.0,204.0,4,7467.057101
7,Rajaji Nagar,4 BHK,3300.0,4.0,600.0,4,18181.818182
8,Marathahalli,3 BHK,1310.0,3.0,63.25,3,4828.244275
9,other,6 Bedroom,1020.0,6.0,370.0,6,36274.509804
