# Data Science Regression Project: Predicting Home Prices in Banglore


Dataset is downloaded from here: https://www.kaggle.com/amitabhajoy/bengaluru-house-price-data

In [3]:
from matplotlib import pyplot as plt
import pandas as pd 
import numpy as np
%matplotlib inline
import matplotlib 
matplotlib.rcParams["figure.figsize"] = (20,10)

In [10]:
housedf=pd.read_csv('Bengaluru_House_Data.csv')
housedf.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price(lakhs)
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [18]:
housedf.shape

(13320, 9)

In [11]:
housedf['area_type'].unique()

array(['Super built-up  Area', 'Plot  Area', 'Built-up  Area',
       'Carpet  Area'], dtype=object)

<b>We would keep the features that are important for the analysis. we donot require parameters like area_type , availability ,society in our analysis, so we drop them.</b>

In [12]:
housedf1=housedf.drop(columns=['area_type', 'availability','society'])
housedf1.head()

Unnamed: 0,location,size,total_sqft,bath,balcony,price(lakhs)
0,Electronic City Phase II,2 BHK,1056,2.0,1.0,39.07
1,Chikka Tirupathi,4 Bedroom,2600,5.0,3.0,120.0
2,Uttarahalli,3 BHK,1440,2.0,3.0,62.0
3,Lingadheeranahalli,3 BHK,1521,3.0,1.0,95.0
4,Kothanur,2 BHK,1200,2.0,1.0,51.0


## Data Cleaning 

<b> The first step in data cleaning is always to check for null/NA values in all the columns and deal with them</b>

In [16]:
housedf1.isnull().sum()

location          1
size             16
total_sqft        0
bath             73
balcony         609
price(lakhs)      0
dtype: int64

<b>Given we have a sufficiently large data set(size=13320 rows) we would drop the NA values for all whose count is less (eg< 100 say) . But the column 'balcony' has as big has '609' values as null so we could replace them with the mean value .</b>
   

In [22]:
housedf1['balcony'].unique()

array([ 1.,  3., nan,  2.,  0.])

In [23]:
housedf1['balcony'].value_counts()

2.0    5113
1.0    4897
3.0    1672
0.0    1029
Name: balcony, dtype: int64

In [30]:
mean_balcony=np.floor(housedf1['balcony'].mean())

In [31]:
housedf2=housedf1.copy()
housedf2.head()

Unnamed: 0,location,size,total_sqft,bath,balcony,price(lakhs)
0,Electronic City Phase II,2 BHK,1056,2.0,1.0,39.07
1,Chikka Tirupathi,4 Bedroom,2600,5.0,3.0,120.0
2,Uttarahalli,3 BHK,1440,2.0,3.0,62.0
3,Lingadheeranahalli,3 BHK,1521,3.0,1.0,95.0
4,Kothanur,2 BHK,1200,2.0,1.0,51.0


In [38]:
housedf2['balcony'].fillna(mean_balcony, inplace=True)
housedf2['balcony'].unique()

array([1., 3., 2., 0.])

In [39]:
housedf2.dropna(inplace=True)
housedf2

Unnamed: 0,location,size,total_sqft,bath,balcony,price(lakhs)
0,Electronic City Phase II,2 BHK,1056,2.0,1.0,39.07
1,Chikka Tirupathi,4 Bedroom,2600,5.0,3.0,120.00
2,Uttarahalli,3 BHK,1440,2.0,3.0,62.00
3,Lingadheeranahalli,3 BHK,1521,3.0,1.0,95.00
4,Kothanur,2 BHK,1200,2.0,1.0,51.00
...,...,...,...,...,...,...
13315,Whitefield,5 Bedroom,3453,4.0,0.0,231.00
13316,Richards Town,4 BHK,3600,5.0,1.0,400.00
13317,Raja Rajeshwari Nagar,2 BHK,1141,2.0,1.0,60.00
13318,Padmanabhanagar,4 BHK,4689,4.0,1.0,488.00


## Feature Engineering

<b>So now we analyze the values in our columns .(range, type  of vaues in each columns).Modify or add new features.</b>

<b> Feature 1: Size </b>

In [45]:
housedf2['size'].unique()

array(['2 BHK', '4 Bedroom', '3 BHK', '4 BHK', '6 Bedroom', '3 Bedroom',
       '1 BHK', '1 RK', '1 Bedroom', '8 Bedroom', '2 Bedroom',
       '7 Bedroom', '5 BHK', '7 BHK', '6 BHK', '5 Bedroom', '11 BHK',
       '9 BHK', '9 Bedroom', '27 BHK', '10 Bedroom', '11 Bedroom',
       '10 BHK', '19 BHK', '16 BHK', '43 Bedroom', '14 BHK', '8 BHK',
       '12 Bedroom', '13 BHK', '18 Bedroom'], dtype=object)

<b> Here we note that two different type are used to represent the same thing , eg:  '2 BHK' an '2 Bedroom' have one and the same meaning . So we will use a function to make that homogenous </b>

In [46]:
housedf2['bhk']=housedf2['size'].apply(lambda x:int(x.split(' ')[0]))
housedf2['bhk'].unique()
                                       

array([ 2,  4,  3,  6,  1,  8,  7,  5, 11,  9, 27, 10, 19, 16, 43, 14, 12,
       13, 18], dtype=int64)

<b>Feature 2:total_sqft</b>

In [50]:
type(housedf2['total_sqft'][0])

str

In [52]:
housedf2['total_sqft'].unique()

array(['1056', '2600', '1440', ..., '1133 - 1384', '774', '4689'],
      dtype=object)

<b>Here we see that data type of the column in string  , and  string can tak eany value , digit or special character . so we would try to convert all to float value . and for string that has non numeric value or ranges etc will get separated by throwing an exception</b>

In [53]:
def is_float(x):
    try:
        float(x)
    except:
        return False
    return True

In [57]:
housedf2[~housedf2['total_sqft'].apply(is_float)].head(12)

Unnamed: 0,location,size,total_sqft,bath,balcony,price(lakhs),bhk
30,Yelahanka,4 BHK,2100 - 2850,4.0,0.0,186.0,4
122,Hebbal,4 BHK,3067 - 8156,4.0,0.0,477.0,4
137,8th Phase JP Nagar,2 BHK,1042 - 1105,2.0,0.0,54.005,2
165,Sarjapur,2 BHK,1145 - 1340,2.0,0.0,43.49,2
188,KR Puram,2 BHK,1015 - 1540,2.0,0.0,56.8,2
410,Kengeri,1 BHK,34.46Sq. Meter,1.0,0.0,18.5,1
549,Hennur Road,2 BHK,1195 - 1440,2.0,0.0,63.77,2
648,Arekere,9 Bedroom,4125Perch,9.0,1.0,265.0,9
661,Yelahanka,2 BHK,1120 - 1145,2.0,0.0,48.13,2
672,Bettahalsoor,4 Bedroom,3090 - 5002,4.0,0.0,445.0,4


<b>Above shows that total_sqft can be a range (e.g. 2100-2850). For such case we can just take average of min and max value in the range. There are other cases such as 34.46Sq. Meter which one can convert to square ft using unit conversion. I am going to just drop such corner cases to keep things simple</b>

In [58]:
def convert_sqft_to_num(x):
    tokens = x.split('-')
    if len(tokens) == 2:
        return (float(tokens[0])+float(tokens[1]))/2
    try:
        return float(x)
    except:
        return None

In [None]:
ho