<h1 style="color:green" align='center'> Data Science Regression Project: Predicting Home Prices in Bangalore </h1>

In [14]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib
matplotlib.rcParams["figure.figsize"] = (20,10)

import warnings
warnings.filterwarnings('ignore')

<h2 style='color:blue'>1. Data Load: Load Bangalore home price into a dataframe:</h2>

In [4]:
df = pd.read_csv("./../datasets/bengaluru_house_prices.csv")
print(df.shape)
df.head(3)

(13320, 9)


Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0


#### **Check for null values in the dataframe for each column.**

In [5]:
df.isnull().sum()

area_type          0
availability       0
location           1
size              16
society         5502
total_sqft         0
bath              73
balcony          609
price              0
dtype: int64

In [6]:
df.columns

Index(['area_type', 'availability', 'location', 'size', 'society',
       'total_sqft', 'bath', 'balcony', 'price'],
      dtype='object')

#### **Drop features/columns that are not required to build our model and assign it into the new dataframe df1.**

In [7]:
df1 = df.drop(['area_type', 'availability', 'society', 'balcony'], axis='columns')
df1.head(3)

Unnamed: 0,location,size,total_sqft,bath,price
0,Electronic City Phase II,2 BHK,1056,2.0,39.07
1,Chikka Tirupathi,4 Bedroom,2600,5.0,120.0
2,Uttarahalli,3 BHK,1440,2.0,62.0


<h2 style='color:blue'>2. Data Clearning and Handling: </h2>

#### **Check the null value after dropping/removing some features.**

In [8]:
df1.isnull().sum()

location       1
size          16
total_sqft     0
bath          73
price          0
dtype: int64

In [9]:
df1.shape

(13320, 5)

#### **Drop all the null values as we have big dataset it will not affect the outcome much.**

In [10]:
df2 = df1.dropna()
df2.shape

(13246, 5)

#### **Check for null values after new dataframe df2 just to make sure there are none.**

In [11]:
df2.isnull().sum()

location      0
size          0
total_sqft    0
bath          0
price         0
dtype: int64

<h2 style='color: blue'>3. Feature Engineering(1):</h2>

In [17]:
df2.head(3)

Unnamed: 0,location,size,total_sqft,bath,price
0,Electronic City Phase II,2 BHK,1056,2.0,39.07
1,Chikka Tirupathi,4 Bedroom,2600,5.0,120.0
2,Uttarahalli,3 BHK,1440,2.0,62.0


**Check the unique value of feature 'size'.**

In [18]:
df2['size'].unique()

array(['2 BHK', '4 Bedroom', '3 BHK', '4 BHK', '6 Bedroom', '3 Bedroom',
       '1 BHK', '1 RK', '1 Bedroom', '8 Bedroom', '2 Bedroom',
       '7 Bedroom', '5 BHK', '7 BHK', '6 BHK', '5 Bedroom', '11 BHK',
       '9 BHK', '9 Bedroom', '27 BHK', '10 Bedroom', '11 Bedroom',
       '10 BHK', '19 BHK', '16 BHK', '43 Bedroom', '14 BHK', '8 BHK',
       '12 Bedroom', '13 BHK', '18 Bedroom'], dtype=object)

#### **Add new features called 'bhk'(bed hall kitchen). Split the feature 'size' into two and take the first value which is the number we want.**

**"2 BHK".split(" ") will split the categorical value '2 BHK'into [2, 'BHK'] but we want the first value so "2 BHK".split(" ")[0] will return only 2.**


In [43]:
df2['bhk'] = df2['size'].apply(lambda x: int(x.split(' ')[0]))
df2.head(3)

Unnamed: 0,location,size,total_sqft,bath,price,bhk
0,Electronic City Phase II,2 BHK,1056,2.0,39.07,2
1,Chikka Tirupathi,4 Bedroom,2600,5.0,120.0,4
2,Uttarahalli,3 BHK,1440,2.0,62.0,3


**Drop size feature from the dataframe.**

In [44]:
df3 = df2.drop('size', axis=1)
df3.head(3)

Unnamed: 0,location,total_sqft,bath,price,bhk
0,Electronic City Phase II,1056,2.0,39.07,2
1,Chikka Tirupathi,2600,5.0,120.0,4
2,Uttarahalli,1440,2.0,62.0,3


In [45]:
df3.shape

(13246, 5)

**Now check the values of bhk to confirm it's only numerical as we expect.**

In [46]:
df3['bhk'].unique()

array([ 2,  4,  3,  6,  1,  8,  7,  5, 11,  9, 27, 10, 19, 16, 43, 14, 12,
       13, 18], dtype=int64)

**Explore total_sqft. feature by checking if all the values are float type or not by using a user defined function.**

In [47]:
def is_float(data):
    try:
        float(data)
        return True
    except: 
        return False

**Check the total rows where the total_sqft features values are float.**

In [48]:
df3[df['total_sqft'].apply(is_float)].shape

(13056, 5)

**Explore the rows where total_sqft features values are not float and how to dealt with it.**

In [49]:
df3[~df['total_sqft'].apply(is_float)].head(10)

Unnamed: 0,location,total_sqft,bath,price,bhk
30,Yelahanka,2100 - 2850,4.0,186.0,4
122,Hebbal,3067 - 8156,4.0,477.0,4
137,8th Phase JP Nagar,1042 - 1105,2.0,54.005,2
165,Sarjapur,1145 - 1340,2.0,43.49,2
188,KR Puram,1015 - 1540,2.0,56.8,2
410,Kengeri,34.46Sq. Meter,1.0,18.5,1
549,Hennur Road,1195 - 1440,2.0,63.77,2
648,Arekere,4125Perch,9.0,265.0,9
661,Yelahanka,1120 - 1145,2.0,48.13,2
672,Bettahalsoor,3090 - 5002,4.0,445.0,4


**Transform the range value such as '2100-2980' to a single value by taking means and drop the rest by working around through user define function.
'2000-2100'.split('-') will return ['2000', '2100']**

In [63]:
def range_to_mean(data):
    ret_val = data.split('-')
    if len(ret_val) == 2:
        return (float(ret_val[0])+float(ret_val[1]))/2
    try: 
        return float(data)
    except:
        return None
    

In [101]:
df4.shape

(13246, 5)

In [73]:
df4 = df3.copy()

In [74]:
df4['total_sqft'] = df4['total_sqft'].apply(range_to_mean)
df4['total_sqft'].values

array([1056., 2600., 1440., ..., 1141., 4689.,  550.])

**There are some elements from df4['total_sqft'] features that are NaN, we will remove those.**

In [107]:
df4 = df4[df4['total_sqft'].notnull()]
df4.shape

(13200, 5)

In [108]:
df4.loc[30, ['total_sqft']]

total_sqft    2475.0
Name: 30, dtype: object

In [109]:
df4.loc[30, 'total_sqft']

2475.0

<h2 style='color: blue'> 3. Feature Engineering(2):</h2>

**Add new feature called 'price_per_sqft' to check analyze the price of an apartment per square feet.**

In [110]:
df5 = df4.copy()

df5['price_per_sqft'] = (df5['price']*100000)/df5['total_sqft']
df5.head(10)

Unnamed: 0,location,total_sqft,bath,price,bhk,price_per_sqft
0,Electronic City Phase II,1056.0,2.0,39.07,2,3699.810606
1,Chikka Tirupathi,2600.0,5.0,120.0,4,4615.384615
2,Uttarahalli,1440.0,2.0,62.0,3,4305.555556
3,Lingadheeranahalli,1521.0,3.0,95.0,3,6245.890861
4,Kothanur,1200.0,2.0,51.0,2,4250.0
5,Whitefield,1170.0,2.0,38.0,2,3247.863248
6,Old Airport Road,2732.0,4.0,204.0,4,7467.057101
7,Rajaji Nagar,3300.0,4.0,600.0,4,18181.818182
8,Marathahalli,1310.0,3.0,63.25,3,4828.244275
9,Gandhi Bazar,1020.0,6.0,370.0,6,36274.509804


**Analyze price_per_sqft through statistical result.**

In [111]:
df5_stats = df5['price_per_sqft'].describe()
df5_stats

count    1.320000e+04
mean     7.920759e+03
std      1.067272e+05
min      2.678298e+02
25%      4.267701e+03
50%      5.438331e+03
75%      7.317073e+03
max      1.200000e+07
Name: price_per_sqft, dtype: float64

**Strip or trim all the white spaces or blanks that may be present at the end and beginning of the value of location feature**

In [112]:
df6 = df5.copy()
df6['location'] = df5['location'].apply(lambda x: x.strip())

**Examine the location feature, which is a categorical value or variable. We will need to apply dimensionality reduction technique to reduce the number of location name if it is very large**

In [113]:
location_stats = df6['location'].value_counts(ascending=False)
location_stats

location
Whitefield                   533
Sarjapur  Road               392
Electronic City              304
Kanakpura Road               264
Thanisandra                  235
                            ... 
Rajanna Layout                 1
Subramanyanagar                1
Lakshmipura Vidyaanyapura      1
Malur Hosur Road               1
Abshot Layout                  1
Name: count, Length: 1287, dtype: int64

**Exploring the total numbers of unique location name we have**

In [114]:
location_stats.values.sum()

13200

In [115]:
 df5[df5['location']=='Whitefield']

Unnamed: 0,location,total_sqft,bath,price,bhk,price_per_sqft
5,Whitefield,1170.0,2.0,38.00,2,3247.863248
10,Whitefield,1800.0,2.0,70.00,3,3888.888889
11,Whitefield,2785.0,5.0,295.00,4,10592.459605
27,Whitefield,1610.0,3.0,81.00,3,5031.055901
47,Whitefield,1459.0,2.0,94.82,2,6498.971899
...,...,...,...,...,...,...
13235,Whitefield,1730.0,3.0,125.00,3,7225.433526
13257,Whitefield,1453.0,2.0,58.00,3,3991.741225
13258,Whitefield,877.0,1.0,59.00,1,6727.480046
13299,Whitefield,2856.0,5.0,154.50,4,5409.663866
