<u><b><h1 align="center" style = "color:Red; background-color:yellow;" > Data Science Regression Project: Predicting Home Prices in Banglore </h1></b></u>

<u><b><h3 style = "color:Blue"> Loading Data</h3></b></u>

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import matplotlib 
matplotlib.rcParams["figure.figsize"] = (20,10)

In [None]:
df1 = pd.read_csv("Bengaluru_House_Data.csv")
print(df1.shape)
df1.head()

In [None]:
df1["area_type"].unique()

In [None]:
df1['area_type'].value_counts()

<b> Dropping features that are not required to build our model </b>

In [None]:
df2 = df1.drop(["area_type", "availability", "society", "balcony"], axis = "columns")
df2.head()

<u><b><h3 style = "color:Blue;">Data Cleaning</h3></b></u>

In [None]:

df2.isnull().sum()

In [None]:
df3 = df2.dropna()  # Since the NaN values in each column were very few as compared to total rows in the dataset ; hence we can easily drop those rows.
df3.shape

In [None]:
df3["size"].unique()

Now it's evident that the BHK and Bedroom are one and all the same thing ; but since the string value is different hence they are treated different ; for that purpose we can make a column named "Size_in_BHK".
For that make a function that retrieve the Integer value from the size column.

<b> Add new feature(integer) for bhk (Bedrooms Hall Kitchen) </b>

In [None]:
df3['Size_in_BHK'] = df3['size'].apply(lambda x: int(x.split(' ')[0]))
df3.head()

In [None]:
df3.drop(["size"], axis = "columns", inplace=True)

In [None]:
df3["total_sqft"].unique()

Check that the values are not one having single number  ; few datas of <b>total_sqft</b> have ranges too ; like 1133-1384. 
We need to remove these. 
<br>For this we can check if the total_sqft value is float or not.

In [None]:
def is_float(x):
    try:
        float(x)   # If float value exists then it'll return ; as the range can't be converted to float value.
    except:
        return False
    return True

In [None]:
df3[df3["total_sqft"].apply(is_float)]     # Checking for valid numerical value of "total_sqft"

In [None]:
df3[~df3["total_sqft"].apply(is_float)].head(15) # Negation used here ; will throw dataset where the "total_sqft" value is range type

Above shows that total_sqft can be a range (e.g. 2100-2850). For such case we can just take average of min and max value in the range. There are other cases such as 34.46Sq. Meter which one can convert to square ft using unit conversion.We will just drop such corner cases to keep things simple.

In [None]:
# Function to convert ranges to number (We will take average of upper limit and lower limit)

def range_sqft_to_num (x):
    tokens = x.split("-")
    if(len(tokens)==2):
        return (float(tokens[0])+float(tokens[1]))/2
    try:
        return float(x)
    except:
        return None

In [None]:
df4 = df3.copy()
df4.total_sqft = df4["total_sqft"].apply(range_sqft_to_num)

# Now we will drop NaN values of total_sqft column which we had converted to NaN by changing the extra values like 34.46 sq. meter etc. to NaN 

df4.dropna(subset = ["total_sqft"], inplace=True)   # we could also use this command :  df4 = df4[df4.total_sqft.notnull()]
print(df4.shape)
df4.head(15)

We have reduced the no. of rows to 13200 which were 13320 in the original dataset. 

<u><b><h3 style = "color:Blue;">Feature Engineering</h3></b></u>

In [None]:
# Adding new feature "Price_per_sqft" 
df5 = df4.copy()
df5["Price_per_sqft"] = df5["price"]*100000/df5["total_sqft"]
df5.head()

In [None]:
# Calculating the above dataframe's data analysis 
df5_stats = df5['Price_per_sqft'].describe()
df5_stats

In [None]:
len(df5.location.unique())

<b> Examine locations which is a categorical variable. We need to apply dimensionality reduction technique here to reduce number of locations.</b>

The no. of unique locations are huge which are 1298 ; and hence encoding won't be suitable here.


In [None]:
df5.location.apply(lambda x : x.strip())  # This method is used  trim leading and trailing spaces
location_count = df5.location.value_counts(ascending=False)
location_count

<u><b><h3 style = "color:Blue;">Dimensionality Reduction</h3></b></u>

In [None]:
print(len(location_count[location_count > 10]))
print(len(location_count[location_count <= 10]))

It's clearly evident that only few locations, precisely 240 have location count greater than 10.
While 1058 locations are such that which have occurence less than or equal to 10 in the dataset.


<b>Any location having less than 10 data points should be tagged as "other" location. This way number of categories can be reduced by huge amount. Later on when we do one hot encoding, it will help us with having fewer dummy columns.</b>

In [None]:
location_count_less_than_10 = location_count[location_count <= 10]

In [None]:
df5.location = df5.location.apply(lambda x : "others" if x in location_count_less_than_10 else x)
df5.head()

In [None]:
len(df5.location.unique())   # Only 241 unique values left now ; hence one hot encoding will be suitable here.

<u><b><h3 style = "color:Blue;">Outlier Removal Using Business Logic</h3></b></u>

<b>normally square ft per bedroom is 300 (i.e. 2 bhk apartment is minimum 600 sqft. If you have for example 400 sqft apartment with 2 bhk than that seems suspicious and can be removed as an outlier. We will remove such outliers by keeping our minimum thresold per bhk to be 300 sqft</b>

In [None]:
df5[(df5["total_sqft"]/df5["Size_in_BHK"])<300]

In [212]:
df6 = df5[~((df5["total_sqft"]/df5["Size_in_BHK"])<300)]    # storing only relevant data into df6
df6.shape

(12456, 6)

<b><h5>Outlier Removal Using Standard Deviation and Mean</h5></b>

In [None]:
df6["Price_per_sqft"].describe()

Now we can check that the datasset contains few values of Price Per Sqft. to be as low as INR 267 which is quite impossible ; same with the price to be INR 176470, 
now this might be possible for some specific posh or luxury area but this is not relevant for our basic generic model.<br>
Hence we need to ignore these values and keep only generic dataset which have fair price per sqft.

Since the data should be normally distributed. For this we can use the statistical method which says the range such data should be in range[mean-std , mean+std]

In [214]:
# Let's create a function for Outlier Removal

def remove_outliers(df):
    df_out = pd.DataFrame()
    for key, subdf in df.groupby('location'):
        mean= np.mean(subdf["Price_per_sqft"])
        std = np.std(subdf["Price_per_sqft"])
        reduced_df = subdf[(subdf.Price_per_sqft >= mean-std) & (subdf.Price_per_sqft <= mean+std)]
        df_out = pd.concat([df_out, reduced_df],ignore_index=True)
    return df_out

In [215]:
df7 = remove_outliers(df6)
df7.shape

(10245, 6)