Consider the Bangalore House Price Data. Perform following
operations.
a) Find and replace null values in the data using appropriate
technique.
b) Transform the ‘Size’ column to numerical values. For Example:
2 BHK to be converted as 2
c) Transform the ‘total_sqft’ column to contain numerical values
on same scale. If the range is given average value of the range to
be taken.
d) Calculate and add one more column as ‘Price_Per_Sqft’
e) Remove the outliers from Price_Per_Sqft and BHK Size column
if any.
f) Apply the Linear Regression model to the data and display the
training and testing performance measures as Mean Squared Error
and Accuracy

In [212]:
import numpy as np
import pandas as pd

In [213]:
dataframe = pd.read_csv('/content/Banglore Housing Prices.csv')
dataframe.head()

Unnamed: 0,location,size,total_sqft,bath,price
0,Electronic City Phase II,2 BHK,1056,2.0,39.07
1,Chikka Tirupathi,4 Bedroom,2600,5.0,120.0
2,Uttarahalli,3 BHK,1440,2.0,62.0
3,Lingadheeranahalli,3 BHK,1521,3.0,95.0
4,Kothanur,2 BHK,1200,2.0,51.0


**a) Find and replace null values in the data using appropriate technique.**

In [214]:
dataframe.isnull().sum()

location       1
size          16
total_sqft     0
bath          73
price          0
dtype: int64

replace

In [215]:
dataframe['location'] = dataframe['location'].replace(np.nan, 'None')
dataframe['size'] = dataframe['size'].replace(np.nan, 'None')

fill with 0

In [216]:
dataframe['bath'].fillna(0, inplace=True)

In [217]:
dataframe.isnull().sum()

location      0
size          0
total_sqft    0
bath          0
price         0
dtype: int64

**b) Transform the ‘Size’ column to numerical values. For Example: 2 BHK to be converted as 2**

In [218]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
dataframe['size'] = label_encoder.fit_transform(dataframe['size'])
dataframe.head()

Unnamed: 0,location,size,total_sqft,bath,price
0,Electronic City Phase II,13,1056,2.0,39.07
1,Chikka Tirupathi,19,2600,5.0,120.0
2,Uttarahalli,16,1440,2.0,62.0
3,Lingadheeranahalli,16,1521,3.0,95.0
4,Kothanur,13,1200,2.0,51.0


**c) Transform the ‘total_sqft’ column to contain numerical values on same scale. If the range is given average value of the range to be taken.**

In [219]:
dataframe.groupby('total_sqft').count()

Unnamed: 0_level_0,location,size,bath,price
total_sqft,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,1,1,1,1
1.25Acres,1,1,1,1
1.26Acres,1,1,1,1
1000,172,172,172,172
1000 - 1285,1,1,1,1
...,...,...,...,...
995,10,10,10,10
996,4,4,4,4
997,2,2,2,2
998,1,1,1,1


**d) Calculate and add one more column as ‘Price_Per_Sqft’**

In [220]:
dataframe.dtypes

location       object
size            int64
total_sqft     object
bath          float64
price         float64
dtype: object

In [221]:
# pd.to_numeric(dataframe['total_sqft'])
# ValueError: Unable to parse string "2100 - 2850" at position 30

In [222]:
def new_totalsqft(value):
  splited = value.split('-')
  if len(splited) == 2:
    return (float(splited[0]) + float(splited[1]))/2
  try:
    return float(value)
  except:
    return None


In [223]:
dataframe['new_totalsqft'] = dataframe['total_sqft'].apply(new_totalsqft)

In [224]:
dataframe.isnull().sum()

location          0
size              0
total_sqft        0
bath              0
price             0
new_totalsqft    46
dtype: int64

In [225]:
bool_sqft = dataframe['new_totalsqft'].isnull()
dataframe[bool_sqft]

Unnamed: 0,location,size,total_sqft,bath,price,new_totalsqft
410,Kengeri,0,34.46Sq. Meter,1.0,18.5,
648,Arekere,30,4125Perch,9.0,265.0,
775,Basavanagara,0,1000Sq. Meter,2.0,93.0,
872,Singapura Village,13,1100Sq. Yards,2.0,45.0,
1019,Marathi Layout,1,5.31Acres,1.0,110.0,
1086,Narasapura,14,30Acres,2.0,29.5,
1400,Chamrajpet,29,716Sq. Meter,9.0,296.0,
1712,Singena Agrahara,17,1500Sq. Meter,3.0,95.0,
1743,Hosa Road,16,142.61Sq. Meter,3.0,115.0,
1821,Sarjapur,17,1574Sq. Yards,3.0,76.0,


In [226]:
dataframe = dataframe.dropna()
dataframe.isnull().sum()

location         0
size             0
total_sqft       0
bath             0
price            0
new_totalsqft    0
dtype: int64

In [227]:
# price per square feet = price / total_sqft
dataframe['price_per_sqft'] = (dataframe['price'] * 100000) / dataframe['new_totalsqft']
dataframe

Unnamed: 0,location,size,total_sqft,bath,price,new_totalsqft,price_per_sqft
0,Electronic City Phase II,13,1056,2.0,39.07,1056.0,3699.810606
1,Chikka Tirupathi,19,2600,5.0,120.00,2600.0,4615.384615
2,Uttarahalli,16,1440,2.0,62.00,1440.0,4305.555556
3,Lingadheeranahalli,16,1521,3.0,95.00,1521.0,6245.890861
4,Kothanur,13,1200,2.0,51.00,1200.0,4250.000000
...,...,...,...,...,...,...,...
13315,Whitefield,22,3453,4.0,231.00,3453.0,6689.834926
13316,Richards Town,18,3600,5.0,400.00,3600.0,11111.111111
13317,Raja Rajeshwari Nagar,13,1141,2.0,60.00,1141.0,5258.545136
13318,Padmanabhanagar,18,4689,4.0,488.00,4689.0,10407.336319


 **e) Remove the outliers from Price_Per_Sqft and BHK Size column if any.**

In [228]:
dataframe.describe()

Unnamed: 0,size,bath,price,new_totalsqft,price_per_sqft
count,13274.0,13274.0,13274.0,13274.0,13274.0
mean,14.814675,2.67636,112.453654,1559.626694,7907.501
std,4.474545,1.349933,149.070368,1238.405258,106429.6
min,0.0,0.0,8.0,1.0,267.8298
25%,13.0,2.0,50.0,1100.0,4266.865
50%,16.0,2.0,72.0,1276.0,5434.306
75%,16.0,3.0,120.0,1680.0,7311.746
max,31.0,40.0,3600.0,52272.0,12000000.0


In [229]:
dataframe.shape

(13274, 7)

In [230]:
Q1 = dataframe['price_per_sqft'].quantile(.25)
Q3 = dataframe['price_per_sqft'].quantile(.75)

IQR = Q3 - Q1

upper = Q3 + (IQR*1.5)
lower = Q1 - (IQR*1.5)

dataframe[(dataframe['price_per_sqft'] > upper) | (dataframe['price_per_sqft'] < lower)].shape

(1268, 7)

There are 1268 outliers in the data

**Removing Outliers**

In [231]:
dataframe = dataframe[(dataframe['price_per_sqft'] < upper) & (dataframe['price_per_sqft'] > lower)]
dataframe

Unnamed: 0,location,size,total_sqft,bath,price,new_totalsqft,price_per_sqft
0,Electronic City Phase II,13,1056,2.0,39.07,1056.0,3699.810606
1,Chikka Tirupathi,19,2600,5.0,120.00,2600.0,4615.384615
2,Uttarahalli,16,1440,2.0,62.00,1440.0,4305.555556
3,Lingadheeranahalli,16,1521,3.0,95.00,1521.0,6245.890861
4,Kothanur,13,1200,2.0,51.00,1200.0,4250.000000
...,...,...,...,...,...,...,...
13315,Whitefield,22,3453,4.0,231.00,3453.0,6689.834926
13316,Richards Town,18,3600,5.0,400.00,3600.0,11111.111111
13317,Raja Rajeshwari Nagar,13,1141,2.0,60.00,1141.0,5258.545136
13318,Padmanabhanagar,18,4689,4.0,488.00,4689.0,10407.336319


REmvoing outliers from BHK Size

In [232]:
Q1 = dataframe['size'].quantile(.25)
Q3 = dataframe['size'].quantile(.75)

IQR = Q3 - Q1
lower = Q1 - (IQR * 1.5)
upper = Q3 + (IQR * 1.5)

# scanning for outliers
dataframe[(dataframe['size'] < lower) | (dataframe['size'] > upper)].shape

(1068, 7)

In [233]:
dataframe = dataframe[(dataframe['size'] > lower) & (dataframe['size'] < upper)]
dataframe

Unnamed: 0,location,size,total_sqft,bath,price,new_totalsqft,price_per_sqft
0,Electronic City Phase II,13,1056,2.0,39.07,1056.0,3699.810606
1,Chikka Tirupathi,19,2600,5.0,120.00,2600.0,4615.384615
2,Uttarahalli,16,1440,2.0,62.00,1440.0,4305.555556
3,Lingadheeranahalli,16,1521,3.0,95.00,1521.0,6245.890861
4,Kothanur,13,1200,2.0,51.00,1200.0,4250.000000
...,...,...,...,...,...,...,...
13313,Uttarahalli,16,1345,2.0,57.00,1345.0,4237.918216
13314,Green Glen Layout,16,1715,3.0,112.00,1715.0,6530.612245
13316,Richards Town,18,3600,5.0,400.00,3600.0,11111.111111
13317,Raja Rajeshwari Nagar,13,1141,2.0,60.00,1141.0,5258.545136


**f) Apply the Linear Regression model to the data and display the training and testing performance measures as Mean Squared Error and Accuracy**

In [234]:
import matplotlib.pyplot as plt 
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

df_linear = dataframe[['price', 'new_totalsqft']]
X = df_linear.drop('price', axis='columns')
Y = df_linear['price']

In [235]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 10)

In [236]:
linear_regression = linear_model.LinearRegression()
linear_regression.fit(X_train, Y_train)

LinearRegression()

In [237]:
linear_regression.coef_

array([0.04048073])

In [238]:
linear_regression.intercept_

26.51142876049711

In [239]:
linear_regression.score(X_test, Y_test)

0.5205021354117046

In [240]:
y_pred = linear_regression.predict(X_test)
print(mean_squared_error(Y_test, y_pred))

1831.0732985828263


In [241]:
y_pred = linear_regression.predict(X_train)
print(mean_squared_error(Y_train, y_pred))

2854.1099977133763
