# Housing Price Model

## Importing Libraries and Data

In [88]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns

In [89]:
data = pd.read_csv(r"D:\Utkarsh Mathur\Career\Data Science\Datasets\melbourne housing snapshot\melb_data.csv")

## Data Pre-processing and  Preparation

In [90]:
data.isnull().sum()

Suburb              0
Address             0
Rooms               0
Type                0
Price               0
Method              0
SellerG             0
Date                0
Distance            0
Postcode            0
Bedroom2            0
Bathroom            0
Car                62
Landsize            0
BuildingArea     6450
YearBuilt        5375
CouncilArea      1369
Lattitude           0
Longtitude          0
Regionname          0
Propertycount       0
dtype: int64

In [91]:
data.Car = data.Car.fillna(data.Car.median())
data.isnull().sum()

Suburb              0
Address             0
Rooms               0
Type                0
Price               0
Method              0
SellerG             0
Date                0
Distance            0
Postcode            0
Bedroom2            0
Bathroom            0
Car                 0
Landsize            0
BuildingArea     6450
YearBuilt        5375
CouncilArea      1369
Lattitude           0
Longtitude          0
Regionname          0
Propertycount       0
dtype: int64

Initially the data has 20 features and a price columns. But we need to remove some of the uncessary and overlapping features so as to make a better data.<br>
<br>
I'm decribing why all the features are removed:<br>
1) **Date** :- As the data is from a very short span of time the variance in prices with time will be negligible.<br>
2) **BuildingArea** :- Due to large number of empty data.<br>
3) **YearBuilt** :- Due to large number of empty data.<br>
4) **CouncilArea** :- Again in a state the council area doesnot makes much of a difference in the prices of houses.<br>
5) **Address** :- As address is a string which is different for every house, it does not make sense to use it in prediciting price.<br>

In [92]:
data1 = data.drop(['Address','BuildingArea', 'YearBuilt', 'CouncilArea', 'Date'], axis=1)
data1.head()

Unnamed: 0,Suburb,Rooms,Type,Price,Method,SellerG,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,2,h,1480000.0,S,Biggin,2.5,3067.0,2.0,1.0,1.0,202.0,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,2,h,1035000.0,S,Biggin,2.5,3067.0,2.0,1.0,0.0,156.0,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,3,h,1465000.0,SP,Biggin,2.5,3067.0,3.0,2.0,0.0,134.0,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,3,h,850000.0,PI,Biggin,2.5,3067.0,3.0,2.0,1.0,94.0,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,4,h,1600000.0,VB,Nelson,2.5,3067.0,3.0,1.0,2.0,120.0,-37.8072,144.9941,Northern Metropolitan,4019.0


In [93]:
from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()

In [94]:
data1['Suburb'] = lb.fit_transform(data1['Suburb']) 
data1['Type'] = lb.fit_transform(data1['Type'])
data1['Method'] = lb.fit_transform(data1['Method'])
data1['SellerG'] = lb.fit_transform(data1['SellerG'])
data1['Regionname'] = lb.fit_transform(data1['Regionname'])
data1['Postcode'] = lb.fit_transform(data1['Postcode'])
data1.head()

Unnamed: 0,Suburb,Rooms,Type,Price,Method,SellerG,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,Lattitude,Longtitude,Regionname,Propertycount
0,0,2,0,1480000.0,1,23,2.5,53,2.0,1.0,1.0,202.0,-37.7996,144.9984,2,4019.0
1,0,2,0,1035000.0,1,23,2.5,53,2.0,1.0,0.0,156.0,-37.8079,144.9934,2,4019.0
2,0,3,0,1465000.0,3,23,2.5,53,3.0,2.0,0.0,134.0,-37.8093,144.9944,2,4019.0
3,0,3,0,850000.0,0,23,2.5,53,3.0,2.0,1.0,94.0,-37.7969,144.9969,2,4019.0
4,0,4,0,1600000.0,4,155,2.5,53,3.0,1.0,2.0,120.0,-37.8072,144.9941,2,4019.0


## Test Train Splits

In [95]:
y = data1.iloc[:, 3].values

In [96]:
X = data1.drop(['Price'],axis=1).values

In [97]:
X.shape

(13580, 15)

In [98]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X = sc_X.fit_transform(X)

In [99]:
from sklearn.model_selection import train_test_split

In [100]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)
x_train.shape, x_test.shape

((10864, 15), (2716, 15))

## Model Building

Here in building my model I'm using Polynomial Multiple Linear Regresssion with a degree of 3.

In [101]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3)
lr = LinearRegression(normalize = False)

In [102]:
lr.fit(poly.fit_transform(x_train),y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [103]:
pred = lr.predict(poly.fit_transform(x_test))

In [104]:
from sklearn.metrics import r2_score
r2_score(y_test,pred)

0.7092645239933695