# Problem Statement

In this particular project, we are using a dataset that contains information like, Address, Rooms, Type, Price, Seller etc and using that to predict the price of a given house. However, before you go ahead and make a prediction, it is advised that you first pre-process the data, since it may contain some irregularities and noise. In addition, try various tricks and techniques in order to gain the best accuracy in your predictions.

## Part-1

## Data Exploration and Pre-processing

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split as tts
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression,Lasso,Ridge
from sklearn.metrics import r2_score,mean_squared_error as mse,mean_absolute_error as mae

### 1 - Importing the Dataset in the file

In [2]:
df = pd.read_csv(r"C:\Users\Vyas\1_Assignment\ML FT Projects\Linear Regression\P2_House Price Prediction Project\Python_Linear_Regres.csv")
df.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,68 Studley St,2,h,,SS,Jellis,03/09/2016,2.5,3067.0,...,1.0,1.0,126.0,,,Yarra City Council,-37.8014,144.9958,Northern Metropolitan,4019.0
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,03/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra City Council,-37.7996,144.9984,Northern Metropolitan,4019.0
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,04/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra City Council,-37.8079,144.9934,Northern Metropolitan,4019.0
3,Abbotsford,18/659 Victoria St,3,u,,VB,Rounds,04/02/2016,2.5,3067.0,...,2.0,1.0,0.0,,,Yarra City Council,-37.8114,145.0116,Northern Metropolitan,4019.0
4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,04/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra City Council,-37.8093,144.9944,Northern Metropolitan,4019.0


### 2 - Print names of the Column

In [3]:
df.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

### 3 - Describing the Data

In [4]:
df.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,34857.0,27247.0,34856.0,34856.0,26640.0,26631.0,26129.0,23047.0,13742.0,15551.0,26881.0,26881.0,34854.0
mean,3.031012,1050173.0,11.184929,3116.062859,3.084647,1.624798,1.728845,593.598993,160.2564,1965.289885,-37.810634,145.001851,7572.888306
std,0.969933,641467.1,6.788892,109.023903,0.98069,0.724212,1.010771,3398.841946,401.26706,37.328178,0.090279,0.120169,4428.090313
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.19043,144.42379,83.0
25%,2.0,635000.0,6.4,3051.0,2.0,1.0,1.0,224.0,102.0,1940.0,-37.86295,144.9335,4385.0
50%,3.0,870000.0,10.3,3103.0,3.0,2.0,2.0,521.0,136.0,1970.0,-37.8076,145.0078,6763.0
75%,4.0,1295000.0,14.0,3156.0,4.0,2.0,2.0,670.0,188.0,2000.0,-37.7541,145.0719,10412.0
max,16.0,11200000.0,48.1,3978.0,30.0,12.0,26.0,433014.0,44515.0,2106.0,-37.3902,145.52635,21650.0


### 4 - Droping the non relevant columns

In [5]:
df.drop(['Address','Date','Postcode','YearBuilt','Lattitude'], axis=1,inplace=True)
df.head()

Unnamed: 0,Suburb,Rooms,Type,Price,Method,SellerG,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,CouncilArea,Longtitude,Regionname,Propertycount
0,Abbotsford,2,h,,SS,Jellis,2.5,2.0,1.0,1.0,126.0,,Yarra City Council,144.9958,Northern Metropolitan,4019.0
1,Abbotsford,2,h,1480000.0,S,Biggin,2.5,2.0,1.0,1.0,202.0,,Yarra City Council,144.9984,Northern Metropolitan,4019.0
2,Abbotsford,2,h,1035000.0,S,Biggin,2.5,2.0,1.0,0.0,156.0,79.0,Yarra City Council,144.9934,Northern Metropolitan,4019.0
3,Abbotsford,3,u,,VB,Rounds,2.5,3.0,2.0,1.0,0.0,,Yarra City Council,145.0116,Northern Metropolitan,4019.0
4,Abbotsford,3,h,1465000.0,SP,Biggin,2.5,3.0,2.0,0.0,134.0,150.0,Yarra City Council,144.9944,Northern Metropolitan,4019.0


### 5 - Counting Null values in each column

In [6]:
df.isnull().sum()

Suburb               0
Rooms                0
Type                 0
Price             7610
Method               0
SellerG              0
Distance             1
Bedroom2          8217
Bathroom          8226
Car               8728
Landsize         11810
BuildingArea     21115
CouncilArea          3
Longtitude        7976
Regionname           3
Propertycount        3
dtype: int64

### 6 - Fill Null values with 0 in few columns

In [7]:
x = ['Propertycount','Distance','Bedroom2','Bathroom','Car']
for i in x:
    df[i] = df[i].fillna(0)
df.isnull().sum()

Suburb               0
Rooms                0
Type                 0
Price             7610
Method               0
SellerG              0
Distance             0
Bedroom2             0
Bathroom             0
Car                  0
Landsize         11810
BuildingArea     21115
CouncilArea          3
Longtitude        7976
Regionname           3
Propertycount        0
dtype: int64

### 7 - Fill Null values of two columns with mean values

In [8]:
landsize_mean = df['Landsize'].mean()
df['Landsize'].fillna(value = landsize_mean, inplace=True)
building_area_mean = df['BuildingArea'].mean()
df['BuildingArea'].fillna(value=building_area_mean, inplace=True)

In [9]:
df.dropna(inplace=True)
df.isnull().sum()

Suburb           0
Rooms            0
Type             0
Price            0
Method           0
SellerG          0
Distance         0
Bedroom2         0
Bathroom         0
Car              0
Landsize         0
BuildingArea     0
CouncilArea      0
Longtitude       0
Regionname       0
Propertycount    0
dtype: int64

### 8 - Knowing unique values in Method Column

In [10]:
df['Method'].unique()

array(['S', 'SP', 'PI', 'VB', 'SA'], dtype=object)

### 9 - Creating dummy variables for the Categorical Data using encoding

In [11]:
df=pd.get_dummies(df,drop_first=True)

In [12]:
print("No of Rows: ", df.shape[0])
print("No. of Columns: ", df.shape[1])

No of Rows:  20993
No. of Columns:  715


# Part-2: Working with Model

### 1 - Create the target data and feature data where target data is price

In [13]:
df.head()

Unnamed: 0,Rooms,Price,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,Longtitude,Propertycount,...,CouncilArea_Wyndham City Council,CouncilArea_Yarra City Council,CouncilArea_Yarra Ranges Shire Council,Regionname_Eastern Victoria,Regionname_Northern Metropolitan,Regionname_Northern Victoria,Regionname_South-Eastern Metropolitan,Regionname_Southern Metropolitan,Regionname_Western Metropolitan,Regionname_Western Victoria
1,2,1480000.0,2.5,2.0,1.0,1.0,202.0,160.2564,144.9984,4019.0,...,0,1,0,0,1,0,0,0,0,0
2,2,1035000.0,2.5,2.0,1.0,0.0,156.0,79.0,144.9934,4019.0,...,0,1,0,0,1,0,0,0,0,0
4,3,1465000.0,2.5,3.0,2.0,0.0,134.0,150.0,144.9944,4019.0,...,0,1,0,0,1,0,0,0,0,0
5,3,850000.0,2.5,3.0,2.0,1.0,94.0,160.2564,144.9969,4019.0,...,0,1,0,0,1,0,0,0,0,0
6,4,1600000.0,2.5,3.0,1.0,2.0,120.0,142.0,144.9941,4019.0,...,0,1,0,0,1,0,0,0,0,0


In [14]:
X=df.drop('Price',axis=1)
y=df['Price']

### 2 - Create a linear regression model for Target and feature data

In [15]:
X_train,X_test,y_train,y_test = tts(X,y,test_size=0.2,random_state=6)

In [16]:
X_test

Unnamed: 0,Rooms,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,Longtitude,Propertycount,Suburb_Aberfeldie,...,CouncilArea_Wyndham City Council,CouncilArea_Yarra City Council,CouncilArea_Yarra Ranges Shire Council,Regionname_Eastern Victoria,Regionname_Northern Metropolitan,Regionname_Northern Victoria,Regionname_South-Eastern Metropolitan,Regionname_Southern Metropolitan,Regionname_Western Metropolitan,Regionname_Western Victoria
8072,3,4.5,3.0,2.0,1.0,8216.000000,130.0000,144.98760,7717.0,0,...,0,0,0,0,0,0,0,1,0,0
15620,3,8.2,3.0,2.0,1.0,161.000000,149.0000,144.89368,1308.0,0,...,0,0,0,0,0,0,0,0,1,0
17065,4,11.7,4.0,2.0,2.0,535.000000,160.2564,144.85545,5629.0,0,...,0,0,0,0,0,0,0,0,1,0
17168,1,10.1,1.0,1.0,1.0,0.000000,76.0000,145.06865,4442.0,0,...,0,0,0,0,0,0,0,1,0,0
16658,4,10.2,4.0,1.0,1.0,676.000000,160.2564,145.08696,3052.0,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16509,1,12.0,1.0,1.0,1.0,113.000000,160.2564,145.02643,21650.0,0,...,0,0,0,0,1,0,0,0,0,0
22408,3,19.6,3.0,2.0,2.0,593.598993,125.0000,145.04721,10926.0,0,...,0,0,0,0,1,0,0,0,0,0
9719,4,9.7,4.0,3.0,2.0,650.000000,348.0000,144.92080,3284.0,0,...,0,0,0,0,0,0,0,0,1,0
15067,3,7.8,3.0,2.0,2.0,334.000000,128.0000,145.04964,5549.0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
model = LinearRegression()

In [18]:
model.fit(X_train,y_train)

LinearRegression()

In [19]:
pred_price = model.predict(X_test)
pred_price

array([1955594.8216362 ,  933920.77701104, 1118555.65522811, ...,
       1648361.36310378, 1393408.87598664,  461466.50706738])

In [20]:
model.score(X_train,y_train)

0.6963359082956294

In [21]:
model.score(X_test,y_test)

0.6880204136383825

### 3 - Checking if model is overfitting or underfitting

There is not much difference in score of Train and Test model. So, there is no overfitting and underfitting.

### 4 - If the model is overfitting then apply ridge and lasso regression algorithms

There is no overfitting issue but we can always apply ridge and lasso regression and find out if there can be significant difference in score.

#### (A) Lasso Regression

In [22]:
lasso_reg = Lasso(alpha = 50, max_iter = 100)
lasso_reg.fit(X_train,y_train)

  model = cd_fast.enet_coordinate_descent(


Lasso(alpha=50, max_iter=100)

In [23]:
lasso_reg.score(X_train,y_train)

0.6917178037580247

In [24]:
lasso_reg.score(X_test,y_test)

0.6900048265689334

#### (B) Ridge Regression

In [25]:
ridge_reg = Ridge(alpha = 500, max_iter = 100)
ridge_reg.fit(X_train,y_train)

Ridge(alpha=500, max_iter=100)

In [26]:
ridge_reg.score(X_train,y_train)

0.6280721210058698

In [27]:
ridge_reg.score(X_test,y_test)

0.6263640844849787

Ridge doesnot improve any more accuracy. So, there is not much need of Lasso or Ridge

### 5 - Extract slope and intercept value from the model

In [28]:
print("Slope of the model is: ", model.coef_)
print("Intercept of the model is: ", model.intercept_)

Slope of the model is:  [ 2.11779735e+05 -5.21331559e+04 -3.37116014e+04  1.36035204e+05
  4.42671277e+04  2.59014949e+00  3.29648213e+01 -1.32893654e+06
  1.38942504e+00  2.96509717e+05 -9.03808496e+04 -8.29804845e+04
  2.32556538e+05  1.50839187e+05  3.29876744e+05  1.63269445e+05
 -1.20917717e+05 -1.54699411e+05 -3.69217025e+04  4.84366846e+04
 -8.22063076e+04 -3.71481749e+04 -4.52670699e+03  1.48631238e+05
 -2.22751073e+05 -2.21528841e+05 -2.74070525e+05 -8.96840920e+04
 -9.54725754e+04  2.19862115e+05  3.39289240e+04  1.32824093e+05
  2.47875146e+04  1.93948913e+05  4.10536898e+04  1.77935656e+05
 -3.91102556e+05  7.24338252e+04  2.23943107e+04  1.13239363e+05
  2.20610642e+05  1.83113873e+05 -1.42037866e+05 -5.95131718e+04
  3.37886910e+05  2.03026537e+05 -9.02218744e-10  1.47019722e+05
 -1.55599561e+05 -1.03648520e+05  3.24338482e+05 -1.71152431e+05
 -1.11667755e+05 -1.69690981e+05  1.57785820e+05  2.14381417e+04
  4.61616794e+04 -2.90817331e+04  1.92725565e-07 -3.15095353e+05
 

### 6 - Display Mean Squared Error

In [29]:
m_s_e = mse(y_test,pred_price)
m_s_e

129782617625.28763

### 7 - Display Mean Absolute Error

In [30]:
m_a_e = mae(y_test,pred_price)
m_a_e

231156.88695036597

### 8 - Root Mean Squared Error

In [31]:
rmse = np.sqrt(m_s_e)
rmse

360253.54630494287

### 9 - R2 Score

In [32]:
r2 = r2_score(y_test,pred_price)
r2

0.6880204136383825