<a href="https://colab.research.google.com/github/DLPY/Regression-Session-2/blob/master/SydneyHousePrices_MultipleRegression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#<centre> **House Price Prediction based on Postal Code, Number of Bedrooms and Bathrooms, Car Parking  and Property Type**</centre>

Detail on Data : https://www.kaggle.com/mihirhalai/sydney-house-prices

## 1. Download source data from Github

In [None]:
!wget https://raw.githubusercontent.com/DLPY/Regression-Session-2/master/Data/SydneyHousePrices.csv

## 2. Import neccssary packages for performing EDA and Multiple regression

In [None]:
# Imports
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
%matplotlib inline
pd.set_option('display.max_colwidth', None)

## 3. Read data into pandas dataframe to perform data analysis, cleaning and transformation

In [None]:
df = pd.read_csv('SydneyHousePrices.csv')

In [None]:
df.head(5)

In [None]:
df.shape

In [None]:
df.info()

In [None]:
round(df.describe(),2)

**From the above max - It shows the dataset has outliers and it needs to be removed.**

## 4. Choosing predictors and target variables for performing Multiple Regression

**Target and Source variables**

*   SellingPrice - Target Variable
*   Bed, Bath, Car, propType, postalCode  - Predictor Variables



In [None]:
df_new = df[['postalCode', 'bed', 'bath', 'car', 'propType' , 'sellPrice']]

In [None]:
df_new.propType.unique()

In [None]:
df_new.head(5)

Encoding the categorical variables - Change the text into numbers

In [None]:
df_new['propType'] = df_new['propType'].astype('category').cat.codes

In [None]:
df_new.propType.unique()

In [None]:
df_new.head(5)

In [None]:
df_new.dtypes

In [None]:
df_new.count()

In [None]:
df_new.replace(np.nan, 0)

## 5. Remove outliers in the data

In [None]:
def remove_outlier(df_in, col_name):
    q1 = df_in[col_name].quantile(0.25)
    q3 = df_in[col_name].quantile(0.75)
    iqr = q3-q1 #Interquartile range
    fence_low  = q1-1.5*iqr
    fence_high = q3+1.5*iqr
    df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
    return df_out

In [None]:
df_no_outlier_bed=remove_outlier(df_new, 'bed').reset_index(drop=True)
df_no_outlier_bath=remove_outlier(df_no_outlier_bed, 'bath').reset_index(drop=True)
df_no_outlier_car=remove_outlier(df_no_outlier_bath, 'car').reset_index(drop=True)
df_no_outlier=remove_outlier(df_no_outlier_car, 'sellPrice').reset_index(drop=True)

In [None]:
df_no_outlier

In [None]:
corr = df_no_outlier.corr()

In [None]:
corr

## 6. Correlation heatmap with the mask and correct aspect ratio

In [None]:
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
cmap = sns.diverging_palette(10, 250, as_cmap=True)
f, ax = plt.subplots(figsize=(11, 9))
sns.heatmap(corr, mask=mask, cmap=cmap,
            square=True, linewidths=.2, cbar_kws={'shrink': .5}, ax=ax, annot=True)

## 7. Split Target and Predictor Variables to different dataframes

In [None]:
X = df_no_outlier.iloc[:,:-1]
Y = df_no_outlier.iloc[:,5]

In [None]:
X.head(5)

In [None]:
Y.head(5)

Convert dataframes to values to feed into Model

In [None]:
X = X.values
Y= Y.values

In [None]:
print('Number of records and predictor variables:', X.shape)
print('Number of records and target variable:', Y.shape)

## 8. Split dataset into the training and test using train_set_split: 90% - train and 10% - test

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split( X, Y, test_size=0.1, random_state=23)

In [None]:
print ('Training Data:',X_train.shape, Y_train.shape)
print ('Testing Data:',X_test.shape, Y_test.shape)

## 9. Train, Test and Predict using regression model

Create an object called regressor in the LinearRegression class

In [None]:
regressor = LinearRegression()

Fit the linear regression model to the training set. We use the fit method the arguments of the fit method will be training sets

In [None]:
regressor.fit(X_train,Y_train)

Regression Coefficients

In [None]:
print('Coefficients: ', regressor.coef_)

In [None]:
columns=['postalCode', 'bed', 'bath', 'car', 'propType' ]
Coefficient = pd.DataFrame(regressor.coef_).T
Coefficient.columns = columns
Coefficient

Predicting the Test set results

In [None]:
Y_pred= regressor.predict(X_test)

## 10. Evaluation metrics - How to Calculate R-Square and RMSE

In [None]:
coefficient_of_dermination = r2_score(Y_test,Y_pred)
print('R-squared:',coefficient_of_dermination)

rmse = np.sqrt(mean_squared_error(Y_test, Y_pred))
print('Root Mean Squared Error: {}'.format(rmse))

In [None]:
# Displaying Results and Difference in Table 
res = pd.DataFrame(Y_pred, Y_test)
res = res.reset_index()
res.columns = ['Price', 'Prediction']
res['Prediction'] = round(res['Prediction'],0)
res['Difference'] = res['Price'] - res['Prediction']
res.head(5)

## 11. Regression on Full data using OLS model

In [None]:
Regression = sm.OLS(endog = Y, exog = X).fit()

In [None]:
print(Regression.summary())

###To read more on skewness, kurtosis, autocorrelation and multicollinearity.

---
https://www.sciencedirect.com/topics/neuroscience/kurtosis 

---
https://www.investopedia.com/terms/d/durbin-watson-statistic.asp 

---
https://www.analyticsvidhya.com/blog/2020/03/what-is-multicollinearity