<a href="https://colab.research.google.com/github/DLPY/Regression-Session-2/blob/master/SydneyHousePrices_Ridge_Lasso.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **House Price Prediction based on Postal Code, Number of Bathrooms, Car Parking  and Property Type**

Detail on Data : https://www.kaggle.com/mihirhalai/sydney-house-prices

## 1. Download source data from Github

In [None]:
!wget https://raw.githubusercontent.com/DLPY/Regression-Session-2/master/Data/SydneyHousePrices.csv

## 2. Import neccssary packages for performing EDA, Lasso and Ridge regression

In [None]:
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd
import seaborn as sns
from sklearn.linear_model import Lasso, LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

%matplotlib inline

pd.set_option('display.max_colwidth', None)

## 3. Read data into pandas dataframe to perform data analysis, cleaning and transformation

In [None]:
df = pd.read_csv('SydneyHousePrices.csv')

In [None]:
df.head(5)

In [None]:
df.shape

In [None]:
df.info()

In [None]:
round(df.describe(),2)

**From the above max - It shows the dataset has outliers and it needs to be removed.**

## 4. Choosing predictors and target variables for performing Regression

**Target and Source variables**

*   SellingPrice - Target Variable
*   Bed, Bath, Car, propType, postalCode  - Predictor Variables



In [None]:
df_new = df[(df['Date']>= '2017-01-01') & (df['Date']< '2020-01-01')].reset_index(drop=True)
df_new = df_new[['postalCode', 'bath', 'car', 'propType' , 'sellPrice']]

In [None]:
df_new.propType.unique()

In [None]:
df_new.head(5)

Encoding the categorical variables - Change the text into numbers

In [None]:
df_new['propType'] = df_new['propType'].astype('category').cat.codes

In [None]:
df_new.propType.unique()

In [None]:
df_new.head(5)

In [None]:
df_new.dtypes

In [None]:
df_new.count()

In [None]:
df_new.isnull().sum()

In [None]:
df_new['car'].fillna(df_new.groupby(['postalCode'])['car'].transform('median'), inplace=True)
#df_new['bed'].fillna(df_new.groupby(['postalCode', 'propType'])['bed'].transform('median'), inplace=True)

In [None]:
df_new.isnull().sum()

## 5. Remove outliers in the data

In [None]:
def remove_outlier(df_in, col_name):
    q1 = df_in[col_name].quantile(0.25)
    q3 = df_in[col_name].quantile(0.75)
    iqr = q3-q1 #Interquartile range
    fence_low  = q1-1.5*iqr
    fence_high = q3+1.5*iqr
    df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
    return df_out

In [None]:
#df_no_outlier_bed=remove_outlier(df_new, 'bed').reset_index(drop=True)
df_no_outlier_bath=remove_outlier(df_new, 'bath').reset_index(drop=True)
df_no_outlier_car=remove_outlier(df_no_outlier_bath, 'car').reset_index(drop=True)
df_no_outlier=remove_outlier(df_no_outlier_car, 'sellPrice').reset_index(drop=True)

In [None]:
df_no_outlier

In [None]:
corr = df_no_outlier.corr()

In [None]:
corr

## 6. Correlation heatmap with the mask and correct aspect ratio

In [None]:
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
cmap = sns.diverging_palette(10, 250, as_cmap=True)
f, ax = plt.subplots(figsize=(11, 9))
sns.heatmap(corr, mask=mask, cmap=cmap,
            square=True, linewidths=.2, cbar_kws={'shrink': .5}, ax=ax, annot=True)

## 7. Split Target and Predictor Variables to different dataframes

In [None]:
X = df_no_outlier.iloc[:,:-1]
y = df_no_outlier.iloc[:,4]

In [None]:
X.head(5)

In [None]:
y.head(5)

Convert dataframes to values to feed into Model

In [None]:
X = X.values
y = y.values

In [None]:
print('Number of records and predictor variables:', X.shape)
print('Number of records and target variable:', y.shape)

## 8. Split dataset into the training and test using train_set_split: 90% - train and 10% - test

In [None]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.1, random_state=23)

In [None]:
print('Training Data:', X_train.shape, y_train.shape)
print('Testing Data:', X_test.shape, y_test.shape)

## 9. Train, Test and Predict using ridge and lasso regression model

Create an object called RidgeRegression in the regression class with alpha 0.01

In [None]:
ridgeregressor = Ridge(alpha=0.01)

Fit the linear regression model to the training set. We use the fit method the arguments of the fit method will be training sets

In [None]:
ridgeregressor.fit(X_train, y_train)

Regression Coefficients

In [None]:
print('Coefficients: ', ridgeregressor.coef_)

In [None]:
column_names = ['postalCode', 'bath', 'car', 'propType' ]
coefficient_df = pd.DataFrame(ridgeregressor.coef_).T
coefficient_df.columns = column_names
coefficient_df

Predicting the Test set results

In [None]:
y_pred= ridgeregressor.predict(X_test)

## 10. Evaluation metrics - How to Calculate R-Square and RMSE

In [None]:
coefficient_of_dermination = r2_score(y_test,y_pred)
print('R-squared:',coefficient_of_dermination)

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error: {}'.format(rmse))

In [None]:
# Displaying Results and Difference in Table 
res = pd.DataFrame(y_pred, y_test)
res = res.reset_index()
res.columns = ['Price', 'Prediction']
res['Prediction'] = round(res['Prediction'],0)
res['Difference'] = res['Price'] - res['Prediction']
res.head(5)

Create an object called RidgeRegression in the regression class with alpha 100

In [None]:
ridgeregressor100 = Ridge(alpha=100)
ridgeregressor100.fit(X_train, y_train)
print('Coefficients: ', ridgeregressor100.coef_)

In [None]:
column_names = ['postalCode', 'bath', 'car', 'propType' ]
coefficient_df = pd.DataFrame(ridgeregressor100.coef_).T
coefficient_df.columns = column_names
coefficient_df

In [None]:
y_pred= ridgeregressor100.predict(X_test)

coefficient_of_dermination = r2_score(y_test, y_pred)
print('R-squared:',coefficient_of_dermination)

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error: {}'.format(rmse))

In [None]:
# Displaying Results and Difference in Table 
res = pd.DataFrame(y_pred, y_test)
res = res.reset_index()
res.columns = ['Price', 'Prediction']
res['Prediction'] = round(res['Prediction'],0)
res['Difference'] = res['Price'] - res['Prediction']
res.head(5)

Create an object called Lasso in the regression class with alpha 0.01

In [None]:
lassoregressor = Lasso(alpha=0.01, max_iter=10e6)
lassoregressor.fit(X_train, y_train)
print('Coefficients: ', lassoregressor.coef_)

In [None]:
column_names = ['postalCode', 'bath', 'car', 'propType' ]
coefficient_df = pd.DataFrame(lassoregressor.coef_).T
coefficient_df.columns = column_names
coefficient_df

In [None]:
coeff_used = np.sum(lassoregressor.coef_!=0)
coeff_used

In [None]:
y_pred= lassoregressor.predict(X_test)

coefficient_of_dermination = r2_score(y_test, y_pred)
print('R-squared:',coefficient_of_dermination)

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error: {}'.format(rmse))

Create an object called Lasso in the regression class with alpha 100

In [None]:
lassoregressor100 = Lasso(alpha=100, max_iter=10e6)
lassoregressor100.fit(X_train, y_train)
print('Coefficients: ', lassoregressor100.coef_)

In [None]:
column_names = ['postalCode', 'bath', 'car', 'propType' ]
coefficient_df = pd.DataFrame(lassoregressor100.coef_).T
coefficient_df.columns = column_names
coefficient_df

In [None]:
y_pred= lassoregressor100.predict(X_test)

coefficient_of_dermination = r2_score(y_test, y_pred)
print('R-squared:',coefficient_of_dermination)

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error: {}'.format(rmse))

## 12. Regression on Full data using OLS model

In [None]:
Regression = sm.OLS(endog = y, exog = X).fit()

In [None]:
print(Regression.summary())

###To read more on lasso and ridge regression.

---
https://towardsdatascience.com/ridge-and-lasso-regression-a-complete-guide-with-python-scikit-learn-e20e34bcbf0b