<a href="https://colab.research.google.com/github/DLPY/Regression-Session-2/blob/master/Regression_Session_2_Lasso.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Open In Colab

# House Price Prediction based on Postal Code, Number of Bathrooms, Car Parking and Property Type

Detail on Data: https://www.kaggle.com/mihirhalai/sydney-house-prices

# **1.Import necessary packages for performing EDA and Multiple Regression**

In [None]:
import datetime as dt
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np 
import seaborn as sns
from sklearn.linear_model import (Ridge, Lasso, ElasticNet, RidgeCV, LassoCV, ElasticNetCV)
from sklearn.metrics import (r2_score, mean_squared_error)
from sklearn.model_selection import (RepeatedKFold, train_test_split)
from sklearn.preprocessing import StandardScaler
# from sklearn.preprocessing import (LabelEncoder, OneHotEncoder, StandardScaler)

%matplotlib inline

pd.set_option('display.max_colwidth', None)

## i) Read data from csv file into Pandas dataframe

In [None]:
!wget https://raw.githubusercontent.com/DLPY/Regression-Session-2/master/Data/Processed_SydneyHousePrices.csv
df = pd.read_csv('Processed_SydneyHousePrices.csv')

## ii) Isolate Target and Predictor Variables to Different Dataframes

In [None]:
X = df[['postalCode', 'bed', 'bath', 'car', 'propType', 'diffDate', 'Year', 'Month', 'Day', 'Quarter', 'medSellPrice']]
y = df[['sellPrice']]

# Save this list of column values for later
columns_list = list(X.columns.values)

In [None]:
X.head(5)

In [None]:
y.head(5)

# **2.Standardise Features**

In [None]:
scaler = StandardScaler()
X_std = scaler.fit_transform(X)

In [None]:
# The scaled values are now stored as an array.
X_std[: 5]

In [None]:
# X is already an array data type, so y also needs converting (the model expects these as inputs).
y = y.values

# **3.Split dataset into the training and test using train_set_split**

90% - train

10% - test

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_std, y, test_size=0.1, random_state=23)

In [None]:
print('Training Data:', X_train.shape, y_train.shape)
print('Testing Data:', X_test.shape, y_test.shape)

# **4.Train, Test and Predict using Lasso regression**

## i) Lasso Regression Model

<b>Note:</b> There were no modifications to the training and testing data sets so the same train/test data will be used again for this model.

Create an object called LassoCV in the regression class with various alpha values and 5-fold cross-validation

In [None]:
%%capture
# Supress warnings from displaying while running this cell using %%capture at the very first line of the cell.

# alpha values are on a log10 scale.
lasso_cv_regr = LassoCV(alphas=(1e-3, 1e-2, 1e-1, 0.1, 1.0, 10.0, 100.0), cv=5, random_state=0)

# Fit the linear regression
lasso_cv_regr_model = lasso_cv_regr.fit(X_train, y_train.ravel())

In [None]:
# View best alpha value
alpha_ = lasso_cv_regr_model.alpha_
print('The optimal alpha value is: {}'.format(alpha_))

## ii) How does penalty parameter impact features in Lasso Regression?

In [None]:
# Create a function that runs through a list of alpha values and outputs a table to review the penalty impact.

def lasso(alphas):
    '''Determine what impact lambda/alpha may have on feature selection.'''
    # Create an empty data frame
    df = pd.DataFrame()

    # Create a column of feature names
    df['Feature Name'] = columns_list

    # For each alpha value in the list of alpha values,
    for alpha in alphas:
        # Create a lasso regression with that alpha value,
        lasso = Lasso(alpha=alpha)

        # Fit the lasso regression on training data
        lasso.fit(X_train, y_train)

        # Create a column name for that alpha value
        column_name = 'Alpha = %f' % alpha

        # Create a column of coefficient values
        df[column_name] = lasso.coef_

    # Return the dataframe    
    return df

In [None]:
# As alpha increases, some of the feature importances may be reduced.  Zero values are of particular interest.
alpha_list = [1e-3, 1e-2, 1e-1, 0.1, 1.0, 10.0, 100.0]

print('Alpha Impact On Features')
lasso(alpha_list)

From the above; notice that Day has a Zero value at Alpha=100. Currently, the optimal alpha value of LassoCV is 100.

This means that we should remove 'Day' from the data set.

In [None]:
# Remove 'Day' feature from the model, and prepare the data as before.

X = df[['postalCode', 'bed', 'bath', 'car', 'propType', 'diffDate', 'Year', 'Month', 'Quarter', 'medSellPrice']]
y = df[['sellPrice']]

# Save this list of column values for later
columns_list = list(X.columns.values)

# Scale the values of X
scaler = StandardScaler()
X_std = scaler.fit_transform(X)

y = y.values

In [None]:
# Split data into test/train, same parameter values as before
X_train, X_test, y_train, y_test = train_test_split(X_std, y, test_size=0.1, random_state=23)

Use LassoCV again to identify the optimal Alpha value.

In [None]:
%%capture

lasso_cv_regr = LassoCV(alphas=(1e-3, 1e-2, 1e-1, 0.1, 1.0, 10.0, 100.0), cv=5, random_state=0)

# Fit the linear regression
lasso_cv_regr_model = lasso_cv_regr.fit(X_train, y_train.ravel())

In [None]:
# View best alpha value
alpha_ = lasso_cv_regr_model.alpha_
print('The optimal alpha value is: {}'.format(alpha_))

In [None]:
# Review the features coefficients again.
lasso(alpha_list)

From the above; notice that there are no Zero values, regardless that the optimal alpha value of LassoCV is 10.

This means that we may proceed with modelling without any further modifying the features.

# **5.Create the Lasso model with alpha value discovered during cross-validation**

In [None]:
lasso_regr = Lasso(alpha=alpha_)

In [None]:
# Fit the linear regression
lasso_regr_model = lasso_regr.fit(X_train, y_train.ravel())

In [None]:
lasso_coef = pd.DataFrame(abs(lasso_regr_model.coef_)).T
lasso_coef.columns = columns_list
lasso_coef

In [None]:
lasso_train_score = lasso_regr_model.score(X_train, y_train)
print('Training data r-squared score: {}'.format(lasso_train_score))

In [None]:
print('Lasso Test Data results:')
y_pred = lasso_regr_model.predict(X_test)

coef_of_determination_lasso = r2_score(y_test, y_pred)
print('R-squared: {}'.format(coef_of_determination_lasso))

rmse_lasso = np.sqrt(mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error: {}'.format(rmse_lasso))

In [None]:
# Displaying Results and Difference in Table 
res = pd.DataFrame(y_pred, y_test.ravel())
res = res.reset_index()
res.columns = ['Price', 'Prediction']
res['Prediction'] = round(res['Prediction'], 0)
res['Difference'] = res['Prediction'] - res['Price']
res.head(5)

In [None]:
# Get the median difference of actual prices and predicted prices.
lasso_med_diff = res['Difference'].median()

print('The median difference of actual prices and predicted prices using Lasso: {}'.format(lasso_med_diff))