<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2 - Ames Housing Data and Kaggle Challenge

# Part 3 Modelling

## Contents: 
Part 1: Data Import and Cleaning

Part 2: Exploratory Data Analysis and Statistics

Part 3: Modelling

-[Linear Regression](#linear-regression)

-[Rerun Linear Regression](#rerun-lr)

-[Feature Engineering with Linear Regression](#fe-with-linear-regression)

-[Hyperparameters with Ridge](#hyperparameters-with-ridge)

-[Hyperparameters with Lasso](#hyperparameters-with-lasso)

-[Export for Kaggle Submission](#kaggle-submission)

-[Conclusion and Recommendations](#conclusion-and-recommendations)

## Importing and checks

In [1]:
import numpy as np
import pandas as pd

from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.model_selection import cross_val_score,train_test_split, GridSearchCV
from sklearn import metrics 
from sklearn.preprocessing import StandardScaler, PolynomialFeatures

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [2]:
train = pd.read_csv('./datasets/train_selected.csv')

In [3]:
# double check shape and head of csv
print(train.shape)
train.head()

(2051, 58)


Unnamed: 0,saleprice,fireplaces,overall_qual,gr_liv_area,garage_area,1st_flr_sf,year_built,full_bath,mas_vnr_area,ms_zoning_C (all),...,fireplace_qu_Fa,fireplace_qu_Gd,fireplace_qu_Po,fireplace_qu_TA,fireplace_qu_none,bsmt_qual_Fa,bsmt_qual_Gd,bsmt_qual_Po,bsmt_qual_TA,bsmt_qual_none
0,130500,0,6,1479,475.0,725,1976,2,289.0,0,...,0,0,0,0,1,0,0,0,1,0
1,220000,1,7,2122,559.0,913,1996,2,132.0,0,...,0,0,0,1,0,0,1,0,0,0
2,109000,0,5,1057,246.0,1057,1953,1,0.0,0,...,0,0,0,0,1,0,0,0,1,0
3,174000,0,5,1444,400.0,744,2006,2,0.0,0,...,0,0,0,0,1,0,1,0,0,0
4,138500,0,6,1445,484.0,831,1900,2,0.0,0,...,0,0,0,0,1,1,0,0,0,0


In [4]:
# double check for any null value
train.isnull().sum()

saleprice               0
fireplaces              0
overall_qual            0
gr_liv_area             0
garage_area             0
1st_flr_sf              0
year_built              0
full_bath               0
mas_vnr_area            0
ms_zoning_C (all)       0
ms_zoning_FV            0
ms_zoning_I (all)       0
ms_zoning_RH            0
ms_zoning_RL            0
ms_zoning_RM            0
neighborhood_Blueste    0
neighborhood_BrDale     0
neighborhood_BrkSide    0
neighborhood_ClearCr    0
neighborhood_CollgCr    0
neighborhood_Crawfor    0
neighborhood_Edwards    0
neighborhood_Gilbert    0
neighborhood_Greens     0
neighborhood_GrnHill    0
neighborhood_IDOTRR     0
neighborhood_Landmrk    0
neighborhood_MeadowV    0
neighborhood_Mitchel    0
neighborhood_NAmes      0
neighborhood_NPkVill    0
neighborhood_NWAmes     0
neighborhood_NoRidge    0
neighborhood_NridgHt    0
neighborhood_OldTown    0
neighborhood_SWISU      0
neighborhood_Sawyer     0
neighborhood_SawyerW    0
neighborhood

In [5]:
train.columns

Index(['saleprice', 'fireplaces', 'overall_qual', 'gr_liv_area', 'garage_area',
       '1st_flr_sf', 'year_built', 'full_bath', 'mas_vnr_area',
       'ms_zoning_C (all)', 'ms_zoning_FV', 'ms_zoning_I (all)',
       'ms_zoning_RH', 'ms_zoning_RL', 'ms_zoning_RM', 'neighborhood_Blueste',
       'neighborhood_BrDale', 'neighborhood_BrkSide', 'neighborhood_ClearCr',
       'neighborhood_CollgCr', 'neighborhood_Crawfor', 'neighborhood_Edwards',
       'neighborhood_Gilbert', 'neighborhood_Greens', 'neighborhood_GrnHill',
       'neighborhood_IDOTRR', 'neighborhood_Landmrk', 'neighborhood_MeadowV',
       'neighborhood_Mitchel', 'neighborhood_NAmes', 'neighborhood_NPkVill',
       'neighborhood_NWAmes', 'neighborhood_NoRidge', 'neighborhood_NridgHt',
       'neighborhood_OldTown', 'neighborhood_SWISU', 'neighborhood_Sawyer',
       'neighborhood_SawyerW', 'neighborhood_Somerst', 'neighborhood_StoneBr',
       'neighborhood_Timber', 'neighborhood_Veenker', 'exter_qual_Fa',
       'exter_qual

In [6]:
list_x_cat = ['ms_zoning_C (all)',
       'ms_zoning_FV', 'ms_zoning_I (all)', 'ms_zoning_RH', 'ms_zoning_RL',
       'ms_zoning_RM', 'neighborhood_Blueste', 'neighborhood_BrDale',
       'neighborhood_BrkSide', 'neighborhood_ClearCr', 'neighborhood_CollgCr',
       'neighborhood_Crawfor', 'neighborhood_Edwards', 'neighborhood_Gilbert',
       'neighborhood_Greens', 'neighborhood_GrnHill', 'neighborhood_IDOTRR',
       'neighborhood_Landmrk', 'neighborhood_MeadowV', 'neighborhood_Mitchel',
       'neighborhood_NAmes', 'neighborhood_NPkVill', 'neighborhood_NWAmes',
       'neighborhood_NoRidge', 'neighborhood_NridgHt', 'neighborhood_OldTown',
       'neighborhood_SWISU', 'neighborhood_Sawyer', 'neighborhood_SawyerW',
       'neighborhood_Somerst', 'neighborhood_StoneBr', 'neighborhood_Timber',
       'neighborhood_Veenker', 'exter_qual_Fa', 'exter_qual_Gd',
       'exter_qual_TA', 'kitchen_qual_Fa', 'kitchen_qual_Gd',
       'kitchen_qual_TA', 'fireplace_qu_Fa', 'fireplace_qu_Gd',
       'fireplace_qu_Po', 'fireplace_qu_TA', 'fireplace_qu_none', 'bsmt_qual_Fa', 'bsmt_qual_Gd', 'bsmt_qual_Po', 'bsmt_qual_TA',
       'bsmt_qual_none']

In [7]:
list_x_num = ['overall_qual', 'fireplaces', 'gr_liv_area', 'garage_area', '1st_flr_sf',
       'year_built', 'full_bath', 'mas_vnr_area']

<a id="linear-regression"></a>
## Linear Regression

In [8]:
# X in dataframe and y in series 
X = train.drop(columns='saleprice')
y = train['saleprice']

In [9]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=55)

In [10]:
# instantiate standard scaler 
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train) 
X_test_sc = sc.transform(X_test)

In [11]:
# instantiate linear regression
ols = LinearRegression() # instantiate
ols.fit(X_train_sc, y_train)

In [12]:
# cross validate score
cross_val_score(ols, X_train_sc, y_train, cv=6)

array([ 8.89105533e-01,  7.58310445e-01,  8.83652410e-01, -8.81568334e+20,
       -1.67311424e+21,  8.53076037e-01])

In [13]:
# R2 score on train 
ols.score(X_train_sc, y_train)

0.8690742610120241

In [14]:
# R2 score on test
ols.score(X_test_sc, y_test)

0.8237443983043173

In [15]:
# y pred
y_pred = ols.predict(X_test_sc)

In [16]:
# RMSE
np.sqrt(metrics.mean_squared_error(y_test, y_pred))

32375.315836464768

take a look at the beta for OLS 

or take a look at the test set and see which is the one predicting wrongly 

In [17]:
ols.coef_

array([ 3.77179876e+03,  1.55374235e+04,  1.78830998e+04,  7.56526656e+03,
        6.89535396e+03,  7.80365170e+03,  8.99618851e+02,  4.11010786e+03,
       -1.86553954e+03, -1.15578723e+01,  4.41904227e+02,  1.03741094e+02,
        1.08020387e+03, -4.51263656e+03,  1.21550697e+01, -4.90720790e+02,
        3.87643400e+03,  3.76828766e+03,  4.58880325e+03,  6.30717303e+03,
        3.01602094e+02,  2.28778987e+03, -3.00133252e-11,  4.46589727e+03,
        3.12821009e+03,  1.09139364e-11,  1.02666329e+03,  1.64707350e+03,
        3.01722058e+03, -9.16584083e+02,  8.44106307e+02,  7.46514718e+03,
        7.66120290e+03,  3.72207041e+03,  9.76607024e+02,  2.37058717e+03,
        9.78564181e+02,  3.70049401e+03,  8.71002967e+03,  2.53052876e+03,
        1.96511315e+03, -3.78820183e+03, -1.09742497e+04, -1.31233877e+04,
       -6.35577002e+03, -1.47527007e+04, -1.87152761e+04, -3.49904268e+03,
       -8.30811782e+03, -2.76616160e+03, -7.74100961e+03, -1.07355209e+04,
       -5.26481743e+03, -

### Findings

R-square test score is lower than R-square train score. This suggests that model was underfit. Observed that there are some irregular values in the cross validation score as well. A possible explanation is that the standard scaler that was done on the X train features including the features that had one hot encoding done (pd.get dummies) were affecting the R-square scores. 

<a id="rerun-lr"></a>
## Rerun Linear Regression

Rerun with standardscaler only on numeric features and exclude pd.dummies features

In [20]:
X = train.drop(columns=['saleprice'])
y = train['saleprice']

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=55, train_size=0.8)

In [22]:
X.columns

Index(['fireplaces', 'overall_qual', 'gr_liv_area', 'garage_area',
       '1st_flr_sf', 'year_built', 'full_bath', 'mas_vnr_area',
       'ms_zoning_C (all)', 'ms_zoning_FV', 'ms_zoning_I (all)',
       'ms_zoning_RH', 'ms_zoning_RL', 'ms_zoning_RM', 'neighborhood_Blueste',
       'neighborhood_BrDale', 'neighborhood_BrkSide', 'neighborhood_ClearCr',
       'neighborhood_CollgCr', 'neighborhood_Crawfor', 'neighborhood_Edwards',
       'neighborhood_Gilbert', 'neighborhood_Greens', 'neighborhood_GrnHill',
       'neighborhood_IDOTRR', 'neighborhood_Landmrk', 'neighborhood_MeadowV',
       'neighborhood_Mitchel', 'neighborhood_NAmes', 'neighborhood_NPkVill',
       'neighborhood_NWAmes', 'neighborhood_NoRidge', 'neighborhood_NridgHt',
       'neighborhood_OldTown', 'neighborhood_SWISU', 'neighborhood_Sawyer',
       'neighborhood_SawyerW', 'neighborhood_Somerst', 'neighborhood_StoneBr',
       'neighborhood_Timber', 'neighborhood_Veenker', 'exter_qual_Fa',
       'exter_qual_Gd', 'exter_

In [23]:
# split X train to numeric and categoric dataframe
X_train_num = X_train[list_x_num]

X_train_cat = X_train[list_x_cat]

In [24]:
X_train_cat.head()

Unnamed: 0,ms_zoning_C (all),ms_zoning_FV,ms_zoning_I (all),ms_zoning_RH,ms_zoning_RL,ms_zoning_RM,neighborhood_Blueste,neighborhood_BrDale,neighborhood_BrkSide,neighborhood_ClearCr,...,fireplace_qu_Fa,fireplace_qu_Gd,fireplace_qu_Po,fireplace_qu_TA,fireplace_qu_none,bsmt_qual_Fa,bsmt_qual_Gd,bsmt_qual_Po,bsmt_qual_TA,bsmt_qual_none
84,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
858,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
1574,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,1,0,1,0,0,0
105,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,1,0,1,0,0,0


In [25]:
# split X test to numeric and categoric dataframe
X_test_num = X_test[list_x_num]

X_test_cat = X_test[list_x_cat]

In [26]:
# instantiate standardscaler
# sc only on numeric X train and X test
# concat the dataframe with pd.dummies after sc

sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train_num) 
X_train_sc = pd.DataFrame(X_train_sc, columns=list_x_num)
X_train_cat.reset_index(drop=True, inplace=True)
X_train_sc = pd.concat([X_train_sc, X_train_cat], axis=1)


In [27]:
X_test_sc = sc.transform(X_test_num)
X_test_sc = pd.DataFrame(X_test_sc, columns=list_x_num)
X_test_cat.reset_index(drop=True, inplace=True)
X_test_sc = pd.concat([X_test_sc, X_test_cat], axis=1)

In [28]:
# instantiate linear regression
lr = LinearRegression()
lr.fit(X_train_sc, y_train)

In [29]:
cross_val_score(lr, X_train_sc, y_train, cv=6).mean()

0.8442731043424564

In [30]:
# R2 score on train
lr.score(X_train_sc, y_train)

0.8571942886583913

In [31]:
# R2 score on test
lr.score(X_test_sc, y_test)

0.8655855367909225

In [32]:
# y pred
y_pred = lr.predict(X_test_sc)

In [33]:
# RMSE
np.sqrt(metrics.mean_squared_error(y_test, y_pred))

27725.043867208104

In [34]:
lr.coef_

array([ 1.57048594e+04,  5.18845226e+03,  1.61982227e+04,  7.68019937e+03,
        5.21776367e+03,  6.83961709e+03,  2.39919120e+03,  3.06549978e+03,
       -2.09966132e+04, -6.12810943e+03,  1.25207445e+04,  1.11992934e+03,
       -9.76505798e+02, -1.45859276e+04, -4.94286431e+03, -8.81253600e+03,
        1.48218830e+04,  3.32912393e+04,  1.56446371e+04,  3.09612406e+04,
       -3.45496408e+03,  8.05175456e+03,  1.04309049e+03,  1.24318427e+05,
        1.23250879e+04, -2.00088834e-11,  4.57527385e+03,  8.74820095e+03,
        7.73129697e+03, -1.39957296e+04,  4.34869667e+03,  5.44197215e+04,
        3.82311908e+04,  1.00350693e+04,  3.63339514e+03,  8.95037543e+03,
        6.72771355e+03,  1.73075322e+04,  7.04877360e+04,  1.87673551e+04,
        2.12159181e+04, -3.45767168e+04, -2.46197768e+04, -2.78899474e+04,
       -4.42152867e+04, -3.08614312e+04, -3.78192671e+04, -2.23275256e+04,
       -1.96505966e+04, -2.37411495e+04, -2.06584396e+04, -2.17266047e+04,
       -3.08528972e+04, -

### Findings

R-square test score is higher than R-square train score and R-square cross validation score. Irregularities in cross validation score have been resolved after standardscaler was applied to only the numeric features without the pd.dummies. RMSE also significantly improved. 

<a id="fe-with-linear-regression"></a>
## Feature Engineering with Linear Regression

In [35]:
train_num = train[list_x_num]

In [36]:
# instantiate 
poly = PolynomialFeatures(include_bias=False)

In [37]:
# polynomialfeatures only to X numeric features
X_poly = poly.fit_transform(train_num)

In [38]:
X_poly = pd.DataFrame(X_poly, columns=poly.get_feature_names_out(list_x_num))

In [39]:
X_poly.shape

(2051, 44)

In [40]:
train_cat = train[list_x_cat]

In [41]:
# concat X numeric features with X categoric features
X_poly = pd.concat([X_poly, train_cat], axis=1)

In [42]:
# check shape and head
print(X_poly.shape)
X_poly.head()

(2051, 93)


Unnamed: 0,overall_qual,fireplaces,gr_liv_area,garage_area,1st_flr_sf,year_built,full_bath,mas_vnr_area,overall_qual^2,overall_qual fireplaces,...,fireplace_qu_Fa,fireplace_qu_Gd,fireplace_qu_Po,fireplace_qu_TA,fireplace_qu_none,bsmt_qual_Fa,bsmt_qual_Gd,bsmt_qual_Po,bsmt_qual_TA,bsmt_qual_none
0,6.0,0.0,1479.0,475.0,725.0,1976.0,2.0,289.0,36.0,0.0,...,0,0,0,0,1,0,0,0,1,0
1,7.0,1.0,2122.0,559.0,913.0,1996.0,2.0,132.0,49.0,7.0,...,0,0,0,1,0,0,1,0,0,0
2,5.0,0.0,1057.0,246.0,1057.0,1953.0,1.0,0.0,25.0,0.0,...,0,0,0,0,1,0,0,0,1,0
3,5.0,0.0,1444.0,400.0,744.0,2006.0,2.0,0.0,25.0,0.0,...,0,0,0,0,1,0,1,0,0,0
4,6.0,0.0,1445.0,484.0,831.0,1900.0,2.0,0.0,36.0,0.0,...,0,0,0,0,1,1,0,0,0,0


In [43]:
# set X dataframe and y series 
X = X_poly
y = train['saleprice']

In [44]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, train_size=0.8)

In [45]:
X_train_num = X_train.drop(columns=list_x_cat)

X_train_cat = X_train[list_x_cat]

In [46]:
X_test_num = X_test.drop(columns=list_x_cat)

X_test_cat = X_test[list_x_cat]

In [47]:
# instantiate sc
# sc only applied to numeric features of X train and X test and not pd.dummies 
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train_num) 
X_train_sc = pd.DataFrame(X_train_sc, columns=X_test_num.columns)
X_train_cat.reset_index(drop=True, inplace=True)
X_train_sc = pd.concat([X_train_sc, X_train_cat], axis=1)

In [48]:
X_test_sc = sc.transform(X_test_num)
X_test_sc = pd.DataFrame(X_test_sc, columns=X_test_num.columns)
X_test_cat.reset_index(drop=True, inplace=True)
X_test_sc = pd.concat([X_test_sc, X_test_cat], axis=1)

In [49]:
# instantiate lr
lr = LinearRegression()
lr.fit(X_train_sc, y_train)

In [50]:
cross_val_score(lr, X_train_sc, y_train, cv=6).mean()

0.8543834016565482

In [51]:
# R2 train score
lr.score(X_train_sc, y_train)

0.9054375903961018

In [52]:
# R2 test score
lr.score(X_test_sc, y_test)

0.8969925019965425

In [53]:
y_pred = lr.predict(X_test_sc)

In [54]:
# RMSE
np.sqrt(metrics.mean_squared_error(y_test, y_pred))

24739.051020034127

### Findings
There was significant improvement to the R-square test scores and train scores as well as RMSE scores after polynomial features was applied to the numeric X features (not including the one hot encoding numeric features). 

<a id="hyperparameters-with-ridge"></a>
## Hyperparameters with Ridge

In [55]:
r_alphas = np.logspace(0, 10, 10) 
r_alphas

array([1.00000000e+00, 1.29154967e+01, 1.66810054e+02, 2.15443469e+03,
       2.78255940e+04, 3.59381366e+05, 4.64158883e+06, 5.99484250e+07,
       7.74263683e+08, 1.00000000e+10])

In [56]:
# set up ridge parameters
ridge_params = {
    'alpha': r_alphas,
}

In [57]:
# gridsearchcv for ridge
ridge_gridsearch = GridSearchCV(Ridge(), # estimator: What is the model we want to fit?
                              ridge_params, # param_grid: What is the dictionary of hyperparameters?
                              cv=5, # What number of folds in CV will we use?
                              verbose=1, # Display limited output post grid searching
                              n_jobs=-1 # Use all CPU cores on your computer to speed up the fit
                             )

In [58]:
# fit X_train_sc and y_train from last iteration after poly and sc were done
ridge_gridsearch.fit(X_train_sc, y_train);

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [59]:
ridge_gridsearch.best_score_ # this is the metric performance on validation set (part of training)

0.8627759030708637

In [60]:
# Evaluate the best fit model on the test data.
# Best model is automatically chosen
ridge_gridsearch.score(X_test_sc, y_test) 

0.8975263143381346

In [61]:
ridge_gridsearch.best_params_ # Print out the set of hyperparameters that achieved the best score.

{'alpha': 1.0}

In [62]:
# converting above to dataframe format and sorting best model first, viewing top 5
# rank_test_score captures ranking based on scoring on validation set. rank#1 is the model with the best metric!
pd.DataFrame(ridge_gridsearch.cv_results_).sort_values('rank_test_score').head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_alpha,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.119879,0.040039,0.006584,0.00272,1.0,{'alpha': 1.0},0.900568,0.864094,0.865179,0.89275,0.791288,0.862776,0.038587,1
1,0.019946,0.007801,0.004389,0.000799,12.915497,{'alpha': 12.91549665014884},0.895478,0.85024,0.867993,0.903009,0.783654,0.860075,0.042653,2
2,0.009775,0.001163,0.003789,0.000397,166.810054,{'alpha': 166.81005372000593},0.870076,0.812486,0.852166,0.850258,0.722508,0.821499,0.05293,3
3,0.008178,0.001466,0.003193,0.0004,2154.43469,{'alpha': 2154.4346900318847},0.829691,0.770669,0.815414,0.755512,0.641071,0.762471,0.066596,4
4,0.008976,0.001093,0.003191,0.001163,27825.594022,{'alpha': 27825.59402207126},0.59761,0.566442,0.58775,0.563261,0.546983,0.572409,0.01809,5


In [64]:
y_pred_ridge = ridge_gridsearch.predict(X_test_sc)

In [65]:
np.sqrt(metrics.mean_squared_error(y_test, y_pred_ridge))

24674.86557356423

In [66]:
print(ridge_gridsearch.score(X_train_sc, y_train))
print(ridge_gridsearch.score(X_test_sc, y_test))

0.9036079874871452
0.8975263143381346


### Findings

Slight improvement in RMSE and R2 test score using ridge model with hyperparameters as compared to linear regression. 

<a id="hyperparameters-with-lasso"></a>
## Hyperparameters with Lasso

In [67]:
l_alphas = np.logspace(-3, 10, 50)
l_alphas

array([1.00000000e-03, 1.84206997e-03, 3.39322177e-03, 6.25055193e-03,
       1.15139540e-02, 2.12095089e-02, 3.90693994e-02, 7.19685673e-02,
       1.32571137e-01, 2.44205309e-01, 4.49843267e-01, 8.28642773e-01,
       1.52641797e+00, 2.81176870e+00, 5.17947468e+00, 9.54095476e+00,
       1.75751062e+01, 3.23745754e+01, 5.96362332e+01, 1.09854114e+02,
       2.02358965e+02, 3.72759372e+02, 6.86648845e+02, 1.26485522e+03,
       2.32995181e+03, 4.29193426e+03, 7.90604321e+03, 1.45634848e+04,
       2.68269580e+04, 4.94171336e+04, 9.10298178e+04, 1.67683294e+05,
       3.08884360e+05, 5.68986603e+05, 1.04811313e+06, 1.93069773e+06,
       3.55648031e+06, 6.55128557e+06, 1.20679264e+07, 2.22299648e+07,
       4.09491506e+07, 7.54312006e+07, 1.38949549e+08, 2.55954792e+08,
       4.71486636e+08, 8.68511374e+08, 1.59985872e+09, 2.94705170e+09,
       5.42867544e+09, 1.00000000e+10])

In [68]:
lasso_params = {
    'alpha': l_alphas,
    'max_iter' : [50000]
}

In [69]:
lasso_gridsearch = GridSearchCV(Lasso(), # estimator: What is the model we want to fit?
                              lasso_params, # param_grid: What is the dictionary of hyperparameters?
                              cv=5, # What number of folds in CV will we use?
                              verbose=1, # Display limited output post grid searching
                              n_jobs=-1 # Use all CPU cores on your computer to speed up the fit
                             )

In [70]:
lasso_gridsearch.fit(X_train_sc, y_train);

Fitting 5 folds for each of 50 candidates, totalling 250 fits


In [71]:
lasso_gridsearch.best_score_ # this is the metric performance on validation set (part of training)

0.8643550153073752

In [72]:
# Evaluate the best fit model on the test data.
# Best model is automatically chosen
lasso_gridsearch.score(X_test_sc, y_test) 

0.8985220311326859

In [73]:
lasso_gridsearch.best_params_ # Print out the set of hyperparameters that achieved the best score.

{'alpha': 32.37457542817646, 'max_iter': 50000}

In [74]:
# converting above to dataframe format and sorting best model first, viewing top 5
# rank_test_score captures ranking based on scoring on validation set. rank#1 is the model with the best metric!
pd.DataFrame(lasso_gridsearch.cv_results_).sort_values('rank_test_score').head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_alpha,param_max_iter,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
17,1.559234,0.705949,0.005385,0.00049,32.374575,50000,"{'alpha': 32.37457542817646, 'max_iter': 50000}",0.901005,0.864792,0.869404,0.894844,0.79173,0.864355,0.038918,1
16,1.897927,0.429302,0.005387,0.000489,17.575106,50000,"{'alpha': 17.57510624854793, 'max_iter': 50000}",0.900716,0.865396,0.86707,0.89113,0.79276,0.863414,0.037865,2
18,0.717881,0.117461,0.005387,0.000489,59.636233,50000,"{'alpha': 59.636233165946486, 'max_iter': 50000}",0.899205,0.861851,0.871464,0.89585,0.787297,0.863133,0.040482,3
15,4.030826,1.71927,0.007379,0.004788,9.540955,50000,"{'alpha': 9.540954763499943, 'max_iter': 50000}",0.900379,0.865423,0.864722,0.888127,0.791966,0.862124,0.037628,4
14,9.69668,4.122619,0.005586,0.000798,5.179475,50000,"{'alpha': 5.1794746792312125, 'max_iter': 50000}",0.900009,0.865418,0.862921,0.885784,0.785843,0.859995,0.039503,5


In [81]:
# Instantiate.
# alpha = 32 is best params
lasso_model = Lasso(alpha=32, max_iter=50000)

# Fit.
lasso_model.fit(X_train_sc, y_train)

y_pred = lasso_model.predict(X_test_sc)

# Evaluate model using R2.
print(lasso_model.score(X_train_sc, y_train))
print(lasso_model.score(X_test_sc, y_test))

# RMSE
np.sqrt(metrics.mean_squared_error(y_test, y_pred))

#lasso coefficient
lasso_model.coef_

0.9030218389182908
0.8985111682384391


array([-7.68294040e+03,  1.76036424e+04,  0.00000000e+00, -1.88107779e+03,
        0.00000000e+00,  0.00000000e+00, -2.21443804e+04, -4.47718254e+03,
       -1.57432531e+04,  7.91668747e+03,  6.77923937e+04,  1.12872156e+04,
        2.91217216e+04, -0.00000000e+00, -1.97584630e+04, -0.00000000e+00,
       -1.64452343e+04, -0.00000000e+00, -1.93481828e+03,  2.25089084e+04,
        0.00000000e+00, -9.04988243e+03,  1.20034723e+01, -1.40273444e+04,
        5.11000756e+02, -7.39411481e+04,  5.01716057e+03,  3.86636416e+04,
       -9.98875795e+03,  1.20980432e+03,  4.36393681e+03, -0.00000000e+00,
       -5.19857543e+03,  8.98374813e+03, -0.00000000e+00,  2.62319991e+03,
        2.94603023e+04, -4.53032162e+03,  9.51387220e+03, -0.00000000e+00,
       -0.00000000e+00, -0.00000000e+00,  1.12788237e+04,  8.62335881e+01,
       -1.91014593e+04,  1.06045449e+04,  0.00000000e+00, -0.00000000e+00,
        3.30244374e+02, -7.26289186e+03, -0.00000000e+00, -5.22380589e+03,
        4.95145663e+03,  

In [76]:
y_pred_lasso = lasso_gridsearch.predict(X_test_sc)

In [77]:
np.sqrt(metrics.mean_squared_error(y_test, y_pred_lasso))

24554.692511028035

In [78]:
print(lasso_gridsearch.score(X_train_sc, y_train))
print(lasso_gridsearch.score(X_test_sc, y_test))

0.9029973224458736
0.8985220311326859


### Findings

Slight improvement in RMSE and R2 test score using lasso model with hyperparameters as compared to linear regression and ridge. This is the best score from the 3 models and will be used for kaggle submission. 

<a id="kaggle-submission"></a>
## Export for Kaggle Submission

Tidy up test.csv

In [79]:
test = pd.read_csv('./datasets/test_selected.csv')

In [80]:
# extract dataframe with only numeric features
test_num = test[list_x_num]

In [81]:
# feature engineering as done to train set
poly = PolynomialFeatures(include_bias=False)

In [82]:
X_poly_test = poly.fit_transform(test_num)

In [83]:
X_poly_test = pd.DataFrame(X_poly_test, columns=poly.get_feature_names_out(list_x_num))

In [84]:
X_poly_test.shape

(878, 44)

In [85]:
# scaling only numeric features
sc = StandardScaler()
X_poly_test_sc = sc.fit_transform(X_poly_test) 
X_poly_test_sc = pd.DataFrame(X_poly_test_sc, columns=X_poly_test.columns)


In [86]:
# columns added back to test set, as these values were not within test set but within train set
test['ms_zoning_C (all)'] = 0
test['neighborhood_GrnHill'] = 0
test['neighborhood_Landmrk'] = 0

In [87]:
test_cat = test[list_x_cat]

In [88]:
# combining numeric features back with pd.dummies features to ensure that pd.dummies features were not scaled
test_cat.reset_index(drop=True, inplace=True)
X_poly_test = pd.concat([X_poly_test_sc, test_cat], axis=1)

In [89]:
print(X_poly_test.shape)
X_poly_test.head()

(878, 93)


Unnamed: 0,overall_qual,fireplaces,gr_liv_area,garage_area,1st_flr_sf,year_built,full_bath,mas_vnr_area,overall_qual^2,overall_qual fireplaces,...,fireplace_qu_Fa,fireplace_qu_Gd,fireplace_qu_Po,fireplace_qu_TA,fireplace_qu_none,bsmt_qual_Fa,bsmt_qual_Gd,bsmt_qual_Po,bsmt_qual_TA,bsmt_qual_none
0,-0.036625,-0.924179,0.851644,-0.142805,-0.634014,-1.991272,0.823523,-0.567521,-0.14335,-0.882206,...,0,0,0,0,1,1,0,0,0,0
1,-0.767467,-0.924179,0.928691,0.515669,2.189607,0.214229,0.823523,-0.567521,-0.780185,-0.882206,...,0,0,0,0,1,0,1,0,0,0
2,0.694217,0.570166,-0.001807,-0.208652,-1.284593,1.168849,0.823523,-0.567521,0.609272,0.619328,...,0,1,0,0,0,0,1,0,0,0
3,-0.767467,-0.924179,-1.044913,0.04533,-0.474035,-1.563339,-0.966213,-0.567521,-0.780185,-0.882206,...,0,0,0,0,1,0,0,0,1,0
4,-0.036625,2.064511,-0.203316,0.205245,0.661812,-0.246622,-0.966213,0.753476,-0.14335,1.691852,...,0,1,0,0,0,0,1,0,0,0


In [90]:
# double check number of columns matches train set as above
print(X_test_sc.shape)
X_test_sc.head()

(411, 93)


Unnamed: 0,overall_qual,fireplaces,gr_liv_area,garage_area,1st_flr_sf,year_built,full_bath,mas_vnr_area,overall_qual^2,overall_qual fireplaces,...,fireplace_qu_Fa,fireplace_qu_Gd,fireplace_qu_Po,fireplace_qu_TA,fireplace_qu_none,bsmt_qual_Fa,bsmt_qual_Gd,bsmt_qual_Po,bsmt_qual_TA,bsmt_qual_none
0,-0.783656,-0.91472,-1.078817,0.468839,-0.520542,0.137667,-1.036098,-0.56279,-0.802603,-0.881939,...,0,0,0,0,1,0,0,0,1,0
1,0.620163,2.213833,1.300712,0.062506,0.127663,-0.193379,0.760097,1.201479,0.531285,2.258952,...,0,1,0,0,0,0,0,0,1,0
2,1.322072,0.649556,1.164231,0.834083,-0.535387,1.16391,0.760097,0.053848,1.364965,0.912856,...,0,1,0,0,0,0,1,0,0,0
3,-0.081747,-0.91472,0.291934,-1.174756,-0.800112,-1.881715,-1.036098,-0.56279,-0.191238,-0.881939,...,0,0,0,0,1,0,1,0,0,0
4,0.620163,-0.91472,0.469954,0.477971,-0.775371,1.130806,0.760097,-0.56279,0.531285,-0.881939,...,0,0,0,0,1,0,1,0,0,0


In [91]:
y_pred_lasso_test = lasso_gridsearch.predict(X_poly_test)
print(type(y_pred_lasso_test))
print(len(y_pred_lasso_test))

<class 'numpy.ndarray'>
878


In [92]:
y_pred_lasso_test = pd.DataFrame(y_pred_lasso_test, columns=['SalePrice'])
print(y_pred_lasso_test.shape)
y_pred_lasso_test.head()

(878, 1)


Unnamed: 0,SalePrice
0,136342.830622
1,152808.650083
2,180807.495924
3,116031.865572
4,192193.301747


In [93]:
ID = test[['id']]

In [94]:
# dataframe in format required for kaggle submission

submission_lasso = pd.concat([ID, y_pred_lasso_test], axis=1)
submission_lasso.rename(columns={'id' : 'Id'}, inplace=True)
submission_lasso.head()

Unnamed: 0,Id,SalePrice
0,2658,136342.830622
1,2718,152808.650083
2,2414,180807.495924
3,1989,116031.865572
4,625,192193.301747


In [95]:
# export for kaggle submission
submission_lasso.to_csv('./datasets/submission_lasso.csv', index=False)

<a id="conclusion-and-recommendations"></a>
## Conclusion and Recommendations

#### Feature engineering, Model Evaluation and Selection

pd.get_dummies was applied to the selected categorical variables identified from the EDA. The categorical variables with more than 80% common value occurrence were dropped beforehand. 

The other feature engineering that was performed was done after the initial run on a linear regression model; a polynomial feature was done on the selected numeric variables. This significantly improved the R-square test score and root mean square error in the rerun for linear regression model. 

The same set of training data were then used to train a ridge and lasso model with hyperparameter to find the best alpha. 

A summary of the findings were as follows: 

|Model|Cross Val Score|R-square score|RMSE|Best alpha|
|-----|---------------|--------------|----|----------|
|Linear Regression|0.8544|0.8970|24739|nil|
|Ridge|0.8628|0.8975|24674|1.00|
|Lasso|0.8644|0.8985|24554|32.37|

The best model is lasso with R-square score of 0.8985 and RMSE of 24554. This model was then used to predict the provided test dataset for Kaggle submission. This returned a  score of 25195 on Kaggle website. 

#### Other considerations

A check on the coefficient of the lasso model with best alpha in a separate estimator showed that there were variables with coefficient reduced to zero. It could be possible that the features selected be further streamlined in the future. 

#### Conclusion and Recommendations

A suitable model was built by the data science team to streamline the process of housing sale price prediction and reduced the number of features that the realtors have to keep an eye out for (from 82 to 14). This would be effective in providing a first cut estimation in predicting the house sale price. It would also be advisable to obtain the latest data available to be used for house sale price prediction. 

There are also external factors which has not in considered in this given data set and should be considered, such as, 
- interest rates affecting housing prices
- government policies (eg, Singapore government increasing additional buyer stamp duty on second properties onwards ([*source*](https://www.propertyguru.com.sg/property-guides/additional-buyers-stamp-duty-guide-13034))