# Preprocessing and Modelling

In this notebook, with the cleaned dataset, I preprocess the features to make it suitable for regression ie. use hot encoded variable etc.

Subsequently, I use feature selection libraries to select the more salient features for the 

Contents:
- [Preprocessing](#Preprocessing)
- [Exploratory Modelling](#Exploratory-Modelling)
- [Best Model](#Best-Model)

In [2]:
#imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('whitegrid')

#for modelling
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.metrics import r2_score, mean_squared_error

## Preprocessing
There are two types of categorical variables - nominal and ordinal. Ordinal variables have a particular order in the category for instance, quality and condition (rated Excellent, Good, Average or Poor). On the other hand, nominal variables are just pure categorical with no particular order. These variables need to be encoded before fitting into the model. There are several ways to encode a categorical variable (Roy, 2019).

For ordinal features, I map the integer ratings.
For nominal features, I use pandas' get_dummies function to get the dummy variables for each categories.

In [3]:
#read clean dataset
train = pd.read_csv("../data/train_clean.csv")
test = pd.read_csv("../data/test_clean.csv")

In [4]:
print(train.shape)
print(test.shape)

(2049, 24)
(879, 23)


In [5]:
#get dummies for nominal features
nom_features = ['ms_zoning','garage_type','neighborhood','condition_2','central_air','sale_type','roof_matl']
train = pd.get_dummies(train, columns = nom_features)
test = pd.get_dummies(test, columns = nom_features)
#check
train.head()

Unnamed: 0,overall_qual,year_remod/add,mas_vnr_area,exter_qual,bsmt_qual,heating_qc,electrical,kitchen_qual,totrms_abvgrd,functional,...,sale_type_ConLI,sale_type_ConLw,sale_type_New,sale_type_Oth,sale_type_WD,roof_matl_CompShg,roof_matl_Membran,roof_matl_Tar&Grv,roof_matl_WdShake,roof_matl_WdShngl
0,6,2005,289.0,4,3,Ex,SBrkr,Gd,6,Typ,...,0,0,0,0,1,1,0,0,0,0
1,7,1997,132.0,4,4,Ex,SBrkr,Gd,8,Typ,...,0,0,0,0,1,1,0,0,0,0
2,5,2007,0.0,3,3,TA,SBrkr,Gd,5,Typ,...,0,0,0,0,1,1,0,0,0,0
3,5,2007,0.0,3,4,Gd,SBrkr,TA,7,Typ,...,0,0,0,0,1,1,0,0,0,0
4,6,1993,0.0,3,2,TA,SBrkr,TA,6,Typ,...,0,0,0,0,1,1,0,0,0,0


In [6]:
#check
test.head()

Unnamed: 0,overall_qual,year_remod/add,mas_vnr_area,exter_qual,bsmt_qual,heating_qc,electrical,kitchen_qual,totrms_abvgrd,functional,...,sale_type_New,sale_type_Oth,sale_type_VWD,sale_type_WD,roof_matl_CompShg,roof_matl_Metal,roof_matl_Roll,roof_matl_Tar&Grv,roof_matl_WdShake,roof_matl_WdShngl
0,6,1950,0.0,3,2,Gd,FuseP,Fa,9,Typ,...,0,0,0,1,1,0,0,0,0,0
1,5,1977,0.0,3,4,TA,SBrkr,TA,10,Typ,...,0,0,0,1,1,0,0,0,0,0
2,7,2006,0.0,4,4,Ex,SBrkr,Gd,7,Typ,...,1,0,0,0,1,0,0,0,0,0
3,5,2006,0.0,4,3,TA,SBrkr,TA,5,Typ,...,0,0,0,1,1,0,0,0,0,0
4,6,1963,247.0,3,4,Gd,SBrkr,TA,6,Typ,...,0,0,0,1,1,0,0,0,0,0


In [7]:
#assign integer ratings to ordinal features
train['heating_qc'] = train['heating_qc'].map({'Ex':5,'Gd':4,'TA':3,'Fa':2,'Po':1})
train['electrical'] = train['electrical'].map({'SBrkr':5, 'FuseF':3, 'FuseA':4,'FuseP':2,'Mix':1})
train['kitchen_qual'] = train['kitchen_qual'].map({'Ex':5,'Gd':4,'TA':3,'Fa':2})
train['functional'] = train['functional'].map({'Typ':7, 'Mod':5, 'Min2':5, 'Maj1':4,'Min1':6,'Sev':2,'Sal':1,'Maj2':3})
train['garage_finish'] = train['garage_finish'].map({'RFn':2,'Unf':1,'Fin':3,'None':0})
train['paved_drive'] = train['paved_drive'].map({'Y':3,'N':1,'P':2})

test['heating_qc'] = test['heating_qc'].map({'Ex':5,'Gd':4,'TA':3,'Fa':2,'Po':1})
test['electrical'] = test['electrical'].map({'SBrkr':5, 'FuseF':3, 'FuseA':4,'FuseP':2,'Mix':1})
test['kitchen_qual'] = test['kitchen_qual'].map({'Ex':5,'Gd':4,'TA':3,'Fa':2})
test['functional'] = test['functional'].map({'Typ':7, 'Mod':5, 'Min2':5, 'Maj1':4,'Min1':6,'Sev':2,'Sal':1,'Maj2':3})
test['garage_finish'] = test['garage_finish'].map({'RFn':2,'Unf':1,'Fin':3,'None':0})
test['paved_drive'] = test['paved_drive'].map({'Y':3,'N':1,'P':2})

In [8]:
train.describe()

Unnamed: 0,overall_qual,year_remod/add,mas_vnr_area,exter_qual,bsmt_qual,heating_qc,electrical,kitchen_qual,totrms_abvgrd,functional,...,sale_type_ConLI,sale_type_ConLw,sale_type_New,sale_type_Oth,sale_type_WD,roof_matl_CompShg,roof_matl_Membran,roof_matl_Tar&Grv,roof_matl_WdShake,roof_matl_WdShngl
count,2049.0,2049.0,2049.0,2049.0,2049.0,2049.0,2049.0,2049.0,2049.0,2049.0,...,2049.0,2049.0,2049.0,2049.0,2049.0,2049.0,2049.0,2049.0,2049.0,2049.0
mean,6.108346,1984.166423,97.427038,3.404588,3.569058,4.157638,4.88531,3.515861,6.42899,6.868228,...,0.003416,0.00244,0.077111,0.001952,0.869204,0.987799,0.000488,0.007321,0.001952,0.00244
std,1.42178,21.032785,171.826526,0.586134,0.696138,0.964224,0.402657,0.664287,1.544572,0.555697,...,0.058363,0.04935,0.266832,0.044151,0.337259,0.109809,0.022092,0.085268,0.044151,0.04935
min,1.0,1950.0,0.0,2.0,1.0,1.0,1.0,2.0,2.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,5.0,1964.0,0.0,3.0,3.0,3.0,5.0,3.0,5.0,7.0,...,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
50%,6.0,1993.0,0.0,3.0,4.0,5.0,5.0,3.0,6.0,7.0,...,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
75%,7.0,2004.0,157.0,4.0,4.0,5.0,5.0,4.0,7.0,7.0,...,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
max,10.0,2010.0,1600.0,5.0,5.0,5.0,5.0,5.0,14.0,7.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [9]:
test.describe()

Unnamed: 0,overall_qual,year_remod/add,mas_vnr_area,exter_qual,bsmt_qual,heating_qc,electrical,kitchen_qual,totrms_abvgrd,functional,...,sale_type_New,sale_type_Oth,sale_type_VWD,sale_type_WD,roof_matl_CompShg,roof_matl_Metal,roof_matl_Roll,roof_matl_Tar&Grv,roof_matl_WdShake,roof_matl_WdShngl
count,879.0,879.0,879.0,879.0,879.0,879.0,879.0,878.0,879.0,879.0,...,879.0,879.0,879.0,879.0,879.0,879.0,879.0,879.0,879.0,879.0
mean,6.054608,1984.444824,106.182025,3.381115,3.538111,4.128555,4.90785,3.5,6.459613,6.863481,...,0.089875,0.003413,0.001138,0.858931,0.98066,0.001138,0.001138,0.009101,0.005688,0.002275
std,1.374756,20.454546,188.128642,0.562016,0.698578,0.944034,0.353208,0.652598,1.603071,0.520226,...,0.286165,0.058354,0.033729,0.348291,0.137796,0.033729,0.033729,0.095019,0.075249,0.047673
min,2.0,1950.0,0.0,2.0,1.0,2.0,2.0,2.0,3.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,5.0,1967.0,0.0,3.0,3.0,3.0,5.0,3.0,5.0,7.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
50%,6.0,1992.0,0.0,3.0,3.0,4.0,5.0,3.0,6.0,7.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
75%,7.0,2003.0,170.5,4.0,4.0,5.0,5.0,4.0,7.0,7.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
max,10.0,2010.0,1378.0,5.0,5.0,5.0,5.0,5.0,12.0,7.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Check that columns in train set and the same as in test set.

In [10]:
#find out what columns are missing in the test set
missing_col_from_test = [col for col in train.columns if col not in test.columns]
#remove saleprice
missing_col_from_test = missing_col_from_test[1:]
missing_col_from_test

['ms_zoning_A (agr)',
 'neighborhood_GrnHill',
 'neighborhood_Landmrk',
 'condition_2_Artery',
 'condition_2_RRAe',
 'condition_2_RRAn',
 'condition_2_RRNn',
 'roof_matl_Membran']

In [11]:
#add these in and impute with 0
for col in missing_col_from_test:
    test[col] = 0

In [12]:
#check
print(train.shape)
print(test.shape)

(2049, 83)
(879, 85)


In [13]:
#there are some columns in test that are not in train. test should have one less column than train. 
test.head()

Unnamed: 0,overall_qual,year_remod/add,mas_vnr_area,exter_qual,bsmt_qual,heating_qc,electrical,kitchen_qual,totrms_abvgrd,functional,...,roof_matl_WdShake,roof_matl_WdShngl,ms_zoning_A (agr),neighborhood_GrnHill,neighborhood_Landmrk,condition_2_Artery,condition_2_RRAe,condition_2_RRAn,condition_2_RRNn,roof_matl_Membran
0,6,1950,0.0,3,2,4,2,2.0,9,7,...,0,0,0,0,0,0,0,0,0,0
1,5,1977,0.0,3,4,3,5,3.0,10,7,...,0,0,0,0,0,0,0,0,0,0
2,7,2006,0.0,4,4,5,5,4.0,7,7,...,0,0,0,0,0,0,0,0,0,0
3,5,2006,0.0,4,3,3,5,3.0,5,7,...,0,0,0,0,0,0,0,0,0,0
4,6,1963,247.0,3,4,4,5,3.0,6,7,...,0,0,0,0,0,0,0,0,0,0


In [14]:
train.head()

Unnamed: 0,overall_qual,year_remod/add,mas_vnr_area,exter_qual,bsmt_qual,heating_qc,electrical,kitchen_qual,totrms_abvgrd,functional,...,sale_type_ConLI,sale_type_ConLw,sale_type_New,sale_type_Oth,sale_type_WD,roof_matl_CompShg,roof_matl_Membran,roof_matl_Tar&Grv,roof_matl_WdShake,roof_matl_WdShngl
0,6,2005,289.0,4,3,5,5,4,6,7,...,0,0,0,0,1,1,0,0,0,0
1,7,1997,132.0,4,4,5,5,4,8,7,...,0,0,0,0,1,1,0,0,0,0
2,5,2007,0.0,3,3,3,5,4,5,7,...,0,0,0,0,1,1,0,0,0,0
3,5,2007,0.0,3,4,4,5,3,7,7,...,0,0,0,0,1,1,0,0,0,0
4,6,1993,0.0,3,2,3,5,3,6,7,...,0,0,0,0,1,1,0,0,0,0


In [15]:
#find out what is missing in train set
missing_col_from_train = [col for col in test.columns if col not in train.columns]
missing_col_from_train

['sale_type_VWD', 'roof_matl_Metal', 'roof_matl_Roll']

In [16]:
#drop these columns from test set 
test.drop(missing_col_from_train, axis = 1, inplace = True)

In [17]:
[col for col in test.columns if col not in train.columns]

[]

In [18]:
[col for col in train.columns if col not in test.columns]

['saleprice']

In [19]:
train.shape

(2049, 83)

In [20]:
test.shape

(879, 82)

In [21]:
[x for x in train if x not in test]

['saleprice']

In [22]:
#make sure it's the same order
test = test.reindex(train.columns, axis = 1)

In [23]:
test['saleprice']

0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
       ..
874   NaN
875   NaN
876   NaN
877   NaN
878   NaN
Name: saleprice, Length: 879, dtype: float64

In [24]:
test.drop(columns=['saleprice'],inplace = True)

In [25]:
print(train.shape)
print(test.shape)

(2049, 83)
(879, 82)


In [26]:
train.head()

Unnamed: 0,overall_qual,year_remod/add,mas_vnr_area,exter_qual,bsmt_qual,heating_qc,electrical,kitchen_qual,totrms_abvgrd,functional,...,sale_type_ConLI,sale_type_ConLw,sale_type_New,sale_type_Oth,sale_type_WD,roof_matl_CompShg,roof_matl_Membran,roof_matl_Tar&Grv,roof_matl_WdShake,roof_matl_WdShngl
0,6,2005,289.0,4,3,5,5,4,6,7,...,0,0,0,0,1,1,0,0,0,0
1,7,1997,132.0,4,4,5,5,4,8,7,...,0,0,0,0,1,1,0,0,0,0
2,5,2007,0.0,3,3,3,5,4,5,7,...,0,0,0,0,1,1,0,0,0,0
3,5,2007,0.0,3,4,4,5,3,7,7,...,0,0,0,0,1,1,0,0,0,0
4,6,1993,0.0,3,2,3,5,3,6,7,...,0,0,0,0,1,1,0,0,0,0


In [27]:
test.head()

Unnamed: 0,overall_qual,year_remod/add,mas_vnr_area,exter_qual,bsmt_qual,heating_qc,electrical,kitchen_qual,totrms_abvgrd,functional,...,sale_type_ConLI,sale_type_ConLw,sale_type_New,sale_type_Oth,sale_type_WD,roof_matl_CompShg,roof_matl_Membran,roof_matl_Tar&Grv,roof_matl_WdShake,roof_matl_WdShngl
0,6,1950,0.0,3,2,4,2,2.0,9,7,...,0,0,0,0,1,1,0,0,0,0
1,5,1977,0.0,3,4,3,5,3.0,10,7,...,0,0,0,0,1,1,0,0,0,0
2,7,2006,0.0,4,4,5,5,4.0,7,7,...,0,0,1,0,0,1,0,0,0,0
3,5,2006,0.0,4,3,3,5,3.0,5,7,...,0,0,0,0,1,1,0,0,0,0
4,6,1963,247.0,3,4,4,5,3.0,6,7,...,0,0,0,0,1,1,0,0,0,0


In [28]:
print(train.shape)
print(test.shape)

(2049, 83)
(879, 82)


In [29]:
test.columns

Index(['overall_qual', 'year_remod/add', 'mas_vnr_area', 'exter_qual',
       'bsmt_qual', 'heating_qc', 'electrical', 'kitchen_qual',
       'totrms_abvgrd', 'functional', 'garage_yr_blt', 'garage_finish',
       'garage_area', 'paved_drive', 'total_hse_area', 'total_bathrms',
       'ms_zoning_A (agr)', 'ms_zoning_C (all)', 'ms_zoning_FV',
       'ms_zoning_I (all)', 'ms_zoning_RH', 'ms_zoning_RL', 'ms_zoning_RM',
       'garage_type_2Types', 'garage_type_Attchd', 'garage_type_Basment',
       'garage_type_BuiltIn', 'garage_type_CarPort', 'garage_type_Detchd',
       'garage_type_None', 'neighborhood_Blmngtn', 'neighborhood_Blueste',
       'neighborhood_BrDale', 'neighborhood_BrkSide', 'neighborhood_ClearCr',
       'neighborhood_CollgCr', 'neighborhood_Crawfor', 'neighborhood_Edwards',
       'neighborhood_Gilbert', 'neighborhood_Greens', 'neighborhood_GrnHill',
       'neighborhood_IDOTRR', 'neighborhood_Landmrk', 'neighborhood_MeadowV',
       'neighborhood_Mitchel', 'neighborhoo

In [30]:
train.columns

Index(['overall_qual', 'year_remod/add', 'mas_vnr_area', 'exter_qual',
       'bsmt_qual', 'heating_qc', 'electrical', 'kitchen_qual',
       'totrms_abvgrd', 'functional', 'garage_yr_blt', 'garage_finish',
       'garage_area', 'paved_drive', 'saleprice', 'total_hse_area',
       'total_bathrms', 'ms_zoning_A (agr)', 'ms_zoning_C (all)',
       'ms_zoning_FV', 'ms_zoning_I (all)', 'ms_zoning_RH', 'ms_zoning_RL',
       'ms_zoning_RM', 'garage_type_2Types', 'garage_type_Attchd',
       'garage_type_Basment', 'garage_type_BuiltIn', 'garage_type_CarPort',
       'garage_type_Detchd', 'garage_type_None', 'neighborhood_Blmngtn',
       'neighborhood_Blueste', 'neighborhood_BrDale', 'neighborhood_BrkSide',
       'neighborhood_ClearCr', 'neighborhood_CollgCr', 'neighborhood_Crawfor',
       'neighborhood_Edwards', 'neighborhood_Gilbert', 'neighborhood_Greens',
       'neighborhood_GrnHill', 'neighborhood_IDOTRR', 'neighborhood_Landmrk',
       'neighborhood_MeadowV', 'neighborhood_Mitchel',

We set the preprocessed test data aside, and now we model the train data. 

## Exploratory Modelling

In [31]:
# create features and target variable into X and y
features = [col for col in train.columns if col != 'saleprice']
X = train[features]
y = train['saleprice']

In [32]:
# do the train/test split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 42)

In [33]:
# scaling
ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

In [34]:
len(X_train[0])

82

### Fitting and Cross Validating Models
I will apply the Linear Regression Model, Ridge Model, Lasso Model and ElasticNet Model. 10 kfolds as this is generally optimal (Brownlee, 2020)

In [35]:
#instantiate models
lr = LinearRegression()
ridge = RidgeCV(alphas = np.linspace(0.1, 10, 100))
lasso = LassoCV(n_alphas = 200)
enet = ElasticNetCV(l1_ratio = np.linspace(0.001, 1, 50), n_alphas = 100)

**Linear Regression Model**

In [36]:
#fit
lr_model = lr.fit(X_train,y_train)
#cross validation using 10 folds - lr
lr_negmse_score = cross_val_score(lr_model, X_train, y_train, cv = 10, scoring = 'neg_mean_squared_error')
lr_mse_score = -lr_negmse_score
lr_rmse_score = np.sqrt(lr_mse_score)
print("RMSE: ", lr_rmse_score.mean()) 
lr_r2_score = cross_val_score(lr_model, X_train, y_train, cv = 10)
print("R2 score: ", lr_r2_score.mean())

RMSE:  3.2650570864332044e+16
R2 score:  -1.451517899341233e+24


In [37]:
# find features with highest coefficients from lr
lr_features = pd.DataFrame(X.columns, columns=['feature'])
lr_features['coef'] = lr_model.coef_
lr_features['abs_coef'] = np.abs(lr_model.coef_)
lr_features.sort_values(by='abs_coef', ascending=False).head(10)

Unnamed: 0,feature,coef,abs_coef
19,ms_zoning_I (all),-7.419247e+17,7.419247e+17
21,ms_zoning_RL,-4.570407e+17,4.570407e+17
22,ms_zoning_RM,-3.981464e+17,3.981464e+17
67,central_air_Y,3.514664e+17,3.514664e+17
66,central_air_N,3.514664e+17,3.514664e+17
18,ms_zoning_FV,-2.35729e+17,2.35729e+17
45,neighborhood_NAmes,-2.232729e+17,2.232729e+17
50,neighborhood_OldTown,-1.7514e+17,1.7514e+17
35,neighborhood_CollgCr,-1.720171e+17,1.720171e+17
37,neighborhood_Edwards,-1.600532e+17,1.600532e+17


Top 3 features that is most useful for the model is overall quality, total house area and garage area.

**Ridge Model**

In [38]:
#fit
ridge.fit(X_train, y_train)
#cross validation using 10 folds - ridge
ridge_negmse_scores = cross_val_score(ridge, X_train, y_train, cv = 10, scoring = 'neg_mean_squared_error')
ridge_mse_score = -ridge_negmse_scores
ridge_rmse_scores = np.sqrt(ridge_mse_score)
print("RMSE: ",ridge_rmse_scores.mean())
ridge_r2_score = cross_val_score(ridge, X_train, y_train, cv = 10)
print("R2 score: ", ridge_r2_score.mean())

RMSE:  28476.680556930085
R2 score:  0.8665850510284221


In [39]:
# find features with highest coefficients from lr
ridge_features = pd.DataFrame(X.columns, columns=['feature'])
ridge_features['coef'] = ridge.coef_
ridge_features['abs_coef'] = np.abs(ridge.coef_)
ridge_features.sort_values(by='abs_coef', ascending=False).head(15)

Unnamed: 0,feature,coef,abs_coef
14,total_hse_area,26757.755114,26757.755114
0,overall_qual,10894.333121,10894.333121
12,garage_area,9733.852852,9733.852852
7,kitchen_qual,7447.705679,7447.705679
4,bsmt_qual,7275.778392,7275.778392
15,total_bathrms,6765.734752,6765.734752
55,neighborhood_StoneBr,5306.809535,5306.809535
2,mas_vnr_area,5237.947366,5237.947366
3,exter_qual,5165.034554,5165.034554
49,neighborhood_NridgHt,4723.067447,4723.067447


For ridge model, the top 3 features that affect sales price most is the total house area, overall quality and the garage area. 

**Lasso Model**

In [118]:
#fit
lasso.fit(X_train, y_train)
#cross validation using 10 folds - lasso
lasso_negmse_scores = cross_val_score(lasso, X_train, y_train, cv = 10, scoring = 'neg_mean_squared_error')
lasso_mse_score = -lasso_negmse_scores
lasso_rmse_scores = np.sqrt(lasso_mse_score)
print("RMSE: ",lasso_rmse_scores.mean())
lasso_r2_score = cross_val_score(lasso, X_train, y_train, cv = 10)
print("R2 score: ", lasso_r2_score.mean())

RMSE:  28378.79190907519
R2 score:  0.8675708988167792


In [119]:
# find features with highest coefficients from lasso
lasso_features = pd.DataFrame(X.columns, columns=['feature'])
lasso_features['coef'] = lasso.coef_
lasso_features['abs_coef'] = np.abs(lasso.coef_)
lasso_features.sort_values(by='abs_coef', ascending=False).head(10)

Unnamed: 0,feature,coef,abs_coef
14,total_hse_area,27900.689299,27900.689299
0,overall_qual,10866.357319,10866.357319
12,garage_area,8254.687021,8254.687021
7,kitchen_qual,7669.219521,7669.219521
4,bsmt_qual,6800.684095,6800.684095
15,total_bathrms,6593.635253,6593.635253
49,neighborhood_NridgHt,5667.816842,5667.816842
55,neighborhood_StoneBr,5659.551537,5659.551537
74,sale_type_New,5376.025542,5376.025542
2,mas_vnr_area,5310.213391,5310.213391


For lasso model, the top 3 features that affect sales price most is the total house area, overall quality and the garage area. 

**Enet Model**

In [120]:
#fit
enet.fit(X_train, y_train)
#cross validation using 10 folds - enet
enet_scores = cross_val_score(enet, X_train, y_train, cv = 10, scoring = 'neg_mean_squared_error')
enet_mse_score = -enet_scores
enet_rmse_scores = np.sqrt(enet_mse_score)
print("RMSE: ",enet_rmse_scores.mean())
enet_r2_score = cross_val_score(enet, X_train, y_train, cv = 10)
print("R2 score: ", enet_r2_score.mean())

RMSE:  28379.16338334804
R2 score:  0.8675675206485731


In [121]:
# find features with highest coefficients from enet
enet_features = pd.DataFrame(X.columns, columns=['feature'])
enet_features['coef'] = enet.coef_
enet_features['abs_coef'] = np.abs(enet.coef_)
enet_features.sort_values(by='abs_coef', ascending=False).head(10)

Unnamed: 0,feature,coef,abs_coef
14,total_hse_area,27905.296037,27905.296037
0,overall_qual,10866.663033,10866.663033
12,garage_area,8241.888955,8241.888955
7,kitchen_qual,7670.004112,7670.004112
4,bsmt_qual,6798.119563,6798.119563
15,total_bathrms,6590.970546,6590.970546
49,neighborhood_NridgHt,5670.353465,5670.353465
55,neighborhood_StoneBr,5659.534304,5659.534304
74,sale_type_New,5372.74936,5372.74936
2,mas_vnr_area,5311.748822,5311.748822


I look to R2 score when determine which model can best explain the variability of data. The ridge model has an R2 score of 0.866, Lasso an R2 score of 0.867 and Enet an R2 score of 0.867. The enet model performs close to Lasso model but Lasso performs slightly better. 

RMSE can be used to compare between models as well (Wu,2020). Based on this metric, Lasso is the best performing model among the three, so we will move forward with this model. 

## Model Tuning

In this section I tune Lasso model so it works better. I tune the hyperparameter to find the optimal alpha. 

In [122]:
optimal_lasso = LassoCV(n_alphas = 500, cv = 5)
optimal_lasso.fit(X, y)
print(optimal_lasso.alpha_)

50200.29542713745


In [123]:
#apply optimal alpha
lasso = LassoCV(n_alphas = 50200)
#run model and cross validate with 10 folds - lasso
lasso_negmse_scores = cross_val_score(lasso, X_train, y_train, cv = 10, scoring = 'neg_mean_squared_error')
lasso_mse_score = -lasso_negmse_scores
lasso_rmse_scores = np.sqrt(lasso_mse_score)
print("RMSE: ",lasso_rmse_scores.mean())
lasso_r2_score = cross_val_score(lasso, X_train, y_train, cv = 10)
print("R2 score: ", lasso_r2_score.mean())

RMSE:  28377.510123650933
R2 score:  0.8675812351712373


There seems to be a slight improvement in R2 score from using the optimal alpha. 

In [124]:
#apply on test set 
lasso_negmse_scores = cross_val_score(lasso, X_test, y_test, cv = 10, scoring = 'neg_mean_squared_error')
lasso_mse_score = -lasso_negmse_scores
lasso_rmse_scores = np.sqrt(lasso_mse_score)
print("RMSE: ",lasso_rmse_scores.mean())
lasso_r2_score = cross_val_score(lasso, X_test, y_test, cv = 10)
print("R2 score: ", lasso_r2_score.mean())

RMSE:  28609.14197790131
R2 score:  0.8628617477641136


There is a drop in accuracy and there is a chance that the model is overfitting in the train set (Brownlee, 2018). 

## Model Evaluation on Kaggle Test Set

Apply the model on the kaggle clean test set. 

In [125]:
train.shape[1], test.shape[1]

(83, 82)

In [126]:
test.shape

(879, 82)

In [127]:
test.columns

Index(['overall_qual', 'year_remod/add', 'mas_vnr_area', 'exter_qual',
       'bsmt_qual', 'heating_qc', 'electrical', 'kitchen_qual',
       'totrms_abvgrd', 'functional', 'garage_yr_blt', 'garage_finish',
       'garage_area', 'paved_drive', 'total_hse_area', 'total_bathrms',
       'ms_zoning_A (agr)', 'ms_zoning_C (all)', 'ms_zoning_FV',
       'ms_zoning_I (all)', 'ms_zoning_RH', 'ms_zoning_RL', 'ms_zoning_RM',
       'garage_type_2Types', 'garage_type_Attchd', 'garage_type_Basment',
       'garage_type_BuiltIn', 'garage_type_CarPort', 'garage_type_Detchd',
       'garage_type_None', 'neighborhood_Blmngtn', 'neighborhood_Blueste',
       'neighborhood_BrDale', 'neighborhood_BrkSide', 'neighborhood_ClearCr',
       'neighborhood_CollgCr', 'neighborhood_Crawfor', 'neighborhood_Edwards',
       'neighborhood_Gilbert', 'neighborhood_Greens', 'neighborhood_GrnHill',
       'neighborhood_IDOTRR', 'neighborhood_Landmrk', 'neighborhood_MeadowV',
       'neighborhood_Mitchel', 'neighborhoo

In [128]:
#scaling 
kaggle_test_ss = ss.transform(test)

In [129]:
#apply model on test set
y_pred = lasso.predict(kaggle_test_ss)
y_pred.mean()

NotFittedError: This LassoCV instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

In [None]:
#check shape
y_pred.shape

In [None]:
#submission df
kaggle_submission = pd.DataFrame(y_pred, columns = ['saleprice'])
kaggle_submission.head()

In [None]:
#get id data from raw test set
raw_test = pd.read_csv("../data/test.csv")
#assign id on kaggle submission dataset
kaggle_submission['id'] = raw_test['Id']
#reorder the dataframe
kaggle_submission = kaggle_submission[['id', 'saleprice']]
#check
kaggle_submission.head()

In [None]:
kaggle_submission.shape

In [None]:
kaggle_submission.to_csv('zawanah_submission_1.csv', index = False)

## References

"All about Categorical Variable Encoding" (Roy, 2019)
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02

"3 Best metrics to evaluate Regression Model?" (Wu, 2020)
https://towardsdatascience.com/what-are-the-best-metrics-to-evaluate-your-regression-model-418ca481755b

"How to Configure k-Fold Cross-Validation" (Brownlee, 2020)
https://machinelearningmastery.com/how-to-configure-k-fold-cross-validation/

"The Model Performance Mismatch Problem (and what to do about it)" (Brownlee, 2018)
https://machinelearningmastery.com/the-model-performance-mismatch-problem/