# Overfitting and Regularization - House Prices Model

## By Jean-Philippe Pitteloud

### Requirements

In [1]:
import numpy as np
import pandas as pd
from sklearn import linear_model
import matplotlib.pyplot as plt
import seaborn as sns
from sqlalchemy import create_engine
import statsmodels.api as sm
from statsmodels.tools.eval_measures import mse, rmse
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

### Data Gathering

In [2]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
house_df = pd.read_sql_query('select * from houseprices',con = engine)

engine.dispose()


house_df.head()

Unnamed: 0,id,mssubclass,mszoning,lotfrontage,lotarea,street,alley,lotshape,landcontour,utilities,...,poolarea,poolqc,fence,miscfeature,miscval,mosold,yrsold,saletype,salecondition,saleprice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


### Modeling and Evaluation

In our first and simpler model, a Linear Regression Model is used using OLS method. The target variable is 'saleprice'. The independent variables included are 'overallqual', 'grlivarea', 'garagearea', 'firstflrsf', 'lotarea', 'fireplaces', and the categorical variables 'exterqual', 'kitchenqual', and 'mszoning'. Upon creating dummy variables for all three categorical variables, the model was estimated and the results and statistics presented below. During optimization of the model, two dummy variables associated to the categorical variable 'mszoning' were removed from the model due to p-values associated to the t-test larger than the accepted threshold

In [3]:
house_df = pd.concat([house_df,pd.get_dummies(house_df['exterqual'], prefix = 'exterqual_dummy', drop_first=True)], axis = 1)

house_df = pd.concat([house_df,pd.get_dummies(house_df['kitchenqual'], prefix = 'kitchenqual_dummy', drop_first=True)], axis = 1)

house_df = pd.concat([house_df,pd.get_dummies(house_df['mszoning'], prefix = 'mszoning_dummy', drop_first=True)], axis = 1)

In [4]:
X = house_df[['overallqual', 'grlivarea', 'garagearea', 'firstflrsf', 'lotarea', 'fireplaces', 'exterqual_dummy_Fa', 'exterqual_dummy_Gd', 'exterqual_dummy_TA', 'kitchenqual_dummy_Fa', 'kitchenqual_dummy_Gd', 'kitchenqual_dummy_TA', 'mszoning_dummy_FV', 'mszoning_dummy_RL']]

Y = house_df['saleprice']

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 465)

alphas = [np.power(10.0, p) for p in np.arange(-10, 50, 1)]

#### - Linear Regression Model using OLS

In [5]:
lrm = LinearRegression()

lrm.fit(X_train, y_train)

# We are making predictions here to evaluate the performance of our model using the "train" and "test" subsets
y_preds_train = lrm.predict(X_train)

y_preds_test = lrm.predict(X_test)

print("R-squared of the model using training subset is: {}".format(lrm.score(X_train, y_train)))
print("\n-----Test set statistics-----\n")
print("R-squared of the model using testing subset is: {}".format(lrm.score(X_test, y_test)))
print("Mean absolute error (MAE) of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error (MSE) of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error (RMSE) of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error (MAPE) of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

R-squared of the model using training subset is: 0.7989494701062688

-----Test set statistics-----

R-squared of the model using testing subset is: 0.8048998229343759
Mean absolute error (MAE) of the prediction is: 23465.116162242302
Mean squared error (MSE) of the prediction is: 1309844854.593427
Root mean squared error (RMSE) of the prediction is: 36191.778826045935
Mean absolute percentage error (MAPE) of the prediction is: 14.017521953957651


#### - Linear Regression Model using Ridge Regression

In [6]:
ridge_cv = RidgeCV(alphas = alphas, cv = 5)

ridge_cv.fit(X_train, y_train)

y_preds_train = ridge_cv.predict(X_train)

y_preds_test = ridge_cv.predict(X_test)

print("Best alpha value is: {}\n".format(ridge_cv.alpha_))
print("R-squared of the model using the training subset is: {}\n".format(ridge_cv.score(X_train, y_train)))
print("\n-----Test set statistics-----\n")
print("R-squared of the model using the testing subset is: {}".format(ridge_cv.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

Best alpha value is: 1.0

R-squared of the model using the training subset is: 0.7988677802372635


-----Test set statistics-----

R-squared of the model using the testing subset is: 0.8034633706478683
Mean absolute error of the prediction is: 23526.814467591732
Mean squared error of the prediction is: 1319488770.1687472
Root mean squared error of the prediction is: 36324.767998828946
Mean absolute percentage error of the prediction is: 14.05011579113144


#### - Linear Regression Model using Lasso Regression

In [7]:
lasso_cv = LassoCV(alphas = alphas, cv = 5)

lasso_cv.fit(X_train, y_train)

y_preds_train = lasso_cv.predict(X_train)

y_preds_test = lasso_cv.predict(X_test)

print("Best alpha value is: {}\n".format(lasso_cv.alpha_))
print("R-squared of the model using the training subset is: {}\n".format(lasso_cv.score(X_train, y_train)))
print("\n-----Test set statistics-----\n")
print("R-squared of the model using the testing subset is: {}".format(lasso_cv.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

Best alpha value is: 1.0

R-squared of the model using the training subset is: 0.7989493776676728


-----Test set statistics-----

R-squared of the model using the testing subset is: 0.8048565085500624
Mean absolute error of the prediction is: 23466.721727096585
Mean squared error of the prediction is: 1310135654.5520747
Root mean squared error of the prediction is: 36195.796089491865
Mean absolute percentage error of the prediction is: 14.01850595644012


#### - Linear Regression Model using ElasticNet Regression

In [8]:
elasticnet_cv = ElasticNetCV(alphas = alphas, cv = 5)

elasticnet_cv.fit(X_train, y_train)

y_preds_train = elasticnet_cv.predict(X_train)

y_preds_test = elasticnet_cv.predict(X_test)

print("Best alpha value is: {}\n".format(elasticnet_cv.alpha_))
print("R-squared of the model in training set is: {}\n".format(elasticnet_cv.score(X_train, y_train)))
print("\n-----Test set statistics-----\n")
print("R-squared of the model in test set is: {}".format(elasticnet_cv.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

Best alpha value is: 0.001

R-squared of the model in training set is: 0.7989193823119797


-----Test set statistics-----

R-squared of the model in test set is: 0.804043678368228
Mean absolute error of the prediction is: 23500.78164112521
Mean squared error of the prediction is: 1315592755.8594503
Root mean squared error of the prediction is: 36271.10083605749
Mean absolute percentage error of the prediction is: 14.036896956425327


A comparison of the statistics estimated using OLS, Ridge, Lasso, and ElasticNet regression techniques, suggest that the Linear Regression model using OLS estimation yields the most accurate predictions of the target