# Assignment

In this assignment, you'll continue working with the house prices data. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

- Load the houseprices data from Thinkful's database.
- Reimplement your model from the previous checkpoint.
- Try OLS, Lasso, Ridge, and ElasticNet regression using the same model specification. This time, you need to do k-fold cross-validation to choose the best hyperparameter values for your models. Which model is the best? Why?

This is not a graded checkpoint, but you should discuss your solution with your mentor. After you've submitted your work, take a moment to compare your solution to [this example solution](https://github.com/Thinkful-Ed/machine-learning-regression-problems/blob/master/notebooks/7.solution_overfitting_and_regularization.ipynb).

### Import Statements

In [34]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats
from scipy.stats.mstats import winsorize
from scipy.stats import bartlett
from scipy.stats import levene
from scipy.stats import jarque_bera
from scipy.stats import normaltest
from statsmodels.tsa.stattools import acf
from sklearn import linear_model
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sqlalchemy import create_engine
import statsmodels.api as sm
from statsmodels.tools.eval_measures import mse, rmse

# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

import warnings
warnings.filterwarnings('ignore')

### Loading the House Prices Dataframe

In [35]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
house_prices_df = pd.read_sql_query('select * from houseprices',con=engine)

engine.dispose()

### EDA on the House Prices Dataframe

Please note that I'm re-using code from earlier assingments in this module.

In [36]:
# Dropping columns that don't have less than 800 datapoints (compared to 1400).

fist_revision_house_prices_df = house_prices_df.drop(columns=['alley', 'fence', 'fireplacequ', 'poolqc', 'miscfeature', 'id'])

In [37]:
# Dropping columns that can't be transformed with one-hot encoding.

second_revision_house_prices_df = fist_revision_house_prices_df.drop(columns=['mszoning', 'lotshape', 'landcontour', 
                                    'lotconfig', 'landslope', 'neighborhood', 'condition1', 'condition2', 'bldgtype', 
                                    'housestyle', 'roofstyle', 'roofmatl', 'exterior1st', 'exterior2nd', 'masvnrtype', 
                                    'exterqual', 'extercond', 'foundation', 'bsmtqual', 'bsmtexposure', 'bsmtfintype1', 
                                    'bsmtfintype2', 'heating', 'heatingqc', 'electrical', 'kitchenqual', 'functional', 
                                    'garagetype', 'garagefinish', 'garagequal', 'garagecond', 'paveddrive', 'saletype', 
                                    'salecondition'])

In [38]:
# Transforming categorical variables using one-hot encoding.

second_revision_house_prices_df['street'] = pd.get_dummies(second_revision_house_prices_df['street'], drop_first=True)

second_revision_house_prices_df['utilities'] = pd.get_dummies(second_revision_house_prices_df['utilities'], drop_first=True)

second_revision_house_prices_df['bsmtcond'] = pd.get_dummies(second_revision_house_prices_df['bsmtcond'], drop_first=True)

second_revision_house_prices_df['centralair'] = pd.get_dummies(second_revision_house_prices_df['centralair'], drop_first=True)

In [39]:
# Filling in missing values with means.

third_revision_house_prices_df = second_revision_house_prices_df.copy()

replace_nans_with_means_list = ['lotfrontage', 'masvnrarea', 'garageyrblt']

for col in replace_nans_with_means_list:
        third_revision_house_prices_df.fillna(third_revision_house_prices_df[replace_nans_with_means_list
        ].mean(), inplace=True)

In [40]:
# Log transforming the data to normalize it.

fourth_revision_house_prices_df = third_revision_house_prices_df.copy()

log_transform_list = ['mssubclass', 'lotfrontage', 'lotarea', 'street', 'utilities', 'overallqual', 'overallcond', 'yearbuilt',
                 'yearremodadd', 'masvnrarea', 'bsmtcond', 'bsmtfinsf1', 'bsmtfinsf2', 'bsmtunfsf', 'totalbsmtsf', 'centralair',
                 'firstflrsf', 'secondflrsf', 'lowqualfinsf', 'grlivarea', 'bsmtfullbath', 'bsmthalfbath', 'fullbath',
                 'halfbath', 'bedroomabvgr', 'totrmsabvgrd', 'fireplaces', 'garageyrblt', 'garagecars', 'garagearea',
                 'wooddecksf', 'openporchsf', 'enclosedporch', 'threessnporch', 'screenporch', 'poolarea', 'miscval',
                 'mosold', 'yrsold', 'saleprice']

for col in log_transform_list:
    np.log(fourth_revision_house_prices_df[log_transform_list])
    
fourth_revision_house_prices_df.head()

Unnamed: 0,mssubclass,lotfrontage,lotarea,street,utilities,overallqual,overallcond,yearbuilt,yearremodadd,masvnrarea,...,wooddecksf,openporchsf,enclosedporch,threessnporch,screenporch,poolarea,miscval,mosold,yrsold,saleprice
0,60,65.0,8450,1,0,7,5,2003,2003,196.0,...,0,61,0,0,0,0,0,2,2008,208500
1,20,80.0,9600,1,0,6,8,1976,1976,0.0,...,298,0,0,0,0,0,0,5,2007,181500
2,60,68.0,11250,1,0,7,5,2001,2002,162.0,...,0,42,0,0,0,0,0,9,2008,223500
3,70,60.0,9550,1,0,7,5,1915,1970,0.0,...,0,35,272,0,0,0,0,2,2006,140000
4,60,84.0,14260,1,0,8,5,2000,2000,350.0,...,192,84,0,0,0,0,0,12,2008,250000


### Reimplementing the Model from the Previous Checkpoint

_Splitting the Data into Train and Test Sets_

In [41]:
# Y is the target variable.
Y = fourth_revision_house_prices_df['saleprice']

# X is the feature set.
X = fourth_revision_house_prices_df[['overallqual', 'grlivarea', 'garagecars', 'garagearea', 'totalbsmtsf']]

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 465)

print('- The number of observations in the training set is: {}'.format(X_train.shape[0]))
print('- The number of observations in the test set is: {}'.format(X_test.shape[0]))

- The number of observations in the training set is: 1168
- The number of observations in the test set is: 292


### Try OLS, Lasso, Ridge, and ElasticNet regression using the same model specification. This time, you need to do k-fold cross-validation to choose the best hyperparameter values for your models. Which model is the best? Why?

_OLS Regression_

In [42]:
# Fit an OLS model using sklearn.
lrm = LinearRegression()
lrm.fit(X_train, y_train)

# Make predictions here.
y_preds_train = lrm.predict(X_train)
y_preds_test = lrm.predict(X_test)

print('- R-squared of the model in training set is: {}'.format(lrm.score(X_train, y_train)))
print('--- Test Set Statistics ---')
print('- R-squared of the model in test set is: {}'.format(lrm.score(X_test, y_test)))
print('- Mean absolute error of the prediction is: {}'.format(mean_absolute_error(y_test, y_preds_test)))
print('- Mean squared error of the prediction is: {}'.format(mse(y_test, y_preds_test)))
print('- Root mean squared error of the prediction is: {}'.format(rmse(y_test, y_preds_test)))
print('- Mean absolute percentage error of the prediction is: {}'.format(np.mean(np.abs((y_test - y_preds_test)
        / y_test)) * 100))

- R-squared of the model in training set is: 0.7589588843157826
--- Test Set Statistics ---
- R-squared of the model in test set is: 0.7656379370075236
- Mean absolute error of the prediction is: 25964.89937383187
- Mean squared error of the prediction is: 1573437538.2926502
- Root mean squared error of the prediction is: 39666.579614237606
- Mean absolute percentage error of the prediction is: 16.097178444700113


_Lasso Regression_

In [46]:
lasso_regr = Lasso(alpha=10**20.5)
lasso_regr.fit(X_train, y_train)

# Make predictions here.
y_preds_train = lrm.predict(X_train)
y_preds_test = lrm.predict(X_test)

print('- R-squared of the model in training set is: {}'.format(lasso_regr.score(X_train, y_train)))
print('--- Test Set Statistics ---')
print('- R-squared of the model in test set is: {}'.format(lasso_regr.score(X_test, y_test)))
print('- Mean absolute error of the prediction is: {}'.format(mean_absolute_error(y_test, y_preds_test)))
print('- Mean squared error of the prediction is: {}'.format(mse(y_test, y_preds_test)))
print('- Root mean squared error of the prediction is: {}'.format(rmse(y_test, y_preds_test)))
print('- Mean absolute percentage error of the prediction is: {}'.format(np.mean(np.abs((y_test - y_preds_test)
        / y_test)) * 100))

- R-squared of the model in training set is: 0.0
--- Test Set Statistics ---
- R-squared of the model in test set is: -0.0016183407463286061
- Mean absolute error of the prediction is: 25964.89937383187
- Mean squared error of the prediction is: 1573437538.2926502
- Root mean squared error of the prediction is: 39666.579614237606
- Mean absolute percentage error of the prediction is: 16.097178444700113


_Ridge Regression_

In [47]:
# Fitting a ridge regression model. Alpha's the regularization parameter (usually called Lambda). As alpah gets larger,
# parameter shrinkage gets more pronounced.

ridge_regr = Ridge(alpha=10**37)
ridge_regr.fit(X_train, y_train)

# We're making predictions here.
y_preds_train = ridge_regr.predict(X_train)
y_preds_test = ridge_regr.predict(X_test)

print('- R-squared of the model in training set is: {}'.format(ridge_regr.score(X_train, y_train)))
print('--- Test Set Statistics ---')
print('- R-squared of the model in test set is: {}'.format(ridge_regr.score(X_test, y_test)))
print('- Mean absolute error of the prediction is: {}'.format(mean_absolute_error(y_test, y_preds_test)))
print('- Mean squared error of the prediction is: {}'.format(mse(y_test, y_preds_test)))
print('- Root mean squared error of the prediction is: {}'.format(rmse(y_test, y_preds_test)))
print('- Mean absolute percentage error of the prediction is: {}'.format(np.mean(np.abs((y_test - y_preds_test)
        / y_test)) * 100))

- R-squared of the model in training set is: 0.0
--- Test Set Statistics ---
- R-squared of the model in test set is: -0.0016183407463286061
- Mean absolute error of the prediction is: 58023.64411709514
- Mean squared error of the prediction is: 6724569139.943377
- Root mean squared error of the prediction is: 82003.47029207592
- Mean absolute percentage error of the prediction is: 36.65964810134902


_ElasticNet Regression_

In [48]:
elastic_regr = ElasticNet(alpha=10**21, l1_ratio=0.5)
elastic_regr.fit(X_train, y_train)

# We're making predictions here.
y_preds_train = ridge_regr.predict(X_train)
y_preds_test = ridge_regr.predict(X_test)

print('- R-squared of the model in training set is: {}'.format(elastic_regr.score(X_train, y_train)))
print('--- Test Set Statistics ---')
print('- R-squared of the model in test set is: {}'.format(elastic_regr.score(X_test, y_test)))
print('- Mean absolute error of the prediction is: {}'.format(mean_absolute_error(y_test, y_preds_test)))
print('- Mean squared error of the prediction is: {}'.format(mse(y_test, y_preds_test)))
print('- Root mean squared error of the prediction is: {}'.format(rmse(y_test, y_preds_test)))
print('- Mean absolute percentage error of the prediction is: {}'.format(np.mean(np.abs((y_test - y_preds_test)
        / y_test)) * 100))

- R-squared of the model in training set is: 0.0
--- Test Set Statistics ---
- R-squared of the model in test set is: -0.0016183407463286061
- Mean absolute error of the prediction is: 58023.64411709514
- Mean squared error of the prediction is: 6724569139.943377
- Root mean squared error of the prediction is: 82003.47029207592
- Mean absolute percentage error of the prediction is: 36.65964810134902


_Analysing the Regression Models_

In [53]:
regression_analysis_df = pd.read_excel('Regression Analysis for 19.7.xlsx', delimiter='\t')

regression_analysis_df.head(7)

Unnamed: 0.1,Unnamed: 0,OLS,Lasso,Ridge,ElasticNet
0,R-squared of model in training set,0.759,0.0,0.0,0.0
1,R-squared of model in test set,0.766,-0.002,-0.002,-0.002
2,Mean absolute error of pred.,25964.899,25964.899,58023.644,58023.644
3,Mean squared error of pred.,1573437538.293,1573437538.293,6724569139.943,6724569139.943
4,Root mean squared error of pred.,39666.58,39666.58,82003.47,82003.47
5,Mean absolute percentage error of pred.,16.097,16.097,36.66,36.66
