# Overfitting and regularization

## Assignment

In this assignment, you'll continue working with the house prices data. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Reimplement your model from the previous checkpoint.
* Try OLS, Lasso, Ridge, and ElasticNet regression using the same model specification. This time, you need to do **k-fold cross-validation** to choose the best hyperparameter values for your models. Scikit-learn has RidgeCV, LassoCV, and ElasticNetCV that you can utilize to do this. Which model is the best? Why?

#### Given these instructions I will:
    * import needed libraries
    * import data
    * include "EDA" from previous sections
    * data preprcoessing (changing non-numeric values to numeric values)
    * select vectors for "model data" and target
    * look for overfitting
    * run each model with the same "model data" and modifying "alpha" (lambda) as needed for that model
    * summurize results

In [42]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV
from statsmodels.tools.eval_measures import mse, rmse
from sklearn.metrics import mean_absolute_error
from sklearn import preprocessing
from scipy.stats.mstats import winsorize
from sqlalchemy import create_engine

In [43]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
homes_df = pd.read_sql_query('select * from houseprices',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()


In [44]:
# Preparing data for modeling about house prices 

# ojects holding columns
non_numeric_columns = homes_df.select_dtypes(['object']).columns
numeric_columns = homes_df.select_dtypes(['int64', 'float64']).columns

# dropping Missing data
homes_df = homes_df.drop(['poolqc', 'miscfeature', 'alley', 'fence', 'fireplacequ', 'lotfrontage'], axis=1)
homes_df = homes_df.dropna(axis=0)
numeric_columns = numeric_columns.drop(['id'])

FILL_LIST = []
for cols in homes_df[:]:
    if cols in numeric_columns:
        FILL_LIST.append(cols)

In [45]:
# Preprocessing
homes_win = homes_df.copy()
for col in FILL_LIST:
    homes_win[col] = winsorize(homes_win[col], (.05, .14))

In [46]:
def cat_converter(df):
    for cols in df:
        if cols in non_numeric_columns:
            # Create a label (category) encoder object
            le = preprocessing.LabelEncoder()
            # Create a label (category) encoder object
            le.fit(df[cols])
            # Apply the fitted encoder to the pandas column
            df[cols] = le.transform(df[cols]) 
    return df
cat_converter(homes_win).head()

Unnamed: 0,id,mssubclass,mszoning,lotarea,street,lotshape,landcontour,utilities,lotconfig,landslope,...,enclosedporch,threessnporch,screenporch,poolarea,miscval,mosold,yrsold,saletype,salecondition,saleprice
0,1,60,3,8450,1,3,3,0,4,0,...,0,0,0,0,0,2,2008,8,4,208500
1,2,20,3,9600,1,3,3,0,2,0,...,0,0,0,0,0,5,2007,8,4,181500
2,3,60,3,11250,1,0,3,0,4,0,...,0,0,0,0,0,9,2008,8,4,223500
3,4,70,3,9550,1,0,3,0,0,0,...,0,0,0,0,0,2,2006,8,0,140000
4,5,60,3,13518,1,0,3,0,2,0,...,0,0,0,0,0,10,2008,8,4,250000


In [47]:
# selecting data and target
homes_mod3 = homes_win[['lotarea', 'masvnrarea', 'bsmtfinsf1', 'totalbsmtsf',
                        'grlivarea', 'garagearea', 'wooddecksf', 'openporchsf',
                        'saleprice']]

X = homes_mod3.iloc[:, :-1]
Y = homes_mod3['saleprice']

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 465)

alphas = [np.power(10.0,p) for p in np.arange(-10,40,1)]

# CV

In [48]:
from sklearn.model_selection import cross_val_score
lrm = LinearRegression()
scores = cross_val_score(lrm, X, Y, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.73 (+/- 0.08)


Based on the low varience in our accuarcy, I conclude that we are not dealing with overfitting. 

# OLS

In [49]:
err_chart = pd.DataFrame(index=['Best alpha value', 'R-squared of training', 'R-squared of test',
                                'Mean absolute error', 'Mean squared error', 'Root mean squared error',
                                'Mean absolute percentage error'])

In [54]:
# creating linear obj and fitting 
lrm = LinearRegression()
lrm.fit(X_train, y_train)

# We are making predictions here
y_preds_train = lrm.predict(X_train)
y_preds_test = lrm.predict(X_test)

err_chart['OLS'] = list(['N/A', lrm.score(X_train, y_train), lrm.score(X_test, y_test),
                          mean_absolute_error(y_test, y_preds_test),
                          mse(y_test, y_preds_test), rmse(y_test, y_preds_test),
                          np.mean(np.abs((y_test - y_preds_test) / y_test) * 100)])

# Lasso

In [55]:
# creating lasso obj specifying "alpha" (lambda) and folds
lasso_cv = LassoCV(alphas=alphas, cv=5)

# fitting data to target using lasso
lasso_cv.fit(X_train, y_train)

# We are making predictions here
y_preds_train = lasso_cv.predict(X_train)
y_preds_test = lasso_cv.predict(X_test)

err_chart['Lasso'] = list([lasso_cv.alpha_, lasso_cv.score(X_train, y_train), lasso_cv.score(X_test, y_test),
                          mean_absolute_error(y_test, y_preds_test),
                          mse(y_test, y_preds_test), rmse(y_test, y_preds_test),
                          np.mean(np.abs((y_test - y_preds_test) / y_test) * 100)])



# Ridge 

In [57]:
# creating Ridge obj specifying "alpha" (lambda) and folds
ridge_cv = RidgeCV(alphas=alphas, cv=5)

# fitting data to target using lasso
ridge_cv.fit(X_train, y_train)

# We are making predictions here
y_preds_train = lasso_cv.predict(X_train)
y_preds_test = lasso_cv.predict(X_test)

err_chart['Ridge'] = list([ridge_cv.alpha_, ridge_cv.score(X_train, y_train), ridge_cv.score(X_test, y_test),
                          mean_absolute_error(y_test, y_preds_test),
                          mse(y_test, y_preds_test), rmse(y_test, y_preds_test),
                          np.mean(np.abs((y_test - y_preds_test) / y_test) * 100)])

# ElasticNet

In [58]:
# creating ElasticNet obj specifying "alpha" (lambda) and folds
elasticnet_cv = ElasticNetCV(alphas=alphas, cv=5)

elasticnet_cv.fit(X_train, y_train)

# We are making predictions here
y_preds_train = elasticnet_cv.predict(X_train)
y_preds_test = elasticnet_cv.predict(X_test)

err_chart['ElasticNet'] = list([elasticnet_cv.alpha_, elasticnet_cv.score(X_train, y_train), elasticnet_cv.score(X_test, y_test),
                          mean_absolute_error(y_test, y_preds_test),
                          mse(y_test, y_preds_test), rmse(y_test, y_preds_test),
                          np.mean(np.abs((y_test - y_preds_test) / y_test) * 100)])



# Comparing Results

In [59]:
err_chart

Unnamed: 0,OLS,Lasso,Ridge,ElasticNet
Best alpha value,,100.0,10000.0,100.0
R-squared of training,0.74433,0.7443299,0.7443294,0.7443163
R-squared of test,0.715853,0.7158524,0.715831,0.7157269
Mean absolute error,19520.6,19520.74,19520.74,19528.26
Mean squared error,750601000.0,750602000.0,750602000.0,750933600.0
Root mean squared error,27397.1,27397.12,27397.12,27403.17
Mean absolute percentage error,11.9538,11.95389,11.95389,11.95932


### The on the surface it appears the OLS model performed best, but as we dive into our prediction metrics it becomes clear that the ElasticNet performs marginally better than the OLS model at making predictions. 

Even though there is a smaller generalization gap using OLS, the ElasticNet model creates smaller errors. It at least marginally outperforms OLS on every error metric above. This is why I would select this model if I was choosing between these four. 