# Capstone Project Part 7: Optimizing Models

**Authur:** Kate Meredith 

**Date:** September-November 2022

**Notebook #**: 7 of

## Background

**Source:** Data was collected from [CoffeeReview.com](https://www.coffeereview.com/). See prior notebooks for details on scraping, cleaning, compilation and text transformation. 

**Goal:** Optimize models on full data set (both text transformed and original numerical values).

## References

- Documentation on [Ridge Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)
- Documentation on [Lasso Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html)
- Documentation on [ElasticNet Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html)
- [Differences](https://towardsdatascience.com/whats-the-difference-between-linear-regression-lasso-ridge-and-elasticnet-8f997c60cf29) between linear, ridge, lasso and elastic net regression

## Table of Contents

* [1. Importing Libraries](#header1)
* [2. Importing the Data and EDA](#header2)
* [3. Scaling the Data](#header3)
* [4. Principal Component Analysis](#header4)
* [5. Optimizing the Models](#header5)
    * [5.1 Linear Regression](#subheader51)
    * [5.2 XGBoost Regressor](#subheader52)
* [6. Comparing all Models](#header6)

## Importing Libraries  <a class="anchor" id="header1"></a>

In [132]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
import operator

from scipy import stats
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from tempfile import mkdtemp
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.neural_network import MLPRegressor
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

## 2. Importing the Data and EDA <a class="anchor" id="header2"></a>

In the last notebook, the text was transformed using a few different methods. Given time limitations, this will move forward with the data transformation method that worked best: TFIDF Vectorization. The data has already been split into remain (training), validation, and test data. These datasets will be imported here to use.

Importing the remain (training) data and exploring some initial info about it:

In [13]:
Xremain_df_tfidf = pd.read_csv('tfidf_Xremain_combo_df.csv')
Xremain_df_tfidf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4194 entries, 0 to 4193
Columns: 632 entries, coffee_name to zesty
dtypes: float64(621), int64(9), object(2)
memory usage: 20.2+ MB


In [14]:
Xremain_df_tfidf.shape

(4194, 632)

The remain data set has 4,194 rows and 640 columns.

In [15]:
Xremain_df_tfidf.head()

Unnamed: 0,coffee_name,roaster_name,month,year,bean_agtron,ground_agtron,aroma,acidity,body,flavor,...,white,wild,willem,wine,winy,wisteria,wood,woody,zest,zesty
0,Celebration Caffe,Allegro Coffee,11,1999,54,71,8,8,7,8,...,0.0,0.314372,0.0,0.24877,0.0,0.0,0.0,0.0,0.0,0.0
1,Kenya,Starbucks Coffee,6,1999,33,35,7,8,7,8,...,0.0,0.0,0.0,0.403825,0.0,0.0,0.0,0.0,0.0,0.0
2,Colombia Tolima Cup of Excellence Presidential...,Terroir Coffee,2,2007,56,68,8,7,8,7,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Gesha Village 1931, Lot 86",Mudhouse Coffee Roasters,9,2017,54,78,10,9,9,9,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Kenya AA,Willoughby's Coffee & Tea,3,1997,46,49,8,8,8,8,...,0.0,0.0,0.0,0.0,0.451704,0.0,0.0,0.0,0.0,0.0


Most of the columns come from the vectorized text.

In [16]:
#importing the validation data and checking out the info
Xval_df_tfidf = pd.read_csv('tfidf_Xval_combo_df.csv')
Xval_df_tfidf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1049 entries, 0 to 1048
Columns: 632 entries, coffee_name to zesty
dtypes: float64(618), int64(12), object(2)
memory usage: 5.1+ MB


In [17]:
Xval_df_tfidf.shape

(1049, 632)

The validation data set has 1,049 rows and 640 columns.

In [18]:
Xval_df_tfidf.head()

Unnamed: 0,coffee_name,roaster_name,month,year,bean_agtron,ground_agtron,aroma,acidity,body,flavor,...,white,wild,willem,wine,winy,wisteria,wood,woody,zest,zesty
0,Hojas de Otono,Caribou Coffee,8,2011,43,49,8,7,8,9,...,0.0,0.0,0.0,0.0,0.0,0.0,0.186518,0.0,0.0,0.0
1,Colombia Supremo Pitalito Estate,The Roasterie,12,2004,61,68,8,8,8,8,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Papua New Guinea Sero Bebes,Temple Coffee and Tea,3,2016,57,80,9,8,8,9,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.147089,0.0
3,El Gaucho Espresso,Manzanita Roasting Company,3,2016,60,72,9,8,8,8,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Jamaican Blue Mtn Dark Roast,Endless World Coffee,6,1999,35,42,7,6,6,7,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
#importing the test data and checking out the info
Xtest_df_tfidf = pd.read_csv('tfidf_Xtest_combo_df.csv')
Xtest_df_tfidf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1311 entries, 0 to 1310
Columns: 632 entries, coffee_name to zesty
dtypes: float64(620), int64(10), object(2)
memory usage: 6.3+ MB


In [20]:
Xtest_df_tfidf.shape

(1311, 632)

The test dataframe has 1311 rows and 640 columns.

In [21]:
Xtest_df_tfidf.head()

Unnamed: 0,coffee_name,roaster_name,month,year,bean_agtron,ground_agtron,aroma,acidity,body,flavor,...,white,wild,willem,wine,winy,wisteria,wood,woody,zest,zesty
0,East Timor Maubesse Fair-Trade Organic,PT's Coffee Roasting Co.,4,2007,51,68,8,7,8,7,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Ambrosia Espresso,Caffe Fresco,9,2005,37,47,8,7,7,7,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,SFCC House Blend Espresso Capsule,Gourmesso,7,2016,58,58,7,8,8,8,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.116112,0.0
3,Sumatra Dark,Van Houtte Cafe,1,2006,46,57,8,8,7,8,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,French Roast,Tully's Coffee,1,2012,25,28,6,7,8,7,...,0.0,0.0,0.0,0.0,0.0,0.0,0.223662,0.0,0.0,0.0


Importing the y values:

In [22]:
y_remain_df = pd.read_csv('y_remain_df.csv')
y_remain_df.head()

Unnamed: 0,overall_score
0,89
1,88
2,93
3,95
4,93


In [23]:
#formatting y to work for modeling later
#note lengths match corresponding X dataframe
y_remain = y_remain_df['overall_score']
y_remain

0       89
1       88
2       93
3       95
4       93
        ..
4189    92
4190    94
4191    92
4192    92
4193    94
Name: overall_score, Length: 4194, dtype: int64

In [24]:
y_val_df = pd.read_csv('y_val_df.csv')
y_val_df.head()

Unnamed: 0,overall_score
0,92
1,90
2,92
3,91
4,84


In [25]:
#formatting y to work for modeling later
#note lengths match corresponding X dataframe
y_val = y_val_df['overall_score']
y_val

0       92
1       90
2       92
3       91
4       84
        ..
1044    92
1045    89
1046    92
1047    92
1048    87
Name: overall_score, Length: 1049, dtype: int64

In [26]:
y_test_df = pd.read_csv('y_test_df.csv')
y_test_df.head()

Unnamed: 0,overall_score
0,92
1,87
2,87
3,90
4,85


In [27]:
#formatting y to work for modeling later
#note lengths match corresponding X dataframe
y_test = y_test_df['overall_score']
y_test

0       92
1       87
2       87
3       90
4       85
        ..
1306    95
1307    90
1308    91
1309    94
1310    93
Name: overall_score, Length: 1311, dtype: int64

In [28]:
#verifying no null values introduced during vectorizing process
Xremain_df_tfidf.isnull().sum().sum()

0

In [29]:
Xval_df_tfidf.isnull().sum().sum()

0

In [30]:
Xtest_df_tfidf.isnull().sum().sum()

0

In [31]:
y_remain.isnull().sum().sum()

0

In [32]:
y_val.isnull().sum().sum()

0

In [33]:
y_test.isnull().sum().sum()

0

Dropping the remaining text columns `coffee_name` and `roaster_name`:

In [34]:
Xremain_df_tfidf.drop(['coffee_name', 'roaster_name'], axis=1, inplace=True)

In [35]:
Xremain_df_tfidf.head()

Unnamed: 0,month,year,bean_agtron,ground_agtron,aroma,acidity,body,flavor,aftertaste,roaster_lat,...,white,wild,willem,wine,winy,wisteria,wood,woody,zest,zesty
0,11,1999,54,71,8,8,7,8,8,40.015416,...,0.0,0.314372,0.0,0.24877,0.0,0.0,0.0,0.0,0.0,0.0
1,6,1999,33,35,7,8,7,8,8,47.603832,...,0.0,0.0,0.0,0.403825,0.0,0.0,0.0,0.0,0.0,0.0
2,2,2007,56,68,8,7,8,7,7,42.485093,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,9,2017,54,78,10,9,9,9,8,38.029306,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,3,1997,46,49,8,8,8,8,8,41.279541,...,0.0,0.0,0.0,0.0,0.451704,0.0,0.0,0.0,0.0,0.0


In [36]:
Xval_df_tfidf.drop(['coffee_name', 'roaster_name'], axis=1, inplace=True)

In [37]:
Xval_df_tfidf.head()

Unnamed: 0,month,year,bean_agtron,ground_agtron,aroma,acidity,body,flavor,aftertaste,roaster_lat,...,white,wild,willem,wine,winy,wisteria,wood,woody,zest,zesty
0,8,2011,43,49,8,7,8,9,10,44.9773,...,0.0,0.0,0.0,0.0,0.0,0.0,0.186518,0.0,0.0,0.0
1,12,2004,61,68,8,8,8,8,8,39.100105,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,2016,57,80,9,8,8,9,8,38.581061,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.147089,0.0
3,3,2016,60,72,9,8,8,8,8,32.71742,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,6,1999,35,42,7,6,6,7,7,34.053691,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [38]:
Xtest_df_tfidf.drop(['coffee_name', 'roaster_name'], axis=1, inplace=True)

In [39]:
Xtest_df_tfidf.head()

Unnamed: 0,month,year,bean_agtron,ground_agtron,aroma,acidity,body,flavor,aftertaste,roaster_lat,...,white,wild,willem,wine,winy,wisteria,wood,woody,zest,zesty
0,4,2007,51,68,8,7,8,7,7,39.049011,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,9,2005,37,47,8,7,7,7,7,41.325913,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,7,2016,58,58,7,8,8,8,7,52.517037,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.116112,0.0
3,1,2006,46,57,8,8,7,8,8,45.503182,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1,2012,25,28,6,7,8,7,7,47.603832,...,0.0,0.0,0.0,0.0,0.0,0.0,0.223662,0.0,0.0,0.0


## 3. Scaling the Data <a class="anchor" id="header3"></a>

While scaling isn't required for all the model types (like linear regression), we'll go ahead and scale now to simplify the process.

In [40]:
#fitting min max scaler on remain data
mm_scaler = MinMaxScaler()
mm_scaler.fit(Xremain_df_tfidf)

#transforming the all the X data
X_mm_scaled_remain = mm_scaler.transform(Xremain_df_tfidf)
X_mm_scaled_val = mm_scaler.transform(Xval_df_tfidf)
X_mm_scaled_test = mm_scaler.transform(Xtest_df_tfidf)

In [41]:
X_mm_scaled_remain

array([[0.90909091, 0.08      , 0.56756757, ..., 0.        , 0.        ,
        0.        ],
       [0.45454545, 0.08      , 0.28378378, ..., 0.        , 0.        ,
        0.        ],
       [0.09090909, 0.4       , 0.59459459, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.09090909, 0.76      , 0.31081081, ..., 0.        , 0.        ,
        0.        ],
       [0.72727273, 0.92      , 0.54054054, ..., 0.        , 0.41449663,
        0.        ],
       [0.81818182, 0.92      , 0.58108108, ..., 0.        , 0.4008192 ,
        0.        ]])

In [42]:
X_mm_scaled_val

array([[0.63636364, 0.56      , 0.41891892, ..., 0.        , 0.        ,
        0.        ],
       [1.        , 0.28      , 0.66216216, ..., 0.        , 0.        ,
        0.        ],
       [0.18181818, 0.76      , 0.60810811, ..., 0.        , 0.35079339,
        0.        ],
       ...,
       [0.81818182, 0.68      , 0.71621622, ..., 0.        , 0.        ,
        0.        ],
       [0.36363636, 0.76      , 0.62162162, ..., 0.        , 0.        ,
        0.        ],
       [0.27272727, 0.36      , 0.62162162, ..., 0.        , 0.        ,
        0.        ]])

In [43]:
X_mm_scaled_test

array([[0.27272727, 0.4       , 0.52702703, ..., 0.        , 0.        ,
        0.        ],
       [0.72727273, 0.32      , 0.33783784, ..., 0.        , 0.        ,
        0.        ],
       [0.54545455, 0.76      , 0.62162162, ..., 0.        , 0.27691744,
        0.        ],
       ...,
       [0.81818182, 0.72      , 0.66216216, ..., 0.        , 0.        ,
        0.        ],
       [1.        , 0.76      , 0.41891892, ..., 0.        , 0.34329853,
        0.43415295],
       [0.72727273, 0.96      , 0.59459459, ..., 0.        , 0.37247568,
        0.        ]])

## 4. Principal Component Analysis <a class="anchor" id="header4"></a>

Given the number of dimensions in this dataset, simplifying our dimensions will may be beneficial. To do so, Pricipal Component Analysis (PCA) will be applied. However, we do lose interpretability with PCA. Therefore,  the models will be compared using PCA and the non-simplified data. If performance is similar, the non-simplified data can be used in order to retain interpretability.

In [44]:
#instatiate and fit the PCA model
my_PCA = PCA(n_components = 0.9) #retaining 90% of the variance
my_PCA.fit(X_mm_scaled_val)

# transform data 
X_remain_PCA = my_PCA.transform(X_mm_scaled_remain)
X_val_PCA = my_PCA.transform(X_mm_scaled_val)
X_test_PCA = my_PCA.transform(X_mm_scaled_test)

In [45]:
print(f"Variance captured by PC1: {my_PCA.explained_variance_[0]: 0.3f}")
print(f"Variance captured by PC2: {my_PCA.explained_variance_[1]: 0.3f}")

print(f"Proportion of variance captured by PC1: {my_PCA.explained_variance_ratio_[0]: 0.3f}")
print(f"Proportion of variance captured by PC2: {my_PCA.explained_variance_ratio_[1]: 0.3f}")

Variance captured by PC1:  0.252
Variance captured by PC2:  0.167
Proportion of variance captured by PC1:  0.049
Proportion of variance captured by PC2:  0.033


In [46]:
print(f'Original: {Xremain_df_tfidf.shape}')
print(f'PCA Transformed: {X_remain_PCA.shape}')

Original: (4194, 630)
PCA Transformed: (4194, 280)


PCA is able to capture 90% of the variance while decreasing the number of features significantly to 281 (down from 638).

## 5. Fitting the Models <a class="anchor" id="header5"></a>

Below will fit each of the models, trying both the 'full' dataset and the PCA dataset. In all of these, the data is scaled. At the end, R2 results will be compared across all models.

### 5.1 Linear Regression <a class="anchor" id="subheader51"></a>

**"Vanilla" Linear Regression** 

This first uses a basic, "vanilla" linear regression model. This is what we used in previous notebooks to check in on how our data transformations were performing.

For reference:
- The best validation data R2 from running Linear Regression on the text data alone was about 0.761.
- The best validation data R2 from running Linear Regression on the numeric data alone (meaning no vectorized text) was about 0.898.

**Linear Regression: Min Max, Full Dataset**
 
Below runs the model on the full, scaled dataset.

In [53]:
# 1. Instantiate the model
lr_model_full = LinearRegression()

# 2. Fit the model
lr_model_full.fit(X_mm_scaled_remain, y_remain)

# 3. Scoring the models
print(f'The R2 score for lr_model_full training data is: {lr_model_full.score(X_mm_scaled_remain, y_remain)}')

lr_model_full_val_r2 = lr_model_full.score(X_mm_scaled_val, y_val)
print(f'The R2 score for lr_model_full validation data is: {lr_model_full_val_r2}')

The R2 score for lr_model_full training data is: 0.9408817380239718
The R2 score for lr_model_full validation data is: 0.9025842705674784


**Linear Regression: Min Max, PCA Dataset**
 
Below runs the model on the simplified, scaled dataset.

In [57]:
# 1. Instantiate the model
lr_model_pca = LinearRegression()

# 2. Fit the model
lr_model_pca.fit(X_remain_PCA, y_remain)

# 3. Scoring the models
print(f'The R2 score for lr_model_pca training data is: {lr_model_pca.score(X_remain_PCA, y_remain)}')

lr_model_pca_val_r2 = lr_model_pca.score(X_val_PCA, y_val)
print(f'The R2 score for lr_model_pca validation data is: {lr_model_pca_val_r2}')

The R2 score for lr_model_pca training data is: 0.9201855216374213
The R2 score for lr_model_pca validation data is: 0.8959728426048407


Interestingly, the full dataset does slightly better than the PCA version. This may be because using PCA did trade some information for fewer dimensions (e.g. fewer columns). 

**Ridge Regression: Min Max, Full Dataset**

The pipeline below will look for the optimal Ridge Regression model using the full dataset:

In [58]:
#creating our pipeline
#the values listed here for vectorization are placeholders
linreg = Pipeline(
    [
        ("model", Ridge()),
    ]
)

In [59]:
#ran the grid search using a series of changes on these paramaters
#started with more extreme options for each paramater and then narrowed down to these

parameters = {
    "model__alpha": [6.2], 
    "model__solver": ['sag'],
}

In [60]:
#running the serach to find the best combination of these parameters
grid_search = GridSearchCV(linreg, parameters)

grid_search.fit(X_mm_scaled_remain, y_remain)

print("Best parameters: ", grid_search.best_params_)
#checking the R2
print("Best score:", grid_search.best_score_)

Best parameters:  {'model__alpha': 6.2, 'model__solver': 'sag'}
Best score: 0.9120997478973498


Below will see how these parameters do with our remain (training) and validation data:

In [92]:
#instatiate and fit best model
lr_model_ridge_full = Ridge(alpha=6.2, solver='sag')
lr_model_ridge_full.fit(X_mm_scaled_remain, y_remain)

# 3. Scoring the models
print(f'The R2 score for lr_model_ridge_full training data is: {lr_model_ridge_full.score(X_mm_scaled_remain, y_remain)}')

lr_model_ridge_full_val_r2 = lr_model_ridge_full.score(X_mm_scaled_val, y_val)
print(f'The R2 score for lr_model_ridge_full validation data is: {lr_model_ridge_full_val_r2}')

The R2 score for lr_model_ridge_full training data is: 0.9359693918542785
The R2 score for lr_model_ridge_full validation data is: 0.9098745867302618


**Ridge Regression: Min Max, PCA Dataset**

The pipeline below will look for the optimal Ridge Regression model using the full dataset:

In [110]:
#creating our pipeline
#the values listed here for vectorization are placeholders
linreg = Pipeline(
    [
        ("model", Ridge()),
    ]
)

In [111]:
#ran the grid search using a series of changes on these paramaters
#started with more extreme options for each paramater and then narrowed down to these

parameters = {
    "model__alpha": [3.5], 
    "model__solver": ['sag'],
}

In [112]:
#running the serach to find the best combination of these parameters
grid_search = GridSearchCV(linreg, parameters)

grid_search.fit(X_remain_PCA, y_remain)

print("Best parameters: ", grid_search.best_params_)
#checking the R2
print("Best score:", grid_search.best_score_)

Best parameters:  {'model__alpha': 3.5, 'model__solver': 'sag'}
Best score: 0.9068516513539085


Ridge regression on the PCA data does slightly less well than the full dataset. The best parameters for it are shown above. Below will see how these do with the remain and validation data.

In [113]:
#instatiate and fit best model
lr_model_ridge_pca = Ridge(alpha=3.5, solver='sag')
lr_model_ridge_pca.fit(X_remain_PCA, y_remain)

# 3. Scoring the models
print(f'The R2 score for lr_model_ridge_pca training data is: {lr_model_ridge_pca.score(X_remain_PCA, y_remain)}')

lr_model_ridge_pca_val_r2 = lr_model_ridge_pca.score(X_val_PCA, y_val)
print(f'The R2 score for lr_model_ridge_pca validation data is: {lr_model_ridge_pca_val_r2}')

The R2 score for lr_model_ridge_pca training data is: 0.9194928820732233
The R2 score for lr_model_ridge_pca validation data is: 0.9011979701874846


**Lasso Regression: Min Max, Full Dataset**

The pipeline below will look for the optimal Lasso Regression model using the full datset:

In [96]:
#creating our pipeline
#the values listed here for vectorization are placeholders
linreg = Pipeline(
    [
        ("model", Lasso()),
    ]
)

In [97]:
#ran the grid search using a series of changes on these paramaters
#started with more extreme options for each paramater and then narrowed down to these

parameters = {
    "model__alpha": [0.002], 
    "model__positive": [False],
    "model__warm_start": [True]
}

In [98]:
#running the serach to find the best combination of these parameters
grid_search = GridSearchCV(linreg, parameters)

grid_search.fit(X_mm_scaled_remain, y_remain)

print("Best parameters: ", grid_search.best_params_)

#checking the R2
print("Best score:", grid_search.best_score_)

Best parameters:  {'model__alpha': 0.002, 'model__positive': False, 'model__warm_start': True}
Best score: 0.9131574911589692


The optimal Lasso Regression model uses parameters are shown above. Below will see how the model does with training and validation data using these optimized paramters:

In [99]:
#instatiate and fit best model
lr_model_lasso_full = Lasso(alpha=0.002, positive=False, warm_start=True)
lr_model_lasso_full.fit(X_mm_scaled_remain, y_remain)

# 3. Scoring the models
print(f'The R2 score for lr_model_lasso_full training data is: {lr_model_lasso_full.score(X_mm_scaled_remain, y_remain)}')

lr_model_lasso_full_val_r2 = lr_model_lasso_full.score(X_mm_scaled_val, y_val)
print(f'The R2 score for lr_model_lasso_full validation data is: {lr_model_lasso_full_val_r2}')

The R2 score for lr_model_lasso_full training data is: 0.9270599542538364
The R2 score for lr_model_lasso_full validation data is: 0.907597542099871


**Lasso Regression: Min Max, PCA Dataset**

The pipeline below will look for the optimal Lasso Regression model using the PCA datset:

In [100]:
#creating our pipeline
#the values listed here for vectorization are placeholders
linreg = Pipeline(
    [
        ("model", Lasso()),
    ]
)

In [105]:
#ran the grid search using a series of changes on these paramaters
#started with more extreme options for each paramater and then narrowed down to these

parameters = {
    "model__alpha": [0.002], 
    "model__positive": [False],
    "model__warm_start": [True]
}

In [106]:
#running the serach to find the best combination of these parameters
grid_search = GridSearchCV(linreg, parameters)

grid_search.fit(X_remain_PCA, y_remain)

print("Best parameters: ", grid_search.best_params_)

#checking the R2
print("Best score:", grid_search.best_score_)

Best parameters:  {'model__alpha': 0.002, 'model__positive': False, 'model__warm_start': True}
Best score: 0.9045196074779615


In [108]:
#instatiate and fit best model
lr_model_lasso_pca = Lasso(alpha=0.002, positive=False, warm_start=True)
lr_model_lasso_pca.fit(X_remain_PCA, y_remain)

# 3. Scoring the models
print(f'The R2 score for lr_model_lasso_pca training data is: {lr_model_lasso_pca.score(X_remain_PCA, y_remain)}')

lr_model_lasso_pca_val_r2 = lr_model_lasso_pca.score(X_val_PCA, y_val)
print(f'The R2 score for lr_model_lasso_pca validation data is: {lr_model_lasso_pca_val_r2}')

The R2 score for lr_model_lasso_pca training data is: 0.9142017685228763
The R2 score for lr_model_lasso_pca validation data is: 0.9033447450950415


**ElasticeNet Regression: Min Max, Full Dataset**

The pipeline below will look for the optimal ElasticNet Regression model on the full dataset:

In [114]:
#creating our pipeline
#the values listed here for vectorization are placeholders
linreg = Pipeline(
    [
        ("model", ElasticNet()),
    ]
)

In [115]:
#ran the grid search using a series of changes on these paramaters
#started with more extreme options for each paramater and then narrowed down to these

parameters = {
    "model__alpha": [0.002], 
    "model__l1_ratio": [0.4],
    "model__warm_start": [True],
    "model__positive": [False]
}

In [116]:
#running the serach to find the best combination of these parameters
grid_search = GridSearchCV(linreg, parameters)

grid_search.fit(X_mm_scaled_remain, y_remain)

print("Best parameters: ", grid_search.best_params_)

#checking the R2
print("Best score:", grid_search.best_score_)

Best parameters:  {'model__alpha': 0.002, 'model__l1_ratio': 0.4, 'model__positive': False, 'model__warm_start': True}
Best score: 0.9136891720109446


Below will see how the optimized model does with the training and validation data:

In [128]:
#instatiate and fit best model
lr_model_elastic_full = ElasticNet(alpha=0.002, l1_ratio=0.4, positive=False, warm_start=True)
lr_model_elastic_full.fit(X_mm_scaled_remain, y_remain)

# 3. Scoring the models
print(f'The R2 score for lr_model_elastic_full training data is: {lr_model_elastic_full.score(X_mm_scaled_remain, y_remain)}')

lr_model_elastic_full_val_r2 = lr_model_elastic_full.score(X_mm_scaled_val, y_val)
print(f'The R2 score for lr_model_elastic_full validation data is: {lr_model_elastic_full_val_r2}')

The R2 score for lr_model_elastic_full training data is: 0.9319226114642285
The R2 score for lr_model_elastic_full validation data is: 0.9101990891299294


**ElasticeNet Regression: Min Max, PCA Dataset**

The pipeline below will look for the optimal ElasticNet Regression model on the full dataset:

In [None]:
#creating our pipeline
#the values listed here for vectorization are placeholders
linreg = Pipeline(
    [
        ("model", ElasticNet()),
    ]
)

In [126]:
#ran the grid search using a series of changes on these paramaters
#started with more extreme options for each paramater and then narrowed down to these

parameters = {
    "model__alpha": [0.001], 
    "model__l1_ratio": [0.2],
    "model__warm_start": [True],
    "model__positive": [False]
}

In [127]:
#running the serach to find the best combination of these parameters
grid_search = GridSearchCV(linreg, parameters)

grid_search.fit(X_remain_PCA, y_remain)

print("Best parameters: ", grid_search.best_params_)

#checking the R2
print("Best score:", grid_search.best_score_)

Best parameters:  {'model__alpha': 0.001, 'model__l1_ratio': 0.2, 'model__positive': False, 'model__warm_start': True}
Best score: 0.9069016414322609


In [129]:
#instatiate and fit best model
lr_model_elastic_pca = ElasticNet(alpha=0.002, l1_ratio=0.4, positive=False, warm_start=True)
lr_model_elastic_pca.fit(X_remain_PCA, y_remain)

# 3. Scoring the models
print(f'The R2 score for lr_model_elastic_pca training data is: {lr_model_elastic_pca.score(X_remain_PCA, y_remain)}')

lr_model_elastic_pca_val_r2 = lr_model_elastic_pca.score(X_val_PCA, y_val)
print(f'The R2 score for lr_model_elastic_pca validation data is: {lr_model_elastic_pca_val_r2}')

The R2 score for lr_model_elastic_pca training data is: 0.9165220287609357
The R2 score for lr_model_elastic_pca validation data is: 0.9041516740152954


Below will compare R2 values calculated on the validation data to see how these different regression models perform:

In [151]:
#Comparing the models
R2_dictionary = {'Linear Full R2': lr_model_full_val_r2, 'Linear PCA R2' :lr_model_pca_val_r2, 'Ridge Full R2': lr_model_ridge_full_val_r2, 'Ridge PCA R2': lr_model_ridge_pca_val_r2, 'Lasso Full R2': lr_model_lasso_full_val_r2, 'Lasso PCA R2': lr_model_lasso_pca_val_r2, 'Elastic Full R2': lr_model_elastic_full_val_r2, 'Elastic PCA R2': lr_model_elastic_pca_val_r2} 

In [152]:
#sorting scores
R2_values_sorted = dict(sorted(R2_dictionary.items(), key = operator.itemgetter(1), reverse=True))

In [153]:
R2_values_sorted

{'Elastic Full R2': 0.9101990891299294,
 'Ridge Full R2': 0.9098745867302618,
 'Lasso Full R2': 0.907597542099871,
 'Elastic PCA R2': 0.9041516740152954,
 'Lasso PCA R2': 0.9033447450950415,
 'Linear Full R2': 0.9025842705674784,
 'Ridge PCA R2': 0.9011979701874846,
 'Linear PCA R2': 0.8959728426048407}

**Comparing model performance**

Looking at the different regression models, ElasticNet using the full dataset performed best with the validation data. In general, using the full dataset provided better results than PCA, which is good because interpretability can be preserved then. 

All things considered, the models perform pretty similarly. 

Below will look at another model type before drawing digging further on the models and drawing conclusions about which performs best.

### 5.2 XG Boost Regressor <a class="anchor" id="subheader52"></a>

For reference, the best validation R2 from running the baseline XGBoost Regressor model on numeric data alone (no vectorized text) was about 91.8. XG Boost was not run on the text only data.

**XG Boost Regressor: Min Max, Full Dataset**

In [137]:
# 1. Instantiate the model
XGBR_model = XGBRegressor()

# 2. Fit the model
XGBR_model.fit(X_mm_scaled_remain, y_remain)

# 3. Scoring the models
print(f'The R2 score for XGBR_model training data is: {XGBR_model.score(X_mm_scaled_remain, y_remain)}')

XGBR_model_val_r2 = XGBR_model.score(X_mm_scaled_val, y_val)
print(f'The R2 score for XGBR_model validation data is: {XGBR_model_val_r2}')

The R2 score for XGBR_model training data is: 0.9926375397608137
The R2 score for XGBR_model validation data is: 0.9022804820873469


Optimizing the XG Boost Model using the full dataset:

In [145]:
#creating our pipeline
#the values listed here for vectorization are placeholders
linreg = Pipeline(
    [
        ("model", XGBRegressor()),
    ]
)

In [146]:
#ran the grid search using a series of changes on these paramaters
#started with more extreme options for each paramater and then narrowed down to these

parameters = {
    "model__booster": ['dart'], 
    "model__eta": [0.1],
    "model__gamma": [1],
    "model__max_depth": [11]
}

In [147]:
#running the serach to find the best combination of these parameters
grid_search = GridSearchCV(linreg, parameters)

grid_search.fit(X_mm_scaled_remain, y_remain)

print("Best parameters: ", grid_search.best_params_)

#checking the R2
print("Best score:", grid_search.best_score_)

Best parameters:  {'model__booster': 'dart', 'model__eta': 0.1, 'model__gamma': 1, 'model__max_depth': 11}
Best score: 0.9196504587392071


In [163]:
#instatiate and fit best model
XGBR_model_full = XGBRegressor(booster='dart', eta=0.1, gamma=1, max_depth=11)
XGBR_model_full.fit(X_mm_scaled_remain, y_remain)

# 3. Scoring the models
print(f'The R2 score for XGBR_model_full training data is: {XGBR_model_full.score(X_mm_scaled_remain, y_remain)}')

XGBR_model_full_val_r2 = XGBR_model_full.score(X_mm_scaled_val, y_val)
print(f'The R2 score for XGBR_model_full validation data is: {XGBR_model_full_val_r2}')

The R2 score for XGBR_model_full training data is: 0.9921717669177883
The R2 score for XGBR_model_full validation data is: 0.9157344141540313


The optimized XG Boost Model does slightly better than Linear Regression, though it is quite overfit.

Fitting the XGBoost Regressor model on the PCA data:

In [148]:
#creating our pipeline
#the values listed here for vectorization are placeholders
linreg = Pipeline(
    [
        ("model", XGBRegressor()),
    ]
)

In [158]:
#ran the grid search using a series of changes on these paramaters
#started with more extreme options for each paramater and then narrowed down to these

parameters = {
    "model__booster": ['dart'], 
    "model__eta": [0.1],
    "model__gamma": [1],
    "model__max_depth": [5]
}

In [159]:
#running the serach to find the best combination of these parameters
grid_search = GridSearchCV(linreg, parameters)

grid_search.fit(X_remain_PCA, y_remain)

print("Best parameters: ", grid_search.best_params_)

#checking the R2
print("Best score:", grid_search.best_score_)

Best parameters:  {'model__booster': 'dart', 'model__eta': 0.1, 'model__gamma': 1, 'model__max_depth': 5}
Best score: 0.8519782429252418


In [161]:
#instatiate and fit best model
XGBR_model_pca = XGBRegressor(booster='dart', eta=0.1, gamma=1, max_depth=5)
XGBR_model_pca.fit(X_remain_PCA, y_remain)

# 3. Scoring the models
print(f'The R2 score for XGBR_model_pca training data is: {XGBR_model_pca.score(X_remain_PCA, y_remain)}')

XGBR_model_pca_val_r2 = XGBR_model_pca.score(X_val_PCA, y_val)
print(f'The R2 score for XGBR_model_pca validation data is: {XGBR_model_pca_val_r2}')

The R2 score for XGBR_model_pca training data is: 0.9702665775242029
The R2 score for XGBR_model_pca validation data is: 0.8466916761279962


## 6. Comparing All Models <a class="anchor" id="header6"></a>



In [164]:
#Comparing the models
R2_dictionary = {'Linear Full R2': lr_model_full_val_r2, 'Linear PCA R2' :lr_model_pca_val_r2, 'Ridge Full R2': lr_model_ridge_full_val_r2, 'Ridge PCA R2': lr_model_ridge_pca_val_r2, 'Lasso Full R2': lr_model_lasso_full_val_r2, 'Lasso PCA R2': lr_model_lasso_pca_val_r2, 'Elastic Full R2': lr_model_elastic_full_val_r2, 'Elastic PCA R2': lr_model_elastic_pca_val_r2, 'XGBR Full R2': XGBR_model_full_val_r2, 'XGBR PCA R2': XGBR_model_pca_val_r2} 

Below compares all the models run, listing the one with the highest R2 first:

In [165]:
#sorting scores
R2_values_sorted = dict(sorted(R2_dictionary.items(), key = operator.itemgetter(1), reverse=True))
R2_values_sorted

{'XGBR Full R2': 0.9157344141540313,
 'Elastic Full R2': 0.9101990891299294,
 'Ridge Full R2': 0.9098745867302618,
 'Lasso Full R2': 0.907597542099871,
 'Elastic PCA R2': 0.9041516740152954,
 'Lasso PCA R2': 0.9033447450950415,
 'Linear Full R2': 0.9025842705674784,
 'Ridge PCA R2': 0.9011979701874846,
 'Linear PCA R2': 0.8959728426048407,
 'XGBR PCA R2': 0.8466916761279962}

XGBR on the full dataset performs best, followed by Elasticnet on the full dataset. Overall, the PCA versions perform worse. XGBoost Regressor in particular takes a peformance hit when using the PCA data.

However, 

because the XGBoost and ElasticNet models are so close in performance, the notebook will proceed with digging into the the ElasticNet model further.

## 7. Further Model Interpretation <a class="anchor" id="header7"></a>