# Capstone Project Part 7: Optimizing Models

**Authur:** Kate Meredith 

**Date:** September-November 2022

**Notebook #**: 7 of

## Background

**Source:** Data was collected from [CoffeeReview.com](https://www.coffeereview.com/). See prior notebooks for details on scraping, cleaning, compilation and text transformation. 

**Goal:** Optimize models on full data set (both text transformed and original numerical values).

## References

- Documentation on [Ridge Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)
- Documentation on [Lasso Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html)
- Documentation on [ElasticNet Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html)
- [Differences](https://towardsdatascience.com/whats-the-difference-between-linear-regression-lasso-ridge-and-elasticnet-8f997c60cf29) between linear, ridge, lasso and elastic net regression
- How to [add headers back after scaling data](https://stackoverflow.com/questions/29586323/how-to-retain-column-headers-of-data-frame-after-pre-processing-in-scikit-learn)
- Getting model [results as a dataframe](https://stackoverflow.com/questions/51734180/converting-statsmodels-summary-object-to-pandas-dataframe)
- Using first row as [column headers](https://stackoverflow.com/questions/31328861/python-pandas-replacing-header-with-top-row)
- Filtering sorted array to [get top N values](https://stackoverflow.com/questions/6910641/how-do-i-get-indices-of-n-maximum-values-in-a-numpy-array)

## Table of Contents

* [1. Importing Libraries](#header1)
* [2. Importing the Data and EDA](#header2)
* [3. Scaling the Data](#header3)
* [4. Principal Component Analysis](#header4)
* [5. Optimizing the Models](#header5)
    * [5.1 Linear Regression](#subheader51)
    * [5.2 XGBoost Regressor](#subheader52)
* [6. Comparing all Models](#header6)
* [7. Further Model Interpretation](#header7)
    * [7.1 Using Stats Models for Additional Insights](#subheader71)
    * [7.2 SKLearn Elastic Net Model Interpretation](#subheader72)
* [8. Conclusion](#header8)

## Importing Libraries  <a class="anchor" id="header1"></a>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
import operator
import statsmodels.api as sm

from scipy import stats
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from tempfile import mkdtemp
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.neural_network import MLPRegressor
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

  from pandas import MultiIndex, Int64Index


## 2. Importing the Data and EDA <a class="anchor" id="header2"></a>

In the last notebook, the text was transformed using a few different methods. Given time limitations, this will move forward with the data transformation method that worked best: TFIDF Vectorization. The data has already been split into remain (training), validation, and test data. These datasets will be imported here to use.

Importing the remain (training) data and exploring some initial info about it:

In [2]:
Xremain_df_tfidf = pd.read_csv('tfidf_Xremain_combo_df.csv')
Xremain_df_tfidf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4194 entries, 0 to 4193
Columns: 632 entries, coffee_name to zesty
dtypes: float64(621), int64(9), object(2)
memory usage: 20.2+ MB


In [3]:
Xremain_df_tfidf.shape

(4194, 632)

The remain data set has 4,194 rows and 640 columns.

In [4]:
Xremain_df_tfidf.head()

Unnamed: 0,coffee_name,roaster_name,month,year,bean_agtron,ground_agtron,aroma,acidity,body,flavor,...,wild,willem,wine,winey,winy,wisteria,wood,woody,zest,zesty
0,Honeymoon Espresso,Choosy Gourmet,3,2020,44,52,9,8,8,9,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Medium-Dark Roast,Kona Love Coffee Co.,2,2022,46,64,8,8,9,9,...,0.0,0.0,0.0,0.0,0.0,0.0,0.215396,0.0,0.0,0.0
2,Kathakwa Kenya,Caribou Coffee,8,2014,52,66,9,8,9,9,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Costa Rica Geisha Honey COE 2018 2nd Place,Dragonfly Coffee Roasters,11,2018,50,76,9,9,9,9,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Taitung Guanshan Lot 12 (Single-Serve Capsule),Sancoffee,2,2015,86,86,8,7,8,8,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Most of the columns come from the vectorized text.

In [5]:
#importing the validation data and checking out the info
Xval_df_tfidf = pd.read_csv('tfidf_Xval_combo_df.csv')
Xval_df_tfidf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1049 entries, 0 to 1048
Columns: 632 entries, coffee_name to zesty
dtypes: float64(619), int64(11), object(2)
memory usage: 5.1+ MB


In [6]:
Xval_df_tfidf.shape

(1049, 632)

The validation data set has 1,049 rows and 640 columns.

In [7]:
Xval_df_tfidf.head()

Unnamed: 0,coffee_name,roaster_name,month,year,bean_agtron,ground_agtron,aroma,acidity,body,flavor,...,wild,willem,wine,winey,winy,wisteria,wood,woody,zest,zesty
0,Ardent Ethiopia Natural,JBC Coffee Roasters,11,2020,57,77,10,9,9,10,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Mocha Java, Harrar Style",Susan's Coffee and Tea,1,1998,43,49,5,4,5,5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Sulawesi Toraja Peaberry,P.T. Mega Agmist Indonesia,6,2012,48,73,8,9,9,9,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Worka Ethiopia,JBC Coffee Roasters,8,2016,58,78,9,9,9,9,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Ethiopia Oromia,Caffe Luxxe,8,2017,49,72,9,8,9,9,...,0.0,0.0,0.0,0.0,0.0,0.0,0.19191,0.0,0.0,0.0


In [8]:
#importing the test data and checking out the info
Xtest_df_tfidf = pd.read_csv('tfidf_Xtest_combo_df.csv')
Xtest_df_tfidf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1311 entries, 0 to 1310
Columns: 632 entries, coffee_name to zesty
dtypes: float64(620), int64(10), object(2)
memory usage: 6.3+ MB


In [9]:
Xtest_df_tfidf.shape

(1311, 632)

The test dataframe has 1311 rows and 640 columns.

In [10]:
Xtest_df_tfidf.head()

Unnamed: 0,coffee_name,roaster_name,month,year,bean_agtron,ground_agtron,aroma,acidity,body,flavor,...,wild,willem,wine,winey,winy,wisteria,wood,woody,zest,zesty
0,PBS Blend K-Cup,Green Mountain Coffee,4,2007,50,50,7,7,7,7,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Washed Yirgacheffe,Bird Rock Coffee Roasters,7,2014,54,76,9,9,8,9,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Rwanda Coopac Cooperative,Wonderstate Coffee,6,2009,53,68,8,8,8,8,...,0.0,0.0,0.0,0.0,0.0,0.0,0.204984,0.0,0.0,0.0
3,India Badnekhan Estate,Devon Plantations,8,2002,47,45,9,8,7,8,...,0.0,0.0,0.0,0.275726,0.0,0.0,0.0,0.0,0.0,0.0
4,India Monsooned Malabar,Mayorga Coffee Roasters,7,2004,35,47,7,7,8,7,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Importing the y values:

In [11]:
y_remain_df = pd.read_csv('y_remain_df.csv')
y_remain_df.head()

Unnamed: 0,overall_score
0,92
1,91
2,93
3,94
4,88


In [12]:
#formatting y to work for modeling later
#note lengths match corresponding X dataframe
y_remain = y_remain_df['overall_score']
y_remain

0       92
1       91
2       93
3       94
4       88
        ..
4189    88
4190    94
4191    92
4192    93
4193    92
Name: overall_score, Length: 4194, dtype: int64

In [13]:
y_val_df = pd.read_csv('y_val_df.csv')
y_val_df.head()

Unnamed: 0,overall_score
0,97
1,79
2,93
3,95
4,93


In [14]:
#formatting y to work for modeling later
#note lengths match corresponding X dataframe
y_val = y_val_df['overall_score']
y_val

0       97
1       79
2       93
3       95
4       93
        ..
1044    94
1045    94
1046    87
1047    94
1048    90
Name: overall_score, Length: 1049, dtype: int64

In [15]:
y_test_df = pd.read_csv('y_test_df.csv')
y_test_df.head()

Unnamed: 0,overall_score
0,85
1,94
2,90
3,92
4,83


In [16]:
#formatting y to work for modeling later
#note lengths match corresponding X dataframe
y_test = y_test_df['overall_score']
y_test

0       85
1       94
2       90
3       92
4       83
        ..
1306    95
1307    91
1308    93
1309    89
1310    93
Name: overall_score, Length: 1311, dtype: int64

In [17]:
#verifying no null values introduced during vectorizing process
Xremain_df_tfidf.isnull().sum().sum()

0

In [18]:
Xval_df_tfidf.isnull().sum().sum()

0

In [19]:
Xtest_df_tfidf.isnull().sum().sum()

0

In [20]:
y_remain.isnull().sum().sum()

0

In [21]:
y_val.isnull().sum().sum()

0

In [22]:
y_test.isnull().sum().sum()

0

Dropping the remaining text columns `coffee_name` and `roaster_name`:

In [23]:
Xremain_df_tfidf.drop(['coffee_name', 'roaster_name'], axis=1, inplace=True)

In [24]:
Xremain_df_tfidf.head()

Unnamed: 0,month,year,bean_agtron,ground_agtron,aroma,acidity,body,flavor,aftertaste,roaster_lat,...,wild,willem,wine,winey,winy,wisteria,wood,woody,zest,zesty
0,3,2020,44,52,9,8,8,9,8,22.620335,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,2022,46,64,8,8,9,9,7,19.623815,...,0.0,0.0,0.0,0.0,0.0,0.0,0.215396,0.0,0.0,0.0
2,8,2014,52,66,9,8,9,9,8,44.9773,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,11,2018,50,76,9,9,9,9,8,40.015416,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2,2015,86,86,8,7,8,8,7,25.045275,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:
Xval_df_tfidf.drop(['coffee_name', 'roaster_name'], axis=1, inplace=True)

In [26]:
Xval_df_tfidf.head()

Unnamed: 0,month,year,bean_agtron,ground_agtron,aroma,acidity,body,flavor,aftertaste,roaster_lat,...,wild,willem,wine,winey,winy,wisteria,wood,woody,zest,zesty
0,11,2020,57,77,10,9,9,10,9,43.074761,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,1998,43,49,5,4,5,5,5,41.083064,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,6,2012,48,73,8,9,9,9,8,23.128402,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,8,2016,58,78,9,9,9,9,9,43.074761,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,8,2017,49,72,9,8,9,9,8,34.01947,...,0.0,0.0,0.0,0.0,0.0,0.0,0.19191,0.0,0.0,0.0


In [27]:
Xtest_df_tfidf.drop(['coffee_name', 'roaster_name'], axis=1, inplace=True)

In [28]:
Xtest_df_tfidf.head()

Unnamed: 0,month,year,bean_agtron,ground_agtron,aroma,acidity,body,flavor,aftertaste,roaster_lat,...,wild,willem,wine,winey,winy,wisteria,wood,woody,zest,zesty
0,4,2007,50,50,7,7,7,7,7,44.337125,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,7,2014,54,76,9,9,8,9,9,32.840162,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,6,2009,53,68,8,8,8,8,8,43.556917,...,0.0,0.0,0.0,0.0,0.0,0.0,0.204984,0.0,0.0,0.0
3,8,2002,47,45,9,8,7,8,8,12.976794,...,0.0,0.0,0.0,0.275726,0.0,0.0,0.0,0.0,0.0,0.0
4,7,2004,35,47,7,7,8,7,7,39.081798,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 3. Scaling the Data <a class="anchor" id="header3"></a>

While scaling isn't required for all the model types (like linear regression), we'll go ahead and scale now to simplify the process.

In [29]:
#fitting min max scaler on remain data
mm_scaler = MinMaxScaler()
mm_scaler.fit(Xremain_df_tfidf)

#transforming the all the X data
X_mm_scaled_remain = mm_scaler.transform(Xremain_df_tfidf)
X_mm_scaled_val = mm_scaler.transform(Xval_df_tfidf)
X_mm_scaled_test = mm_scaler.transform(Xtest_df_tfidf)

In [30]:
#preview scaled date
X_mm_scaled_remain

array([[0.18181818, 0.92      , 0.44      , ..., 0.        , 0.        ,
        0.        ],
       [0.09090909, 1.        , 0.46666667, ..., 0.        , 0.        ,
        0.        ],
       [0.63636364, 0.68      , 0.54666667, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [1.        , 0.92      , 0.44      , ..., 0.        , 0.44691187,
        0.        ],
       [0.        , 0.84      , 0.65333333, ..., 0.        , 0.        ,
        0.        ],
       [0.54545455, 0.76      , 0.6       , ..., 0.        , 0.        ,
        0.        ]])

In [31]:
#add headers back to our scaled date for interpretability later 
X_mm_scaled_remain = pd.DataFrame(X_mm_scaled_remain, columns = Xremain_df_tfidf.columns)

#verify headers added back
X_mm_scaled_remain.head()

Unnamed: 0,month,year,bean_agtron,ground_agtron,aroma,acidity,body,flavor,aftertaste,roaster_lat,...,wild,willem,wine,winey,winy,wisteria,wood,woody,zest,zesty
0,0.181818,0.92,0.44,0.45977,0.875,0.777778,0.666667,0.888889,0.75,0.588732,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.090909,1.0,0.466667,0.597701,0.75,0.777778,0.833333,0.888889,0.625,0.559541,...,0.0,0.0,0.0,0.0,0.0,0.0,0.377216,0.0,0.0,0.0
2,0.636364,0.68,0.546667,0.62069,0.875,0.777778,0.833333,0.888889,0.75,0.806526,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.909091,0.84,0.52,0.735632,0.875,0.888889,0.833333,0.888889,0.75,0.758189,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.090909,0.72,1.0,0.850575,0.75,0.666667,0.666667,0.777778,0.625,0.612355,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [32]:
X_mm_scaled_val

array([[0.90909091, 0.92      , 0.61333333, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.04      , 0.42666667, ..., 0.        , 0.        ,
        0.        ],
       [0.45454545, 0.6       , 0.49333333, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.72727273, 0.32      , 0.34666667, ..., 0.        , 0.        ,
        0.        ],
       [0.63636364, 0.44      , 0.49333333, ..., 0.        , 0.        ,
        0.        ],
       [0.54545455, 0.6       , 0.50666667, ..., 0.        , 0.        ,
        0.        ]])

In [33]:
#add headers back to our scaled date for interpretability later 
X_mm_scaled_val = pd.DataFrame(X_mm_scaled_val, columns = Xval_df_tfidf.columns)
X_mm_scaled_val.head()

Unnamed: 0,month,year,bean_agtron,ground_agtron,aroma,acidity,body,flavor,aftertaste,roaster_lat,...,wild,willem,wine,winey,winy,wisteria,wood,woody,zest,zesty
0,0.909091,0.92,0.613333,0.747126,1.0,0.888889,0.833333,1.0,0.875,0.787992,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.04,0.426667,0.425287,0.375,0.333333,0.166667,0.444444,0.375,0.768589,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.454545,0.6,0.493333,0.701149,0.75,0.888889,0.833333,0.888889,0.75,0.593681,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.636364,0.76,0.626667,0.758621,0.875,0.888889,0.833333,0.888889,0.875,0.787992,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.636364,0.8,0.506667,0.689655,0.875,0.777778,0.833333,0.888889,0.75,0.699778,...,0.0,0.0,0.0,0.0,0.0,0.0,0.336086,0.0,0.0,0.0


In [34]:
X_mm_scaled_test

array([[0.27272727, 0.4       , 0.52      , ..., 0.        , 0.        ,
        0.        ],
       [0.54545455, 0.68      , 0.57333333, ..., 0.        , 0.        ,
        0.        ],
       [0.45454545, 0.48      , 0.56      , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.54545455, 0.84      , 0.57333333, ..., 0.        , 0.        ,
        0.        ],
       [0.36363636, 0.52      , 0.37333333, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.8       , 0.62666667, ..., 0.        , 0.        ,
        0.        ]])

In [35]:
#add headers back to our scaled date for interpretability later 
X_mm_scaled_test = pd.DataFrame(X_mm_scaled_test, columns = Xtest_df_tfidf.columns)
X_mm_scaled_test.head()

Unnamed: 0,month,year,bean_agtron,ground_agtron,aroma,acidity,body,flavor,aftertaste,roaster_lat,...,wild,willem,wine,winey,winy,wisteria,wood,woody,zest,zesty
0,0.272727,0.4,0.52,0.436782,0.625,0.666667,0.5,0.666667,0.625,0.800289,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.545455,0.68,0.573333,0.735632,0.875,0.888889,0.666667,0.888889,0.875,0.68829,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.454545,0.48,0.56,0.643678,0.75,0.777778,0.666667,0.777778,0.75,0.792689,...,0.0,0.0,0.0,0.0,0.0,0.0,0.358982,0.0,0.0,0.0
3,0.636364,0.2,0.48,0.37931,0.875,0.777778,0.5,0.777778,0.75,0.494788,...,0.0,0.0,0.0,0.554661,0.0,0.0,0.0,0.0,0.0,0.0
4,0.545455,0.28,0.32,0.402299,0.625,0.666667,0.666667,0.666667,0.625,0.749094,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 4. Principal Component Analysis <a class="anchor" id="header4"></a>

Given the number of dimensions in this dataset, simplifying our dimensions will may be beneficial. To do so, Pricipal Component Analysis (PCA) will be applied. However, we do lose interpretability with PCA. Therefore,  the models will be compared using PCA and the non-simplified data. If performance is similar, the non-simplified data can be used in order to retain interpretability.

In [36]:
#instatiate and fit the PCA model
my_PCA = PCA(n_components = 0.9) #retaining 90% of the variance
my_PCA.fit(X_mm_scaled_val)

# transform data 
X_remain_PCA = my_PCA.transform(X_mm_scaled_remain)
X_val_PCA = my_PCA.transform(X_mm_scaled_val)
X_test_PCA = my_PCA.transform(X_mm_scaled_test)

In [37]:
print(f"Variance captured by PC1: {my_PCA.explained_variance_[0]: 0.3f}")
print(f"Variance captured by PC2: {my_PCA.explained_variance_[1]: 0.3f}")

print(f"Proportion of variance captured by PC1: {my_PCA.explained_variance_ratio_[0]: 0.3f}")
print(f"Proportion of variance captured by PC2: {my_PCA.explained_variance_ratio_[1]: 0.3f}")

Variance captured by PC1:  0.253
Variance captured by PC2:  0.152
Proportion of variance captured by PC1:  0.049
Proportion of variance captured by PC2:  0.030


In [38]:
print(f'Original: {Xremain_df_tfidf.shape}')
print(f'PCA Transformed: {X_remain_PCA.shape}')

Original: (4194, 630)
PCA Transformed: (4194, 277)


PCA is able to capture 90% of the variance while decreasing the number of features significantly to 280 (down from 630).

## 5. Fitting the Models <a class="anchor" id="header5"></a>

Below will fit each of the models, trying both the 'full' dataset and the PCA dataset. In all of these, the data is scaled. At the end, R2 results will be compared across all models.

### 5.1 Linear Regression <a class="anchor" id="subheader51"></a>

**"Vanilla" Linear Regression** 

This first uses a basic, "vanilla" linear regression model. This is what we used in previous notebooks to check in on how our data transformations were performing.

For reference:
- The best validation data R2 from running Linear Regression on the text data alone was about 0.761.
- The best validation data R2 from running Linear Regression on the numeric data alone (meaning no vectorized text) was about 0.898.

**Linear Regression: Min Max, Full Dataset**
 
Below runs the model on the full, scaled dataset.

In [184]:
# 1. Instantiate the model
lr_model_full = LinearRegression()

# 2. Fit the model
lr_model_full.fit(X_mm_scaled_remain, y_remain)

# 3. Scoring the models
print(f'The R2 score for lr_model_full training data is: {lr_model_full.score(X_mm_scaled_remain, y_remain)}')

lr_model_full_val_r2 = lr_model_full.score(X_mm_scaled_val, y_val)
print(f'The R2 score for lr_model_full validation data is: {lr_model_full_val_r2}')

The R2 score for lr_model_full training data is: 0.9408817380239718
The R2 score for lr_model_full validation data is: 0.9025842705674784


**Linear Regression: Min Max, PCA Dataset**
 
Below runs the model on the simplified, scaled dataset.

In [185]:
# 1. Instantiate the model
lr_model_pca = LinearRegression()

# 2. Fit the model
lr_model_pca.fit(X_remain_PCA, y_remain)

# 3. Scoring the models
print(f'The R2 score for lr_model_pca training data is: {lr_model_pca.score(X_remain_PCA, y_remain)}')

lr_model_pca_val_r2 = lr_model_pca.score(X_val_PCA, y_val)
print(f'The R2 score for lr_model_pca validation data is: {lr_model_pca_val_r2}')

The R2 score for lr_model_pca training data is: 0.9201855216374213
The R2 score for lr_model_pca validation data is: 0.8959728426048407


Interestingly, the full dataset does slightly better than the PCA version. This may be because using PCA did trade some information for fewer dimensions (e.g. fewer columns). 

**Ridge Regression: Min Max, Full Dataset**

The pipeline below will look for the optimal Ridge Regression model using the full dataset:

In [58]:
#creating our pipeline
#the values listed here for vectorization are placeholders
linreg = Pipeline(
    [
        ("model", Ridge()),
    ]
)

In [59]:
#ran the grid search using a series of changes on these paramaters
#started with more extreme options for each paramater and then narrowed down to these

parameters = {
    "model__alpha": [6.2], 
    "model__solver": ['sag'],
}

In [60]:
#running the serach to find the best combination of these parameters
grid_search = GridSearchCV(linreg, parameters)

grid_search.fit(X_mm_scaled_remain, y_remain)

print("Best parameters: ", grid_search.best_params_)
#checking the R2
print("Best score:", grid_search.best_score_)

Best parameters:  {'model__alpha': 6.2, 'model__solver': 'sag'}
Best score: 0.9120997478973498


Below will see how these parameters do with our remain (training) and validation data:

In [92]:
#instatiate and fit best model
lr_model_ridge_full = Ridge(alpha=6.2, solver='sag')
lr_model_ridge_full.fit(X_mm_scaled_remain, y_remain)

# 3. Scoring the models
print(f'The R2 score for lr_model_ridge_full training data is: {lr_model_ridge_full.score(X_mm_scaled_remain, y_remain)}')

lr_model_ridge_full_val_r2 = lr_model_ridge_full.score(X_mm_scaled_val, y_val)
print(f'The R2 score for lr_model_ridge_full validation data is: {lr_model_ridge_full_val_r2}')

The R2 score for lr_model_ridge_full training data is: 0.9359693918542785
The R2 score for lr_model_ridge_full validation data is: 0.9098745867302618


**Ridge Regression: Min Max, PCA Dataset**

The pipeline below will look for the optimal Ridge Regression model using the full dataset:

In [110]:
#creating our pipeline
#the values listed here for vectorization are placeholders
linreg = Pipeline(
    [
        ("model", Ridge()),
    ]
)

In [111]:
#ran the grid search using a series of changes on these paramaters
#started with more extreme options for each paramater and then narrowed down to these

parameters = {
    "model__alpha": [3.5], 
    "model__solver": ['sag'],
}

In [112]:
#running the serach to find the best combination of these parameters
grid_search = GridSearchCV(linreg, parameters)

grid_search.fit(X_remain_PCA, y_remain)

print("Best parameters: ", grid_search.best_params_)
#checking the R2
print("Best score:", grid_search.best_score_)

Best parameters:  {'model__alpha': 3.5, 'model__solver': 'sag'}
Best score: 0.9068516513539085


Ridge regression on the PCA data does slightly less well than the full dataset. The best parameters for it are shown above. Below will see how these do with the remain and validation data.

In [113]:
#instatiate and fit best model
lr_model_ridge_pca = Ridge(alpha=3.5, solver='sag')
lr_model_ridge_pca.fit(X_remain_PCA, y_remain)

# 3. Scoring the models
print(f'The R2 score for lr_model_ridge_pca training data is: {lr_model_ridge_pca.score(X_remain_PCA, y_remain)}')

lr_model_ridge_pca_val_r2 = lr_model_ridge_pca.score(X_val_PCA, y_val)
print(f'The R2 score for lr_model_ridge_pca validation data is: {lr_model_ridge_pca_val_r2}')

The R2 score for lr_model_ridge_pca training data is: 0.9194928820732233
The R2 score for lr_model_ridge_pca validation data is: 0.9011979701874846


**Lasso Regression: Min Max, Full Dataset**

The pipeline below will look for the optimal Lasso Regression model using the full datset:

In [96]:
#creating our pipeline
#the values listed here for vectorization are placeholders
linreg = Pipeline(
    [
        ("model", Lasso()),
    ]
)

In [97]:
#ran the grid search using a series of changes on these paramaters
#started with more extreme options for each paramater and then narrowed down to these

parameters = {
    "model__alpha": [0.002], 
    "model__positive": [False],
    "model__warm_start": [True]
}

In [98]:
#running the serach to find the best combination of these parameters
grid_search = GridSearchCV(linreg, parameters)

grid_search.fit(X_mm_scaled_remain, y_remain)

print("Best parameters: ", grid_search.best_params_)

#checking the R2
print("Best score:", grid_search.best_score_)

Best parameters:  {'model__alpha': 0.002, 'model__positive': False, 'model__warm_start': True}
Best score: 0.9131574911589692


The optimal Lasso Regression model uses parameters are shown above. Below will see how the model does with training and validation data using these optimized paramters:

In [99]:
#instatiate and fit best model
lr_model_lasso_full = Lasso(alpha=0.002, positive=False, warm_start=True)
lr_model_lasso_full.fit(X_mm_scaled_remain, y_remain)

# 3. Scoring the models
print(f'The R2 score for lr_model_lasso_full training data is: {lr_model_lasso_full.score(X_mm_scaled_remain, y_remain)}')

lr_model_lasso_full_val_r2 = lr_model_lasso_full.score(X_mm_scaled_val, y_val)
print(f'The R2 score for lr_model_lasso_full validation data is: {lr_model_lasso_full_val_r2}')

The R2 score for lr_model_lasso_full training data is: 0.9270599542538364
The R2 score for lr_model_lasso_full validation data is: 0.907597542099871


**Lasso Regression: Min Max, PCA Dataset**

The pipeline below will look for the optimal Lasso Regression model using the PCA datset:

In [100]:
#creating our pipeline
#the values listed here for vectorization are placeholders
linreg = Pipeline(
    [
        ("model", Lasso()),
    ]
)

In [105]:
#ran the grid search using a series of changes on these paramaters
#started with more extreme options for each paramater and then narrowed down to these

parameters = {
    "model__alpha": [0.002], 
    "model__positive": [False],
    "model__warm_start": [True]
}

In [106]:
#running the serach to find the best combination of these parameters
grid_search = GridSearchCV(linreg, parameters)

grid_search.fit(X_remain_PCA, y_remain)

print("Best parameters: ", grid_search.best_params_)

#checking the R2
print("Best score:", grid_search.best_score_)

Best parameters:  {'model__alpha': 0.002, 'model__positive': False, 'model__warm_start': True}
Best score: 0.9045196074779615


In [108]:
#instatiate and fit best model
lr_model_lasso_pca = Lasso(alpha=0.002, positive=False, warm_start=True)
lr_model_lasso_pca.fit(X_remain_PCA, y_remain)

# 3. Scoring the models
print(f'The R2 score for lr_model_lasso_pca training data is: {lr_model_lasso_pca.score(X_remain_PCA, y_remain)}')

lr_model_lasso_pca_val_r2 = lr_model_lasso_pca.score(X_val_PCA, y_val)
print(f'The R2 score for lr_model_lasso_pca validation data is: {lr_model_lasso_pca_val_r2}')

The R2 score for lr_model_lasso_pca training data is: 0.9142017685228763
The R2 score for lr_model_lasso_pca validation data is: 0.9033447450950415


**ElasticeNet Regression: Min Max, Full Dataset**

The pipeline below will look for the optimal ElasticNet Regression model on the full dataset:

In [114]:
#creating our pipeline
#the values listed here for vectorization are placeholders
linreg = Pipeline(
    [
        ("model", ElasticNet()),
    ]
)

In [115]:
#ran the grid search using a series of changes on these paramaters
#started with more extreme options for each paramater and then narrowed down to these

parameters = {
    "model__alpha": [0.002], 
    "model__l1_ratio": [0.4],
    "model__warm_start": [True],
    "model__positive": [False]
}

In [116]:
#running the serach to find the best combination of these parameters
grid_search = GridSearchCV(linreg, parameters)

grid_search.fit(X_mm_scaled_remain, y_remain)

print("Best parameters: ", grid_search.best_params_)

#checking the R2
print("Best score:", grid_search.best_score_)

Best parameters:  {'model__alpha': 0.002, 'model__l1_ratio': 0.4, 'model__positive': False, 'model__warm_start': True}
Best score: 0.9136891720109446


Below will see how the optimized model does with the training and validation data:

In [128]:
#instatiate and fit best model
lr_model_elastic_full = ElasticNet(alpha=0.002, l1_ratio=0.4, positive=False, warm_start=True)
lr_model_elastic_full.fit(X_mm_scaled_remain, y_remain)

# 3. Scoring the models
print(f'The R2 score for lr_model_elastic_full training data is: {lr_model_elastic_full.score(X_mm_scaled_remain, y_remain)}')

lr_model_elastic_full_val_r2 = lr_model_elastic_full.score(X_mm_scaled_val, y_val)
print(f'The R2 score for lr_model_elastic_full validation data is: {lr_model_elastic_full_val_r2}')

The R2 score for lr_model_elastic_full training data is: 0.9319226114642285
The R2 score for lr_model_elastic_full validation data is: 0.9101990891299294


**ElasticeNet Regression: Min Max, PCA Dataset**

The pipeline below will look for the optimal ElasticNet Regression model on the full dataset:

In [None]:
#creating our pipeline
#the values listed here for vectorization are placeholders
linreg = Pipeline(
    [
        ("model", ElasticNet()),
    ]
)

In [126]:
#ran the grid search using a series of changes on these paramaters
#started with more extreme options for each paramater and then narrowed down to these

parameters = {
    "model__alpha": [0.001], 
    "model__l1_ratio": [0.2],
    "model__warm_start": [True],
    "model__positive": [False]
}

In [127]:
#running the serach to find the best combination of these parameters
grid_search = GridSearchCV(linreg, parameters)

grid_search.fit(X_remain_PCA, y_remain)

print("Best parameters: ", grid_search.best_params_)

#checking the R2
print("Best score:", grid_search.best_score_)

Best parameters:  {'model__alpha': 0.001, 'model__l1_ratio': 0.2, 'model__positive': False, 'model__warm_start': True}
Best score: 0.9069016414322609


In [129]:
#instatiate and fit best model
lr_model_elastic_pca = ElasticNet(alpha=0.002, l1_ratio=0.4, positive=False, warm_start=True)
lr_model_elastic_pca.fit(X_remain_PCA, y_remain)

# 3. Scoring the models
print(f'The R2 score for lr_model_elastic_pca training data is: {lr_model_elastic_pca.score(X_remain_PCA, y_remain)}')

lr_model_elastic_pca_val_r2 = lr_model_elastic_pca.score(X_val_PCA, y_val)
print(f'The R2 score for lr_model_elastic_pca validation data is: {lr_model_elastic_pca_val_r2}')

The R2 score for lr_model_elastic_pca training data is: 0.9165220287609357
The R2 score for lr_model_elastic_pca validation data is: 0.9041516740152954


Below will compare R2 values calculated on the validation data to see how these different regression models perform:

In [151]:
#Comparing the models
R2_dictionary = {'Linear Full R2': lr_model_full_val_r2, 'Linear PCA R2' :lr_model_pca_val_r2, 'Ridge Full R2': lr_model_ridge_full_val_r2, 'Ridge PCA R2': lr_model_ridge_pca_val_r2, 'Lasso Full R2': lr_model_lasso_full_val_r2, 'Lasso PCA R2': lr_model_lasso_pca_val_r2, 'Elastic Full R2': lr_model_elastic_full_val_r2, 'Elastic PCA R2': lr_model_elastic_pca_val_r2} 

In [152]:
#sorting scores
R2_values_sorted = dict(sorted(R2_dictionary.items(), key = operator.itemgetter(1), reverse=True))

In [153]:
R2_values_sorted

{'Elastic Full R2': 0.9101990891299294,
 'Ridge Full R2': 0.9098745867302618,
 'Lasso Full R2': 0.907597542099871,
 'Elastic PCA R2': 0.9041516740152954,
 'Lasso PCA R2': 0.9033447450950415,
 'Linear Full R2': 0.9025842705674784,
 'Ridge PCA R2': 0.9011979701874846,
 'Linear PCA R2': 0.8959728426048407}

**Comparing model performance**

Looking at the different regression models, ElasticNet using the full dataset performed best with the validation data. In general, using the full dataset provided better results than PCA, which is good because interpretability can be preserved then. 

All things considered, the models perform pretty similarly. 

Below will look at another model type before drawing digging further on the models and drawing conclusions about which performs best.

### 5.2 XG Boost Regressor <a class="anchor" id="subheader52"></a>

For reference, the best validation R2 from running the baseline XGBoost Regressor model on numeric data alone (no vectorized text) was about 91.8. XG Boost was not run on the text only data.

**XG Boost Regressor: Min Max, Full Dataset**

In [137]:
# 1. Instantiate the model
XGBR_model = XGBRegressor()

# 2. Fit the model
XGBR_model.fit(X_mm_scaled_remain, y_remain)

# 3. Scoring the models
print(f'The R2 score for XGBR_model training data is: {XGBR_model.score(X_mm_scaled_remain, y_remain)}')

XGBR_model_val_r2 = XGBR_model.score(X_mm_scaled_val, y_val)
print(f'The R2 score for XGBR_model validation data is: {XGBR_model_val_r2}')

The R2 score for XGBR_model training data is: 0.9926375397608137
The R2 score for XGBR_model validation data is: 0.9022804820873469


Optimizing the XG Boost Model using the full dataset:

In [145]:
#creating our pipeline
#the values listed here for vectorization are placeholders
linreg = Pipeline(
    [
        ("model", XGBRegressor()),
    ]
)

In [146]:
#ran the grid search using a series of changes on these paramaters
#started with more extreme options for each paramater and then narrowed down to these

parameters = {
    "model__booster": ['dart'], 
    "model__eta": [0.1],
    "model__gamma": [1],
    "model__max_depth": [11]
}

In [147]:
#running the serach to find the best combination of these parameters
grid_search = GridSearchCV(linreg, parameters)

grid_search.fit(X_mm_scaled_remain, y_remain)

print("Best parameters: ", grid_search.best_params_)

#checking the R2
print("Best score:", grid_search.best_score_)

Best parameters:  {'model__booster': 'dart', 'model__eta': 0.1, 'model__gamma': 1, 'model__max_depth': 11}
Best score: 0.9196504587392071


In [163]:
#instatiate and fit best model
XGBR_model_full = XGBRegressor(booster='dart', eta=0.1, gamma=1, max_depth=11)
XGBR_model_full.fit(X_mm_scaled_remain, y_remain)

# 3. Scoring the models
print(f'The R2 score for XGBR_model_full training data is: {XGBR_model_full.score(X_mm_scaled_remain, y_remain)}')

XGBR_model_full_val_r2 = XGBR_model_full.score(X_mm_scaled_val, y_val)
print(f'The R2 score for XGBR_model_full validation data is: {XGBR_model_full_val_r2}')

The R2 score for XGBR_model_full training data is: 0.9921717669177883
The R2 score for XGBR_model_full validation data is: 0.9157344141540313


The optimized XG Boost Model does slightly better than Linear Regression, though it is quite overfit.

Fitting the XGBoost Regressor model on the PCA data:

In [148]:
#creating our pipeline
#the values listed here for vectorization are placeholders
linreg = Pipeline(
    [
        ("model", XGBRegressor()),
    ]
)

In [158]:
#ran the grid search using a series of changes on these paramaters
#started with more extreme options for each paramater and then narrowed down to these

parameters = {
    "model__booster": ['dart'], 
    "model__eta": [0.1],
    "model__gamma": [1],
    "model__max_depth": [5]
}

In [159]:
#running the serach to find the best combination of these parameters
grid_search = GridSearchCV(linreg, parameters)

grid_search.fit(X_remain_PCA, y_remain)

print("Best parameters: ", grid_search.best_params_)

#checking the R2
print("Best score:", grid_search.best_score_)

Best parameters:  {'model__booster': 'dart', 'model__eta': 0.1, 'model__gamma': 1, 'model__max_depth': 5}
Best score: 0.8519782429252418


In [161]:
#instatiate and fit best model
XGBR_model_pca = XGBRegressor(booster='dart', eta=0.1, gamma=1, max_depth=5)
XGBR_model_pca.fit(X_remain_PCA, y_remain)

# 3. Scoring the models
print(f'The R2 score for XGBR_model_pca training data is: {XGBR_model_pca.score(X_remain_PCA, y_remain)}')

XGBR_model_pca_val_r2 = XGBR_model_pca.score(X_val_PCA, y_val)
print(f'The R2 score for XGBR_model_pca validation data is: {XGBR_model_pca_val_r2}')

The R2 score for XGBR_model_pca training data is: 0.9702665775242029
The R2 score for XGBR_model_pca validation data is: 0.8466916761279962


## 6. Comparing All Models <a class="anchor" id="header6"></a>



In [164]:
#Comparing the models
R2_dictionary = {'Linear Full R2': lr_model_full_val_r2, 'Linear PCA R2' :lr_model_pca_val_r2, 'Ridge Full R2': lr_model_ridge_full_val_r2, 'Ridge PCA R2': lr_model_ridge_pca_val_r2, 'Lasso Full R2': lr_model_lasso_full_val_r2, 'Lasso PCA R2': lr_model_lasso_pca_val_r2, 'Elastic Full R2': lr_model_elastic_full_val_r2, 'Elastic PCA R2': lr_model_elastic_pca_val_r2, 'XGBR Full R2': XGBR_model_full_val_r2, 'XGBR PCA R2': XGBR_model_pca_val_r2} 

Below compares all the models run, listing the one with the highest R2 first:

In [165]:
#sorting scores
R2_values_sorted = dict(sorted(R2_dictionary.items(), key = operator.itemgetter(1), reverse=True))
R2_values_sorted

{'XGBR Full R2': 0.9157344141540313,
 'Elastic Full R2': 0.9101990891299294,
 'Ridge Full R2': 0.9098745867302618,
 'Lasso Full R2': 0.907597542099871,
 'Elastic PCA R2': 0.9041516740152954,
 'Lasso PCA R2': 0.9033447450950415,
 'Linear Full R2': 0.9025842705674784,
 'Ridge PCA R2': 0.9011979701874846,
 'Linear PCA R2': 0.8959728426048407,
 'XGBR PCA R2': 0.8466916761279962}

XGBR on the full dataset performs best, followed by ElasticNet on the full dataset. Overall, the PCA versions perform worse. XGBoost Regressor in particular takes a peformance hit when using the PCA data. Across the board, model performance is pretty similar, with the exception of the XGBR PCA version. 

At this stage, we'll pick one model to move forward with. We could use the XGBoost model on all the data, as it is the top model. However, interpretability is much more challenging. We could use a tool like SHAP to help with this. However, given how close the ElasticNet and other linear regression models are, we'll stick with the Linear Regression models.

## 7. Further Model Interpretation <a class="anchor" id="header7"></a>

Our model has a pretty strong R2, but which features are important? To answer this question, below will look at Linear Regression coefficients and p-values.

### 7.1 Using Stats Models for Additional Insights <a class="anchor" id="subheader71"></a>

Unfortunately, SKLearn doesn't offer an easy way to way to get the p-values for the model. Instead, we'll use stats models to check on p-values. If the p-values shown using the stats model are significant, we can feel more confident in our linear regression models.

With stats models, we do have to add a constant to X before proceeding.

In [41]:
#adding the constant before modeling
X_mm_remain_constant = sm.add_constant(X_mm_scaled_remain)

Running the model with stats models:

In [42]:
# 1. Instantiate Model
final_model = sm.OLS(y_remain, X_mm_remain_constant)

# 2. Fit Model (this returns a seperate object with the parameters)
final_model_results = final_model.fit()

# Looking at the summary
final_model_results.summary()

0,1,2,3
Dep. Variable:,overall_score,R-squared:,0.943
Model:,OLS,Adj. R-squared:,0.933
Method:,Least Squares,F-statistic:,93.19
Date:,"Thu, 03 Nov 2022",Prob (F-statistic):,0.0
Time:,16:24:46,Log-Likelihood:,-5852.1
No. Observations:,4194,AIC:,12970.0
Df Residuals:,3563,BIC:,16970.0
Df Model:,630,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,65.4234,0.407,160.702,0.000,64.625,66.222
month,0.0990,0.060,1.654,0.098,-0.018,0.216
year,-0.1549,0.263,-0.590,0.555,-0.670,0.360
bean_agtron,-1.2737,0.361,-3.525,0.000,-1.982,-0.565
ground_agtron,1.6821,0.298,5.639,0.000,1.097,2.267
aroma,8.7087,0.279,31.232,0.000,8.162,9.255
acidity,5.0138,0.351,14.286,0.000,4.326,5.702
body,4.3515,0.220,19.745,0.000,3.919,4.784
flavor,5.8711,0.418,14.057,0.000,5.052,6.690

0,1,2,3
Omnibus:,701.284,Durbin-Watson:,1.977
Prob(Omnibus):,0.0,Jarque-Bera (JB):,14313.121
Skew:,-0.062,Prob(JB):,0.0
Kurtosis:,12.049,Cond. No.,290.0


This basic version has a pretty strong R2 value as is. 

Given the number of variables, it will be useful to sort the coef and p values. This will highlight which coefficients are strongest, and which have the strongest p-value, indicating that there is a real relationship there and its very unlikely that the finding is just due to chance.

In [58]:
#identifying the pvalues and coefficients
pvalues = final_model_results.pvalues
coeff = final_model_results.params

In [61]:
#creating a dataframe using the pvales and coeff for easier sorting
results_df = pd.DataFrame({'pvals': pvalues, 'coeff': coeff})

Looking at coefficients first:

In [46]:
#looking at the 50 strongest negative coefficients
results_df['coeff'].sort_values()[:50]

hard          -3.636910
shadow        -2.924256
flat          -2.605377
nice          -2.540766
single        -2.207284
cups          -2.189681
taste         -2.154472
sharp         -1.843138
tested        -1.831253
pod           -1.680764
bitter        -1.474950
capsule       -1.466924
pruny         -1.450876
nuts          -1.396667
sample        -1.382201
musty         -1.374905
faint         -1.279075
bean_agtron   -1.273737
woody         -1.246075
dimension     -1.214421
lean          -1.179287
leanish       -1.101886
carbon        -1.101617
grace         -1.052883
substantial   -1.004988
water         -0.996420
solid         -0.988622
called        -0.919792
interesting   -0.912091
sharpness     -0.886726
nose          -0.859408
ferment       -0.853366
mustiness     -0.824091
key           -0.743318
fallen        -0.723292
burned        -0.670038
origin_lat    -0.656468
bay           -0.642872
bitterish     -0.642027
attractive    -0.614003
ken           -0.605461
astringent    -0

In [47]:
#narrowing down to top 50 positie coefficients
results_df['coeff'].sort_values(ascending=False)[:50]

const            65.423442
aroma             8.708658
aftertaste        6.138584
flavor            5.871105
acidity           5.013760
body              4.351545
kenya             2.291016
malty             1.942738
ted               1.753575
byron             1.744289
ground_agtron     1.682064
flower            1.582224
wine              1.578629
understated       1.513380
blooms            1.481906
best              1.425150
miguel            1.416512
superb            1.362618
serve             1.307893
leaves            1.291068
cold              1.289755
keurig            1.259523
acidy             1.227601
brown             1.204013
berry             1.173601
winy              1.159030
leaf              1.152578
flowering         1.122301
ready             1.085979
deepening         1.064350
fragrant          1.056600
especially        1.048805
ethan             1.044863
way               1.033733
dimensioned       1.032348
shimmer           1.027525
cherryish         1.012664
b

Looking at p-values:

In [48]:
#sorting to see which have lowest pvalues
results_df['pvals'].sort_values()[:50]

const             0.000000e+00
aroma            1.728116e-189
body              1.931333e-82
aftertaste        3.574319e-72
acidity           4.577116e-45
flavor            1.009842e-43
hard              4.115328e-18
wine              5.098205e-11
berry             2.485614e-10
flat              2.052284e-09
sharp             1.086212e-08
taste             1.541735e-08
shadow            1.560907e-08
ground_agtron     1.846499e-08
acidy             6.293852e-08
orange            7.384449e-08
kenya             3.612057e-07
floral            4.145186e-07
malty             4.905591e-07
understated       1.711198e-06
cocoa             2.653913e-06
flower            4.252826e-06
chocolate         5.162893e-06
nice              6.460172e-06
long              6.576506e-06
clean             8.907306e-06
honey             9.221251e-06
cups              9.346409e-06
faint             1.061656e-05
superb            1.117158e-05
flowers           1.373589e-05
dimension         1.442235e-05
richly  

In [49]:
#sorting to see which have highest pvalues
results_df['pvals'].sort_values(ascending=False)[:50]

underlying         0.999941
salty              0.998292
throughline        0.997745
nib                0.993386
brewing            0.992947
particularly       0.987922
zesty              0.985821
tight              0.982015
shot               0.978526
particular         0.977564
simplifies         0.973948
device             0.969061
brewed             0.967686
pecan              0.966520
lead               0.965155
toasted            0.963430
turning            0.957313
aromatic           0.955735
syrup              0.950809
little             0.948572
centers            0.939440
relatively         0.935932
carrying           0.929579
zest               0.928656
round              0.921084
size               0.920972
evaluated          0.917444
liked              0.914752
think              0.911055
rye                0.908868
fading             0.900722
softening          0.899193
freshly            0.896761
mid                0.895762
silky              0.893880
original           0

In [62]:
#creating filtered dataframe that only includes coefficients that have significant p-values
sig_results_df = results_df.loc[results_df['pvals'] <= 0.05]
sig_results_df

Unnamed: 0,pvals,coeff
const,0.000000e+00,65.423442
bean_agtron,4.289270e-04,-1.273737
ground_agtron,1.846499e-08,1.682064
aroma,1.728116e-189,8.708658
acidity,4.577116e-45,5.013760
...,...,...
viscous,4.296123e-02,0.333725
way,8.051757e-03,1.033733
wine,5.098205e-11,1.578629
winy,5.434969e-04,1.159030


In [65]:
#isolating significant, negative coefficients
neg_sig_results_df = results_df.loc[(results_df['pvals'] <= 0.05)&(results_df['coeff']<0)]
neg_sig_results_df

Unnamed: 0,pvals,coeff
bean_agtron,0.000428927,-1.273737
origin_lat,7.220066e-05,-0.656468
astringent,0.004897715,-0.595191
attractive,0.04293996,-0.614003
bitter,8.373882e-05,-1.47495
called,0.02463395,-0.919792
carbon,0.01562305,-1.101617
cups,9.346409e-06,-2.189681
dimension,1.442235e-05,-1.214421
dominated,0.0182331,-0.516362


In [73]:
#sorting to see which variables with significant pvalues have strongest neg coeff
neg_visual = neg_sig_results_df['coeff'].sort_values()
neg_visual

hard          -3.636910
shadow        -2.924256
flat          -2.605377
nice          -2.540766
single        -2.207284
cups          -2.189681
taste         -2.154472
sharp         -1.843138
tested        -1.831253
pod           -1.680764
bitter        -1.474950
pruny         -1.450876
nuts          -1.396667
sample        -1.382201
musty         -1.374905
faint         -1.279075
bean_agtron   -1.273737
woody         -1.246075
dimension     -1.214421
lean          -1.179287
leanish       -1.101886
carbon        -1.101617
grace         -1.052883
substantial   -1.004988
solid         -0.988622
called        -0.919792
interesting   -0.912091
sharpness     -0.886726
nose          -0.859408
ferment       -0.853366
key           -0.743318
origin_lat    -0.656468
attractive    -0.614003
astringent    -0.595191
quite         -0.577004
tones         -0.555838
dominated     -0.516362
turns         -0.458443
simple        -0.417165
short         -0.357663
like          -0.334987
drying        -0

In [68]:
#isolating significant, negative coefficients
pos_sig_results_df = results_df.loc[(results_df['pvals'] <= 0.05)&(results_df['coeff']>=0)]
pos_sig_results_df

Unnamed: 0,pvals,coeff
const,0.000000e+00,65.423442
ground_agtron,1.846499e-08,1.682064
aroma,1.728116e-189,8.708658
acidity,4.577116e-45,5.013760
body,1.931333e-82,4.351545
...,...,...
vibrant,2.591547e-02,0.468497
viscous,4.296123e-02,0.333725
way,8.051757e-03,1.033733
wine,5.098205e-11,1.578629


In [72]:
#sorting to see which variables with significant pvalues have strongest positive coeff
pos_visual = pos_sig_results_df['coeff'].sort_values(ascending=False)[:50]
pos_visual

const            65.423442
aroma             8.708658
aftertaste        6.138584
flavor            5.871105
acidity           5.013760
body              4.351545
kenya             2.291016
malty             1.942738
ted               1.753575
byron             1.744289
ground_agtron     1.682064
flower            1.582224
wine              1.578629
understated       1.513380
blooms            1.481906
best              1.425150
miguel            1.416512
superb            1.362618
leaves            1.291068
acidy             1.227601
berry             1.173601
winy              1.159030
leaf              1.152578
deepening         1.064350
fragrant          1.056600
especially        1.048805
way               1.033733
dimensioned       1.032348
shimmer           1.027525
cherryish         1.012664
deepens           1.005686
rum               1.001725
herb              0.977119
surprising        0.974738
classic           0.970284
lyric             0.968210
explicit          0.958327
n

In [74]:
#save for easy reference to create visualizations
neg_visual.to_csv('neg_visual_df.csv', index=False)

In [75]:
#save for easy reference to create visualizations
pos_visual.to_csv('pos_visual_df.csv', index=False)

In [None]:
############ what does it mean? Narrate
########## pull coef from elastic net model to compare - and maybe export the elastic net coeff insted
###stas model just a sanity check
##### other value Arad suggested

### 7.2 SKLearn Elastic Net Model Interpretation <a class="anchor" id="subheader72"></a>

## 8. Conclusion <a class="anchor" id="header8"></a>