# Capstone 3 - Modeling


## Hypothesis:  Does inclusion of consumer confiden index (cci) improve model prediction score for sales?  How much does logrithmic transformation of several non-normal distributed features improve predictions?


Procedure:

Part I.

Build and apply column transformer to Train Set-

Nominal Categories: 'IsHoliday', 'Dept'

Ordinal Categories: 'Week','Type'

Standard Scaler for Numerical Features: 'Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5', 'CPI', 'Unemployment', 'Size','cci_value'


Part II.  Obtain base model performance metrics without cci values.

1. Fit following models without cci data:  Ordinary Least Squares(OLS), ElasticNet, Random Forest Regressor,
XGBoost, HistGradientBoost.
2. Cross validate to obtain coefficient of determination (R2), mean squared error (MSE), and mean average error (MAE) for each model. 

Part III.  Using models from Part I, obtain performance metrics with cci values.
1. Build and test models from Part I with cci data.
2.  Record coefficient of determination (R2), mean squared error (MSE), and mean average error (MAE) for each model. 
3.  Cross validate to obtain R2, MSE, and MAE scores.
4.  Compare R2, MSE, and MAE scores between Parts II and III.

Part IV.  Apply logrithmic function to following features: 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown3',
'MarkDown4', 'MarkDown5'
1.  Create new features by applying logrithmic function to following features: 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown3', 'MarkDown4', 'MarkDown5'.
2.  Drop original "MarkDown1", "MarkDown2", "MarkDown3", "MarkDown4", and "MarkDown5" features from Training data.
3.  Fit HistGradientBoost and Random Forest models to training data.
4.  Cross validate to obtain R2, MSE, MAE scores.
5.  Compare R2, MSE MAE scores between II and IV.

Results:

See Capstone 3 Modeling - Results Plots.ipynb


Conclusion:

See Capstone 3 Modeling - Results Plots.ipynb




# Import Modules

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [3]:
# sklearn libraries

from sklearn.compose import make_column_transformer

from sklearn.model_selection import train_test_split
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV, RandomizedSearchCV, TimeSeriesSplit

from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder

from sklearn.model_selection import cross_validate


from sklearn.decomposition import PCA
from sklearn.decomposition import NMF

from sklearn.tree import DecisionTreeClassifier

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet


from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingRegressor


#Model Metrics
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.svm import LinearSVC

import xgboost as xgb 

from sklearn.pipeline import Pipeline


In [3]:
!ls

Capstone 3 Data Wrangling Ed Gatdula.ipynb
Capstone 3 EDA Ed Gatdula.ipynb
Capstone 3 Modeling - Ensemble Regression Models.ipynb
Capstone 3 Modeling - Linear Regression Models Ed Gatdula.ipynb
Capstone 3 Modeling - Regression Model Compilations.ipynb
Capstone 3 Modeling - TimeSeriesSplit Version.ipynb
Feature Union Worksheet.ipynb
[34mKaggle Submissions[m[m
Log Transform MarkDown Features.ipynb
[34mMisc Worksheet[m[m
[34mcapstone 3 project data[m[m
capstone_3
capstone_3_test_data
capstone_3_train_data
capstone_3_wrnglng_results
kaggle_submission5_randomforest_randomsearchcv
[34mreport_images[m[m


# Import Data


## Test set: 

In [4]:
# test set for kaggle prediction

df_test = pd.read_csv('./capstone_3_test_data')
print('df_test shape: {}\n'.format(df_test.shape))
df_test.info()
df_test.head()

df_test shape: (115064, 18)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115064 entries, 0 to 115063
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Date          115064 non-null  object 
 1   Store         115064 non-null  int64  
 2   Dept          115064 non-null  int64  
 3   IsHoliday     115064 non-null  int64  
 4   Temperature   115064 non-null  float64
 5   Fuel_Price    115064 non-null  float64
 6   MarkDown1     115064 non-null  float64
 7   MarkDown2     115064 non-null  float64
 8   MarkDown3     115064 non-null  float64
 9   MarkDown4     115064 non-null  float64
 10  MarkDown5     115064 non-null  float64
 11  CPI           115064 non-null  float64
 12  Unemployment  115064 non-null  float64
 13  isocalendar   115064 non-null  object 
 14  Week          115064 non-null  int64  
 15  Type          115064 non-null  object 
 16  Size          115064 non-null  int64  
 17  cci_value     11506

Unnamed: 0,Date,Store,Dept,IsHoliday,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,CPI,Unemployment,isocalendar,Week,Type,Size,cci_value
0,2012-11-02,1,1,0,55.32,3.386,6766.44,5147.7,50.82,3639.9,2737.42,223.462779,6.573,"(2012, 44)",44,A,151315,99.00362
1,2012-11-02,1,2,0,55.32,3.386,6766.44,5147.7,50.82,3639.9,2737.42,223.462779,6.573,"(2012, 44)",44,A,151315,99.00362
2,2012-11-02,1,3,0,55.32,3.386,6766.44,5147.7,50.82,3639.9,2737.42,223.462779,6.573,"(2012, 44)",44,A,151315,99.00362
3,2012-11-02,1,4,0,55.32,3.386,6766.44,5147.7,50.82,3639.9,2737.42,223.462779,6.573,"(2012, 44)",44,A,151315,99.00362
4,2012-11-02,1,5,0,55.32,3.386,6766.44,5147.7,50.82,3639.9,2737.42,223.462779,6.573,"(2012, 44)",44,A,151315,99.00362


## Train Set: 

Using OrdinalEncoder, OneHotEncoder, StandardScaler to:

1. prepare data for model use
2. prepare columntransformer in pipeline

In [8]:
# training data
df_train = pd.read_csv('./capstone_3_train_data.csv')
print('df_train shape: {}\n'.format(df_train.shape))
df_train.info()
df_train.head()

df_train shape: (421570, 19)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 421570 entries, 0 to 421569
Data columns (total 19 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Date          421570 non-null  object 
 1   Store         421570 non-null  int64  
 2   Dept          421570 non-null  int64  
 3   Weekly_Sales  421570 non-null  float64
 4   IsHoliday     421570 non-null  int64  
 5   Temperature   421570 non-null  float64
 6   Fuel_Price    421570 non-null  float64
 7   MarkDown1     421570 non-null  float64
 8   MarkDown2     421570 non-null  float64
 9   MarkDown3     421570 non-null  float64
 10  MarkDown4     421570 non-null  float64
 11  MarkDown5     421570 non-null  float64
 12  CPI           421570 non-null  float64
 13  Unemployment  421570 non-null  float64
 14  isocalendar   421570 non-null  object 
 15  Week          421570 non-null  int64  
 16  Type          421570 non-null  object 
 17  Size          4215

Unnamed: 0,Date,Store,Dept,Weekly_Sales,IsHoliday,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,CPI,Unemployment,isocalendar,Week,Type,Size,cci_value
0,2010-02-05,1,1,24924.5,0,42.31,2.572,0.0,0.0,0.0,0.0,0.0,211.096358,8.106,"(2010, 5)",5,A,151315,98.22324
1,2010-02-05,1,2,50605.27,0,42.31,2.572,0.0,0.0,0.0,0.0,0.0,211.096358,8.106,"(2010, 5)",5,A,151315,98.22324
2,2010-02-05,1,3,13740.12,0,42.31,2.572,0.0,0.0,0.0,0.0,0.0,211.096358,8.106,"(2010, 5)",5,A,151315,98.22324
3,2010-02-05,1,4,39954.04,0,42.31,2.572,0.0,0.0,0.0,0.0,0.0,211.096358,8.106,"(2010, 5)",5,A,151315,98.22324
4,2010-02-05,1,5,32229.38,0,42.31,2.572,0.0,0.0,0.0,0.0,0.0,211.096358,8.106,"(2010, 5)",5,A,151315,98.22324


In [12]:
train_features = df_train.columns.to_list()
print("Number of train features: {}".format(len(train_features)))
print(train_features)

Number of train features: 19
['Date', 'Store', 'Dept', 'Weekly_Sales', 'IsHoliday', 'Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5', 'CPI', 'Unemployment', 'isocalendar', 'Week', 'Type', 'Size', 'cci_value']


## Column Transformer Setup


two column_transform objects:

###  column_transform_no_cci

IMPORTANT! drop 'cci_value' from column transformer setup


In [7]:
df_train.shape

(421570, 19)

In [31]:
# column_transform_no_cci
# 'Date', 'Store', 'Dept', 'Weekly_Sales', 'IsHoliday', 'Temperature', 'Fuel_Price', 'MarkDown1',
# 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5', 'CPI', 'Unemployment', 'isocalendar', 'Week',
# 'Type', 'Size', 'cci_value'

# 3 Nominal Categories: 'IsHoliday', 'Dept','Store'

ohe=OneHotEncoder(handle_unknown='ignore')
ohe.fit_transform(df_train[["IsHoliday",'Dept','Store']])
ohe.categories_

# 3 Ordinal Categories: 'Week', 'Type', 'isocalendar'
oe = OrdinalEncoder()
oe.fit_transform(df_train[['Week','Type', 'isocalendar']])
oe.categories_

# 10 Standard Scaler for Numerical Feature
scaler = StandardScaler()
scaler.fit_transform(df_train[['Temperature', 'Fuel_Price', 'MarkDown1',
                                  'MarkDown2', 'MarkDown3', 'MarkDown4',
                                  'MarkDown5', 'CPI', 'Unemployment', 'Size',
                              #    'cci_value'
                              ]])

# Instantiate make_column_transformer using standard scaler, onehotencoder, ordinalencoder

column_transform_no_cci = make_column_transformer((scaler,['Temperature', 'Fuel_Price','MarkDown1',
                                                              'MarkDown2', 'MarkDown3', 'MarkDown4',
                                                              'MarkDown5', 'CPI', 'Unemployment', 'Size',
                                                          #    'cci_value'
                                                          ]),
                                           (ohe,['IsHoliday','Dept','Store']), 
                                           (oe,['Week','Type','isocalendar']),sparse_threshold=0)

#fit_transform make_column_transformer object
column_transform_no_cci.fit(df_train.drop(columns = ['Date','Weekly_Sales','cci_value']))


ColumnTransformer(sparse_threshold=0,
                  transformers=[('standardscaler', StandardScaler(),
                                 ['Temperature', 'Fuel_Price', 'MarkDown1',
                                  'MarkDown2', 'MarkDown3', 'MarkDown4',
                                  'MarkDown5', 'CPI', 'Unemployment', 'Size']),
                                ('onehotencoder',
                                 OneHotEncoder(handle_unknown='ignore'),
                                 ['IsHoliday', 'Dept', 'Store']),
                                ('ordinalencoder', OrdinalEncoder(),
                                 ['Week', 'Type', 'isocalendar'])])

###  column_transform_cci

IMPORTANT:  Retains Consumer Confidence Index (cci) values

In [26]:
# column_transform_cci
# 'Date', 'Store', 'Dept', 'Weekly_Sales', 'IsHoliday', 'Temperature', 'Fuel_Price', 'MarkDown1',
# 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5', 'CPI', 'Unemployment', 'isocalendar', 'Week',
# 'Type', 'Size', 'cci_value'

# 3 Nominal Categories: 'IsHoliday', 'Dept','Store'

ohe=OneHotEncoder(handle_unknown='ignore')
ohe.fit_transform(df_train[["IsHoliday",'Dept','Store']])
ohe.categories_

# 3 Ordinal Categories: 'Week', 'Type', 'isocalendar'
oe = OrdinalEncoder()
oe.fit_transform(df_train[['Week','Type', 'isocalendar']])
oe.categories_

# 11 Standard Scaler for Numerical Feature (includes 'cci_value' feature)
scaler = StandardScaler()
scaler.fit_transform(df_train[['Temperature', 'Fuel_Price', 'MarkDown1',
                                  'MarkDown2', 'MarkDown3', 'MarkDown4',
                                  'MarkDown5', 'CPI', 'Unemployment', 'Size',
                                  'cci_value'
                              ]])

# Instantiate make_column_transformer using standard scaler, onehotencoder, ordinalencoder

column_transform_cci = make_column_transformer((scaler,['Temperature', 'Fuel_Price','MarkDown1',
                                                              'MarkDown2', 'MarkDown3', 'MarkDown4',
                                                              'MarkDown5', 'CPI', 'Unemployment', 'Size',
                                                              'cci_value'
                                                          ]),
                                           (ohe,['IsHoliday','Dept','Store']), 
                                           (oe,['Week','Type','isocalendar']),sparse_threshold=0)

#fit_transform make_column_transformer object
column_transform_cci.fit(df_train.drop(columns = ['Date','Weekly_Sales']))



ColumnTransformer(sparse_threshold=0,
                  transformers=[('standardscaler', StandardScaler(),
                                 ['Temperature', 'Fuel_Price', 'MarkDown1',
                                  'MarkDown2', 'MarkDown3', 'MarkDown4',
                                  'MarkDown5', 'CPI', 'Unemployment', 'Size',
                                  'cci_value']),
                                ('onehotencoder',
                                 OneHotEncoder(handle_unknown='ignore'),
                                 ['IsHoliday', 'Dept', 'Store']),
                                ('ordinalencoder', OrdinalEncoder(),
                                 ['Week', 'Type', 'isocalendar'])])

# Linear Regression 

note:  all cross validation scores recorded in dataframe named df_scores

## Linear Regression (no cci values) - TimeSeriesSplit Cross-Validation Using For-Loop

In [15]:
df_train.columns

Index(['Date', 'Store', 'Dept', 'Weekly_Sales', 'IsHoliday', 'Temperature',
       'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4',
       'MarkDown5', 'CPI', 'Unemployment', 'isocalendar', 'Week', 'Type',
       'Size', 'cci_value'],
      dtype='object')

In [20]:
# cross validate base linear regression model, no cci_value

Y = df_train.Weekly_Sales
X = df_train.drop(columns = ['Date','Weekly_Sales', 'cci_value'])

#fit column transformer using entire train set

tranform_no_cci = column_transform_no_cci.fit(X)

cv = TimeSeriesSplit(n_splits=5)

# transform column transformer on each time series split
lr_no_cci_scores = cross_validate(LinearRegression(), column_transform_no_cci.transform(X), Y, cv=cv,
                        scoring=('r2', 'neg_mean_squared_error', 'neg_mean_absolute_error'))

print("MSE scores: {}".format(lr_no_cci_scores['test_neg_mean_squared_error']))
print("R2 scores: {}".format(lr_no_cci_scores['test_r2']))
print("MAE scores: {}".format(lr_no_cci_scores['test_neg_mean_absolute_error']))



MSE scores: [-5.87771188e+27 -2.55857059e+26 -2.32052518e+22 -2.60963165e+26
 -9.58236393e+25]
R2 scores: [-8.08840292e+18 -5.39585175e+17 -4.85127160e+13 -7.70030055e+17
 -2.58427474e+17]
MAE scores: [-6.31688390e+13 -1.35981239e+13 -1.19603516e+11 -9.69012159e+12
 -8.79034065e+12]


In [21]:
# write a function that returns average of cross validation scores
list = {}

def scores(name):
    keys = ['test_r2', 'test_neg_mean_squared_error',
            'test_neg_mean_absolute_error']
    for item in keys:
        z = list.update({item: np.round(np.mean(np.abs(name[item])), 2)})
    return list


scores(lr_no_cci_scores)

{'test_r2': 1.9312988279932575e+18,
 'test_neg_mean_squared_error': 1.2980757890467114e+27,
 'test_neg_mean_absolute_error': 19073405730295.04}

In [22]:
# create dataframe to store model performance metrics
df_summary = pd.DataFrame(columns = ['name'])

In [23]:
# function to write model name and results to a dict.  dict be used to create df_summary entry

def write(description, cv_scores):
    dict1 = {'model description':description}
    dict2 = scores(cv_scores)
    dict2.update(dict1)
    return(dict2)    

In [24]:
df_summary = df_summary.append(write('lr base no cci', lr_no_cci_scores), ignore_index=True)
df_summary

Unnamed: 0,name,model description,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_r2
0,,lr base no cci,19073410000000.0,1.298076e+27,1.931299e+18


## Linear Regression (cci values) - TimeSeriesSplit Cross-Validation Using For-Loop

In [27]:
# cross validate base linear regression model with cci_value

cv = TimeSeriesSplit(n_splits=5)

# cross validate base linear regression model, no cci_value

Y = df_train.Weekly_Sales
X = df_train.drop(columns = ['Date','Weekly_Sales'])

#fit column transformer using entire train set

tranform_no_cci = column_transform_cci.fit(X)

# cross validate 
# transform column transformer on each time series split

lr_cci_scores = cross_validate(LinearRegression(), column_transform_cci.transform(X), Y, cv=cv,
                        scoring=('r2', 'neg_mean_squared_error', 'neg_mean_absolute_error'))

print("MSE scores: {}".format(lr_cci_scores['test_neg_mean_squared_error']))
print("R2 scores: {}".format(lr_cci_scores['test_r2']))
print("MAE scores: {}".format(lr_cci_scores['test_neg_mean_absolute_error']))


MSE scores: [-3.39916478e+25 -1.14079090e+26 -1.47648565e+23 -4.14002926e+25
 -5.29881750e+23]
R2 scores: [-4.67763901e+16 -2.40585060e+17 -3.08672923e+14 -1.22160802e+17
 -1.42904197e+15]
MAE scores: [-5.40063577e+12 -9.09096992e+12 -3.37511199e+11 -5.32306622e+12
 -6.62666847e+11]


In [28]:
# appending metrics to df_summary
scores(lr_cci_scores)

score_dict = write('lr cci', lr_cci_scores)

df_summary = df_summary.append(score_dict, ignore_index=True)


In [29]:
df_summary

Unnamed: 0,name,model description,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_r2
0,,lr base no cci,19073410000000.0,1.298076e+27,1.931299e+18
1,,lr cci,4162970000000.0,3.802971e+25,8.225199e+16


## Linear Regression (no cci_value) - GridSearchCV, PCA, TimeSeriesSplit 

Results:  PCA does not improve Linear Regression Root Mean Absolute Error

In [33]:
# define steps

# instantiate pipeline object

steps = [('reducer', PCA()),
        ('linear', LinearRegression(normalize=True))]

pipe = Pipeline(steps)

# define gridsearch parameters dict

param_grid = {'reducer__n_components':[1,2,3,4,5,6,7]}

# split data into test train split
Y = df_train['Weekly_Sales']
X = df_train.drop(columns = ['Date','cci_value','Weekly_Sales'])
Z = column_transform_no_cci.fit(X)

# instantiate GridSearchCV object using pipeline, parameters dict
cv = TimeSeriesSplit(n_splits=5)
lr_grid_search = GridSearchCV(pipe, param_grid ,cv=cv, return_train_score = True)

# grid search fit
lr_grid_search.fit(Z.transform(X),Y)


GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
             estimator=Pipeline(steps=[('reducer', PCA()),
                                       ('linear',
                                        LinearRegression(normalize=True))]),
             param_grid={'reducer__n_components': [1, 2, 3, 4, 5, 6, 7]},
             return_train_score=True)

In [34]:
# cross validate gridsearchcv model
# fit column transformer using entire train set
Y = df_train['Weekly_Sales']
X = df_train.drop(columns = ['Date','cci_value','Weekly_Sales'])
Z = column_transform_no_cci.fit(df_train.drop(columns = ['Date','cci_value','Weekly_Sales']))

cv = TimeSeriesSplit(n_splits=5)
lr_gridsearch_scores = cross_validate(lr_grid_search,
                                      Z.transform(X), Y, cv=cv,
                        scoring=('r2', 'neg_mean_squared_error', 'neg_mean_absolute_error'))
print(lr_gridsearch_scores['test_neg_mean_squared_error'])
print(lr_gridsearch_scores['test_r2'])
print(lr_gridsearch_scores['test_neg_mean_absolute_error'])

# appending metrics to df_summary
scores(lr_gridsearch_scores)

score_dict=write('lr gridsearch', lr_gridsearch_scores)

df_summary=df_summary.append(score_dict, ignore_index=True)
df_summary

[-6.99638840e+08 -4.58622596e+08 -4.87113874e+08 -3.44185271e+08
 -3.64175858e+08]
[ 0.03721704  0.03279607 -0.01835642 -0.01559545  0.01785146]
[-18091.56783053 -14953.6799979  -16395.66055545 -13802.56299493
 -12969.80484779]


Unnamed: 0,name,model description,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_r2
0,,lr base no cci,19073410000000.0,1.298076e+27,1.931299e+18
1,,lr cci,4162970000000.0,3.802971e+25,8.225199e+16
2,,lr gridsearch,15242.66,470747300.0,0.02


# ElasticNet - GridSearchCV

## ElasticNet (no cci_value) - Manual CV Loop

In [35]:

# split data into test train split
# cross validate base regression model, no cci_value

cv = TimeSeriesSplit(n_splits=5)

Y = df_train['Weekly_Sales']
X = df_train.drop(columns = ['Date','cci_value','Weekly_Sales'])
Z = column_transform_no_cci.fit(df_train.drop(columns = ['Date','cci_value','Weekly_Sales']))

elastic_no_cci_scores = cross_validate(ElasticNet(), Z.transform(X), Y, cv=cv,
                        scoring=('r2', 'neg_mean_squared_error', 'neg_mean_absolute_error'))

print("MSE scores: {}".format(elastic_no_cci_scores['test_neg_mean_squared_error']))
print("R2 scores: {}".format(elastic_no_cci_scores['test_r2']))
print("MAE scores: {}".format(elastic_no_cci_scores['test_neg_mean_absolute_error']))


MSE scores: [-6.79612232e+08 -4.39867174e+08 -4.58792047e+08 -3.22303939e+08
 -3.45738933e+08]
R2 scores: [0.06477594 0.07234998 0.04085297 0.0489703  0.06757414]
MAE scores: [-17510.12188566 -14594.33634497 -15757.47190423 -13029.54128439
 -13057.43660696]


In [36]:
# appending metrics to df_summary
scores(elastic_no_cci_scores)

score_dict=write('elasticnet_base', elastic_no_cci_scores)

df_summary=df_summary.append(score_dict, ignore_index=True)
df_summary

Unnamed: 0,name,model description,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_r2
0,,lr base no cci,19073410000000.0,1.298076e+27,1.931299e+18
1,,lr cci,4162970000000.0,3.802971e+25,8.225199e+16
2,,lr gridsearch,15242.66,470747300.0,0.02
3,,elasticnet_base,14789.78,449262900.0,0.06


## ElasticNet (cci_value)  -  TimeSeriesSplit Cross Validation

In [37]:
# split data into test train split
# cross validate base regression model, no cci_value

cv = TimeSeriesSplit(n_splits=5)

Y = df_train['Weekly_Sales']
X = df_train.drop(columns = ['Date','Weekly_Sales'])
Z = column_transform_cci.fit(X)

elastic_cci_scores = cross_validate(ElasticNet(), Z.transform(X), Y, cv=cv,
                        scoring=('r2', 'neg_mean_squared_error', 'neg_mean_absolute_error'))

print("MSE scores: {}".format(elastic_cci_scores['test_neg_mean_squared_error']))
print("R2 scores: {}".format(elastic_cci_scores['test_r2']))
print("MAE scores: {}".format(elastic_cci_scores['test_neg_mean_absolute_error']))

# appending metrics to df_summary
scores(elastic_cci_scores)

score_dict=write('elasticnet_cci', elastic_cci_scores)

df_summary=df_summary.append(score_dict, ignore_index=True)
df_summary

MSE scores: [-6.79524663e+08 -4.39791444e+08 -4.58813557e+08 -3.22290848e+08
 -3.45691994e+08]
R2 scores: [0.06489645 0.07250969 0.040808   0.04900893 0.06770073]
MAE scores: [-17512.47755491 -14590.81143557 -15758.00551711 -13028.20248363
 -13058.13549329]


Unnamed: 0,name,model description,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_r2
0,,lr base no cci,19073410000000.0,1.298076e+27,1.931299e+18
1,,lr cci,4162970000000.0,3.802971e+25,8.225199e+16
2,,lr gridsearch,15242.66,470747300.0,0.02
3,,elasticnet_base,14789.78,449262900.0,0.06
4,,elasticnet_cci,14789.53,449222500.0,0.06


## ElasticNet(no cci_value) GridSearchCV

In [38]:
# create steps

steps = [#('transform', column_transform),
         ('linear',ElasticNet())]

#instantiate pipeline object
pipe = Pipeline(steps)

# split df_train into test and train sets 
# Column Transform df_train before split
Y = df_train['Weekly_Sales']
X = df_train.drop(columns = ['Date','cci_value','Weekly_Sales'])
Z = column_transform_no_cci.fit(df_train.drop(columns = ['Date','cci_value','Weekly_Sales']))


# GridSearchCV parameters
param_grid = {'linear__alpha':[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]}

scoring = {'R2': 'r2_score', 'MAE': 'mean_absolute_error', 'MSE': 'mean_squared_error'}

cv = TimeSeriesSplit(n_splits=5)
# instantiate GridSearchCV object using pipeline, parameters dict
elastic_grid_search = GridSearchCV(pipe, param_grid, cv=cv, return_train_score = True)
elastic_grid_search.fit(Z.transform(X), Y)



GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
             estimator=Pipeline(steps=[('linear', ElasticNet())]),
             param_grid={'linear__alpha': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7,
                                           0.8, 0.9, 1.0]},
             return_train_score=True)

In [39]:
print("Best parameters: {}".format(elastic_grid_search.best_params_))

Best parameters: {'linear__alpha': 0.1}


In [40]:
# cross validate gridsearchcv model
# split df_train into test and train sets 
# Column Transform df_train before split
Y = df_train['Weekly_Sales']
X = df_train.drop(columns = ['Date','cci_value','Weekly_Sales'])
Z = column_transform_no_cci.fit(df_train.drop(columns = ['Date','cci_value','Weekly_Sales']))

cv = TimeSeriesSplit(n_splits=5)
elastic_scores = cross_validate(elastic_grid_search, Z.transform(X), Y, cv=cv,
                        scoring=('r2', 'neg_mean_squared_error', 'neg_mean_absolute_error'), n_jobs=-1)
print(elastic_scores['test_neg_mean_squared_error'])
print(elastic_scores['test_r2'])
print(elastic_scores['test_neg_mean_absolute_error'])

[-5.50384932e+08 -3.31721904e+08 -3.64080390e+08 -2.42577721e+08
 -2.60902714e+08]
[0.24260747 0.30042102 0.23885641 0.2842203  0.29636956]
[-15532.07167359 -12113.45382472 -14056.32962458 -10797.74866351
 -10928.57994587]


In [41]:
# appending metrics to df_summary
scores(elastic_scores)

score_dict=write('elasticnet_gridsearch', elastic_scores)

df_summary=df_summary.append(score_dict, ignore_index=True)
df_summary

Unnamed: 0,name,model description,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_r2
0,,lr base no cci,19073410000000.0,1.298076e+27,1.931299e+18
1,,lr cci,4162970000000.0,3.802971e+25,8.225199e+16
2,,lr gridsearch,15242.66,470747300.0,0.02
3,,elasticnet_base,14789.78,449262900.0,0.06
4,,elasticnet_cci,14789.53,449222500.0,0.06
5,,elasticnet_gridsearch,12685.64,349933500.0,0.27


In [33]:
df_summary

Unnamed: 0,name,model description,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_r2
0,,lr base no cci,469952000000000.0,1.237278e+30,1.708826e+21
1,,lr cci,374348000000000.0,8.155701999999999e+29,1.124642e+21
2,,lr gridsearch,15203.59,470440800.0,0.02
3,,elasticnet_base,14798.65,449297200.0,0.06
4,,elasticnet_cci,14798.97,449276600.0,0.06
5,,elasticnet_gridsearch,12685.48,349886500.0,0.27


In [None]:
#write df_summary to df_summary_linear file 
# df_summary_linear to be used for results discussion in Capstone 3 Modeling - Results Plots.ipynb

df_summary.to_csv('df_summary_linear',index=False)