# Activity 1.01
## Adding regularization to the model

In this activity we will utilize the same logistic regression model from the scikit-learn package. This time however we will add regularization to the model, and search for the optimum regularization parameter, a process often called tuning the hyperparameters.

After training the models we will test the predictions and compare the model evaluation metrics to those produced by the baseline model and the model without regularization.

First, let's load the feature data from the first exercise and the target data from the second activity, the feature data from the second activity can also be used.

In [1]:
import pandas as pd
feats = pd.read_csv('../data/OSI_feats_e3.csv')
target = pd.read_csv('../data/OSI_target_e2.csv')

We will again create a test and train dataset. We will train the data using the training dataset, his time however we will use part of the training dataset for validation, in order to choose the most appropriate hyperparameter.

We will again use a test_size = 0.2 which means that 20% of the data will be reserved for testing. The size of our validation set will be determined by how many validation folds we have, if we do 10-fold cross validation this equates to reserving 10% of the training dataset to validate our model on, each fold will use a different 10% of the training dataset, and the average error across all folds is used to compare models with different hyperparameters.

In [2]:
from sklearn.model_selection import train_test_split
test_size = 0.2
random_state = 13
X_train, X_test, y_train, y_test = train_test_split(feats, target, test_size=test_size, random_state=random_state)

Let's make sure our dimensions are correct

In [3]:
print(f'Shape of X_train: {X_train.shape}')
print(f'Shape of y_train: {y_train.shape}')
print(f'Shape of X_test: {X_test.shape}')
print(f'Shape of y_test: {y_test.shape}')

Shape of X_train: (9864, 68)
Shape of y_train: (9864, 1)
Shape of X_test: (2466, 68)
Shape of y_test: (2466, 1)


We fit our model first by instantiating it, then by fitting the model to the training data.
We also add in a penalty, denoted by 'l1' and 'l2', our goal is to find the penalty type and penalty value that gives us the best results.

To get a reminder of how to use certain functions we can always use the help function to look at the details

In [4]:
from sklearn.linear_model import LogisticRegressionCV
help(LogisticRegressionCV)

Help on class LogisticRegressionCV in module sklearn.linear_model._logistic:

class LogisticRegressionCV(LogisticRegression, sklearn.base.BaseEstimator, sklearn.linear_model._base.LinearClassifierMixin)
 |  LogisticRegressionCV(Cs=10, fit_intercept=True, cv=None, dual=False, penalty='l2', scoring=None, solver='lbfgs', tol=0.0001, max_iter=100, class_weight=None, n_jobs=None, verbose=0, refit=True, intercept_scaling=1.0, multi_class='auto', random_state=None, l1_ratios=None)
 |  
 |  Logistic Regression CV (aka logit, MaxEnt) classifier.
 |  
 |  See glossary entry for :term:`cross-validation estimator`.
 |  
 |  This class implements logistic regression using liblinear, newton-cg, sag
 |  of lbfgs optimizer. The newton-cg, sag and lbfgs solvers support only L2
 |  regularization with primal formulation. The liblinear solver supports both
 |  L1 and L2 regularization, with a dual formulation only for the L2 penalty.
 |  Elastic-Net penalty is only supported by the saga solver.
 |  
 |  

In [5]:
import numpy as np
from sklearn.linear_model import LogisticRegressionCV
Cs = np.logspace(-2, 6, 9)
model_l1 = LogisticRegressionCV(Cs=Cs, penalty='l1', cv=10, solver='liblinear', random_state=42, max_iter=10000)
model_l2 = LogisticRegressionCV(Cs=Cs, penalty='l2', cv=10, random_state=42, max_iter=10000)

model_l1.fit(X_train, y_train['Revenue'])
model_l2.fit(X_train, y_train['Revenue'])

LogisticRegressionCV(Cs=array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03, 1.e+04, 1.e+05,
       1.e+06]),
                     class_weight=None, cv=10, dual=False, fit_intercept=True,
                     intercept_scaling=1.0, l1_ratios=None, max_iter=10000,
                     multi_class='auto', n_jobs=None, penalty='l2',
                     random_state=42, refit=True, scoring=None, solver='lbfgs',
                     tol=0.0001, verbose=0)

To test the model performance we will predict the outcome on the test features (X_test), and compare those outcomes to real values (y_test) for each of the models.

With the 'LogisiticRegressionCV' model when the is fit the hyperparameter wthat resulted in the lowest error is automatically chosen for any subsequent predictions.

We can also look at what the hyperparameters are.

In [6]:
print(f'Best hyperparameter for l1 regularization model: {model_l1.C_[0]}')
print(f'Best hyperparameter for l2 regularization model: {model_l2.C_[0]}')

Best hyperparameter for l1 regularization model: 1000000.0
Best hyperparameter for l2 regularization model: 1.0


In [7]:
y_pred_l1 = model_l1.predict(X_test)
y_pred_l2 = model_l2.predict(X_test)

We can again compare the accuracy against the true values. Remember that our baseline model (predicting 'no' or False for all) gave us an accuracy of 70.27972%, so we want to try and beat that.

In [8]:
from sklearn import metrics
accuracy_l1 = metrics.accuracy_score(y_pred=y_pred_l1, y_true=y_test)
accuracy_l2 = metrics.accuracy_score(y_pred=y_pred_l2, y_true=y_test)
print(f'Accuracy of the model with l1 regularization is {accuracy_l1*100:.4f}%')
print(f'Accuracy of the model with l2 regularization is {accuracy_l2*100:.4f}%')

Accuracy of the model with l1 regularization is 89.2133%
Accuracy of the model with l2 regularization is 89.2944%


They both performed slightly worse than the baseline model and the model with l1 regularization performed about as well as the model without regularization.

It's worth noting that due to the small amount of data it may be difficult to get an accurate model.

### Other evaluation metrics

Let's test again with the other evaluation metrics

In [9]:
precision_l1, recall_l1, fscore_l1, _ = metrics.precision_recall_fscore_support(y_pred=y_pred_l1, y_true=y_test, average='binary')
precision_l2, recall_l2, fscore_l2, _ = metrics.precision_recall_fscore_support(y_pred=y_pred_l2, y_true=y_test, average='binary')
print(f'l1\nPrecision: {precision_l1:.4f}\nRecall: {recall_l1:.4f}\nfscore: {fscore_l1:.4f}\n\n')
print(f'l2\nPrecision: {precision_l2:.4f}\nRecall: {recall_l2:.4f}\nfscore: {fscore_l2:.4f}')

l1
Precision: 0.7300
Recall: 0.4078
fscore: 0.5233


l2
Precision: 0.7350
Recall: 0.4106
fscore: 0.5269


Overall the model with l2 performs slightly better.

### Feature importances
   
Examining the feature importances can show us how the regularization affected the values of the coefficients

In [10]:
coef_list = [f'{feature}: {coef}' for coef, feature in sorted(zip(model_l1.coef_[0], X_train.columns.values.tolist()))]
for item in coef_list:
    print(item)

ExitRates: -15.883224778812533
TrafficType_15: -14.795568293882813
TrafficType_19: -14.572417554129022
Browser_11: -14.194513734001333
TrafficType_18: -13.580838983781868
TrafficType_12: -10.887010740988496
Browser_3: -1.591759861518971
OperatingSystems_6: -1.3061883214341672
Browser_13: -1.1580420838412984
TrafficType_13: -1.1370525347807676
TrafficType_14: -1.0975546272442658
Browser_6: -1.0010333108800196
BounceRates: -0.9141131252024657
Browser_7: -0.8502023615800486
TrafficType_3: -0.811958144680347
TrafficType_6: -0.610088799226249
OperatingSystems_3: -0.608450446552214
TrafficType_1: -0.6050916722027272
OperatingSystems_1: -0.5365902656959869
OperatingSystems_4: -0.5111451125487375
TrafficType_4: -0.5062856279182754
Browser_2: -0.5059966973246434
TrafficType_2: -0.46430066907789846
TrafficType_9: -0.4466955982654509
Browser_4: -0.4214541023779841
Browser_5: -0.41616336572267953
TrafficType_5: -0.415441717179765
TrafficType_7: -0.3955699524561608
Browser_1: -0.3880867021005534
Br

l1 regularization tends to send coefficients all the way down to zero, and is useful for reducing the total number of features in a training dataset. Here we can see some columns are very close to zero. 

Let's take a look at the the model coefficients for the model with l2 reglarization.

In [11]:
coef_list = [f'{feature}: {coef}' for coef, feature in sorted(zip(model_l2.coef_[0], X_train.columns.values.tolist()))]
for item in coef_list:
    print(item)

TrafficType_13: -0.34009141528704256
Month_May: -0.2996216354738292
TrafficType_3: -0.26886567044331466
Month_Dec: -0.2600371384272894
Month_Mar: -0.234528066357572
VisitorType_Returning_Visitor: -0.2092729117037135
ExitRates: -0.20125071612863224
OperatingSystems_3: -0.17171727353008637
BounceRates: -0.15860752704242273
SpecialDay: -0.15806189654702138
TrafficType_1: -0.14304912321990237
Browser_6: -0.10194914517637654
Region_4: -0.09904631488404457
Region_9: -0.09873197637696395
Browser_3: -0.08914198358022751
Month_June: -0.042681926326746056
TrafficType_15: -0.04024689666293222
Browser_7: -0.03327911496921953
TrafficType_6: -0.031102643479272572
Region_1: -0.027052218462517107
OperatingSystems_6: -0.02574568309220721
TrafficType_19: -0.02203436126130139
Browser_13: -0.021797739222369092
Browser_2: -0.017512124173199008
Region_7: -0.010600896457621618
TrafficType_14: -0.009414755781185648
TrafficType_18: -0.008565351627323418
Browser_11: -0.007825344595778813
Region_3: -0.0072270657

Here we can see that none of the coefficients go right down to zero, which is rare when applying l2 regularization, this is because the feature coefficients get penalized less when they small, and much greater when the coefficients are larger.

In this activity we have seen how to create models that include regularization. While the regularization added little to model performance in this dataset, regularization is an important technique with which to prevent your models from overfitting to the training dataset.

In the following lesson we will apply many of the same techniques learned in the this lesson, namely creating test and training datasets, performaing cross-validation, and using model evaluation metrics to score our models, however we we apply them to deep learning models in using the Keras library.