In [1]:
import numpy as np
from numpy import random
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import SGDClassifier
from sklearn import neighbors, datasets
from sklearn.model_selection import GridSearchCV

### Regularization

#### Overfitting

When a model has high variance, it is said to *overfit* the data.  Overfitting is an issue because the model will not *generalize* well to new data.

#### Parameter Shrinkage

Shrinking the parameters $\theta = (\beta_0, \beta_1,...,\beta_n)^T$ of a linear model close (or equal) to $0$ can significantly reduce model variance and decrease overfitting.

*Regularization* is a technique that can accomplish *parameter shrinkage*.  In general, SGD with a regularization term has the form

$$L_{GEN} = \frac{1}{N}\sum_{i=1}^NL(y_i, \hat{y}_i)+\alpha R(\theta) $$ 

where $\alpha \geq 0$ is a *hyperparameter* and $R(\theta) \geq 0$ is the regularization function.  Note that $R$ is a function of the parameters $\theta$ and that there are different parameter estimates $\hat{\beta}_i$ for each value of $\alpha$. 

The two most common regularization functions are:

+ Ridge: $R(\theta) = \sum_{i=0}^{k}\beta_i^2$, where $k = $len$(\theta)$.
    + This is also called the $l_2$ penalty, since it uses the $l_2=||\theta||_2^2$ (or Euclidean) norm.

+ Lasso: $R(\theta) = \sum_{i=0}^{k}|\beta_i|$, where $k = $len$(\theta)$.
    + This is also called the $l_1$ penalty, since it uses the $l_1=||\theta||_1^2$ (or Manhattan) norm.

So, how does the addition of the term $\alpha R(\theta)$ shrink the parameters $\hat{\beta}_i$ close to $0$?

We are trying to find values of $\theta = (\beta_0, \beta_1,...,\beta_n)^T$ that make $L_{GEN}$ as small as possible.  However, it is possible to find a value of $\theta$ that makes $L_{GEN}$ small where $\beta_0, \beta_1,...,\beta_n$ are big.  The *shrinkage penalty* $\alpha R(\theta)$ penalizes this situation by making 

$$L_{GEN} = \frac{1}{N}\sum_{i=1}^NL(y_i, \hat{y}_i)+\alpha R(\theta) $$

bigger.  Thus, the values $\beta_0, \beta_1,...,\beta_n$ that make

$$L_{GEN} = \frac{1}{N}\sum_{i=1}^NL(y_i, \hat{y}_i)+\alpha R(\theta) $$

as small as possible will make $\frac{1}{N}\sum_{i=1}^NL(y_i, \hat{y}_i)$ small and make $\alpha R(\theta)$ small.

##### Example 1

We train a mult-logistic regression classifier for the Iris dataset.

In [2]:
iris = datasets.load_iris()
data = pd.DataFrame(iris.data, columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'])
data['Label'] = iris.target

In [3]:
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,Label
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


##### Feature Scaling

In [4]:
cols_to_scale = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

In [5]:
scaler=StandardScaler()
data[cols_to_scale] = scaler.fit_transform(data[cols_to_scale])

##### Modelling

In [6]:
X = data.iloc[:, :-1].to_numpy()
y = data.iloc[:, -1:].to_numpy().ravel()

In [7]:
X_train,X_test,y_train,y_test=train_test_split(X, y, test_size=0.25, random_state=250)

We play around with different values for $\alpha$.

In [45]:
model = SGDClassifier(loss = 'log_loss', penalty = 'l1', alpha = 0.01)
model.fit(X_train,y_train)

In [46]:
y_pred = model.predict(X_test)
(y_pred == y_test).sum()/len(y_test)

0.868421052631579

$\Box$

#### Hyperparameter Selection

As we saw in Example 1, different choices for the penalty and alpha gave differing performance by our model.

We could guess and check to find semi-optimal values for our hyperparameters $R(\theta)$ and $\alpha$, but a better way is to use *cross validation*.  k-fold cross validation is discussed [here](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation).  

There are many cross validation techniques that can be used to select optimum hyperparameters, but in this class we will only look at *grid-search cross validation*.  The documentation is given [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV).

##### Example 2

We use Sci-Kit Learn's GridSearchCV class to select the best values to use for the penalty and alpha hyperparameters.

First, we use a dictionary to specify what values to use for which hyperparameters.

In [34]:
parameters = {'penalty':('l1', 'l2'), 'alpha':[0.0001, 0.001, 0.01, 0.1, 1.0, 2.0, 3.0, 4.0, 5.0]}

Next, we specify our model.

In [47]:
model = SGDClassifier(loss = 'log_loss')

In [48]:
clf = GridSearchCV(model, parameters)

In [49]:
clf.fit(X_train,y_train)

The *cv_results_* attribute gives many interesting results from the cross validation process as a dictionary.

We can pass this info to a Pandas DataFrame to display the results.

In [38]:
pd.DataFrame(clf.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_alpha,param_penalty,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.005984,0.004554,0.001583,0.001104,0.0001,l1,"{'alpha': 0.0001, 'penalty': 'l1'}",0.956522,0.956522,0.909091,0.909091,0.818182,0.909881,0.050519,2
1,0.004178,0.002431,0.000903,0.000801,0.0001,l2,"{'alpha': 0.0001, 'penalty': 'l2'}",0.956522,0.956522,0.909091,0.909091,0.818182,0.909881,0.050519,2
2,0.007878,0.00711,0.0,0.0,0.001,l1,"{'alpha': 0.001, 'penalty': 'l1'}",0.956522,0.956522,0.909091,0.909091,0.818182,0.909881,0.050519,2
3,0.003169,0.006337,0.0,0.0,0.001,l2,"{'alpha': 0.001, 'penalty': 'l2'}",0.956522,0.956522,0.909091,0.909091,0.818182,0.909881,0.050519,2
4,0.006272,0.007681,0.0,0.0,0.01,l1,"{'alpha': 0.01, 'penalty': 'l1'}",1.0,0.956522,0.909091,0.909091,0.863636,0.927668,0.046593,1
5,0.003124,0.006249,0.0,0.0,0.01,l2,"{'alpha': 0.01, 'penalty': 'l2'}",0.956522,0.913043,0.909091,0.863636,0.818182,0.892095,0.047226,7
6,0.004297,0.006361,0.003141,0.006282,0.1,l1,"{'alpha': 0.1, 'penalty': 'l1'}",0.956522,0.956522,0.863636,0.863636,0.863636,0.900791,0.045504,6
7,0.0001,0.000199,0.003305,0.00661,0.1,l2,"{'alpha': 0.1, 'penalty': 'l2'}",0.956522,0.913043,0.909091,0.818182,0.818182,0.883004,0.055483,8
8,0.003168,0.006336,0.0,0.0,1.0,l1,"{'alpha': 1.0, 'penalty': 'l1'}",0.304348,0.304348,0.318182,0.363636,0.363636,0.33083,0.027258,15
9,0.003165,0.006331,0.0,0.0,1.0,l2,"{'alpha': 1.0, 'penalty': 'l2'}",0.956522,0.956522,0.863636,0.772727,0.818182,0.873518,0.073618,9


The *best_params_* attribute gives the hyperparameters that give the best results. 

In [50]:
clf.best_params_

{'alpha': 0.001, 'penalty': 'l2'}

The *best_score_* attribute gives the average cross-validation score.  This should not be confused with the model accuracy.   

In [51]:
clf.best_score_

0.9549407114624507

Accuracy

In [52]:
y_pred = clf.predict(X_test)

In [53]:
(y_pred == y_test).sum()/len(y_test)

0.9210526315789473

$\Box$

##### Example 3

Put the best params into the classifier in Example 1.

Run Example 2 again without constant learning rate.