## Regularized classification in scikit-learn

- Wine dataset from the UCI Machine Learning Repository: [data](http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data), [data dictionary](http://archive.ics.uci.edu/ml/datasets/Wine)
- **Goal:** Predict the origin of wine using chemical analysis

### Load and prepare the wine dataset

In [24]:
# read in the dataset
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'
wine = pd.read_csv(url, header=None)
wine.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [25]:
# examine the response variable


2    71
1    59
3    48
dtype: int64

In [26]:
# define X and y


In [27]:
# split into training and testing sets
from sklearn.cross_validation import train_test_split


### Logistic regression (unregularized)

In [1]:
# build a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)


In [29]:
# examine the coefficients


[[ -4.15397186e+00   4.90075538e+00   1.53790183e+01  -2.85235942e+00
    1.10505593e-01  -5.30164453e+00   1.05110893e+01   3.06125855e+00
   -1.18150745e+01  -2.43836649e+00  -1.80124471e+00   7.09007536e+00
    6.93945792e-02]
 [  5.43369566e+00  -5.23569200e+00  -1.67764364e+01   1.64044307e+00
    5.76913231e-03   2.82443590e+00   4.90939605e+00   2.45082648e+00
    6.08259423e+00  -7.85245282e+00   3.48485003e+00  -7.70842670e+00
   -4.90936762e-02]
 [ -9.70207612e-01   2.08269720e+00   9.41526165e-01   2.38023989e-01
   -2.49596817e-03  -9.80981243e-01  -6.54889317e+00  -4.83302817e-01
   -2.65888456e+00   2.57458669e+00  -1.30417032e+00  -2.34300175e+00
    9.48521802e-03]]


In [2]:
# generate predicted probabilities


In [3]:
# calculate log loss


### Logistic regression (regularized)

- [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) documentation
- **C:** must be positive, decrease for more regularization
- **penalty:** l1 (lasso) or l2 (ridge)

In [32]:
# standardize X_train and X_test
from sklearn.preprocessing import StandardScaler


In [4]:
# try C=0.1 with L1 penalty and do a Fit and Predict
logreg = LogisticRegression(C=0.1, penalty='l1')


In [34]:
# generate predicted probabilities and calculate log loss


0.362248219747


In [5]:
# try C=0.1 with L2 penalty and do a Fit and Predict
logreg = LogisticRegression(C=0.1, penalty='l2')


In [36]:
# generate predicted probabilities and calculate log loss


0.244588324539


- [Pipeline](http://scikit-learn.org/stable/modules/pipeline.html): chain steps together
- [GridSearchCV](http://scikit-learn.org/stable/modules/grid_search.html): search a grid of parameters

In [37]:
# pipeline of StandardScaler and LogisticRegression
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(StandardScaler(), LogisticRegression())

In [38]:
# grid search for best combination of C and penalty
from sklearn.grid_search import GridSearchCV
C_range = 10.**np.arange(-2, 3)
penalty_options = ['l1', 'l2']
param_grid = dict(logisticregression__C=C_range, logisticregression__penalty=penalty_options)
grid = GridSearchCV(pipe, param_grid, cv=10, scoring='log_loss')
grid.fit(X, y)

GridSearchCV(cv=10, error_score='raise',
       estimator=Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0))]),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'logisticregression__penalty': ['l1', 'l2'], 'logisticregression__C': array([  1.00000e-02,   1.00000e-01,   1.00000e+00,   1.00000e+01,
         1.00000e+02])},
       pre_dispatch='2*n_jobs', refit=True, score_func=None,
       scoring='log_loss', verbose=0)

In [6]:
# print all log loss scores using grid_scores_


In [40]:
# examine the best model
print grid.best_score_
print grid.best_params_

-0.0583689728556
{'logisticregression__penalty': 'l1', 'logisticregression__C': 10.0}


## Comparing regularized linear models with unregularized linear models

**Advantages of regularized linear models:**

- Better performance
- L1 regularization performs automatic feature selection
- Useful for high-dimensional problems (p > n)

**Disadvantages of regularized linear models:**

- Tuning is required
- Feature scaling is recommended
- Less interpretable (due to feature scaling)