# Intro Model Selection


## Scikit-learn pipelines
- Scikit-learn pipelines are an extremely convenient and powerful concept -- one of the things that sets scikit-learn apart from other machine learning libraries.
- Pipelines basically let us define a series of perprocessing steps together with fitting an estimator.
- Pipelines will automatically take care of pitfalls like estimating feature scaling parameters from the training set and applying those to scale new data (which we discussed earlier in the context of z-score standardization).
- Below is an visualization of how pipelines work.

<img src="https://github.com/rasbt/stat451-machine-learning-fs20/raw/ee813e1c30a5610a2e6475a77c67c1174a63b75c/L05/code/images/sklearn-pipeline.png" width="400">

Below is an example pipeline that combines the feature scaling step, PCA and the kNN classifier.

In [None]:
import numpy as np

from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [None]:
iris = datasets.load_iris()

X = iris.data
y = iris.target

In [None]:
X

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=1, stratify=y)

In [None]:
pipe = Pipeline([
        ('z-score', StandardScaler()),
        ('reduce_dim', PCA(n_components=3)),
        ('classify', KNeighborsClassifier(n_neighbors=1))])

In [None]:
pipe.fit(X_train, y_train)

Pipeline(steps=[('z-score', StandardScaler()),
                ('reduce_dim', PCA(n_components=3)),
                ('classify', KNeighborsClassifier(n_neighbors=1))])

In [None]:
from sklearn.metrics import accuracy_score

y_train_pred = pipe.predict(X_train)
accuracy_score(y_train, y_train_pred)

1.0

In [None]:
from sklearn.metrics import accuracy_score

y_test_pred = pipe.predict(X_test)
accuracy_score(y_test, y_test_pred)

0.9333333333333333

As you can see above, the Pipeline itself follows the scikit-learn estimator API.

## Scikit-learn grid-search

- In machine learning practice, we often need to experiment with an machine learning algorithm's hyperparameters to find a good setting.
- The process of tuning hyperparameters and comparing and selecting the resulting models is also called "model selection" (in contrast to "algorithm selection").
- We will cover topics such as "model selection" and "algorithm selection" in more detail later in this course.
- For now, we are introducing the simplest way of performing model selection: using the "holdout method."
- In the holdout method, we split a dataset into 3 subsets: a training, a validation, and a test datatset.
- To avoid biasing the estimate of the generalization performance, we only want to use the test dataset once, which is why we use the validation dataset for hyperparameter tuning (model selection).
- Here, the validation dataset serves as an estimate of the generalization performance, too, but it becomes more biased than the final estimate on the test data because of its repeated re-use during model selection (think of "multiple hypothesis testing").

<img src="https://github.com/rasbt/stat451-machine-learning-fs20/raw/ee813e1c30a5610a2e6475a77c67c1174a63b75c/L05/code/images/holdout-tuning.png" width="400">

In [None]:
param_grid = {
    'reduce_dim__n_components': [1, 2, 3, 4],
    'classify__n_neighbors': [2, 3, 4, 5]
}

grid = GridSearchCV(pipe, cv=2, n_jobs=1, param_grid=param_grid, scoring='accuracy')

In [None]:
grid.fit(X_train, y_train)

GridSearchCV(cv=2, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('z-score',
                                        StandardScaler(copy=True,
                                                       with_mean=True,
                                                       with_std=True)),
                                       ('reduce_dim',
                                        PCA(copy=True, iterated_power='auto',
                                            n_components=3, random_state=None,
                                            svd_solver='auto', tol=0.0,
                                            whiten=False)),
                                       ('classify',
                                        KNeighborsClassifier(algorithm='auto',
                                                             leaf_size=30,
                                                             metric='minkowski',
                             

In [None]:
print(grid.cv_results_)

{'mean_fit_time': array([0.00215685, 0.00161123, 0.00155306, 0.00156486, 0.00198913,
       0.00156486, 0.00145936, 0.00168526, 0.00168669, 0.00142121,
       0.00150418, 0.00141299, 0.00137055, 0.00140047, 0.0014323 ,
       0.00141287]), 'std_fit_time': array([5.41329384e-04, 6.43730164e-05, 2.38418579e-06, 9.26256180e-05,
       4.93526459e-04, 1.56044960e-04, 5.36441803e-05, 2.14219093e-04,
       2.21371651e-04, 2.02655792e-05, 2.62260437e-05, 2.41994858e-05,
       2.70605087e-05, 1.47819519e-05, 5.37633896e-05, 4.17232513e-05]), 'mean_score_time': array([0.00443637, 0.0036118 , 0.00266862, 0.00273967, 0.00266659,
       0.0026015 , 0.00256932, 0.0033294 , 0.00295794, 0.00275695,
       0.00273621, 0.00272202, 0.00253725, 0.00274074, 0.00270486,
       0.00268829]), 'std_score_time': array([7.32779503e-04, 9.33170319e-04, 4.33921814e-05, 1.34229660e-04,
       3.27825546e-05, 1.37090683e-05, 2.02655792e-06, 4.45008278e-04,
       3.64661217e-04, 1.69634819e-04, 1.84774399e-05, 9.

In [None]:
grid.cv_results_['mean_test_score']

array([0.875     , 0.91666667, 0.91666667, 0.90833333, 0.90833333,
       0.93333333, 0.95      , 0.94166667, 0.9       , 0.925     ,
       0.94166667, 0.94166667, 0.90833333, 0.925     , 0.95      ,
       0.93333333])

In [None]:
print(grid.best_score_)
print(grid.best_params_)

0.95
{'classify__n_neighbors': 3, 'reduce_dim__n_components': 3}


In [None]:
clf = grid.best_estimator_

In [None]:
y_test_pred = clf.predict(X_test)
accuracy_score(y_test, y_test_pred)

0.9666666666666667