# Model selection: choosing estimators and their parameters
  
Reference: https://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html 

  
    
As seen, every estimator exposes a ```score``` method that can judge the quality of the fit(or the prediction) on new data. **Big is better.**



In [None]:
from sklearn import datasets, svm 
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
svc = svm.SVC(C=1, kernel='linear')
svc.fit(X_digits[:-100], y_digits[:-100]).score(X_digits[-100:], y_digits[-100:])

To get a better measure of prediction accuracy(which we can use as a proxy for goodness of fit of the model), we can successively split the data in *folds* that we use for training and testing:

In [5]:
import numpy as np
X_folds = np.array_split(X_digits, 3)
y_folds = np.array_split(y_digits, 3)
scores = list()
for k in range(3):
# We use 'list' to copy, in order to 'pop' later on
    X_train = list(X_folds)
    X_test = X_train.pop(k)
    X_train = np.concatenate(X_train)
    y_train = list(y_folds)
    y_test = y_train.pop(k)
    y_train = np.concatenate(y_train)
    scores.append(svc.fit(X_train, y_train).score(X_test, y_test))
print(scores)  


[0.9348914858096828, 0.9565943238731218, 0.9398998330550918]


This is called a **kfold** cross valdidation.

### Cross-validation generators  


Scikit-learn has a collection of class which can be used to generate lists of train/test indices for popular cross-validation strategies.  

  
They expose a ```split``` method which accepts the input dataset to be split and yields the train/test set indices for each iteration of the chosen CV strategy.  
  
This example shows an example usage of the split method:

In [6]:
from sklearn.model_selection import KFold, cross_val_score
X = ["a", "a", "a", "b", "b", "c", "c", "c", "c", "c"]
k_fold = KFold(n_splits=5)
for train_indices, test_indices in k_fold.split(X):
      print('Train: %s | test: %s' % (train_indices, test_indices))

Train: [2 3 4 5 6 7 8 9] | test: [0 1]
Train: [0 1 4 5 6 7 8 9] | test: [2 3]
Train: [0 1 2 3 6 7 8 9] | test: [4 5]
Train: [0 1 2 3 4 5 8 9] | test: [6 7]
Train: [0 1 2 3 4 5 6 7] | test: [8 9]


The CV can be performed easily:

In [10]:
[svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test])
  for train, test in k_fold.split(X_digits)] 

[0.9638888888888889,
 0.9222222222222223,
 0.9637883008356546,
 0.9637883008356546,
 0.9303621169916435]

The CV score can be directly calculated using the cross_val_score helper. Given an estimator, the CV object and the input dataset, the ```cross_val_score``` splits the data repeatedly into a training and a testing set, trains the estimator using the training set and computes the scores based on the testing set for each iteration of cross-validation.  
  
By default the ```score``` method is used to compute the individual scores.   

Refer the [metrics module](https://scikit-learn.org/stable/modules/metrics.html#metrics) to learn more on the available scoring methods.

In [11]:
cross_val_score(svc, X_digits, y_digits, cv=k_fold, n_jobs=-1)

array([0.96388889, 0.92222222, 0.9637883 , 0.9637883 , 0.93036212])

n_jobs=-1 means that the computation will be dispatched on all the CPUs of the computer.  
  
Alternatively, the scoring argument can be provided to specify an alternative scoring method.

In [12]:
cross_val_score(svc, X_digits, y_digits, cv=k_fold,
                 scoring='precision_macro')

array([0.96578289, 0.92708922, 0.96681476, 0.96362897, 0.93192644])

### Cross-validation generators  
  
**KFold (n_splits, shuffle, random_state)** - splits it into K folds, trains on K-1 and then tests on the left-out.  
**StratifiedKFold (n_splits, shuffle, random_state)** - Same as K-Fold but preserves the class distrubution within each fold.  
**GroupKFold (n_splits)** - Ensures that the same group is not in both testing and training sets.  
**ShuffleSplit (n_splits, test_size, train_size, random_state)** - Generates train/test indices based on random permutation.  
**StratifiedShuffleSplit** - Same as shuffle split but preserves the class distribution within each iteration.  
**GroupShuffleSplit** - Ensures that the same group is no in both testing and training sets.  
**LeaveOneGroupOut()** - Takes a group array to group observations.  
**LeavePGroupsOut (n_groups)** - Leave P groups out.  
**LeaveOneOut()** - Leave one observation out.  
**LeavePOut (p)** - Leave P observations out.  
**PredefinedSplit** - Generates train/test indices based on predefined splits.  

### Grid-search and cross-validated estimators  
  
#### Grid-search  
scikit-learn provodes an objevt that, given data, computes the score during the fit of an estimator on a parameter grid and chooses the parameters to maximize the cross-validation score. This objects takes an estimator during the construction and exposes and estimator API:  


In [13]:
from sklearn.model_selection import GridSearchCV, cross_val_score
Cs = np.logspace(-6, -1, 10)
clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs),
                    n_jobs=-1)
clf.fit(X_digits[:1000], y_digits[:1000])    

GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'C': array([1.00000e-06, 3.59381e-06, 1.29155e-05, 4.64159e-05, 1.66810e-04,
       5.99484e-04, 2.15443e-03, 7.74264e-03, 2.78256e-02, 1.00000e-01])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [14]:
clf.best_score_ 

0.925

In [15]:
clf.best_estimator_.C 

0.007742636826811277

In [16]:
# Prediction performance on test set is not as good as on train set
clf.score(X_digits[1000:], y_digits[1000:])

0.9435382685069009

By default, the ```GridSearchCV``` uses a 3-fold cross-validation. However, if it detects that a classifier is passed, rather than a regressor, it uses a stratified 3-fold. The default will change to a 5-fold cross-validaton in version 0.22.

#### Nested cross-validation

In [17]:
cross_val_score(clf, X_digits, y_digits)

array([0.93853821, 0.96327212, 0.94463087])

Two cross-validation loops are performed in parallel: one by the ```GridSearchCV``` estimator to set gamma and the other one by cross_val_score to measure the prediction performance of the estimator. The resulting scores are unbiased estimates of the prediction score on new data.  
  
  Warning You cannot nest objects with parallel computing (n_jobs different than 1).  
  
  
#### Cross-validated estimators 
Cross-validation to set a parameter can be done more efficiently on an algorithm-by-algorithm basis. This is why for certain estimators, scikit-learn exposes Cross-Validation:evaluating estimator performance estimators that set their parameter automatically by cross-validation:

In [18]:
from sklearn import linear_model, datasets
lasso = linear_model.LassoCV(cv=3)
diabetes = datasets.load_diabetes()
X_diabetes = diabetes.data
y_diabetes = diabetes.target
lasso.fit(X_diabetes, y_diabetes)

LassoCV(alphas=None, copy_X=True, cv=3, eps=0.001, fit_intercept=True,
    max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False,
    precompute='auto', random_state=None, selection='cyclic', tol=0.0001,
    verbose=False)

In [19]:
lasso.alpha_

0.012291895087486173