# Model selection: choosing estimators and their parameters

[Link to tutorial page](https://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html)

## Score, and cross-validated scores

As we have seen, every estimator exposes a score method that can judge the quality of the fit (or the prediction) on new data. **Bigger is better**.

In [4]:
from sklearn import datasets, svm
import numpy as np

In [5]:
X_digits, y_digits = datasets.load_digits(return_X_y=True)
svc = svm.SVC(C=1, kernel='linear')
svc.fit(X_digits[:-100], y_digits[:-100]).score(X_digits[-100:], y_digits[-100:])

0.98

##  KFold cross-validation - manually

Split the data in folds that we use for training and testing

In [6]:
X_folds = np.array_split(X_digits, 3)
y_folds = np.array_split(y_digits, 3)
scores = list()

In [7]:
X_folds

[array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ..., 10.,  0.,  0.],
        [ 0.,  0.,  0., ..., 16.,  9.,  0.],
        ...,
        [ 0.,  0.,  5., ..., 16., 11.,  2.],
        [ 0.,  0.,  6., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ..., 12.,  0.,  0.]]),
 array([[ 0.,  0.,  1., ..., 16., 16.,  8.],
        [ 0.,  0., 10., ..., 16., 16.,  9.],
        [ 0.,  0.,  6., ..., 16., 15.,  3.],
        ...,
        [ 0.,  1., 13., ...,  0.,  0.,  0.],
        [ 0.,  1.,  7., ..., 12.,  2.,  0.],
        [ 0.,  0., 13., ...,  0.,  0.,  0.]]),
 array([[ 0.,  0.,  0., ...,  9.,  0.,  0.],
        [ 0.,  0.,  7., ...,  8.,  0.,  0.],
        [ 0.,  0., 12., ...,  0.,  0.,  0.],
        ...,
        [ 0.,  0.,  1., ...,  6.,  0.,  0.],
        [ 0.,  0.,  2., ..., 12.,  0.,  0.],
        [ 0.,  0., 10., ..., 12.,  1.,  0.]])]

In [8]:
for k in range(3):
    # We use 'list' to copy, in order to 'pop' later on
    X_train = list(X_folds)
    # pop()removes the item at the given index from the list and returns the removed item.
    X_test = X_train.pop(k)
    # Join a sequence of arrays along an existing axis
    X_train = np.concatenate(X_train)
    # numpy array to list
    y_train = list(y_folds)
    y_test = y_train.pop(k)
    y_train = np.concatenate(y_train)
    scores.append(svc.fit(X_train, y_train).score(X_test, y_test))
print(scores)

[0.9348914858096828, 0.9565943238731218, 0.9398998330550918]


## Cross-validation generators

### sklearn KFold 

In [12]:
from sklearn.model_selection import KFold, cross_val_score

In [13]:
# split method
X = ["a", "a", "a", "b", "b", "c", "c", "c", "c", "c"]
k_fold = KFold(n_splits=5)
for train_indices, test_indices in k_fold.split(X):
     print('Train: %s | test: %s' % (train_indices, test_indices))

Train: [2 3 4 5 6 7 8 9] | test: [0 1]
Train: [0 1 4 5 6 7 8 9] | test: [2 3]
Train: [0 1 2 3 6 7 8 9] | test: [4 5]
Train: [0 1 2 3 4 5 8 9] | test: [6 7]
Train: [0 1 2 3 4 5 6 7] | test: [8 9]


In [14]:
# performing cross-validation
[svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test]) for train, test in k_fold.split(X_digits)]

[0.9638888888888889,
 0.9222222222222223,
 0.9637883008356546,
 0.9637883008356546,
 0.9303621169916435]

In [15]:
# cross_val_score
cross_val_score(svc, X_digits, y_digits, cv=k_fold, n_jobs=-1)

array([0.96388889, 0.92222222, 0.9637883 , 0.9637883 , 0.93036212])