# Model scoring
See http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html
Please note that the Support Vector Machine model is trained with all but the last 100 samples, and it is scored with the 100 last samples. The score is the mean accuracy measured over the test set.

In [1]:
from sklearn import datasets, svm
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
svc = svm.SVC(C=1, kernel='linear')
svc.fit(X_digits[:-100], y_digits[:-100])
svc.score(X_digits[-100:], y_digits[-100:])

0.98

# k-fold cross-validation
Here we carry out a 3-fold cross validation on the diabetes dataset

In [2]:
import numpy as np
num_folds = 3
X_folds = np.array_split(X_digits, num_folds)
y_folds = np.array_split(y_digits, num_folds)
scores = list()
for k in range(num_folds):
    # We use 'list' to copy, in order to 'pop' later on
    X_train = list(X_folds)
    X_test  = X_train.pop(k)
    X_train = np.concatenate(X_train)
    y_train = list(y_folds)
    y_test  = y_train.pop(k)
    y_train = np.concatenate(y_train)
    scores.append(svc.fit(X_train, y_train).score(X_test, y_test))
print(scores)

[0.9348914858096828, 0.9565943238731218, 0.9398998330550918]


# Cross-validation generators
Here we use the KFold class to generate index sets for k-fold cross-validation

In [3]:
from sklearn.model_selection import KFold, cross_val_score
X = ["a", "a", "b", "c", "c", "c"]
k_fold = KFold(n_splits=3)
for train_indices, test_indices in k_fold.split(X):
    print('Train: %s | test: %s' % (train_indices, test_indices))

Train: [2 3 4 5] | test: [0 1]
Train: [0 1 4 5] | test: [2 3]
Train: [0 1 2 3] | test: [4 5]


Use the generated index sets to carry out the cross-validation

In [4]:
[svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test])
     for train, test in k_fold.split(X_digits)]

[0.9348914858096828, 0.9565943238731218, 0.9398998330550918]

Use the cross_val_score helper to compute the cross-validated scores directly

In [5]:
cross_val_score(svc, X_digits, y_digits, cv=k_fold, n_jobs=-1)

array([0.93489149, 0.95659432, 0.93989983])

Use the precision as the score, rather than the mean accuracy

In [6]:
cross_val_score(svc, X_digits, y_digits, cv=k_fold, scoring='precision_macro')

array([0.93969761, 0.95911415, 0.94041254])

Carry out a grid search to optimize the C parameter of a Support Vector Machine classifier

In [7]:
from sklearn.model_selection import GridSearchCV, cross_val_score
Cs = np.logspace(-6, -1, 10)
clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs), n_jobs=-1)
clf.fit(X_digits[:1000], y_digits[:1000])

GridSearchCV(cv=None, error_score=nan,
             estimator=SVC(C=1, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='scale', kernel='linear', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'C': array([1.00000000e-06, 3.59381366e-06, 1.29154967e-05, 4.64158883e-05,
       1.66810054e-04, 5.99484250e-04, 2.15443469e-03, 7.74263683e-03,
       2.78255940e-02, 1.00000000e-01])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [8]:
clf.best_score_

0.95

In [9]:
clf.best_estimator_.C

0.0021544346900318843

In [10]:
# Prediction performance on test set is not as good as on train set
clf.score(X_digits[1000:], y_digits[1000:])

0.946047678795483

Carry out a nested cross validation. In the inner loop, GridSearchCV optimizes the value of the C parameter. In the outer loop, cross_val_score measures the prediction performance of the optimized estimators obtained by GridSearchCV.

In [11]:
cross_val_score(clf, X_digits, y_digits)

array([0.94722222, 0.91666667, 0.96657382, 0.97493036, 0.93593315])

Set the value of the alpha parameter of a Lasso model automatically by cross validation

In [12]:
from sklearn import linear_model, datasets
lasso = linear_model.LassoCV()
diabetes = datasets.load_diabetes()
X_diabetes = diabetes.data
y_diabetes = diabetes.target
lasso.fit(X_diabetes, y_diabetes)

LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,
        max_iter=1000, n_alphas=100, n_jobs=None, normalize=False,
        positive=False, precompute='auto', random_state=None,
        selection='cyclic', tol=0.0001, verbose=False)

In [13]:
 # The estimator chose automatically its alpha:
lasso.alpha_ 

0.003753767152692203