# Cross-validation

We must reserve some data for testing lest our models should just repeat their training data.

Here's an example using scikit-learn and the iris dataset.

In [1]:
from sklearn import datasets
X, y = datasets.load_iris(return_X_y=True)

Note the shape of these two arrays.

In [2]:
X.shape, y.shape

((150, 4), (150,))

The train_test_split function of the scikit-learn module model_selection randomly splits the data into training and testing portions.  Here, we enter a test_size parameter of 0.4 to tell the function to reserve 40% of the data for testing.

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.4,
                                                    random_state=0)

As expected, the training arrays are 60% as long as the iris dataset, while the testing ones are 40%.

In [4]:
X_train.shape, y_train.shape

((90, 4), (90,))

In [5]:
X_test.shape, y_test.shape

((60, 4), (60,))

Now, we train a support vector machine on our training data.

In [6]:
from sklearn import svm
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)

Finally, we test the machine against our reserved testing data.

In [7]:
clf.score(X_test, y_test)

0.9666666666666667

Even though some data was reserved for testing, the model still learned well enough to score over 96%.  Tweaking its settings, also called *hyperparameters*, could raise its score even higher, but we would have to tune them across many different test sets to avoid overfitting the hyperparameters to the one we chose.  This technique is called *cross-validation*.

A simple cross-validation algorithm is called *k-folds*, named after its approach of sub-dividing or *folding* the training data into $k$ smaller portions called *folds*.  The algorithm iterates through the folds, adding each one to the training data and then testing the model on the remaining ones, and returns the average performance of the model across these runs.

With computational power ever-cheapening but labeled data staying expensive, k-fold remains a straightforwardly useful cross-validation approach.  Let's try an example with another support vector machine.

In [8]:
clf = svm.SVC(kernel='linear', C=1, random_state=42)

We can cross validate it with just one call to the cross_val_score method from the scikit-learn model_selection module.  Here, we've chosen 5 k-folds.

In [9]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf, X, y, cv=5)

In [10]:
scores

array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])

The returned cross-validation object includes score statistics.

In [11]:
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

0.98 accuracy with a standard deviation of 0.02


The cross-validation method itself can be customized, here to the F1 macro from the scikit-learn metrics module.

In [12]:
from sklearn import metrics
scores = cross_val_score(clf, X, y, cv=5, scoring='f1_macro')

In [13]:
scores

array([0.96658312, 1.        , 0.96658312, 0.96658312, 1.        ])

These new scores nearly match our old ones because the iris dataset is well-balanced across its target classes.

We can even substitute a different cross-validator by passing a cross-validation iterator into the cross validation method, in this case ShuffleSplit from the scikit-learn model_selection module.

In [14]:
from sklearn.model_selection import ShuffleSplit
n_samples = X.shape[0]
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)

In [15]:
cross_val_score(clf, X, y, cv=cv)

array([0.97777778, 0.97777778, 1.        , 0.95555556, 1.        ])

## Hyper-parameter tuning

Hyper-parameters are parameters unlearnt by estimators but instead optimized by searching.  Scikit-learn provides two generic searches, grid and randomized, with halving versions that can be much faster.  While such searches may seem to have daunting numbers of dimensions and computational demands, many hyper-parameters can usually be left default, and some models allow  specialized, efficient searches.

### Grid Search

Grid search generates candidates from a specified grid of parameter values.

In [16]:
param_grid = [
    {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
    {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]

Let's reload the iris dataset and fire up a support vector classifier for it.

In [17]:
iris = datasets.load_iris()
svc = svm.SVC()

Next, we import the GridSearchCV class, which implements grid search cross-validation, from the scikit-learn model_selection module.

In [18]:
from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(svc, param_grid)

Finally, we fit the cross validator to the data and our support vector classifier.

In [19]:
clf.fit(iris.data, iris.target)

GridSearchCV(estimator=SVC(),
             param_grid=[{'C': [1, 10, 100, 1000], 'kernel': ['linear']},
                         {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001],
                          'kernel': ['rbf']}])

It returns quite a few values!

In [20]:
sorted(clf.cv_results_.keys())

['mean_fit_time',
 'mean_score_time',
 'mean_test_score',
 'param_C',
 'param_gamma',
 'param_kernel',
 'params',
 'rank_test_score',
 'split0_test_score',
 'split1_test_score',
 'split2_test_score',
 'split3_test_score',
 'split4_test_score',
 'std_fit_time',
 'std_score_time',
 'std_test_score']

### Randomized Search

While grid search is the most popular search method, other searches have favorable properties too.  Randomized search, which generates candidates from statistical distributions, can have a budget independent of the number of hyper-parameters and their possible values, and adding hyper-parameters irrelevant to model performance does not influence random search efficiency.

Consider this example of a logistic regression on the iris dataset.

In [21]:
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression(solver='saga', tol=1e-2, max_iter=200,
                              random_state=0)

First, we prepare a uniform distribution centered on the origin and choose loss-squared and loss penalties to guide our search.

In [23]:
from scipy.stats import uniform
distributions = dict(C=uniform(loc=0, scale=4), penalty=['l2', 'l1'])

Next, we create a randomized search object from the model selection module of scikit-learn.

In [24]:
from sklearn.model_selection import RandomizedSearchCV
clf = RandomizedSearchCV(logistic, distributions, random_state=0)

Finally, we search.

In [25]:
search = clf.fit(iris.data, iris.target)

In [27]:
search.best_params_

{'C': 2.195254015709299, 'penalty': 'l1'}

### Halving Searches

scikit-learn includes experimental halving versions of grid and random searches.  Both versions generate a pool of initial candidates, each of which is evaluated with few resources, and the worse-performing half of the candidates is eliminated, their resources transferred to the remaining ones before the process repeats until a single candidate remains.