# Grid Search

In machine learning, two tasks are commonly done at the same time in data pipelines: *cross validation* and *(hyper)parameter* tuning. `Cross validation` is the process of training learners using one set of data and testing it using a different set. `(Hyper)parameter` tuning is the process to selecting the values for a model's parameters that maximize the accuracy of the model.

In [1]:
import numpy as np

from sklearn.model_selection import GridSearchCV
from sklearn import datasets, svm
import matplotlib.pyplot as plt

### Using `sklearn`'s Digit Dataset

The target data is a vector containing the image's handwritten digit.

In [2]:
# load the digit data
digits = datasets.load_digits()

# view the features of the first observation
digits.data[0:1]

array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
        15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
        12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
         0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
        10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.]])

In [3]:
# view the target of the first observation
digits.target[0:1]

array([0])

Split into 2 datasets. We will use the first dataset for now.

In [9]:
# create dataset 1
data1_features = digits.data[:1000]
data1_target = digits.target[:1000]

# create dataset 2
data2_features = digits.data[1000:]
data2_target = digits.target[1000:]

### Create Parameter Candidates

Before looking for which combination of parameter values produces the most accurate model, we must specify the different candidate values we want to try.

In the code below, we have a number of candidate parameter values including:
- 4 values for `C(1, 10, 100, 1000)`,
- 2 values for `gamma(0.001, 0.0001)`,
- and 2 values for `kernel(linear, rbf)`.

The grid search will try all combinations of parameter values and select the set of parameters which provides the most accurate model.

In [10]:
parameter_candidates = [
    {'C' : [1, 10, 100, 1000], 'kernel' : ['linear']},
    {'C' : [1, 10, 100, 1000], 'gamma' : [0.001, 0.0001], 'kernel' : ['rbf']},
]

### Conduct Grid Search to Find Parameters Producing Highest Score

Now we are ready to conduct the grid search using scikit-learn's `GridSearchCV` which stands for **grid search cross validation**. By default, `GridSearchCV`'s cross validation uses 3-fold `KFold` or `StratifiedKFold` depending on the situation.

In [12]:
# create a classifier object with the classifier and parameter candidates
clf = GridSearchCV(estimator=svm.SVC(), param_grid=parameter_candidates, n_jobs=-1)

# train the classifier on data1's feature and target data
clf.fit(data1_features, data1_target)

GridSearchCV(estimator=SVC(), n_jobs=-1,
             param_grid=[{'C': [1, 10, 100, 1000], 'kernel': ['linear']},
                         {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001],
                          'kernel': ['rbf']}])

In [13]:
# view the accuracy score
print('Best score for data1:', clf.best_score_)

Best score for data1: 0.966


In [14]:
# view the best parameters for the model found using grid search
print('Best C:', clf.best_estimator_)
print('Best kernel:', clf.best_estimator_.kernel)
print('Best gamma:', clf.best_estimator_.gamma)

Best C: SVC(C=10, gamma=0.001)
Best kernel: rbf
Best gamma: 0.001


### Sanity Check Using the Second Dataset

We will use the 2nd dataset to prove that those parameters are actually used by the model. First, we apply the classifier we just trained to the second dataset. Then we will train a new support vector classfier from scratch using the parameters found using the grid search.

In [15]:
# apply the classifier trained using data1 to data2, and view the accuracy score
clf.score(data2_features, data2_target)

0.9698870765370138

In [16]:
# train a new classifier using the best parameters found by the grid search
svm.SVC(C=10, kernel='rbf', gamma=0.001).fit(data1_features, data1_target).score(data2_features, data2_target)

0.9698870765370138