# MNIST handwritten digits classification with parameter grid search for SVM

In this notebook, we'll use [grid search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) and a validation set to find optimal values for our SVM model's hyperparameters.

First, the needed imports. 

In [1]:
%matplotlib inline

import numpy as np
from sklearn import svm, datasets, __version__
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV, PredefinedSplit

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Suppress annoying warnings...
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

from distutils.version import LooseVersion as LV
assert(LV(__version__) >= LV("0.20")), "Version >= 0.20 of sklearn is required."

Then we load the MNIST data. First time it downloads the data, which can take a while.

In [2]:
mnist = datasets.fetch_openml('mnist_784')

train_len = 60000
X = mnist['data']
y = mnist['target']

X_train, y_train = X[:train_len], y[:train_len]
X_test, y_test = X[train_len:], y[train_len:]     
     
print('MNIST data loaded: train:',len(X_train),'test:',len(X_test))
print('X_train:', X_train.shape)
print('y_train:', y_train.shape)
print('X_test', X_test.shape)
print('y_test', y_test.shape)

MNIST data loaded: train: 60000 test: 10000
X_train: (60000, 784)
y_train: (60000,)
X_test (10000, 784)
y_test (10000,)


## Linear SVM

Let's start with the linear SVM trained with a subset of training data.  `C` is the penalty parameter that we need to specify.  Let's first try with just some guess, e.g., `C=1.0`.

In [3]:
%%time

clf_lsvm = svm.LinearSVC(C=1.0)

print(clf_lsvm.fit(X_train[:10000,:], y_train[:10000]))

pred_lsvm = clf_lsvm.predict(X_test)
print('Predicted', len(pred_lsvm), 'digits with accuracy:', accuracy_score(y_test, pred_lsvm))

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)
Predicted 10000 digits with accuracy: 0.8569
CPU times: user 6.34 s, sys: 183 ms, total: 6.52 s
Wall time: 6.26 s


Next, let's try grid search, i.e., we try several different values for the parameter `C`.  Remember that it's important to *not* use the test set for evaluating hyperparameters.  Instead we opt to set aside the last 1000 images as a validation set.


In [4]:
# The values for C that we will try out
param_grid = {'C': [1, 10, 100, 1000]}

# Use first 9000 as training and last 1000 as vaildation set
valid_split = PredefinedSplit(9000*[-1] + 1000*[0])

clf_lsvm_grid = GridSearchCV(clf_lsvm, param_grid, cv=valid_split, verbose=2)
print(clf_lsvm_grid.fit(X_train[:10000,:], y_train[:10000]))

Fitting 1 folds for each of 4 candidates, totalling 4 fits
[CV] C=1 .............................................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] .............................................. C=1, total=   5.5s
[CV] C=10 ............................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    5.6s remaining:    0.0s


[CV] ............................................. C=10, total=   5.5s
[CV] C=100 ...........................................................
[CV] ............................................ C=100, total=   5.4s
[CV] C=1000 ..........................................................
[CV] ........................................... C=1000, total=   5.6s


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:   22.0s finished


GridSearchCV(cv=PredefinedSplit(test_fold=array([-1, -1, ...,  0,  0])),
       error_score='raise-deprecating',
       estimator=LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'C': [1, 10, 100, 1000]}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score='warn', scoring=None, verbose=2)


We can now see what was the best value for C that was selected.

In [None]:
print(clf_lsvm_grid.best_params_)

best_C = clf_lsvm_grid.best_params_['C']

{'C': 1000}


Let's try predicting with out new model with optimal hyperparameters.

In [None]:
clf_lsvm2 = svm.LinearSVC(C=best_C)

print(clf_lsvm2.fit(X_train[:10000,:], y_train[:10000]))

pred_lsvm2 = clf_lsvm2.predict(X_test)
print('Predicted', len(pred_lsvm2), 'digits with accuracy:', accuracy_score(y_test, pred_lsvm2))

LinearSVC(C=1000, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)
Predicted 10000 digits with accuracy: 0.8556


## Kernel SVM

The Kernel SVM typically has two hyperparameters that need to be set.  For example for a Gaussian (or RBF) kernel we also have `gamma` (Greek $\gamma$) in addition to `C`.

In [None]:
%%time

clf_ksvm = svm.SVC(decision_function_shape='ovr', kernel='rbf', C=1.0, gamma=1e-6)
print(clf_ksvm.fit(X_train[:10000,:], y_train[:10000]))

pred_ksvm = clf_ksvm.predict(X_test)
print('Predicted', len(pred_ksvm), 'digits with accuracy:', accuracy_score(y_test, pred_ksvm))

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=1e-06, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
Predicted 10000 digits with accuracy: 0.956
CPU times: user 2min 43s, sys: 982 ms, total: 2min 44s
Wall time: 2min 43s


Now we can try grid search again, now with two parameters.  We use even a smaller subset of the training set it will otherwise be too slow.

In [None]:
param_grid = {'C': [1, 10, 100],
              'gamma': [1e-8, 5e-8, 1e-7, 5e-7, 1e-6]}

train_items = 3000
valid_items = 500
tot_items = train_items + valid_items

# Use first 9000 as training and last 1000 as vaildation set
valid_split = PredefinedSplit(train_items*[-1] + valid_items*[0])

clf_ksvm_grid = GridSearchCV(clf_ksvm, param_grid, cv=valid_split, verbose=2)
print(clf_ksvm_grid.fit(X_train[:tot_items,:], y_train[:tot_items]))


Fitting 1 folds for each of 15 candidates, totalling 15 fits
[CV] C=1, gamma=1e-08 ................................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ................................. C=1, gamma=1e-08, total=   6.9s
[CV] C=1, gamma=5e-08 ................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   12.4s remaining:    0.0s


[CV] ................................. C=1, gamma=5e-08, total=   4.0s
[CV] C=1, gamma=1e-07 ................................................
[CV] ................................. C=1, gamma=1e-07, total=   3.6s
[CV] C=1, gamma=5e-07 ................................................
[CV] ................................. C=1, gamma=5e-07, total=   7.1s
[CV] C=1, gamma=1e-06 ................................................
[CV] ................................. C=1, gamma=1e-06, total=  12.4s
[CV] C=10, gamma=1e-08 ...............................................
[CV] ................................ C=10, gamma=1e-08, total=   3.1s
[CV] C=10, gamma=5e-08 ...............................................
[CV] ................................ C=10, gamma=5e-08, total=   2.6s
[CV] C=10, gamma=1e-07 ...............................................
[CV] ................................ C=10, gamma=1e-07, total=   2.9s
[CV] C=10, gamma=5e-07 ...............................................
[CV] .

Again, let's see what parameters were selected.

In [None]:
print(clf_ksvm_grid.best_params_)

best_C = clf_ksvm_grid.best_params_['C']
best_gamma = clf_ksvm_grid.best_params_['gamma']

As we did the grid search on a small subset of the training set it probably makes sense to retrain the model with the selected parameters using a bigger part of the training data.

In [None]:
clf_ksvm2 = svm.SVC(decision_function_shape='ovr', kernel='rbf', C=best_C, gamma=best_gamma)
print(clf_ksvm2.fit(X_train[:10000,:], y_train[:10000]))

pred_ksvm2 = clf_ksvm2.predict(X_test)
print('Predicted', len(pred_ksvm2), 'digits with accuracy:', accuracy_score(y_test, pred_ksvm2))