# Chapter 5. Model evaluation and enhancement.
# Part 2. Grid search.
Trying all possible model params until the most efficient set would be found.

The table of params forms grid.

## - Elementary grid search
Realized via elementary set of for-to-do cycles:

In [1]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0)

best_score = 0
for gamma in [0.001, 0.01, 0.1, 1, 10, 100]:
    for C in [0.001, 0.01, 0.1, 1, 10, 100]:
        svm = SVC(gamma=gamma, C=C).fit(X_train, y_train)
        score = svm.score(X_test, y_test)
        if score > best_score:
            best_score = score
            best_parameters = {'C':C, 'gamma':gamma}

print('Best accuracy: {}'.format(best_score))
print('Best params: {}'.format(best_parameters))

Best accuracy: 0.9736842105263158
Best params: {'C': 100, 'gamma': 0.001}


^ Accuracy value isn't genuine since it represents accuracy only for the gain bond of found params and test set which was the search factor.

## - Correct grid search

To have genuine accuracy, there should be dataset separation which would give 3 parts: train set, validation set (to find params), test set (to get genuine accuracy).

When best params would be found and before validating via test set, it's better to fit model again over train and validation data together to have best efficiency.

The algorythm:

In [2]:
X_trainval, X_test, y_trainval, y_test = train_test_split(iris.data, iris.target, random_state=0)
X_train, X_valid, y_train, y_valid = train_test_split(X_trainval, y_trainval, random_state=1)
print('Train set shape: {}, validation set shape: {}, test set shape: {}'.format(X_train.shape[0], X_valid.shape[0], X_test.shape[0]))

best_score = 0

for gamma in [0.001, 0.01, 0.1, 1, 10, 100]:
    for C in [0.001, 0.01, 0.1, 1, 10, 100]:
        svm = SVC(gamma=gamma, C=C).fit(X_train, y_train)
        score = svm.score(X_valid, y_valid)
        if score>best_score:
            best_score = score
            best_parameters = {'C':C, 'gamma':gamma}
            
svm = SVC(**best_parameters).fit(X_trainval, y_trainval)
test_score = svm.score(X_test, y_test)
print('Best found validation accuracy: {}'.format(best_score))
print('Best params set: {}'.format(best_parameters))
print('Test accuracy: {}'.format(test_score))

Train set shape: 84, validation set shape: 28, test set shape: 38
Best found validation accuracy: 0.9642857142857143
Best params set: {'C': 10, 'gamma': 0.001}
Test accuracy: 0.9210526315789473


^ Gain accuracy is kinda less which could be indicating that now accuracy value is more genuine.

## - Grid search with cross-validation
Combining cross-validation with grid search to find the best possible model parameters, model would be having the best accuracy.

In [3]:
from sklearn.model_selection import cross_val_score
import numpy as np

for gamma in [0.001, 0.01, 0.1, 1, 10, 100]:
    for C in [0.001, 0.01, 0.1, 1, 10, 100]:
        svm = SVC(gamma=gamma, C=C).fit(X_train,y_train)
        scores = cross_val_score(svm, X_trainval, y_trainval, cv=5)
        score = np.mean(scores)
        if score>best_score:
            best_score=score
            best_parameters = {'C':C,'gamma':gamma}

svm = SVC(**best_parameters).fit(X_trainval,y_trainval)

'GridSearchCV' - perform cross-validation grid search via model:

In [4]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

#splitting dataset
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0)

#'GridSearchCV' needs dictionary of considering grid params
param_grid = {'C':[0.001,0.01,0.1,1,10,100], 'gamma':[0.001,0.01,0.1,1,10,100]}

#'SVC()' - model, 'param_grid' - params grid, 'cv=5' - 5 fold stratified cross-validation
grid_search = GridSearchCV(SVC(), param_grid, cv=5)
#'GridSearchCV.fit' runs cross-validation for each set of params...
#...As result the best params set is gonna be found and...
#...a model is gonna be built using them. 'predict','score' could be used
grid_search.fit(X_train, y_train)

print('Test accuracy: {}'.format(grid_search.score(X_test, y_test)))

Test accuracy: 0.9736842105263158


^ The highest possible accuracy achieved through correct validation (the test set of data wasn't used to train model - it's provided as default by 'GridSearchCV').

'cv' can also take 'sklearn.model_selection.ShuffleSplit()' or 'sklearn.model_selection.StratifiedShuffleSplit()' as split generator

'best_score_' - Found best params

'best_score_' - Found best accuracy (as a mean value of fold sets for gain params)

'best_estimator_' - Info about best model

In [5]:
print('Best param values: {}'.format(grid_search.best_params_))
print('Best cross-validation accuracy: {}'.format(grid_search.best_score_))
print('Best model: {}'.format(grid_search.best_estimator_))

Best param values: {'C': 10, 'gamma': 0.1}
Best cross-validation accuracy: 0.9731225296442687
Best model: SVC(C=10, gamma=0.1)


To use found model 'GridSearchCV' has 'GridSearchCV.predict' and 'GridSearchCV.score' methods

## - Grid Search Analysis

Grid search analysis data stored in 'GridSearchCV.cv_results':

In [6]:
import pandas as pd

results = pd.DataFrame(grid_search.cv_results_)
display(results.head())

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_gamma,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.002848,0.000581,0.000899,0.000182,0.001,0.001,"{'C': 0.001, 'gamma': 0.001}",0.347826,0.347826,0.363636,0.363636,0.409091,0.366403,0.022485,22
1,0.002346,0.00025,0.000719,5.1e-05,0.001,0.01,"{'C': 0.001, 'gamma': 0.01}",0.347826,0.347826,0.363636,0.363636,0.409091,0.366403,0.022485,22
2,0.002416,0.000247,0.00068,2e-05,0.001,0.1,"{'C': 0.001, 'gamma': 0.1}",0.347826,0.347826,0.363636,0.363636,0.409091,0.366403,0.022485,22
3,0.002778,0.000454,0.001114,0.000332,0.001,1.0,"{'C': 0.001, 'gamma': 1}",0.347826,0.347826,0.363636,0.363636,0.409091,0.366403,0.022485,22
4,0.002591,0.000417,0.000885,0.000146,0.001,10.0,"{'C': 0.001, 'gamma': 10}",0.347826,0.347826,0.363636,0.363636,0.409091,0.366403,0.022485,22


## - Economical grid search
It's not always a good idea to pass through whole grid since some params don't support each other.

For that case it's better to make a list of dictionaries for each set of considering params:

In [14]:
param_grid = [{'kernel': ['rbf'], 'C': [0.001,0.01,0.1,1,10,100], 'gamma': [0.001,0.01,0.1,1,10,100]},
              {'kernel': ['linear'], 'C': [0.001,0.01,0.1,1,10,100]}]

grid_search = GridSearchCV(SVC(), param_grid, cv=5).fit(X_train, y_train)

print('Best params set: {}'.format(grid_search.best_params_))
print('Best cross-validation accuracy: {}'.format(grid_search.best_score_))

Best params set: {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
Best cross-validation accuracy: 0.9731225296442687


In [15]:
results = pd.DataFrame(grid_search.cv_results_)
display(results.T)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,32,33,34,35,36,37,38,39,40,41
mean_fit_time,0.003231,0.003102,0.003064,0.003367,0.002802,0.002392,0.002473,0.00285,0.002272,0.002465,...,0.001098,0.001417,0.00245,0.002576,0.001807,0.001624,0.0011,0.000966,0.000959,0.001056
std_fit_time,0.000355,0.000349,0.000608,0.000805,0.000344,0.000355,0.000215,0.000273,0.000254,0.000348,...,0.000464,0.00011,0.000141,0.000106,0.000192,0.000068,0.000036,0.000089,0.000233,0.000281
mean_score_time,0.001142,0.001068,0.001233,0.001021,0.00082,0.000811,0.000748,0.000813,0.000744,0.00074,...,0.000478,0.000479,0.000723,0.000784,0.000574,0.00066,0.000462,0.000403,0.000449,0.00041
std_score_time,0.000177,0.0003,0.000415,0.000256,0.000122,0.000113,0.000051,0.00012,0.000089,0.000058,...,0.000134,0.000028,0.000065,0.000072,0.00005,0.000207,0.000029,0.000026,0.000178,0.000105
param_C,0.001,0.001,0.001,0.001,0.001,0.001,0.01,0.01,0.01,0.01,...,100,100,100,100,0.001,0.01,0.1,1,10,100
param_gamma,0.001,0.01,0.1,1,10,100,0.001,0.01,0.1,1,...,0.1,1,10,100,,,,,,
param_kernel,rbf,rbf,rbf,rbf,rbf,rbf,rbf,rbf,rbf,rbf,...,rbf,rbf,rbf,rbf,linear,linear,linear,linear,linear,linear
params,"{'C': 0.001, 'gamma': 0.001, 'kernel': 'rbf'}","{'C': 0.001, 'gamma': 0.01, 'kernel': 'rbf'}","{'C': 0.001, 'gamma': 0.1, 'kernel': 'rbf'}","{'C': 0.001, 'gamma': 1, 'kernel': 'rbf'}","{'C': 0.001, 'gamma': 10, 'kernel': 'rbf'}","{'C': 0.001, 'gamma': 100, 'kernel': 'rbf'}","{'C': 0.01, 'gamma': 0.001, 'kernel': 'rbf'}","{'C': 0.01, 'gamma': 0.01, 'kernel': 'rbf'}","{'C': 0.01, 'gamma': 0.1, 'kernel': 'rbf'}","{'C': 0.01, 'gamma': 1, 'kernel': 'rbf'}",...,"{'C': 100, 'gamma': 0.1, 'kernel': 'rbf'}","{'C': 100, 'gamma': 1, 'kernel': 'rbf'}","{'C': 100, 'gamma': 10, 'kernel': 'rbf'}","{'C': 100, 'gamma': 100, 'kernel': 'rbf'}","{'C': 0.001, 'kernel': 'linear'}","{'C': 0.01, 'kernel': 'linear'}","{'C': 0.1, 'kernel': 'linear'}","{'C': 1, 'kernel': 'linear'}","{'C': 10, 'kernel': 'linear'}","{'C': 100, 'kernel': 'linear'}"
split0_test_score,0.347826,0.347826,0.347826,0.347826,0.347826,0.347826,0.347826,0.347826,0.347826,0.347826,...,1.0,0.956522,0.869565,0.521739,0.347826,0.869565,1.0,1.0,1.0,0.956522
split1_test_score,0.347826,0.347826,0.347826,0.347826,0.347826,0.347826,0.347826,0.347826,0.347826,0.347826,...,0.956522,0.956522,0.913043,0.521739,0.347826,0.869565,0.913043,0.956522,1.0,0.956522


^ Linear models didn't use 'gamma' param just as intended.

## - Nested cross-validation
Pushing effectiveness to the limits via cross-validation grid search with every fold iteratively used as a test set.

WARNING: Processing is highly demanding.

In [16]:
scores = cross_val_score(GridSearchCV( SVC(), param_grid, cv=5),
                         iris.data, iris.target, cv=5 )
print('Cross-validation accuracies: {}'.format(scores))
print('Mean cross-validation accuracy: {}'.format(scores.mean()))

Cross-validation accuracies: [0.96666667 1.         0.9        0.96666667 1.        ]
Mean cross-validation accuracy: 0.9666666666666668


^ The result should be read as 'SVC model is able to achieve 97% cross-validation accuracy at mean'. There's no saving gain best params.

## - Paralleling processing
'GridSearchCV' and 'cross_val_score' both have param 'n_jobs' to set number of cores to work with.

'n_jobs = -1' sets all available cores.