# Chapter 5. Model evaluation and enhancement.
# Part 2. Grid search.
Trying all possible model params until the most efficient set would be found.

The table of params forms grid.

## - Elementary grid search
Realized via elementary set of for-to-do cycles:

In [5]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0)

best_score = 0
for gamma in [0.001, 0.01, 0.1, 1, 10, 100]:
    for C in [0.001, 0.01, 0.1, 1, 10, 100]:
        svm = SVC(gamma=gamma, C=C).fit(X_train, y_train)
        score = svm.score(X_test, y_test)
        if score > best_score:
            best_score = score
            best_parameters = {'C':C, 'gamma':gamma}

print('Best accuracy: {}'.format(best_score))
print('Best params: {}'.format(best_parameters))

Best accuracy: 0.9736842105263158
Best params: {'C': 100, 'gamma': 0.001}


^ Accuracy value isn't genuine since it represents accuracy only for the gain bond of found params and test set which was the search factor.

## - Correct grid search

To have genuine accuracy, there should be dataset separation which would give 3 parts: train set, validation set (to find params), test set (to get genuine accuracy).

When best params would be found and before validating via test set, it's better to fit model again over train and validation data together to have best efficiency.

The algorythm:

In [7]:
X_trainval, X_test, y_trainval, y_test = train_test_split(iris.data, iris.target, random_state=0)
X_train, X_valid, y_train, y_valid = train_test_split(X_trainval, y_trainval, random_state=1)
print('Train set shape: {}, validation set shape: {}, test set shape: {}'.format(X_train.shape[0], X_valid.shape[0], X_test.shape[0]))

best_score = 0

for gamma in [0.001, 0.01, 0.1, 1, 10, 100]:
    for C in [0.001, 0.01, 0.1, 1, 10, 100]:
        svm = SVC(gamma=gamma, C=C).fit(X_train, y_train)
        score = svm.score(X_valid, y_valid)
        if score>best_score:
            best_score = score
            best_parameters = {'C':C, 'gamma':gamma}
            
svm = SVC(**best_parameters).fit(X_trainval, y_trainval)
test_score = svm.score(X_test, y_test)
print('Best found validation accuracy: {}'.format(best_score))
print('Best params set: {}'.format(best_parameters))
print('Test accuracy: {}'.format(test_score))

Train set shape: 84, validation set shape: 28, test set shape: 38
Best found validation accuracy: 0.9642857142857143
Best params set: {'C': 10, 'gamma': 0.001}
Test accuracy: 0.9210526315789473


^ Gain accuracy is kinda less which could be indicating that now accuracy value is more genuine.

## - Grid search with cross-validation
Combining cross-validation with grid search to find the best possible model parameters, model would be having the best accuracy.

In [12]:
from sklearn.model_selection import cross_val_score
import numpy as np

for gamma in [0.001, 0.01, 0.1, 1, 10, 100]:
    for C in [0.001, 0.01, 0.1, 1, 10, 100]:
        svm = SVC(gamma=gamma, C=C).fit(X_train,y_train)
        scores = cross_val_score(svm, X_trainval, y_trainval, cv=5)
        score = np.mean(scores)
        if score>best_score:
            best_score=score
            best_parameters = {'C':C,'gamma':gamma}

svm = SVC(**best_parameters).fit(X_trainval,y_trainval)

However, there's a 'GridSearchCV' to perform it via model with handy interface:

In [13]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

#splitting dataset
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0)

#'GridSearchCV' needs dictionary of considering grid params
param_grid = {'C':[0.001,0.01,0.1,1,10,100], 'gamma':[0.001,0.01,0.1,1,10,100]}

#'SVC()' - model, 'param_grid' - param grid, 'cv=5' - 5 fold stratified cross-validation
grid_search = GridSearchCV(SVC(), param_grid, cv=5)
#'GridSearchCV.fit' runs cross-validation for each set of params...
#...As result the best params set is gonna be found and...
#...a model is gonna be built using them. 'predict','score' could be used
grid_search.fit(X_train, y_train)

print('Test accuracy: {}'.format(grid_search.score(X_test, y_test)))

Test accuracy: 0.9736842105263158


^ The highest possible accuracy achieved through correct validation (the test set of data wasn't used to train model).

'best_score_' - Found best params

'best_score_' - Found best accuracy (as a mean value of fold sets for gain params)

'best_estimator_' - Info about best model

In [15]:
print('Best param values: {}'.format(grid_search.best_params_))
print('Best cross-validation accuracy: {}'.format(grid_search.best_score_))
print('Best model: {}'.format(grid_search.best_estimator_))

Best param values: {'C': 10, 'gamma': 0.1}
Best cross-validation accuracy: 0.9731225296442687
Best model: SVC(C=10, gamma=0.1)


To use found model 'GridSearchCV' has 'GridSearchCV.predict' and 'GridSearchCV.score' methods