# Lesson 23 - Grid Search

### The following topis are discussed in this notebook:
* Using `GridSearchCV` for hyperparameter selection.

### Additional Resources
* Hands-On Machine Learning, pages 72 - 84
*  https://www.mygreatlearning.com/blog/gridsearchcv/
*  https://sklearn.org/modules/generated/sklearn.model_selection.GridSearchCV.html



In [47]:
import warnings
warnings.filterwarnings('ignore')



import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

## Generate Data

In [48]:
np.random.seed(1)
X, y = make_classification(n_samples=400, n_features=6, n_informative=6, n_redundant=0, n_classes=7, class_sep=3)

np.set_printoptions(suppress=True, precision=2)
print('Distribution of Features:')
print('Min: ', np.min(X, axis=0))
print('Max: ', np.max(X, axis=0))
print('Mean:', np.mean(X, axis=0))
print('SDev:', np.std(X, axis=0))
np.set_printoptions(suppress=True, precision=4)

Distribution of Features:
Min:  [-6.84 -7.25 -7.16 -6.73 -5.93 -6.76]
Max:  [7.33 6.66 6.61 8.56 5.95 8.44]
Mean: [ 0.51 -0.41  0.82  0.2   0.81 -0.89]
SDev: [3.38 3.18 3.24 3.35 3.16 3.23]


## Grid Search with Logistic Regression

In [49]:
from sklearn.model_selection import GridSearchCV

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

**class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)**

Parameters:

estimator:  estimator object.   This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a score function, or scoring must be passed.

param_grid: dict or list of dictionaries.  Dictionary with parameters names (str) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.

scoring: str, callable, list, tuple or dict, default=None.   Strategy to evaluate the performance of the cross-validated model on the test set.

If scoring represents a single score, one can use:
•a single string (see The scoring parameter: defining model evaluation rules);
•a callable (see Defining your scoring strategy from metric functions) that returns a single value.
If scoring reprents multiple scores, one can use:
•a list or tuple of unique strings;
•a callable returning a dictionary where the keys are the metric names and the values are the metric scores;
•a dictionary with metric names as keys and callables a values.
See Specifying multiple metrics for evaluation for an example.

n_jobs: int, default=None.   Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.


cv:  int, cross-validation generator or an iterable, default=None.   Determines the cross-validation splitting strategy. Possible inputs for cv are:
•None, to use the default 5-fold cross validation,
•integer, to specify the number of folds in a (Stratified)KFold,
•CV splitter,
•An iterable yielding (train, test) splits as arrays of indices.





Instead of fiddling with the hyperparameters manually (i.e., using the FOR loops) until we find the optimal combination of hyperparameters we can use Scikit_learn's  *GridSearchCV* to find them.  Here's how we do it with LogReg. 

In [50]:
param_grid = [
    {'C': 10**np.linspace(-3,3,10)}
]

log_reg = LogisticRegression(solver='lbfgs', multi_class='ovr')
gscv_01 = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy', refit=True)
gscv_01.fit(X, y)
res_01 = gscv_01.cv_results_

#### Exploring Grid Search Results

In [51]:
print(res_01.keys())

dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_C', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score'])


In [52]:
print(res_01['mean_test_score'])
print(res_01['mean_fit_time'])

[0.655  0.715  0.74   0.7425 0.735  0.7275 0.7275 0.7275 0.7275 0.7275]
[0.0118 0.0115 0.013  0.0142 0.0148 0.0166 0.0165 0.017  0.017  0.0169]


In [53]:
for i in range(0,10):
    print(res_01['mean_test_score'][i], res_01['params'][i])



0.655 {'C': 0.001}
0.7150000000000001 {'C': 0.004641588833612777}
0.74 {'C': 0.021544346900318832}
0.7424999999999999 {'C': 0.1}
0.7350000000000001 {'C': 0.46415888336127775}
0.7274999999999999 {'C': 2.154434690031882}
0.7274999999999999 {'C': 10.0}
0.7274999999999999 {'C': 46.41588833612773}
0.7274999999999999 {'C': 215.44346900318823}
0.7274999999999999 {'C': 1000.0}


In [54]:
print(gscv_01.best_score_)
print(gscv_01.best_params_)

0.7424999999999999
{'C': 0.1}


 #### Obtaining Best Model

In [55]:
lin_reg = gscv_01.best_estimator_
print('Training Score:', lin_reg.score(X, y))

Training Score: 0.76


## Grid Search with K-Nearest Neighbors

In [56]:
param_grid = [
    {'n_neighbors': range(1,20), 'p': [1,2]}
]

knn = KNeighborsClassifier()

gscv_02 = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy',  refit=True)

gscv_02.fit(X, y)

res_02 = gscv_02.cv_results_

In [57]:
print(gscv_02.best_score_)
print(gscv_02.best_params_)

0.9625
{'n_neighbors': 3, 'p': 2}


In [58]:
knn = gscv_02.best_estimator_
print('Training Score:', knn.score(X, y))

Training Score: 0.98


## Grid Search with SVC

In [59]:
param_grid = [
    {'kernel':['poly'], 'degree': [1,2,3], 'C':10**np.linspace(-3,3,10), 'gamma':['auto']},
    {'kernel':['rbf'], 'C':10**np.linspace(-3,3,10), 'gamma':10**np.linspace(-3,3,10)}
]

svm = SVC()

gscv_03 = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy',  refit=True)

gscv_03.fit(X, y)

res_03 = gscv_03.cv_results_

In [60]:
print(gscv_03.best_score_)
print(gscv_03.best_params_)

0.975
{'C': 2.154434690031882, 'gamma': 0.1, 'kernel': 'rbf'}


In [61]:
svm = gscv_03.best_estimator_
print('Training Score:', svm.score(X, y))

Training Score: 0.99


## Grid Search with Random Forests

In [62]:
param_grid = [
    {'n_estimators':np.arange(100,500,100), 'max_depth':range(2,6), 'bootstrap':['True','False']}
]

forest = RandomForestClassifier()

gscv_04 = GridSearchCV(forest, param_grid, cv=5, scoring='accuracy',  refit=True)

gscv_04.fit(X, y)

res_04 = gscv_04.cv_results_

In [63]:
print(gscv_04.best_score_)
print(gscv_04.best_params_)

0.9525
{'bootstrap': 'True', 'max_depth': 5, 'n_estimators': 200}


In [64]:
forest = gscv_04.best_estimator_
print('Training Score:', forest.score(X, y))

Training Score: 0.9825


##  let's try GridSearch on another data set...

In [65]:
df = pd.read_csv('data/titanic.txt', sep='\t')
df.head(5)
Xnum = df.iloc[:, [4]].values
Xcat = df.iloc[:, [1, 3]].values.astype('str')
y = df.iloc[:, 0].values

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
encoder.fit(Xcat)
Xenc = encoder.transform(Xcat)
print(Xenc.shape)
X = np.hstack([Xnum, Xenc])

(887, 5)


In [66]:
#using LogReg
param_grid = [
    {'C': 10**np.linspace(-3,3,10)}
]

log_reg = LogisticRegression(solver='lbfgs', multi_class='ovr')
gscv_01 = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy', refit=True)
gscv_01.fit(X, y)
res_01 = gscv_01.cv_results_
print(res_01['mean_test_score'])
print(res_01['mean_fit_time'])

log_reg = gscv_01.best_estimator_
print(gscv_01.best_params_)
print('Training Score:', log_reg.score(X, y))

[0.6144 0.7159 0.7925 0.7925 0.7903 0.7892 0.7892 0.7892 0.7892 0.7892]
[0.0102 0.0091 0.0158 0.0144 0.0146 0.0168 0.0131 0.009  0.0102 0.0113]
{'C': 0.021544346900318832}
Training Score: 0.7970687711386697


## now try the K_Nearest Neighbors...   On the same TITANIC data set...

In [67]:
param_grid = [
    {'n_neighbors': range(1,20), 'p': [1,2]}
]

knn = KNeighborsClassifier()

gscv_02 = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy',  refit=True)

gscv_02.fit(X, y)

res_02 = gscv_02.cv_results_

print(gscv_02.best_score_)
print(gscv_02.best_params_)
knn = gscv_02.best_estimator_
print('Training Score:', knn.score(X, y))

0.8061067733130198
{'n_neighbors': 5, 'p': 1}
Training Score: 0.8511837655016911


In [68]:
param_grid = [
    {'kernel':['poly'], 'degree': [1], 'C':10**np.linspace(-3,3,10), 'gamma':['ovr']},
    {'kernel':['rbf'], 'C':10**np.linspace(-3,3,10), 'gamma':10**np.linspace(-3,3,10)}
]

svm = SVC()

gscv_03 = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy',  refit=True)

gscv_03.fit(X, y)

res_03 = gscv_03.cv_results

print(gscv_03.best_score_)
print(gscv_03.best_params_)
svm = gscv_03.best_estimator_
print('Training Score:', svm.score(X, y))

KeyboardInterrupt: 