# Grid Search Practice

Grid search is a way of finding better hypermeters, those that define the configuration of the model and not altered by learning of the model.
It is also known as Exhaustive Grid Search, because it takes many parameter options and creates a search space with cartesian product and then exhaust and evaluate all of these possibilities.
In essence, it is a brute force algorithm.
Let's get familiarize with its usage through practice.

First, we prepare some data.

In [1]:
from time import time
import os, itertools
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

DATASET = '/dsa/data/all_datasets/titanic_ML/titanic.csv'
assert os.path.exists(DATASET)

from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

dataset = pd.read_csv(DATASET).sample(frac = 1).reset_index(drop=True)
X = dataset.iloc[:, :-1]
y = dataset.survived



## Parameter Grid

As a model selection approach, the grid search allows you to investigate the optimal choice of model parameters by specifying the variations for each parameter.

For example, the SVC model has an error penalty parameter `C` in the model,
and we can specify the parameter grid as a dictionary, with the name of the parameter in question, then define the variations of the parameter in a list:

```python
param_grid = {'C': [1e3, 1e4] }
```

The dictionary allows us to supply the variations for different parameters:

```python
param_grid = {'C': [1e3, 1e4],
              'gamma': [1e-4, 1e-3], }
```

Conceptually, the parameter grid dictionary represents a cartesian product of the parameter variations, which contains all the configurations of the models that will be evaluated.  
In other words, the above `param_grid` helps to set up the following 4 models to evaluate:

```python
SVC(C=1e3, gamma=1e-4, ...)
SVC(C=1e3, gamma=1e-3, ...)
SVC(C=1e4, gamma=1e-4, ...)
SVC(C=1e4, gamma=1e-3, ...)
```

But sometimes this kind of cartesian product will generate too many models, more than desired.
We can also provide multiply dictionaries as alternative options.

Consider the difference between
```python
param_grid = [{'C': [1e3, 3e3],
              'gamma': [1e-4, 1e-3], },
             {'C': [5e2, 1e3],
              'gamma': [5e-5, 1e-4], }]
=> 2*2+2*2 = 8 models

```

and

```python
param_grid = [{'C': [1e3, 3e3, 5e2],
              'gamma': [1e-4, 1e-3, 5e-5], }]
=> 3*3 = 9 models
```

## Cross Validation

Not only does the Grid Search help you set up various configurations of models, it also sets up cross validation to provide more objective evaluation metrics of these models. `cv` parameter is used to specify number of cross validation folds used for evaluation.

## Practice

With the parameter grid and cross validation, the grid search results in a lot of computation.
However, we can leverage multiple processors available to accelerate the task, with `n_jobs` parameter, although be aware that in your Jupyter server learning environment, there may be resource limits to ensure fairness, i.e. you may not be able to use as many CPU cores as you can see, so we use a relatively low number in that parameter.

Now let's practice GridSearchCV usage, create a GridSearchCV, **named `clf`**:

1. Create at least 8 models with variations in `C`, `gamma` or your choice.
2. Use 5 fold cross validation.
3. Use 2 parallel jobs.

In [4]:
# Add your code for the above task here:   (Question #P001)
# ----------------------------------------
param_grid = [{'C': [1e3, 2e3],
              'gamma': [1e-4, 1e-3], },
             {'C': [5e2, 1e3],
              'gamma': [5e-5, 3e-4], }]
clf = GridSearchCV(SVC(kernel = 'rbf', class_weight = 'balanced'),
                   param_grid, cv = 5, n_jobs = 2)


Fit the model to loaded data `X` and `y`

In [5]:
# Add your code for the above task here:   (Question #P002)
# ----------------------------------------
clf.fit(X, y)


GridSearchCV(cv=5, estimator=SVC(class_weight='balanced'), n_jobs=2,
             param_grid=[{'C': [1000.0, 2000.0], 'gamma': [0.0001, 0.001]},
                         {'C': [500.0, 1000.0], 'gamma': [5e-05, 0.0003]}])

In [6]:
clf.best_estimator_

SVC(C=2000.0, class_weight='balanced', gamma=0.0001)

The GridSearchCV creates an abstract classifier, once trained with cross validation, the returned object itself can be used as a classifier that represent the optimal classifier within the given hyperparameter space.
For example, from the example above, `clf` will have many familiar methods like `.predict()`, `.score()`.

In addition, `clf.best_estimator_` gives you access to the best model chosen; `clf.best_score_` stores the accuracy score for the best model as well; `clf.cv_results_` provides details on cross validation.


In [7]:
clf.cv_results_

{'mean_fit_time': array([0.08175302, 0.16797867, 0.13424835, 0.32009592, 0.05432243,
        0.07824359, 0.07279572, 0.14306188]),
 'std_fit_time': array([0.00942083, 0.04143124, 0.01801916, 0.05124984, 0.01172544,
        0.01925602, 0.011789  , 0.05972734]),
 'mean_score_time': array([0.00596814, 0.0059    , 0.00592299, 0.00588446, 0.00648656,
        0.00595622, 0.00626373, 0.00593266]),
 'std_score_time': array([1.35640803e-04, 2.33837879e-04, 7.43120382e-05, 1.45119606e-04,
        1.40339308e-04, 7.63602862e-05, 1.65065776e-04, 1.21409764e-04]),
 'param_C': masked_array(data=[1000.0, 1000.0, 2000.0, 2000.0, 500.0, 500.0, 1000.0,
                    1000.0],
              mask=[False, False, False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_gamma': masked_array(data=[0.0001, 0.001, 0.0001, 0.001, 5e-05, 0.0003, 5e-05,
                    0.0003],
              mask=[False, False, False, False, False, False, False, False],
      

In [11]:
clf_df = pd.DataFrame(clf.cv_results_)
print(clf_df)

   mean_fit_time  std_fit_time  mean_score_time  std_score_time param_C  \
0       0.081753      0.009421         0.005968        0.000136  1000.0   
1       0.167979      0.041431         0.005900        0.000234  1000.0   
2       0.134248      0.018019         0.005923        0.000074  2000.0   
3       0.320096      0.051250         0.005884        0.000145  2000.0   
4       0.054322      0.011725         0.006487        0.000140   500.0   
5       0.078244      0.019256         0.005956        0.000076   500.0   
6       0.072796      0.011789         0.006264        0.000165  1000.0   
7       0.143062      0.059727         0.005933        0.000121  1000.0   

  param_gamma                          params  split0_test_score  \
0      0.0001  {'C': 1000.0, 'gamma': 0.0001}           0.797753   
1       0.001   {'C': 1000.0, 'gamma': 0.001}           0.747191   
2      0.0001  {'C': 2000.0, 'gamma': 0.0001}           0.808989   
3       0.001   {'C': 2000.0, 'gamma': 0.001}       

Now from `clf.cv_results_`, can you find where did the value of `clf.best_score_` come from?

Copy-paste the key/value pair from `clf.cv_results_` that shows the source of `clf.best_score_` below:

Which was the fastest model to train? What were the parameters to the fastest model?

Run a prediction on the first 5 data samples from `X`.

In [9]:
# Enter your answer below   (Question #P005)
# ----------------------------------------
clf.predict(X[:5])


array([0, 0, 0, 1, 0])