# Hyperparameter Tuning and Model Selection using GridSearchCV

### Resources:
- [sklearn GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV)
- [sklearn RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)
- [Additional walkthrough on Medium found here](https://medium.com/vickdata/a-simple-guide-to-scikit-learn-pipelines-4ac0d974bdcf)
- Credit of this notebook goes to Dhaval Patel from Codebasics. [Video found here](https://www.youtube.com/watch?v=HdlDYng8g9s)

### Note:
This notebook does NOT show the best or only way to work through this process of Hyperparameter Tuning. This is meant to be a very basic and illustrative example. You should still refer to the sklearn documentation.

## A worked example using the sklearn iris dataset

1. Load in the sklearn iris dataset

In [None]:
from sklearn import svm, datasets
iris = datasets.load_iris()

In [None]:
import pandas as pd
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['flower'] = iris.target
df['flower'] = df['flower'].apply(lambda x: iris.target_names[x])

2. train, test, split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3)

3. Use a support vector clasifier (SVC) to fit the data and return the best fit (Don't worry too much about it)

In [None]:
model = svm.SVC(kernel='rbf', C=30, gamma='auto') #randomly initialized parameters
model.fit(X_train, y_train)
model.score(X_test, y_test)



4. depending on the training and test sets, we get a different score, so we need to do a k-fold cross validation 

In [None]:
from sklearn.model_selection import cross_val_score

cross_val_score(svm.SVC(kernel='linear', C=10, gamma='auto'), iris.data, iris.target, cv=5)



In [None]:
# try with different hyperparameters 
# get the average score, but its very manual




You can also try a number of for loops but think of how long that would take...

Pseduocode:
```python
# set kernel_lst, c_lst, gamma_lst to be a list of parameters you want to try
scores = []
for k in kernel_lst:
    for c in C_lst:
        for g in gamma_lst:
            scores.append(cross_val_score(svm.SVC(k, c, g), iris.data, iris.target, cv=5)
```


## Instead, Use GridSearchCV

1. Set up the GridSearchCV parameters:
- first parameter is the model and any parameters we want to set as static 
- second parameter is the parameter grid = dictionary where key == parameter, value == list of values to try

In [None]:
from sklearn.model_selection import GridSearchCV

clf=GridSearchCV(svm.SVC(gamma='auto'), {
    'C': [1, 10, 20], #these are the 
    'kernel': ['rbf', 'linear']
}, cv=5, return_train_score=False)


2. Fit the data and get the results. Then show it in a dataframe

In [None]:
clf.fit(iris.data, iris.target)
clf.cv_results_

# cast the results to a pandas dataframe
df = pd.DataFrame(clf.cv_results_)

In [None]:
# We can check out the parameters and mean test scores as a pandas dataframe

df[['param_C', 'param_kernel', 'mean_test_score']]

In [None]:
# check the directory of the classifier to see the attibutes and methods
dir(clf)

3. Check the best scores and best parameters using the class attributes.

these should match one of the best results from what we see in the dataframe

In [None]:
clf.best_score_

In [None]:

clf.best_params_


## GridSearch Computation Can be Very Costly

1. to deal with having lots of for-loops to test out hyperparameters, you can do RandomSearchCV so sklearn does a random gridsearch and you (the user) pass in how many combinations you want to try

In [None]:
from sklearn.model_selection import RandomizedSearchCV

rs=RandomizedSearchCV(svm.SVC(gamma='auto'), {
    'C': [1, 10, 20], #these are the 
    'kernel': ['rbf', 'linear']
}, 
                 cv=5, 
                 return_train_score=False, 
                 n_iter=2)  # n_iter tells us how many combination we need to try

2. Fit the RandomizedSearch and get the cross validated results

In [None]:
rs.fit(iris.data, iris.target)

#pass the results to a dataframe
df2 = pd.DataFrame(rs.cv_results_)[['param_C', 'param_kernel', 'mean_test_score']]

## Using GridSearch to Select the Best Model

In [None]:
from sklearn import svm #it's already imported. but let's pretend we're starting from scratch
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

1. Define the parameter grid 

Think of this is a trial and error process



In [None]:

model_params = {
    'svm': {
        'model' : svm.SVC(gamma='auto'), 
        'params': {
            'C': [1, 10, 20], 
            'kernel': ['rbf', 'linear']
        }
    }, 
    'random_forest': {
        'model' : RandomForestClassifier(), 
        'params': {
            'n_estimators': [1, 5, 10]
    }
},
    'logistic_regression': {
        'model' : LogisticRegression(solver='liblinear', multi_class='auto'),
        'params': {
            'C': [1, 5, 10], 
        }
    }
}

2. Get the best model

In [None]:
scores = []

for model_name, mp in model_params.items():
    clf=GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)
    clf.fit(iris.data, iris.target)
    scores.append({
        'model': model_name,
        'best_score': clf.best_score_,
        'best_params': clf.best_params_
    })

3. Show the model with the best parameters and the best score to choose the best model

In [None]:
df = pd.DataFrame(scores, columns=['model', 'best_score', 'best_params'])
df