# Hyperparameter Tuning with sklearn

In this tutorial, we will walk you through an example workflow for how you can use hyperparameter tuning with [sci-kit learn](https://scikit-learn.org/stable/) to pick the best model for your machine learning problem.

**Requirements:** Please add the `scikit-learn` package from the package picker on the top right. We will be using this package in the notebook.

## Load Iris Dataset

 We will be using sklearn's built-in [Iris dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html) to predict the flower type based on features, such as the length and width of the sepal and petal.

In [33]:
# Load iris flower dataset
from sklearn import svm, datasets
iris = datasets.load_iris()

In [104]:
import pandas as pd
df = pd.DataFrame(iris.data,columns=iris.feature_names)
df['flower'] = iris.target
df['flower'] = df['flower'].apply(lambda x: iris.target_names[x])
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),flower
47,4.6,3.2,1.4,0.2,setosa
48,5.3,3.7,1.5,0.2,setosa
49,5.0,3.3,1.4,0.2,setosa
50,7.0,3.2,4.7,1.4,versicolor
51,6.4,3.2,4.5,1.5,versicolor
52,6.9,3.1,4.9,1.5,versicolor
53,5.5,2.3,4.0,1.3,versicolor
54,6.5,2.8,4.6,1.5,versicolor
55,5.7,2.8,4.5,1.3,versicolor
56,6.3,3.3,4.7,1.6,versicolor


Let's start by using `train_test_split` to manually tune parameters by trial and error.

In [109]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3)

In [110]:
from sklearn.model_selection import cross_val_score
# Build a Support Vector Classification (SVC) model
model = svm.SVC(kernel='rbf',C=30,gamma='auto')
model.fit(X_train,y_train)
model.score(X_test, y_test)

0.9111111111111111

It's difficult to guess the right parameters to use for the model, so let's loop over different parameter values and use K Fold cross validation and compute the average score.

In [None]:
# Using a for-loop to iterate over different kernel types and regularization parameter (C) values
import numpy as np
kernels = ['rbf', 'linear']
C = [1,10,20]
avg_scores = {}
for kval in kernels:
    for cval in C:
        cv_scores = cross_val_score(svm.SVC(kernel=kval,C=cval,gamma='auto'),iris.data, iris.target, cv=5)
        avg_scores[kval + '_' + str(cval)] = np.average(cv_scores)

avg_scores

## Hyperparameter Search with Grid Search
sklearn's [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) provides a convenient way for us to perform the parameter search without having to write a for-loop.

In [95]:
from sklearn.model_selection import GridSearchCV
gs = GridSearchCV(svm.SVC(gamma='auto'), {
    'C': [1,5,10,20,30,50],
    'kernel': ['rbf','linear','poly']
}, cv=5, return_train_score=False)
gs.fit(iris.data, iris.target)

{'mean_fit_time': array([0.00040021, 0.00079727, 0.00119658, 0.00039496, 0.00080886,
        0.00018702]),
 'std_fit_time': array([0.00049016, 0.00039864, 0.00039806, 0.00048377, 0.00076687,
        0.00037403]),
 'mean_score_time': array([0.00059886, 0.00019956, 0.00019951, 0.00019903, 0.00079527,
        0.00019898]),
 'std_score_time': array([0.00048897, 0.00039911, 0.00039902, 0.00039806, 0.00074207,
        0.00039797]),
 'param_C': masked_array(data=[1, 1, 10, 10, 20, 20],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_kernel': masked_array(data=['rbf', 'linear', 'rbf', 'linear', 'rbf', 'linear'],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'C': 1, 'kernel': 'rbf'},
  {'C': 1, 'kernel': 'linear'},
  {'C': 10, 'kernel': 'rbf'},
  {'C': 10, 'kernel': 'linear'},
  {'C': 20, 'kernel': 'rbf'},
  {'C': 20, 'kernel': 'linear'}],


In [21]:
# Putting GridSearchCV result into a dataframe
df = pd.DataFrame(gs.cv_results_)
df

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_kernel,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.000598,0.000489,0.000598,0.000488,1,rbf,"{'C': 1, 'kernel': 'rbf'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
1,0.0006,0.00049,0.000599,0.000489,1,linear,"{'C': 1, 'kernel': 'linear'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
2,0.000605,0.000495,0.000394,0.000483,10,rbf,"{'C': 10, 'kernel': 'rbf'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
3,0.000798,0.000399,0.0002,0.000399,10,linear,"{'C': 10, 'kernel': 'linear'}",1.0,1.0,0.9,0.966667,1.0,0.973333,0.038873,4
4,0.000205,0.00041,0.000193,0.000386,20,rbf,"{'C': 20, 'kernel': 'rbf'}",0.966667,1.0,0.9,0.966667,1.0,0.966667,0.036515,5
5,0.0002,0.0004,0.000198,0.000397,20,linear,"{'C': 20, 'kernel': 'linear'}",1.0,1.0,0.9,0.933333,1.0,0.966667,0.042164,5


In [23]:
results = df[['param_C','param_kernel','mean_test_score']]
results

Unnamed: 0,param_C,param_kernel,mean_test_score
0,1,rbf,0.98
1,1,linear,0.98
2,10,rbf,0.98
3,10,linear,0.973333
4,20,rbf,0.966667
5,20,linear,0.966667


Let's plot the scores for each parameter combination in a heatmap to visualize the grid search results: 

In [None]:
import altair as alt
alt.Chart(results).mark_rect().encode(
    x='param_C:O',
    y='param_kernel:O',
    color='mean_test_score:Q'
)

In [24]:
print(f"From GridSearch, this is the best parameter combination {gs.best_params_} and resulting score {gs.best_score_:.2f}.")

{'C': 1, 'kernel': 'rbf'}

## Hyperparameter Search with Randomized Search
GridSearchCV performs an exhaustive search over all parameter combinations. That can become very expensive when there is large numbers of parameters to search through. 
[RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) performs a random search over the space of possible parameters. 

In [102]:
from sklearn.model_selection import RandomizedSearchCV
rs = RandomizedSearchCV(svm.SVC(gamma='auto'), {
        'C': [1,5,10,20,30,50],
        'kernel': ['rbf','linear','poly']
    }, 
    cv=5, 
    return_train_score=False, 
    n_iter=2
)
import time
start = time.time()
rs.fit(iris.data, iris.target)
end = time.time()
print(f"RandomizedSearchCV took {end-start:.2f} seconds")
df = pd.DataFrame(rs.cv_results_)[['param_C','param_kernel','mean_test_score']]
df

Unnamed: 0,param_C,param_kernel,mean_test_score
0,10,rbf,0.98
1,1,linear,0.98


In [None]:
start = time.time()
gs.fit(iris.data, iris.target)
end = time.time()
print(f"GridSearchCV took {end-start:.2f} seconds")

We can see that GridSearchCV takes longer than RandomizedSearchCV. The runtime difference is even larger when you have many parameters to try or if the training time is long (e.g., when you are working with large datasets). Read more about the difference between the two hyperparameter tuning methods [here](https://scikit-learn.org/stable/auto_examples/model_selection/plot_randomized_search.html).

## Model Comparison with Grid Search

Now that we learned how we can find the optimized parameter for a given model, we can loop through different model and parameter ranges and perform grid search to determine the most optimal parameters for different types of classification models.

In [92]:
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

model_params = {
    'svm': {
        'model': svm.SVC(gamma='auto'),
        'params' : {
            'C': [1,5,10,20,30,50],
            'kernel': ['rbf','linear','poly']
        }  
    },
    'random_forest': {
        'model': RandomForestClassifier(),
        'params' : {
            'n_estimators': [1,5,10]
        }
    },
    'logistic_regression' : {
        'model': LogisticRegression(solver='liblinear',multi_class='auto'),
        'params': {
            'C': [1,5,10]
        }
    }
}


In [93]:
scores = []

for model_name, mp in model_params.items():
    clf =  GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)
    clf.fit(iris.data, iris.target)
    scores.append({
        'model': model_name,
        'best_score': clf.best_score_,
        'best_params': clf.best_params_
    })
    
df = pd.DataFrame(scores,columns=['model','best_score','best_params'])
df

Unnamed: 0,model,best_score,best_params
0,svm,0.98,"{'C': 1, 'kernel': 'rbf'}"
1,random_forest,0.953333,{'n_estimators': 5}
2,logistic_regression,0.966667,{'C': 5}


Based on above, we can determine that SVM with C=1 and Radial Basis Function (RBF) kernel is the most accurate model for this iris flower classification task.

## Conclusion
In this example, we looked at how you can use scikit-learn to search over a range of parameter combinations and find the best parameter settings for your model using GridSearchCV and RandomizedSearchCV. We also looked at how you can apply this technique to search over different models to look for the best model-parameter combination for your machine learning task. 