# Hyperparameter Tuning

## Overview

in machine learning, we have model parameters and hyperparameters. **Model Parameters** are the parameters that are estimated by the model (if the model is parametric) from the given data. the **Model hyperparameters** are the parameters that the model used to estimate the model parameters. so each model have its hyperparameters, and different hyperparameter will give us different answers, so we have to find the best hyperparameters to get the best accuracy.so **Hyperparameter Tuning** (or hyperparameter optimization) is the process of determining the right combination of hyperparameters in order to maximize the model performance. it works by running multiple trials in a single training process. it is an important step in any ML project since it leads to optimal results for a model. but how to find these hyperparameters?<br>
in general, we need another dataset besides the training and test set, which is called the **Cross Validation** or **Development** set. so after choosing our hyperparameters and training our model using training data, we calculate our model performance using the development set (dev set) and keep changing the hyperparameters until we reach our optimal performance on the dev set. then choose those hyperparameters and train our model another time on the training set and evaluate its final accuracy using the test set. by using the dev set we prevent data leakage from the test set to our training process, so we get a more realistic performance number. so before we talk about methods of hyperparameters tuning, let's talk about a method that helps us perform cross validation, without the need for a separate dataset as the dev set, and just using our usual training and test dataset.

## K-Fold Cross Validation

in **K-Fold Cross Validation** we are using training set both as training and dev set. we split the training set into K equal parts and train our model k time on k-1 of those parts with fixed hyperparameters and calculate our performance on the remaining part. at the end we will have k performance numbers, so we use average of those numbers as our final model accuracy on dev set correspond to those hyperparameters. then we change our hyperparameters and repeat this process and so on.<br>

**NOTE:** this method preferable when we don't have big dataset, and by splitting our data into train/dev/test set we will have small data for training our model. so in other words, we somehow train our data on more data.<br>

**NOTE:** for big datasets, we won't use this method and use train/dev/test set split with portion of 0.98/0.01/0.01 (or something like this).<br>

let's see how we can perform K-Fold cross validation by using scikit learn.

### Implementing K-Fold cross validation in scikit learn

In [1]:
# all the libraries we need through out the entire notebook
import numpy as np
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [2]:
# our breast cancer dataset- so we just split our data into train and test set
X,y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,stratify=y)
scale = StandardScaler()
scale.fit(X_train)
X_train = scale.transform(X_train)
X_test = scale.transform(X_test)

In [3]:
from sklearn.model_selection import cross_val_score
lr = LogisticRegression(penalty='l2',C=1) # using logistic regression with these hyperparameters to model our data
lr_cv = cross_val_score(lr,X_train,y_train,cv=10) # cv : number of split (our k)
accuracy = np.mean(lr_cv)
lr_cv,accuracy

(array([0.93478261, 1.        , 1.        , 0.97826087, 0.93478261,
        1.        , 1.        , 1.        , 1.        , 0.97777778]),
 0.9825603864734299)

as you can see we have 10 (equal to our k) calculated accuracy, so our final accuracy using these hyperparameters will be average of these numbers.<br>
now that we know how to calculate the accuracy of our model using each of these hyperparameters, lets talk about methods that can help us find the optimum hyperparameters.

## Hyperparameter Tuning

in general, there are two types of hyperparameter tuning:<br>
   1. **Manual** hyperparameter tuning
   2. **Automated** hyperparameter tuning<br>

in manual HP tuning, as you can gauss, we pick hyperparameters by hand and test them one by one. in this method we have more control over the process but tuning usually consist of many trials so keeping track of them is hard and time-consuming, so this isn't a very practical approach when there are a lot of hyperparameters to consider (and that's usually the case).<br>
in automated HP tuning, we are using algorithms to do these processes and find the optimum HPs. in these algorithms generally we have two steps:<br>
   * Step 1: we have to specify a set of hyperparameters and their limits.in other words, we somehow specify a grid of search for our algorithm.
   * Step 2: then the algorithm does the rest. it runs these trials (train our model with these hyperparameters) and find the best hyperparameters that give us optimum results.<br>

how do the algorithms do this? well, there are a lot of classes of algorithms with many variations, and the advanced ones are completely related to the field of optimization because, in the end, it is an optimization problem. here I will talk about 4 of these methods and I think knowing those are enough! 

### GridSearch

in **GridSearch** we define grid of possible values for hyperparameters, and the algorithm will perform the model on each and every possible combination and record the model performance. finally it returns the best model with the best hyperparameters. let's implement it with scikit learn.

#### Implementing GridSearch with scikit learn

In [4]:
from sklearn.model_selection import GridSearchCV
# our models that we want to perform on our dataset, we will use them through entire notebook.
# as you can see i fixed some of hyperparameters but in general you can search over them too

lr = LogisticRegression(max_iter=1000,penalty='l2',random_state=42) #Logistic Regression
rf = RandomForestClassifier(random_state=42,n_jobs=-1)# Random forest

In [5]:
# our grid of hyperparameters
lr_params = {'C': np.logspace(4,-4,30),
             'solver':['lbfgs','liblinear']
             
}

rf_params = {'n_estimators':[200,400,600,800,1000],
             'criterion':['gini','entropy'],
             'bootstrap':[True,False],
             'max_depth':[3,7,9,11,13,None]
}

In [6]:
%%time
#logistic regression
lr_g_clf = GridSearchCV(lr,param_grid=lr_params,cv=10,n_jobs=-1)
lr_g_clf.fit(X_train,y_train)

CPU times: total: 391 ms
Wall time: 2.85 s


In [7]:
# our best parameters
lr_g_clf.best_params_

{'C': 0.7278953843983146, 'solver': 'liblinear'}

In [8]:
# score of our best parameters
lr_g_clf.best_score_

0.9825603864734299

In [9]:
# our model with best hyperparameters. now we test our model with chosen hyperparameter on our test dataset.
lr_g_clf.best_estimator_.score(X_test,y_test)

0.9824561403508771

In [10]:
%%time
#random forest
rf_g_clf = GridSearchCV(rf,param_grid=rf_params,cv=10,n_jobs=-1)
rf_g_clf.fit(X_train,y_train)

CPU times: total: 3.58 s
Wall time: 4min 47s


In [11]:
rf_g_clf.best_params_

{'bootstrap': False,
 'criterion': 'entropy',
 'max_depth': 7,
 'n_estimators': 600}

In [12]:
rf_g_clf.best_score_

0.9648309178743961

In [13]:
rf_g_clf.best_estimator_.score(X_test,y_test)

0.9649122807017544

**Some Notes:**<br>
   * as you can see grid search is time consuming process and with finer grid we will have very slow process.
   * another problem of grid search is that our domain is discrete, and maybe the optimum number is the number between what we define.
   * `GridSearchCV` will perform k-fold cross validation for us so we don't have to worry about it. so we have specify the k number and `cv=` is for that.

can we do better? yes we can, let's see how.

### RandomSearch

in **RandomSearch**, again we define our grid, but not in discrete form, instead we define distribution for our hyperparameters. then the algorithm will pick randomly from our domain. how many? well, we have to specify it. let's see how we can implement it with scikit learn.

#### Implementing RandomSearch with scikit learn

In [14]:
from sklearn.model_selection import RandomizedSearchCV

In [15]:
lr_distribution = {
    'C':np.logspace(-4,4,num=1000),
    'solver':['lbfgs','liblinear']
}

rf_distribution = {
    'n_estimators':np.arange(200,1600,200),
             'criterion':['gini','entropy'],
             'bootstrap':[True,False],
             'max_depth':np.arange(3,15,2)
}

In [16]:
%%time
# random search for logistic regression
lr_r_clf = RandomizedSearchCV(lr,param_distributions=lr_distribution
                              ,cv=10,n_iter=50,n_jobs=-1,random_state=42)
lr_r_clf.fit(X_train,y_train)

CPU times: total: 203 ms
Wall time: 1.07 s


In [17]:
lr_r_clf.best_params_

{'solver': 'lbfgs', 'C': 0.14831025143361043}

In [18]:
lr_r_clf.best_score_

0.9824637681159419

In [19]:
lr_r_clf.best_estimator_.score(X_test,y_test)

0.9824561403508771

In [20]:
%%time
#random search for random forest
rf_r_clf = RandomizedSearchCV(rf,param_distributions={
    'n_estimators':np.arange(200,1600,100),
             'criterion':['gini','entropy'],
             'bootstrap':[True,False],
             'max_depth':np.arange(3,15,2)},cv=10,n_iter=5,n_jobs=-1)
rf_r_clf.fit(X_train,y_train)

CPU times: total: 3.23 s
Wall time: 23 s


In [21]:
rf_r_clf.best_params_

{'n_estimators': 1300,
 'max_depth': 13,
 'criterion': 'entropy',
 'bootstrap': False}

In [22]:
rf_r_clf.best_score_

0.962657004830918

In [23]:
rf_r_clf.best_estimator_.score(X_test,y_test)

0.9649122807017544

**Some Notes:**<br>
   * in random search it is more likely to get the optimum hyperparameters.
   * here we define the distribution, so it will choose from all the possible numbers that we define. (I'm not defining distribution here, check `scipy` library for that)
   * `n_iter=` is the number of times that our algorithm will sample hyperparameters and try them.

### Bayesian Optimization

in **Bayesian Optimization** instead of using random combinations of values for hyperparameters, we look at the result we get so far and use them to predict the region in our hyperparameter space that might give better results. we need to predict how well a new combination will do and also model the uncertainty of that prediction.<br>
the entire concept of Bayesian optimization is to reduce the number of times we need to train our model on our dataset by choosing only the most promising set of hyperparameters to evaluate based on previous results. in contrast, in the grid and random search, we are completely uninformed by post evaluations results and often spend a significant amount of time evaluating bad hyperparameters.<br>
in these models, we have 3 main parts:<br>
   1. our grid of hyperparameters.
   2. **objective function**, which is our model.
   3. **surrogate function**, which is a probabilistic model(usually) of our objective function and is so much easier to optimize compared to our objective function.<br>

so the model will build a surrogate function and randomly choose hyperparameters from our grid, then choose the best combinations of hyperparameters that work best on our surrogate function. after that apply these hyperparameters to our objective function (our model) and calculate the results. then by using the result it updates the surrogate function and repeats its process until we reach our maximum number of iterations.<br>

**NOTE:** there are different methods used for construct surrogate function like:
   * Gaussian Process (GP)
   * Tree Parzen Estimator (TPE)
   * Random Forrest
   * Neural Network<br>

several good libraries use Bayesian optimization like Scikit-Optimize or Hyperopt, but I will use scikit-Optimize because, besides its own format, it has scikit learn format too. before using it be sure to install the package first and read the documentation.

#### Implementing Bayesian Optimization using Scikit-Optimize

In [24]:
from skopt import BayesSearchCV

In [25]:
%%time
lr_bo_clf = BayesSearchCV(lr,search_spaces={
    'C':(1e-4,1e4,'log-uniform'),
    'solver':['lbfgs','liblinear']
},n_iter=20,cv=10,n_jobs=-1,random_state=42)
lr_bo_clf.fit(X_train,y_train)

CPU times: total: 6.2 s
Wall time: 6.81 s


In [26]:
lr_bo_clf.best_params_

OrderedDict([('C', 0.9278459556357788), ('solver', 'lbfgs')])

In [27]:
lr_bo_clf.best_score_

0.9825603864734299

In [28]:
lr_bo_clf.best_estimator_.score(X_test,y_test)

0.9824561403508771

In [29]:
%%time
rf_bo_clf = BayesSearchCV(rf,search_spaces={
    'n_estimators':np.arange(200,1600,200),
    'criterion':['gini','entropy'],
    'bootstrap':[True,False],
    'max_depth':np.arange(3,15,2)
},cv=10,n_iter=10,n_jobs=-1)
rf_bo_clf.fit(X_train,y_train)

CPU times: total: 1.94 s
Wall time: 27.7 s


In [30]:
rf_bo_clf.best_params_

OrderedDict([('bootstrap', False),
             ('criterion', 'gini'),
             ('max_depth', 13),
             ('n_estimators', 600)])

In [31]:
rf_bo_clf.best_score_

0.9604830917874396

In [32]:
rf_bo_clf.best_estimator_.score(X_test,y_test)

0.956140350877193

### Successive Halving and Hyperband

if a hyperparameter configuration is bad, it will be bad at the early stage of our computation or will be bad at the simpler form of our problem. that is the whole idea of **Successive Halving**. in successive halving (SH) we have 3 parameters:
   * Number of configuration (${N}$): which is our number of hyperparameter combinations.
   * our budget (${B}$): it can be the number of samples, number of iterations, or number of estimators in case of boosting methods.
   * our factor (${k}$): usually it is 2 or 3.<br>

and the SH algorithm will be something like this:
   1. Randomly sample a set of hyperparameter configurations. how many? well, it depends on our budget and factor.
   2. Evaluate the performances of all correctly remaining configurations.
   3. keep $\frac{1}{K}$ of best scoring configurations and increase the number of our budget with the factor of ${k}$.
   4. repeat steps 2 and 3 until one configuration remains.

so instead of wasting our time on configurations that won't perform well, we keep the most promising ones and test them on more resources in each iteration.<br>
as you can see there is a trade-off between the number of configurations and our budget. so the question is how do we specify these numbers? that is where we use **Hyperband**. hyperband will perform SH with different budgets to find the best configurations.<br>
**NOTE:** there is a variation of hyperband that use Bayesian Optimization to find our budget (instead of blindly testing all budgets), and it is called **BOHB** (short for Bayesian Optimization Hyperband). now let's implement SH using scikit learn.

#### Implementing SH in Scikit learn

In [33]:
from sklearn.experimental import enable_halving_search_cv # this method is still experimental so we have to import it
from sklearn.model_selection import HalvingRandomSearchCV

In [34]:
%%time
lr_sh_clf = HalvingRandomSearchCV(lr,param_distributions={
    'C':np.logspace(-4,4,num=1000),
    'solver':['lbfgs','liblinear']
},factor=3,cv=10,n_jobs=-1)
lr_sh_clf.fit(X_train,y_train)

CPU times: total: 172 ms
Wall time: 222 ms


In [35]:
lr_sh_clf.best_score_

0.9804761904761904

In [36]:
lr_sh_clf.best_params_

{'solver': 'lbfgs', 'C': 12.856096069432965}

In [37]:
lr_sh_clf.best_estimator_.score(X_test,y_test)

0.9649122807017544

In [38]:
%%time
rf_sh_clf = HalvingRandomSearchCV(rf,param_distributions={
    'n_estimators':np.arange(200,1600,200),
             'criterion':['gini','entropy'],
             'bootstrap':[True,False],
             'max_depth':np.arange(3,17,2)
},factor=3,cv=10,n_jobs=-1)
rf_sh_clf.fit(X_train,y_train)

CPU times: total: 3.33 s
Wall time: 43.7 s


In [39]:
rf_sh_clf.best_params_

{'n_estimators': 1400,
 'max_depth': 5,
 'criterion': 'entropy',
 'bootstrap': False}

In [40]:
rf_sh_clf.best_score_

0.954920634920635

In [41]:
rf_sh_clf.best_estimator_.score(X_test,y_test)

0.9649122807017544

**NOTE:** as you can see this method is experimental, so check documentation before use it.

## Final notes on hyperparameters Tuning

### Using pipeline in HP tuning

we can use pipeline in our tuning process let's see how.

In [54]:
from sklearn.decomposition import PCA 
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier 

knn_pipe = make_pipeline(StandardScaler(),PCA(),KNeighborsClassifier())

knn_params = {'kneighborsclassifier__n_neighbors':[3,5,7,9,11,13,15],
              'kneighborsclassifier__metric':['euclidean','minkowski','manhattan'],
              'pca__n_components':[0.99,0.95,0.90,None]
    
}

knn_clf = RandomizedSearchCV(knn_pipe,param_distributions=knn_params,cv=10,n_iter=50)
knn_clf.fit(X_train,y_train)

In [55]:
knn_clf.best_score_

0.973719806763285

In [56]:
knn_clf.best_params_

{'pca__n_components': None,
 'kneighborsclassifier__n_neighbors': 7,
 'kneighborsclassifier__metric': 'minkowski'}

In [57]:
knn_clf.best_estimator_.score(X_test,y_test)

0.9736842105263158

### HP tuning Packages

various packages perform HP tuning for us and each has its pros and cons, these are some of the most popular ones:<br>

   * Scikit-Optimize
   * Hyperopt
   * Ray-tune
   * SMAC
   * Scikit-Learn

and usually, most frameworks (like PyTorch, Tensorflow, xgboost, etc) have their HP Tuning packages, so check them too.