### Setup

# What is Sklearn's HalvingGridSearchCV and HalvingRandomizedSearchCV?
## Even faster hyperparameter tuning for massive models
<img src='images/tune.jpg'></img>
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://www.pexels.com/@karolina-grabowska?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Karolina Grabowska</a>
        on 
        <a href='https://www.pexels.com/photo/person-holding-tuning-pegs-4472108/?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Pexels</a>
    </strong>
</figcaption>

### Setup

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### A Note on Terminology

Before we move, let's make sure we are all on the same page on some of the terms I will be using today.

**1. Hyperparameters**: a model's internal settings that should be set by the user. The model cannot learn these from training data. An example is the learning rate in `xgboost` estimators.

**2. Parameter Grid**: a dictionary with parameter names as keys and a list of possible hyperparameters as values. Here is a sample parameter grid for `XGBClassifier`:

```python
param_grid = {
    "max_depth": [3, 4, 5, 7],
    "gamma": [0, 0.25, 1],
    "scale_pos_weight": [1, 3, 5]
}
```
A parameter grid's size or all possible combinations are calculated by multiplying the number of possible values of each parameter. So, the above grid has 4 * 3 * 3 = 36 possible combinations. Generally, parameter grids will be much larger than that.

**3. Candidate**: a single combination from all possible sets of hyperparameters in a parameter grid.

**4. Resources or samples**: another name for the data at hand. One sample refers to a single row in training data.

**5. Iteration**: any single round in which a single set of hyperparameters is used on the training data.

### Brief Overview of `GridSearchCV` and `RanomizedSearchCV`

Since the main ideas of the new classes are related to GridSearch and RandomSearch, let me briefly give an overview of how they work.

GridSearch is an exhaustive, brute-force estimator. This means that all combinations of hyperparameters will be trained using cross-validation. If there are 100 possible candidates and you are doing 5-fold cross-validation, the given estimator will be trained 500 times (500 iterations) . Surely, this will take excruciatingly long time for heavy models.

RandomizedSearch tries to control the number of iterations by making 'smarter' choices about which set of parameters to choose in each iteration. It has an additional `n_iter` parameter which directly controls this process. If there are 1000 candidates and `n_iter` is set to 100, the search will stop after 100th iteration and returns the best results from those 100. This random choosing process results in a much shorter training time but its performance won't be as good as GridSearch. 

If you want to learn more about them and see them in action check out my separate article on the topic:
https://towardsdatascience.com/automatic-hyperparameter-tuning-with-sklearn-gridsearchcv-and-randomizedsearchcv-e94f53a518ee

### What Is Successive Halving?

While both GridSearch and RandomizedSearch train the candidates on all of the training data, HalvingGridSearch and HalvingRandomSearch take a smarter approach called successive halving. Let's see what it means in terms of HalvingGridSearch (HGS).

HGS is like a competition among all candidates (hyperparameter combinations). In the first iteration, HGS trains all candidates on a small proportion of the training data. In the next iteration, only the candidates which performed best are chosen and they will be given more resources to compete. So, with each passing iteration, the 'surviving' candidates will be given more and more resources (training samples) until there is only two sets of best hyperparameters standing. Lastly, these two are trained on the full data and the best one is found. 

Now, let's get more granular. The above process can be controlled by two arguments - `factor` and `min_samples`. `min_samples` takes an integer to specify the number of samples of the training data to use in the first iteration. All candidates are trained on this data and in the next iteration `min_samples` grows by `factor` and the number of candidates decreases by `factor`. All next rounds continue in this manner until the best candidate is found. 

To let the idea sink in, let's say we have 1000 samples and 20 candidates in the parameter grid. If we set `min_samples` to 20 and choose a `factor` of 2, here is how the iterations will unfold:

In [20]:
pd.DataFrame({'iteration': [1, 2, 3, 4],
 'n_samples': [20, 40, 80, 160],
 'n_candidates': [20, 10, 5, 2],
 '*factor': [2, 2, 2, 2]}).set_index('iteration')

Unnamed: 0_level_0,n_samples,n_candidates,*factor
iteration,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,20,20,2
2,40,10,2
3,80,5,2
4,160,2,2


Setting the `min_resources` to 20 with 20 candidates, we are only able to run 4 iterations because we will run out of candidates before exhausting all samples. This means the rest of the training data (1000 - 160 = 840) will be wasted and the best candidate is only found by training only on 160 rows of data. 

Similarly, we may also run out of samples before all candidates are tried out. For example, let's say we have 1000 samples and 300 candidates. We set `min_samples` to 50 and choose a factor of 2:

In [21]:
pd.DataFrame({'iteration': [1, 2, 3, 4, 5],
 'n_samples': [50, 100, 200, 400, 800],
 'n_candidates': [300, 150, 75, 37, 18],
 '*factor': [2, 2, 2, 2, 2]}).set_index('iteration')

Unnamed: 0_level_0,n_samples,n_candidates,*factor
iteration,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,50,300,2
2,100,150,2
3,200,75,2
4,400,37,2
5,800,18,2


As you can see, at the 5th iteration, we don't have enough resources to double further and we are left with 18 final candidates. These final candidates have no choice but to be trained on the full dataset which is no different than plain GridSearch. This problem is even more evident with real world datasets. 

For example, the dataset we will be using today has 145k samples. So, we need to make sure that we choose the perfect combination of `factor` and `min_samples` depending on the size of our parameter grid that we will end up with as much unwasted resources as possible with the least amount of best candidates. 

That sure sounds a lot of work but fortunately you can pass `exhaust` to `min_samples` so that the minimum number of resources will be automatically determined to create the best possible combination with `factor` and number of candidates. For example, for 1000 samples and a factor of two, setting the `min_samples` to `exhaust` will set it to 250 which will become 250, 500, 1000 samples as we go through each iteration.

The official guide says that exhausting the number of samples will definitely lead to a more robust selection of parameters but might be a bit more time-consuming. In the next sections, we will explore just how much better the new classes are than their counterparts.

### Loading and Preparing Data

In [2]:
from prep import preprocess

rain = pd.read_csv('data/weatherAUS.csv')
rain.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No


In [3]:
rain.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 23 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           145460 non-null  object 
 1   Location       145460 non-null  object 
 2   MinTemp        143975 non-null  float64
 3   MaxTemp        144199 non-null  float64
 4   Rainfall       142199 non-null  float64
 5   Evaporation    82670 non-null   float64
 6   Sunshine       75625 non-null   float64
 7   WindGustDir    135134 non-null  object 
 8   WindGustSpeed  135197 non-null  float64
 9   WindDir9am     134894 non-null  object 
 10  WindDir3pm     141232 non-null  object 
 11  WindSpeed9am   143693 non-null  float64
 12  WindSpeed3pm   142398 non-null  float64
 13  Humidity9am    142806 non-null  float64
 14  Humidity3pm    140953 non-null  float64
 15  Pressure9am    130395 non-null  float64
 16  Pressure3pm    130432 non-null  float64
 17  Cloud9am       89572 non-null

In [4]:
# Get the preprocessed feature and target arrays
X, y = preprocess(rain)

In [13]:
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.model_selection import GridSearchCV
import xgboost as xgb

xgb_cl = xgb.XGBClassifier(objective='binary:logistic', verbosity=0)

In [14]:
param_grid = {
    "max_depth": [3, 4, 5, 7],
    "gamma": [0, 0.25, 1],
    "scale_pos_weight": [1, 3, 5],
    "subsample": [0.8],
    "colsample_bytree": [0.5],
}

In [15]:
xgb_cl.fit(X, y)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=4, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=0)

In [7]:
halving_cv = HalvingGridSearchCV(xgb_cl, param_grid, cv=3, scoring='roc_auc', n_jobs=-1,
                                min_resources='exhaust')

_ = halving_cv.fit(X, y)



In [8]:
halving_cv.best_score_

0.8492997996870751

In [9]:
halving_cv.best_params_

{'colsample_bytree': 0.5,
 'gamma': 0,
 'learning_rate': 0.05,
 'max_depth': 5,
 'scale_pos_weight': 1,
 'subsample': 0.8}

### HalvingGridSearchCV

### Load Data