# Tuning the hyper-parameters of an estimator

From this guide: https://scikit-learn.org/stable/modules/grid_search.html

## Overview

Hyper-parameters are parameters that are not directly learnt within estimators. 

💡 In scikit-learn they are **passed as arguments to the constructor of the estimator classes**. Typical examples include `C`, `kernel` and `gamma` for Support Vector Classifier, `alpha` for Lasso, etc.

It is possible and recommended to search the hyper-parameter space for the best **cross validation** score.

💡 Any parameter provided when constructing an estimator may be optimized in this manner. Specifically, **to find the names and current values for all parameters for a given estimator, use**:
```python
estimator.get_params()
```

A search consists of:
1. an estimator (regressor or classifier such as `sklearn.svm.SVC()`);
2. **a parameter space**;
3. a method for searching or sampling candidates;
4. a cross-validation scheme; and
5. a score function.

Two generic approaches to parameter search are provided in scikit-learn: 
* `GridSearchCV` exhaustively considers all parameter combinations, 
* `RandomizedSearchCV` can sample a given number of candidates from a parameter space with a specified distribution. 

Both these tools have successive halving counterparts :
* `HalvingGridSearchCV` and 
* `HalvingRandomSearchCV`, 

which can be much faster at finding a good parameter combination.

## [Exhaustive Grid Search](https://scikit-learn.org/stable/modules/grid_search.html#exhaustive-grid-search)

`GridSearchCV` exhaustively generates candidates from a grid of parameter values specified with the `param_grid` parameter. For instance, the following param_grid:

In [1]:
param_grid = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]

specifies that **two grids should be explored**: one with a linear kernel and `C` values in `[1, 10, 100, 1000]`, and the second one with an RBF kernel, and the cross-product of `C` values ranging in `[1, 10, 100, 1000]` and `gamma` values in `[0.001, 0.0001]`.

The `GridSearchCV` instance implements the usual estimator API: when “fitting” it on a dataset all the possible combinations of parameter values are evaluated and the best combination is retained.

### [Example](https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html#sphx-glr-auto-examples-model-selection-plot-grid-search-digits-py)

This examples shows how a **classifier** is optimized by **cross-validation**, which is done using the `GridSearchCV` object on a *development set* that **comprises only half of the available labeled data**.

The performance of the selected hyper-parameters and trained model is then measured on a dedicated *evaluation set* that was not used during the model selection step.

Note the problem is too easy: the hyperparameter plateau is too flat and the output model is the same for precision and recall with ties in quality.

In [2]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC

In [3]:
# Loading the Digits dataset
digits = datasets.load_digits()

In [4]:
[x for x in dir(digits) if "_" not in x]

['DESCR', 'data', 'frame', 'images', 'target']

In [5]:
print("digits.images.shape:", digits.images.shape)
print("digits.target.shape:", digits.target.shape)

digits.images.shape: (1797, 8, 8)
digits.target.shape: (1797,)


In [6]:
# To apply a classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target

print("X.shape:", X.shape)
print("y.shape:", y.shape)

X.shape: (1797, 64)
y.shape: (1797,)


In [7]:
# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

print("X_train.shape:", X_train.shape)
print("y_test.shape:", y_test.shape)

X_train.shape: (898, 64)
y_test.shape: (899,)


In [8]:
# Set the parameters by cross-validation
tuned_parameters = [
    # Grid 1:
    {'kernel': ['rbf'], 'gamma': [1e-3, 1e-4], 'C': [1, 10, 100, 1000]},
    # Grid 2:
    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}
]

In [9]:
# Define scores: 
scores = ['precision', 'recall']

In [10]:
for score in scores:
    print("=" * 80)
    print(f"# Tuning hyper-parameters for {score}")
    print()

    clf = GridSearchCV(
        SVC(), 
        tuned_parameters,  # <-- The grid.
        scoring=f'{score}_macro',
        verbose=1
    )
    clf.fit(X_train, y_train)  # <-- Actually run the grid search.

    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    
    # Note the "mean_test_score", "std_test_score", and "params" keys in `cv_results_`:
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = y_test, clf.predict(X_test)
    print(classification_report(y_true, y_pred))
    print()

# Tuning hyper-parameters for precision

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best parameters set found on development set:

{'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}

Grid scores on development set:

0.986 (+/-0.016) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}
0.959 (+/-0.028) for {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}
0.988 (+/-0.017) for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
0.982 (+/-0.026) for {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}
0.988 (+/-0.017) for {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
0.983 (+/-0.026) for {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}
0.988 (+/-0.017) for {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}
0.983 (+/-0.026) for {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}
0.974 (+/-0.012) for {'C': 1, 'kernel': 'linear'}
0.974 (+/-0.012) for {'C': 10, 'kernel': 'linear'}
0.974 (+/-0.012) for {'C': 100, 'kernel': 'linear'}
0.974 (+/-0.012) for {'C': 1000, 'kernel': 'linear'}

Detailed classification report:

The model is t

### 🔑 See many good examples in the examples box at the [end of this section](https://scikit-learn.org/stable/modules/grid_search.html#exhaustive-grid-search):

* **\[Done above\]** See [Parameter estimation using grid search with cross-validation](https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html#sphx-glr-auto-examples-model-selection-plot-grid-search-digits-py) for an example of Grid Search computation on the digits dataset.

* See [Sample pipeline for text feature extraction and evaluation](https://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-grid-search-text-feature-extraction-py) for an example of Grid Search coupling parameters from a text documents feature extractor (n-gram count vectorizer and TF-IDF transformer) with a classifier (here a linear SVM trained with SGD with either elastic net or L2 penalty) using a `pipeline.Pipeline` instance.

* See [Nested versus non-nested cross-validation](https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html#sphx-glr-auto-examples-model-selection-plot-nested-cross-validation-iris-py) for an example of Grid Search within a cross validation loop on the iris dataset. **This is the best practice for evaluating the performance of a model with grid search**. *Essentially, this is CV with a validation step -like setup.*

* See [Demonstration of **multi-metric evaluation** on cross_val_score and GridSearchCV](https://scikit-learn.org/stable/auto_examples/model_selection/plot_multi_metric_evaluation.html#sphx-glr-auto-examples-model-selection-plot-multi-metric-evaluation-py) for an example of `GridSearchCV` being used to evaluate multiple metrics simultaneously.

* See [Balance model complexity and cross-validated score](https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_refit_callable.html#sphx-glr-auto-examples-model-selection-plot-grid-search-refit-callable-py) for an example of using `refit=callable` interface in `GridSearchCV`. The example shows how this interface adds certain amount of flexibility in identifying the “best” estimator. This interface can also be used in multiple metrics evaluation.

* See [Statistical comparison of models using grid search](https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_stats.html#sphx-glr-auto-examples-model-selection-plot-grid-search-stats-py) for an example of how to do a statistical comparison on the outputs of `GridSearchCV`.

## [Randomized Parameter Optimization](https://scikit-learn.org/stable/modules/grid_search.html#randomized-parameter-optimization)

While using a grid of parameter settings is currently the most widely used method for parameter optimization, other search methods have more favourable properties. 

`RandomizedSearchCV` implements a *randomized search over parameters*, where each setting is **sampled from a distribution over possible parameter values**. This has two main benefits over an exhaustive search:
* A budget can be chosen independent of the number of parameters and possible values.
* Adding parameters that do not influence the performance does not decrease efficiency.

Specifying how parameters should be sampled is done *using a dictionary*, very similar to specifying parameters for `GridSearchCV`. Additionally, a computation budget, being the number of sampled candidates or sampling iterations, is specified using the `n_iter` parameter. 

For each parameter, either a **distribution** over possible values or a **list** of discrete choices (which *will be sampled uniformly*) can be specified:
```python
{
    'C': scipy.stats.expon(scale=100),     # # scipy **distribution**
    'gamma': scipy.stats.expon(scale=.1),  # scipy **distribution**
    'kernel': ['rbf'],                     # List
    'class_weight': ['balanced', None]     # List
}
```
This example uses the `scipy.stats` module, which contains many useful distributions for sampling parameters, such as `expon`, `gamma`, `uniform` or `randint`. In principle, any function can be passed that provides a `rvs` (random variate sample) method to sample a value. A call to the `rvs` function should provide independent random samples from possible parameter values on consecutive calls.

⚠️ For *continuous* parameters, such as `C` above, it is important to **specify a continuous distribution** to take full advantage of the randomization. This way, increasing `n_iter` will always lead to a finer search.

A continuous log-uniform random variable is available through `loguniform`. *This is a continuous version of log-spaced parameters*. For example to specify `C` above, `loguniform(1, 100)` can be used **instead of** `[1, 10, 100]` or `np.logspace(0, 2, num=1000)`. This is an alias to SciPy’s `stats.reciprocal`.

Mirroring the example above in grid search, we can specify a continuous random variable that is log-uniformly distributed between `1e0` and `1e3`:

In [11]:
from sklearn.utils.fixes import loguniform

param_space = {
    'C': loguniform(1e0, 1e3),         # `loguniform` continuous, log spaced.
    'gamma': loguniform(1e-4, 1e-3),   # `loguniform` continuous, log spaced.
    'kernel': ['rbf'],                 # List, discrete.
    'class_weight':['balanced', None]  # List, discrete.
}

### [Example: Randomized vs Grid Search](https://scikit-learn.org/stable/auto_examples/model_selection/plot_randomized_search.html#sphx-glr-auto-examples-model-selection-plot-randomized-search-py)

Compare randomized search and grid search for optimizing hyperparameters of a *linear SVM* with *SGD training*. All parameters that influence the learning are searched simultaneously (except for the number of estimators, which poses a time / quality tradeoff).

**The randomized search and the grid search explore exactly the same space of parameters.** The result in parameter settings is quite similar, while **the run time for randomized search is drastically lower**.

The performance may be slightly worse for the randomized search, and is likely due to a noise effect and would not carry over to a held-out test set.

Note that in practice, one would not search over this many different parameters simultaneously using grid search, but pick only the ones deemed most important.

In [12]:
from time import time

import numpy as np

import scipy.stats as stats
from sklearn.utils.fixes import loguniform

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.datasets import load_digits
from sklearn.linear_model import SGDClassifier

In [13]:
# get some data
X, y = load_digits(return_X_y=True)
print(X.shape)
print(y.shape)

(1797, 64)
(1797,)


In [14]:
# build a classifier
clf = SGDClassifier(
    loss='hinge', 
    penalty='elasticnet',
    fit_intercept=True
)
print(clf)

SGDClassifier(penalty='elasticnet')


In [15]:
# Utility function to report best scores
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})"
                  .format(
                      results['mean_test_score'][candidate],
                      results['std_test_score'][candidate]
                  )
            )
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

In [16]:
# specify parameters and distributions to sample from
param_dist = {
    'average': [True, False],
    'l1_ratio': stats.uniform(0, 1),
    'alpha': loguniform(1e-4, 1e0)
}

In [17]:
# run randomized search
n_iter_search = 20
random_search = RandomizedSearchCV(
    clf, 
    param_distributions=param_dist,
    n_iter=n_iter_search
)

In [18]:
start = time()
random_search.fit(X, y)
print(f"RandomizedSearchCV took {(time() - start):.2f} seconds for {n_iter_search} candidate parameter settings.")
report(random_search.cv_results_)

RandomizedSearchCV took 15.52 seconds for 20 candidate parameter settings.
Model with rank: 1
Mean validation score: 0.928 (std: 0.033)
Parameters: {'alpha': 0.8277590142094537, 'average': False, 'l1_ratio': 0.004245259225131637}

Model with rank: 2
Mean validation score: 0.923 (std: 0.026)
Parameters: {'alpha': 0.000808177876312611, 'average': True, 'l1_ratio': 0.20754959699827236}

Model with rank: 3
Mean validation score: 0.921 (std: 0.032)
Parameters: {'alpha': 0.0002004035975261302, 'average': False, 'l1_ratio': 0.9413608653373494}



In [19]:
# use a full grid over all parameters
from IPython.display import display
param_grid = {
    'average': [True, False],
    # Note that here we have to provide a list of discrete values for all!
    'l1_ratio': np.linspace(0, 1, num=10),
    'alpha': np.power(10, np.arange(-4, 1, dtype=float))
}
display(param_grid)

{'average': [True, False],
 'l1_ratio': array([0.        , 0.11111111, 0.22222222, 0.33333333, 0.44444444,
        0.55555556, 0.66666667, 0.77777778, 0.88888889, 1.        ]),
 'alpha': array([1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00])}

In [20]:
# run grid search
grid_search = GridSearchCV(clf, param_grid=param_grid)
start = time()
grid_search.fit(X, y)

print(f"GridSearchCV took {(time() - start):.2f} seconds for {len(grid_search.cv_results_['params'])} candidate parameter settings.")
report(random_search.cv_results_)



GridSearchCV took 82.59 seconds for 100 candidate parameter settings.
Model with rank: 1
Mean validation score: 0.928 (std: 0.033)
Parameters: {'alpha': 0.8277590142094537, 'average': False, 'l1_ratio': 0.004245259225131637}

Model with rank: 2
Mean validation score: 0.923 (std: 0.026)
Parameters: {'alpha': 0.000808177876312611, 'average': True, 'l1_ratio': 0.20754959699827236}

Model with rank: 3
Mean validation score: 0.921 (std: 0.032)
Parameters: {'alpha': 0.0002004035975261302, 'average': False, 'l1_ratio': 0.9413608653373494}



## [Searching for optimal parameters with successive halving](https://scikit-learn.org/stable/modules/grid_search.html#searching-for-optimal-parameters-with-successive-halving)

Scikit-learn also provides the 
* `HalvingGridSearchCV` and 
* `HalvingRandomSearchCV` 

estimators that can be used to search a parameter space using successive halving.

Successive halving (SH) is like **a tournament among candidate parameter combinations**. 
1. SH is an iterative selection process where all candidates (the parameter combinations) are evaluated with a small amount of **resources** at the first iteration. 
2. Only some of these candidates are selected for the next iteration, which will be allocated **more resources**. 
3. For parameter tuning, the **resource** is typically the *number of training samples*, but it can also be an *arbitrary numeric parameter* such as `n_estimators` in a random forest.

As illustrated in the figure below, only a subset of candidates ‘survive’ until the last iteration. These are the candidates that have consistently ranked among the top-scoring candidates across all iterations. Each iteration is allocated an *increasing amount of resources per candidate*, here the number of samples:

<img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_successive_halving_iterations_0012.png" width=600/>

**Parameters:**
* The `factor` (> 1) parameter controls the **rate at which the resources grow**, and the **rate at which the number of candidates decreases**. 
    * In each iteration, the number of resources per candidate is multiplied by `factor` and the number of candidates is divided by the same `factor`. 
    * Along with `resource` and `min_resources`, `factor` is the most important parameter to control the search in our implementation, though a value of `3` usually works well. 
    * `factor` **effectively controls the number of iterations** in `HalvingGridSearchCV` and the **number of candidates (by default) and iterations** in `HalvingRandomSearchCV`. 
* `aggressive_elimination=True` can also be used if the number of available resources is small. 
* More control is available through tuning the `min_resources` parameter.

⚠️ These estimators are still **experimental**: their predictions and their API might change without any deprecation cycle. To use them, you need to explicitly import `enable_halving_search_cv`:
```python
# explicitly require this experimental feature
from sklearn.experimental import enable_halving_search_cv  # noqa
# now you can import normally from model_selection
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.model_selection import HalvingRandomSearchCV
```

**📚 Detailed discussion of the `Halving` methods:**
* [Choosing min_resources and the number of candidates](https://scikit-learn.org/stable/modules/grid_search.html#choosing-min-resources-and-the-number-of-candidates)
* [Amount of resource and number of candidates at each iteration](https://scikit-learn.org/stable/modules/grid_search.html#amount-of-resource-and-number-of-candidates-at-each-iteration)
* [Choosing a resource](https://scikit-learn.org/stable/modules/grid_search.html#choosing-a-resource)
* [Exhausting the available resources](https://scikit-learn.org/stable/modules/grid_search.html#exhausting-the-available-resources)
* [Aggressive elimination of candidates](https://scikit-learn.org/stable/modules/grid_search.html#aggressive-elimination-of-candidates)

### [Analysing results with the `cv_results_` attribute](https://scikit-learn.org/stable/modules/grid_search.html#analysing-results-with-the-cv-results-attribute)

The `cv_results_` attribute contains useful information for analysing the results of a search. It can be converted to a `pandas` dataframe with `df = pd.DataFrame(est.cv_results_)`. 

The `cv_results_` attribute of `HalvingGridSearchCV` and `HalvingRandomSearchCV` is similar to that of `GridSearchCV` and `RandomizedSearchCV`, with additional information related to the successive halving process.

Here is an example with some of the columns of a (truncated) dataframe:

<table class="docutils align-default" style="margin-left:0; margin-right:auto;">
<colgroup>
<col style="width: 3%">
<col style="width: 5%">
<col style="width: 12%">
<col style="width: 13%">
<col style="width: 67%">
</colgroup>
<thead>
<tr class="row-odd"><th class="head"></th>
<th class="head"><p>iter</p></th>
<th class="head"><p>n_resources</p></th>
<th class="head"><p>mean_test_score</p></th>
<th class="head"><p>params</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p>0</p></td>
<td><p>0</p></td>
<td><p>125</p></td>
<td><p>0.983667</p></td>
<td><p>{‘criterion’: ‘entropy’, ‘max_depth’: None, ‘max_features’: 9, ‘min_samples_split’: 5}</p></td>
</tr>
<tr class="row-odd"><td><p>1</p></td>
<td><p>0</p></td>
<td><p>125</p></td>
<td><p>0.983667</p></td>
<td><p>{‘criterion’: ‘gini’, ‘max_depth’: None, ‘max_features’: 8, ‘min_samples_split’: 7}</p></td>
</tr>
<tr class="row-even"><td><p>2</p></td>
<td><p>0</p></td>
<td><p>125</p></td>
<td><p>0.983667</p></td>
<td><p>{‘criterion’: ‘gini’, ‘max_depth’: None, ‘max_features’: 10, ‘min_samples_split’: 10}</p></td>
</tr>
<tr class="row-odd"><td><p>3</p></td>
<td><p>0</p></td>
<td><p>125</p></td>
<td><p>0.983667</p></td>
<td><p>{‘criterion’: ‘entropy’, ‘max_depth’: None, ‘max_features’: 6, ‘min_samples_split’: 6}</p></td>
</tr>
<tr class="row-even"><td><p>…</p></td>
<td><p>…</p></td>
<td><p>…</p></td>
<td><p>…</p></td>
<td><p>…</p></td>
</tr>
<tr class="row-odd"><td><p>15</p></td>
<td><p>2</p></td>
<td><p>500</p></td>
<td><p>0.951958</p></td>
<td><p>{‘criterion’: ‘entropy’, ‘max_depth’: None, ‘max_features’: 9, ‘min_samples_split’: 10}</p></td>
</tr>
<tr class="row-even"><td><p>16</p></td>
<td><p>2</p></td>
<td><p>500</p></td>
<td><p>0.947958</p></td>
<td><p>{‘criterion’: ‘gini’, ‘max_depth’: None, ‘max_features’: 10, ‘min_samples_split’: 10}</p></td>
</tr>
<tr class="row-odd"><td><p>17</p></td>
<td><p>2</p></td>
<td><p>500</p></td>
<td><p>0.951958</p></td>
<td><p>{‘criterion’: ‘gini’, ‘max_depth’: None, ‘max_features’: 10, ‘min_samples_split’: 4}</p></td>
</tr>
<tr class="row-even"><td><p>18</p></td>
<td><p>3</p></td>
<td><p>1000</p></td>
<td><p>0.961009</p></td>
<td><p>{‘criterion’: ‘entropy’, ‘max_depth’: None, ‘max_features’: 9, ‘min_samples_split’: 10}</p></td>
</tr>
<tr class="row-odd"><td><p>19</p></td>
<td><p>3</p></td>
<td><p>1000</p></td>
<td><p>0.955989</p></td>
<td><p>{‘criterion’: ‘gini’, ‘max_depth’: None, ‘max_features’: 10, ‘min_samples_split’: 4}</p></td>
</tr>
</tbody>
</table>

* Each row corresponds to a given parameter combination (a candidate) and a given iteration. 
* The iteration is given by the `iter` column. 
* (The `n_resources column` tells you how many resources were used.)

In the example above, the best parameter combination is `{'criterion': 'entropy', 'max_depth': None, 'max_features': 9, 'min_samples_split': 10}` since it has reached the last iteration (3) with the highest score: 0.96.

## [Tips for parameter search](https://scikit-learn.org/stable/modules/grid_search.html#tips-for-parameter-search



#### [Specifying an objective metric:](https://scikit-learn.org/stable/modules/grid_search.html#specifying-an-objective-metric)
By **default**, parameter search uses the `score` function of the estimator to evaluate a parameter setting. 
These are: 
* `sklearn.metrics.accuracy_score` for **classification** and 
* `sklearn.metrics.r2_score` for **regression**.

An alternative scoring function can be specified **via the `scoring` parameter** of most parameter search tools.



#### [Specifying multiple metrics for evaluation](https://scikit-learn.org/stable/modules/grid_search.html#specifying-multiple-metrics-for-evaluation)
`GridSearchCV` and `RandomizedSearchCV` allow specifying multiple metrics for the scoring parameter.

⚠️ But not the `Halving` searches!

Multimetric scoring can either be specified as:
* a `list` of strings of predefined scores names 
* or a `dict` mapping the scorer *name* to the scorer *function* and/or the predefined scorer name(s).

⚠️ When specifying multiple metrics, the `refit` parameter **must be set to the metric (string) for which the `best_params_` will be found and used to build the `best_estimator_` on the whole dataset**. If the search should not be refit, set `refit=False`. Leaving refit to the default value `None` will result in an **error** when using multiple metrics. See [example](https://scikit-learn.org/stable/auto_examples/model_selection/plot_multi_metric_evaluation.html#sphx-glr-auto-examples-model-selection-plot-multi-metric-evaluation-py).


#### [Composite estimators and parameter spaces](https://scikit-learn.org/stable/auto_examples/model_selection/plot_multi_metric_evaluation.html#sphx-glr-auto-examples-model-selection-plot-multi-metric-evaluation-py)

`GridSearchCV` and `RandomizedSearchCV` allow searching over **parameters of *composite* or *nested* estimators** such as:
* `Pipeline` [(docs)](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline), 
* `ColumnTransformer` [(docs)](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer), 
* `VotingClassifier` [(docs)](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier) or 
* `CalibratedClassifierCV` [(docs)](https://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html#sklearn.calibration.CalibratedClassifierCV)

using a dedicated `<estimator>__<parameter>` \[❗\] syntax:

In [22]:
from sklearn.model_selection import GridSearchCV
from sklearn.calibration import CalibratedClassifierCV  # <-- Composite/nested estimator
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_moons

In [23]:
X, y = make_moons()
print("X.shape:", X.shape)
print("y.shape:", y.shape)
print("X[0]:\n", X[0])
print("y[0]:\n", y[0])

X.shape: (100, 2)
y.shape: (100,)
X[0]:
 [0.1595999  0.98718178]
y[0]:
 0


In [24]:
calibrated_forest = CalibratedClassifierCV(
   base_estimator=RandomForestClassifier(n_estimators=10)
)
# ^ See this is COMPOSITE.

In [27]:
# Apply GridSearchCV
param_grid = {
    # ❗ <estimator>__<parameter> syntax:
    'base_estimator__max_depth': [2, 4, 6, 8]
}

search = GridSearchCV(
    calibrated_forest, 
    param_grid, 
    cv=5,
    verbose=2,
)

search.fit(X, y)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] END ........................base_estimator__max_depth=2; total time=   0.1s
[CV] END ........................base_estimator__max_depth=2; total time=   0.1s
[CV] END ........................base_estimator__max_depth=2; total time=   0.1s
[CV] END ........................base_estimator__max_depth=2; total time=   0.1s
[CV] END ........................base_estimator__max_depth=2; total time=   0.1s
[CV] END ........................base_estimator__max_depth=4; total time=   0.1s
[CV] END ........................base_estimator__max_depth=4; total time=   0.1s
[CV] END ........................base_estimator__max_depth=4; total time=   0.1s
[CV] END ........................base_estimator__max_depth=4; total time=   0.1s
[CV] END ........................base_estimator__max_depth=4; total time=   0.1s
[CV] END ........................base_estimator__max_depth=6; total time=   0.1s
[CV] END ........................base_estimator__

GridSearchCV(cv=5,
             estimator=CalibratedClassifierCV(base_estimator=RandomForestClassifier(n_estimators=10)),
             param_grid={'base_estimator__max_depth': [2, 4, 6, 8]}, verbose=2)

Here, `<estimator>` is the parameter name of the nested estimator, in this case `base_estimator`. 

If the meta-estimator is constructed as a collection of estimators as in `pipeline.Pipeline`, then `<estimator>` refers to the name of the estimator, see [Nested parameters](https://scikit-learn.org/stable/modules/compose.html#pipeline-nested-parameters). 

In practice, there can be several levels of nesting:

In [29]:
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest

pipe = Pipeline(
    [
        ('select', SelectKBest()),
        ('model', calibrated_forest)  # From above example code.
    ]
)

param_grid = {
    # <estimator>=`select`, <parameter>=`k`:
    'select__k': [1, 2],
    # MULTIPLE NESTING <estimator>=`model`, <nested_estimator>=`base_estimator`, <parameter>=`max_depth`
    'model__base_estimator__max_depth': [2, 4, 6, 8]
}

search = GridSearchCV(pipe, param_grid, cv=5, verbose=2).fit(X, y)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
[CV] END ....model__base_estimator__max_depth=2, select__k=1; total time=   0.1s
[CV] END ....model__base_estimator__max_depth=2, select__k=1; total time=   0.1s
[CV] END ....model__base_estimator__max_depth=2, select__k=1; total time=   0.1s
[CV] END ....model__base_estimator__max_depth=2, select__k=1; total time=   0.1s
[CV] END ....model__base_estimator__max_depth=2, select__k=1; total time=   0.1s
[CV] END ....model__base_estimator__max_depth=2, select__k=2; total time=   0.1s
[CV] END ....model__base_estimator__max_depth=2, select__k=2; total time=   0.1s
[CV] END ....model__base_estimator__max_depth=2, select__k=2; total time=   0.1s
[CV] END ....model__base_estimator__max_depth=2, select__k=2; total time=   0.1s
[CV] END ....model__base_estimator__max_depth=2, select__k=2; total time=   0.1s
[CV] END ....model__base_estimator__max_depth=4, select__k=1; total time=   0.1s
[CV] END ....model__base_estimator__max_depth=4, 

#### [Model selection: development and evaluation](https://scikit-learn.org/stable/modules/grid_search.html#model-selection-development-and-evaluation)

Model selection by evaluating various parameter settings can be seen as a way to use the labeled data to “train” the parameters of the grid.

When evaluating the resulting model it is important to do it on held-out samples that were not seen during the grid search process: it is recommended to split the data into a **development set** (to be fed to the `GridSearchCV` instance) and an **evaluation set** to compute performance metrics.

This can be done by using the `train_test_split` utility function.

#### [Parallelism](https://scikit-learn.org/stable/modules/grid_search.html#parallelism)

The parameter search tools evaluate each parameter combination on each data fold independently. 

Computations *can be run in parallel* by using the keyword `n_jobs=-1`. 

See function signature for more details, and also the [Glossary entry for `n_jobs`](https://scikit-learn.org/stable/glossary.html#term-n_jobs).

#### [Robustness to failure](https://scikit-learn.org/stable/modules/grid_search.html#robustness-to-failure)

Some parameter settings may result in a failure to fit one or more folds of the data. 

By **default**, this will **cause the entire search to fail**, even if some parameter settings could be fully evaluated. 

Setting `error_score=0` (or `=np.nan`) **will make the procedure robust to such failure**, issuing a *warning* and setting the score for that fold to `0` (or `NaN`), but completing the search.

## [Alternatives to brute force parameter search](https://scikit-learn.org/stable/modules/grid_search.html#alternatives-to-brute-force-parameter-search)

#### *Model specific* cross-validation

> Some models can fit data for a range of values of some parameter almost as efficiently as fitting the estimator for a single value of the parameter.

This feature can be leveraged to perform a more efficient cross-validation used for model selection of this parameter.

The most common parameter amenable to this strategy is the parameter **encoding the strength of the regularizer**. In this case we say that we compute the **regularization path** of the estimator.

Here is the list of such models:

<table class="longtable docutils align-default" style="margin-left:0; margin-right:auto;">
<colgroup>
<col style="width: 10%">
<col style="width: 90%">
</colgroup>
<tbody>
<tr class="row-odd"><td><p><a class="reference internal" href="generated/sklearn.linear_model.ElasticNetCV.html#sklearn.linear_model.ElasticNetCV" title="sklearn.linear_model.ElasticNetCV"><code class="xref py py-obj docutils literal notranslate"><span class="pre">linear_model.ElasticNetCV</span></code></a>(*[,&nbsp;l1_ratio,&nbsp;…])</p></td>
<td><p>Elastic Net model with iterative fitting along a regularization path.</p></td>
</tr>
<tr class="row-even"><td><p><a class="reference internal" href="generated/sklearn.linear_model.LarsCV.html#sklearn.linear_model.LarsCV" title="sklearn.linear_model.LarsCV"><code class="xref py py-obj docutils literal notranslate"><span class="pre">linear_model.LarsCV</span></code></a>(*[,&nbsp;fit_intercept,&nbsp;…])</p></td>
<td><p>Cross-validated Least Angle Regression model.</p></td>
</tr>
<tr class="row-odd"><td><p><a class="reference internal" href="generated/sklearn.linear_model.LassoCV.html#sklearn.linear_model.LassoCV" title="sklearn.linear_model.LassoCV"><code class="xref py py-obj docutils literal notranslate"><span class="pre">linear_model.LassoCV</span></code></a>(*[,&nbsp;eps,&nbsp;n_alphas,&nbsp;…])</p></td>
<td><p>Lasso linear model with iterative fitting along a regularization path.</p></td>
</tr>
<tr class="row-even"><td><p><a class="reference internal" href="generated/sklearn.linear_model.LassoLarsCV.html#sklearn.linear_model.LassoLarsCV" title="sklearn.linear_model.LassoLarsCV"><code class="xref py py-obj docutils literal notranslate"><span class="pre">linear_model.LassoLarsCV</span></code></a>(*[,&nbsp;fit_intercept,&nbsp;…])</p></td>
<td><p>Cross-validated Lasso, using the LARS algorithm.</p></td>
</tr>
<tr class="row-odd"><td><p><a class="reference internal" href="generated/sklearn.linear_model.LogisticRegressionCV.html#sklearn.linear_model.LogisticRegressionCV" title="sklearn.linear_model.LogisticRegressionCV"><code class="xref py py-obj docutils literal notranslate"><span class="pre">linear_model.LogisticRegressionCV</span></code></a>(*[,&nbsp;Cs,&nbsp;…])</p></td>
<td><p>Logistic Regression CV (aka logit, MaxEnt) classifier.</p></td>
</tr>
<tr class="row-even"><td><p><a class="reference internal" href="generated/sklearn.linear_model.MultiTaskElasticNetCV.html#sklearn.linear_model.MultiTaskElasticNetCV" title="sklearn.linear_model.MultiTaskElasticNetCV"><code class="xref py py-obj docutils literal notranslate"><span class="pre">linear_model.MultiTaskElasticNetCV</span></code></a>(*[,&nbsp;…])</p></td>
<td><p>Multi-task L1/L2 ElasticNet with built-in cross-validation.</p></td>
</tr>
<tr class="row-odd"><td><p><a class="reference internal" href="generated/sklearn.linear_model.MultiTaskLassoCV.html#sklearn.linear_model.MultiTaskLassoCV" title="sklearn.linear_model.MultiTaskLassoCV"><code class="xref py py-obj docutils literal notranslate"><span class="pre">linear_model.MultiTaskLassoCV</span></code></a>(*[,&nbsp;eps,&nbsp;…])</p></td>
<td><p>Multi-task Lasso model trained with L1/L2 mixed-norm as regularizer.</p></td>
</tr>
<tr class="row-even"><td><p><a class="reference internal" href="generated/sklearn.linear_model.OrthogonalMatchingPursuitCV.html#sklearn.linear_model.OrthogonalMatchingPursuitCV" title="sklearn.linear_model.OrthogonalMatchingPursuitCV"><code class="xref py py-obj docutils literal notranslate"><span class="pre">linear_model.OrthogonalMatchingPursuitCV</span></code></a>(*)</p></td>
<td><p>Cross-validated Orthogonal Matching Pursuit model (OMP).</p></td>
</tr>
<tr class="row-odd"><td><p><a class="reference internal" href="generated/sklearn.linear_model.RidgeCV.html#sklearn.linear_model.RidgeCV" title="sklearn.linear_model.RidgeCV"><code class="xref py py-obj docutils literal notranslate"><span class="pre">linear_model.RidgeCV</span></code></a>([alphas,&nbsp;…])</p></td>
<td><p>Ridge regression with built-in cross-validation.</p></td>
</tr>
<tr class="row-even"><td><p><a class="reference internal" href="generated/sklearn.linear_model.RidgeClassifierCV.html#sklearn.linear_model.RidgeClassifierCV" title="sklearn.linear_model.RidgeClassifierCV"><code class="xref py py-obj docutils literal notranslate"><span class="pre">linear_model.RidgeClassifierCV</span></code></a>([alphas,&nbsp;…])</p></td>
<td><p>Ridge classifier with built-in cross-validation.</p></td>
</tr>
</tbody>
</table>

#### Information Criterion

> Some models can offer an **information-theoretic closed-form formula of the optimal estimate of the regularization parameter** by computing a single regularization path (instead of several when using cross-validation).

Here is the list of models benefiting from the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC) for automated model selection:

<table class="longtable docutils align-default" style="margin-left:0; margin-right:auto;">
<colgroup>
<col style="width: 10%">
<col style="width: 90%">
</colgroup>
<tbody>
<tr class="row-odd"><td><p><a class="reference internal" href="generated/sklearn.linear_model.LassoLarsIC.html#sklearn.linear_model.LassoLarsIC" title="sklearn.linear_model.LassoLarsIC"><code class="xref py py-obj docutils literal notranslate"><span class="pre">linear_model.LassoLarsIC</span></code></a>([criterion,&nbsp;…])</p></td>
<td><p>Lasso model fit with Lars using BIC or AIC for model selection</p></td>
</tr>
</tbody>
</table>

####  Out of Bag Estimates

When using *ensemble* methods base upon **bagging**, i.e. generating new training sets using sampling with replacement, part of the training set remains unused. For each classifier in the ensemble, a different part of the training set is left out.

*This left out portion can be used to estimate the generalization error without having to rely on a separate validation set*. This estimate comes “for free” as no additional data is needed and can be used for model selection.

This is currently implemented in the following classes:

<table class="longtable docutils align-default" style="margin-left:0; margin-right:auto;">
<colgroup>
<col style="width: 10%">
<col style="width: 90%">
</colgroup>
<tbody>
<tr class="row-odd"><td><p><a class="reference internal" href="generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier" title="sklearn.ensemble.RandomForestClassifier"><code class="xref py py-obj docutils literal notranslate"><span class="pre">ensemble.RandomForestClassifier</span></code></a>([…])</p></td>
<td><p>A random forest classifier.</p></td>
</tr>
<tr class="row-even"><td><p><a class="reference internal" href="generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor" title="sklearn.ensemble.RandomForestRegressor"><code class="xref py py-obj docutils literal notranslate"><span class="pre">ensemble.RandomForestRegressor</span></code></a>([…])</p></td>
<td><p>A random forest regressor.</p></td>
</tr>
<tr class="row-odd"><td><p><a class="reference internal" href="generated/sklearn.ensemble.ExtraTreesClassifier.html#sklearn.ensemble.ExtraTreesClassifier" title="sklearn.ensemble.ExtraTreesClassifier"><code class="xref py py-obj docutils literal notranslate"><span class="pre">ensemble.ExtraTreesClassifier</span></code></a>([…])</p></td>
<td><p>An extra-trees classifier.</p></td>
</tr>
<tr class="row-even"><td><p><a class="reference internal" href="generated/sklearn.ensemble.ExtraTreesRegressor.html#sklearn.ensemble.ExtraTreesRegressor" title="sklearn.ensemble.ExtraTreesRegressor"><code class="xref py py-obj docutils literal notranslate"><span class="pre">ensemble.ExtraTreesRegressor</span></code></a>([n_estimators,&nbsp;…])</p></td>
<td><p>An extra-trees regressor.</p></td>
</tr>
<tr class="row-odd"><td><p><a class="reference internal" href="generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier" title="sklearn.ensemble.GradientBoostingClassifier"><code class="xref py py-obj docutils literal notranslate"><span class="pre">ensemble.GradientBoostingClassifier</span></code></a>(*[,&nbsp;…])</p></td>
<td><p>Gradient Boosting for classification.</p></td>
</tr>
<tr class="row-even"><td><p><a class="reference internal" href="generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn.ensemble.GradientBoostingRegressor" title="sklearn.ensemble.GradientBoostingRegressor"><code class="xref py py-obj docutils literal notranslate"><span class="pre">ensemble.GradientBoostingRegressor</span></code></a>(*[,&nbsp;…])</p></td>
<td><p>Gradient Boosting for regression.</p></td>
</tr>
</tbody>
</table>