# [Random forest (Leo Breiman, 2001)](src/paper/RandomForest/Random_Forest.pdf)
---
### 1. Arbres de décision

#### 1.1 Introduction
Pour toute la suite de cette partie, on se concentrera uniquement sur les arbres de décision binaire pour la classification. Il existe des arbres de classification multiclasses qui ne sont pas détaillés dans cette partie. Les arbres de décision pour la régression sont détaillés dans la partie *Regression Trees*.

Introduits par Brieman et Al. en 1984 sous le nom de CART (Classification And Regression Trees), les arbres de décision binaires sont des DAG (graphes acycliques orientés) qui permettent de classer différentes features en fonction de leurs caractéristiques. Pour des arbres de décision utilisés pour la classification, le but est de séparer l'espace des features par des hyperpavés afin de séparer les classes. Voici un exemple d'arbre de décisions binaire qui vise à déterminer le risque d'avoir une crise cardiaque en fonction des variables d'entrées $(poids, âge, fumeur)$  

<div align="center">
  <img src="src/pics/RandomForest/DecisonTree.png" alt="a" width="750" height="500">
</div>

#### 1.2 Construction des arbres de décision

**Arbres de régression**

On prend un dataset $\mathcal{D} = (y, X_1, \dots, X_j) = \{ (y_1, (x_{1,1}, \dots, x_{1,j})), \dots, (y_n, (x_{n,1}, \dots, x_{n,j})) \}$ 
avec 

* $y = (y_1, \dots, y_n) \in \mathbb{R}^n$
* $X_j = (x_{1,j} \dots, x_{n,j}) \in \mathbb{R}^n, j \in \mathbb{N}^{*}$.

On souhaite découper l'espace des features ($\mathbb{R}^{n}$) en $M$ partitions (i.e. régions) $(R_m)_{m \in \{ 1,M \}}$. **Le modèle de régression n'associe pas ici une classe mais une valeur pour chaque régions** (i.e. $\forall m \in \{ 1,M \}, \quad R_m = c_m \in \mathbb{R}$). La fonction de décision est alors la suivante : 

$$
f(x) = \sum_{m=1}^{M}c_m \mathbb{1}_{\{x \in R_m\}}
$$

Avec $\mathbb{1}(\bullet)$ la fonction indicatrice. $\mathbb{1}_{x \in \mathcal{A}}(x) = 1 \quad \text{si} \quad x \in \mathcal{A}, 0$ sinon. $\mathbb{1}_{ \{\ \bullet \} }(x) = 1 \quad \text{si} \quad (\bullet)$ est respecté, 0 sinon. Cette fonction minimise le risque empirique avec la fonction de perte $l_2$ ($L(y_i,\hat{y_i}) = (y_i - \hat{y_i})²$) tel que le problème de régression devient le problème d'optimisation suivant :

$$
\begin{alignat}{3}
\min_{(c_{m})_{m \in \{ 1,M \}}} \mathcal{R}_n(f) &= \min_{(c_m)_{m \in \{ 1,M \}}} \mathbb{E}(L(y_i,\hat{y_i})) \\
                                                  &= \min_{(c_m)_{m \in \{ 1,M \}}} \frac{1}{n} \sum_{i=1}^{n}L(y_i,\hat{y_i}) \\
                                                  &= \min_{(c_m)_{m \in \{ 1,M \}}} \frac{1}{n} \sum_{i=1}^{n}(y_i - \hat{y_i})² \\
                                                  &= \min_{(c_m)_{m \in \{ 1,M \}}} \frac{1}{n} \sum_{i=1}^{n}(y_i - f(x_i))²
\end{alignat}
$$

La solution de ce problème est obtenue en appliquant le gradient à la fonction à minimiser. On cherche alors $\nabla \frac{1}{n} \sum_{n=1}^{n}(y_i - f(x_i))² = 0$ soit : 

$$
\begin{alignat}{3}
\hat{c_{m}} &= \text{ave}(y_i | x_i \in R_m) \quad \forall m \in \{ 1,M \} \\
            &= \frac{1}{|R_m|}\sum_{i, x_i \in R_m} y_i \quad \forall m \in \{ 1,M \} \\
            &=\frac{\sum_{i=1}^{n} y_i \mathbb{1}_{\{x \in R_m\}}} {\sum_{i=1}^{n} \mathbb{1}_{\{x \in R_m\}}}, \quad \forall m \in \{ 1,M \}
\end{alignat}
$$

ou $\hat{c_{m}}$ est l'estimateur empirique de la valeur $c_m$ au regard des données d'entrée pour la région $R_n$ et $\text{ave}$ est la fonction *average*. $|R_m| = \sum_{x_i \in R_m}\mathbb{1}_{\{x_i \in R_m \}}$ est le nombre d'observations dans la région $m$. En revanche, choisir la meilleure séparation de partitions $R_m$ est souvent infaisable de manière optimale. Pour cela, on utilise un algortihme glouton qui se comporte de la manière suivante :

On considère $j$ une variable de séparation et $s$ un point de séparation. Au départ, l'espace tout entier n'est pas séparé. L'expression des deux hyperplans de telle sorte que l'espace tout entier soit séparé s'écrit de la mainère suivante : 

$$
R_1(j,s) = \{ x_i \in \mathcal{D} : x_{i,j} <= s\}, \quad R_2(j,s) = \{ x_i \in \mathcal{D} : x_{i,j} > s\}
$$

On chercher alors à minimiser l'expression suivante sur le couple $(j,s)$ :

$$
\min_{j,s} (\min_{x_i \in R_1(j,s)}{\sum_{x_{i} \in R_{1}(j,s)} (y_{i} - c_{1})²} + \min_{x_i \in R_2(j,s)}{\sum_{x_i \in R_{2}(j,s)} (y_{i} - c_{2})²})
$$

La solution de ce problème d'optimisation est alors : 

$$
\hat{c_1} = \text{ave}(y_i | x_i \in R_1(j,s)), \quad \hat{c_2} = \text{ave}(y_i | x_i \in R_2(j,s))
$$

**Arbres de classification**

#### 1.3 Exemples

**En dimension 2**

Prenons un dataset $\mathcal{D} = (y, X_1, X_2) = \{ (y_1, (x_{1,1}, x_{n,2})), \dots, (y_n, (x_{n,1}, x_{n,2})) \}$ 
avec $y = (y_1, \dots, y_n) \in \mathbb{R}^n$, $x_j = (x_{1,j} \dots, x_{n,j}) \in \mathbb{R}^n, j \in \{1,2\}$. On souhaite séparer l'espace (i.e. le plan $(x_1, x_2)$) en $k$ rectangles de telle sorte à avoir des zones bien définies pour séparer les classes.

1. On commmence par séparer l'espace en deux à une frontière définie par l'équation de droite $x_1=t_1$. Si $x_1 <= t_1$, on en dans la zone de la classe 1, appelée $R_1$. $R_2$ sinon.
2. Si $x_1 <= t_1$, on re-découpe la zone 1 en deux avec la droite d'équation $x_2 = t_2$. Si $x_2 <= t_2$, on est dans $R_3$, $R_4$ sinon.
3. 2. Si $x_1 > t_1$, on re-découpe la zone 2 ($R_2$) en deux avec la droite d'équation $x_2 = t_3$. Si $x_2 <= t_3$, on est dans $R_5$, $R_6$ sinon.
4. On répète cce processus de sorte à avoir $k$ séparations de l'espace (profondeur de l'arbre)

<div align="center">
  <img src="src/pics/RandomForest/Regions.png" alt="a" width="350" height="150">
</div>

<div align="center">
  <img src="src/pics/RandomForest/decision_tree_2d.gif" alt="a" width="350" height="150">
</div>

### 2. Forêts aléatoires
#### 2.1 Formulation mathématique des random forests

Une random forest est un regrouppement de plusieurs arbres de décision entraînés sur des datasets différents. Ces datasets sont générés grâce au bagging sur le dataset initial. Les noeuds sont créés par sélection aléatoire d'une des features du dataset. Pour la classification, on prend le vote majoritaire. Pour la régression, on prend la moyenne des valeurs renvoyées par l'ensemble des arbres. L'image suivante montre une comparaison entre le bagging et le boosting (voir 6. Boosting)

<div align="center">
  <img src="src/pics/RandomForest/BaggingVSBoosting.png" alt="a" width="600" height="300">
</div>

1. **Construction d'une forêt aléatoire**

On génère plusieurs jeux de données d'entraînement en effectuant un tirage aléatoire avec remise dans le dataset initial noté $\mathbb{D}_{i}$. On note $\mathcal{D}$ l'ensemble des jeux de données issu du boostrap :

$$
\mathcal{D} = \{ (D_{i})_{i \in \mathbb{N}} \} = \{ (Y, (X_{i})_{i \in \mathbb{N}})\}
$$

Pour chaque $(D_{i})_{i \in \mathbb{N}} \in \mathcal{D}$, on construit un arbre de décision $T_{i}$.

2. **Sélection des variables**

Pour chaque arbres, on re-sélectionne uniquement un sous-ensemble des variables tirées aléatoirement à chaque noeud des arbres, basé sur une sélection avec des critères de séparation comme leurs entropie ou le critère de Gini.

3. **Prise de décision**

Chaque arbre $T_{i}$ prédit une classe $\hat{y}_{i} \in \mathbb{R}$. La méthode de sélection de la classe dépend du type de problème que l'on veut résoudre : une classification ou une régression. Si c'est une classification, on cherche à avoir la classe majoritaire tandis que pour un problème de régression, on cherche à avoir la moyenne des $\hat{y_i}$.

### 2.2 Random forest en classification

$$
\hat{y} = mode \{\hat{y_{1}}, ..., \hat{y_{n}} \}
$$

ou $mode$ désigne la fonction modale qui retourne l'estimateur le plus fréquement rencontré dans l'ensemble des estimateurs (i.e. la classe majoriaire). On peut récrire cette fonction comme ceci :

$$
mode(X) = max(count((x_{i})_{i \in \mathbb{N}}))
$$

Exemple : $mode({1,1,2,3,2,2,2,4,1}) = 2$

### 2.3 Random forest en régression

$$
\hat{y} = \frac{1}{N}\sum_{i = 1}^{N}\hat{y_{i}}
$$

#### 2.4 Métriques pour la random forest

Pour la classification : 

accuracy_score, f1_score, roc_auc, confusion_matrix

Pour la régression :

r2_score, mean_squared_error, mean_absolute_error


### 2.5 Entraînement d'une random forest

**Classification**

<div align="center">
  <img src="src/pics/RandomForest/random_forest_classification_evolution.gif" alt="a" width="600" height="300">
</div>

L'ensemble des arbres de décision retournent le même résultat que l'animation sur un dataset généré aléatoirement (partie sur les arbres de décision). La random forest prend en compte l'ensemble de ces résultats et garde les régions les plus retournées par la forêt. 

**Regression**

<div align="center">
  <img src="src/pics/RandomForest/random_forest_regression_evolution.gif" alt="a" width="600" height="300">
</div>

Parmis toutes les valeurs possibles de la prédiction (nuage de points généré par les prédictions de chacun des arbres), la random forest retourne le résultat le plus probable qui correspond à la valeur la plus retournée par chacun des arbres de la forêt.  

## 3 Des exemples en Python
---
### 3.1 Classification
#### 3.1.1 Import des librairies

In [1]:
import plotly.graph_objects as go
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier                   # pour classification
from sklearn.ensemble import RandomForestRegressor                    # pour régression
from sklearn.datasets import load_iris, fetch_california_housing      # exemple de dataset : iris poour la classification, fetch_california_housing pour la régression
from sklearn.model_selection import train_test_split, GridSearchCV    
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, accuracy_score, mean_squared_error, f1_score, confusion_matrix, classification_report

#### 3.1.2 Import des données
On reprend le jeu de données iris.

In [2]:
data = load_iris()

df = pd.DataFrame(data.data, columns = data.feature_names)
df["target"] = data.target

X = data.data # Features
y = data.target # Classes

df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


#### 3.1.3 Séparation des données de test et d'entraînement

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

#### 3.1.4 Entraînement du modèle
Ici, on génère une pipeline pour les cas ou on prend une random forest pour la régression et une autre pour la classification. 

* Le problème de classification se fait sur la prédiction d'une espèce sur le dataset iris
* Le problème de régression se fera sur ........

**Pipeline pour la RandomForestClassifier**

Pour le dataset Iris, la pipeline est la suivante : on réalise une GridSearch pour trouver les paramètres optimaux de la *RandomForestClassifier* avec une *cross-validation* à 5 folds pour améliorer les scores. Avant d'entraîner les différentes forêts, on normalise les données. Cette étape n'est pas obligatoire dans notre cas car on n'utilise uniquement des random forests, insensibles à la l'échelle des données puisque le modèle n'est pas basé sur l'évaluation de distances mais recommandée si on souhaite comparer d'autres modèles qui évaluent des distances comme la régression logistique ou les SVM (ou régression linéaire et/ou SVR pour les problèmes de régression). La pipeline peut se représenter de la manière suivante :

<pre>
for parameters in GridSearch:
    Normalization   # StandardScaler
    Modélization    # RandomForestClassifier
    fold in folds : # 4 folds 
        Fit         # Train
    validation      # On the 5th fold
    mean(scores)    # for each folds
final_eval          # on the best RandomForestClassifier
</pre>

In [4]:
# ------------- Init -------------
pipe = Pipeline([
    ('scaler', StandardScaler()),          # Normalisation
    ('rf', RandomForestClassifier(         # Modèle
        random_state=0,                    # seed
        verbose=True
    )) 
]) 

param_grid = {
    'rf__n_estimators': range(100, 200, 20),       # Nombre d'arbres
    'rf__max_depth': [None, 10, 20, 30, 50, 100],  # Profondeur maximale des arbres [None : pas de limites]
    'rf__min_samples_split': [2, 5, 10],           # Nombre d'échantillons pour scinder un noeud
}

scoring = {
    'accuracy': 'accuracy',
    'f1_macro': 'f1_macro',
    'precision_macro': 'precision_macro',
    'recall_macro': 'recall_macro'
}

grid_search = GridSearchCV(
    estimator=pipe,         # Pipeline Standadization + model
    param_grid=param_grid,  # Params
    cv=5,                   # Cross-validation
    scoring=scoring,        # Metrics
    refit='accuracy',       # Model selection by specific metric
    n_jobs=-1,              # CPU cores usage (all)
    verbose=2               # verbose
)

grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 90 candidates, totalling 450 fits


[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.

0,1,2
,"estimator  estimator: estimator object This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a ``score`` function, or ``scoring`` must be passed.",Pipeline(step...rbose=True))])
,"param_grid  param_grid: dict or list of dictionaries Dictionary with parameters names (`str`) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.","{'rf__max_depth': [None, 10, ...], 'rf__min_samples_split': [2, 5, ...], 'rf__n_estimators': range(100, 200, 20)}"
,"scoring  scoring: str, callable, list, tuple or dict, default=None Strategy to evaluate the performance of the cross-validated model on the test set. If `scoring` represents a single score, one can use: - a single string (see :ref:`scoring_string_names`); - a callable (see :ref:`scoring_callable`) that returns a single value; - `None`, the `estimator`'s  :ref:`default evaluation criterion ` is used. If `scoring` represents multiple scores, one can use: - a list or tuple of unique strings; - a callable returning a dictionary where the keys are the metric  names and the values are the metric scores; - a dictionary with metric names as keys and callables as values. See :ref:`multimetric_grid_search` for an example.","{'accuracy': 'accuracy', 'f1_macro': 'f1_macro', 'precision_macro': 'precision_macro', 'recall_macro': 'recall_macro'}"
,"n_jobs  n_jobs: int, default=None Number of jobs to run in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details. .. versionchanged:: v0.20  `n_jobs` default changed from 1 to None",-1
,"refit  refit: bool, str, or callable, default=True Refit an estimator using the best found parameters on the whole dataset. For multiple metric evaluation, this needs to be a `str` denoting the scorer that would be used to find the best parameters for refitting the estimator at the end. Where there are considerations other than maximum score in choosing a best estimator, ``refit`` can be set to a function which returns the selected ``best_index_`` given ``cv_results_``. In that case, the ``best_estimator_`` and ``best_params_`` will be set according to the returned ``best_index_`` while the ``best_score_`` attribute will not be available. The refitted estimator is made available at the ``best_estimator_`` attribute and permits using ``predict`` directly on this ``GridSearchCV`` instance. Also for multiple metric evaluation, the attributes ``best_index_``, ``best_score_`` and ``best_params_`` will only be available if ``refit`` is set and all of them will be determined w.r.t this specific scorer. See ``scoring`` parameter to know more about multiple metric evaluation. See :ref:`sphx_glr_auto_examples_model_selection_plot_grid_search_digits.py` to see how to design a custom selection strategy using a callable via `refit`. See :ref:`this example ` for an example of how to use ``refit=callable`` to balance model complexity and cross-validated score. .. versionchanged:: 0.20  Support for callable added.",'accuracy'
,"cv  cv: int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - integer, to specify the number of folds in a `(Stratified)KFold`, - :term:`CV splitter`, - An iterable yielding (train, test) splits as arrays of indices. For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used. These splitters are instantiated with `shuffle=False` so the splits will be the same across calls. Refer :ref:`User Guide ` for the various cross-validation strategies that can be used here. .. versionchanged:: 0.22  ``cv`` default value if None changed from 3-fold to 5-fold.",5
,"verbose  verbose: int Controls the verbosity: the higher, the more messages. - >1 : the computation time for each fold and parameter candidate is  displayed; - >2 : the score is also displayed; - >3 : the fold and candidate parameter indexes are also displayed  together with the starting time of the computation.",2
,"pre_dispatch  pre_dispatch: int, or str, default='2*n_jobs' Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: - None, in which case all the jobs are immediately created and spawned. Use  this for lightweight and fast-running jobs, to avoid delays due to on-demand  spawning of the jobs - An int, giving the exact number of total jobs that are spawned - A str, giving an expression as a function of n_jobs, as in '2*n_jobs'",'2*n_jobs'
,"error_score  error_score: 'raise' or numeric, default=np.nan Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.",
,"return_train_score  return_train_score: bool, default=False If ``False``, the ``cv_results_`` attribute will not include training scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance. .. versionadded:: 0.19 .. versionchanged:: 0.21  Default value was changed from ``True`` to ``False``",False

0,1,2
,"copy  copy: bool, default=True If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.",True
,"with_mean  with_mean: bool, default=True If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.",True
,"with_std  with_std: bool, default=True If True, scale the data to unit variance (or equivalently, unit standard deviation).",True

0,1,2
,"n_estimators  n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22  The default value of ``n_estimators`` changed from 10 to 100  in 0.22.",100
,"criterion  criterion: {""gini"", ""entropy"", ""log_loss""}, default=""gini"" The function to measure the quality of a split. Supported criteria are ""gini"" for the Gini impurity and ""log_loss"" and ""entropy"" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`. Note: This parameter is tree-specific.",'gini'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",1
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: {""sqrt"", ""log2"", None}, int or float, default=""sqrt"" The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at each  split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. versionchanged:: 1.1  The default of `max_features` changed from `""auto""` to `""sqrt""`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.",'sqrt'
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0
,"bootstrap  bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.",True


#### 3.1.5 Extraction du meilleur modèle
Après avoir entraîné plusieurs modèlles, on en extrait le meilleur grâce au GridSearch

In [5]:
pd.DataFrame(grid_search.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_rf__max_depth,param_rf__min_samples_split,param_rf__n_estimators,params,split0_test_accuracy,split1_test_accuracy,...,std_test_precision_macro,rank_test_precision_macro,split0_test_recall_macro,split1_test_recall_macro,split2_test_recall_macro,split3_test_recall_macro,split4_test_recall_macro,mean_test_recall_macro,std_test_recall_macro,rank_test_recall_macro
0,0.166302,0.010377,0.020627,0.002482,,2,100,"{'rf__max_depth': None, 'rf__min_samples_split...",0.952381,0.904762,...,0.031138,1,0.952381,0.916667,0.910714,1.0,0.952381,0.946429,0.031944,1
1,0.194416,0.009875,0.021696,0.001852,,2,120,"{'rf__max_depth': None, 'rf__min_samples_split...",0.952381,0.904762,...,0.031138,1,0.952381,0.916667,0.910714,1.0,0.952381,0.946429,0.031944,1
2,0.219371,0.013929,0.024199,0.005623,,2,140,"{'rf__max_depth': None, 'rf__min_samples_split...",0.952381,0.904762,...,0.031138,1,0.952381,0.916667,0.910714,1.0,0.952381,0.946429,0.031944,1
3,0.263531,0.039690,0.024896,0.002909,,2,160,"{'rf__max_depth': None, 'rf__min_samples_split...",0.952381,0.904762,...,0.031138,1,0.952381,0.916667,0.910714,1.0,0.952381,0.946429,0.031944,1
4,0.290408,0.018364,0.029766,0.003371,,2,180,"{'rf__max_depth': None, 'rf__min_samples_split...",0.952381,0.904762,...,0.031138,1,0.952381,0.916667,0.910714,1.0,0.952381,0.946429,0.031944,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85,0.165636,0.003971,0.019081,0.000706,100,10,100,"{'rf__max_depth': 100, 'rf__min_samples_split'...",0.952381,0.904762,...,0.031138,1,0.952381,0.916667,0.910714,1.0,0.952381,0.946429,0.031944,1
86,0.196239,0.003916,0.021056,0.000531,100,10,120,"{'rf__max_depth': 100, 'rf__min_samples_split'...",0.952381,0.904762,...,0.031138,1,0.952381,0.916667,0.910714,1.0,0.952381,0.946429,0.031944,1
87,0.263218,0.035079,0.027068,0.005437,100,10,140,"{'rf__max_depth': 100, 'rf__min_samples_split'...",0.952381,0.904762,...,0.031138,1,0.952381,0.916667,0.910714,1.0,0.952381,0.946429,0.031944,1
88,0.266722,0.010831,0.023745,0.002650,100,10,160,"{'rf__max_depth': 100, 'rf__min_samples_split'...",0.952381,0.904762,...,0.031138,1,0.952381,0.916667,0.910714,1.0,0.952381,0.946429,0.031944,1


Etant donné que ce dataset est relativement simple, les modèles donnent tous le même score. Récupération du meilleur modèle :

In [6]:
best_model = grid_search.best_estimator_
best_params = grid_search.best_params_
best_score = grid_search.best_score_

In [7]:
best_model

0,1,2
,"steps  steps: list of tuples List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`. See :ref:`Combining Estimators ` for more details.","[('scaler', ...), ('rf', ...)]"
,"transform_input  transform_input: list of str, default=None The names of the :term:`metadata` parameters that should be transformed by the pipeline before passing it to the step consuming it. This enables transforming some input arguments to ``fit`` (other than ``X``) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via :ref:`metadata routing `. For instance, this can be used to pass a validation set through the pipeline. You can only set this if metadata routing is enabled, which you can enable using ``sklearn.set_config(enable_metadata_routing=True)``. .. versionadded:: 1.6",
,"memory  memory: str or object with the joblib.Memory interface, default=None Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See :ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py` for an example on how to enable caching.",
,"verbose  verbose: bool, default=False If True, the time elapsed while fitting each step will be printed as it is completed.",False

0,1,2
,"copy  copy: bool, default=True If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.",True
,"with_mean  with_mean: bool, default=True If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.",True
,"with_std  with_std: bool, default=True If True, scale the data to unit variance (or equivalently, unit standard deviation).",True

0,1,2
,"n_estimators  n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22  The default value of ``n_estimators`` changed from 10 to 100  in 0.22.",100
,"criterion  criterion: {""gini"", ""entropy"", ""log_loss""}, default=""gini"" The function to measure the quality of a split. Supported criteria are ""gini"" for the Gini impurity and ""log_loss"" and ""entropy"" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`. Note: This parameter is tree-specific.",'gini'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",1
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: {""sqrt"", ""log2"", None}, int or float, default=""sqrt"" The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at each  split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. versionchanged:: 1.1  The default of `max_features` changed from `""auto""` to `""sqrt""`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.",'sqrt'
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0
,"bootstrap  bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.",True


In [8]:
best_params

{'rf__max_depth': None, 'rf__min_samples_split': 2, 'rf__n_estimators': 100}

In [9]:
best_score

np.float64(0.9428571428571428)

C'est l'accuracy score initialisé dans le grid search

#### 3.1.6 Prédictions

In [10]:
y_pred = best_model.predict(X_test) # Prédicitons sur les données de test

[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.0s finished


#### 3.1.7 Evaluation des performances du modèle

In [11]:
f1_score(y_test, y_pred, average='macro')

1.0

In [12]:
accuracy_score(y_test, y_pred)

1.0

In [13]:
confusion_matrix(y_test, y_pred)

array([[19,  0,  0],
       [ 0, 13,  0],
       [ 0,  0, 13]])

In [14]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45



---
### 3.2 Régression
#### 3.2.1 Import du dataset et séparation $X_{train}, X_{test}$

Ici, on utilise le dataset california_housing. Import du modèle et séparation $X_{train}$, $X_{test}$

In [15]:
X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

#### 3.2.2 Entraînement du modèle

In [16]:
# ------------- Init -------------
pipe = Pipeline([
    ('scaler', StandardScaler()),          # Normalisation
    ('rf', RandomForestRegressor(         # Modèle
        random_state=0,                    # seed
        verbose=True
    )) 
]) 

param_grid = {
    'rf__n_estimators': [100, 200],
    'rf__max_depth': [None, 10, 20],
    'rf__min_samples_split': [2, 5],
}

scoring = {
    'r2': 'r2',
    'neg_mse': 'neg_mean_squared_error',
    'neg_mae': 'neg_mean_absolute_error'
}

grid_search = GridSearchCV(
    estimator=pipe,         # Pipeline Standadization + model
    param_grid=param_grid,  # Params
    cv=5,                   # Cross-validation
    scoring=scoring,        # Metrics
    refit='r2',             # Model selection by specific metric
    n_jobs=-1,              # CPU cores usage (all)
    verbose=2               # verbose
)

grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    4.5s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    4.7s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    4.7s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    4.8s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    4.8s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    4.9s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    4.9s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    5.0s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    4.9s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    5.0s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    5.0s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    5.1s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    5.1s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    5.1s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    5.1s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    5.2s
[Parallel(n_jobs=1)]: Do

0,1,2
,"estimator  estimator: estimator object This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a ``score`` function, or ``scoring`` must be passed.",Pipeline(step...rbose=True))])
,"param_grid  param_grid: dict or list of dictionaries Dictionary with parameters names (`str`) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.","{'rf__max_depth': [None, 10, ...], 'rf__min_samples_split': [2, 5], 'rf__n_estimators': [100, 200]}"
,"scoring  scoring: str, callable, list, tuple or dict, default=None Strategy to evaluate the performance of the cross-validated model on the test set. If `scoring` represents a single score, one can use: - a single string (see :ref:`scoring_string_names`); - a callable (see :ref:`scoring_callable`) that returns a single value; - `None`, the `estimator`'s  :ref:`default evaluation criterion ` is used. If `scoring` represents multiple scores, one can use: - a list or tuple of unique strings; - a callable returning a dictionary where the keys are the metric  names and the values are the metric scores; - a dictionary with metric names as keys and callables as values. See :ref:`multimetric_grid_search` for an example.","{'neg_mae': 'neg_mean_absolute_error', 'neg_mse': 'neg_mean_squared_error', 'r2': 'r2'}"
,"n_jobs  n_jobs: int, default=None Number of jobs to run in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details. .. versionchanged:: v0.20  `n_jobs` default changed from 1 to None",-1
,"refit  refit: bool, str, or callable, default=True Refit an estimator using the best found parameters on the whole dataset. For multiple metric evaluation, this needs to be a `str` denoting the scorer that would be used to find the best parameters for refitting the estimator at the end. Where there are considerations other than maximum score in choosing a best estimator, ``refit`` can be set to a function which returns the selected ``best_index_`` given ``cv_results_``. In that case, the ``best_estimator_`` and ``best_params_`` will be set according to the returned ``best_index_`` while the ``best_score_`` attribute will not be available. The refitted estimator is made available at the ``best_estimator_`` attribute and permits using ``predict`` directly on this ``GridSearchCV`` instance. Also for multiple metric evaluation, the attributes ``best_index_``, ``best_score_`` and ``best_params_`` will only be available if ``refit`` is set and all of them will be determined w.r.t this specific scorer. See ``scoring`` parameter to know more about multiple metric evaluation. See :ref:`sphx_glr_auto_examples_model_selection_plot_grid_search_digits.py` to see how to design a custom selection strategy using a callable via `refit`. See :ref:`this example ` for an example of how to use ``refit=callable`` to balance model complexity and cross-validated score. .. versionchanged:: 0.20  Support for callable added.",'r2'
,"cv  cv: int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - integer, to specify the number of folds in a `(Stratified)KFold`, - :term:`CV splitter`, - An iterable yielding (train, test) splits as arrays of indices. For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used. These splitters are instantiated with `shuffle=False` so the splits will be the same across calls. Refer :ref:`User Guide ` for the various cross-validation strategies that can be used here. .. versionchanged:: 0.22  ``cv`` default value if None changed from 3-fold to 5-fold.",5
,"verbose  verbose: int Controls the verbosity: the higher, the more messages. - >1 : the computation time for each fold and parameter candidate is  displayed; - >2 : the score is also displayed; - >3 : the fold and candidate parameter indexes are also displayed  together with the starting time of the computation.",2
,"pre_dispatch  pre_dispatch: int, or str, default='2*n_jobs' Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: - None, in which case all the jobs are immediately created and spawned. Use  this for lightweight and fast-running jobs, to avoid delays due to on-demand  spawning of the jobs - An int, giving the exact number of total jobs that are spawned - A str, giving an expression as a function of n_jobs, as in '2*n_jobs'",'2*n_jobs'
,"error_score  error_score: 'raise' or numeric, default=np.nan Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.",
,"return_train_score  return_train_score: bool, default=False If ``False``, the ``cv_results_`` attribute will not include training scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance. .. versionadded:: 0.19 .. versionchanged:: 0.21  Default value was changed from ``True`` to ``False``",False

0,1,2
,"copy  copy: bool, default=True If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.",True
,"with_mean  with_mean: bool, default=True If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.",True
,"with_std  with_std: bool, default=True If True, scale the data to unit variance (or equivalently, unit standard deviation).",True

0,1,2
,"n_estimators  n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22  The default value of ``n_estimators`` changed from 10 to 100  in 0.22.",200
,"criterion  criterion: {""squared_error"", ""absolute_error"", ""friedman_mse"", ""poisson""}, default=""squared_error"" The function to measure the quality of a split. Supported criteria are ""squared_error"" for the mean squared error, which is equal to variance reduction as feature selection criterion and minimizes the L2 loss using the mean of each terminal node, ""friedman_mse"", which uses mean squared error with Friedman's improvement score for potential splits, ""absolute_error"" for the mean absolute error, which minimizes the L1 loss using the median of each terminal node, and ""poisson"" which uses reduction in Poisson deviance to find splits. Training using ""absolute_error"" is significantly slower than when using ""squared_error"". .. versionadded:: 0.18  Mean Absolute Error (MAE) criterion. .. versionadded:: 1.0  Poisson criterion.",'squared_error'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",20
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",1
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: {""sqrt"", ""log2"", None}, int or float, default=1.0 The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at each  split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None or 1.0, then `max_features=n_features`. .. note::  The default of 1.0 is equivalent to bagged trees and more  randomness can be achieved by setting smaller values, e.g. 0.3. .. versionchanged:: 1.1  The default of `max_features` changed from `""auto""` to 1.0. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.",1.0
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0
,"bootstrap  bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.",True


#### 3.2.3 Extraction du meilleur modèle

In [17]:
pd.DataFrame(grid_search.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_rf__max_depth,param_rf__min_samples_split,param_rf__n_estimators,params,split0_test_r2,split1_test_r2,...,std_test_neg_mse,rank_test_neg_mse,split0_test_neg_mae,split1_test_neg_mae,split2_test_neg_mae,split3_test_neg_mae,split4_test_neg_mae,mean_test_neg_mae,std_test_neg_mae,rank_test_neg_mae
0,10.50234,0.117841,0.103748,0.002086,,2,100,"{'rf__max_depth': None, 'rf__min_samples_split...",0.792316,0.808461,...,0.01219,6,-0.340319,-0.334512,-0.341221,-0.337477,-0.336992,-0.338104,0.002415,5
1,20.757374,0.081853,0.212534,0.018033,,2,200,"{'rf__max_depth': None, 'rf__min_samples_split...",0.79404,0.809655,...,0.01221,2,-0.338044,-0.332142,-0.339361,-0.335154,-0.336343,-0.336209,0.002489,1
2,9.70076,0.276415,0.084955,0.007275,,5,100,"{'rf__max_depth': None, 'rf__min_samples_split...",0.79186,0.806984,...,0.012316,8,-0.341174,-0.33546,-0.341408,-0.336875,-0.337114,-0.338406,0.002424,7
3,19.714474,0.315768,0.162108,0.023479,,5,200,"{'rf__max_depth': None, 'rf__min_samples_split...",0.793506,0.808873,...,0.01218,3,-0.338529,-0.33272,-0.33972,-0.335201,-0.336696,-0.336573,0.002469,3
4,6.473466,0.176779,0.044807,0.004309,10.0,2,100,"{'rf__max_depth': 10, 'rf__min_samples_split':...",0.769698,0.781294,...,0.011829,11,-0.371934,-0.371159,-0.375282,-0.367622,-0.369966,-0.371193,0.002511,11
5,13.406473,0.213723,0.087685,0.015979,10.0,2,200,"{'rf__max_depth': 10, 'rf__min_samples_split':...",0.771182,0.782304,...,0.011369,10,-0.369979,-0.369294,-0.373323,-0.367058,-0.370081,-0.369947,0.002009,9
6,7.044411,0.228609,0.042785,0.002131,10.0,5,100,"{'rf__max_depth': 10, 'rf__min_samples_split':...",0.769808,0.78129,...,0.011683,12,-0.371983,-0.371024,-0.375561,-0.367788,-0.369907,-0.371252,0.002568,12
7,13.560994,0.497549,0.078759,0.003278,10.0,5,200,"{'rf__max_depth': 10, 'rf__min_samples_split':...",0.771462,0.782108,...,0.011197,9,-0.369893,-0.369571,-0.373514,-0.366993,-0.370126,-0.370019,0.002078,10
8,10.108453,0.108852,0.093677,0.003428,20.0,2,100,"{'rf__max_depth': 20, 'rf__min_samples_split':...",0.792892,0.807865,...,0.012077,5,-0.340523,-0.334762,-0.341553,-0.336735,-0.336952,-0.338105,0.002535,6
9,18.427678,0.474731,0.147896,0.002683,20.0,2,200,"{'rf__max_depth': 20, 'rf__min_samples_split':...",0.794119,0.809418,...,0.012108,1,-0.33811,-0.332414,-0.339728,-0.335175,-0.336642,-0.336414,0.002508,2


Avec un dataset plus complexe, on arrive à avoir des variations sur les performances en fonction des paramètres du modèle.

In [18]:
best_model = grid_search.best_estimator_
best_params = grid_search.best_params_
best_score = grid_search.best_score_

In [19]:
best_model

0,1,2
,"steps  steps: list of tuples List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`. See :ref:`Combining Estimators ` for more details.","[('scaler', ...), ('rf', ...)]"
,"transform_input  transform_input: list of str, default=None The names of the :term:`metadata` parameters that should be transformed by the pipeline before passing it to the step consuming it. This enables transforming some input arguments to ``fit`` (other than ``X``) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via :ref:`metadata routing `. For instance, this can be used to pass a validation set through the pipeline. You can only set this if metadata routing is enabled, which you can enable using ``sklearn.set_config(enable_metadata_routing=True)``. .. versionadded:: 1.6",
,"memory  memory: str or object with the joblib.Memory interface, default=None Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See :ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py` for an example on how to enable caching.",
,"verbose  verbose: bool, default=False If True, the time elapsed while fitting each step will be printed as it is completed.",False

0,1,2
,"copy  copy: bool, default=True If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.",True
,"with_mean  with_mean: bool, default=True If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.",True
,"with_std  with_std: bool, default=True If True, scale the data to unit variance (or equivalently, unit standard deviation).",True

0,1,2
,"n_estimators  n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22  The default value of ``n_estimators`` changed from 10 to 100  in 0.22.",200
,"criterion  criterion: {""squared_error"", ""absolute_error"", ""friedman_mse"", ""poisson""}, default=""squared_error"" The function to measure the quality of a split. Supported criteria are ""squared_error"" for the mean squared error, which is equal to variance reduction as feature selection criterion and minimizes the L2 loss using the mean of each terminal node, ""friedman_mse"", which uses mean squared error with Friedman's improvement score for potential splits, ""absolute_error"" for the mean absolute error, which minimizes the L1 loss using the median of each terminal node, and ""poisson"" which uses reduction in Poisson deviance to find splits. Training using ""absolute_error"" is significantly slower than when using ""squared_error"". .. versionadded:: 0.18  Mean Absolute Error (MAE) criterion. .. versionadded:: 1.0  Poisson criterion.",'squared_error'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",20
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",1
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: {""sqrt"", ""log2"", None}, int or float, default=1.0 The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at each  split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None or 1.0, then `max_features=n_features`. .. note::  The default of 1.0 is equivalent to bagged trees and more  randomness can be achieved by setting smaller values, e.g. 0.3. .. versionchanged:: 1.1  The default of `max_features` changed from `""auto""` to 1.0. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.",1.0
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0
,"bootstrap  bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.",True


In [20]:
best_params

{'rf__max_depth': 20, 'rf__min_samples_split': 2, 'rf__n_estimators': 200}

In [21]:
best_score

np.float64(0.8019263223659057)

#### 3.2.4 Prédictions

In [22]:
y_pred = best_model.predict(X_test) # Prédicitons sur les données de test

[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done 199 tasks      | elapsed:    0.2s
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.2s finished


#### 3.2.5 Evaluation des performances du modèle

In [23]:
r2_score(y_test, y_pred)

0.7993020604265566

In [24]:
mean_squared_error(y_test, y_pred)

0.2617010621912664

[CV] END rf__max_depth=None, rf__min_samples_split=2, rf__n_estimators=120; total time=   0.2s
[CV] END rf__max_depth=None, rf__min_samples_split=5, rf__n_estimators=100; total time=   0.2s
[CV] END rf__max_depth=None, rf__min_samples_split=5, rf__n_estimators=140; total time=   0.3s
[CV] END rf__max_depth=None, rf__min_samples_split=10, rf__n_estimators=100; total time=   0.2s
[CV] END rf__max_depth=None, rf__min_samples_split=10, rf__n_estimators=140; total time=   0.3s
[CV] END rf__max_depth=10, rf__min_samples_split=2, rf__n_estimators=100; total time=   0.2s
[CV] END rf__max_depth=10, rf__min_samples_split=2, rf__n_estimators=160; total time=   0.4s
[CV] END rf__max_depth=10, rf__min_samples_split=5, rf__n_estimators=140; total time=   0.3s
[CV] END rf__max_depth=10, rf__min_samples_split=10, rf__n_estimators=120; total time=   0.2s
[CV] END rf__max_depth=10, rf__min_samples_split=10, rf__n_estimators=160; total time=   0.3s
[CV] END rf__max_depth=20, rf__min_samples_split=2, rf__

Nota : La classification_report n'est pas supportée pour les valeur continues

# [Isolation forest (Fei Tony Liu, 2008)](src/paper/IsolationForest/Isolation_Forest.pdf)
---
### 1.1 Isolation Tree

L'isolation forest reprend le principe de la random forest pour la détection d'anomalies. Le paradigme est le suivant : plus un point est considéré comme une anomalie, plus il est facile à isoler du reste du nuage de points. Pour mesurer le taux d'isolation d'un point dans le *dataset*, on attribue à chaque points un score d'isolation. Pour rappel, on distingue trois types de valeurs pour la détection d'anomalies :

* **Valeur extrême** : valeur minimale ou maximale prise par le dataset. Valeur physiquement réalisable mais rare (ex : 40°C en été)
* **Valeur abérrante** : valeur physiquement invraisemblable. (ex : température sur la Terre de 752°C)
* **Anomalie** : valeur physiquement acceptable mais pas dans ce contexte (exemple : température de 30°C en hiver)

Dans notre cas, on s'intéresse uniquement à la dernière forme de valeur anormale. On se trouve donc dans un problème de classification binaire non supervisé. Pour faire une forêt d'isolation, on construit d'abord des arbres d'isolation. Ces derniers sont comme des arbres de décisions retrouvés dans une simple *random forest* mais sans critère d'optimisation comme l'indice de Gini par exmeple. Il gardent cependant les mêmes critères d'arrêt. 

### 1.2 Score d'anomalies

Pour un échantillon de taille $n$, la profondeur moyenne d'un arbre d'isolation est donnée par :

$$
\begin{align}
C(n) &= 2H(n-1) - \frac{2(n-1)}{n}, \quad H(n) \approx ln(n) + \gamma, \quad \gamma \approx 0,57721 \quad (\text{constante d'Euler Mascheroni}) \\
     &= 2ln(n-1) + 2\gamma - \frac{2(n-1)}{n}
\end{align}
$$

Le score d'anomalie pour un point $x \in \mathbb{R}^n$ donné est donc : 

$$
\boxed{S(x) = 2^{\frac{1}{M} \sum_{m=1}^{M} h_m(x)}}
$$

Avec $m \in \{ 1;M \}$, $M$ étant le nombre d'estimateurs (arbres) de la forêt, $h_m$ est la profondeur de $x$ associé à l'arbre $m$.

<div align="center">
  <img src="src/pics/RandomForest/isolation_forest.gif" alt="a" width="500" height="500">
</div>


## 2 Des exemples en Python
---