Une pipeline ou chaîne de transformation est l'assemblage d'un transformer et d'un estimateur.

--> TRANSFORMER + ESTIMATEUR = PIPELINE

Une pipeline dispose de sa propre fonction fit, predict et score.

Concrètement, lorsqu'on utilise fit sur la pipeline, le transformer utilise sa methode fit et l'estimateur aussi.
Lorsque l'on utilise predict, le transformer utilise sa methode fit et l'estimateur sa methode predict.

In [1]:
from sklearn.datasets import load_iris

In [2]:
iris = load_iris()
X = iris.data
y = iris.target

In [26]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np

In [8]:
model = make_pipeline(StandardScaler(),
                      KNeighborsClassifier())

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.2, random_state=5)

## Utilisation de GridSearchCV sur une pipeline

In [23]:
model.get_params()

{'memory': None,
 'steps': [('standardscaler', StandardScaler()),
  ('kneighborsclassifier', KNeighborsClassifier())],
 'verbose': False,
 'standardscaler': StandardScaler(),
 'kneighborsclassifier': KNeighborsClassifier(),
 'standardscaler__copy': True,
 'standardscaler__with_mean': True,
 'standardscaler__with_std': True,
 'kneighborsclassifier__algorithm': 'auto',
 'kneighborsclassifier__leaf_size': 30,
 'kneighborsclassifier__metric': 'minkowski',
 'kneighborsclassifier__metric_params': None,
 'kneighborsclassifier__n_jobs': None,
 'kneighborsclassifier__n_neighbors': 5,
 'kneighborsclassifier__p': 2,
 'kneighborsclassifier__weights': 'uniform'}

In [37]:
params = {
    # 'standardscaler__copy' : ['True', 'False'],
    # 'standardscaler__with_mean' : ['True','False'],
    # 'standardscaler__with_std' : ['True', 'False'],
    'kneighborsclassifier__metric' : ['euclidean','minkowski','manhattan'],
    'kneighborsclassifier__n_neighbors' : np.arange(1,20)
}

Dans le dictionnaire de paramètres, il faut mettre le nom du transformer ou de l'estimateur suivis de deux underscore suivis du nom de l'hyperparamètre.

In [38]:
grid = GridSearchCV(model, param_grid=params, cv=4)

In [39]:
grid.fit(X_train, y_train)

In [46]:
grid.best_score_

0.9375

In [40]:
grid.best_params_

{'kneighborsclassifier__metric': 'manhattan',
 'kneighborsclassifier__n_neighbors': 5}

In [41]:
model = grid.best_estimator_

In [43]:
model.predict(X_test)

array([1, 1, 2, 0, 2, 1, 0, 2, 0, 1, 1, 1, 2, 2, 0, 0, 2, 2, 0, 0, 1, 2,
       0, 2, 1, 2, 1, 1, 1, 2, 0, 1, 1, 0, 1, 0, 0, 2, 0, 2, 2, 1, 0, 0,
       1, 1, 1, 2, 2, 0, 2, 0, 0, 0, 0, 1, 1, 1, 1, 2, 1, 2, 2, 1, 0, 1,
       0, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 0, 2, 0, 0, 1, 0, 0, 2, 1, 0,
       2, 0, 1, 1, 0, 0, 2, 1, 1, 0, 0, 2, 1, 1, 0, 1, 2, 1, 0, 1, 2, 2,
       2, 2, 0, 0, 2, 2, 0, 1, 0, 0])

In [45]:
model.score(X_test, y_test)

0.9416666666666667