In [1]:
# https://iaml.it/blog/optimizing-sklearn-pipelines

Pipeline aka workflow

Execute a sequence of typical tasks: data normalization, imputation of missing values, outlier elicitation, dimensionality reduction, classification.

### Pipeline Setup

In [3]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
data = load_boston()
X_train, X_test, y_train, y_test = train_test_split(data['data'], data['target'])

In [4]:
#  import all modules for this tutorial
from sklearn.preprocessing import StandardScaler, RobustScaler, QuantileTransformer
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.decomposition import PCA
from sklearn.linear_model import Ridge

This pipeline is composed of the following steps:
1. Data Normalization: in this tutorial we have selected three different normalization methods, including the QuantileTransformer.
2. Dimensionality Reduction: we selected Principal Component Analysis (PCA) and a univariate feature selection algorithm as possible candidates.
3. Regression: we apply a simple regularized linear method, although the method is easily extendable to other learning algorithms.

#### Maunally implementing Pipeline

In [5]:
# Instantiate
scaler = StandardScaler()
pca = PCA()
ridge = Ridge()

In [6]:
# Manually pass training datase to everystep
X_train = scaler.fit_transform(X_train)
X_train = pca.fit_transform(X_train)
ridge.fit(X_train, y_train)

Ridge()

#### Scikit-learn pipeline object

In [7]:
from sklearn.pipeline import Pipeline
pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('reduce_dim', PCA()),
        ('regressor', Ridge())
        ])

The pipeline module leverages on the common interface that every scikit-learn library must implement, such as: fit, transform and predict.

In [8]:
# train and test whole pipeline
pipe = pipe.fit(X_train, y_train)
print('Testing score: ', pipe.score(X_test, y_test))

Testing score:  -12056.504018831934


It is also possible to index the pipeline to access a specific element and retrieve a single value, for example the explained variance in the PCA step:

In [9]:
print(pipe.steps[1][1].explained_variance_)

[1.0026455 1.0026455 1.0026455 1.0026455 1.0026455 1.0026455 1.0026455
 1.0026455 1.0026455 1.0026455 1.0026455 1.0026455 1.0026455]


On every object within the pipeline the methods fit_transform are invoked during training, while transform (or predict) are called during test.

#### Pipeline Tuning (base version)

Let's start with example, where we aim at optimizing the number of components selected by the PCA and the regularization factor of the linear regression model. 

In [10]:
# Concerning PCA, we want to evaluate how accuracy varies with the number of components, from 1 to 10:
import numpy as np
n_features_to_test = np.arange(1, 11)

In [11]:
# As for the regularization factor, we consider an exponential range of values (see sklearn gridsearchCV tutorial)
alpha_to_test = 2.0**np.arange(-6, +6)

It's possible to notice that the two parameters are correlated, and should be optimized in combination. That is, a variation in the number of PCA components might imply a variation in the regularization factor, and viceversa. Thereby, it is important to evaluate all their possible combinations, and this is where the pipeline module really supports us. First of all, we define a dictionary with all the parameters we would like to combine in the evaluation:

In [12]:
params = {'reduce_dim__n_components': n_features_to_test,\
              'regressor__alpha': alpha_to_test}
# note naming: name of pipeline step, dunder, name of parameter within step

In [13]:
from sklearn.model_selection import GridSearchCV
gridsearch = GridSearchCV(pipe, params, verbose=1).fit(X_train, y_train)
print('Final score is: ', gridsearch.score(X_test, y_test))

Fitting 5 folds for each of 120 candidates, totalling 600 fits
Final score is:  -2869.352576296183


In [14]:
gridsearch.best_params_

{'reduce_dim__n_components': 10, 'regressor__alpha': 32.0}

#### Pipeline Tuning (advanced version)

Use same approach to decide which algorithm we should use.

In [17]:
#  data normalization
scalers_to_test = [StandardScaler(), RobustScaler(), QuantileTransformer()]

In [18]:
# Add normalization parameter to gridsearch
params = {'scaler': scalers_to_test,
        'reduce_dim__n_components': n_features_to_test,\
        'regressor__alpha': alpha_to_test}

In theory, we could also apply the same approach to the dimensionality reduction step, for example to choose between PCA and SelectKBest. The only problem in this case is that PCA relies on a parameter named n_components, while SelectKBest requires to optimize a parameter named k.

Luckily, GridSearchCV also allows to optimize lists of parameter dictionaries, which solves this issue as well:

In [19]:
params = [
        {'scaler': scalers_to_test,
         'reduce_dim': [PCA()],
         'reduce_dim__n_components': n_features_to_test,\
         'regressor__alpha': alpha_to_test},

        {'scaler': scalers_to_test,
         'reduce_dim': [SelectKBest(f_regression)],
         'reduce_dim__k': n_features_to_test,\
         'regressor__alpha': alpha_to_test}
        ]

In [21]:
# launch gridsearch
gridsearch = GridSearchCV(pipe, params, verbose=1).fit(X_train, y_train)


In [22]:
print('Final score is: ', gridsearch.score(X_test, y_test))

Final score is:  -11771.701767234736


In [23]:
gridsearch.best_params_

{'reduce_dim': SelectKBest(score_func=<function f_regression at 0x7fea1df79430>),
 'reduce_dim__k': 10,
 'regressor__alpha': 4.0,
 'scaler': StandardScaler()}

When the overall number of hyper-parameters is very high, we might need to replace the optimization method (e.g. applying a randomized grid search)

https://scikit-learn.org/stable/modules/grid_search.html#randomized-parameter-optimization