# Chapter 6. Pipelines.
# Part 1. Basis.

Notes:

1) Data Scaling should be applied only to the training folds, thus cross-validation should be done BEFORE scaling in terms of using pipeline.

2) There could be any number of stages in pipeline with the only requirement to keep all stages (except last one) to have method 'transform' in a such way that new transformed data could be used for next stage.

Setting up:

In [1]:
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

#-----dataset preparing
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)
#scaling
scaler = MinMaxScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)

#-----model initialization and building
svm = SVC().fit(X_train_scaled, y_train)
X_test_scaled = scaler.transform(X_test)

print('Test accuracy: {}'.format(svm.score(X_test_scaled, y_test)))

Test accuracy: 0.972027972027972


## - Building Pipelines.

Corteges made of preferred name and used model are representing stages.

In [2]:
from sklearn.pipeline import Pipeline

#initialization and building pipeline model:
pipe = Pipeline([('scaler', MinMaxScaler()), ('svm', SVC())])
pipe.fit(X_train, y_train)

#accuracy:
print('Test accuracy: {}'.format(pipe.score(X_test, y_test)))

Test accuracy: 0.972027972027972


There's also a handy way to build pipeline via 'make_pipeline'. It's the same thing but the stage's name is assigned automatically:

In [3]:
from sklearn.pipeline import make_pipeline

#standard syntax:
pipe_long = ([('scaler', MinMaxScaler()), ('svm', SVC(C=100))])
#'make_pipeline' syntax:
pipe_short = make_pipeline(MinMaxScaler(), SVC(C=100))

#assigned stage names within corteges:
print('Pipelines stages:\n{}'.format(pipe_short.steps))

Pipelines stages:
[('minmaxscaler', MinMaxScaler()), ('svc', SVC(C=100))]


## - Building Pipelines (GridSearchCV)

Using GridSearchCV with pipelines is the same as usual (put the model inside GridSearchCV) but the grid itself should be assigned via specific syntax:

In [4]:
from sklearn.model_selection import GridSearchCV

param_grid = {'svm__C': [0.001, 0.01, 0.1, 1, 10, 100],
              'svm__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}

grid = GridSearchCV(pipe, param_grid=param_grid, cv=5)
grid.fit(X_train, y_train)

print('Best cross-validation accuracy: {}'.format(grid.best_score_))
print('Test accuracy: {}'.format(grid.score(X_test, y_test)))
print('Best params: {}'.format(grid.best_params_))

Best cross-validation accuracy: 0.9812311901504789
Test accuracy: 0.972027972027972
Best params: {'svm__C': 1, 'svm__gamma': 1}


## - Accessing stage's atributes
To see stage's model atributes it should be used 'named_steps' which is a dictionary of names attached to models:

In [5]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

pipe = make_pipeline(StandardScaler(), PCA(n_components=2), StandardScaler())

pipe.fit(cancer.data)
components = pipe.named_steps['pca'].components_

print('Steps: {}\n'.format(pipe.named_steps))
print('PCA components shape: {}'.format(components.shape))

Steps: {'standardscaler-1': StandardScaler(), 'pca': PCA(n_components=2), 'standardscaler-2': StandardScaler()}

PCA components shape: (2, 30)


## - Accessing stage's atributes (GridSearchCV)

In [17]:
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, random_state=4)

param_grid = {'logisticregression__C': [0.01, 0.1, 1, 10, 100]}

pipe = make_pipeline(StandardScaler(), LogisticRegression())
grid = GridSearchCV(pipe, param_grid, cv=5).fit(X_train, y_train)

print('Best model:\n{}\n'.format(grid.best_estimator_))
print('Logreg stage:\n{}\n'.format(grid.best_estimator_.named_steps['logisticregression']))
print('Logreg coefs:\n{}\n'.format(grid.best_estimator_.named_steps['logisticregression'].coef_))

Best model:
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression', LogisticRegression(C=1))])

Logreg stage:
LogisticRegression(C=1)

Logreg coefs:
[[-0.43570655 -0.34266946 -0.40809443 -0.5344574  -0.14971847  0.61034122
  -0.72634347 -0.78538827  0.03886087  0.27497198 -1.29780109  0.04926005
  -0.67336941 -0.93447426 -0.13939555  0.45032641 -0.13009864 -0.10144273
   0.43432027  0.71596578 -1.09068862 -1.09463976 -0.85183755 -1.06406198
  -0.74316099  0.07252425 -0.82323903 -0.65321239 -0.64379499 -0.42026013]]



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt