# Chapter 6. Pipelines.

Notes:

1) Data Scaling should be applied only to the training folds, thus cross-validation should be done BEFORE scaling in terms of using pipeline.

2) There could be any number of stages in pipeline with the only requirement to keep all stages (except last one) to have method 'transform' in a such way that new transformed data could be used for next stage.

#### Setting up:

In [7]:
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

#-----dataset preparing
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)
#scaling
scaler = MinMaxScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)

#-----model initialization and building
svm = SVC().fit(X_train_scaled, y_train)
X_test_scaled = scaler.transform(X_test)

print('Test accuracy: {}'.format(svm.score(X_test_scaled, y_test)))

Test accuracy: 0.972027972027972


## - Building Pipelines.

Corteges made of preferred name and used model are representing stages.

In [8]:
from sklearn.pipeline import Pipeline

#initialization and building pipeline model:
pipe = Pipeline([('scaler', MinMaxScaler()), ('svm', SVC())])
pipe.fit(X_train, y_train)

#accuracy:
print('Test accuracy: {}'.format(pipe.score(X_test, y_test)))

Test accuracy: 0.972027972027972


## - Pipelines with GridSearchCV

Using GridSearchCV with pipelines is the same as usual (put the model inside GridSearchCV) but the grid itself should be assigned via specific syntax:

In [12]:
from sklearn.model_selection import GridSearchCV

param_grid = {'svm__C': [0.001, 0.01, 0.1, 1, 10, 100],
              'svm__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}

grid = GridSearchCV(pipe, param_grid=param_grid, cv=5)
grid.fit(X_train, y_train)

print('Best cross-validation accuracy: {}'.format(grid.best_score_))
print('Test accuracy: {}'.format(grid.score(X_test, y_test)))
print('Best params: {}'.format(grid.best_params_))

Best cross-validation accuracy: 0.9812311901504789
Test accuracy: 0.972027972027972
Best params: {'svm__C': 1, 'svm__gamma': 1}
