# Python Explainer - How do pipelines work

Pieter Overdevest  
2022-11-28

For suggestions/questions regarding this notebook, please contact
[Pieter Overdevest](https://www.linkedin.com/in/pieteroverdevest/)
(pieter@innovatewithdata.nl).

#### Aim

To explain how pipelines work with KNN and the Iris data set.

#### Initialization

We start by importing (parts of) packages.

In [1]:
from sklearn.decomposition import PCA

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.neighbors import KNeighborsClassifier

from sklearn.pipeline import Pipeline, make_pipeline

from sklearn.preprocessing import StandardScaler

from sklearn import metrics, datasets

import pandas as pd

#### Get Iris data

The Iris dataset is a very simple dataset available through the
`datasets` packages. It contains 150 observations and five features. One
feature describes which variety the observation belongs to. The data
contains 50 observations per variety. The other four features describe
the geometry of the flower.

In [2]:
# Load Iris data. The 'iris' object is a collection of data objects, that we each assign to other objects.
iris   = datasets.load_iris()

# Assign the predictor data (X), the target (y), the feature names and the target variable categories to individual objects.
ar_X_iris            = iris.data # 2D array
l_df_X_iris_names    = iris.feature_names

ar_y_iris             = iris.target
ar_y_iris_categories  = iris.target_names


# Show dimensions and names of the data.
print("Predictor data")
print(f"Shape:         {ar_X_iris.shape}")
print(f"Feature names: {l_df_X_iris_names}")
print("")
print("Response variable")
print(f"Shape:         {ar_y_iris.shape}")
print(f"Values:        {ar_y_iris_categories}")

Predictor data
Shape:         (150, 4)
Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Response variable
Shape:         (150,)
Values:        ['setosa' 'versicolor' 'virginica']

We use the predictor data (`ar_X_iris`) and the feature names to
construct a data frame.

In [3]:
df_X_iris = pd.DataFrame(ar_X_iris, columns = l_df_X_iris_names)

df_X_iris.head(5)

#### Split the data

We split the data in a train and test set using the
`f_train_test_split()` function.

In [4]:
df_X_iris_train, df_X_iris_test, ar_y_iris_train, ar_y_iris_test = train_test_split(df_X_iris, ar_y_iris, test_size=0.33, random_state=42)

In [5]:
pd.DataFrame({
    'full':  pd.Series(ar_y_iris).value_counts(),
    'train': pd.Series(ar_y_iris_train).value_counts(),
    'test':  pd.Series(ar_y_iris_test).value_counts()
})

#### Build the pipeline

A pipeline is a series of transforms. Optionally, it ends by one
estimator. For both there are examples below. Examples of transforms
include encoders, imputers and scalers. In this example, the transform
sequence is as follows:

1.  Scale the data.

2.  Apply PCA dimension reductions to 2 principle components.

3.  Model with k-Nearest Neighbour Classifier
    ([ref](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)).

There are two ways to define pipelines
([ref](https://stackoverflow.com/questions/40708077/what-is-the-difference-between-pipeline-and-make-pipeline-in-scikit#)).
Using SciKit Learn’s:

1.  `Pipeline()` function
    ([ref](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline)).
    The steps in the pipeline need to be explicitly named. So, it is
    clear how to refer to them in later steps. These names don’t change
    if the estimator/transformer is updated.

2.  `make_pipeline()` function
    ([ref](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html)).
    This makes the pipeline shorter and arguably more readable. However,
    names are auto-generated using a straightforward rule (lowercase
    name of an estimator). So, in case you want to refer to steps you
    need to derive the names yourself, and you will have to review the
    names in case the estimator is changed.

So, use `make_pipeline()` for quick solutions and `Pipeline()` in case
of more elaborate pipelines.

The `ColumnTransformer()` function
([ref](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer))
applies transformers to columns of an array or pandas DataFrame. It
enables processing of different columns in different ways. I mention it,
so you can look it up and use it as needed. It is demonstrated in ‘Ames
step-by-step’, not in this intermezzo.

The cell below shows implementation of the example using either (A)
`Pipeline()` or (B) `make_pipeline()`.

In [6]:
# (A) Using Pipeline().
pl_Pipeline = Pipeline(
    
    [        
        ('std', StandardScaler()),
        ('pca', PCA(n_components = 2)),
        ('knn', KNeighborsClassifier(n_neighbors = 1))
    ],
    
    verbose = True
)

# (B) Using make_pipeline(). 
pl_make_pipeline = make_pipeline(
        
        StandardScaler(),
        PCA(n_components = 2),
        KNeighborsClassifier(n_neighbors = 1),

        verbose = True
)

Executing both pipelines as-is.

In [7]:
# (A) Using Pipeline() as-is.

print("Pipeline() as-is:")

# Fit the data using Pipeline().
pl_Pipeline.fit(df_X_iris_train, ar_y_iris_train)

# Score the data.
print(f"Accuracy: {metrics.accuracy_score(ar_y_iris_test, pl_Pipeline.predict(df_X_iris_test))}")


# (B) Using make_pipeline() as-is.

print("\nmake_pipeline() as-is:")

# Fit the data using make_pipeline().
pl_make_pipeline.fit(df_X_iris_train, ar_y_iris_train)

# Score the data.
print(f"Accuracy: {metrics.accuracy_score(ar_y_iris_test, pl_make_pipeline.predict(df_X_iris_test))}")

Pipeline() as-is:
[Pipeline] ............... (step 1 of 3) Processing std, total=   0.0s
[Pipeline] ............... (step 2 of 3) Processing pca, total=   0.0s
[Pipeline] ............... (step 3 of 3) Processing knn, total=   0.0s
Accuracy: 0.9

make_pipeline() as-is:
[Pipeline] .... (step 1 of 3) Processing standardscaler, total=   0.0s
[Pipeline] ............... (step 2 of 3) Processing pca, total=   0.0s
[Pipeline]  (step 3 of 3) Processing kneighborsclassifier, total=   0.0s
Accuracy: 0.9

Applying the pipelines to `GridSearchCV()` shows how these two can be
combined. Here, you see an example where the step names are required,
and simpler (given) names make it easier to admin your code. The
parameter `param_grid` is constructed from the step name and the name of
the hyperparameter, separated by ’\_\_’.

In [14]:
# (A) - Using Pipeline() with GridSearch().

print("\nPipeline() with GridsearchCV():\n")

# Define grid.
l_param_grid = [{'knn__n_neighbors': [1, 2, 5, 10, 20, 50]}]

# Define grid instance (machine with original settings).
gs = GridSearchCV(pl_Pipeline, l_param_grid)

# Fit the train data using the grid of hyperparameters.
gs.fit(df_X_iris_train, ar_y_iris_train)

# Score the data.
print(f"Accuracy: {metrics.accuracy_score(ar_y_iris_test, gs.predict(df_X_iris_test))}")

# Which 'n_neighbours' gave the best results?
print(gs.best_params_)


Pipeline() with GridsearchCV():

[Pipeline] ............... (step 1 of 3) Processing std, total=   0.0s
[Pipeline] ............... (step 2 of 3) Processing pca, total=   0.0s
[Pipeline] ............... (step 3 of 3) Processing knn, total=   0.0s
[Pipeline] ............... (step 1 of 3) Processing std, total=   0.0s
[Pipeline] ............... (step 2 of 3) Processing pca, total=   0.0s
[Pipeline] ............... (step 3 of 3) Processing knn, total=   0.0s
[Pipeline] ............... (step 1 of 3) Processing std, total=   0.0s
[Pipeline] ............... (step 2 of 3) Processing pca, total=   0.0s
[Pipeline] ............... (step 3 of 3) Processing knn, total=   0.0s
[Pipeline] ............... (step 1 of 3) Processing std, total=   0.0s
[Pipeline] ............... (step 2 of 3) Processing pca, total=   0.0s
[Pipeline] ............... (step 3 of 3) Processing knn, total=   0.0s
[Pipeline] ............... (step 1 of 3) Processing std, total=   0.0s
[Pipeline] ............... (step 2 of 3) Pr

In [15]:
# (B) - Using make_pipeline() with GridSearch()

print("\nmake_pipeline() with GridsearchCV():\n")

# Define grid.
l_param_grid = [{'kneighborsclassifier__n_neighbors': [1, 2, 5, 10, 20, 50]}]

# Define grid instance (machine with original settings).
gs = GridSearchCV(pl_make_pipeline, l_param_grid)

# Fit the train data using the grid of hyperparameters.
gs.fit(df_X_iris_train, ar_y_iris_train)

# Score the data.
print(f"Accuracy: {metrics.accuracy_score(ar_y_iris_test, gs.predict(df_X_iris_test))}")

# Which 'n_neighbours' gave the best results?
print(gs.best_params_)


make_pipeline() with GridsearchCV():

[Pipeline] .... (step 1 of 3) Processing standardscaler, total=   0.0s
[Pipeline] ............... (step 2 of 3) Processing pca, total=   0.0s
[Pipeline]  (step 3 of 3) Processing kneighborsclassifier, total=   0.0s
[Pipeline] .... (step 1 of 3) Processing standardscaler, total=   0.0s
[Pipeline] ............... (step 2 of 3) Processing pca, total=   0.0s
[Pipeline]  (step 3 of 3) Processing kneighborsclassifier, total=   0.0s
[Pipeline] .... (step 1 of 3) Processing standardscaler, total=   0.0s
[Pipeline] ............... (step 2 of 3) Processing pca, total=   0.0s
[Pipeline]  (step 3 of 3) Processing kneighborsclassifier, total=   0.0s
[Pipeline] .... (step 1 of 3) Processing standardscaler, total=   0.0s
[Pipeline] ............... (step 2 of 3) Processing pca, total=   0.0s
[Pipeline]  (step 3 of 3) Processing kneighborsclassifier, total=   0.0s
[Pipeline] .... (step 1 of 3) Processing standardscaler, total=   0.0s
[Pipeline] ............... (st