## Building Machine Learning Pipeline with Scikit-Learn

### Basic Overview of Pipelines
Pipelines are common in machine learning systems and help with speeding up and simplifying some preprocessing situations. They are used to chain multiple estimators into one, which automates the machine learning process. This is extremely useful as there is often a fixed sequence of steps in processing the data. They are also useful when it comes to spitting out base models and comparing them to see which may give a better result for a particular metric/metrics, but it can also be tricky to access certain parts of a pipeline. The skeleton of a pipeline for one model is fairly simple.

Our Example Data
The data we will be using in this walkthrough will be the wine quality dataset, which we can get from Sci-kit Learn’s library.

In [83]:
# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [84]:
# importing dataset
df = pd.read_csv('wine.csv')
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [85]:
# Here, first, we split the data into a training and a test set.
from sklearn.model_selection import train_test_split

X = df.drop("quality", axis=1)
y = df["quality"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [86]:
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

# fill missing values with medians
imputer = SimpleImputer(strategy='median')
X_train_tr  =imputer.fit_transform(X_train)
# scale the data
scaler = StandardScaler()
X_train_tr = scaler.fit_transform(X_train_tr)

# do the same for test data. But here we will not apply the fit method.
# Only the transform method because we do not want our model to learn anything from the test data.
X_test_tr = imputer.transform(X_test)
X_test_tr = scaler.transform(X_test_tr)

 Once we did that we need to prepare the data for machine learning before uilding the model like filling the missing value, scaling the data, doing one-hot encoding for categorical features etc.

In [87]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# initiate the k-nearest neighbors regressor class
knn = KNeighborsRegressor()

# train the knn model on training data
knn.fit(X_train_tr, y_train)

# make predictions on test data
y_pred = knn.predict(X_test_tr)

# measure the performance of the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(rmse)

0.6729908369856655


As you can see there are lots of steps that need to be executed in the right order for training the model and If you mess things up, your model will be complete garbage. And this is just a simple example of an ml workflow. As you start working with a more complicated model, the chances of making errors are much higher. This is where the pipeline comes in.

### What is a Pipeline?
A Pipeline is simply a method of chaining multiple steps together in which the output of the previous step is used as the input for the next step.

Let’s see how can we build the same model using a pipeline assuming we already split the data into a training and a test set.
### Option-1

In [88]:
# list all the steps here for building the model
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(
    SimpleImputer(strategy="median"), StandardScaler(), KNeighborsRegressor()
)
# apply all the transformation on the training set and train an knn model
pipe.fit(X_train, y_train)
# apply all the transformation on the test set and make predictions
y_pred = pipe.predict(X_test)
# measure the performance
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(rmse)

0.6729908369856655


That’s it. Every step of the model from start to finish is defined in a single step and Scikit-Learn did everything for you. First, it applied all the appropriate transformations on the training set and build the model on it when we call the fit method and then transform the test set and made the prediction when we call the predict method.

Isn’t this simple and nice? Pipeline helps you hide complexity just like functions do. It also helps you avoid leaking information from your test data into the trained model during cross-validation which we will see later in this post. It is easier to use and debug. If you don’t like something you can easily replace that step with something else without making too many changes to your code. It is also nicer for others to read and understand your code.

Now, let’s see pipelines in more detail.

### How to use a Pipeline in Scikit-Learn?
The Pipeline in scikit-learn is built using a list of (key, value) pairs where the key is a string containing the name you want to give to a particular step and value is an estimator object for that step.

### Option-2

In [89]:
# list all the steps here for building the model
from sklearn.pipeline import Pipeline
pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('estimator', KNeighborsRegressor())])
    

# apply all the transformation on the training set and train an knn model
pipe.fit(X_train, y_train)
# apply all the transformation on the test set and make predictions
y_pred = pipe.predict(X_test)
# measure the performance
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(rmse)

0.6729908369856655


There is also a shorthand syntax (make_pipeline) for making a pipeline that we saw earlier. It only takes the estimators and fills in the names automatically with the lowercase class names.

In [90]:
from sklearn.pipeline import make_pipeline
pipe_short = make_pipeline(SimpleImputer(strategy="median"), StandardScaler(), KNeighborsRegressor())
pipe_short

Rules for creating a Pipeline –
There are few rules that you need to follow when creating a Pipeline in scikit Learn.

1. All estimators in a pipeline, except the last one, must be transformers (i.e. must have a transform method).
2. The last estimator may be any type (transformer, classifier, etc.).
3. Names for the steps can be anything you like as long as they are unique and don’t contain double underscores as they are used during hyperparameter tunning.

### Accessing Steps of a Pipeline –
The estimators of a pipeline are stored as a list in the steps attribute and can be accessed by index or by their name like this.

In [91]:
print(pipe.steps[0])
print(pipe.steps[1])
print(pipe[2])
print(pipe["imputer"])

('imputer', SimpleImputer(strategy='median'))
('scaler', StandardScaler())
KNeighborsRegressor()
SimpleImputer(strategy='median')


Pipeline’s named_steps attribute allows accessing steps by name with tab completion in interactive environments.

In [92]:
print(pipe.named_steps.imputer)

SimpleImputer(strategy='median')


You can also use the slice notation to access them.

In [93]:
print(pipe_long[1:])
Pipeline(steps=[('scaler', StandardScaler()), ('knn', KNeighborsRegressor())])

Pipeline(steps=[('scaler', StandardScaler()), ('knn', KNeighborsRegressor())])


## Grid Search using a Pipeline –
You can also do a grid search for hyperparameter optimization with a pipeline. And to access the parameters of the estimators in the pipeline using the <estimator>__<parameter> syntax.

In [94]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV

# create a pipeline
pipe = make_pipeline(
    SimpleImputer(strategy="median"), StandardScaler(), KNeighborsRegressor()
)
# list of parameter values to try
param_grid = {
    "kneighborsregressor__n_neighbors": [3, 5, 8, 12, 15],
    "kneighborsregressor__weights": ["uniform", "distance"],
}
grid = GridSearchCV(pipe, param_grid=param_grid, scoring="neg_mean_squared_error", cv=5)
grid.fit(X_train, y_train)

Here, we wanted to set the numbers of neighbors parameters of the knn model so we use double underscore after the estimator name – kneighborsregressor__n_neighbors.

In [95]:
# best score after grid search
print(np.sqrt(-grid.best_score_))

0.6187124991308474


In [96]:
print(grid.best_estimator_)
Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),
                ('standardscaler', StandardScaler()),
                ('kneighborsregressor',
                 KNeighborsRegressor(n_neighbors=15, weights='distance'))])

Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),
                ('standardscaler', StandardScaler()),
                ('kneighborsregressor',
                 KNeighborsRegressor(n_neighbors=15, weights='distance'))])


In [97]:
# the estimators can be accessed like this
print(grid.best_estimator_.named_steps.kneighborsregressor)
print(grid.best_estimator_['kneighborsregressor'])

KNeighborsRegressor(n_neighbors=15, weights='distance')
KNeighborsRegressor(n_neighbors=15, weights='distance')


In [98]:
# and to access the nested parameters of the estimators
print(grid.best_estimator_.named_steps.kneighborsregressor.n_neighbors)
print(grid.best_estimator_["kneighborsregressor"].n_neighbors)

15
15


#### We can go one step further.
So far, we only worked with a single algorithm(K-Nearest Neighbors) but many other algorithms might perform better than this. So, now let’s try different algorithms and see which perform best and we will also try different options for preparing the data as well, everything in a single step.

In [99]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV

# pipeline for the model
pipe = Pipeline(
    [
        ("imputer", SimpleImputer()),
        ("scaler", StandardScaler()),
        ("regressor", RandomForestRegressor()),
    ]
)

# model tunning with GridSearch
param_grid = {
    "imputer__strategy": ["mean", "median", "most_frequent", "constant"],
    "scaler": [StandardScaler(), MinMaxScaler(), "passthrough"],
    "regressor": [
        KNeighborsRegressor(),
        LinearRegression(),
        RandomForestRegressor(random_state=42),
        DecisionTreeRegressor(random_state=42)
    ],
}
grid = GridSearchCV(
    pipe,
    param_grid=param_grid,
    cv=5,
    scoring="neg_mean_squared_error",
    return_train_score=True,
)

grid.fit(X_train, y_train)

In [100]:
print(np.sqrt(-grid.best_score_))

0.6076141317775702


In [101]:
print(grid.best_estimator_)

Pipeline(steps=[('imputer', SimpleImputer()), ('scaler', 'passthrough'),
                ('regressor', RandomForestRegressor(random_state=42))])


In [102]:
print(grid.best_estimator_.named_steps.imputer.strategy)

mean


Here, we tried 4 different algorithms with default values and we also tested the scaler and imputer method that works best with them. The best algorithm for this task is the RandomForestRegressor which is scaled and the mean is used to fill the missing values. Other model that performed well is LinearRegression.