# Pipelines and Model Persistence - W07D1
### Instructor: Eric Elmoznino
(Adapted from content by Arunabh Singh)

## Overview - Pipelines
- Motivation and example
- Feature unions
- Visualizing pipelines
- Hyperparameter tuning with pipelines
- Custom class in a pipeline
- Activity (time permitting)

---
## Motivation and example
Consider the following example of a diabetes vs. non-diabetes classification task in Sklearn.

In [2]:
import pandas as pd

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = pd.read_csv(url, names=names)

df.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

X = df.drop(columns='class')
y = df['class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=27, stratify=y)

scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)

pca = PCA(n_components=3)
pca.fit(X_train_scaled)
X_train_pca = pca.transform(X_train_scaled)

model = LogisticRegression()
model.fit(X_train_pca, y_train)

X_test_scaled = scaler.transform(X_test)
X_test_pca = pca.transform(X_test_scaled)

y_pred = model.predict(X_test_pca)
acc = accuracy_score(y_test, y_pred)
print(f'Test set accuracy: {acc}')

Test set accuracy: 0.6948051948051948


There are several inconvenient things about this:
1. We have a lot of ugly code. We keep calling `.fit()` and `.transform()` on different objects, and
we keep having to rename transformed variables so as not to cause confusions later in our notebook.
2. Our modeling code is distributed and therefore error-prone. If we try running our model 
somewhere else and forget to copy over a step (e.g. we don't apply StandardScaler to the test set), 
then our model will not work as expected.
3. We cannot use convenient Sklearn functions/classes such as `cross_val_score()` or `GridSearchCV()`,
which take a single model as input. *Note that we can't just pass the final classifier/regressor to
these functions, because our preprocessing steps (e.g. `StandardScaler`, `PCA`, etc.) must only be
fit to the train set at each cross validation fold.*

#### The solution: Sklearn Pipelines

In [4]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline(steps=[('scaling', StandardScaler()),
                           ('pca', PCA(n_components=3)),
                           ('classifier', LogisticRegression())])
pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f'Test set accuracy: {acc}')

Test set accuracy: 0.6948051948051948


Notice how much cleaner this code is. The composite model created using `Pipeline`
can be used just like any other Sklearn model you have learned, which means that it
can also be passed to functions like `cross_val_score()`.

To get a better understanding of what is happening under the hood,
let's try to build our own pipeline-like class that has some of the
same core functionality as the Sklearn one.

In [5]:
class BasicPipeline:
    
    def __init__(self, steps):
        self.steps = steps
        
    def fit(self, X, y=None):
        print('Called .fit()')
        # Fit all preprocessing modules and sequentially transform input using them
        for name, estimator in self.steps[:-1]:
            print(f'Fitting {name}')
            estimator.fit(X)
            print(f'Transforming with {name}')
            X = estimator.transform(X)
        
        # Fit the final (prediction) module
        name, estimator = self.steps[-1]
        print(f'Fitting {name}\n')
        if y is not None:
            estimator.fit(X, y)
        else:
            estimator.fit(X)
        
        # Return fitted self so that we can write things like "model = model.fit(X, y)",
        # in addition to just "model.fit(X, y)" on its own line
        return self
        
    def predict(self, X):
        print('Called .predict()')
        # Sequentially transform input using all the preprocessing modules
        for name, estimator in self.steps[:-1]:
            print(f'Transforming with {name}')
            X = estimator.transform(X)
        
        # Predict using the final module
        name, estimator = self.steps[-1]
        print(f'Predicting with {name}\n')
        y_pred = estimator.predict(X)
        
        return y_pred

In [6]:
pipeline = BasicPipeline(steps=[('scaling', StandardScaler()),
                                ('pca', PCA(n_components=3)),
                                ('classifier', LogisticRegression())])
pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f'Test set accuracy: {acc}')

Called .fit()
Fitting scaling
Transforming with scaling
Fitting pca
Transforming with pca
Fitting classifier

Called .predict()
Transforming with scaling
Transforming with pca
Predicting with classifier

Test set accuracy: 0.6948051948051948


---
## Feature unions
`Pipeline` lets us specify a sequence of steps that will be executed in one after the other (i.e. in `series`),
but want if we want branches in our process? For instance, what if we want to create two different sets
of features and use both of them when fitting our model?

For this type of application, we can use a `FeatureUnion`. It is an Sklearn class that lets us join the
outputs of several steps through *concatenation* (i.e. in parallel). `FeatureUnion`'s can be composed with `Pipeline`'s however much we want.

![](images/series_and_parallel.png)

In [8]:
from sklearn.pipeline import FeatureUnion
from sklearn.feature_selection import SelectKBest

feature_union = FeatureUnion([('pca', PCA(n_components=3)), 
                              ('select_best', SelectKBest(k=6))])

pipeline = Pipeline(steps=[('scaling', StandardScaler()),
                           ('features', feature_union),
                           ('classifier', LogisticRegression())])
pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f'Test set accuracy: {acc}')

Test set accuracy: 0.7337662337662337


---
## Visualizing pipelines
Another advantage of having these pipelines is that we can quickly visualize complex workflows used in our
modeling as HTML, which can be helpful for debugging purposes or presentations.

<sub>*Note: I highly recommend you use this in your own presentations as a substitute for (or in addition to) code.*</sub>

In [9]:
# Display HTML representation in a jupyter context
from sklearn import set_config
set_config(display='diagram')

pipeline

Note that you can also click on the individual parts in the diagram (e.g. PCA) to see their arguments.

In [10]:
# Or, save the HTML to a file
from sklearn.utils import estimator_html_repr

with open('images/model_pipeline.html', 'w') as f:  
    f.write(estimator_html_repr(pipeline))

---
## Hyperparameter tuning with pipelines
Normally, if we want to tune hyperparameters using something like `GridSearchCV`, we need to pass it:
1. A model object.
2. A dictionary of (parameter name, list of values to try) pairs.

When not using pipelines, we can only tune hyperparameters for a single model (the one we specify as the
model in `GridSearchCV`. As we've seen, however, we can create composite models using `Pipeline`. We can
then pass this composite model to `GridSearchCV` and tune hyperparameters for multiple components at once.

In [11]:
from sklearn.linear_model import RidgeClassifier
from sklearn.model_selection import GridSearchCV

feature_union = FeatureUnion([('pca', PCA(n_components=3)), 
                              ('select_best', SelectKBest(k=6))])

pipeline = Pipeline(steps=[('scaling', StandardScaler()),
                           ('features', feature_union),
                           ('classifier', RidgeClassifier())])

# Find the best hyperparameters using GridSearchCV on the train set
param_grid = {'classifier__alpha': [0.001, 0.01, 0.1], 
              'features__pca__n_components': [3, 5],
              'features__select_best__k': [1, 3, 6]}
grid = GridSearchCV(pipeline, param_grid=param_grid, cv=5)
grid.fit(X_train, y_train)

best_model = grid.best_estimator_
best_hyperparams = grid.best_params_
best_acc = grid.score(X_test, y_test)
print(f'Best test set accuracy: {best_acc}\nAchieved with hyperparameters: {best_hyperparams}')

Best test set accuracy: 0.7467532467532467
Achieved with hyperparameters: {'classifier__alpha': 0.001, 'features__pca__n_components': 3, 'features__select_best__k': 3}


---
## Custom class in a pipeline
In some scenarios, the standard Sklearn models and preprocessing functions may not be enough, and you will
have to generate your own custom classes.
However, you'll still want the convenience of `Pipeline` and the advantages that come with it.

Here, we'll see how to embed your own custom class into an Sklearn `Pipeline`. We'll be using a dummy dataset where $y = 5x_1 + 2\sqrt{x_2}$. A linear regression model cannot find the appropriate solution, but a linear regression model that takes the square roots of the features (in addition to the features themselves) can.

In [12]:
import numpy as np

X = np.random.rand(1000, 2) * 10
y = 5 * X[:, 0] + 2 * np.sqrt(X[:, 1])

X_train, X_test, y_train, y_test = X[:600], X[600:], y[:600], y[600:]

In [13]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f'Test set RMSE: {rmse}')

Test set RMSE: 0.33363728944960125


The performance just using linear regression is poor. Let's create a custom transformer
that generates a square-rooted version of the features, and use it in a pipeline.

In [14]:
from sklearn.base import BaseEstimator, TransformerMixin

# Custom Sklearn classes must:
# a) inherit from BaseEstimator and TransformerMixin
# b) implement the __init__(), fit(), and transform() methods
class SqrtTransformer(BaseEstimator, TransformerMixin):

    def __init__(self):
        # Your __init__ function takes in arguments as input
        # and does some initialization, such as creating model parameters.
        # This transformer does not require any.
        print('init() called')

    def fit(self, X, y = None):
        # Your forward() function takes in an X (and optionally a y)
        # and fits its parameters to the data. It then returns "self".
        # This transformer does not fit anything, because it is parameterless.
        print('fit() called')
        return self

    def transform(self, X, y = None):
        # Your transform() function takes in an X (and optionally a y)
        # and spits out the transformed output.
        # This transformer returns the original features and their square root.
        print('transform() called')
        X_sqrt = np.sqrt(X)
        X = np.concatenate((X, X_sqrt), axis=1)
        return X

In [15]:
pipeline = Pipeline(steps=[('sqrt', SqrtTransformer()), 
                           ('regression', LinearRegression())])
pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f'\nTest set RMSE: {rmse}')

init() called
fit() called
transform() called
transform() called

Test set RMSE: 1.0961584504754316e-14


---
## Activity (time permitting)
- Go back to some sklearn code you've written in the past (either for a project or an exercise).
- Turn it into a pipeline.
- Post your before/after snippets in the Slack thread.
- Meet back in 10 minutes.
- Ask questions if you weren't sure how to do something.

The goals of this exercise are to:
- Get comfortable pipelining your preprocessing and modeling code.
- See many paired before/after examples.
- Fill any gap in understanding that arises when trying out pipelines yourself.