<center><h2>scikit-learn's Pipeline</h2></center>

<center><img src="https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2016/06/Automate-Machine-Learning-Workflows-with-Pipelines-in-Python-and-scikit-learn.jpg" width="65%"/></center>

By The End Of This Session You Should Be Able To:
----

- Write more compact and readable code with scikit-learn's Pipeline
- Use Pipeline to automate cross validation and grid searching

Typical ML Workflow
------

1. Load tidy data
1. Split into train and test sets
1. Do feature engineering
1. Choose hyperparameters
1. Fit model
1. Evaluate model
1. Decide if business outcome was accomplished

Being a curious DS, you'll ask more and more questions …
----

What if I change this hyperparamter?

What if I change the algorithm?

I have low bias, but do I have low variance? I should try different variations of the data!

<center><h2>Analytics is combinatorial, time is linear.</h2></center>

scikit-learn's Pipeline are answer!
------


- Pipelines makes code more readable
- Pipelines makes it easy to change the order or add/remove steps
- You only have to call `fit` and `predict` once
- Easy to write grid search over multiple parameters

Source: https://stackoverflow.com/questions/33091376/python-what-is-exactly-sklearn-pipeline-pipeline

scikit-learn's Transformer
------

For data preparation

`fit` – find parameters from training data (if needed)   
`transform` – apply to training or test data  

Let's take a look [Dataset transformations](https://scikit-learn.org/stable/data_transforms.html)

scikit-learn's Estimator
------

For modeling

`fit` – find parameters from training data   
`predict` – apply to training or test data

Examples: LR, k-NN, SVM, DT,...

scikit-learn's Pipeline
------

`Pipeline` class managea Machine Learning pipelines

A series of Transformers on the data

The last one can be a Estimator (or could be another Transformer)

`Pipeline = [Transformer, Transformer, Estimator]`  
`Pipeline = [Transformer, Transformer, Transformer]`  

<center><h2>scikit-learn's Pipeline</h2></center>

<center><img src="https://iaml.it/blog/optimizing-sklearn-pipelines/images/pipeline-diagram.png" width="70%"/></center>

In [18]:
reset -fs

In [19]:
import numpy as np

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

In [20]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load and split the data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

Source: https://www.kdnuggets.com/2017/12/managing-machine-learning-workflows-scikit-learn-pipelines-part-1.html

In [21]:
from sklearn.pipeline import Pipeline

In [22]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier

pipe_dt = Pipeline([('scl', StandardScaler()),          # Transformer: Standardize
                    ('pca', PCA(n_components=2)),       # Transformer: Dimension Reduction
                    ('clf', DecisionTreeClassifier())]) # Estimator: ML algorithm

In [23]:
pipe_dt.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('scl', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max...      min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'))])

In [24]:
f"{pipe_dt.score(X_test, y_test):.4f}"

'0.8667'

Pipelines All The Way Down
--------



In [25]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

pipe_lr = Pipeline([('scl', StandardScaler()),
                    ('pca', PCA(n_components=2)),
                    ('clf', LogisticRegression())])

pipe_svm = Pipeline([('scl', StandardScaler()),
                     ('pca', PCA(n_components=2)),
                     ('clf', SVC())])

# List of pipelines for ease of iteration
pipelines = [pipe_lr, pipe_svm, pipe_dt]

In [26]:
# Fit the pipelines 
for pipe in pipelines:
    pipe.fit(X_train, y_train) # One-liners fits all models 💥

In [27]:
# Compare accuracies
for pipe in pipelines:
    name = pipe.steps[-1][1].__class__.__name__.split('.')[-1] # Fluent like a boss
    print(f"{name:<24}: {pipe.score(X_test, y_test):.4f}")

LogisticRegression      : 0.9333
SVC                     : 0.9000
DecisionTreeClassifier  : 0.8667


Pipeline works well with Cross Validation
-----

A Pipeline makes it easier to compose estimators, providing simple behavior under cross-validation

In [28]:
from sklearn.model_selection import cross_val_score, KFold

kfold = KFold(n_splits=10, random_state=42)
results = cross_val_score(pipe_dt, # Put your pipeline where an Estimator would go
                          X_train, 
                          y_train, 
                          cv=kfold)
print(results.mean())

0.9166666666666666


Source: https://machinelearningmastery.com/automate-machine-learning-workflows-pipelines-python-scikit-learn/ 

Pipeline is awesome for Grid Search
----- 

<center><img src="https://i.stack.imgur.com/cIDuR.png" width="75%"/></center>

In [29]:
from sklearn.model_selection import GridSearchCV

# Grid search over each element in pipeline
#                  <estimator>__<parameter>                            
grid_params = dict(pca__n_components=[1, 2, 3],
                   clf__max_depth=range(1, 5),
                   clf__criterion=['gini', 'entropy'],
                   clf__min_samples_leaf =range(3, 15))

gs = GridSearchCV(estimator=pipe_dt, # If I had the time, I could also put a list of pipelines 
                  param_grid=grid_params,
                  scoring='accuracy',
                  cv=10)

gs.fit(X_train, y_train)
f"{gs.score(X_test, y_test):.4f}"

'0.9667'

In [30]:
# Best parameters
gs.best_params_

{'clf__criterion': 'gini',
 'clf__max_depth': 4,
 'clf__min_samples_leaf': 4,
 'pca__n_components': 3}

In [34]:
# Best model
gs.best_estimator_

Pipeline(memory=None,
     steps=[('scl', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', PCA(copy=True, iterated_power='auto', n_components=3, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
            max_fe...      min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'))])


<center><h2>Pipelines are even more important for text processing given the amount of data preprocessing needed</h2></center>


Summary
------

- From now use Pipeline
- Pipeline will help for hyperparameter tuning and algorithm comparisons

Further Study
------

- Custom Transformer
- FeatureUnion which concatenates the output of transformers into a composite feature space.