<center><h2>scikit-learn's Pipeline</h2></center>

<center><img src="https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2016/06/Automate-Machine-Learning-Workflows-with-Pipelines-in-Python-and-scikit-learn.jpg" width="65%"/></center>

By The End Of This Session You Should Be Able To:
----

- Write more compact and readable code with scikit-learn's Pipeline
- Use Pipeline to automate cross validation and grid searching

Typical ML Workflow
------

1. Load tidy data
1. Split into train and test sets
1. Do feature engineering
1. Choose hyperparameters
1. Fit model
1. Evaluate model
1. Decide if business outcome was accomplished

Being a curious DS, you'll ask more and more questions …
----

What if I change this hyperparamter?

What if I change the algorithm?

I have low bias, but do I have low variance? I should try different variations of the data!

<center><h2>Analytics is combinatorial, time is linear.</h2></center>

scikit-learn's Pipeline are answer!
------

- Pipelines makes it easy to reorder or add/remove steps
- Pipelines make it easy to adjust hyperparamters
- Pipelines makes code more readable

Source: https://stackoverflow.com/questions/33091376/python-what-is-exactly-sklearn-pipeline-pipeline

scikit-learn's Transformer class
------

For data preparation

`fit` – find parameters from training data (if needed)   
`transform` – apply to training or test data  

Let's take a look [Dataset transformations](https://scikit-learn.org/stable/data_transforms.html)

scikit-learn's Estimator class
------

For modeling

`fit` – find parameters from training data   
`predict` – apply to training or test data

Examples: LR, k-NN, SVM, DT,...

scikit-learn's Pipeline class
------

`Pipeline` manage Machine Learning workflows

Examples:  
`Pipeline = [Estimator]`  
`Pipeline = [Transformer, Transformer, Estimator]`  
`Pipeline = [Transformer, Transformer, Transformer]`  

<center><h2>scikit-learn's Pipeline</h2></center>

<center><img src="https://iaml.it/blog/optimizing-sklearn-pipelines/images/pipeline-diagram.png" width="70%"/></center>

In [76]:
reset -fs

In [77]:
import numpy as np

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

In [78]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load and split the data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

Source: https://www.kdnuggets.com/2017/12/managing-machine-learning-workflows-scikit-learn-pipelines-part-1.html

In [79]:
from sklearn.pipeline import Pipeline

In [80]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier

pipe_dt = Pipeline([('scl', StandardScaler()),          # Transformer: Standardize
                    ('pca', PCA(n_components=2)),       # Transformer: Dimension Reduction
                    ('clf', DecisionTreeClassifier())]) # Estimator: ML algorithm

In [81]:
pipe_dt.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('scl', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max...      min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'))])

In [82]:
f"{pipe_dt.score(X_test, y_test):.4f}"

'0.9000'

Pipelines All The Way Down
--------



In [83]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

pipe_lr = Pipeline([('scl', StandardScaler()),
                    ('pca', PCA(n_components=2)),
                    ('clf', LogisticRegression())])

pipe_svm = Pipeline([('scl', StandardScaler()),
                     ('pca', PCA(n_components=2)),
                     ('clf', SVC())])

# List of pipelines for ease of iteration
pipelines = [pipe_dt, pipe_lr, pipe_svm, ]

In [84]:
# Fit the pipelines 
for pipe in pipelines:
    pipe.fit(X_train, y_train) # One-liners fits all models 💥

In [85]:
# Compare accuracies
for pipe in pipelines:
    name = pipe.steps[-1][1].__class__.__name__.split('.')[-1] # Fluent like a boss
    print(f"{name:<24}: {pipe.score(X_test, y_test):.3f}")

DecisionTreeClassifier  : 0.900
LogisticRegression      : 0.933
SVC                     : 0.900


Pipeline works well with Cross Validation
-----

A Pipeline makes it easier to compose estimators, providing simple behavior under cross-validation

In [94]:
from sklearn.model_selection import cross_val_score, KFold

kfold = KFold(n_splits=10, random_state=42)
results = cross_val_score(pipe_dt, # Put your pipeline where an Estimator would go
                          X_train, 
                          y_train, 
                          cv=kfold)
print(f"{results.mean():.4f}")

0.9083


Source: https://machinelearningmastery.com/automate-machine-learning-workflows-pipelines-python-scikit-learn/ 

Pipeline is awesome for Grid Search
----- 

<center><img src="https://i.stack.imgur.com/cIDuR.png" width="75%"/></center>

Like strategies for the Battleship boardgame

In [87]:
from sklearn.model_selection import GridSearchCV

# Grid search over each element in pipeline
#                  <estimator>__<parameter>                            
grid_params = dict(pca__n_components=[1, 2, 3],
                   clf__max_depth=range(1, 5),
                   clf__criterion=['gini', 'entropy'],
                   clf__min_samples_leaf =range(3, 15))

gs = GridSearchCV(estimator=pipe_dt,  
                  param_grid=grid_params,
                  scoring='accuracy',
                  cv=10)

gs.fit(X_train, y_train)
f"{gs.score(X_test, y_test):.4f}"

'0.9667'

In [88]:
# Best parameters
gs.best_params_

{'clf__criterion': 'entropy',
 'clf__max_depth': 4,
 'clf__min_samples_leaf': 4,
 'pca__n_components': 3}

In [89]:
# Best algorithm with best hyperparameters (need to fit it to find specific model parameters)
gs.best_estimator_

Pipeline(memory=None,
     steps=[('scl', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', PCA(copy=True, iterated_power='auto', n_components=3, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
            max...      min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'))])

In [90]:
# Best model with specific model parameters
gs.best_estimator_.get_params()['clf']

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=4, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

<center><h2>Grid Search Across Algorithms</h2></center>

In [91]:
from sklearn.base import BaseEstimator

class DummyEstimator(BaseEstimator):
    def fit(self): pass
    def score(self): pass
    
# Create a pipeline
pipe = Pipeline([('clf', DummyEstimator())]) # Placeholder Estimator

In [92]:
# Create space of candidate learning algorithms and their hyperparameters
search_space = [{'clf': [LogisticRegression()], # Actual Estimator
                 'clf__penalty': ['l1', 'l2'],
                 'clf__C': np.logspace(0, 4, 10)},
                
                {'clf': [DecisionTreeClassifier()],  # Actual Estimator
                 'clf__criterion': ['gini', 'entropy']}]


# Create grid search 
clf = GridSearchCV(pipe, search_space, cv=5, verbose=0)

#  Fit grid search
best_model = clf.fit(X_train, y_train);

# View best model
best_model.best_estimator_.get_params()['clf']

LogisticRegression(C=1291.5496650148827, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=None,
          solver='warn', tol=0.0001, verbose=0, warm_start=False)

Source: https://chrisalbon.com/machine_learning/model_selection/model_selection_using_grid_search/

<center><h2>Pipelines are even more important for text processing given the amount of data preprocessing needed</h2></center>


Summary
------

- From now use Pipeline
- Pipeline will help for hyperparameter tuning and algorithm comparisons

Further Study
------

- Custom Transformer
- FeatureUnion which concatenates the output of transformers into a composite feature space.