# Pipelines 

Scikit-learn uses the notion of pipeline. Using the `Pipeline` class, you can chain together transformers and models, and treat the whole process like a scikit-learn model. You can even inster custom logic. 

## Classification Pipeline

In [1]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
import pandas as pd 

df = pd.read_excel('titanic3.xls')
df

C:\Users\alber\anaconda3\lib\site-packages\numpy\.libs\libopenblas.GK7GX5KEQ4F6UYO3P26ULGBQYHGQO7J4.gfortran-win_amd64.dll
C:\Users\alber\anaconda3\lib\site-packages\numpy\.libs\libopenblas.WCDJNK7YVMPZQ2ME2ZZHJJRJ3JIKNDB7.gfortran-win_amd64.dll


Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,,C,,328.0,
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C,,,
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.2250,,C,,304.0,
1307,3,0,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.2250,,C,,,


In [2]:
df.columns

Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')

In [3]:
def tweak_titanic(df):
    df = df.drop(columns = [
        'name', 
        'ticket', 
        'home.dest',
        'boat',
        'body', 
        'cabin',
    ]).pipe(pd.get_dummies, drop_first = True)
    
    return df

In [4]:
class TitanicTransformer(BaseEstimator, TransformerMixin):
    
    def transform(self, X):
        # assumes X is output from reading Excel file
        X = tweak_titanic(X)
        X = X.drop(columns = 'survived')

        return X
    
    def fit(self, X, y):
        return self
    


In [5]:
from sklearn import preprocessing

In [6]:
from sklearn.ensemble import RandomForestClassifier

In [7]:
from sklearn.experimental import  enable_iterative_imputer# enable_iterative_impute
from sklearn.impute import IterativeImputer

In [8]:
pipe = Pipeline(
[
    ('titan', TitanicTransformer()),
    ('impute', IterativeImputer()),
    ('std', preprocessing.StandardScaler()),
    ('rf', RandomForestClassifier())
])

With a pipeline in hand, we can call `.fit` and `.score` on it:

In [9]:
orig_df = pd.read_excel('titanic3.xls')

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(orig_df, orig_df.survived, test_size=0.3, random_state=42)

pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

0.7989821882951654

Pipelines can be used in grid search. Our `param_grid` needs to have the parameters prefixed by the name of the pipe stage, followed by two underscores. In the example below, we add some parameters for the random forest stage: 

In [11]:
params ={
    'rf__max_features': [0.4, 'auto'],
    'rf__n_estimators' : [15,200]
}

In [12]:
from sklearn.model_selection import GridSearchCV

grid = GridSearchCV(pipe, cv = 3, param_grid = params)
grid.fit(orig_df, orig_df.survived)

GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('titan', TitanicTransformer()),
                                       ('impute', IterativeImputer()),
                                       ('std', StandardScaler()),
                                       ('rf', RandomForestClassifier())]),
             param_grid={'rf__max_features': [0.4, 'auto'],
                         'rf__n_estimators': [15, 200]})

Now we can pull out the best parameters and train the final model. 

In [13]:
grid.best_params_

{'rf__max_features': 0.4, 'rf__n_estimators': 200}

In [14]:
pipe.set_params(**grid.best_params_)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

0.7913486005089059

We can use the pipeline where we use scikit-learn models

In [15]:
from sklearn.metrics import roc_auc_score

roc_auc_score(y_test, pipe.predict(X_test))

0.7799159974640744

## Regression Pipeline

In [16]:
from sklearn.linear_model import LinearRegression

In [17]:
from sklearn import (model_selection, preprocessing)
from sklearn.datasets import load_boston 

In [18]:
b = load_boston()
bos_X = pd.DataFrame(b.data, columns = b.feature_names)
bos_y = b.target

In [19]:
bos_X_train, bos_X_test, bos_y_train, bos_y_test = model_selection.train_test_split(bos_X, bos_y, test_size=0.3, random_state=42)

In [21]:
from sklearn.pipeline import Pipeline

reg_pip = Pipeline(
[
    (
    'std', preprocessing.StandardScaler()),
    ('lr', LinearRegression())
])
reg_pip.fit(bos_X_train, bos_y_train)
reg_pip.score(bos_X_test, bos_y_test)

0.7112260057484934

If we want to pull parts out of the pipeline to examine their properties, we can do that with the `named_steps` attribute:

In [22]:
reg_pip.named_steps['lr'].intercept_

23.01581920903956

In [23]:
reg_pip.named_steps['lr'].coef_

array([-1.10834602,  0.80843998,  0.34313466,  0.81386426, -1.79804295,
        2.913858  , -0.29893918, -2.94251148,  2.09419303, -1.44706731,
       -2.05232232,  1.02375187, -3.88579002])

We can use the pipeline metric calculation as well: 

In [24]:
from sklearn import metrics

metrics.mean_squared_error(bos_y_test, reg_pip.predict(bos_X_test))

21.517444231177198

## PCA Pipeline

In [25]:
X = pd.read_excel('X.xls')

In [27]:
from sklearn.decomposition import PCA 
from sklearn.preprocessing import StandardScaler


pca_pipe = Pipeline(
[
    ('std' ,preprocessing.StandardScaler()),
    ('pca', PCA())
])

X_pca = pca_pipe.fit_transform(X)

Using the `.named_steps` attribute, we can pull properties off of the PCA portion of the pipeline:

In [29]:
pca_pipe.named_steps['pca'].explained_variance_ratio_

array([0.22412174, 0.17480016, 0.16419943, 0.10385405, 0.09977459,
       0.08050957, 0.06302402, 0.05486078, 0.03452659, 0.00032905])

In [30]:
pca_pipe.named_steps['pca'].components_[0]

array([-0.01901925, -0.47269871,  0.27401873, -0.04223445,  0.0287613 ,
        0.46153822,  0.16841706, -0.43706922, -0.03158823,  0.5148668 ])