# Pipelines
> **Pipelines** can keep our code neat and clean all the way from gathering & cleaning our data, to creating models & fine-tuning them!

## Advantages of `Pipeline`

### Reduces Complexity
> You can focus on parts of the pipeline at a time and debug or adjust parts as needed.

### Convenient
> You can summarize your fine-detail steps into the pipeline. That way you can focus on the big-picture aspects.

### Flexible

> You can also use pipelines to be applied to different models and can perform optimization techniques like grid search and random search on hyperparameters!

## Example of Using `Pipeline`

In [2]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from scipy import stats as stats

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import precision_score, recall_score
from sklearn.model_selection import train_test_split, GridSearchCV,\
cross_val_score, RandomizedSearchCV

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

In [3]:
# Getting some data
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=27)

### Without the Pipeline class

In [4]:
# Define transformers (will adjust/massage the data)
imputer = SimpleImputer(strategy="median") # replaces missing values
std_scaler = StandardScaler() # scales the data

# Define the classifier (predictor) to train
rf_clf = DecisionTreeClassifier()

# Have the classifer (and full pipeline) learn/train/fit from the data
X_train_filled = imputer.fit_transform(X_train)
X_train_scaled = std_scaler.fit_transform(X_train_filled)
rf_clf.fit(X_train_scaled, y_train)

# Predict using the trained classifier (still need to do the transformations)
X_test_filled = imputer.transform(X_test)
X_test_scaled = std_scaler.transform(X_test_filled)
y_pred = rf_clf.predict(X_test_scaled)

> Note that if we were to add more steps in this process, we'd have to change both the *training* and *testing* processes.

### With `Pipeline` Class

In [None]:
pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")), 
        ('std_scaler', StandardScaler()),
        ('rf_clf', DecisionTreeClassifier()),
])


# Train the pipeline (tranformations & predictor)
pipeline.fit(X_train, y_train)

# Predict using the pipeline (includes the transfomers & trained predictor)
predicted = pipeline.predict(X_test)

> If we need to change our process, we change it _just once_ in the Pipeline

## Grid Searching a Pipeline

In [5]:
penguins = sns.load_dataset('penguins')
penguins = penguins.dropna()

In [6]:
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male


In [7]:
y = penguins.pop('species')
X_train, X_test, y_train, y_test = train_test_split(
    penguins, y, test_size=0.5, random_state=42)

In [8]:
X_train_nums = X_train.select_dtypes('float64')



In [9]:
X_train_cat = X_train.select_dtypes('object')



> Intermediary step to treat categorical and numerical data differently

In [10]:
numerical_pipeline = Pipeline(steps=[
    ('ss', StandardScaler())
])
                
categorical_pipeline = Pipeline(steps=[
    ('ohe', OneHotEncoder(drop='first',
                         sparse=False))
])

trans = ColumnTransformer(transformers=[
    ('numerical', numerical_pipeline, X_train_nums.columns),
    ('categorical', categorical_pipeline, X_train_cat.columns)
])

In [11]:
model_pipe = Pipeline(steps=[
    ('trans', trans),
    ('knn', KNeighborsClassifier())
])

> Finally showing we can fit the full pipeline

In [12]:
model_pipe.fit(X_train, y_train)

Pipeline(steps=[('trans',
                 ColumnTransformer(transformers=[('numerical',
                                                  Pipeline(steps=[('ss',
                                                                   StandardScaler())]),
                                                  Index(['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g'], dtype='object')),
                                                 ('categorical',
                                                  Pipeline(steps=[('ohe',
                                                                   OneHotEncoder(drop='first',
                                                                                 sparse=False))]),
                                                  Index(['island', 'sex'], dtype='object'))])),
                ('knn', KNeighborsClassifier())])

In [13]:
model_pipe.score(X_train, y_train)

0.9939759036144579

> Performing grid search on the full pipeline

In [14]:
pipe_grid = {'knn__n_neighbors': [3, 5, 7], 'knn__p': [1, 2, 3]}

gs_pipe = GridSearchCV(estimator=model_pipe, param_grid=pipe_grid)

In [15]:
gs_pipe.fit(X_train, y_train)

GridSearchCV(estimator=Pipeline(steps=[('trans',
                                        ColumnTransformer(transformers=[('numerical',
                                                                         Pipeline(steps=[('ss',
                                                                                          StandardScaler())]),
                                                                         Index(['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g'], dtype='object')),
                                                                        ('categorical',
                                                                         Pipeline(steps=[('ohe',
                                                                                          OneHotEncoder(drop='first',
                                                                                                        sparse=False))]),
                                                               

In [16]:
pd.DataFrame(gs_pipe.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_knn__n_neighbors,param_knn__p,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.006886,0.001415,0.005512,0.002026,3,1,"{'knn__n_neighbors': 3, 'knn__p': 1}",1.0,1.0,0.969697,1.0,1.0,0.993939,0.012121,1
1,0.005278,0.000177,0.003808,5.4e-05,3,2,"{'knn__n_neighbors': 3, 'knn__p': 2}",1.0,1.0,0.969697,1.0,1.0,0.993939,0.012121,1
2,0.00477,0.000128,0.003662,9.1e-05,3,3,"{'knn__n_neighbors': 3, 'knn__p': 3}",1.0,1.0,0.969697,1.0,1.0,0.993939,0.012121,1
3,0.005134,0.000411,0.003782,0.000305,5,1,"{'knn__n_neighbors': 5, 'knn__p': 1}",1.0,1.0,0.969697,1.0,1.0,0.993939,0.012121,1
4,0.005306,0.000462,0.003821,0.000408,5,2,"{'knn__n_neighbors': 5, 'knn__p': 2}",1.0,1.0,0.969697,1.0,1.0,0.993939,0.012121,1
5,0.005406,0.000511,0.004092,0.000496,5,3,"{'knn__n_neighbors': 5, 'knn__p': 3}",1.0,1.0,0.969697,1.0,1.0,0.993939,0.012121,1
6,0.00518,0.000433,0.003918,0.000481,7,1,"{'knn__n_neighbors': 7, 'knn__p': 1}",1.0,1.0,0.969697,0.969697,1.0,0.987879,0.014845,7
7,0.0051,0.000492,0.003859,0.000416,7,2,"{'knn__n_neighbors': 7, 'knn__p': 2}",1.0,1.0,0.969697,0.969697,1.0,0.987879,0.014845,7
8,0.005052,0.000432,0.003957,0.000477,7,3,"{'knn__n_neighbors': 7, 'knn__p': 3}",1.0,1.0,0.969697,0.969697,1.0,0.987879,0.014845,7


In [17]:
gs_pipe.best_params_

{'knn__n_neighbors': 3, 'knn__p': 1}

## A Note on Data Leakage
Note we still have to be careful in performing a grid search!

We can accidentally "leak" information by doing transformations with the whole data set, instead of just the training set!

### Example of leaking information

In [18]:
scaler = StandardScaler()
# Scales over all of the X-train data! (validation set will be considered in scaling)
scaled_data = scaler.fit_transform(X_train.select_dtypes('float64'))

parameters = {
    'n_neighbors': [1, 3, 5],
    'metric': ['minkowski', 'manhattan'],
    'weights': ['uniform', 'distance']
}

clf_dt = KNeighborsClassifier()
clf = GridSearchCV(clf_dt, parameters)
clf.fit(X_train.select_dtypes('float64'), y_train)

GridSearchCV(estimator=KNeighborsClassifier(),
             param_grid={'metric': ['minkowski', 'manhattan'],
                         'n_neighbors': [1, 3, 5],
                         'weights': ['uniform', 'distance']})

### Example of Grid Search with no leakage

In [None]:
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', KNeighborsClassifier())
])

# Note you use the part of the pipeline's name `NAME__{parameter}`
parameters = {
    'scaler__with_mean': [True, False],
    'clf__n_neighbors': [1, 3, 5],
    'clf__metric': ['minkowski', 'manhattan'],
    'clf__weights': ['uniform', 'distance']
}

cv = GridSearchCV(pipeline, param_grid=parameters)

cv.fit(X_train.select_dtypes('float64'), y_train)
y_pred = cv.predict(X_test.select_dtypes('float64'))