## Pipelining in Sklearn 

What is Pipelining? 
- Pipelining chains multiple steps together. The output of step 1 will be the input to step 2. Essentially is a list of ordered instruction we want to use to preprocess our train and test data. 
- Makes life easy.


Tips modified from: https://github.com/justmarkham/scikit-learn-tips

^recommended repo for further sklearn tips

In [None]:
import pandas as pd
import numpy as np
train = pd.DataFrame({'feat1':[10, 20, np.nan, 2], 'feat2':[25., 20, 5, 3], 'label':['A', 'A', 'B', 'B']})
test = pd.DataFrame({'feat1':[30., 5, 15], 'feat2':[12, 10, np.nan]})

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

In [None]:
imputer = SimpleImputer()
log_reg = LogisticRegression()

In [None]:
#Using Make pipeline, making a two step pipeline
pipe = make_pipeline(imputer,
                    log_reg)

In [None]:
features = ['feat1', 'feat2']

In [None]:
X, y = train[features], train['label']
X_new = test[features]

In [None]:
# pipeline applies the imputer to X before fitting the classifier
pipe.fit(X, y)

# pipeline applies the imputer to X_new before making predictions
# note: pipeline uses imputation values learned during the "fit" step
pipe.predict(X_new)

## Pipeline vs. make_pipline in Sklearn

Pipeline requires naming of steps, make_pipeline does not. **It is easier to use make_pipeline, less syntax.**

- I would recommend use make_pipeline

In [None]:
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=6)

In [None]:
cols = ['Embarked', 'Sex', 'Age', 'Fare']
X = df[cols]

In [None]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
imp = SimpleImputer()
clf = LogisticRegression()

In [None]:
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

In [None]:
#Pass tuples, (object, list of columns to apply object)
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (imp, ['Age']),
    remainder='passthrough')

In [None]:
pipe = make_pipeline(ct, clf)

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [None]:
#uses more code, you pass a list of tuples, each tuple requries (made up name, object, list of columns)
ct = ColumnTransformer(
    [('encoder', ohe, ['Embarked', 'Sex']),
     ('imputer', imp, ['Age'])],
    remainder='passthrough') #any columns not named, pass through

In [None]:
pipe = Pipeline([('preprocessor', ct), ('classifier', clf)])

## Using Function Transformer to build customer transformers. 

Use Case: Feature engineering within a Column Transformer or Pipeline.



In [None]:
from sklearn.preprocessing import FunctionTransformer

In [None]:
X = pd.DataFrame({'Fare':[200, 300, 50, 900],
                  'Code':['X12', 'Y20', 'Z7', np.nan],
                  'Deck':['A101', 'C102', 'A200', 'C300']})

In [None]:
# Convert existing fucntion into a transformer
clip_values = FunctionTransformer(np.clip, kw_args={'a_min':100, 'a_max':600}) #provides a lower and upper limit to value

In [None]:
# convert custom function into a transformer
# extract the first letter from each string
def first_letter(df):
    return df.apply(lambda x: x.str.slice(0, 1))

#apply function to Function Transformer
get_first_letter = FunctionTransformer(first_letter)

In [None]:
ct = make_column_transformer(
    (clip_values, ['Fare']),
    (get_first_letter, ['Code', 'Deck']))

In [None]:
X

In [None]:
ct.fit_transform(X)

# Preprocessing Pipeline Examples

In [None]:
cols = ['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']

In [None]:
df = pd.read_csv('http://bit.ly/kaggletrain')
X = df[cols]
y = df['Survived']

In [None]:
X

In [None]:
df_new = pd.read_csv('http://bit.ly/kaggletest', nrows=10)
X_new = df_new[cols]



In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_selector, make_column_transformer, ColumnTransformer
from sklearn.pipeline import make_pipeline, make_union, FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score



### Below is 4 different methods using various combinations of the above functions, they all result in the same thing. Besides method 4. 

In [None]:
# Method 1 with make_column_selector and make pipeline : BEST Method

# set up preprocessing for numeric columns
imp_median = SimpleImputer(strategy='median', add_indicator=True)
scaler = StandardScaler()

# set up preprocessing for categorical columns
imp_constant = SimpleImputer(strategy='constant')
ohe = OneHotEncoder(handle_unknown='ignore')

# select columns by data type
num_cols = make_column_selector(dtype_include='number')
cat_cols = make_column_selector(dtype_exclude='number')


# do all preprocessing
preprocessor = make_column_transformer(
    (make_pipeline(imp_median, scaler), num_cols),
    (make_pipeline(imp_constant, ohe), cat_cols))

# create a pipeline
pipe = make_pipeline(preprocessor, LogisticRegression())

In [None]:
#Method 2 - More complicated with more steps 
cat_feats = X.dtypes[X.dtypes == 'object'].index.tolist()
num_feats = X.dtypes[~X.dtypes.index.isin(cat_feats)].index.tolist()

from sklearn.preprocessing import FunctionTransformer

# Using own function in Pipeline
def numFeat(data):
    return data[num_feats]

def catFeat(data):
    return data[cat_feats]

# we will start two separate pipelines for each type of features
keep_num = FunctionTransformer(numFeat)
keep_cat = FunctionTransformer(catFeat)

pipe_num = Pipeline([
    ("num_feats", keep_num),
    ("inpute_num", imp_median),
    ("scaler", scaler)
])

pipe_cat = Pipeline([
    ('cat_feats', keep_cat),
    ('inpute_cat', imp_constant),
    ('ohe', ohe)
])

union = FeatureUnion([('num_process', pipe_num), #Feature Union runs things in parrelel 
                     ('cat_process', pipe_cat)])

##OR use make_union, it is a short hand version of Feature Union
#make_union = make_union(pipe_num,pipe_cat)

pipe = Pipeline([('all_features', union), 
                ('model', LogisticRegression())])

In [None]:
#Method 3 - Similar to Method 2 but with columntransformer, and doesn't require Keep_num and Keep_cat, uses the make_column_selector

# select columns by data type
num_cols = make_column_selector(dtype_include='number')
cat_cols = make_column_selector(dtype_exclude='number')

pipe_num = Pipeline([
    ("inpute_num", imp_median),
    ("scaler", scaler)
])

pipe_cat = Pipeline([
    ('inpute_cat', imp_constant),
    ('ohe', ohe)
])

pre_process = ColumnTransformer([("num", pipe_num, num_cols),
                                ("cat",pipe_cat, cat_cols)],
                               remainder = 'passthrough') #Any columns not selected in the above two steps will not be modified
    
pipe = Pipeline([('all_features', pre_process), 
                ('model', LogisticRegression())])

In [None]:
#Method 4 - Similar to Method 1 but with Function Transformer, where we create a custom transformation. \\
from sklearn.ensemble import RandomForestClassifier

#multiply fare by 10 and have a new column
def fare_x10(df):
    return df.apply(lambda x: x * 10)

#Apply function to FunctionTransfomer, needed to fit into a columntransformer or make_column transformer
fare_x10_function = FunctionTransformer(fare_x10)

# set up preprocessing for numeric columns
imp_median = SimpleImputer(strategy='median', add_indicator=True)
scaler = StandardScaler()

# set up preprocessing for categorical columns
imp_constant = SimpleImputer(strategy='constant')
ohe = OneHotEncoder(handle_unknown='ignore')

# select columns by data type
num_cols = make_column_selector(dtype_include='number')
cat_cols = make_column_selector(dtype_exclude='number')


# do all preprocessing
preprocessor = make_column_transformer(
    (fare_x10_function, ["Fare"]),
    (make_pipeline(imp_median, scaler), num_cols),
    (make_pipeline(imp_constant, ohe), cat_cols), remainder = 'passthrough')

# create a pipeline
pipe = make_pipeline(preprocessor, RandomForestClassifier())

In [None]:
# cross-validate the pipeline
cross_val_score(pipe, X, y).mean()

In [None]:
# fit the pipeline and make predictions
pipe.fit(X, y)
pipe.predict(X_new)

In [None]:
import joblib
joblib.dump(pipe, "pipe.joblib")

In [None]:
#load pipeline
model = joblib.load('pipe.joblib')

model.predict(X_new)