## Creating Pipelines

In notebooks [2](02-feature-engineering.ipynb) and  [3](03-model-logistic-regression.ipynb) we developed and trained a feature engineering technique and a logistic regression model. In this notebook we will combine them into a pipeline. 

Machine learning pipelines allow you to precisely specify a set of transformations which start with raw data and result in a model. They make it possible to re-train the same model repeatedly, using different parameter values, and to reapply these same transformations to raw data in production, resulting in predictions.  

We load in our data: 

In [1]:
import numpy as np
import pandas as pd
df = pd.read_parquet("fraud-cleaned-sample.parquet")

In [2]:
from sklearn import model_selection
train, test = model_selection.train_test_split(df, random_state=43)

Now we load the pipeline steps we created in earier notebooks. These are `feat_pipeline.pkl` and `lr.pkl`, corresponding to the feature engineering stages and the logisitc regression model, respectively. 

In [3]:
import cloudpickle as cp
feature_pipeline = cp.load(open('feat_pipeline.pkl', 'rb'))
model = cp.load(open('lr.pkl', 'rb'))

Now we can combine these stages together in a pipeline and fit it to training data:

In [4]:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
    ('features', feature_pipeline),
    ('model', model)
])

In [5]:
pipeline

Pipeline(steps=[('features',
                 Pipeline(steps=[('feature_extraction',
                                  ColumnTransformer(transformers=[('interarrival_scaler',
                                                                   Pipeline(steps=[('median_imputer',
                                                                                    SimpleImputer(strategy='median')),
                                                                                   ('interarrival_scaler',
                                                                                    RobustScaler())]),
                                                                   ['interarrival']),
                                                                  ('amount_scaler',
                                                                   RobustScaler(),
                                                                   ['amount']),
                                                             

Here you can see all the transformations and parameters used in the pipeline. 

We can refit the whole pipeline to training data:

In [6]:
pipeline.fit(train, y = train["label"])

Pipeline(steps=[('features',
                 Pipeline(steps=[('feature_extraction',
                                  ColumnTransformer(transformers=[('interarrival_scaler',
                                                                   Pipeline(steps=[('median_imputer',
                                                                                    SimpleImputer(strategy='median')),
                                                                                   ('interarrival_scaler',
                                                                                    RobustScaler())]),
                                                                   ['interarrival']),
                                                                  ('amount_scaler',
                                                                   RobustScaler(),
                                                                   ['amount']),
                                                             

In [None]:
We can use this pipeline to make predictions - let's predict for our test set:

In [8]:
pipeline.predict(test)

array(['legitimate', 'legitimate', 'legitimate', ..., 'legitimate',
       'legitimate', 'legitimate'], dtype=object)

Let's now save this pipeline as one pickled object:

In [7]:
cp.dump(pipeline, open("pipeline.pkl", "wb"))
