# Python Data Mining Quick Start Guide
## Ch 7 - Building a Data Processing Pipeline and Deploying
### Copyright: Nathan Greeneltch, PhD 2019

#### These code examples and description are meant to accompany the book "Python Data Mining Quick Start Guide" by Nathan Greeneltch. For full background on the topics and introduction sections, please purchase the book. 

In [1]:
# initial imports
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_context("paper", font_scale=1.5)
sns.set_style("white")

# Pipelining Your Analysis

A **pipelined** analysis is a series of steps stored as a single function or object. On top of providing a framework for your analysis, the most important reason for pipelining is apparent upon examining what is required to reproduce your workflow or apply it to new data. of Now that you've seen a nice collection of various data mining methods, it's a good time to acknowledge some facts:

* Most analysis workflows have multiple steps (cleaning, scaling, transforming, clustering, etc..).
* In order to reproduce the workflow, all the steps must be done in the exact right order.
* Failure to reproduce the steps exactly can result in bad information, often failing silently.
* Humans make mistakes so we need to guard against them.

The perfect tool for guarding against mistakes is to build a pipeline, test it locally, and deploy the entire pipeline as a finished product. 

TIP: It is a good idea to build your pipeline while as you develop your analysis workflow. This will allow you to have confidence that the steps you applied are indeed captured correctly in the pipeline.  

## Scikit-learn's Pipeline Object

Scikit-learn has a full service **Pipeline** object that is compatible with objects that use both the transformer and estimator APIs. It can also take a **GridSearchCV** as a step in the pipeline, so you can use the pipeline for tuning and the result will automatically be stored in the pipe. 

For our example, we will build a pipeline that transforms the data with PCA and then predicts labels with logistic regression. Let's start by loading the iris dataset, required modules, and splitting the data into a train/test set. We will use k-fold cross-validation in the grid search, so no need to make a separate validation set. Let's start with the following code:

In [8]:
### Building a Pipeline ###
# load iris and create X and y
from sklearn.datasets import load_iris
dataset = load_iris()
X,y = dataset.data, dataset.target

# import modules 
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

# create train and test sets
X_train, X_test, y_train, y_test = \
        train_test_split(X, y, test_size=.33)

Now we can instantiate the transformer and classifier objects and feed them into the pipeline (named **pipe**):

In [3]:
# instantiate the transformer and classifier objects
pca = PCA()
logistic = LogisticRegression(solver='liblinear', multi_class='ovr', C=1.5)

# instantiate a pipeline and add steps to the pipeline
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])

Next we will create the parameter grid that the grid search will use and instantiate the grid search object. Here we will test a few values of **n_components** for PCA and **C** for logistic regression using 5-fold cross-validation. Finally we fit our model to the data and print out the best parameters:

In [9]:
# set the parameter grid to be passed to the grid search
param_grid = {
    'pca__n_components': [2, 3, 4],
    'logistic__C': [0.5, 1, 5, 10],
}

# instantiate the grid search object and pass the pipe and param_grid
model = GridSearchCV(pipe, param_grid, iid=False, cv=5,
                      return_train_score=False)

# fit entire pipeline using grid serach and 5-fold cross validation
model.fit(X_train, y_train)
print("Best parameter (CV score=%0.3f):" % model.best_score_)
print(model.best_params_)

Best parameter (CV score=0.959):
{'logistic__C': 5, 'pca__n_components': 3}


The full pipeline model can be used to predict on new data with the **.predict()** method:

In [10]:
# use the resulting pipeline to predict on new data
y_pred = model.predict(X_test)

# Deployment of Model

Often in a production environment, deployment is the step where you release your model into the wild and let it run on unforeseen data. However, data mining also produces many local analysis workflows that don't necessarily need to deploy but do need to be stored and re-loaded later in order to reproduce the analysis. Both of these use-cases require what is called **model persistence**. The persistence term means the model needs to be stored and loaded for later use. Python is an object-oriented language and appropriately Sci-kit learn uses objects for most of its analysis routines. Storing an object is not as simple as storing a basic text file full of strings. It instead requires a process called **serialization** to store in a reliable and error-free manner. One of the most popular serialization packages is a Python core library **pickle**. It's what we will use for our serialization examples.

## Serialize Model and Store with Pickle Module

**Pickle** is compatible with Scikit-learn's transformers and estimators. Conveniently (and more importantly) it is also compatible with Scikit-learn's grid search and pipeline objects. It is very easy to use as serialization and storage are accomplished with a single method called **.dump()**. The following example will use pickle to serialize our pipeline model and store it in a file named "model.pkl":  

In [11]:
### Store Model for Later with Pickle ###
# import module
import pickle

# save the pipeline model to disk
pickle.dump(model, open('./model_storage/model.pkl', 'wb'))

## Load Serialized Model and Predict

Now when we are ready to use the model either in production or locally, we simply load back up with pickle and store it in a new local object. We can name the new loaded model object **model_load** and after loading and **deserializing** with the **.load()** method. Then use model_load as if it were the original version of the model. See the following code example for demonstration:

In [12]:
# load the pipeline model from disk and deserialize
model_load = pickle.load(open('./model_storage/model.pkl', 'rb'))

# use the loaded pipeline model to predict on new data
y_pred = model_load.predict(X_test)

## Python-specific Deployment Concerns

Python is not a compiled language. It is interpreted at the time of execution. It is important to remember that when you follow the steps in this chapter, you are not pickling an executable program. You are simply pickling an object. At load time, the environment must be compatible with the contents of the object. Often that means matching versions as libraries change over time. Also the default serialization protocol for pickle is not compatible with Python 2, so you will have to change the protocol if switching Python versions.

Lastly the pickled object is similar to a zip file in that anyone can bundle up anything inside it and you will not know it until you unpickle/unzip it. **Security** should always be a concern with any file types that are not transparent

NOTE: You should read the main pickle doc page for descriptions of compatibility and security before using. It is here: https://docs.python.org/3/library/pickle.html﻿