# Automate Machine Learning Workflows with Pipelines

><small><i>from the book 
"Machine Learning Mastery With Python: Understand Your Data, Create Accurate Models and Work Projects End-To-End"
by Jason Brownlee, Migrated to Jupyter with additions by Mitch Sanders 2017</i></small>




There are standard workflows in a machine learning project that can be automated. In Python
scikit-learn, Pipelines help to clearly define and automate these workflows. In this chapter you
will discover Pipelines in scikit-learn and how you can automate common machine learning
workflows. After completing this lesson you will know:

1. How to use pipelines to minimize data leakage.
2. How to construct a data preparation and modeling pipeline.
3. How to construct a feature extraction and modeling pipeline.

Let’s get started.


## Automating Machine Learning Workflows

There are standard workflows in applied machine learning. Standard because they overcome
common problems like data leakage in your test harness. Python scikit-learn provides a Pipeline
utility to help automate machine learning workflows. Pipelines work by allowing for a linear
sequence of data transforms to be chained together culminating in a modeling process that can
be evaluated.

The goal is to ensure that all of the steps in the pipeline are constrained to the data available
for the evaluation, such as the training dataset or each fold of the cross-validation procedure.
You can learn more about Pipelines in scikit-learn by reading the Pipeline section of the user
guide. You can also review the API documentation for the Pipeline and FeatureUnion classes
and the pipeline module.

http://scikit-learn.org/stable/modules/pipeline.html

http://scikit-learn.org/stable/modules/classes.html#module-sklearn.pipeline


## Data Preparation and Modeling Pipeline

An easy trap to fall into in applied machine learning is leaking data from your training dataset
to your test dataset. To avoid this trap you need a robust test harness with strong separation of training and testing. This includes data preparation. Data preparation is one easy way to leak
knowledge of the whole training dataset to the algorithm. For example, **preparing your data
using normalization or standardization on the entire training dataset before learning would not
be a valid test because the training dataset would have been influenced** by the scale of the data
in the test set.

Links on Data Linkage:

https://insidebigdata.com/2014/11/26/ask-data-scientist-data-leakage/
https://www.quora.com/Whats-data-leakage-in-data-science

Pipelines help you prevent data leakage in your test harness by ensuring that data preparation
like standardization is constrained to each fold of your cross-validation procedure. The example
below demonstrates this important data preparation and model evaluation workflow on the
Pima Indians onset of diabetes dataset. The pipeline is defined with two steps:

1. Standardize the data.
2. Learn a Linear Discriminant Analysis model.

The pipeline is then evaluated using 10-fold cross-validation.


In [None]:
# Sequentially apply a list of transforms and a final estimator
# Create a pipeline that standardizes the data then creates a model
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# load data
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

# create pipeline
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('lda', LinearDiscriminantAnalysis()))

model = Pipeline(estimators)
# evaluate pipeline
kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())



Notice how we create a Python list of steps that are provided to the Pipeline for process
the data. Also notice how **the Pipeline itself is treated like an estimator and is evaluated in its
entirety by the k-fold cross-validation procedure**. Running the example provides a summary of
accuracy of the setup on the dataset

## Feature Extraction and Modeling Pipeline
Feature extraction is another procedure that is susceptible to data leakage. Like data preparation,
**feature extraction procedures must be restricted to the data in your training dataset**. The
pipeline provides a handy tool called the FeatureUnion which allows the results of multiple
feature selection and extraction procedures to be combined into a larger dataset on which a
model can be trained. Importantly, all the feature extraction and the feature union occurs
within each fold of the cross-validation procedure. The example below demonstrates the pipeline
defined with four steps:

1. Feature Extraction with Principal Component Analysis (3 features).
2. Feature Extraction with Statistical Selection (6 features).
3. Feature Union.
4. Learn a Logistic Regression Model.

The pipeline is then evaluated using 10-fold cross-validation.


In [None]:
# Create a pipeline that extracts features from the data then creates a model
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
# load data
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

# create feature union - #Concatenates results of multiple transformer objects.
features = []
features.append(('pca', PCA(n_components=3)))
features.append(('select_best', SelectKBest(k=6)))
feature_union = FeatureUnion(features)

# create pipeline
estimators = []
estimators.append(('feature_union', feature_union))
estimators.append(('logistic', LogisticRegression()))
model = Pipeline(estimators)

# evaluate pipeline
kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())


This example shows output of a Pipeline extract and combine features before modeling.


Notice how the FeatureUnion is its own Pipeline that in turn is a single step in the final
Pipeline used to feed Logistic Regression. **This might get you thinking about how you can start
embedding pipelines within pipelines**. Running the example provides a summary of accuracy of
the setup on the dataset.

http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html


## Summary
In this chapter you discovered the difficulties of data leakage in applied machine learning. You
discovered the Pipeline utilities in Python scikit-learn and how they can be used to automate
standard applied machine learning workflows. You learned how to use Pipelines in two important
use cases:
- Data preparation and modeling constrained to each fold of the cross-validation procedure.
- Feature extraction and feature union constrained to each fold of the cross-validation
procedure.


### Next
This completes the lessons on how to evaluate machine learning algorithms. In the next lesson
you will take your first look at how to improve algorithm performance on your problems by
using ensemble methods.



<hr>

### About the Pima Indian Dataset 

#### Attribute Information:

1. Number of times pregnant 
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test 
3. Diastolic blood pressure (mm Hg) 
4. Triceps skin fold thickness (mm) 
5. 2-Hour serum insulin (mu U/ml) 
6. Body mass index (weight in kg/(height in m)^2) 
7. Diabetes pedigree function 
8. Age (years) 
9. Class variable (0 or 1) 