# Automate Machine Learning Workflows with Pipelines

1. How to use pipelines to minimize <span class="burk"><span class="burk"><span class="girk">data leakage</span></span></span>.
2. How to construct a data preparation and modeling pipeline.
3. How to construct a feature extraction and modeling pipeline.

There are <span class="mark">standard workflows</span> in applied machine learning. <span class="mark">Standard because they overcome</span> <span class="mark">common problems like data leakage</span> in your test harness. Python scikit-learn provides a Pipeline utility to help automate machine learning workflows. Pipelines work by allowing for a linear sequence of data transforms to be chained together culminating in a modeling process that can
be evaluated. <span class="mark">The goal</span> is to ensure that <span class="girk">all of the steps in the pipeline are constrained to the data available for the evaluation</span>, such as the training dataset or each fold of the cross validation procedure.

# Data Preparation and Modeling Pipeline
An <span class="burk">easy trap to fall</span> into in applied machine learning is <span class="burk">leaking data from your training dataset to your test dataset</span>. To <span class="girk">avoid this trap</span> you need a robust test harness with <span class="girk">strong separation of training and testing</span>. This includes data preparation. Data preparation is one easy way to leak knowledge of the whole training dataset to the algorithm. For example, preparing your data
using <span class="mark">normalization or standardization on the entire training dataset before learning</span> would <span class="burk">not be a valid test</span> because the training dataset would have been <span class="burk">influenced by the scale of the data in the test set</span>.

<span class="girk">Pipelines help you prevent data leakage</span> in your test harness by ensuring that <span class="mark">data preparation like standardization is constrained to each fold</span> of your cross validation procedure. The example below demonstrates this important data preparation and model evaluation workflow on the Pima Indians onset of diabetes dataset. The pipeline is defined with two steps:

1. Standardize the data.
2. Learn a Linear Discriminant Analysis model.

In [1]:
# Pima Indians Diabetes Dataset
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

#Loading dataset
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = pd.read_csv('pima-indians-diabetes.data',names=names)

# separate array into input and output components
X = df.drop('class',axis='columns')
Y = df['class']

In [2]:
# create pipeline
estimators = []
estimators.append(('standardize',StandardScaler()))
estimators.append(('lda',LinearDiscriminantAnalysis()))

model = Pipeline(estimators)

# evaluate pipeline
kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.773462064252


# Feature Extraction and Modeling Pipeline
<span class="mark">Feature extraction</span> is another procedure that <span class="burk">is susceptible to data leakage</span>. Like data preparation, feature extraction procedures <span class="mark">must be restricted to the data in your training dataset</span>. The pipeline provides a handy tool called the FeatureUnion which allows the results of multiple feature selection and extraction procedures to be combined into a larger dataset on which a
model can be trained. Importantly, <span class="girk">all the feature extraction and the feature union occurs within each fold of the cross validation procedure</span>. The example below demonstrates the pipeline
defined with four steps:

1. Feature Extraction with Principal Component Analysis (3 features).
2. Feature Extraction with Statistical Selection (6 features).
3. Feature Union.
4. Learn a Logistic Regression Model.

In [3]:
from sklearn.pipeline import FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

In [4]:
# create feature union
features = []
features.append(('pca',PCA(n_components=3)))
features.append(('select_best',SelectKBest(k=6)))

feature_union = FeatureUnion(features)

In [5]:
# create pipeline
estimators = []
estimators.append(('feature_union',feature_union))
estimators.append(('logistic',LogisticRegression()))

model = Pipeline(estimators)

results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.776042378674
