## Pipeline 1: Data Preparation and Modelling
Data preparation is one easy way to leak knowledge of the whole training dataset to the algorithm.

For example, when performing normalization or standardization on the entire training data before learning would not be a valid test because the training dataset would have been influenced by the scale of the data in the test set.

Pipelines help you prevent data leakage in your test harness by ensuring that data preparation steps like normalization and standardization is constrained in each fold of your cross validation procedure.

In [2]:
# Creating a pipeline that standardizes the data and creates a model
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.pipeline import Pipeline

In [4]:
# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres','skin','test', 'mass', 'pedi', 'age', 'class']

dataframe = pd.read_csv(url, names = names)

dataframe.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [10]:
# divide the data into features and variables
X = dataframe.drop('class', axis = 1)
y = dataframe['class']

# create pipelines
estimator = []
estimator.append(('standardize', StandardScaler()))
estimator.append(('lda', LinearDiscriminantAnalysis()))


model = Pipeline(estimator)

# evauate pipeline
seed = 7
kfold = KFold(n_splits = 10, random_state = seed, shuffle = True)
results = cross_val_score(model, X,y, cv = kfold)

print(results.mean())

0.7669685577580315


## Pipeline 2: Feature Extraction and Modelling
Feature extraction is another procedure that is succetible to data leakage, like data preparation feature extraction procedures must be restricted to the data in your training dataset.

Pipelines provide a handy tool called FeatureUnion which allows the result of multiole feature selection and extraction procedures to be combined to a larger dataset on which a model can be trained. 

All the feature extraction and union occurs within each fold of the crossvalidation procedure.

In the example below, we'll demonstrate the pipeline defined with four steps:
* Feature Extraction with PCA (3 features)
* Feature extraction with Statistical Selection (6 features)
* Feature Union
* Learn a logistic Regression

The pipeline is then evaluated using 10-fold cross validation

In [14]:
# Create a pipeline that extracts features from the data then creates model
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest


In [17]:
# load data
# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres','skin','test', 'mass', 'pedi', 'age', 'class']

dataframe = pd.read_csv(url, names = names)

array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

In [18]:
# Create feature union
features = []
features.append(('pca', PCA(n_components = 3)))
features.append(('select_best', SelectKBest(k=6)))
feature_union = FeatureUnion(features)

# create pipeline
estimators = []
estimators.append(('feature union',feature_union))
estimators.append(('logistic', LogisticRegression()))
model = Pipeline(estimators)

# evaluate pipeline
seed =7
kfold = KFold(n_splits = 10, random_state = seed)
results = cross_val_score(model, X, y, cv = kfold)
print(results.mean())

<IPython.core.display.Javascript object>

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

0.769565960355434


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
