There are standard workflows in a machine learning project that can be automated. In Python
scikit-learn, Pipelines help to clearly define and automate these worklows. Pipelines in scikit-learn and how you can automate common machine learning
workflows.

* How to use pipelines to minimize data leakage.
* How to construct a data preparation and modeling pipeline.
* How to construct a feature extraction and modeling pipeline.

### Automating Machine Learning Workflows

 There are `standard workflows` in applied machine learning. Standard because they overcome common problems like `data leakage` in your test harness. Python scikit-learn provides a `Pipeline utility` to help `automate machine learning workflows`. Pipelines work by allowing for a `linear sequence` of data transforms to be chained together culminating in a modeling process that can be evaluated.
 
The goal is to ensure that all of the steps in the pipeline are constrained to the data available
for the `evaluation`, such as the `training dataset` or each fold of the `cross-validation` procedure.

### Data Preparation and Modeling Pipeline

An easy trap to fall into in `applied machine learning` is `leaking data from your training dataset to your test dataset`. To avoid this trap you need a robust test harness with strong separation of `training` and `testing`. This includes `data preparation`. `Data preparation` is one easy way to leak knowledge of the whole training dataset to the algorithm. For example, preparing your data using `normalization` or `standardization` on the entire `training dataset` before learning would not
be a valid test because the training dataset would have been influenced by the scale of the data in the test set.

`Pipelines` help you prevent `data leakage` in your test harness by ensuring that `data preparation` like `standardization` is constrained to each fold of your `cross-validation` procedure. The example below demonstrates this important data preparation and model evaluation workflow on the Pima Indians onset of diabetes dataset. 

`The pipeline is defined with two steps`:

* Standardize the data.
* Learn a Linear Discriminant Analysis model.

#### Create a pipeline that standardizes the data then creates a model

In [2]:
# Import Libraries

from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [3]:
# load data
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values

# Split Dataset
X = array[:,0:8]
Y = array[:,8]

In [4]:
# create pipeline
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('lda', LinearDiscriminantAnalysis()))
model = Pipeline(estimators)

In [6]:
# evaluate pipeline
kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
print("accuracy of the setup on the dataset.")

0.773462064251538
accuracy of the setup on the dataset.


Notice how we create a Python list of steps that are provided to the Pipeline for process
the data. Also notice how the Pipeline itself is treated like an estimator and is evaluated in its
entirety by the k-fold cross-validation procedure.

A machine learning pipeline is used to help automate machine learning workflows. They operate by enabling a sequence of data to be transformed and correlated together in a model that can be tested and evaluated to achieve an outcome, whether positive or negative.

Machine learning (ML) pipelines consist of several steps to train a model. Machine learning pipelines are iterative as every step is repeated to continuously improve the accuracy of the model and achieve a successful algorithm. To build better machine learning models, and get the most value from them, accessible, scalable and durable storage solutions are imperative, paving the way for on-premises object storage.

* The main objective of having a proper pipeline for any ML model is to exercise control over it. A well-organised pipeline makes the implementation more flexible. It is like having an exploded view of a computer where you can pick the faulty pieces and replace it- in our case, replacing a chunk of code.

* The term ML model refers to the model that is created by the training process.

* The learning algorithm finds patterns in the training data that map the input data attributes to the target (the answer to be predicted), and it outputs an ML model that captures these patterns.

* A model can have many dependencies and to store all the components to make sure all features available both offline and online for deployment, all the information is stored in a central repository

* A pipeline consists of a sequence of components which are a compilation of computations. Data is sent through these components and is manipulated with the help of computation


`Pipelines` are not one-way flows. They are cyclic in nature and enables iteration to improve the scores of the machine learning algorithms and make the model scalable.


Many of today’s ML models are `trained neural networks` capable of executing a specific task or providing insights derived from `what happened` to `what will likely to happen` (predictive analysis). These models are complex and are never completed, but rather, through the repetition of mathematical or computational procedures, are applied to the previous result and improved upon each time to get closer approximations to `solving the problem`. Data scientists want more captured data to provide the fuel to train the ML models.

We’ll build a simple pipeline that standardizes our data, then create a model that we will evaluate with a leave-one-out cross validation. 

In [49]:
#Import all our packages 
from pandas import read_csv
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [50]:
import pandas as pd
names = ["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]
df1 = pd.read_csv("iris_dataset", names=names)
df1.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [51]:
# load data
array = df1.values
X = array[:,0:4]
Y = array[:,4]

In [52]:
X.shape

(150, 4)

In [54]:
#creating pipeline
estimators = []
estimators.append(("standardize", StandardScaler()))
estimators.append(("lda" , LinearDiscriminantAnalysis()))
model = Pipeline(estimators)

In [13]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

### Feature Extraction and Modeling Pipeline

`Feature extraction` is another procedure that is susceptible to `data leakage`. Like data preparation,
`feature extraction` procedures must be restricted to the data in your `training dataset`. The
`pipeline` provides a handy tool called the **FeatureUnion** which allows the results of `multiple
feature selection` and `extraction procedures` to be combined into a larger dataset on which a
model can be trained. Importantly, all the `feature extraction` and the `feature union` occurs
within each fold of the `cross-validation` procedure.

The pipeline defined with four steps:

* Feature Extraction with Principal Component Analysis (3 features).
* Feature Extraction with Statistical Selection (6 features).
* Feature Union.
* Learn a Logistic Regression Model.

The pipeline is then evaluated using 10-fold cross-validation.

#### Create a pipeline that extracts features from the data then creates a model

In [14]:
# IMport Libraries

from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

In [15]:
# load data
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

In [16]:
# create feature union
features = []
features.append(('pca', PCA(n_components=3)))
features.append(('select_best', SelectKBest(k=6)))
feature_union = FeatureUnion(features)

In [17]:
# create pipeline
estimators = []
estimators.append(('feature_union', feature_union))
estimators.append(('logistic', LogisticRegression()))
model = Pipeline(estimators)

In [18]:
# evaluate pipeline
kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.7760423786739576


Notice how the `FeatureUnion` is its own `Pipeline` that in turn is a single step in the `final Pipeline` used to feed `Logistic Regression`. This might get you thinking about how you can start
embedding pipelines within `pipelines`.