### Introduction to Pipelining
- Utility within scikit-learn in Python
- Extremely simple and useful tool for managing machine learning workflows
- Usefulness of pipelines:
    - Standardise the operations of your ML task
    - Chain them in a sequence, make unions and finetune parameters
    - Reproducibility
    - Value in persistence of entire pipeline objects

### What does a typical ML task entail?

- data preparation to varying degrees
    - getting a cleaned dataset from an initial state of disarray (data cleaning, data wrangling)
    - various pre-processing steps (dimensionality reduction, feature extraction)
- finish off with a prediction or a modeling task 

The Pipeline class is a manageable way to apply a series of data transformations followed by the application. (i.e. your choice of ML model)

To give a one line summary - `Pipeline of transforms with a final estimator.`

### Steps for modelling a Pipeline

1. Feature Engineering - Create features to best reflect the meaning behind data
2. Create an approriate model to capture relationships between features; e.g. linear, non-linear
3. Select a loss function and fit the model
4. Evaluate model

Once you perform these steps, your pipeline is ready. 
Now you can use the model for prediction and/or inference.

### How Pipelines help with Data Preparation 

- `Data Preparation` should ensure strong separation of training and testing data
- A common problem in applied ML is overfitting which can ocur when data from your training set is leaked to your testing set

eg. Data Prepartion using normalization or standardization on the entire training dataset before learning would not be a valid test, because the training dataset would have been influenced by the scale of the data in the test set.

- Pipelines help you prevent data leakage
- Ensure that the data preparation is constrained to each fold of your cross validation procedure

### How Pipelines help with Feature Extraction and Modelling 

- `Feature extraction` is another procedure that is susceptible to data leakage
- The pipeline provides a handy tool called the `FeatureUnion`
    - Allows the results of multiple feature selection and extraction procedures to be combined
    - Combined larger dataset can be used to train the model
- Thus, all the feature extraction and the feature union occurs within each fold of the cross validation procedure.

### How to put together a Scikit-Learn Pipelines
- Put together feature transformers and models using sklearn.Pipeline objects
- Create a pipeline: <i>pl = Pipeline([feat, mdl])</i>
- Fit the model(s) in the pipeline using pl.fit(data, target)
- Predict from raw input data through the pipeline using pl.predict

### Simple Examples of a Pipeline

#### Example 1 
- We use the iris dataset
- We perform pre-preprocessing by standardizing the data
- We use a Logistic Regressor to classify the dataset into its target iris

In [1]:
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

In [2]:
# Load and split the data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size= 0.2, random_state=42)
X_train.shape

(120, 4)

In [3]:
pipe_lr = Pipeline([('stdscr', StandardScaler()),
 ('clf', LogisticRegression(solver='newton-cg', multi_class='ovr'))])

# Standardize features by removing the mean and scaling to unit variance
# Logistic Regression 

In [4]:
pipe_lr.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('stdscr', StandardScaler(copy=True, with_mean=True, with_std=True)), ('clf', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          n_jobs=None, penalty='l2', random_state=None, solver='newton-cg',
          tol=0.0001, verbose=0, warm_start=False))])

In [5]:
score = pipe_lr.score(X_test, y_test)
print('Logistic Regression pipeline test accuracy: %.3f' % score)

Logistic Regression pipeline test accuracy: 0.967


### Simple Examples of a Pipeline

#### Example 2 
- We use the digits data from sk-learn 
- We perform pre-preprocessing by Principal Component Anaysis, where we choose the top 20 features
- We use a Logistic Regressor using Stochastic Gradient Descent and early stopping

In [6]:
import numpy as np
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

logistic = SGDClassifier(loss='log', penalty='l2', early_stopping=True,\
                         max_iter=10000, tol=1e-5, random_state=0)
pca = PCA(n_components = 20)
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])

digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
pipe.fit(X_digits, y_digits)

Pipeline(memory=None,
     steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=20, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('logistic', SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=True, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.1...dom_state=0, shuffle=True, tol=1e-05,
       validation_fraction=0.1, verbose=0, warm_start=False))])

The above example was just to observe another kind of data transformation, and not necessarilly for a viable ML task

### Function Transformer
- Recall what a function transformer is
- It forwards the X (and optionally y) arguments to a user-defined function or function object and returns the result of this function
- Used in Data Pre-processing
- Somewhat like an `apply` in pandas

But what if we cannot apply the same transformations to every individual feature of a data point in X?
This is why we need `Column Transformers`. 

###  Column Transformer
- Datasets can often contain components that require different feature extraction and processing pipelines
- Datasets may have a mix of Categorical columns and Continuous Numeric columns, which will almost always need separate transformations
- Datasets may be stored in a Pandas DataFrame and different columns require different processing pipelines

For Example:
    - Your dataset consists of heterogeneous data types (e.g. raster images and text captions)
    - You want to standardize the numerical columns but one-hot-encode the categorical ones

- The brand new ColumnTransformer allows you to choose which columns get which transformations 

- The ColumnTransformer takes a list of tuples, where each tuple has the following 3 entries:
    - The first value in the tuple is a name that labels it, 
    - the second is an instantiated estimator or transformation, 
    - and the third is a list of columns you want to apply the transformation to. 
- The tuple will look like this:
     `('name', SomeTransformer(parameters), columns)`

### Example of a Pipeline using Column - Transformer 

- Let's work on the titanic dataset from class
- Import the data into  dataframe
- To avoid complications and focus on column transformations, let's drop the missing values

In [7]:
import pandas as pd

In [10]:
titanic = pd.read_csv("https://raw.githubusercontent.com/amueller/scipy-2017-sklearn/master/notebooks/datasets/titanic3.csv")

titanic = titanic.dropna(subset=['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked'])

- The target column is chosen as `survived` as done in the hw
- The features we work with are added to `features` list

In [12]:
target = titanic.survived.values
features = titanic[['pclass', 'sex', 'age', 'fare', 'embarked']]
features.head()

Unnamed: 0,pclass,sex,age,fare,embarked
0,1,female,29.0,211.3375,S
1,1,male,0.9167,151.55,S
2,1,female,2.0,151.55,S
3,1,male,30.0,151.55,S
4,1,female,25.0,151.55,S


- Now lets apply ColumnTransformer on different column features
- Numeric features in consideration are `age` and `fare`
- Categorical features in consideration `pclass`, `sex`, `embarked`
- We standardize the numerical features, and one-hot-encode the categorical features as stated above
- we use ColumnTranformer to respectively assign these transformations

In [13]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_transformer

In [14]:
numeric_features = ['age', 'fare']
categorical_features = ['embarked', 'sex', 'pclass']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)])

- Now we construct a pipeline using the transformer object
- Finally we use the pipeline to fit and predict

In [16]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

In [19]:
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=0)

model = make_pipeline(
    preprocessor,
    LogisticRegression(solver='lbfgs'))
model.fit(X_train, y_train)
print("logistic regression score: %f" % model.score(X_test, y_test))

logistic regression score: 0.804598
