# Pipelines: Automating the Automatic Learning

**Pipelines** are a nice tool to use to help in the full data science process!

Pipelines can keep our code neat and clean all the way from gathering & cleaning our data, to creating models & fine-tuning them!

But like with all things, you need to know how to make a proper and useful pipeline:

![data_pipeline_xkcd](./images/data_pipeline_xkcd.png)

# Advantages

## Reduces Complexity
You can focus on parts of the pipeline at a time and debug or adjust parts as needed

## Convenient
You can summarize your fine-detail steps into the pipeline. That way you can focus on the big-picture aspects.

## Flexible
You can also use pipelines to be applied to different models and can perform optimization techniques like grid search and random search on hyperparameters!

## Prevent Mistakes!
We can focus on one section at a time.

We also can ensure data leakage between our training and doesn't occur between our training dataset and validation/testing datasets!

![pipe_leaking_cartoon](./images/pipe_leaking_cartoon.jpg)

# Example of Using a Pipeline
We can imagine doing the full steps planned out for a dataset. We technically don't need to use the Pipeline class but it makes it much more manageable

In [9]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier

In [10]:
# Getting some data
from sklearn import datasets
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                   random_state=27)

## 1. Without the Pipeline class

In [11]:
# Define transformers (will adjust/massage the data)
imputer = SimpleImputer(strategy="median") # replaces missing values
std_scaler = StandardScaler() # scale the data
pca = PCA()

# Define the classifier (predictor) to train
rf_clf = RandomForestClassifier()

# Have the classifer (and full pipeline) learn/train/fit from the data
X_train_filled = imputer.fit_transform(X_train)
X_train_scaled = std_scaler.fit_transform(X_train_filled)
X_train_reduce = pca.fit_transform(X_train_scaled)
rf_clf.fit(X_train_reduce, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [12]:
# Predict using the trained classifier (still need to do the transformations)
X_test_filled = imputer.transform(X_test)
X_test_scaled = std_scaler.transform(X_test_filled)
X_test_reduce = pca.fit_transform(X_test_scaled)
y_pred = rf_clf.predict(X_test_reduce)

> Note that if we were to add more steps in this process, we'd have to change both the *training* and *testing* processes.

## 2. With the Pipeline class

In [13]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
    ('std_scaler', StandardScaler()),
    ('pca', PCA()),
    ('rf_clf', RandomForestClassifier()),
])

# Train the pipeline (transformations & predictor)
pipeline.fit(X_train, y_train)

# Predict using the pipeline (includes the transforms & trained predictor)
predicted = pipeline.predict(X_test)

> If we need to change our process, we change it just *once* in the Pipeline

**Notice** how each parameter of each component of the pipeline can be accessed by using it’s name followed by a double underscore `__`.

# Parts of a Pipeline

Scikit-learn has a class called [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) that is very logical and versatile. We can break up the steps within a full process. But it'll help if we define what the different parts are.

## Estimator

This is any object in the pipeline that can can take in data and *estimate* (or **learn**) some parameters.

This means regression and classification models are estimators but so are objects that transform the original dataset ([Transformers]()) such as a standard scaling.

### Usage (Methods)

#### `fit`

All estimators estimate/learn by calling the `fit()` method by passing in the dataset. Other parameters can be passed in to "help" the estimator to learn. These are called **hyperparameters**, parameters used to tweak the learning process.

## Transformer

Some estimators can change the original data to something new, a **transformation**. You can think of examples of these **transformers** when you do scaling, data cleaning, or expanding/reducing on a dataset.

### Usage (Methods) 

#### `transform`
Transformers will call the `transform()` method to apply the transformation to a dataset.

#### `fit_transform`
Remember that all estimators have a `fit()` method, so a transformer can use the `fit()` method to learn something about the given dataset. After learning with `fit()`, a transformation on the dataset can be made with the `transform()` method.

An example of this would be a function that performs normalization on the dataset; the `fit()` method would learn the minimum and maximum of the dataset and the `transform()` method will scale the dataset.

When you call fit and transform with the same dataset, you can simply call the `fit_transform()` method. This essentially has the same results as calling `fit()` and then `transform()` on the dataset but possibly with some optimization and efficiencies baked in.

## Predictor

We've been using **predictors** whenever we've been making predictions with a classifier or regressor. We would use the `fit()` method to train our predictor object and then feed in new data to make predictions (based on what it learned in the fitting stage).

### Usage (Methods)
#### `predict`
As you probably can guess, the `predict()` method predicts results from a dataset given to it after being trained with a `fit()` method

#### `score`
Predictors also have a `score()` method that can be used to evaluate how well the predictor performed on a dataset (such as the test set).

# Using a Pipeline
Check out Aurélien Geron's notebook of an [end-to-end ml project](https://github.com/ageron/handson-ml2/blob/master/02_end_to_end_machine_learning_project.ipynb) on his GitHub repo based around his book [Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (2nd ed)](https://www.oreilly.com/library/view/hands-on-machine-learning/9781491962282/)