# Pipelines: Automating the Automatic Learning

Pipelines can keep our code neat and clean all the way from gathering & cleaning our data, to creating models & fine-tuning them!

**Advantages**: 
- Reduces complexity
- Convenient 
- Flexible 
- Can help prevent mistakes(like data leakage) 

In [1]:
from sklearn.impute import SimpleImputer #replace missing data 
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

In [2]:
# Getting some data
from sklearn import datasets
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=27)

## No Pipeline 

In [3]:
# Define transformers (will adjust/massage the data)
imputer = SimpleImputer(strategy="median") # replaces missing values
std_scaler = StandardScaler() # scales the data


# Define the classifier (predictor) to train
rf_clf = RandomForestClassifier()

# Have the classifer (and full pipeline) learn/train/fit from the data
X_train_filled = imputer.fit_transform(X_train)
X_train_scaled = std_scaler.fit_transform(X_train_filled)
rf_clf.fit(X_train_scaled, y_train)

# Predict using the trained classifier (still need to do the transformations)
X_test_filled = imputer.transform(X_test)
X_test_scaled = std_scaler.transform(X_test_filled)
y_pred = rf_clf.predict(X_test_scaled)



> **Note that if we were to add more steps in this process, we'd have to change both the training and testing processes.**

## With a Pipeline

In [4]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")), 
        ('std_scaler', StandardScaler()),
        ('rf_clf', RandomForestClassifier()),
])


# Train the pipeline (tranformations & predictor)
pipeline.fit(X_train, y_train)

# Predict using the pipeline (includes the transfomers & trained predictor)
predicted = pipeline.predict(X_test)



In [None]:
mapper = DataFrameMapper(
    [(d, LabelEncoder()) for d in dummies]
)

lm = PMMLPipeline([("mapper", mapper),
                   ("onehot", OneHotEncoder()),
                   ("regressor", LinearRegression())])

## Parts of the Pipeline 
Scikit-learn has a class called [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) that is very logical and versatile. We can break up the steps within a full process. But it'll help if we define what the different parts are.

### Estimators 
* This is any object in the pipeline that can can take in data and *estimate* (or **learn**) some parameters. 

This means regression and classification models are estimators but so are objects that transform the original dataset ([Transformers](pipeline_intro.ipynb#Transformer)) such as a standard scaling.

* Fit()
    - All estimators estimate/learn by calling the fit() method by passing in the dataset. Other parameters can be passed in to "help" the estimator to learn. These are called hyperparameters, parameters used to tweak the learning process.

### Transformers
Some estimators can change the original data to something new, a **transformation**. You can think of examples of these **transformers** when you do scaling, data cleaning, or expanding/reducing on a dataset.

* transform() 
    * Transformers will call the `transform()` method to apply the transformation to a dataset. Must do .fit() first. 
    
* fir_transform()
    * Remember that all estimators have a fit() method, so a transformer can use the fit() method to learn something about the given dataset. After learning with fit(), a transformation on the dataset can be made with the transform() method.

    * An example of this would be a function that performs normalization on the dataset; the fit() method would learn the minimum and maximum of the dataset and the transform() method will scale the dataset.

    * When you call fit and transform with the same dataset, you can simply call the fit_transform() method. This essentially has the same results as calling fit() and then transform() on the dataset but possibly with some optimization and efficiencies baked in.

### Predictors 
We've been using predictors whenever we've been making predictions with a classifier or regressor. We would use the fit() method to train our predictor object and then feed in new data to make predictions (based on what it learned in the fitting stage).

* predict()
    * As you probably can guess, the predict() method predicts results from a dataset given to it after being trained with a fit() method
    
* score()
    * Predictors also have a score() method that can be used to evaluate how well the predictor performed on a dataset (such as the test set).

## Resources 

* Check out Aurélien Geron's notebook of an [end-to-end ml project](https://github.com/ageron/handson-ml2/blob/master/02_end_to_end_machine_learning_project.ipynb) on his GitHub repo based around his book [_Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (2nd ed)_](https://www.oreilly.com/library/view/hands-on-machine-learning/9781491962282/)


* [Using google colab for GPU/TPU speed](https://www.analyticsvidhya.com/blog/2020/03/google-colab-machine-learning-deep-learning/)

[DataSchool video on cat feats for pipelines](https://www.dataschool.io/encoding-categorical-features-in-python/)