In this tutorial, you will learn how to use **pipelines** to clean up your modeling code.

# Introduction

**Pipelines** are a simple way to keep your data preprocessing and modeling code organized.  Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step.

Many data scientists hack together models without pipelines, but pipelines have some important benefits. Those include:
1. **Cleaner Code:** Accounting for data at each step of preprocessing can get messy.  With a pipeline, you won't need to manually keep track of your training and validation data at each step.
2. **Fewer Bugs:** There are fewer opportunities to misapply a step or forget a preprocessing step.
3. **Easier to Productionize:** It can be surprisingly hard to transition a model from a prototype to something deployable at scale.  We won't go into the many related concerns here, but pipelines can help.
4. **More Options for Model Validation:** You will see an example in the next tutorial, which covers cross-validation.

# Example

As in the previous tutorial, we will work with the [Melbourne Housing dataset](https://www.kaggle.com/dansbecker/melbourne-housing-snapshot/home).  

We won't focus on the data loading step. Instead, you can imagine you are at a point where you already have the training and validation data in `train_X`, `valid_X`, `train_y`, and `valid_y`. 

In [None]:
#$HIDE$
import pandas as pd
from sklearn.model_selection import train_test_split

# read the data
data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')

# select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

# select target
y = data.Price

# separate data into training and validation sets
train_X, valid_X, train_y, valid_y = train_test_split(X, y)

### Without pipeline

As you've learned, we can use an **imputer** to fill in missing values and train a **random forest** model to make predictions.  This can be done without a pipeline with the code below:

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer

# define imputer and model
my_imputer = SimpleImputer()
my_model = RandomForestRegressor()

# imputation of training and validation sets
imputed_train_X = my_imputer.fit_transform(train_X)
imputed_valid_X = my_imputer.transform(valid_X)

# fit model
my_model.fit(imputed_train_X, train_y)

# generate predictions
predictions = my_model.predict(imputed_valid_X)

### With pipeline

Instead of dealing with the imputer and model training in separate steps, we can bundle them together in a pipeline with the `make_pipeline()` function as shown below. 

In [None]:
from sklearn.pipeline import make_pipeline

# bundle imputer and model in a pipeline
my_pipeline = make_pipeline(SimpleImputer(), RandomForestRegressor())

# imputation of train dataset, fit model
my_pipeline.fit(train_X, train_y)

# imputation of validation dataset, generate predictions
predictions = my_pipeline.predict(valid_X)

In the code cell above, we fit and predict using the pipeline as a fused whole.  

![pipelines](./images/tut4_pipeline.png)

This simplifies our original code in a few ways:
- With the pipeline, we transform the training data with the imputer and fit the model in a single line of code.  In contrast, without the pipeline, we had to work with the imputer and model in separate steps.
- With the pipeline, we supply the unprocessed features in `valid_X` to the `predict()` command; the pipeline automatically imputes the missing values before generating predictions.  However, without the pipeline, we have to remember to separately preprocess the validation features before making predictions.

# Conclusion

This particular pipeline was only a small improvement in code elegance. But pipelines become increasingly valuable as your data processing becomes increasingly sophisticated.

# Your Turn

xyz