In this tutorial, you will learn how to use pipelines to clean up your modeling code.

# What are pipelines?

**Pipelines** are a simple way to keep your data preprocessing and modeling code organized.  Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step.

Many data scientists hack together models without pipelines, but pipelines have some important benefits. Those include:
1. **Cleaner Code:** Accounting for data at each step of preprocessing can get messy.  With a pipeline, you won't need to manually keep track of your training (_and validation_) data at each step.
2. **Fewer Bugs:** There are fewer opportunities to misapply a step or forget a preprocessing step.
3. **Easier to Productionize:** It can be surprisingly hard to transition a model from a prototype to something deployable at scale.  We won't go into the many related concerns here, but pipelines can help.
4. **More Options For Model Testing:** You will see an example in the next tutorial, which covers cross-validation.

# Example

We won't focus on the data loading step. For now, you can imagine you are at a point where you already have train_X, test_X, train_y and test_y. 

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# read the data
data = pd.read_csv('../input/melb_data.csv')

# select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

# select target
y = data.Price

# separate data into training and test sets
train_X, test_X, train_y, test_y = train_test_split(X, y)

You have a modeling process that uses an imputer to fill in missing values, followed by a random forest model to make predictions.  These can be bundled together with the `make_pipeline` function as shown below.

We fit and predict using this pipeline as a fused whole.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Imputer

# bundle imputer and model in a pipeline
my_pipeline = make_pipeline(Imputer(), RandomForestRegressor())

# imputation of train + test datasets, fit model
my_pipeline.fit(train_X, train_y)

# generate predictions
predictions = my_pipeline.predict(test_X)

For comparison, here is the code to do the same thing without pipelines:

```python
# define imputer and model
my_imputer = Imputer()
my_model = RandomForestRegressor()

# imputation of train + test datasets
imputed_train_X = my_imputer.fit_transform(train_X)
imputed_test_X = my_imputer.transform(test_X)

# fit model
my_model.fit(imputed_train_X, train_y)

# generate predictions
predictions = my_model.predict(imputed_test_X)
```

# Conclusion

This particular pipeline was only a small improvement in code elegance. But pipelines become increasingly valuable as your data processing becomes increasingly sophisticated.

# Your Turn

xyz