# Pipelines 

## Introduction 
Pipelines are a simple way to keep data preprocessing and modeling code organized. It's purpose is to act as a bundle which takes the preprocessing and modeling steps into one bundle as if it were a single step. <br/> <br/>
Implementing piplelines allows our code to appear cleaner (we won't need to manually keep track of training and validation data at each step), have fewer bugs (fewer opportunities to misapply for forget a step). Additionally, it is also easier to productionize and also help provide more options for model validation (we'll see this more in **cross-validation**).

## Setting Up a Pipeline 

To set up a pipeline all we'll do is set up all the preprocessing methods and then combine them into a **Column Transformer**.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
# constant changes the values to use fill_value which is 0 if left to default
numerical_transformer = SimpleImputer(strategy='constant') 

# Preprocessing for categorical data
# using most_frequent replaces empty values with the most frequent value, in this case a string then
# converts them using a onehot encoding method 
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

Now that the preprocessor section has been setup, we can define the model type in which we wish to fit to data to. Afterwards, we can combine the preprocessor with the model to form our pipeline. Once that process is complete, we can fit and validate our data. 

In [None]:
from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model using some method such as mean absolute error (MAE)