# <font color='#eb3483'> Pipelines with scikit-learn </font>
In class we covered how to do pre-processing, train/test_splitting, and hyper parameter optimization. In this notebook we'll show you a handy tool for combining all these steps into a custom sklearn pipeline.

In [1]:
from IPython.display import Image
import pandas as pd
import numpy as np
import seaborn as sns
sns.set()

import warnings
warnings.simplefilter("ignore")

### <font color='#eb3483'> Toy Data </font>

We'll use a custom 'toy' dataset that contains a nice mix of different data types.

In [2]:
data = pd.read_csv("data/data_processing.csv")

In [3]:
data.head()

Unnamed: 0,missing_col,col2,col3,outliers_col,outliers_col2,categorical_col,ordinal_col,text_col
0,19.0,21.0,1.579757,76,0.146815,mouse,normal,"The tunnel wound on and on, going fairly but n..."
1,12.0,87.0,0.257069,85,0.210569,dog,bad,"It had a perfectly round door like a porthole,..."
2,76.0,29.0,0.564826,49,2.427333,elephant,very bad,"It had a perfectly round door like a porthole,..."
3,42.0,66.0,0.34506,67,1.601458,elephant,very good,"It had a perfectly round door like a porthole,..."
4,49.0,43.0,0.861007,-14,22.375265,cat,very bad,"This hobbit was a very well-to-do hobbit, and ..."


We saw on a previous section how we can use scikit-learn transformers (in the prepreocessing module) to process data and prepare it to be used in predictive modeling. The set of steps that process data are called **Pipelines**.

For example, the pipeline we used in the previous section looked like this:

![pipeline](pipeline.png)

We are going to build this pipeline using scikit learn's [pipelines](http://scikit-learn.org/stable/modules/pipeline.html).

Scikit-learn [Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline) allows us to plug in together any number of transformers (that transform data) and estimators (that predict).

In [4]:
data.head()

Unnamed: 0,missing_col,col2,col3,outliers_col,outliers_col2,categorical_col,ordinal_col,text_col
0,19.0,21.0,1.579757,76,0.146815,mouse,normal,"The tunnel wound on and on, going fairly but n..."
1,12.0,87.0,0.257069,85,0.210569,dog,bad,"It had a perfectly round door like a porthole,..."
2,76.0,29.0,0.564826,49,2.427333,elephant,very bad,"It had a perfectly round door like a porthole,..."
3,42.0,66.0,0.34506,67,1.601458,elephant,very good,"It had a perfectly round door like a porthole,..."
4,49.0,43.0,0.861007,-14,22.375265,cat,very bad,"This hobbit was a very well-to-do hobbit, and ..."


In [5]:
#Let's grab our different data types
target_variable = "col3"
numerical_cols =  data.drop(columns=target_variable).select_dtypes(np.number).columns
categorical_col = ['categorical_col']
ordinal_col = ["ordinal_col"]
text_col = ['text_col']

### <font color='#eb3483'> Numerical Pipeline </font>

We will create a pipeline for each kind of data first, then we will see how to put them together into a single Pipeline

First we create the preprocessing transformers the same way we did before

In [6]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

imputer = SimpleImputer(strategy="mean")
scaler = StandardScaler()

In [7]:
from sklearn.pipeline import make_pipeline

A sklearn Pipeline is defined as a sequence of steps.

For example, for the numerical variables our pipeline will have 2 steps:
- first we will impute,
- then we will scale. 

We can create this numerical pipeline like this:

In [8]:
numerical_pipeline = make_pipeline(
    imputer,
    scaler
)

In [9]:
numerical_pipeline

Pipeline(memory=None,
         steps=[('simpleimputer',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='mean',
                               verbose=0)),
                ('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True))],
         verbose=False)

Now we can use `fit` and `transform` with the pipeline and it will apply all the transformers sequentially.

In [10]:
numerical_pipeline.fit_transform(data[numerical_cols])

array([[-1.06119853e+00, -1.03182710e+00,  9.81338211e-01,
        -1.12861489e-01],
       [-1.31267830e+00,  1.29266307e+00,  1.10324526e+00,
        -1.12655904e-01],
       [ 9.86565309e-01, -7.50070712e-01,  6.15617059e-01,
        -1.05507587e-01],
       ...,
       [-2.55267319e-16, -1.63055941e+00,  2.09260225e-01,
        -1.10465017e-01],
       [ 1.59730189e+00,  4.47393917e-01,  8.85111558e+00,
        -1.05060505e-01],
       [ 9.14713946e-01, -1.06704664e+00, -8.06631862e-01,
        -1.12963389e-01]])

This is awesome, we dont need to concatenate the processing steps manually, but we still have a problem, we cant apply this numerical transformer to the whole dataframe

In [11]:
numerical_pipeline.fit_transform(data)

ValueError: Cannot use mean strategy with non-numeric data:
could not convert string to float: 'mouse'

The pipeline fails when it tries to apply the imputer and the scaler to the non numerical columns.

We can fix this by applying a step that will select only the numerical columns before applying the imputer and the scaler.

We can use the package named  [mlxtend](https://github.com/rasbt/mlxtend/pull/378) (which provides additional functionalities to scikit-learn) to import a ColumnSelector, which is a transformer that selects columns.

for example, we can do:

In [None]:
from mlxtend.feature_selection import ColumnSelector

numerical_col_selector = ColumnSelector(cols=numerical_cols)

We can apply this selector to the whole dataframe and it will automatically select the numerical columns

In [None]:
numerical_col_selector.fit_transform(data)

Now we can create a numerical pipeline that takes care of selection the appropriate columns

In [None]:
numerical_pipeline = make_pipeline(
    numerical_col_selector,
    imputer,
    scaler
)

And now we can apply this pipeline to the whole dataset

In [None]:
numerical_pipeline.fit_transform(data)[:5]

 ### <font color='#eb3483'> Text Pipeline </font>


We will proceed the same way with the text column. This pipeline requires a [DenseTransformer](https://rasbt.github.io/mlxtend/user_guide/preprocessing/DenseTransformer/) that transforms the output produced by TfidfVectorizer (a sparse matrix) into a numpy array.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer 
from mlxtend.preprocessing import DenseTransformer

text_pipeline = make_pipeline(
    ColumnSelector(cols=text_col, drop_axis=True),
    TfidfVectorizer(),
    DenseTransformer()
)

In [None]:
text_pipeline.fit_transform(data)

### <font color='#eb3483'> Categorical Pipeline </font>
For categorical variables, we'll use **one-hot encoding**, which means we will create k binary columns (where k is the unique values for the variable). Then these columns will have all of their values 0 except one column that will represent the actual categorical value.

We can use the transformer `OneHotEncoder` that is part of the package [`category_encoders`](http://contrib.scikit-learn.org/categorical-encoding/), which is a package that adds additional categorical encoders that are compatible with scikit-learn.

In [None]:
from category_encoders import OneHotEncoder

Same way as before, we will do One-Hot Encoding

In [None]:
categorical_pipeline = make_pipeline(
     ColumnSelector(cols=categorical_col),
     OneHotEncoder()
)

categorical_pipeline.fit_transform(data)[:5]

### <font color='#eb3483'> Ordinal Pipeline </font>
To encode the ordinal column we need to first, use column selector to select it and then use OrdinalEncoder. We need to remember that ordinal Encoder requires the selection of the columns we want to transform, and because the output of ColumnSelector is a numpy array, we have to specify the column in position 0.

In [None]:
from category_encoders import OrdinalEncoder

# ColumnSelector's output is an array, so we use the column 0 for ordinal encoder
ordinal_encoder = OrdinalEncoder(mapping=[
    {"col": 0, 
      "mapping": {
        "very bad": 0,
        "bad": 1,
        "normal": 2,
        "good": 3,
        "very good": 4
      } 
     }
])
ordinal_pipeline = make_pipeline(
    ColumnSelector(cols=ordinal_col),
    ordinal_encoder
)

ordinal_pipeline.fit_transform(data)[:5]

## <font color='#eb3483'> Pipeline Union </font>

Now we have the individual pipelines that process each different datatype.

Now we just need to get a [FeatureUnion](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html#sklearn.pipeline.FeatureUnion) to join all of the outputs together. A feature union doesnt apply all of its steps sequentially. Instead, it applies the input to all of the steps at once and puts them together

![pipeline](pipeline.png)

In [None]:
from sklearn.pipeline import make_union

In [None]:
processing_pipeline = make_union(
    numerical_pipeline,
    text_pipeline,
    categorical_pipeline,
    ordinal_pipeline
)

In [None]:
processing_pipeline

Now we can use this FeatureUnion to transform the whole dataset

In [None]:
processing_pipeline.fit_transform(data)

Finally, we just have to add an estimator to the end of the pipeline, so it can be trained with the transformed data.

In [None]:
from sklearn.linear_model import LinearRegression

estimator = LinearRegression()
estimator_pipeline = make_pipeline(
    processing_pipeline,
    estimator
)

Now that we added an estimator to the pipeline we can do `fit` and `predict`

In [None]:
estimator_pipeline.fit(data, data[target_variable])

In [None]:
estimator_pipeline.predict(data)[:5]

Using pipelines not only helps organizing the different steps. We can also use all of scikit-learn tools with pipelines! For example, we can do cross validation with the estimator pipeline we just made.

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
X = data.drop(columns=target_variable)
y = data[target_variable]

In [None]:
mae_cv = cross_val_score(estimator_pipeline, 
                X, 
                y,
                scoring='neg_mean_absolute_error', 
                cv=5
)
mae_cv

In [None]:
mae_cv.mean()

## <font color='#eb3483'> Hyperparameter Optimization </font>

Because pipelines follow the scikit-learn convention of implementing the methods fit, transform and predict, we can use a search in a pipeline.

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
predictive_pipeline = Pipeline(
    [
     ("processing", processing_pipeline),
     ("estimator", LinearRegression())   
    ])

This pipeline has a loot of hyperparams we can tweak, basically every single parameter on every single step of the pipeline, we  can see their names like this:

In [None]:
sorted(predictive_pipeline.get_params().keys())

We'll focus on whether or not we want to fit an intercept in our linear regression model.

In [None]:
param_dist_random = {
    "estimator__fit_intercept": [True, False],
}

To add hyperparameter tuning to our pipeline we simply set the estimator to be our predictive pipeline.

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
random_search_pipeline = RandomizedSearchCV(
    estimator=predictive_pipeline, 
    param_distributions=param_dist_random,
   scoring="neg_mean_squared_error", n_jobs=-1, n_iter=10)

Now we can perform the grid search for the whole pipeline.

In [None]:
%%time
random_search_pipeline.fit(X, y)

Let's take a peak at our best model + score.

In [None]:
print(random_search_pipeline.best_score_)
print(random_search_pipeline.best_estimator_)

We can even see what final steps we've applied in our best model (this is super handy if we're trying out different hyperparameters during data pre-processing)

In [None]:
random_search_pipeline.best_estimator_.steps[0]

## <font color='#eb3483'> Exporting Pipelines </font>

The great thing about scikit learn pipelines is that we can save it, and reuse them **without having to retrain them**, we can use the library joblib to do so.

For example, to save the pipeline we have built:

In [None]:
from joblib import dump, load
dump(estimator_pipeline, 'pipeline.joblib') 

In [None]:
reloaded = load("pipeline.joblib")

Now we can predict directly!

In [None]:
reloaded.predict(X)[:10]