# How to Use Sklearn Pipelines For Ridiculously Neat Code
## Huge time-saver
<img src='images/boy.jpg'></img>
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://www.pexels.com/@abhiram2244?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Abhiram Prakash</a>
        on 
        <a href='https://www.pexels.com/photo/man-sitting-on-edge-facing-sunset-915972/?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Pexels</a>
    </strong>
</figcaption>

### Setup

In [90]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

### Why Do You Need a Pipeline?

Data cleaning and preparation is easily the most time-consuming and boring task in machine learning. All ML algorithms are really fussy, some want normalized or standardized features, some want encoded variables and some want both. Then, there is also the issue of missing values which is always there.

Dealing with them is no fun at all, not to mention the added bonus that comes with repeating the same cleaning operations on all training, validation and test sets. Fortunately, Scikit-learn's `Pipeline` is a major productivity tool to facilitate this process, cleaning up code and collapsing all preprocessing and modeling steps into to a single line of code. Here, check this out:

```python
# Fit lasso regression
pipe_lasso.fit(X_train, y_train)

# Predict on X_Test
preds = pipe_lasso.predict(X_test)

```

Above, `pipe_lasso` is an instance of such pipeline where it fills the missing values in `X_train` as well as feature scale the numerical columns and one-hot encode categorical variables finishing up by fitting Lasso Regression. When you call `.predict` the same steps are applied to `X_test`, which is really awesome. 

Pipelines combine everything I love about Scikit-learn: conciseness, consistency and easy of use. So, without further ado, let me show how you can build your own pipeline in a few minutes.

### Intro to Scikit-learn Pipelines

In this and coming sections, we will build the above `pipe_lasso` pipeline together for the [Ames Housing dataset](https://www.kaggle.com/c/home-data-for-ml-course/data) which is used for an [InClass competition](https://www.kaggle.com/c/home-data-for-ml-course/overview) on Kaggle. The dataset contains 81 variables on almost every aspect of a house and using these, you have to predict the house's price. Let's load the training and test sets:

In [91]:
train = pd.read_csv('data/train.csv')
X_test = pd.read_csv('data/test.csv')

train.iloc[:, 70:]

Unnamed: 0,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,0,0,,,,0,2,2008,WD,Normal,208500
1,0,0,,,,0,5,2007,WD,Normal,181500
2,0,0,,,,0,9,2008,WD,Normal,223500
3,0,0,,,,0,2,2006,WD,Abnorml,140000
4,0,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...
1455,0,0,,,,0,8,2007,WD,Normal,175000
1456,0,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,0,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,0,0,,,,0,4,2010,WD,Normal,142125


Everything except for the last column - `SalePrice` is used as features. Before we do anything, let's divide up the training data into train, validation sets. We will use the final `X_test` set for predictions.

In [92]:
from sklearn.model_selection import train_test_split

X = train.drop('SalePrice', axis=1)
y = train.SalePrice

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=.3, random_state=1121218)

Now, let's do basic exploration of the training set:

In [93]:
X_train.describe().T.iloc[:10] # All numerical cols

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Id,1022.0,728.62818,417.491868,1.0,374.5,734.5,1082.0,1459.0
MSSubClass,1022.0,57.030333,42.86121,20.0,20.0,50.0,70.0,190.0
LotFrontage,838.0,70.190931,24.110495,21.0,60.0,70.0,80.0,313.0
LotArea,1022.0,10472.601761,8782.768055,1491.0,7560.0,9571.0,11742.5,164660.0
OverallQual,1022.0,6.071429,1.374094,1.0,5.0,6.0,7.0,10.0
OverallCond,1022.0,5.578278,1.101703,1.0,5.0,5.0,6.0,9.0
YearBuilt,1022.0,1971.221135,29.863975,1875.0,1954.0,1973.0,2000.0,2009.0
YearRemodAdd,1022.0,1984.813112,20.67152,1950.0,1966.0,1994.0,2003.75,2010.0
MasVnrArea,1015.0,101.768473,180.299391,0.0,0.0,0.0,160.0,1600.0
BsmtFinSF1,1022.0,441.294521,438.43075,0.0,0.0,381.0,707.5,2260.0


In [94]:
X_train.describe(include=np.object).T.iloc[:10] # All object cols

Unnamed: 0,count,unique,top,freq
MSZoning,1022,5,RL,809
Street,1022,2,Pave,1017
Alley,67,2,Grvl,37
LotShape,1022,4,Reg,654
LandContour,1022,4,Lvl,920
Utilities,1022,2,AllPub,1021
LotConfig,1022,5,Inside,733
LandSlope,1022,3,Gtl,966
Neighborhood,1022,25,NAmes,156
Condition1,1022,9,Norm,881


In [95]:
above_0_missing = X_train.isnull().sum() > 0

X_train.isnull().sum()[above_0_missing]

LotFrontage      184
Alley            955
MasVnrType         7
MasVnrArea         7
BsmtQual          30
BsmtCond          30
BsmtExposure      31
BsmtFinType1      30
BsmtFinType2      31
Electrical         1
FireplaceQu      480
GarageType        58
GarageYrBlt       58
GarageFinish      58
GarageQual        58
GarageCond        58
PoolQC          1018
Fence            821
MiscFeature      988
dtype: int64

19 features have NaNs. 

In [96]:
numerical_features = X_train.select_dtypes(include='number').columns.tolist()
print(f'There are {len(numerical_features)} numerical features:', '\n')
print(numerical_features)

There are 37 numerical features: 

['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']


In [97]:
categorical_features = X_train.select_dtypes(exclude='number').columns.tolist()
print(f'There are {len(categorical_features)} categorical features:', '\n')
print(categorical_features)

There are 43 categorical features: 

['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition']


Now, on to preprocessing. For numeric columns, we first fill the missing values with `SimpleImputer` using the mean and feature scale using `MinMaxScaler`. For categoricals, we will use `SimpleImputer` to fill the missing values with the mode of each column. Most importantly, we do all of these in a pipeline. Let's import everything:

In [98]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline

We create two small pipelines for both numeric and categorical features:

In [99]:
numeric_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='mean')),
    ('scale', MinMaxScaler())
])

categorical_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('one-hot', OneHotEncoder(handle_unknown='ignore', sparse=False))
])

> Set `handle_unknown` to `ignore` to skip previously unseen labels. Otherwise, `OneHotEncoder` throws an error if there exists labels in test set that are not in train set.

[`sklearn.pipeline.Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class takes a tuple of transformers for its `steps` argument. Each tuple should have this pattern:

```
('name_of_transformer`, transformer)
```

Then, each tuple is called a *step* containing a transformer like `SimpleImputer` and arbitrary name. Each step will be chained and applied to the passed DataFrame in the given order.

But, these two pipelines are useless if we don't tell which columns they should be applied to. For that, we will use another transformer - `ColumnTransformer`.

### Column Transformer

By default, all `Pipeline` objects have `fit` and `transform` methods which can be used to transform the input array like this:

In [100]:
numeric_pipeline.fit_transform(X_train.select_dtypes(include='number'))

array([[0.49451303, 0.58823529, 0.16846209, ..., 0.        , 0.36363636,
        1.        ],
       [0.63443073, 0.        , 0.16846209, ..., 0.        , 0.18181818,
        0.5       ],
       [0.53017833, 0.        , 0.16780822, ..., 0.        , 0.54545455,
        0.25      ],
       ...,
       [0.97325103, 0.        , 0.16846209, ..., 0.        , 1.        ,
        0.        ],
       [0.98902606, 0.23529412, 0.21917808, ..., 0.        , 0.27272727,
        0.75      ],
       [0.50685871, 0.23529412, 0.15068493, ..., 0.        , 0.27272727,
        0.75      ]])

Above, we are using the new numeric preprocessor on `X_train` using `fit_transform`. We are specifying the columns with `select_dtypes`. But, using the pipelines in this way means we have to call each pipeline separately on selected columns which is not what we want. What we want is to have a single preprocessor that is able to perform both numeric and categorical transformations in a single line of code like this:

```python
full_processor.fit_transform(X_train)
```

To achieve this, we will use `ColumnTransformer` class:

In [101]:
from sklearn.compose import ColumnTransformer

full_processor = ColumnTransformer(transformers=[
    ('number', numeric_pipeline, numerical_features),
    ('category', categorical_pipeline, categorical_features)
])

> Remember that `numerical_features` and `categorical_features` contain the respective names of columns from `X_train`.

Similar to `Pipeline` class, `ColumnTransformer` takes a tuple of transformers. Each tuple should contain an arbitrary step name, the transformer itself and the list of column names that the transformer should be applied to. Here, we are creating a column transformer with 2 steps using both of our numeric and categorical preprocessing pipelines. Now, we can use it to fully transform the `X_train`:

In [102]:
full_processor.fit_transform(X_train)

array([[0.49451303, 0.58823529, 0.16846209, ..., 0.        , 1.        ,
        0.        ],
       [0.63443073, 0.        , 0.16846209, ..., 0.        , 0.        ,
        0.        ],
       [0.53017833, 0.        , 0.16780822, ..., 0.        , 1.        ,
        0.        ],
       ...,
       [0.97325103, 0.        , 0.16846209, ..., 0.        , 1.        ,
        0.        ],
       [0.98902606, 0.23529412, 0.21917808, ..., 0.        , 1.        ,
        0.        ],
       [0.50685871, 0.23529412, 0.15068493, ..., 0.        , 1.        ,
        0.        ]])

Note that most transformers return `numpy` arrays which means index and column names will be dropped. 

Finally, we managed to collapse all preprocessing steps into a single line of code. However, we can go even further. We can combine preprocessing and modeling to have even neater code.

### Final Pipeline With an Estimator

Adding an estimator (model) to a pipeline is as easy as creating a new pipeline which contains the above column transformer and the model itself. Let's import and instantiate `LassoRegression` and add it to a new pipeline with the `full_processor`:

In [103]:
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_absolute_error

lasso = Lasso(alpha=0.1)

lasso_pipeline = Pipeline(steps=[
    ('preprocess', full_processor),
    ('model', lasso)
])

> Warning! The order of steps matter! The estimator should always be the last step for the pipeline to work correctly.

That's it! We can now call `lasso_pipeline` just like we call any other model. When we call `.fit`, the pipeline applies all transformations before fitting an estimator:

In [104]:
_ = lasso_pipeline.fit(X_train, y_train)

Let's evaluate our base model on the validation set (Remember, we have a separate testing set which we haven't touched so far):

In [105]:
preds = lasso_pipeline.predict(X_valid)
mean_absolute_error(y_valid, preds)

19830.527070323828

In [106]:
lasso_pipeline.score(X_valid, y_valid)

0.707975813475253

Great, our base pipeline works. Another great thing about pipelines is that they can be treated as any other model. In other words, we can plug it into anywhere where we would use Scikit-learn estimators. So, we will use the pipeline in a grid search to find the optimal hyperparameters in the next section.

### Using Your Pipeline Everywhere

The main hyperparameter for `Lasso` is alpha which can range from 0 to infinity. For simplicity, we will only cross-validate on the values within 0 and 1 with steps of 0.05:

In [107]:
from sklearn.model_selection import GridSearchCV

param_dict = {'model__alpha': np.arange(0, 1, 0.05)}

search = GridSearchCV(lasso_pipeline, param_dict, 
                      cv=10, 
                      scoring='neg_mean_absolute_error')

_ = search.fit(X_train, y_train)

Now, we can get the best score and parameters for `Lasso`:

In [108]:
print('Best score:', abs(search.best_score_))

Best score: 18211.751182491404


In [109]:
print('Best alpha:', search.best_params_)

Best alpha: {'model__alpha': 0.9500000000000001}


As you can see, best `alpha` is 0.95 which is the very end of our given interval, i. e. \[0, 1) with a step of 0.05. We need to search again in case the best parameter lies in a bigger interval:

In [110]:
param_dict = {'model__alpha': np.arange(1, 100, 5)}

search = GridSearchCV(lasso_pipeline, param_dict, 
                      cv=10, 
                      scoring='neg_mean_absolute_error')

_ = search.fit(X_train, y_train)

In [111]:
print('Best score:', abs(search.best_score_))

Best score: 16365.54751395061


In [112]:
print('Best alpha:', search.best_params_)

Best alpha: {'model__alpha': 76}


With best hyperparameters, we get a significant drop in MAE (which is good). Let's redefine our pipeline with `Lasso(alpha=76)`:

In [113]:
lasso = Lasso(alpha=76)

final_lasso_pipe = Pipeline(steps=[
    ('preprocess', full_processor),
    ('model', lasso)
])

Fit it to `X_train`, validate on `X_valid` and submit predictions for the competition using `X_test`:

In [114]:
_ = final_lasso_pipe.fit(X_train, y_train)
preds = final_lasso_pipe.predict(X_valid)

mean_absolute_error(y_valid, preds)

18330.014440409737

In [115]:
preds_final = final_lasso_pipe.predict(X_test)

output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_final})
output.to_csv('submission.csv', index=False)

### Conclusion

In summary, pipelines introduce several advantages to your daily workflow such as compact and fast code, ease of use and in-place modification of multiple steps. In the examples, we used simple Lasso regression but the pipeline we created could be used for virtually any model out there. Go and use it to build something awesome!