# How to Create Custom Sklearn Transformers That Integrate Into Any Pipeline
## Do everything in Sklearn
![](images/unsplash.jpg)
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://unsplash.com/@tetrakiss?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Arseny Togulev</a>
        on 
        <a href='https://unsplash.com/s/photos/transformer?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Unsplash.</a> All images are by the author unless specified otherwise.
    </strong>
</figcaption>

# Setup

In [2]:
import logging
import time

import catboost as cb
import joblib
import lightgbm as lgbm
import matplotlib.pyplot as plt
import numpy as np
import optuna
import pandas as pd
import seaborn as sns
import xgboost as xgb
from optuna.samplers import TPESampler
from sklearn.compose import (
    ColumnTransformer,
    make_column_selector,
    make_column_transformer,
)
from sklearn.impute import SimpleImputer
from sklearn.metrics import log_loss, mean_squared_error
from sklearn.model_selection import (
    KFold,
    StratifiedKFold,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

logging.basicConfig(
    format="%(asctime)s - %(message)s", datefmt="%d-%b-%y %H:%M:%S", level=logging.INFO
)
optuna.logging.set_verbosity(optuna.logging.WARNING)

# Introduction

Single `fit`, single `predict` - how awesome would that be?

You get the data, fit your pipeline just one time and it takes care of everything - preprocessing, feature engineering, modeling, everything. All you have to do is call predict and have the output. 

What kind of pipeline is *that* powerful? Yes, Sklearn has many transformers but it doesn't have one for every imaginable preprocessing scenario. So, is such a pipeline a *pipe* dream?

Absolutely not. Today, we will learn how to create custom Sklearn transformers that enable you to integrate virtually any function or data transformation into Sklearn's Pipeline class.

# What are Sklearn pipelines?

Below is a simple pipeline that imputes the missing values in numeric data, scales them and fits an XGBRegressor to `X`, `y`:

```python
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
import xgboost as xgb

xgb_pipe = make_pipeline(
                SimpleImputer(strategy='mean'),
                StandardScaler(),
                xgb.XGBRegressor()
            )

_ = xgb_pipe.fit(X, y)
```

I have talked at length about the nitty-gritty of Sklearn pipelines and their benefits in an [older post](https://towardsdatascience.com/how-to-use-sklearn-pipelines-for-ridiculously-neat-code-a61ab66ca90d). Most notable advantages are their ability to collapse all preprocessing and modeling steps into a singe estimator, preventing data leakage by never calling `fit` on validation sets and an added bonus that makes the code concise, reproducible and modular. 

But this whole idea of atomic, neat pipelines break when we need to perform operations that are not built into Sklearn as estimators. For example, what if you need to extract regex patterns to clean text data? What do you do if you want to create a new feature combining existing ones based on domain knowledge?

To preserve all the benefits that come with pipelines, you need a way to integrate your custom preprocessing and feature engineering logic into Sklearn. That's where custom transformers come into play.

# Integrating simple functions with `FunctionTransformer`

In this month's (September) TPS Competition on Kaggle, one of the ideas that boosted model performance significantly was adding the number of missing values in a row as a new feature. This is a custom step, not implemented in Sklearn, so let's create a function to achieve that after importing the data:

In [3]:
tps_df = pd.read_csv("data/train.csv")
tps_df.head()

Unnamed: 0,id,f1,f2,f3,f4,f5,f6,f7,f8,f9,...,f110,f111,f112,f113,f114,f115,f116,f117,f118,claim
0,0,0.10859,0.004314,-37.566,0.017364,0.28915,-10.251,135.12,168900.0,399240000000000.0,...,-12.228,1.7482,1.9096,-7.1157,4378.8,1.2096,861340000000000.0,140.1,1.0177,1
1,1,0.1009,0.29961,11822.0,0.2765,0.4597,-0.83733,1721.9,119810.0,3874100000000000.0,...,-56.758,4.1684,0.34808,4.142,913.23,1.2464,7575100000000000.0,1861.0,0.28359,0
2,2,0.17803,-0.00698,907.27,0.27214,0.45948,0.17327,2298.0,360650.0,12245000000000.0,...,-5.7688,1.2042,0.2629,8.1312,45119.0,1.1764,321810000000000.0,3838.2,0.4069,1
3,3,0.15236,0.007259,780.1,0.025179,0.51947,7.4914,112.51,259490.0,77814000000000.0,...,-34.858,2.0694,0.79631,-16.336,4952.4,1.1784,4533000000000.0,4889.1,0.51486,1
4,4,0.11623,0.5029,-109.15,0.29791,0.3449,-0.40932,2538.9,65332.0,1907200000000000.0,...,-13.641,1.5298,1.1464,-0.43124,3856.5,1.483,-8991300000000.0,,0.23049,1


In [6]:
# Find the number of missing values across rows
tps_df.isnull().sum(axis=1)

0         1
1         0
2         5
3         2
4         8
         ..
957914    0
957915    4
957916    0
957917    1
957918    4
Length: 957919, dtype: int64

Let's create a function that takes a DataFrame as an input and implements the above operation:

In [7]:
def num_missing_row(X: pd.DataFrame):
    # Calculate some metrics across rows
    num_missing = X.isnull().sum(axis=1)
    num_missing_std = X.isnull().std(axis=1)

    # Add the above series as a new feature to the df
    X["#missing"] = num_missing
    X["num_missing_std"] = num_missing_std

    return X

Now, adding this function into a pipeline is just as easy as passing it to the `FunctionTransformer`:

In [8]:
from sklearn.preprocessing import FunctionTransformer

num_missing_estimator = FunctionTransformer(num_missing_row)

Passing a custom function to `FunctionTransformer` creates an estimator with `fit`, `transform` and `fit_transform` methods:

In [9]:
# Check number of columns before
print(f"Number of features before preprocessing: {len(tps_df.columns)}")

# Apply the custom estimator
tps_df = num_missing_estimator.transform(tps_df)
print(f"Number of features after preprocessing: {len(tps_df.columns)}")

Number of features before preprocessing: 120
Number of features after preprocessing: 122


Since we have a simple function, no need to call `fit` as it just returns the estimator untouched. The only requirement of `FunctionTransformer` is that the passed function should accept the data as its first argument.
Optionally, you can pass the target array as well if you need it inside the function:

```python
# FunctionTransformer signature
def custom_function(X, y=None):
    ...

estimator = FunctionTransformer(custom_function)  # no errors

custom_pipeline = make_pipeline(StandardScaler(), estimator, xgb.XGBRegressor())
custom_pipeline.fit(X, y)
```

`FunctionTransformer` also accepts an inverse of the passed function if you ever need to revert the changes:

In [10]:
def custom_function(X, y=None):
    ...


def inverse_of_custom(X, y=None):
    ...


estimator = FunctionTransformer(func=custom_function, inverse_func=inverse_of_custom)

Check out the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html) for details on others arguments.