# How to Create Custom Sklearn Transformers That Integrate Into Any Pipeline
## Do everything in Sklearn
![](images/unsplash.jpg)
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://unsplash.com/@tetrakiss?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Arseny Togulev</a>
        on 
        <a href='https://unsplash.com/s/photos/transformer?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Unsplash.</a> All images are by the author unless specified otherwise.
    </strong>
</figcaption>

# Setup

In [1]:
import logging
import time

import catboost as cb
import joblib
import lightgbm as lgbm
import matplotlib.pyplot as plt
import numpy as np
import optuna
import pandas as pd
import seaborn as sns
import xgboost as xgb
from optuna.samplers import TPESampler
from sklearn.compose import (
    ColumnTransformer,
    make_column_selector,
    make_column_transformer,
)
from sklearn.impute import SimpleImputer
from sklearn.metrics import log_loss, mean_squared_error
from sklearn.model_selection import (
    KFold,
    StratifiedKFold,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

logging.basicConfig(
    format="%(asctime)s - %(message)s", datefmt="%d-%b-%y %H:%M:%S", level=logging.INFO
)
optuna.logging.set_verbosity(optuna.logging.WARNING)

# Introduction

Single `fit`, single `predict` - how awesome would that be?

You get the data, fit your pipeline just one time and it takes care of everything - preprocessing, feature engineering, modeling, everything. All you have to do is call predict and have the output. 

What kind of pipeline is *that* powerful? Yes, Sklearn has many transformers but it doesn't have one for every imaginable preprocessing scenario. So, is such a pipeline a *pipe* dream?

Absolutely not. Today, we will learn how to create custom Sklearn transformers that enable you to integrate virtually any function or data transformation into Sklearn's Pipeline class.

# What are Sklearn pipelines?

Below is a simple pipeline that imputes the missing values in numeric data, scales them and fits an XGBRegressor to `X`, `y`:

```python
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
import xgboost as xgb

xgb_pipe = make_pipeline(
                SimpleImputer(strategy='mean'),
                StandardScaler(),
                xgb.XGBRegressor()
            )

_ = xgb_pipe.fit(X, y)
```

I have talked at length about the nitty-gritty of Sklearn pipelines and their benefits in an [older post](https://towardsdatascience.com/how-to-use-sklearn-pipelines-for-ridiculously-neat-code-a61ab66ca90d). Most notable advantages are their ability to collapse all preprocessing and modeling steps into a singe estimator, preventing data leakage by never calling `fit` on validation sets and an added bonus that makes the code concise, reproducible and modular. 

But this whole idea of atomic, neat pipelines break when we need to perform operations that are not built into Sklearn as estimators. For example, what if you need to extract regex patterns to clean text data? What do you do if you want to create a new feature combining existing ones based on domain knowledge?

To preserve all the benefits that come with pipelines, you need a way to integrate your custom preprocessing and feature engineering logic into Sklearn. That's where custom transformers come into play.