#### Custom transformers

n this unit, we will see how to encapsulate a set of more advanced preprocessing steps done in Pandas into a Scikit-learn custom transformer.

The goal of this unit is to see how we can implement complex transformations in Scikit-learn. However, custom transformers are quite advanced tools and everything that we will see in this unit can be implemented outside Scikit-learn as a separate Pandas preprocessing step - it’s perfectly fine to do it with Pandas.

#### Messy bikes data
This time, we will work with a variant of the bike sharing data. The dataset is similar to the one from the previous unit but has missing values in the features.



In [3]:
import pandas as pd

data_df = pd.read_csv("c3_messy-bikes.csv")
data_df.head()

Unnamed: 0,temp,hum,windspeed,yr,workingday,holiday,weekday,season,weathersit,casual
0,0.344,0.806,,2011.0,no,,6.0,spring,cloudy,331
1,0.363,0.696,0.249,2011.0,,no,0.0,spring,cloudy,131
2,0.196,0.437,0.248,2011.0,yes,no,,,clear,120
3,,0.59,0.16,2011.0,yes,no,2.0,spring,,108
4,0.227,0.437,0.187,2011.0,yes,no,3.0,spring,clear,82


Note that the year yr and weekday values are encoded as floating point numbers instead of integers. We will also need to fix this.

To get a sense of the proportion of NaN entries, let’s run



In [4]:
data_df.isnull().mean()

temp          0.095759
hum           0.109439
windspeed     0.112175
yr            0.087551
workingday    0.088919
holiday       0.102599
weekday       0.097127
season        0.102599
weathersit    0.103967
casual        0.000000
dtype: float64

As we can see, input features contain approximately 10% missing values.

#### Custom preprocessing
Let’s write a preprocess_f(df) function to perform the necessary preprocessing steps. To avoid any issues, we will work on a copy of the df DataFrame



In [5]:
import numpy as np


def preprocess_f(df):
    # Work on a copy
    df = df.copy()

    # Missing values in continuous features
    cont_vars = ["temp", "hum", "windspeed"]
    for c in cont_vars:
        df[c] = df[c].fillna(df[c].mean())  # replace by mean

    # Explicitly convert to string values
    to_convert = ["yr", "weekday"]
    convert_f = lambda x: str(int(x)) if not np.isnan(x) else np.nan
    df[to_convert] = df[to_convert].applymap(convert_f)

    # .. in categorical ones: create 'missing' category
    cat_vars = ["yr", "workingday", "holiday", "weekday", "season", "weathersit"]
    df[cat_vars] = df[cat_vars].fillna("missing")

    # One-hot encoding
    df = pd.get_dummies(df)

    return df


preprocessed = preprocess_f(data_df)

In this code, we replace missing values in the continuous variables with their mean and add a missing category for the categorical ones.

We get the following preprocessed DataFrame



In [6]:
preprocessed.head()

Unnamed: 0,temp,hum,windspeed,casual,yr_2011,yr_2012,yr_missing,workingday_missing,workingday_no,workingday_yes,...,weekday_missing,season_fall,season_missing,season_spring,season_summer,season_winter,weathersit_clear,weathersit_cloudy,weathersit_missing,weathersit_rainy
0,0.344,0.806,0.188969,331,1,0,0,0,1,0,...,0,0,0,1,0,0,0,1,0,0
1,0.363,0.696,0.249,131,1,0,0,1,0,0,...,0,0,0,1,0,0,0,1,0,0
2,0.196,0.437,0.248,120,1,0,0,0,0,1,...,1,0,1,0,0,0,1,0,0,0
3,0.495543,0.59,0.16,108,1,0,0,0,0,1,...,0,0,0,1,0,0,0,0,1,0
4,0.227,0.437,0.187,82,1,0,0,0,0,1,...,0,0,0,1,0,0,1,0,0,0


In [7]:
preprocessed.columns

Index(['temp', 'hum', 'windspeed', 'casual', 'yr_2011', 'yr_2012',
       'yr_missing', 'workingday_missing', 'workingday_no', 'workingday_yes',
       'holiday_missing', 'holiday_no', 'holiday_yes', 'weekday_0',
       'weekday_1', 'weekday_2', 'weekday_3', 'weekday_4', 'weekday_5',
       'weekday_6', 'weekday_missing', 'season_fall', 'season_missing',
       'season_spring', 'season_summer', 'season_winter', 'weathersit_clear',
       'weathersit_cloudy', 'weathersit_missing', 'weathersit_rainy'],
      dtype='object')

As we have seen in the last unit, we can encapsulate such preprocessing functions into a FunctionTransformer object which can then be used with Scikit-learn tools such as pipelines.

In [8]:
from sklearn.preprocessing import FunctionTransformer

preprocessor = FunctionTransformer(preprocess_f, validate=False)
preprocessed = preprocessor.fit_transform(data_df)

As we saw in the last unit, Scikit-learn transformers work with Numpy arrays and not Pandas DataFrames as in our preprocess_f() function. To avoid any implicit conversion, we need to set the validate parameter to False.

You should get the same result as above.

In [9]:
preprocessed.head()

Unnamed: 0,temp,hum,windspeed,casual,yr_2011,yr_2012,yr_missing,workingday_missing,workingday_no,workingday_yes,...,weekday_missing,season_fall,season_missing,season_spring,season_summer,season_winter,weathersit_clear,weathersit_cloudy,weathersit_missing,weathersit_rainy
0,0.344,0.806,0.188969,331,1,0,0,0,1,0,...,0,0,0,1,0,0,0,1,0,0
1,0.363,0.696,0.249,131,1,0,0,1,0,0,...,0,0,0,1,0,0,0,1,0,0
2,0.196,0.437,0.248,120,1,0,0,0,0,1,...,1,0,1,0,0,0,1,0,0,0
3,0.495543,0.59,0.16,108,1,0,0,0,0,1,...,0,0,0,1,0,0,0,0,1,0
4,0.227,0.437,0.187,82,1,0,0,0,0,1,...,0,0,0,1,0,0,1,0,0,0


#### FunctionTransformer limitations
The FunctionTransformer has an important limitation: it can only encapsulate stateless transformations. For instance, let’s see what happens if we pass a single row to the preprocessor fitted above

In [11]:
preprocessor.transform(data_df.iloc[:1])

Unnamed: 0,temp,hum,windspeed,casual,yr_2011,workingday_no,holiday_missing,weekday_6,season_spring,weathersit_cloudy
0,0.344,0.806,,331,1,1,1,1,1,1


This time, the output only has 10 columns instead of 30. This is because the get_dummies() call inside our preprocess_f() function creates a column for each categorical value. In this case, we only pass a single data point so get_dummies() creates a single column for each categorical variable ex. only weathersit_cloudy since weathersit='cloudy' for this first entry, but no weathersit_rainy as above.

#### TransformerMixin object
To fix this, we need to create a custom transformer that saves the column names during fitting. This can be done by defining a subclass of the Scikit-learn BaseEstimator and TransformerMixin classes and by implementing the __init__(), fit() and transform() functions.

In [12]:
from sklearn.base import BaseEstimator, TransformerMixin


class PandasPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self, preprocess_f):
        self.preprocess_f = preprocess_f

    def fit(self, X_df, y=None):
        # Check that we get a DataFrame
        assert type(X_df) == pd.DataFrame

        # Preprocess data
        X_preprocessed = self.preprocess_f(X_df)

        # Save columns names/order for inference time
        self.columns_ = X_preprocessed.columns

        return self

    def transform(self, X_df):
        # Check that we get a DataFrame
        assert type(X_df) == pd.DataFrame

        # Preprocess data
        X_preprocessed = self.preprocess_f(X_df)

        # Make sure to have the same features
        X_reindexed = X_preprocessed.reindex(columns=self.columns_, fill_value=0)

        return X_reindexed

In this implementation, we create a PandasPreprocessor transformer that takes a preprocessing preprocess_f function and saves it.

In its fit() method, we check that the input is a DataFrame object and preprocess it with the function. We then save the columns in a new columns_ attribute that we use in the transform() method to make sure that the transformed output has the same set of columns as the input data.

Concretely, this is done with a simple df.reindex(columns, fill_value=0) operation which reindexes the DataFrame df such that it has the same columns as columns and in the same order. If the column is missing, it simply creates it and set its values to zeros.

Let’s see if this fixes our issue.



In [13]:
preprocessor = PandasPreprocessor(preprocess_f)
preprocessor.fit(data_df)
preprocessor.transform(data_df.iloc[:1])

Unnamed: 0,temp,hum,windspeed,casual,yr_2011,yr_2012,yr_missing,workingday_missing,workingday_no,workingday_yes,...,weekday_missing,season_fall,season_missing,season_spring,season_summer,season_winter,weathersit_clear,weathersit_cloudy,weathersit_missing,weathersit_rainy
0,0.344,0.806,,331,1,0,0,0,1,0,...,0,0,0,1,0,0,0,1,0,0


This time, we get a DataFrame with the correct columns, but the missing entry in windspeed wasn’t replaced by the feature mean.

Again, the issue comes from our preprocess_f() implementation which is stateless - missing values are replaced by the mean of the current DataFrame which, in this case, contains a single entry. Since the mean of NaN is NaN, it didn’t impute the missing value. To solve the issue, we need to store the train mean in the fit() step.



In [14]:
class PandasPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.cat_vars_ = [
            "yr",
            "workingday",
            "holiday",
            "weekday",
            "season",
            "weathersit",
        ]
        self.cont_vars_ = ["temp", "hum", "windspeed"]
        self.to_convert_ = ["yr", "weekday"]

    def preprocess_f(self, X_df, train_mean):
        # Work on a copy
        X_df = X_df.copy()

        # Missing values in continuous features
        for c in self.cont_vars_:
            X_df[c] = X_df[c].fillna(train_mean[c])

        # Explicitly convert to string values
        convert_f = lambda x: str(int(x)) if not np.isnan(x) else np.nan
        X_df[self.to_convert_] = X_df[self.to_convert_].applymap(convert_f)

        # .. in categorical ones: create 'missing' category
        X_df[self.cat_vars_] = X_df[self.cat_vars_].fillna("missing")

        # One-hot encoding
        X_df = pd.get_dummies(X_df)

        return X_df

    def fit(self, X_df, y=None):
        # Check that we get a DataFrame
        assert type(X_df) == pd.DataFrame

        # Save train mean for continuous variables
        self.train_mean_ = X_df[self.cont_vars_].mean()

        # Preprocess data
        X_preprocessed = self.preprocess_f(X_df, self.train_mean_)

        # Save columns names/order for inference time
        self.columns_ = X_preprocessed.columns

        return self

    def transform(self, X_df):
        # Check that we get a DataFrame
        assert type(X_df) == pd.DataFrame

        # Preprocess data
        X_preprocessed = self.preprocess_f(X_df, self.train_mean_)

        # Make sure to have the same features
        X_reindexed = X_preprocessed.reindex(columns=self.columns_, fill_value=0)

        return X_reindexed


Let’s look at the main differences with our previous implementation. First, the preprocess_f function and the list of columns are now part of the object. We also pass a train_mean argument to the function. Those values are computed during training time in the fit() step and stored as a train_mean_ attribute. Finally, in the transform() step, we reuse our preprocessing function with the train mean values.

Let’s test this new implementation

In [16]:
preprocessor = PandasPreprocessor()
preprocessor.fit(data_df)
preprocessor.transform(data_df.iloc[:1])

Unnamed: 0,temp,hum,windspeed,casual,yr_2011,yr_2012,yr_missing,workingday_missing,workingday_no,workingday_yes,...,weekday_missing,season_fall,season_missing,season_spring,season_summer,season_winter,weathersit_clear,weathersit_cloudy,weathersit_missing,weathersit_rainy
0,0.344,0.806,0.188969,331,1,0,0,0,1,0,...,0,0,0,1,0,0,0,1,0,0


The windspeed value now corresponds to the mean wind speed computed during the fit() call.

#### Complete Pipeline
Let’s build a final pipeline with our new PandasPreprocessor custom transformer.



In [17]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

# Use our custom transformer in a pipeline
pipe = Pipeline(
    [("preprocessor", PandasPreprocessor()), ("estimator", LinearRegression())]
)

We can now evaluate the pipe object as if it was a standard estimator. We will use the train/test set methodology.



In [18]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error as MAE

# Split data
X = data_df.drop("casual", axis=1)
y = data_df.casual
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=0)

# Evaluate estimator
pipe.fit(X_tr, y_tr)
print("MAE: {:.2f}".format(MAE(y_te, pipe.predict(X_te))))

MAE: 293.58


It’s important to understand that our implementation handles the separation between train and test sets. The mean values are computed on the train set X_tr and used to replace missing values in both sets.

This time, we get a slightly larger MAE score than in the previous units 295 vs. 280. The difference is due to the missing values in the data - our estimator was not able to perfectly recover the information loss!

#### Summary
In this unit, we experimented with custom transformers from Scikit-learn and used them to encapsulate a complex set of Pandas preprocessing steps.

It’s interesting to note that Scikit-learn transformers can only transform the features column axis, but not the data points row axis. For example, we cannot create transformers that drop data points as it’s done with e.g. outliers removal. In the next unit, we will see how to do this by defining custom estimators.
