# A *fast* one hot encoder with sklearn and pandas

If you've worked in data science for any length of time, you've undoubtedly transformed your data into a one hot encoded format before. In this post we'll explore implementing a *fast* one hot encoder with  [scikit-learn](http://scikit-learn.org/stable/) and [pandas](https://pandas.pydata.org/).

# sklearn's one hot encoders

`sklearn` has implemented several classes for one hot encoding data from various formats (`DictVectorizer`, `OneHotEncoder` and `CategoricalEncoder` - not in current release). In this post we'll compare our implementation to `DictVectorizer` which is the most natural for working with `pandas.DataFrame`s.

## The pros of DictVectorizer

`DictVectorizer` has the following great features which we will preserve in our implementation

1. Works great in `sklearn` pipelines and train/test splits.
    - If feature was present during train time but not in the data at predict time
      `DictVectorizer` will automatically set the corresponding value to 0.
    - If a feature is present at predict time that was not at train time it is not
      encoded.
2. Features to be encoded are inferred from the data (user does not need to specify this).
    - This means numeric values in the input remain unchanged and `str` fields are
      encoded automatically.
3. We can get a mapping from feature names to the one hot encoded transformation values.
    - This is useful for looking at coefficients of feature importances of a model.

## The cons of DictVectorizer

`DictVectorizer` has two main blemishes (as related to a specific but common use case, see disclaimer below).

1. Transforming large `DataFrame`s is **slow**
2. The `.fit()` and `.transform()` signatures do not accept `DataFrame`s. To use `DictVectorizer`
    a `DataFrame` must first be converted to a `list` of `dict`s (which is also slow), e.g.
    
```python
    DictVectorizer().fit_transform(X.to_dict('records'))
```
    
Our implementation will guarantee the features of `DictVectorizer` listed in the pros section above and improve the conds by accepting a `DataFrame` as input and vastly improving the speed of the transformation. Our implementation will get a boost in performance by wrapping the super fast `pandas.get_dummies()` with a subclass of `sklearn.base.TransformerMixin`.

Before we get started let's compare the speed of `DictVectorizer` with `pandas.get_dummies()`.

## An improved one hot encoder

Our improved implementation will mimic the `DictVectorizer` interface (except that it accepts `DataFrame`s as input) by wrapping the super fast `pandas.get_dummies()` with a subclass of `sklearn.base.TransformerMixin`. Subclassing the `TransformerMixin` makes it easy for our class to integrate with popular `sklearn` paradigms such as their `Pipeline`s.

## Disclaimer

Note that we are specifically comparing the speed of `DictVectorizer` for the following use case only.

1. We are starting with a `DataFrame` which must be converted to a list of `dict`s
2. We are only interested in dense output, e.g. `DictVectorizer(sparse=False)`

# Time trials

Before getting started let's compare the speed of `DictVectorizer` with `pandas.get_dummies()`.

In [1]:
# first create *large* data set

import numpy as np
import pandas as pd

SIZE = 10000

df = pd.DataFrame({
    'int1': np.random.randint(0, 100, size=SIZE),
    'int2': np.random.randint(0, 100, size=SIZE),
    'float1': np.random.uniform(size=SIZE),
    'str1': np.random.choice([str(x) for x in range(10)], size=SIZE),
    'str1': np.random.choice([str(x) for x in range(75)], size=SIZE),
    'str1': np.random.choice([str(x) for x in range(150)], size=SIZE),
})

In [2]:
%%time
_ = pd.get_dummies(df)

CPU times: user 11.8 ms, sys: 7.16 ms, total: 18.9 ms
Wall time: 22 ms


As we can see, `pandas.get_dummies()` is fast. Let's take a look at `DictVectorizer`s speed.

In [3]:
%%time
from sklearn.feature_extraction import DictVectorizer
_ = DictVectorizer(sparse=False).fit_transform(df.to_dict('records'))

CPU times: user 407 ms, sys: 101 ms, total: 508 ms
Wall time: 759 ms


It's also informative to see that although `DictVectorizer` is slower than `pandas.get_dummies()`, converting the `DataFrame` to a `list` of `dict`s is the real bottleneck.

In [4]:
%%time
# time just to get list of dicts
df_dicts = df.to_dict('records')

CPU times: user 184 ms, sys: 6.82 ms, total: 191 ms
Wall time: 224 ms


In [5]:
%%time
# time to transform dicts
_ = DictVectorizer(sparse=False).fit_transform(df_dicts)

CPU times: user 55.1 ms, sys: 7.27 ms, total: 62.3 ms
Wall time: 92.7 ms


As we can see `pandas.get_dummies()` is *much* faster for our use case.

# Implemention

In [6]:
import sklearn


class GetDummies(sklearn.base.TransformerMixin):
    """Fast one-hot-encoder that makes use of pandas.get_dummies() safely
    on train/test splits.
    """
    def __init__(self, dtypes=None):
        self.input_columns = None
        self.final_columns = None
        if dtypes is None:
            dtypes = [object, 'category']
        self.dtypes = dtypes

    def fit(self, X, y=None, **kwargs):
        self.input_columns = list(X.select_dtypes(self.dtypes).columns)
        X = pd.get_dummies(X, columns=self.input_columns)
        self.final_columns = X.columns
        return self
        
    def transform(self, X, y=None, **kwargs):
        X = pd.get_dummies(X, columns=self.input_columns)
        X_columns = X.columns
        # if columns in X had values not in the data set used during
        # fit add them and set to 0
        missing = set(self.final_columns) - set(X_columns)
        for c in missing:
            X[c] = 0
        # remove any new columns that may have resulted from values in
        # X that were not in the data set when fit
        return X[self.final_columns]
    
    def get_feature_names(self):
        return tuple(self.final_columns)

In [7]:
%%time
# let's take a look at its speed
get_dummies = GetDummies()
get_dummies.fit_transform(df)

CPU times: user 14 ms, sys: 702 µs, total: 14.7 ms
Wall time: 16.8 ms


As we can see the GetDummies implentation has slowed down a bit from the original `pandas.get_dummes()` due to the overhead of making sure it handles train/test splits correctly, however its still super fast.

Let's also take a look at some of its other features.

In [8]:
%%time
# it works in sklearn pipelines too
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
model = make_pipeline(
    GetDummies(),
    DecisionTreeClassifier(max_depth=3)
)
model.fit(df, np.random.choice([0, 1], size=SIZE))

CPU times: user 52.1 ms, sys: 22.3 ms, total: 74.4 ms
Wall time: 112 ms


In [9]:
# you can also pull out the feature names to look at feature importances

tree = model.steps[-1][-1]
importances = tree.feature_importances_
std = np.std(tree.feature_importances_)

indices = np.argsort(importances)[:10:-1]
feature_names = model.steps[0][-1].get_feature_names()

# Print the feature ranking
print("Feature ranking:")

for f in range(len(feature_names[:10])):
    print("%d. feature %s (%f)" % (f + 1, feature_names[indices[f]], importances[indices[f]]))

Feature ranking:
1. feature str1_47 (0.235331)
2. feature int2 (0.232977)
3. feature float1 (0.218621)
4. feature str1_118 (0.157708)
5. feature str1_97 (0.155362)
6. feature str1_131 (0.000000)
7. feature str1_144 (0.000000)
8. feature str1_143 (0.000000)
9. feature str1_142 (0.000000)
10. feature str1_141 (0.000000)


In [10]:
# train/test splits are safe!
get_dummies = GetDummies()

# create test data that demonstrates how GetDummies handles
# different train/test conditions
df1 = pd.DataFrame([
    [1, 'a', 'b', 'hi'],
    [2, 'c', 'd', 'there']
], columns=['foo', 'bar', 'baz', 'grok'])
df2 = pd.DataFrame([
    [3, 'a', 'e', 'whoa', 0],
    [4, 'c', 'b', 'whoa', 0],
    [4, 'c', 'b', 'there', 0],
], columns=['foo', 'bar', 'baz', 'grok', 'new'])

get_dummies.fit_transform(df1)

Unnamed: 0,foo,bar_a,bar_c,baz_b,baz_d,grok_hi,grok_there
0,1,1,0,1,0,1,0
1,2,0,1,0,1,0,1


In [11]:
# 1. the new values of 'e' and 'whoa' are not encoded
# 2. features baz_b and baz_d are both set to 0 when no suitable
#      value for baz is found
get_dummies.transform(df2)

Unnamed: 0,foo,bar_a,bar_c,baz_b,baz_d,grok_hi,grok_there
0,3,1,0,0,0,0,0
1,4,0,1,1,0,0,0
2,4,0,1,1,0,0,1


# Conclusion

I recommend using the `GetDummies` class as needed only. `sklearn` is a true industry standard and deviations thereof should occur only be when the size of the data necessitates such a change. For most modest sized data sets `DictVectorizer` has served me well. However, I've also reduced the time required to one hot encode my data from ~30 minutes to ~5 on data sets of around 100M records.