# Don't Explode - Encode

by Phillip Adkins, Principal Machine Learning Science Manager, Microsoft IDEAs 



#  Read This Before You Start Preprocessing Your Data for Tree Based Models!

Preprocessing data is huge part of building a Machine Learning model.  Data in its raw form often can't be fed directly into an algorithm.  Preprocessing enables ML algorithms to ingest and learn from data.

There are standard preprocessing techniques which are useful in a variety of contexts.  Techniques such as standardization, one-hot-encoding, missing-value imputation and more are a great way to shape your data up for feeding into and algorith ... unless they're not!  

And when you're working with tree-based models - these techniques can vary from useless to harmful.  These techniques can be time wasters, increase the likelihood of errors, and can blow up memory, time and compute requirements while *lowering* model quality! 

Let's talk about one particular preprocessing technique to think twice about when you're working with tree-based models, and what to do instead.


# One-Hot-Encoding - Not So Hot?

When working with categorical data, the #1 go-to method for converting categories into something which an ML algorithm can ingest is One-Hot-Encoding.  And it's a great technique!  However there are several reasons why One-Hot-Encoding is a poor choice for tree-based algorithms.

1) The number 1 reason is that tree-based algorithms simply don't work very well with one-hot-encodings. Because they work based on partitioning, one-hot encoding forces the decision trees to sequester data points by individual categorical values - there's no way for the model to say "if country == "USA" or "UK" then X".  If it wants to use "country", there will have to be a "USA only" branch in the tree, a "UK only" branch in the tree etc ...

2) It can either blow up memory consumption or forces you to work with a sparse matrix

3) It alters the shape of your data so the columns no longer 1:1 coincide with your original data frame / table

Think about mapping feature importance back to the original features!  It's now very difficult to the point that you might just opt not to do it, or you may be forced to look at your feature importances very differently than you otherwise would have.  

For example, instead of a feature importance for "country", you'll have a feature importance for "country=USA", another one for "country=UK" etc - from which it will be hard to determine the relative importance of "country" as a whole unless you do some extra math and transformations on top of these fractured feature importances.

What if you want to do feature selection using feature importances?  Will you drop individual values of categorical variables from your dataframe?  These will be the first to go when you filter on feature importance if you've done one-hot-encoding.

One-hot-encoding makes a mess!

4) There's a simpler, more accurate, faster method


## Another Class of Encoding

There are many methods to encode categoricals.  One-Hot-Encoding, while it has many drawbacks, it ideal in one sense: you can essentially feed it into any algorithm, and the algorithm will have a reasonable chance at extracting signal from the categorical. This is especially true for models which use dot products or matrix multiplications - like linear models, neural networks etc.

However, tree models are much more flexible in how they can extract information from a single feature.  Since tree models work by chopping up a given feature into partitions, a tree model is capable of carving the data up into segments which are defined by the categorical without having a one-hot-encoding handed to it.

A whole class of encodings is defined by mapping a categorical vector to a single numeric vector.  There are many strategies for doing this, and there's an oppotunity here for a ML Scientist to imbue the categorical representation with some extra usefulness with some clever thinking while engineering features.

A good example of this is *target encoding*.

With target encoding, we map the categorical to the mean value of the target.  This can be an especially useful encoding method for converting the categorical to something ordinal that is clearly relevant to the predictive task.  

A caveat here is that it's easy to overfit if you're not careful.  Some ways around this include: using older historical data to compute the encoding (rather than your current training set), doing stacking etc.

However, you might be surprised to find out that a tree-based model can work exceedingly well with encodings that at-a-glance seem like they wouldn't be any good, or that the model would be unlikely to extract signal from.  

Here are some examples of these types of encodings:
- map categorical to random float
- sort categorical alphabetically, map to index in that list (ordinal encoding)
- map categorical to its frequency in the data
- map categorical to its frequency RANK in the data
etc.  

Amazingly, tree-based models are totally fine using encodings like this - which has been demonstrated repeatedly in data science competitions and in professional applications.

# When do these encodings NOT work?

Here are some reasons these encodings may not work:

1) collisions in the map
2) too many categorical values
3) model not tuned properly to allow partitioning fine enough to zero in on specific categoricals
4) luck - the encoding might just for whatever reason obscure information from the splitting criterion and may prevent the tree from deciding to use the feature even though it's got information content. 

# Other Encodings

In order to circumvent some of these issues, there are other ways of encoding categoricals to help ensure that information can be extract.

There is also "binary encoding" - (which is not the same as one-hot-encoding even though one-hot-encoding does produce binary features)

Dimensionality reduction applied to the one-hot-encoding - like PCA, or random projection is another alternative that may retain much of the information in the categorical while keeping the representation skinny and dense.

Binary encoding and projects-based methods do have the undesirable attribute that they may destroy the alignment between the raw data and the transformed input data, but they do retain density so can still be a good compromise if the other encodings aren't working for some reason.

# Experiments

Let's run some quick experiments to demonstrate the effectiveness of some of these methods.

This is easy to extend.  For now, we'll test the following 3 methods on a variety of datasets:
- one-hot-encoding
- ordinal encoding
- binary encoding
- target encoding
- count encoding (frequency encoding)

The Category Encoder package's ordinal encoder (which we're using here) maps categoricals to randomly selected integers.  Scikit-learn also has an Ordinal Encoder, but it maps categories to their rank in the alphabetically sorted list of categories.   

On each dataset, we do a mini random hyperparam optimization on xgboost to attempt to find good params for each method.

We'll then get a look at a table of output scores with which we can compare the results.

## Function: Run a Single Experiment

In [8]:
import pandas as pd
import numpy as np
import time
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier, XGBRegressor
from sklearn.metrics import accuracy_score, mean_squared_error, log_loss
from sklearn.base import BaseEstimator, TransformerMixin
import category_encoders as ce


class ContinuousToFloatTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, continuous_columns):
        self.continuous_columns = continuous_columns

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X_continuous = X[self.continuous_columns].astype(np.float64)
        return X_continuous

def categorical_encoding_experiment(dataset_filename, feature_names, target_name, categoricals,
                                    continuous, encoding_method, problem_type):
    
    def run_experiment(X_train, y_train, X_val, y_val, problem_type, params):
        if problem_type == "classification":
            model = XGBClassifier(**params)
            model.fit(
                X_train,
                y_train,
                early_stopping_rounds=10,
                eval_set=[(X_val, y_val)],
                eval_metric="logloss",
                verbose=False
            )
            preds = model.predict(X_val)
            return accuracy_score(y_val, preds)

        else:
            model = XGBRegressor(**params)
            model.fit(
                X_train,
                y_train,
                early_stopping_rounds=10,
                eval_set=[(X_val, y_val)],
                eval_metric="rmse",
                verbose=False
            )
            preds = model.predict(X_val)
            return mean_squared_error(y_val, preds)

    df = pd.read_csv(dataset_filename)

    X = df.drop(target_name, axis=1)
    y = df[target_name]

    # Define categorical encoding method
    if encoding_method == "one_hot":
        encoder = ce.OneHotEncoder(cols=categoricals)
    elif encoding_method == "ordinal":
        encoder = ce.OrdinalEncoder(cols=categoricals)
    elif encoding_method == "binary":
        encoder = ce.BinaryEncoder(cols=categoricals)
    elif encoding_method == "target":
        encoder = ce.TargetEncoder(cols=categoricals)
    elif encoding_method == "count":
        encoder = ce.CountEncoder(cols=categoricals)
    else:
        raise ValueError("Invalid encoding_method")
        
    # Define column transformer
    preprocessor = ColumnTransformer(transformers=[("categorical", encoder, categoricals)], remainder="passthrough")

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=42)

    X_train = preprocessor.fit_transform(X_train, y_train)
    X_val = preprocessor.transform(X_val)
    X_test = preprocessor.transform(X_test)

    random_search_results = []

    for i in range(1):
        learning_rate = 0.1#np.random.uniform(0.01, 0.3)
        max_depth = 3#np.random.randint(2, 5)
        n_estimators = 500# np.random.randint(50, 200)

        params = {
            "learning_rate": learning_rate,
            "max_depth": max_depth,
            "n_estimators": n_estimators,
            "random_state": 42
        }

        start_time = time.time()
        score = run_experiment(X_train, y_train, X_val, y_val, problem_type, params)
        elapsed_time = time.time() - start_time

        result = {
            "iteration": i + 1,
            "params": params,
            "score": score,
            "elapsed_time": elapsed_time
        }

        random_search_results.append(result)
        print(result)

    return {
        "dataset": dataset_filename,
        "problem_type": problem_type,
        "encoding_method": encoding_method,
        "results": random_search_results
    }


## Pull Data, Load It, and Run All Experiments 

In [9]:

def load_and_save_dataset(url, column_names, columns_to_save, filename, sep):
    try:
        df = pd.read_csv(url, sep=sep)[columns_to_save]
    except:
        df = pd.read_csv(url, sep=sep, header=None, names=column_names)[columns_to_save]
    df.to_csv(filename, index=False)
    return df

def run_experiment(
    dataset_filename,
    feature_names,
    target_name,
    categoricals,
    continuous,
    problem_type,
    encoding_methods
):
    experiment_results = []
    for encoding_method in encoding_methods:
        experiment_results.append(
            categorical_encoding_experiment(
                dataset_filename,
                feature_names,
                target_name,
                categoricals,
                continuous,
                encoding_method,
                problem_type
            )
        )
    return experiment_results

def run_all_experiments():
    datasets = [
        {
            "url": "https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data",
            "column_names": [
                "sex",
                "length",
                "diameter",
                "height",
                "whole_weight",
                "shucked_weight",
                "viscera_weight",
                "shell_weight",
                "rings"
            ],
            "filename": "abalone.csv",
            "target_name": "rings",
            "categoricals": ["sex"],
            "continuous": [
                "length",
                "diameter",
                "height",
                "whole_weight",
                "shucked_weight",
                "viscera_weight",
                "shell_weight"
            ],
            "problem_type": "regression"
        },

        {
            "url": "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv",
            "column_names": [
                "Survived",
                "Pclass",
                "Name",
                "Sex",
                "Age",
                "Siblings/Spouses Aboard",
                "Parents/Children Aboard",
                "Fare"
            ],
            "filename": "titanic.csv",
            "target_name": "Survived",
            "categoricals": ["Pclass", "Sex"],
            "continuous": ["Age", "Siblings/Spouses Aboard", "Parents/Children Aboard", "Fare"],
            "problem_type": "classification"
        },
        {
            "url": "https://raw.githubusercontent.com/ayan-cs/bank-marketing-uciml/main/bank-full.csv",
            "column_names": [
                "age",
                "job",
                "marital",
                "education",
                "default",
                "housing",
                "loan",
                "contact",
                "month",
                "duration",
                "campaign",
                "pdays",
                "previous",
                "poutcome",
                "y"
            ],
            "separator":';',
            "filename": "bank.csv",
            "target_name": "y",
            "categoricals": [
                "job",
                "marital",
                "education",
                "default",
                "housing",
                "loan",
                "contact",
                "month",
                "poutcome"
            ],
            "continuous": [
                "age",
                "duration",
                "campaign",
                "pdays",
                "previous",
            ],
            "problem_type": "classification"
        },
        {
            "url": "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data",
            "column_names": [
                "class",
                "cap_shape",
                "cap_surface",
                "cap_color",
                "bruises",
                "odor",
                "gill_attachment",
                "gill_spacing",
                "gill_size",
                "gill_color",
                "stalk_shape",
                "stalk_root",
                "stalk_surface_above_ring",
                "stalk_surface_below_ring",
                "stalk_color_above_ring",
                "stalk_color_below_ring",
                "veil_type",
                "veil_color",
                "ring_number",
                "ring_type",
                "spore_print_color",
                "population",
                "habitat"
            ],
            "filename": "mushroom.csv",
            "target_name": "class",
            "categoricals": [
                "cap_shape",
                "cap_surface",
                "cap_color",
                "bruises",
                "odor",
                "gill_attachment",
                "gill_spacing",
                "gill_size",
                "gill_color",
                "stalk_shape",
                "stalk_root",
                "stalk_surface_above_ring",
                "stalk_surface_below_ring",
                "stalk_color_above_ring",
                "stalk_color_below_ring",
                "veil_type",
                "veil_color",
                "ring_number",
                "ring_type",
                "spore_print_color",
                "population",
                "habitat"
            ],
            "continuous": [],
            "problem_type": "classification"
        },
        {
            "url": "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
            "column_names": [
                "age",
                "workclass",
                "fnlwgt",
                "education",
                "education_num",
                "marital_status",
                "occupation",
                "relationship",
                "race",
                "sex",
                "capital_gain",
                "capital_loss",
                "hours_per_week",
                "native_country",
                "income"
            ],
            "filename": "adult.csv",
            "target_name": "income",
            "categoricals": [
                "workclass",
                "education",
                "marital_status",
                "occupation",
                "relationship",
                "race",
                "sex",
                "native_country"
            ],
            "continuous": [
                "age",
                "fnlwgt",
                "education_num",
                "capital_gain",
                "capital_loss",
                "hours_per_week"
            ],
            "problem_type": "classification"
        }
    ]


    encoding_methods = ["binary", "ordinal", "one_hot", "count"]

    all_experiment_results = []
    for dataset in datasets:
        print('doing', dataset['filename'])
        sep = dataset.get('separator', ',')
        columns_to_save = dataset['continuous'] + dataset['categoricals'] + [dataset['target_name']]
        df = load_and_save_dataset(dataset["url"], dataset['column_names'], columns_to_save, dataset["filename"], sep)
        experiment_results = run_experiment(
            dataset["filename"],
            dataset["column_names"],
            dataset["target_name"],
            dataset["categoricals"],
            dataset["continuous"],
            dataset["problem_type"],
            encoding_methods
        )
        all_experiment_results.extend(experiment_results)

    return all_experiment_results


all_experiment_results = run_all_experiments()

doing abalone.csv
{'iteration': 1, 'params': {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500, 'random_state': 42}, 'score': 4.889955867904465, 'elapsed_time': 0.19076967239379883}
{'iteration': 1, 'params': {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500, 'random_state': 42}, 'score': 4.882568465268943, 'elapsed_time': 0.22170042991638184}
{'iteration': 1, 'params': {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500, 'random_state': 42}, 'score': 4.958471274702026, 'elapsed_time': 0.2112436294555664}
{'iteration': 1, 'params': {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500, 'random_state': 42}, 'score': 4.882568465268943, 'elapsed_time': 0.24178767204284668}
doing titanic.csv




{'iteration': 1, 'params': {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500, 'random_state': 42}, 'score': 0.8028169014084507, 'elapsed_time': 0.09649848937988281}
{'iteration': 1, 'params': {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500, 'random_state': 42}, 'score': 0.7746478873239436, 'elapsed_time': 0.1018681526184082}




{'iteration': 1, 'params': {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500, 'random_state': 42}, 'score': 0.8028169014084507, 'elapsed_time': 0.1271824836730957}
{'iteration': 1, 'params': {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500, 'random_state': 42}, 'score': 0.8028169014084507, 'elapsed_time': 0.10581469535827637}
doing bank.csv




{'iteration': 1, 'params': {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500, 'random_state': 42}, 'score': 0.9087641692009953, 'elapsed_time': 2.3076682090759277}




{'iteration': 1, 'params': {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500, 'random_state': 42}, 'score': 0.9084876969864528, 'elapsed_time': 3.197153091430664}




{'iteration': 1, 'params': {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500, 'random_state': 42}, 'score': 0.9112524191318773, 'elapsed_time': 5.439720630645752}




{'iteration': 1, 'params': {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500, 'random_state': 42}, 'score': 0.9095935858446226, 'elapsed_time': 3.8912975788116455}
doing mushroom.csv




{'iteration': 1, 'params': {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500, 'random_state': 42}, 'score': 1.0, 'elapsed_time': 4.080474615097046}




{'iteration': 1, 'params': {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500, 'random_state': 42}, 'score': 1.0, 'elapsed_time': 2.138575792312622}




{'iteration': 1, 'params': {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500, 'random_state': 42}, 'score': 1.0, 'elapsed_time': 5.378643989562988}




{'iteration': 1, 'params': {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500, 'random_state': 42}, 'score': 1.0, 'elapsed_time': 1.2091398239135742}
doing adult.csv




{'iteration': 1, 'params': {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500, 'random_state': 42}, 'score': 0.8664107485604606, 'elapsed_time': 6.174733638763428}




{'iteration': 1, 'params': {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500, 'random_state': 42}, 'score': 0.8714011516314779, 'elapsed_time': 4.378894805908203}




{'iteration': 1, 'params': {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500, 'random_state': 42}, 'score': 0.872168905950096, 'elapsed_time': 10.834599256515503}




{'iteration': 1, 'params': {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500, 'random_state': 42}, 'score': 0.872168905950096, 'elapsed_time': 4.939168930053711}


## Parse and Display Results as Table

In [10]:
def results_to_dataframe(all_experiment_results):
    ''' construct dataframe, 
    extract best experimental result
    for each encoding technique '''
    def best_score(experiment):
        best = min if experiment['problem_type'] == 'regression' else max if experiment['problem_type'] == 'classification' else None
        return best(result['score'] for result in experiment['results' ])
        
    
    df = pd.DataFrame(all_experiment_results)
    df['score'] = [best_score(e) for e in all_experiment_results] 

    return df

df = results_to_dataframe(all_experiment_results)
df
df.set_index(['dataset', 'problem_type', 'encoding_method'])['score'].unstack('encoding_method')

Unnamed: 0_level_0,encoding_method,binary,count,one_hot,ordinal
dataset,problem_type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
abalone.csv,regression,4.889956,4.882568,4.958471,4.882568
adult.csv,classification,0.866411,0.872169,0.872169,0.871401
bank.csv,classification,0.908764,0.909594,0.911252,0.908488
mushroom.csv,classification,1.0,1.0,1.0,1.0
titanic.csv,classification,0.802817,0.802817,0.802817,0.774648


# Analysis

For regression problems, we're computing the mean squared error; lower is better.  

For classification: it's "accuracy"; higher is better.  

** Note: "accuracy" is not my favorite metric for understanding how well an ML algo is doing and has plenty of issues, but I think it's fair for comparing multiple algos.  I plan on dressing this up a bit later. 

According my experience and that of many others as well, "Count Encoding" a.k.a. "Frequency Encoding" is a very strong and robust technique.  You can see that here - it's consistently the best of the alternate encoding techniques auditioned. 

Best Results:
- abalone: count encoding
- adult: count / one-hot tied
- bank: one-hot encoding
- mushroom: too easy - all 100%
- titanic: binary / count / one-hot tied


The differences in accuracy here are very small.  On other datasets, it may be larger.

The takeaway from this table is that count encoding is typically about as good on average as one-hot, but it has the advantages alluded to above including simplicity of representation.  

# Conclusion

I highly recommend that you use count/frequency encoding instead of one-hot encoding when working with tree-based models unless you've got a very good reason!

# Followup

I'd like to repeat these experiments with more datasets and improve the depth of comparisons between methods. 

There's a lot more to explore here. 

Thanks for reading!