# Paper Implementation

This notebook comprises the code for our reimplementation of the
[paper titled 'Entity Embeddings of Categorical Variables'](https://arxiv.org/pdf/1604.06737.pdf).
The original code for the paper is available [here](https://github.com/entron/entity-embedding-rossmann)
and is written using the Keras framework, whereas we used PyTorch. We referred to the
original code to try and implement the paper as closely as possible.

The experiment includes implementing neural network, random forest, gradient
boosted trees, and KNN models, each with and without entity embeddings. We have implemented only
the neural networks here as they are the core of the paper.

Our results are close to theirs and this makes us confident that our implementation is correct.

## Data Preprocessing

Unlike the paper, we use `pandas` to help with preprocessing because it is easier.
The goal is to have a dataset with variables in TABLE 1 of the paper.
The steps we take:
1) Read in the dataset, only the columns that we will need.
1) Drop rows where `Sales` (the dependent variable) is zero.
1) Replace `Date` with `Day`, `Month`, and `Year`.
1) Merge in the `State` column from a separate sheet.
1) Encode all categorical columns into numeric types.
    These are all the columns except `Sales`.
1) Reorder the columns.

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# 1.
relevant_columns = ["Store", "DayOfWeek", "Date", "Sales", "Promo"]
dataset = pd.read_csv("rossmann-store-sales/train.csv", usecols=relevant_columns)

# 2.
dataset = dataset[dataset["Sales"] != 0]

# 3.
dataset[["Year", "Month", "Day"]] = dataset["Date"].str.split("-", expand=True)
dataset.drop(columns=["Date"], inplace=True)

# 4.
state_df = pd.read_csv("rossmann-store-sales/store_states.csv")
dataset = pd.merge(dataset, state_df, how="left", on="Store")
del state_df # We delete large variables to save memory

# 5.
label_encoder = LabelEncoder()
for col in dataset.columns.difference(["Sales"]):
    dataset[col] = label_encoder.fit_transform(dataset[col])

# 6.
dataset = dataset[["Store", "DayOfWeek", "Day", "Month", "Year", "Promo", "State", "Sales"]]

## Creating Tensors

We need 2 copies of the dataset here, which defer in their ordering:
- Shuffled set
    * Better for benchmarking models on statistical prediction accuracy
- Temporal set
    * Preseves time ordering, better for measuring generalizability of the models.
        The original data was in reverse chronological order so we simply reverse it.

Next, since its range is very large, we apply a log tranformation on the
dependant variable to make it consistent with the output range of our models.
We encapsulate the code to do it in a `OutputEncoder` class to make it easier to
do this transformation in both directions.

Finally we create a function to create the test/train splits. Important to not
that for training, 200k rows are randomly sampled rather than the full dataset.

In [2]:
import torch

# Shuffled set
shuffled_set = dataset.sample(frac=1)
shuffled_set = torch.tensor(shuffled_set.values, dtype=torch.float)

# Time sorted set
temporal_set = dataset.iloc[::-1].copy()
temporal_set = torch.tensor(temporal_set.values, dtype=torch.float)

del dataset

In [3]:
class OutputEncoder():
    def __init__(self, max_output):
        self.max_output = max_output

    def encode(self, output):
        with torch.no_grad():
            return torch.log(output) / torch.log(self.max_output)

    def decode(self, output):
        with torch.no_grad():
            return torch.exp(output * torch.log(self.max_output))

output_encoder = OutputEncoder(torch.max(temporal_set[:, -1]))

temporal_set[:, -1] = output_encoder.encode(temporal_set[:, -1])
shuffled_set[:, -1] = output_encoder.encode(shuffled_set[:, -1])

In [4]:
def test_train_split(dataset):
    split_threshold = int(0.9 * dataset.size(0))

    X_train = dataset[:split_threshold, :-1].long()
    X_test = dataset[split_threshold:, :-1].long()

    y_train = dataset[:split_threshold, -1]
    y_test = dataset[split_threshold:, -1]

    train_indices = torch.randperm(X_train.size(0))[:200_000]
    return X_train[train_indices], y_train[train_indices], X_test, y_test

## Creating Neural Networks

The architecture and all hyperparameters are copied from the paper.  The input
is converted to either one hot representation or passed through an embedding
layer before being send to a typical feedforward neural network.

About the implementation, we create a parameters dictionary that contains the
number of unique values and embedding dimension for each variable. These numbers
are from TABLE 1 of the paper.

In [5]:
# {parameter_name: (unique_values, embedding_dimension)}
parameters = {
    "store": (1115, 10),
    "day_of_week": (7, 6),
    "day": (31, 10),
    "month": (12, 6),
    "year": (3, 2),
    "promotion": (2, 1),
    "state": (12, 6)
}
    
class EmbeddingNN(torch.nn.Module):
    def __init__(self):
        super(EmbeddingNN, self).__init__()

        emb_list = [torch.nn.Embedding(n, d) for n, d in parameters.values()]
        self.emb_layers = torch.nn.ModuleList(emb_list)

        input_size = sum([tuple[1] for tuple in parameters.values()])

        self.feed_forward = torch.nn.Sequential(
            torch.nn.Linear(input_size, 1000),
            torch.nn.ReLU(),
            torch.nn.Linear(1000, 500),
            torch.nn.ReLU(),
            torch.nn.Linear(500, 1),
            torch.nn.Sigmoid()
        )

    def forward(self, X):
        embeddings = torch.cat([emb(X[:, i]) for i, emb in enumerate(self.emb_layers)], dim=1)
        
        return self.feed_forward(embeddings)

class OneHotNN(torch.nn.Module):
    def __init__(self):
        super(OneHotNN, self).__init__()
        input_size = sum([tuple[0] for tuple in parameters.values()])

        self.feed_forward = torch.nn.Sequential(
            torch.nn.Linear(input_size, 1000),
            torch.nn.ReLU(),
            torch.nn.Linear(1000, 500),
            torch.nn.ReLU(),
            torch.nn.Linear(500, 1),
            torch.nn.Sigmoid()
        )

    def forward(self, X):
        one_hot = torch.cat([torch.nn.functional.one_hot(X[:, i], num_emb).float()
                             for i, (num_emb, _) in enumerate(parameters.values())], dim=1)

        return self.feed_forward(one_hot)

## Training and Testing

Learning rate, batch size, and number of epochs are copied from the paper. We
use the `MAPE` metric for scoring. And for final evaluation, 5 models are
trained then their predictions averaged.

In [6]:
def train_model(model, X, y):
    loss_fn = torch.nn.L1Loss()
    optim = torch.optim.Adam(model.parameters(), lr=0.001)

    epochs = 10
    batch_size = 128
    total_samples = len(X)

    model.train()
    for _ in range(epochs):
        for i in range(0, total_samples, batch_size):
            inputs = X[i:i+batch_size]
            targets = y[i:i+batch_size]
            
            optim.zero_grad()
            outputs = model(inputs).squeeze()
            loss = loss_fn(outputs, targets)
            loss.backward()
            optim.step()

def MAPE(models, X, y_true):
    for model in models:
        model.eval()
        
    y_preds = [model(X).squeeze() for model in models]
    y_preds = [output_encoder.decode(y_pred) for y_pred in y_preds]

    stacked_preds = torch.stack(y_preds)
    y_pred = torch.mean(stacked_preds, dim=0)

    y_true = output_encoder.decode(y_true)
    return torch.mean(torch.abs((y_true - y_pred) / y_true))

def evaluate(cls, dataset):
    X_train, y_train, X_test, y_test = test_train_split(dataset)

    models = [cls() for _ in range(5)]
    for model in models:
        train_model(model, X_train, y_train)

    return MAPE(models, X_test, y_test)

In [7]:
print(f"Shuffled OneHotNN: {evaluate(OneHotNN, shuffled_set):.3f}")
print(f"Shuffled EmbeddingNN: {evaluate(EmbeddingNN, shuffled_set):.3f}")
print(f"Temporal OneHotNN: {evaluate(OneHotNN, temporal_set):.3f}")
print(f"Temporal EmbeddingNN: {evaluate(EmbeddingNN, temporal_set):.3f}")

Shuffled OneHotNN: 0.079
Shuffled EmbeddingNN: 0.084
Temporal OneHotNN: 0.110
Temporal EmbeddingNN: 0.109


## Results

We get the following results. Note that the numbers are `MAPE` score.

|  | OneHotNN | EmbeddingNN |
| --- | --- | --- |
| Shuffled Data | 0.079 | 0.084 |
| Temporal Data | 0.110 | 0.109 |

Compare with the paper's results

|  | OneHotNN | EmbeddingNN |
| --- | --- | --- |
| Shuffled Data | 0.070 | 0.070 |
| Temporal Data | 0.101 | 0.093 |

Our results are look good, slightly worse than their's. Maybe it's a difference
between Keras and PyTorch where some defaults are different. There can also be a
mistake in our implementation, but we are satisfied.