# Paper Implementation

This notebook comprises the code for our reimplementation of the
[paper titled 'Entity Embeddings of Categorical Variables'](https://arxiv.org/pdf/1604.06737.pdf).
The original code for the paper is available [here](https://github.com/entron/entity-embedding-rossmann)
and is written using the Keras framework, whereas we used PyTorch. We referred to the
original code to try and implement the paper as closely as possible.

The experiment includes implementing neural network, random forest, gradient
boosted trees, and KNN models, each with and without entity embeddings. We have implemented only
the neural networks here as they are the core of the paper.

Our results are close to theirs and this makes us confident that our implementation is correct.

# Data Preprocessing

We utilize the `pandas` library for data preprocessing, aiming to structure our dataset according to the variables outlined in TABLE 1 of the paper.
Here are the steps we followed for data preprocessing.
## Steps

1. **Remove Irrelevant Data:**
   Select essential columns (`Store`, `DayOfWeek`, `Date`, `Sales`, `Promo`) to focus on pertinent information.

2. **Handle Zero Sales:**
   Exclude rows where sales (`Sales`) are zero to ensure a meaningful dataset.

3. **Temporal Transformation:**
   Transform the `Date` column into separate `Day`, `Month`, and `Year` columns for enhanced temporal analysis.

4. **Include State Information:**
   Integrate state information into the dataset, providing valuable geographical context. This step is also done in the paper.

5. **Encode Categorical Variables:**
   Convert categorical columns (excluding `Sales`) into numeric types for machine learning compatibility.

6. **Reorder Columns:**
   Adjust column order to enhance dataset clarity and alignment with desired structure.



In [8]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# 1. We read relevant columns from the CSV file
relevant_columns = ["Store", "DayOfWeek", "Date", "Sales", "Promo"]
dataset = pd.read_csv("rossmann-store-sales/train.csv", usecols=relevant_columns)

# 2. We filter out rows where Sales are zero (i.e. the store was closed)
dataset = dataset[dataset["Sales"] != 0]

# 3. Extract Year, Month, and Day from the Date column. Then, drop the Date column. This gives us more categorical columns
dataset[["Year", "Month", "Day"]] = dataset["Date"].str.split("-", expand=True)
dataset.drop(columns=["Date"], inplace=True)

# 4. Merge dataset with store states information. 
state_df = pd.read_csv("rossmann-store-sales/store_states.csv")
dataset = pd.merge(dataset, state_df, how="left", on="Store")
del state_df  # We delete large variables to save memory

# 5. Apply label encoding to categorical columns (excluding Sales). This just assigns a number to each category
label_encoder = LabelEncoder()
for col in dataset.columns.difference(["Sales"]):
    dataset[col] = label_encoder.fit_transform(dataset[col])

# 6. Lastly, we select final relevant columns for the dataset
dataset = dataset[["Store", "DayOfWeek", "Day", "Month", "Year", "Promo", "State", "Sales"]]


# Creating Tensors

In the process of data preparation for modeling, we follow a structured approach to generate effective datasets:

**Shuffled Set and Temporal Set**

We create two versions of our dataset to address distinct aspects of model evaluation:

- **Shuffled Set:**
  - Optimal for benchmarking models based on statistical prediction accuracy.

- **Temporal Set:**
  - Maintains the chronological order of the data, facilitating the assessment of model generalizability.
  - Retains the original reverse chronological order of the data.

**Log Transformation for Scaling**

To address the wide range of the dependent variable, we apply a log transformation. This step ensures consistency with the output range of our models. The transformation code is conveniently encapsulated within the `OutputEncoder` class, allowing for easy execution of both forward and reverse transformations.

**Train/Test Splits**

Creating reliable training and testing sets is crucial for accurate model evaluation. We implement a function to achieve this, with an emphasis on:
- Randomly sampling 200k rows during training for efficiency without compromising the model's robustness.


In [9]:
import torch

# Shuffled set
# Randomly shuffle the dataset to create the shuffled set.
shuffled_set = dataset.sample(frac=1)
shuffled_set = torch.tensor(shuffled_set.values, dtype=torch.float)

# Time sorted set
# Reverse the chronological order to create the temporal set as the dataset is in reverse chronological order
temporal_set = dataset.iloc[::-1].copy()
temporal_set = torch.tensor(temporal_set.values, dtype=torch.float)

# Clean up memory by deleting the original dataset.
del dataset


In [10]:
class OutputEncoder():
    def __init__(self, max_output):
        # Initialize the OutputEncoder with the maximum output value.
        self.max_output = max_output

    def encode(self, output):
        # Logarithmically encode the output to handle a large output range.
        with torch.no_grad():
            return torch.log(output) / torch.log(self.max_output)

    def decode(self, output):
        # Decode the output by taking the exponential.
        with torch.no_grad():
            return torch.exp(output * torch.log(self.max_output))

# Create an instance of OutputEncoder with the maximum value from temporal_set.
output_encoder = OutputEncoder(torch.max(temporal_set[:, -1]))

# Apply encoding to the last column (Sales) in both temporal_set and shuffled_set.
temporal_set[:, -1] = output_encoder.encode(temporal_set[:, -1])
shuffled_set[:, -1] = output_encoder.encode(shuffled_set[:, -1])


In [11]:
def test_train_split(dataset):
    # Define the split threshold for 90% training and 10% testing.
    split_threshold = int(0.9 * dataset.size(0))

    # Extract features (X) and target variable (y) for training and testing sets.
    X_train = dataset[:split_threshold, :-1].long() 
    # We also convert to long dtype because in PyTorch, embedding layers typically expect input of type torch.long (or torch.int64). This is because embedding layers are designed to work with discrete indices, such as those used to represent categories or words.
    X_test = dataset[split_threshold:, :-1].long()

    y_train = dataset[:split_threshold, -1]
    y_test = dataset[split_threshold:, -1]

    # Randomly sampling 200,000 rows for training as done in paper
    train_indices = torch.randperm(X_train.size(0))[:200_000]

    # Return the sampled training data and full testing data.
    return X_train[train_indices], y_train[train_indices], X_test, y_test


## Neural Network Architectures

We implemented two distinct neural network models: `EmbeddingNN` and `OneHotNN`. Each model's architecture closely aligns with the specifications outlined in TABLE 1 of the paper, incorporating specific strategies for handling categorical variables.

**EmbeddingNN**

The `EmbeddingNN` model utilizes embedding layers for categorical variables, with embedding dimensions determined by the unique values of each variable. This design choice is reflected in the `parameters` dictionary, which encapsulates the necessary details. The overall structure comprises a standard feedforward neural network, strictly following the outlined hyperparameters.

**OneHotNN**

In contrast, the `OneHotNN` model processes input data using one-hot encoding. Similar to `EmbeddingNN`, this model follows the paper's architectural design and hyperparameter choices.

*Model Architecture*

Both models employ the learning rate, batch size, and number of epochs as specified in the original paper. The input data undergoes preprocessing, with categorical variables either converted to a one-hot representation or passed through embedding layers, depending on the model.

*Parameters Dictionary*

The `parameters` dictionary plays a crucial role in our implementation as it captures the unique values and embedding dimensions for each variable. We refer to TABLE 1 in the paper for these values, ensuring a precise replication of the specified neural network architectures.



In [12]:
# Define the parameters dictionary, where each key represents a feature and the associated tuple
# contains the number of unique values and the embedding dimension for that feature.
# {parameter_name: (unique_values, embedding_dimension)}
parameters = {
    "store": (1115, 10),
    "day_of_week": (7, 6),
    "day": (31, 10),
    "month": (12, 6),
    "year": (3, 2),
    "promotion": (2, 1),
    "state": (12, 6)
}


class EmbeddingNN(torch.nn.Module):
    def __init__(self):
        super(EmbeddingNN, self).__init__()

        # Create a list of embedding layers based on the parameters
        emb_list = [torch.nn.Embedding(n, d) for n, d in parameters.values()]
        self.emb_layers = torch.nn.ModuleList(emb_list)

        # Calculate the total input size for the feedforward network. This is equal to the sum of embedding dimensions
        input_size = sum([tuple[1] for tuple in parameters.values()])

        # Define the feedforward network with ReLU activations and a sigmoid output. This is done using the Sequential module
        self.feed_forward = torch.nn.Sequential(
            torch.nn.Linear(input_size, 1000),
            torch.nn.ReLU(),
            torch.nn.Linear(1000, 500),
            torch.nn.ReLU(),
            torch.nn.Linear(500, 1),
            torch.nn.Sigmoid()
        )

    def forward(self, X):
        # Concatenate the embeddings obtained from each categorical feature using list comprehension
        # Iterate over the embedding layers and apply them to corresponding columns in the input X
        # The resulting embeddings are concatenated along the specified dimension (dim=1) to form a single tensor
        embeddings = torch.cat([emb(X[:, i])
                               for i, emb in enumerate(self.emb_layers)], dim=1)

        return self.feed_forward(embeddings) # Pass the concatenated embeddings to the feedforward network


class OneHotNN(torch.nn.Module):
    def __init__(self):
        super(OneHotNN, self).__init__()

        # Calculate the total input size for the feedforward network using the number of unique values
        input_size = sum([tuple[0] for tuple in parameters.values()])

        # Define the feedforward network with ReLU activations and a sigmoid output as in paper
        self.feed_forward = torch.nn.Sequential(
            torch.nn.Linear(input_size, 1000),
            torch.nn.ReLU(),
            torch.nn.Linear(1000, 500),
            torch.nn.ReLU(),
            torch.nn.Linear(500, 1),
            torch.nn.Sigmoid()
        )

    def forward(self, X):
        # Convert input features into one-hot encoding and concatenate them
        one_hot = torch.cat([torch.nn.functional.one_hot(X[:, i], num_emb).float()
                             for i, (num_emb, _) in enumerate(parameters.values())], dim=1)

        return self.feed_forward(one_hot)

## Training and Testing

To maintain consistency with the referenced paper, we adopt the same values for
learning rate, batch size, and the number of epochs. The Mean Absolute
Percentage Error (MAPE) metric is utilized for model evaluation since it is more robust to outliers. We train five models for each class, and their predictions are averaged for a comprehensive evaluation. This ensemble approach provides a
more robust and generalizable model, mitigating the impact of individual model
idiosyncrasies and enhancing overall predictive performance.

In [13]:
def train_model(model, X, y):
    # Define loss function and optimizer
    loss_fn = torch.nn.L1Loss()
    optim = torch.optim.Adam(model.parameters(), lr=0.001)

    epochs = 10
    batch_size = 128
    total_samples = len(X)

    # Set the model to training mode
    model.train()
    
    for _ in range(epochs):
        for i in range(0, total_samples, batch_size):
            # Get a batch of data
            inputs = X[i:i+batch_size]
            targets = y[i:i+batch_size]
            # Zero the gradients, forward pass, backward pass, and optimization step
            optim.zero_grad()
            outputs = model(inputs).squeeze() # Squeeze to remove extra dimension
            loss = loss_fn(outputs, targets) # Calculate loss
            loss.backward() # Calculate gradients
            optim.step() # Update the parameters

def MAPE(y_pred, y_true):
    return torch.mean(torch.abs((y_true - y_pred) / y_true)) 

def evaluate(cls, dataset):
    # Split the dataset into train and test sets
    X_train, y_train, X_test, y_test = test_train_split(dataset)

    # Create and train 5 models
    models = [cls() for _ in range(5)]
    for model in models:
        train_model(model, X_train, y_train)

    # Make predictions and calculate ensemble prediction
    y_preds = []
    for model in models:
        model.eval()
        y_pred = model(X_test).squeeze()
        y_pred = output_encoder.decode(y_pred)
        y_preds.append(y_pred)

    stacked_preds = torch.stack(y_preds) # Stack along the first dimension. This is done to average the predictions
    y_pred = torch.mean(stacked_preds, dim=0)

    # Decode the true values and calculate MAPE
    y_true = output_encoder.decode(y_test)
    return MAPE(y_pred, y_true)


In [14]:
# Calculate and print the Mean Absolute Percentage Error (MAPE) values for both OneHotNN and EmbeddingNN models on both datasets (shuffled and temporal)
print(f"Shuffled OneHotNN: {evaluate(OneHotNN, shuffled_set):.3f}")
print(f"Shuffled EmbeddingNN: {evaluate(EmbeddingNN, shuffled_set):.3f}")
print(f"Temporal OneHotNN: {evaluate(OneHotNN, temporal_set):.3f}")
print(f"Temporal EmbeddingNN: {evaluate(EmbeddingNN, temporal_set):.3f}")

Shuffled OneHotNN: 0.075
Shuffled EmbeddingNN: 0.087
Temporal OneHotNN: 0.103
Temporal EmbeddingNN: 0.105


## Results

We obtained the following results, with the numbers representing the `MAPE` scores.

|  | OneHotNN | EmbeddingNN |
| --- | --- | --- |
| Shuffled Data | 0.075 | 0.087 |
| Temporal Data | 0.103 | 0.105 |

### Comparison with Paper's Results

|  | OneHotNN | EmbeddingNN |
| --- | --- | --- |
| Shuffled Data | 0.070 | 0.070 |
| Temporal Data | 0.101 | 0.093 |

While our obtained results demonstrate satisfactory performance, it's important to highlight a notable observation. We faced challenges in replicating the exact `MAPE` scores reported in the paper. We attribute this discrepancy to potential differences between PyTorch and Keras implementations or potential challenges in our PyTorch setup.
