# Preprocessing

We use `pandas` to help with preprocessing.
The goal is to have the variables in TABLE 1 of the paper and no more.
The steps we take:
1) Read in the dataset, only the columns that we will need.
1) Drop rows where `Sales` (our output variable) is zero.
1) Replace `Date` with `Day`, `Month`, and `Year`.
1) Merge in the `State` column from a separate sheet.
1) Encode all categorical columns into numeric types.
    These are all the columns except `Sales`.
1) Reorder the columns.

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# 1.
relevant_columns = ["Store", "DayOfWeek", "Date", "Sales", "Promo"]
dataset = pd.read_csv("rossmann-store-sales/train.csv", usecols=relevant_columns)

# 2.
dataset = dataset[dataset["Sales"] != 0]

# 3.
dataset[["Year", "Month", "Day"]] = dataset["Date"].str.split("-", expand=True)
dataset.drop(columns=["Date"], inplace=True)

# 4.
state_df = pd.read_csv("rossmann-store-sales/store_states.csv")
dataset = pd.merge(dataset, state_df, how="left", on="Store")
del state_df # We delete large variables to save memory

# 5.
label_encoder = LabelEncoder()
for col in dataset.columns.difference(["Sales"]):
    dataset[col] = label_encoder.fit_transform(dataset[col])

# 6.
dataset = dataset[["Store", "DayOfWeek", "Day", "Month", "Year", "Promo", "State", "Sales"]]

# Creating Tensors

Mostly straighforward. We highlight the notable bits.

Firsly, the paper describes a transformation to make to the output variable so it is within the range of the sigmoid function. This normalization is done because the Sales column spans 4 orders of magnitudes, and hence doing so allows us to scale to the same range as
the neural network output
We create a `OutputEncoder` class to encapsulate this so it is easy to do this transformation in both directions.

Next we do the test/train split. The paper describes two ways to do it:
1) Preseving original temporal ordering 
    * Since, this way, the test data is of a future time whose probability distrubition has not yet been sampled by the model, it is a better predictor of the model's generalizabiltiy
2) Shuffling the data 
    * This is beneficial for benchmarking model's performance based on its statistical prediction accuracy

We implement both and state both results

In [2]:
import torch

X = torch.tensor(dataset.drop(columns=["Sales"]).values)
y = torch.tensor(dataset["Sales"].values, dtype=torch.float)
del dataset

In [3]:
class OutputEncoder():
    def __init__(self, max_output):
        self.max_output = max_output

    def encode(self, output):
        with torch.no_grad():
            return torch.log(output) / torch.log(self.max_output)

    def decode(self, output):
        with torch.no_grad():
            return torch.exp(output * torch.log(self.max_output))

output_encoder = OutputEncoder(torch.max(y))
y = output_encoder.encode(y)

In [4]:
# Temporal split (Already in correct order)
X_temporal = X.clone()
y_temporal = y.clone()

# Shuffled split
shuffled_indices = torch.randperm(X.size(0))

X_shuffled = X[shuffled_indices].clone()
y_shuffled = y[shuffled_indices].clone()

del X
del y

# Common split function
def test_train_split(X, y):
    split_threshold = int(0.9 * X.size(0))
    
    X_train = X[:split_threshold]
    X_test = X[split_threshold:]

    y_train = y[:split_threshold]
    y_test = y[split_threshold:]

    return X_train, y_train, X_test, y_test

# Creating Neural Networks

The architecture is defined well in the paper and we will follow it. The paper defines 2 ways to create the networks, with embeddings and with one-hot vectors. We again implement both ways.

* One-hot encoding NN
    * The NN with One-hot encodings creates one hot-encoded vectors for the inputs and feeds this into the model. It does not learn any intrinsic propertie/meanings for the features and only learns the output feature (Sales) distribuition based on the input features.
* Entity Embeddings NN
    * The NN with entity embeddings learns the embedding representation of each categorical feature. This means the model also learns the intrinsic properties/meanings of each feature along with the sales distribuition. This is beneficial when the model sees new data and is able to generalize to it better. It also means that the NN through entitiy emeddings restricts itself in a much smaller but meaningful parameter space. This reduces the chance that the network converges to local minimums far from the global minimum.

In [5]:
class EmbeddingNN(torch.nn.Module):
    def __init__(self):
        super(EmbeddingNN, self).__init__()
        # From TABLE 1. Each tuple is (unique_values, embedding_dimension)
        emb_dims = [(1115, 10), (7, 6), (31, 10), (12, 6), (3, 2), (2, 1), (12, 6)]

        self.embs = [torch.nn.Embedding(*args) for args in emb_dims]
        self.fc1 = torch.nn.Linear(sum(dim for _, dim in emb_dims), 1000)
        self.relu1 = torch.nn.ReLU()
        self.fc2 = torch.nn.Linear(1000, 500)
        self.relu2 = torch.nn.ReLU()
        self.output = torch.nn.Linear(500, 1)
        self.sigmoid = torch.nn.Sigmoid()

    def forward(self, X):
        out = [emb(X[:, i].long()) for i, emb in enumerate(self.embs)] 
        out = torch.cat(out, dim=1)
        
        out = self.relu1(self.fc1(out)) 
        out = self.relu2(self.fc2(out))
        out = self.sigmoid(self.output(out))
        return out

class OneHotNN(torch.nn.Module):
    def __init__(self):
        super(OneHotNN, self).__init__()
        # Required to create the one-hot vectors
        self.one_hot_classes = [1115, 7, 31, 12, 3, 2, 12]

        self.fc1 = torch.nn.Linear(sum(self.one_hot_classes), 1000)
        self.relu1 = torch.nn.ReLU()
        self.fc2 = torch.nn.Linear(1000, 500)
        self.relu2 = torch.nn.ReLU()
        self.output = torch.nn.Linear(500, 1)
        self.sigmoid = torch.nn.Sigmoid()

    def forward(self, X):
        out = [torch.nn.functional.one_hot(X[:, i], num_class).float()
                for i, num_class in enumerate(self.one_hot_classes)]
        out = torch.cat(out, dim=1)

        out = self.relu1(self.fc1(out))
        out = self.relu2(self.fc2(out))
        out = self.sigmoid(self.output(out))
        return out

# Training and Testing

We create some functions to reduce repetitive code when creating multiple models and training on different data.
Important things to note are that for predictions, 5 models are created and trained, then their predictions averaged, as mentioned in the paper.
The `MAPE` (mean absolute percent error) metric is used for scoring.

In [6]:
def train_model(model, X, y):
    loss_fn = torch.nn.MSELoss()
    optim = torch.optim.Adam(model.parameters(), lr=0.001)

    epochs = 10
    batch_size = 128
    total_samples = len(X)

    model.train()
    for _ in range(epochs):
        for i in range(0, total_samples, batch_size):
            inputs = X[i:i+batch_size]
            targets = y[i:i+batch_size]
            
            optim.zero_grad()
            outputs = model(inputs).squeeze()
            loss = loss_fn(outputs, targets)
            loss.backward()
            optim.step()

def MAPE(models, X, y_true):
    for model in models:
        model.eval()
        
    y_preds = [model(X).squeeze() for model in models]
    stacked_preds = torch.stack(y_preds)
    y_pred = torch.mean(stacked_preds, dim=0)

    y_pred = output_encoder.decode(y_pred)
    y_true = output_encoder.decode(y_true)

    return torch.mean(torch.abs((y_true - y_pred) / y_true))

def evaluate(cls, X, y):
    X_train, y_train, X_test, y_test = test_train_split(X, y)

    models = [cls() for _ in range(5)]
    for model in models:
        train_model(model, X_train, y_train)

    return MAPE(models, X_test, y_test)

In [7]:
print(f"Shuffled OneHotNN: {evaluate(OneHotNN, X_shuffled, y_shuffled):.3f}")
print(f"Shuffled EmbeddingNN: {evaluate(EmbeddingNN, X_shuffled, y_shuffled):.3f}")
print(f"Temporal OneHotNN: {evaluate(OneHotNN, X_temporal, y_temporal):.3f}")
print(f"Temporal EmbeddingNN: {evaluate(EmbeddingNN, X_temporal, y_temporal):.3f}")

Shuffled OneHotNN: 0.064
Shuffled EmbeddingNN: 0.073
Temporal OneHotNN: 0.122
Temporal EmbeddingNN: 0.131


# Results

We get the following results. Note that the numbers are `MAPE` score.

|  | OneHotNN | EmbeddingNN |
| --- | --- | --- |
| Shuffled Data | 0.064 | 0.073 |
| Temporal Data | 0.122 | 0.131 |

Compare with the paper's results

|  | OneHotNN | EmbeddingNN |
| --- | --- | --- |
| Shuffled Data | 0.070 | 0.070 |
| Temporal Data | 0.101 | 0.093 |

Our results are slightly worse than theirs. There might be a mistake in our implementation, but we could not find. Or it could just be random chance.