# Embedding Bag Test
---

Experimenting applying an embedding bag (embedding layer + average of all embedding vectors) on the categorical features of a time series dataframe.

## Import the necessary packages

In [1]:
import dask.dataframe as dd                # Dask to handle big data in dataframes
import pandas as pd                        # Pandas to load the data initially
from dask.distributed import Client        # Dask scheduler
import torch                               # PyTorch for tensor and deep learning operations
import data_utils as du                    # Data science and machine learning relevant methods

## Initialize variables

Data that we'll be using:

In [2]:
data_df = pd.DataFrame([[103, 0, 'dog'], 
                        [103, 0, 'cat'],
                        [103, 1, 'horse'],
                        [104, 0, 'bunny'],
                        [105, 0, 'horse'],
                        [105, 0, 'dog'],
                        [105, 0, 'cat'],
                        [105, 0, 'bunny'],
                        [105, 1, 'bunny'],
                        [105, 1, 'dog'],
                        [105, 1, 'horse']], columns=['id', 'ts', 'Var0'])
# Only use the line of code bellow if you want to test on Dask
# data_df = dd.from_pandas(data_df, npartitions=2)
# If using Pandas, uncomment the line of code bellow and comment the next one, which uses Dask
data_df
# data_df.compute()

Embedding matrix used in the embedding layer:

In [3]:
embed_mtx = torch.FloatTensor([[0, 0, 0],
                               [-1, 0, 1],
                               [0, 1, -1],
                               [1, 1, 0],
                               [1, -1, 1]])
embed_mtx

Simple embedding layer:

In [4]:
simple_embed_layer = torch.nn.Embedding.from_pretrained(embed_mtx)
simple_embed_layer

Embedding layer + average operation (bagging):

In [5]:
bag_embed_layer = torch.nn.EmbeddingBag.from_pretrained(embed_mtx)
bag_embed_layer

## Enumerate categories

In [6]:
data_df.Var0, enum_dict = du.embedding.enum_categorical_feature(data_df, 'Var0')
# If using Pandas, uncomment the line of code bellow and comment the next one, which uses Dask
data_df
# data_df.compute()

In [7]:
enum_dict

In [8]:
# If using Pandas, uncomment the line of code bellow and comment the next one, which uses Dask
data = torch.tensor(data_df.values)
# data = torch.tensor(data_df.compute().values)
data

## Apply embedding layer

In [9]:
simple_embed_layer(data[:, 2])

In [10]:
embed_data_df = pd.DataFrame(torch.cat((data[:, :2].float(), simple_embed_layer(data[:, 2])), dim=1).numpy(), columns=['id', 'ts', 'E0', 'E1', 'E2'])
# Only use the line of code bellow if you want to test on Dask
# embed_data_df = dd.from_pandas(embed_data_df, npartitions=2)
# If using Pandas, uncomment the line of code bellow and comment the next one, which uses Dask
embed_data_df
# embed_data_df.compute()

## Apply embedding bag

Concatenate rows that have the same `id` and `ts`:

In [11]:
data_df.Var0 = data_df.Var0.astype(str)
# If using Pandas, uncomment the line of code bellow and comment the next one, which uses Dask
data_df.Var0
# data_df.Var0.compute()

In [12]:
data_df = data_df.groupby(['id', 'ts']).Var0.apply(lambda x: "%s" % ';'.join(x)).to_frame().reset_index()
# If using Pandas, uncomment the line of code bellow and comment the next one, which uses Dask
data_df
# data_df.compute()

Apply the embedding bag:

In [13]:
Var0_embed, Var0_offset = du.embedding.prepare_embed_bag(data_df, 'Var0')
print(f'Var0_embed: {Var0_embed}')
print(f'Var0_offset: {Var0_offset}')

In [14]:
bag_embed_layer(Var0_embed, Var0_offset)[:-1]