# Using the Transformer Networks

This notebook will guide you through the usage of a provided efficient implementation of Transformer Networks, to experiment with hyper-parameters and to perform ablation studies. This notebook will let you master accomplishing experiments with Transformer Networks and analyising the outcomes. Complete the code snippets where request and provide your observations. Feel free to refer the paper [Under the Hood of Transformer Networks for Trajectory Forecasting](https://arxiv.org/abs/2203.11878).

 

# Initial setup

Only if you run from Google Colab run those 2 cells to sync with Google Drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

In [None]:
%cd /content/drive/MyDrive/TF4AML/

Start with the import

In [2]:
import torch
import torch.utils.data
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import scipy
import os
import time

from transformer import baselineUtils
from transformer import individual_TF
from transformer.batch import subsequent_mask
from transformer.noam_opt import NoamOpt

In [3]:
# Select GPU device for the training if available
if not torch.cuda.is_available():
    device=torch.device("cpu")
    print("Current device:", device)
else:
    device=torch.device("cuda")
    print("Current device:", device, "- Type:", torch.cuda.get_device_name(0))

Current device: cuda - Type: NVIDIA GeForce RTX 3090


# Training and Testing

## Data Loading (setup the dataset for train, validation and test)

The subdatasets are 5 (ETH, Hotel, Univ, Zara1 and Zara2) we will leave one of them out for testing and train on the other 4. 

I.e. choosing ```dataset_name = 'zara1'``` the training set is composed by ETH, Hotel, Univ and Zara2 and tested on Zara1.

Moreover you can train and validate on a portion of the dataset setting percentage of the data (default is 50).

------ 

Each sequence is composed by an observed part to train the Encoder and a part we are attempting to predict with the Decoder. 

Generally the standard setup plans to use the first 8 points for the observation and the following 12 for the prediction.

------ 

Each created sequence has the shape (20, 4), where: 
- $N_{obs}+N_{pred} = 8 + 12 = 20$;
- Positions + Speeds = ( $x_i,\ y_i,\ u_i,\ v_i$) = ( $x_i,\ y_i,\ x_{i+1}-x_{i},\ y_{i+1} - y_i$ )

You can easily switch input type from position to speed setting the corresponding variable.

Speeds $u_i, v_i$ are generally more robust input and allow to avoid problems with the reference system.

------ 

Note: that $(u_0, v_0) = (0,0)$ and if speed are used the observed sequence has temporal length of $N_{obs} - 1$.

In [9]:
# Arguments to setup the datasets
dataset_name = 'zara1'
framework = 'regr'
obs_num = 8
preds_num = 12

# We limit the number of samples to a fixed percentage for the sake of time
perc_data = 50

# With predefined function we create dataset according to arguments
train_dataset,_ = baselineUtils.create_dataset('datasets', dataset_name, 0, obs_num, preds_num, delim='\t', train=True, perc_data=perc_data, verbose=True)
val_dataset, _  = baselineUtils.create_dataset('datasets', dataset_name, 0, obs_num, preds_num, delim='\t', train=False, perc_data=perc_data, verbose=True)
test_dataset, _ = baselineUtils.create_dataset('datasets', dataset_name, 0, obs_num, preds_num, delim='\t', train=False, eval=True, verbose=True)

# We create some folders to save model checkpoints
if not os.path.isdir("save_folder"):
    os.mkdir("save_folder")
if not os.path.isdir("save_folder/"+framework):
    os.mkdir("save_folder/"+framework)

if not os.path.isdir("save_folder/"+framework+"/"+dataset_name):
    os.mkdir("save_folder/"+framework+"/"+dataset_name)

start loading dataset
validation set size -> 0
001 / 007 - loading crowds_zara03_train.txt
002 / 007 - loading students003_train.txt
003 / 007 - loading uni_examples_train.txt
004 / 007 - loading biwi_eth_train.txt
005 / 007 - loading crowds_zara02_train.txt
006 / 007 - loading biwi_hotel_train.txt
007 / 007 - loading students001_train.txt
start loading dataset
validation set size -> 0
001 / 007 - loading biwi_eth_val.txt
002 / 007 - loading crowds_zara02_val.txt
003 / 007 - loading uni_examples_val.txt
004 / 007 - loading students001_val.txt
005 / 007 - loading crowds_zara03_val.txt
006 / 007 - loading students003_val.txt
007 / 007 - loading biwi_hotel_val.txt
start loading dataset
validation set size -> 0
001 / 001 - loading crowds_zara01.txt


In [10]:
input_type = 'speed'

if input_type == 'speed':
    input_idx_1 = 2
    input_idx_2 = 4
    first_element = 1
    
elif input_type == 'position':
    input_idx_1 = 0
    input_idx_2 = 2
    first_element = 0

deleted from here ---> Mean and Standard Deviation are computed across the full training dataset in order to normalize each sequence. 

This is manily due to the fact each subdateset is taken in different locations and with different camera settings. With this normalization step we ensure to uniformate each sequence feeded into the model. <--- to here

Added by Fabio ---> 

We compute the mean and standard deviation of positions or speeds across the full training dataset and use those to normalize each entry in the sequence.
This normalization is beneficial prior to processing with neural networks.

In [11]:
# After concatenating each observed and target sequence we compute the mean and std
mean = torch.cat((train_dataset[:]['src'][:, first_element:, input_idx_1:input_idx_2], train_dataset[:]['trg'][:, :, input_idx_1:input_idx_2]), 1).mean((0,1))
std  = torch.cat((train_dataset[:]['src'][:, first_element:, input_idx_1:input_idx_2], train_dataset[:]['trg'][:, :, input_idx_1:input_idx_2]), 1).std((0,1))

Following we create a torch dataloader that create the batches for each epoch.

In [12]:
batch_size = 512

tr_dl   = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=0)
val_dl  = torch.utils.data.DataLoader(val_dataset,   batch_size=batch_size, shuffle=True, num_workers=0)
test_dl = torch.utils.data.DataLoader(test_dataset,  batch_size=batch_size, shuffle=False, num_workers=0)

## Model instantiation

We create an instance of our transformer with the chosen configuration. 

Then we allocate it to the GPU for forward and backward accelerated computation.

In [13]:
# The input for the encoder are speeds (u,v) or positions (x,y)
enc_input_size = 2
# The input for the decoder are speeds (u,v) or positions (x,y) concatenated with mask array for start_of_sequence token [0, 0] 
# Corresponding to start of sequence token the mask is 1 for the other speed input the mask is 0
dec_input_size = 3
# The output of the decoder are predicted speeds and corresponding mask that should be all zero (a loss for that is dedicated)
dec_output_size = 3

emb_size = 512
ff_size = 1024
heads = 8
layers = 6
dropout = 0.1

model = individual_TF.IndividualTF(enc_input_size, dec_input_size, dec_output_size, N=layers, d_model=emb_size, d_ff=ff_size, h=heads, dropout=dropout).to(device)

## Training and Validation Step

Here we create two classes that define the single iteration function for train and validation.

In [14]:
def train_step(model, batch, mean, std, device):

    # If input type is speed then input (or source 'src') has shape (B, N_obs-1, 2) because the first one is (0,0).
    # Otherwise, if input type is position then input  has shape (B, N_obs, 2).
    # Note that the input of the decoder are only the first  N_pred-1  GT future value then target ('trg') has shape (B, N_pred-1, 2).
    inp    = (batch['src'][:,  first_element:, input_idx_1:input_idx_2].to(device) - mean.to(device)) / std.to(device)
    target = (batch['trg'][:, :-1, input_idx_1:input_idx_2].to(device) - mean.to(device)) / std.to(device)

    # We create a third mask channel to append to the 2 speeds. 
    # This helps the decoder differentiating between start of sequence token (with mask token 1) and target speeds (with mask token 0)
    # Summarizing: start_of_seq token is (0,0) and the mask token is 1 ---> [0, 0, 1]
    #              target inputs are (u_i, v_i) and the mask token is 0 ---> [u_i, v_i, 0]
    start_of_seq = torch.Tensor([0, 0, 1]).unsqueeze(0).unsqueeze(1).repeat(target.shape[0], 1, 1).to(device)
    target_c = torch.zeros((target.shape[0], target.shape[1], 1)).to(device)
    target = torch.cat((target, target_c), -1)
    # Final decoder input is the concatenation of them along temporal dimension
    dec_inp = torch.cat((start_of_seq, target), 1)

    # Source attention is enabled between all the observed input (mask elements are setted to 1)
    src_att = torch.ones((inp.shape[0], 1, inp.shape[1])).to(device)
    # For the target attention we mask future elements to prevent model cheating (corresponding future mask elements are setted to False)
    # The mask is changed dinamically to use teacher forcing learning
    trg_att = subsequent_mask(dec_inp.shape[1]).repeat(dec_inp.shape[0], 1, 1).to(device)
    # Source, target and corresponding attention mask are passed to the model for the forward step
    pred = model(inp, dec_inp, src_att, trg_att)

    return pred


def eval_step(model, batch, mean, std, device, preds=12):

    # In the evaluation step we don't provide target to the decoder but we autoregressively input each prediction for the following one.
    inp = (batch['src'][:, first_element:, input_idx_1:input_idx_2].to(device) - mean.to(device)) / std.to(device)

    # The decoder input is the only start of sequence token [0, 0, 1]
    # Please note that now model has to predict also the third channel mask (See loss2 in the main loop)
    src_att = torch.ones((inp.shape[0], 1, inp.shape[1])).to(device)
    start_of_seq = torch.Tensor([0, 0, 1]).unsqueeze(0).unsqueeze(1).repeat(inp.shape[0], 1, 1).to(device)
    dec_inp = start_of_seq

    # We predict just one future speed and we append it to the decoder input for the next iteration (auto-regression)
    # At each step the target mask should be adapted
    for i in range(preds):
        trg_att = subsequent_mask(dec_inp.shape[1]).repeat(dec_inp.shape[0], 1, 1).to(device)
        out = model(inp, dec_inp, src_att, trg_att)
        dec_inp = torch.cat((dec_inp, out[:, -1:, :]), 1)

    # Note at the each iteration of the loop we re-append the start of seq token, so after the last iteration we need to remove it
    return dec_inp[:, 1:, :]

## Optimizer

Here we select the **optimizer** proposed in the original Transformer Networks paper of Vaswani et al.

It uses some initial warmup epochs, where the learning rate is increased. Then it slowly decreases according to a number of epoch and the chosen embedding size. The resulting formula is:

LR = $\frac{F}{\sqrt{D}} min( \frac{1}{\sqrt{epoch}},\ epoch \cdot W^{-\frac{3}{2}}) $

where F is a scaling factor, D is the model embedding size, W is the number of warmup epochs.

In [None]:
# Argument for the optimizer 
factor = 1.
warmup = 5

optim = NoamOpt(emb_size, factor, len(tr_dl)*warmup, torch.optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9))

## Main 


Then we can train, validate and test our transformer epoch by epoch.

-------

The **losses** used are 2:

1.   $L_2$-loss distance between predicted $(\hat{\textbf{u}}, \hat{\textbf{v}})$ and GT $(\textbf{u}, \textbf{v})$ target speeds;
2.   $L_1$-loss for the target token mask. Note these should be all zero, so the loss is simply the mean.

-------

Moreover, the **metrics** used to validate the model goodness at Validation and Test time are the following:

1.   Mean Average Displacement (MAD): $L_2$-distance between *all* the $N_pred$ GT and predicted future ***positions***;
2.   Final Average Displacement (FAD): $L_2$-distance between the *last* GT and predicted future ***positions***;

-------

Note: If you restart the training for any reason, remember to instanciate again model and optimizer in order to reset them.

-------

In [16]:
# compute execution time of the cell
start_time = time.time()

# Argument for the training 
epoch = 0
max_epoch = 40          # Total number of epoch
ckp_save_step = 20      # Frequency for saving the model
log_step = 5           # Frequency for printing the loss


print("Start Training...\n")


for epoch in range(max_epoch):

    if (epoch+1) % log_step == 0:
        print("---> Epoch %03i/%03i <---  LR: %7.5f" % ((epoch+1), max_epoch, optim._rate))

    ###### TRAIN ######
    model.train()

    train_loss=0
    gt_posit = []
    pr_posit = []
    
    for id_b, batch in enumerate(tr_dl):

        # All the gradients are resetted to zero before the training step
        optim.optimizer.zero_grad()
        
        # We predict target speeds and we save the corresponing GTs
        pred_speed = train_step(model, batch, mean, std, device)
        gt_speed = (batch['trg'][:, :, input_idx_1:input_idx_2].to(device) - mean.to(device)) / std.to(device)

        # We compute the two losses, averaging on the batch
        loss1 = F.pairwise_distance(pred_speed[:, :, :2].contiguous().view(-1, 2), gt_speed.contiguous().view(-1, 2).to(device)).mean()
        loss2 = torch.abs(pred_speed[:, :, 2]).mean()
        loss = loss1 + loss2

        # We accumulate and visualize the loss at the end of the epoch. 
        # Note that here the loss is the mean on the batch but in the end we want the mean across the whole dataset.
        train_loss += loss.item() * batch['trg'].shape[0]

        loss.backward()
        optim.step()

        if input_type == 'speed':
            # If input type is speed, to compute MAD and FAD metrics we need to compute back positions from predicted speeds.
            # This is done easily adding cumulative (and denormalized) speeds to the last observed position. If last position in the input is (x_7, y_7) then:
            # (x_8, y_8)    =   (x_7, y_7) + (u_7, v_7)
            # (x_9, y_9)    =   (x_8, y_8) + (u_8, v_8)   =   (x_7, y_7) + (u_7, v_7) + (u_8, v_8)
            # (x_10, y_10)  =   (x_9, y_9) + (u_9, v_9)   =   (x_7, y_7) + (u_7, v_7) + (u_8, v_8) + (u_9, v_9)
            # We have always the last observed position (x_7, y_7) and we add progressively the "cumulated" speeds
            preds_tr_b = batch['src'][:, -1:, 0:2].cpu().numpy() + (pred_speed[:, :, 0:2].detach() * std.to(device) + mean.to(device)).cpu().numpy().cumsum(1)

        elif input_type == 'position':
            # If input type is position, we simply append the output
            preds_tr_b = (pred_speed[:, :, 0:2].detach() * std.to(device) + mean.to(device)).cpu().numpy()
        
        # We store both predicted and GT positions
        pr_posit.append(preds_tr_b)
        gt_posit.append(batch['trg'][:, :, 0:2])
        

    # After concatenation we compute MAD and FAD metrics
    gt_posit = np.concatenate(gt_posit, 0)
    pr_posit = np.concatenate(pr_posit, 0)
    mad, fad, errs = baselineUtils.distance_metrics(gt_posit, pr_posit)

    if (epoch+1) % log_step == 0:
        print('Total Train Loss: %7.4f - MAD: %7.4f - FAD: %7.4f' % (train_loss/len(tr_dl), mad, fad))



    ###### VALIDATION ######
    # Here is all the same exept for eval_step and computation of MAD and FAD metrics
    with torch.no_grad():
        model.eval()

        val_loss = 0
        gt_posit = []
        pr_posit = []

        for id_b, batch in enumerate(val_dl):
            
            pred_speed = eval_step(model, batch, mean, std, device, preds=preds_num)
            gt_speed = (batch['trg'][:, :, input_idx_1:input_idx_2].to(device) - mean.to(device)) / std.to(device)

            loss1 = F.pairwise_distance(pred_speed[:, :, 0:2].contiguous().view(-1, 2), gt_speed.contiguous().view(-1, 2).to(device)).mean()
            loss2 = torch.abs(pred_speed[:, :, 2]).mean()
            loss = loss1 + loss2
            val_loss += loss.item() * batch['trg'].shape[0]

            if input_type == 'speed':
                preds_tr_b = batch['src'][:, -1:, 0:2].cpu().numpy() + (pred_speed[:, :, 0:2] * std.to(device) + mean.to(device)).cpu().numpy().cumsum(1)
            elif input_type == 'position':
                preds_tr_b = (pred_speed[:, :, 0:2] * std.to(device) + mean.to(device)).cpu().numpy()
            
            # We store both predicted and GT positions
            pr_posit.append(preds_tr_b)
            gt_posit.append(batch['trg'][:, :, 0:2])
            
        # After concatenation we compute MAD and FAD metrics
        gt_posit = np.concatenate(gt_posit, 0)
        pr_posit = np.concatenate(pr_posit, 0)
        mad, fad, errs = baselineUtils.distance_metrics(gt_posit, pr_posit)

        if (epoch+1) % log_step == 0:
            print('Total Eval  Loss: %7.4f - MAD: %7.4f - FAD: %7.4f' % (val_loss/len(val_dl), mad, fad))



    ###### TEST ######
    # The test is same as eval 
    with torch.no_grad():
        model.eval()

        test_loss = 0
        gt = []
        pr = []
        
        for id_b, batch in enumerate(test_dl):

            pred_speed = eval_step(model, batch, mean, std, device, preds=preds_num)
            gt_speed = (batch['trg'][:, :, input_idx_1:input_idx_2].to(device) - mean.to(device)) / std.to(device)

            loss1 = F.pairwise_distance(pred_speed[:, :, 0:2].contiguous().view(-1, 2), gt_speed.contiguous().view(-1, 2).to(device)).mean()
            loss2 = torch.abs(pred_speed[:, :, 2]).mean()
            loss = loss1 + loss2
            test_loss += loss.item() * batch['trg'].shape[0]


            if input_type == 'speed':
                preds_tr_b = batch['src'][:, -1:, 0:2].cpu().numpy() + (pred_speed[:, :, 0:2] * std.to(device) + mean.to(device)).cpu().numpy().cumsum(1)
            elif input_type == 'position':
                preds_tr_b = (pred_speed[:, :, 0:2] * std.to(device) + mean.to(device)).cpu().numpy()

            pr.append(preds_tr_b)
            gt.append(batch['trg'][:, :, 0:2])


        gt = np.concatenate(gt, 0)
        pr = np.concatenate(pr, 0)
        mad, fad, errs = baselineUtils.distance_metrics(gt, pr)

        if (epoch+1) % log_step == 0:
            print('Total Test  Loss: %7.4f - MAD: %7.4f - FAD: %7.4f \n'% (test_loss/len(test_dl), mad, fad))

    # Here we save checkpoints to avoid repeated training
    if ((epoch+1) % (ckp_save_step) == 0):
        print("Saving checkpoint... \n ")
        torch.save(model.state_dict(), f'save_folder/{framework}/{dataset_name}/{(epoch+1):05d}.pth')



# print execution time
print("Total time: %s seconds" % (time.time() - start_time))

Start Training...

---> Epoch 010/100 <---  LR: 0.00222
Total Train Loss: 210.6400 - MAD:  0.2933 - FAD:  0.4105
Total Eval  Loss: 598.9922 - MAD:  1.0443 - FAD:  2.0041
Total Test  Loss: 695.5201 - MAD:  1.7321 - FAD:  2.8603 

---> Epoch 020/100 <---  LR: 0.00179
Total Train Loss: 119.3469 - MAD:  0.1324 - FAD:  0.1960
Total Eval  Loss: 255.3613 - MAD:  0.4306 - FAD:  0.9609
Total Test  Loss: 267.4828 - MAD:  0.4663 - FAD:  1.0341 

Saving checkpoint... 
 
---> Epoch 030/100 <---  LR: 0.00145
Total Train Loss: 116.3218 - MAD:  0.1264 - FAD:  0.1885
Total Eval  Loss: 312.5784 - MAD:  0.5698 - FAD:  1.3008
Total Test  Loss: 344.2546 - MAD:  0.6154 - FAD:  1.4784 

---> Epoch 040/100 <---  LR: 0.00125
Total Train Loss: 90.8324 - MAD:  0.1089 - FAD:  0.1587
Total Eval  Loss: 259.1963 - MAD:  0.4592 - FAD:  1.0382
Total Test  Loss: 283.3514 - MAD:  0.4906 - FAD:  1.1192 

Saving checkpoint... 
 


## Load a model


Here we leave a snippet of code to quickly load a model from a saved checkpoint. You can load model at specific epoch using this code before the main train-eval-test loop.

In [None]:
# Instanciate a new model and loading its parameters

# The input for the encoder are speeds (u,v) or positions (x,y)
enc_input_size = 2
# The input for the decoder are speeds (u,v) or positions (x,y) concatenated with mask array for start_of_sequence token [0, 0] 
# Corresponding to start of sequence token the mask is 1 for the other speed input the mask is 0
dec_input_size = 3
# The output of the decoder are predicted speeds and corresponding mask that should be all zero (a loss for that is dedicated)
dec_output_size = 3

emb_size = 512
ff_size = 2048
heads = 8
layers = 6
dropout = 0.1

model = individual_TF.IndividualTF(enc_input_size, dec_input_size, dec_output_size, N=layers, d_model=emb_size, d_ff=ff_size, h=heads, dropout=dropout).to(device)


# Loading arguments
epoch = 50
dataset_name = 'zara1'

path = f'save_folder/{dataset_name}/{(epoch):05d}.pth'
model.load_state_dict(torch.load(path))


# Setup correctly optimizer and its LR as well
factor = 1.
warmup = 10

optim = NoamOpt(emb_size, factor, len(tr_dl)*warmup, torch.optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9))
optim._step = epoch-1

## Visualization  (***2 POINTS***)

Here you can implement some function to create qualitative plots.

We recommend you the following:

1. Loss plot for Train, Eval and Test;
2. MAD plot for Train, Eval and Test;
3. FAD plot for Train, Eval and Test;
4. Trajectory positions (observed points, GT target points and predicted target points)

In [2]:
# Here your code 

## Report  (***4 POINTS***)

Here you can report comments and results for the experiments up to this point.

Perform experiments that improves the performances or that gives meaningfull insights.

I.e. what happens if we change model hyperparamenters? What if we change learning rate?

Please explain extensively the results and organize them clearly with tables, plots...

----

Here your report

# Ablation Studies


Here we ask you to change some settings in order to compare the benefit of some specific mechanism.

Please follow the instructions and create a small report for each point adding your comments supported by plots, tables with results or whatever you think is usefull.

Each extra study included to improve general performance or to draft a more complete analysis will be considered.

---

**Note:** to have a fair comparison we suggest to fix the setup (i.e. Regressive TF with speeds, obs=8, pred=12, ...) and change just the analysed module.

---

## 1. Substitute for the Prediction Framework  (***6 POINTS***)

---

The standard task is the regression of future speeds/positions. 

We propose to implement to different frameworks: Gaussian and Quantized.

---

### a.  Gaussian


Predicting normal distribution parameters mean vector $\mu = (\mu_x, \mu_y)$ and covariance matrix $\Sigma = \biggl( \begin{smallmatrix}\sigma_x^2 & \rho \sigma_x \sigma_y\\ \rho \sigma_x \sigma_y & \sigma_y^2 \end{smallmatrix}\biggr)$ of future predicition. 

Then the model output dimension is 5: 2 for mean parameters $\mu_x, \mu_y$ and 3 for the covariance parameters $\sigma_x, \sigma_y, \rho$.

---

Note: consider carefully the following code snippet. In this way we force $\sigma_x, \sigma_y$ to be positive and $\rho$ to be in $[-1, 1]$

The following lines are meant to be a hint. Integrate those into the code of the previous cells.

In [None]:
# Arguments to setup the datasets
dataset_name = 'zara1'
framework = 'gauss'
obs_num = 8
preds_num = 12

# We limit the number of samples to a fixed percentage for the sake of time
perc_data = 50

# With predefined function we create dataset according to arguments
train_dataset,_ = baselineUtils.create_dataset('datasets', dataset_name, 0, obs_num, preds_num, delim='\t', train=True, perc_data=perc_data, verbose=True)
val_dataset, _  = baselineUtils.create_dataset('datasets', dataset_name, 0, obs_num, preds_num, delim='\t', train=False, perc_data=perc_data, verbose=True)
test_dataset, _ = baselineUtils.create_dataset('datasets', dataset_name, 0, obs_num, preds_num, delim='\t', train=False, eval=True, verbose=True)

# We create some folders to save model checkpoints
if not os.path.isdir("save_folder"):
    os.mkdir("save_folder")
if not os.path.isdir("save_folder/"+framework):
    os.mkdir("save_folder/"+framework)
if not os.path.isdir("save_folder/"+framework+"/"+dataset_name):
    os.mkdir("save_folder/"+framework+"/"+dataset_name)

In [None]:
model = individual_TF.IndividualTF(enc_input_size, dec_input_size, dec_output_size, N=layers, d_model=emb_size, d_ff=ff_size, h=heads, dropout=dropout).to(device)

output = model(inp, dec_inp, src_att, trg_att)

mux = output[:, :, 0].unsqueeze(2)
muy = output[:, :, 1].unsqueeze(2)
sx = torch.exp(output[:, :, 2]).unsqueeze(2)
sy = torch.exp(output[:, :, 3]).unsqueeze(2)
corr = torch.tanh(output[:, :, 4]).unsqueeze(2)

mean = torch.cat((mux, muy), dim=2).to(device)
cov = torch.cat((sx**2, corr*sx*sy, corr*sx*sy, sy**2), dim=2).view((-1, sx.size(1), 2, 2)).to(device)

Next prediction can be now sampled from the predicted distribution making the forecasting stochastic.

The loss used in this case is the NLL.  

Note: To relax the assumption you can also use predicted mean as input for following step (particularly in eval and test), avoiding the sampling and assuming identity as covariance matrix.

In [None]:
# Here your code 

Here your report

---

### b. Quantized

Transformer was originally introduced in the NLP i.e. for next word classification task.
    
To emulate this case we change dataset (clustering all possible speed in C classes) and model to classify the most likely one (with CE loss).

Here we provide a script for the quantized dataset, so you may adapt the final part of the model to output probability score for each class (output_size=1000 + softmax) followed by CE loss.

---

Note: In the quantized framework the start of sequence token is adapted: 

The class indices spans from 0 to 999, so we add index 1000 to represent the start of sequence token

The following lines are meant to be hint, wisely integrate them with the code in the previous cells.  

In [None]:
# Arguments to setup the datasets
dataset_name = 'zara1'
framework = 'quant'
obs_num = 8
preds_num = 12

# We limit the number of samples to a fixed percentage for the sake of time
perc_data = 50

# With predefined function we create dataset according to arguments
train_dataset,_ = baselineUtils.create_dataset('datasets', dataset_name, 0, obs_num, preds_num, delim='\t', train=True, perc_data=perc_data, verbose=True)
val_dataset, _  = baselineUtils.create_dataset('datasets', dataset_name, 0, obs_num, preds_num, delim='\t', train=False, perc_data=perc_data, verbose=True)
test_dataset, _ = baselineUtils.create_dataset('datasets', dataset_name, 0, obs_num, preds_num, delim='\t', train=False, eval=True, verbose=True)

# Load precomputed clusters to quantize the data
mat = scipy.io.loadmat(os.path.join('datasets', dataset_name, "clusters.mat"))
clusters=mat['centroids']
num_classes = clusters.shape[0]

# We create some folders to save model checkpoints
if not os.path.isdir("save_folder"):
    os.mkdir("save_folder")
if not os.path.isdir("save_folder/"+framework):
    os.mkdir("save_folder/"+framework)
if not os.path.isdir("save_folder/"+framework+"/"+dataset_name):
    os.mkdir("save_folder/"+framework+"/"+dataset_name)

In [None]:
# The input for the encoder are speeds (u,v) or positions (x,y)
enc_input_size = num_classes
# The input for the decoder are speeds (u,v) or positions (x,y) concatenated with mask array for start_of_sequence token [0, 0] 
# Corresponding to start of sequence token the mask is 1 for the other speed input the mask is 0
dec_input_size = num_classes+1
# The output of the decoder are predicted speeds and corresponding mask that should be all zero (a loss for that is dedicated)
dec_output_size = num_classes

emb_size = 512
ff_size = 2048
heads = 8
layers = 6
dropout = 0.1


model = individual_TF.IndividualTF(enc_input_size, dec_input_size, dec_output_size, N=layers, d_model=emb_size, d_ff=ff_size, h=heads, dropout=dropout).to(device)


# Inside the train and eval step we need to convert speed/position to cluster index
batch_size = batch['src'].shape[0]

# Associate the nearest class to each speed/position
speeds_inp=batch['src'][:,1:,2:4]
inp=torch.tensor(scipy.spatial.distance.cdist(speeds_inp.reshape(-1,2), clusters).argmin(axis=1).reshape(batch_size, -1)).to(device)

speeds_trg = batch['trg'][:,:,2:4]
target = torch.tensor(scipy.spatial.distance.cdist(speeds_trg.reshape(-1, 2), clusters).argmin(axis=1).reshape(batch_size, -1)).to(device)




# Class are indices from 0 to 999. 
# We add index 1000 to represent the start of sequence token
start_of_seq = torch.tensor([1000]).repeat(batch_size).unsqueeze(1).to(device)



# We predict class indexes of future speeds/positions
output = model(inp, dec_inp, src_att, trg_att)

loss = F.cross_entropy(output.view(-1, num_classes), target.view(-1), reduction='mean')



# To compute metrics we need positions. Then we convert back each predicted index to the relative centroid speed/position values
preds_tr_b = batch['src'][:,-1:,0:2].cpu().numpy() + clusters[output.cpu().numpy()].cumsum(1)
pr.append(preds_tr_b)

In [None]:
# Here your code 

Here your report

---

## 2. Increase Prediction Horizon (Short- or Long-term Forecasting)  (***3 POINTS***)

---

You can easily increase/decrease the number of predictions (i.e. pred = 4, 8, 12, 20, 30, 50 ....) in the dataloader and see the effect on the MAD/FAD metric.

Report your results in a table and/or plot and comment what you see.

---

In [None]:
# Here your code 

Here your report

---

## 3. Increasing Data Number  (***3 POINTS***)

Transformers are generally very large network and need a lot of data to perform well.

Try to vary the percentage data variable (i.e. 10, 25, 50, 75, 100) and see how the performance changes.

Please report here plots and/or tables for:

1. MAD and FAD metrics 

2. Computational time

In [None]:
# Here your code 

Here your report

---

## 4. Change input Type  (***2 POINTS*** - Bonus)

---

What happens if we change the input form speed type (u,v) to position one (x,y)?

Report then some quantitative results and plot trajectory predicted with both method to evaluate qualitative differences.

---

In [None]:
# Here your code 

Here your report

## 5. Positional Encoding  (***3 POINTS*** - Bonus)

A number of positional encodings have been proposed. 

Implement the plain positional encoding [0,1,2,3,4,...] and report your comments and results.

Change the commented class we prepared in the positional_encoding.py file and copy the class here. 

In [None]:
# Here your code 

Here your report


---

This notebook was created by Luca Franco and Alessandro Flaborea.