# Transformers for time-series processing

In this notebook, we continue to play with the Jena Climate dataset [1] presented on the topic of Recurrent Neural Nets. We focus on a single-variable (univariate) time series and pick the temperature reading for building a forecasting model. We start with a recap of the RNN model as a baseline and then proceed to design basic transformer architecture suitable for time-series forecasting. There are usually many hyperparameters one needs to be aware of to unleash its' powers. Thus, the practical tasks would focus on tweaking those parameters and altering the architecture until the results astound you.


__References__
1. Jena Climate Dataset, https://www.kaggle.com/datasets/mnassrib/jena-climate


In [None]:
# PyTorch Lightning
try:
    import pytorch_lightning as pl
    assert pl.__version__ == '1.9.4', "old version"
except Exception: # Google Colab does not have PyTorch Lightning
                            # installed by default.
                            # Hence, we do it here if necessary
    !pip install pytorch-lightning==1.9.4
    import pytorch_lightning as pl

In [None]:
import math
from time import time
from tqdm.auto import tqdm

import numpy as np
import pandas as pd
pd.options.display.float_format = '{:,.3f}'.format
from IPython.display import display

# Sklearn & stats tools
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from statsmodels.tsa.seasonal import seasonal_decompose
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import mean_squared_error as mse

# Neural Networks
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torchmetrics

import pytorch_lightning as pl
from pytorch_lightning import Trainer, seed_everything
from pytorch_lightning.loggers.csv_logs import CSVLogger
# from pytorch_lightning.callbacks import EarlyStopping, LearningRateMonitor
from pytorch_lightning.callbacks.early_stopping import EarlyStopping


# Plotting
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl
plt.style.use('bmh')
mpl.rcParams['figure.figsize'] = 18, 8


if torch.cuda.is_available():
    gpus = 1
else:
    gpus = 0

## Data


### Load

In [None]:
!if [[ ! -f jena_climate_2009_2016.csv.zip ]] ; then wget -nc "https://storage.googleapis.com/tensorflow/tf-keras-datasets/jena_climate_2009_2016.csv.zip" ; fi

In [None]:
data = pd.read_csv("jena_climate_2009_2016.csv.zip")
data = data.iloc[::144] # grab only 144th point (i.e., daily reading)

date_time_key = "Date Time"
data[date_time_key] = pd.to_datetime(data[date_time_key], infer_datetime_format=True)
data.set_index(date_time_key, inplace=True)
data.sort_index(inplace=True)
data.head()


### Choose two columns

Further studies will be carried out with the temperature and pressure columns: `T (degC)`, `p (mbar)`.

In [None]:
# Weather data
column_names = ["T (degC)", "p (mbar)"]
X = data[column_names].values
N = 7 * len(X) // 8  # approximately 1 year for testing

X_train, X_test = X[:N], X[N:]
print(X_train[0])


### Train/test split and preprocessing

In [None]:
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
# plt.plot(scaler.inverse_transform(X_test))
plt.plot(X_test)
plt.title(f'Preprocessed (scaled) "{column_names}" (test sample)');

### Preparing to feed the data to PyTorch models

The following function helps translate univariate time series into a tuple of arrays
- series of sequences of the given length
- target values that are shifted by the given number of steps from the beginning of the corresponding sequence

Such tuple is frequently used for building auto-regressive models, thus the function's name.

__Input X shape__: (n, p)<br>
__Ouput shapes__: (n - k, k, p), (n - k, p))

i.e., it concats subsequence of `k` elements into a single vector of feature vectors.
One can interpret the output as an `k`-dimensional encoding of each given measurement.  Sometimes it is called a _delayed-coordinate_ representation or _auto-regressive_ (AR) representation.

In [None]:
def AR_matrices(X, seq=1, tx=1, shift=1):
    # returns auto-regressive matrices for X with shape (seq, n) 
    # X_AR: (len(X) - seq, seq, element_len)
    # Y_AR: (len(X) - seq, element_len)
    # Y[i] is [tx * element_len] vector next (shift-1) after X[i:i+n]
    X_AR = []
    Y_AR = []
    n_features = len(X[0])
    for i in range(seq, len(X) - shift + 1):
        ax_ar = X[i - seq:i].reshape(-1, seq, n_features)
        X_AR.append(ax_ar)

        ay_ar = X[i + shift - 1:i + shift + tx - 1].reshape(tx, -1)
        # print(">>", ay_ar)
        Y_AR.append(ay_ar)

    return np.array(X_AR).reshape(-1, seq, n_features), np.array(Y_AR).reshape(-1, tx * n_features)


temp_X = np.concatenate([np.arange(7).reshape(-1, 1), np.arange(7, 0, -1).reshape(-1, 1)], axis=1)

temp_XAR, temp_YAR = AR_matrices(temp_X, seq=3, tx=1, shift=1)
print("Example: input shape:", temp_X.shape, '-> X:', temp_XAR.shape, "y:", temp_YAR.shape)
print(temp_X, '-> \nX:', temp_XAR, "\ny:", temp_YAR)
print("e.g.", temp_XAR[0], '->', temp_YAR[0])


We define a custom `torch.util.data.Dataset` subclass to transform source time series (`X` and `y`) to PyTorch models such as RNNs and Transformer in sequences of `seq_len` and `target_len` lengths.

In [None]:
class TimeseriesDataset(Dataset):   
    def __init__(self, X: np.ndarray, y: np.ndarray=None, step: int = 1,
                 target_len=1, reshape_features=None):
        self.X = torch.tensor(X).float()
        self.y = torch.tensor(y).float() if y is not None else None
        self.step = step
        self.target_len = target_len
        self.seq_len = self.X.shape[1]
        self.n_features = self.X.shape[-1]
        self.new_n_features = self.n_features if reshape_features is None else reshape_features

    def __len__(self):
        return len(self.X) - self.step - self.target_len + 2

    def __getitem__(self, index):
        x = self.X[index:index+self.step].reshape(-1, self.new_n_features)
        if self.y is not None:
            return (x,
                    self.y[index+self.step-1:index+self.step-1+self.target_len].flatten())  # TODO check if works with LSTM or squeeze it?
        else:
            return x

In [None]:
X_train_, y_train_ = AR_matrices(X_train[:100], seq=4, tx=1, shift=1)
print(X_train_[:5], y_train_[:5])
dataset = TimeseriesDataset(X_train_, y_train_, reshape_features=4)
X_0, y_0 = dataset[0]
print(X_0, y_0)

In [None]:
print(len(dataset), '\nDS[0]:', dataset[0], "\n", dataset[0][0].shape, dataset[0][1].shape)

`SequenceDataModule` inherits `pl.LightningDataModule` to provide `Trainer` with properly constructed X batches of (`batch_size` x `seq_len` x `n_features`) and Y batches of (`batch_size` x `tgt_len`).

Details: https://pytorch-lightning.readthedocs.io/en/latest/extensions/datamodules.html

In [None]:
class SequenceDataModule(pl.LightningDataModule):
    def __init__(self, X_train, X_test, seq_len=1, batch_size=128, target_len=1,
                 num_workers=0, shift=1, reshape_features=None, train_size=0.8):
        super().__init__()
        self.seq_len = seq_len
        self.batch_size = batch_size
        self.num_workers = num_workers
        self.tgt_len = target_len

        N = int(train_size * len(X_train))

        self.X_train, self.y_train = AR_matrices(X_train[:N], seq=seq_len, tx=1)
        self.X_val, self.y_val = AR_matrices(X_train[N:], seq=seq_len, tx=1)
        self.X_test, self.y_test = AR_matrices(X_test, seq=seq_len, tx=1)
        self.train_dataset = TimeseriesDataset(self.X_train, self.y_train, 
                                               target_len=target_len, reshape_features=reshape_features)
        self.val_dataset = TimeseriesDataset(self.X_val, self.y_val, 
                                             target_len=target_len, reshape_features=reshape_features)
        self.test_dataset = TimeseriesDataset(self.X_test, self.y_test,
                                              target_len=target_len, reshape_features=reshape_features)
        self.predict_dataset = TimeseriesDataset(self.X_test, reshape_features=reshape_features, 
                                                 target_len=target_len)

    def train_dataloader(self):
        return DataLoader(self.train_dataset, 
                                  batch_size = self.batch_size, 
                                  shuffle = False,
                                  num_workers = self.num_workers)

    def val_dataloader(self):
        return DataLoader(self.val_dataset, 
                                batch_size = self.batch_size, 
                                shuffle = False,
                                num_workers = self.num_workers)

    def test_dataloader(self):
        return DataLoader(self.test_dataset, 
                                 batch_size = self.batch_size, 
                                 shuffle = False, 
                                 num_workers = self.num_workers)

    def predict_dataloader(self):
        return DataLoader(self.predict_dataset, 
                            batch_size = self.batch_size, 
                            shuffle = False, 
                            num_workers = self.num_workers)

In [None]:
# Aux training classes/functions

class LitProgressBar(pl.callbacks.TQDMProgressBar):
    def init_sanity_tqdm(self):
        return tqdm(disable=True)
    
    def init_validation_tqdm(self):
        return tqdm(disable=True)

    def get_metrics(self, trainer, model):
        # hide the version number
        items = super().get_metrics(trainer, model)
        items.pop("v_num", None)
        return items
    
def mat2vec(bT):
    t = bT.shape[1]
    bpad = np.pad(bT, ((0,t-1), (0,0)), mode='constant', constant_values=np.nan)
    bpadT = bpad.T.reshape(t,-1)
    concat_shift = np.concatenate([np.roll(bpadT[i], i).reshape(1,-1) for i in range(t)], axis=0)
    return np.nanmean(concat_shift, axis=0)

def plot_test_predict(y, y_hat, scaler=scaler):
    plt.figure(figsize=(18, 7))
    # print(f'MAE loss (sklearn): {mae(y, y_hat):.2f}')
    # print(f'MSE loss (sklearn): {mse(y, y_hat):.2f}')
    if scaler:
        y = scaler.inverse_transform(y)
        y_hat = scaler.inverse_transform(y_hat)
    # y = mat2vec(y)
    # y_hat = mat2vec(y_hat)
    fig = plt.figure()
    axes = fig.subplots(1, y.shape[1])
    for i in range(y.shape[1]):
        mae_i = np.mean(np.abs(y[:,i] - y_hat[:,i]))
        axes[i].plot(y.T[i], label='True', alpha=0.5)
        axes[i].plot(y_hat.T[i], label='Prediction', alpha=0.5)
        axes[i].set_title(f"Target {i+1}, mae: {mae_i:.2f}")
    
    print(f'MAE loss (sklearn), unscaled: {mae(y, y_hat):.2f}')
    # print(f'MSE loss (sklearn), unscaled: {mse(y, y_hat):.2f}')
    # plt.plot(y.T[0], label='True', alpha=1.)
    # plt.plot(y_hat.T[0], label='Prediction', alpha=1.)
    plt.xticks(size=14)
    plt.yticks(size=14)
    plt.legend(loc='best', fontsize=14)
    plt.grid(visible=1)
    plt.show()

def plot_training_metrics(logger):
    filename = f"{logger.log_dir}/metrics.csv"
    metrics = pd.read_csv(filename)
    train_loss = metrics[['train_loss', 'step', 'epoch']][~np.isnan(metrics['train_loss'])]
    val_loss = metrics[['val_loss', 'epoch']][~np.isnan(metrics['val_loss'])]

    fig, axes = plt.subplots(1, 2, figsize=(16, 5), dpi=100)
    axes[0].set_title('Train loss per epoch')
    axes[0].plot(train_loss['epoch'], train_loss['train_loss'])
    axes[1].set_title('Validation loss per epoch')
    axes[1].plot(val_loss['epoch'], val_loss['val_loss'], color='orange')
    plt.show(block = True)

    print('MSE:')
    print(f"\tTrain loss: {train_loss['train_loss'].iloc[-1]:.3f}")
    print(f"\tVal loss:   {val_loss['val_loss'].iloc[-1]:.3f}")

# pytorch model parameter statistics
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
# print(f'The model has {count_parameters(lstm_model):,} trainable parameters')

# LSTM model

Resurrect a recurrent neural network to address the forecasting problem.

In [None]:
class LSTMRegressor(pl.LightningModule):
    '''
    Standard PyTorch Lightning module:
    https://pytorch-lightning.readthedocs.io/en/latest/lightning_module.html
    '''
    def __init__(self, 
                 hidden_size=10, 
                 input_size=1,
                 seq_len=1, 
                 batch_size=16,
                 num_layers=1, 
                 dropout=0.1, 
                 learning_rate=1e-3,
                 criterion=nn.MSELoss,
                 output_size=1):
        super(LSTMRegressor, self).__init__()
        self.hidden_size = hidden_size
        self.seq_len = seq_len
        self.batch_size = batch_size
        self.num_layers = num_layers
        self.dropout = nn.Dropout(dropout)
        self.criterion = criterion
        self.learning_rate = learning_rate
        self.val_mae = torchmetrics.MeanAbsoluteError()
        self.test_mae = torchmetrics.MeanAbsoluteError()

        self.lstm = nn.LSTM(input_size=input_size, 
                            hidden_size=hidden_size,
                            num_layers=num_layers, # number of LSTM-layers.
                            dropout=dropout, 
                            batch_first=True)
        self.linear = nn.Linear(hidden_size, output_size)
        
    def forward(self, x):
        # lstm_out = (batch_size, seq_len, hidden_size)
        lstm_out, _ = self.lstm(x)
        dropout_out = self.dropout(lstm_out[:,-1])
        y_pred = self.linear(dropout_out)
        return y_pred

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=self.learning_rate)

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = self.criterion(y_hat, y)
        return loss

    def training_epoch_end(self, outputs):
        # outputs is a list of dicts
        mean_loss = torch.stack([l['loss'] for l in outputs]).mean()
        self.log('train_loss', mean_loss, prog_bar=True, on_step=False, on_epoch=True)

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = self.criterion(y_hat, y)
        self.val_mae.update(y_hat, y)
        return loss

    def validation_epoch_end(self, outputs):
        # outputs is a list of step losses
        self.log('val_loss', torch.mean(torch.stack(outputs)), prog_bar=True)

    def test_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = self.criterion(y_hat, y)
        self.test_mae.update(y_hat, y)
        return loss
    
    def test_epoch_end(self, outputs):
        self.log('test_loss', np.mean(outputs))

In [None]:
lstm_param = dict(
    seq_len = 14, # number of sequence elements
    input_size = 2, # number of features per sequence element
    target_len = 1, # number of sequence element to predict
    output_size = 2, # number of features per output (target_len times input_size)
    hidden_size = 60,
    num_layers = 2,
    batch_size = 16,
    max_epochs = 100,
    dropout = 0.05,
    learning_rate = 1e-3,
    criterion = nn.MSELoss(),
)

In [None]:
lstm_dm = SequenceDataModule(
    X_train,
    X_test,
    seq_len = lstm_param['seq_len'],
    batch_size = lstm_param['batch_size'],
    target_len = lstm_param['target_len'])  

In [None]:
# check batches
dl = lstm_dm.train_dataloader()
for xb, yb in dl:
  break

print("X element:", xb[0], "\nY element:", yb[0], "\nlengths:", len(xb[0]), len(yb[0]), xb.shape, yb.shape, len(dl), len(X_train))

In [None]:
seed_everything(1)
lstm_model = LSTMRegressor(
    input_size = lstm_param['input_size'],
    hidden_size = lstm_param['hidden_size'],
    output_size = lstm_param['output_size'],
    seq_len = lstm_param['seq_len'],
    batch_size = lstm_param['batch_size'],
    criterion = lstm_param['criterion'],
    num_layers = lstm_param['num_layers'],
    dropout = lstm_param['dropout'],
    learning_rate = lstm_param['learning_rate'],
)

In [None]:
lstm_trainer = Trainer(
    max_epochs=lstm_param['max_epochs'],
    logger=CSVLogger('./logs', name='lstm', version='0'),
    gpus=gpus,
    log_every_n_steps=20,
    callbacks=[LitProgressBar(), EarlyStopping(monitor="val_loss", patience=20)],
)

## Training

In [None]:
lstm_trainer.fit(lstm_model, datamodule=lstm_dm);

In [None]:
plot_training_metrics(lstm_trainer.logger)

## Testing / Prediction

In [None]:
_, y = map(torch.cat, zip(*lstm_dm.test_dataloader()))
y_hat = torch.cat(lstm_trainer.predict(lstm_model, lstm_dm))
plot_test_predict(y, y_hat);

# Time series Transformer, basic implementation

In [None]:
class TSTransformerVanilla(pl.LightningModule):
    # d_model : number of features in the input
    def __init__(self, n_features=1, num_layers=3, num_heads=7, dim_feedforward=2048,
                 dropout=0, learning_rate=1e-3, seq_len=8, output_size=1,
                 criterion=nn.MSELoss()):
        super(TSTransformerVanilla, self).__init__()

        self.encoder_layer = nn.TransformerEncoderLayer(d_model=n_features,
                                                        nhead=num_heads,
                                                        dim_feedforward=dim_feedforward,
                                                        dropout=dropout,
                                                        batch_first=True)
        self.transformer_encoder = nn.TransformerEncoder(self.encoder_layer, 
                                                         num_layers=num_layers)        
        self.seq_len = seq_len
        self.decoder = self.create_decoder(n_features * seq_len, c_out=output_size, dropout=dropout)
        self.learning_rate = learning_rate
        self.mask = self._generate_square_subsequent_mask(self.seq_len)
        # self.mask = None
        self.init_weights()
        self.criterion = criterion
        self.val_mae = torchmetrics.MeanAbsoluteError()
        self.test_mae = torchmetrics.MeanAbsoluteError()
        self.encoder_dropout = nn.Dropout(dropout)

    def create_decoder(self, ndim, c_out=1, dropout=0):
        layers = [nn.Flatten()]
        if dropout:
            layers += [nn.Dropout(dropout)]
        layers += [nn.Linear(ndim, c_out)]
        return nn.Sequential(*layers)

    @torch.no_grad()
    def init_weights(self, initrange=1e-4):
        def _init(m):
            if isinstance(m, nn.Linear) or isinstance(m, nn.Conv2d):
                nn.init.xavier_uniform_(m.weight)
                if hasattr(m, 'bias') and m.bias is not None:
                    nn.init.normal_(m.bias, std=initrange)            
        self.apply(_init)

    def _generate_square_subsequent_mask(self, sz):
        mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
        mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
        return mask

    def forward(self, src):
        assert src.shape[1] == self.seq_len, ">> incorrect sequence len: got {} vs exp {}".format(src.shape[1], self.seq_len)
        src_dropout = self.encoder_dropout(src)
        output = self.transformer_encoder(src_dropout, mask=self.mask)
        output = self.decoder(output)
        return output

    def configure_optimizers(self):
        if self.mask is not None and self.mask.device != self.device:
            self.mask = self.mask.to(self.device)
        optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
        # scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)
        # scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.33, patience=10, verbose=True, threshold=0.0001, threshold_mode='rel', cooldown=0, min_lr=1e-5, eps=1e-08)
        # return dict(optimizer=optimizer, 
        #             lr_scheduler=dict(scheduler=scheduler, monitor='val_loss', frequency=1, interval="epoch"))
        return optimizer

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = self.criterion(y_hat, y)
        return loss

    def training_epoch_end(self, outputs):
        # outputs is a list of dicts
        mean_loss = torch.stack([l['loss'] for l in outputs]).mean()
        self.log('train_loss', mean_loss, prog_bar=True)

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = self.criterion(y_hat, y)
        self.val_mae.update(y_hat, y)
        return loss

    def validation_epoch_end(self, outputs):
        # outputs is a list of step losses
        self.log('val_loss', torch.mean(torch.stack(outputs)), prog_bar=True)

    def test_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = self.criterion(y_hat, y)
        self.test_mae.update(y_hat, y)
        return loss
    
    def test_epoch_end(self, outputs):
        self.log('test_loss', np.mean(outputs))


In [None]:
tfr_param = dict(
                seq_len = 14, # number of elements in raw input sequence
                target_len = 1, # how many elements to predict
                output_size = 2, # number of features in output
                reshape_seq_len = 4, # new seq_len
                reshape_features = 7, # new n_features, must be divisible by num_heads
                dim_feedforward = 1024,
                num_heads = 7,
                num_layers = 2,
                max_epochs = 100,
                batch_size = 96,
                criterion = nn.MSELoss(),
                dropout = 0.05,
                learning_rate = 1e-3/2,
                )

In [None]:
tfr_dm = SequenceDataModule(
    X_train,
    X_test,
    seq_len = tfr_param['seq_len'],  # 7
    batch_size = tfr_param['batch_size'],
    target_len=tfr_param['target_len'],
    reshape_features=tfr_param['reshape_features'],
    shift=1,
    )

In [None]:
# check batches
dl = tfr_dm.train_dataloader()
for xb, yb in dl:
  break

# assert np.allclose(yb[0], X_train[tfr_dm.n_features + tfr_dm.seq_len -1])
print("X0 element:", xb[0], "\nY0 element:", yb[0], "\nX1:", xb[1], "\nlengths:", len(xb[0]), len(yb[0]), xb.shape, yb.shape, len(dl), len(X_train))

In [None]:
seed_everything(1)
tfr_model = TSTransformerVanilla(n_features=tfr_param['reshape_features'],
                                num_heads=tfr_param['num_heads'],
                                num_layers=tfr_param['num_layers'],
                                seq_len=tfr_param['reshape_seq_len'],
                                dim_feedforward=tfr_param['dim_feedforward'],
                                output_size=tfr_param['output_size'],
                                learning_rate=tfr_param['learning_rate'],
                                dropout=tfr_param['dropout'])

In [None]:
tfr_trainer = pl.Trainer(
    max_epochs=tfr_param['max_epochs'],
    gpus=gpus,
    # fast_dev_run=True,  # comment in to check that the dataset has no serious bugs
    callbacks=[LitProgressBar(), EarlyStopping(monitor="val_loss", patience=15)],
    log_every_n_steps=20,
    logger=CSVLogger('logs', name='transformer', version='vanilla')
    )

## Training

In [None]:
tfr_trainer.fit(
    tfr_model,
    datamodule=tfr_dm);


In [None]:

plot_training_metrics(tfr_trainer.logger)

## Testing / Prediction

In [None]:
_, y = map(torch.cat, zip(*tfr_dm.test_dataloader()))
y_hat = torch.cat(tfr_trainer.predict(tfr_model, tfr_dm))
plot_test_predict(y, y_hat)

# Transformer + Positional Encoding

## Positional encoding (recap)

Reminder of positional encoding formula:
$$\overrightarrow{p_{t}}^{(i)}=f(t)^{(i)}:= \begin{cases}\sin \left(\omega_{k} \cdot t\right), & \text { if } i=2 k \\ \cos \left(\omega_{k} \cdot t\right), & \text { if } i=2 k+1\end{cases}$$

where

$$\omega_{k}=\frac{1}{10000^{2 k / d}}, k \in [0, d/2)$$
$$t - \textrm{element index in a sequence}$$
$$i - \textrm{index of a feature} \in [0,..,d)$$

In [None]:
# Once given those parameters, one can construct a tensor with this encoding
seq_len = 4
d_model = 7
max_div = 2e2

pe = torch.zeros(seq_len, d_model)  # positional encoding
position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)
div_term_sin = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(max_div) / d_model))
div_term_cos = torch.exp(torch.arange(1, d_model, 2).float() * (-math.log(max_div) / d_model))
pe[:, 0::2] = torch.sin(position * div_term_sin)
pe[:, 1::2] = torch.cos(position * div_term_cos)
print("Debug, div_term_sin", div_term_sin)

pe = pe - pe.mean()
pe = pe / pe.std()

In [None]:

plt.matshow(pe.numpy().T, origin='lower')
plt.colorbar()
plt.xlabel("sequence index")
plt.ylabel("feture index");

In [None]:
plt.matshow(pe.numpy().T, origin='lower')
plt.xlabel("sequence index")
plt.ylabel("feture index");

In [None]:
def Coord2dPosEncoding(q_len, d_model, exponential=False, normalize=True, eps=1e-3):
    x = .5 if exponential else 1
    i = 0
    for i in range(100):
        cpe = 2 * (torch.linspace(0, 1, q_len).reshape(-1, 1) ** x) * (torch.linspace(0, 1, d_model).reshape(1, -1) ** x) - 1
        if abs(cpe.mean()) <= eps: break
        elif cpe.mean() > eps: x += .001
        else: x -= .001
        i += 1
    if normalize:
        cpe = cpe - cpe.mean()
        cpe = cpe / cpe.std()
        cpe = cpe / 10
    return cpe

cpe = Coord2dPosEncoding(seq_len, d_model, exponential=True, normalize=True)

In [None]:
plt.matshow(cpe.numpy().T, origin='lower')
plt.xlabel("sequence index")
plt.ylabel("feture index");

## Task 1 (Score: 2)

Implement sincos inside PositionalEncoding method following the logic above 

In [None]:
def PositionalEncoding(q_len, d_model, normalize=True, max_div=1e4): # TODO: define device
    pe = torch.zeros(q_len, d_model)
    # implement sin/cos positional encoding, and add save it to `pe` tensor
    
    # <YOUR CODE>
    raise NotImplementedError # <= remove this
    pe = ...
    # </YOUR CODE>

    if normalize:
        pe = pe - pe.mean()
        pe = pe / pe.std()
        pe = pe / 10
    return pe

In [None]:
class TSTransformerPositional(TSTransformerVanilla):
    # d_model : number of features
    def __init__(self, feature_size=1, num_layers=3, num_heads=7, dim_feedforward=2048,
                 dropout=0, learning_rate=1e-3, seq_len=8, output_size=1,
                 criterion=nn.MSELoss()):
        super(TSTransformerPositional, self).__init__(n_features=feature_size,
                                                      num_layers=num_layers,
                                                      num_heads=num_heads,
                                                      dim_feedforward=dim_feedforward,
                                                      dropout=dropout,
                                                      learning_rate=learning_rate,
                                                      seq_len=seq_len,
                                                      output_size=output_size,
                                                      criterion=criterion)

        # self.W_pos = self._positional_encoding(None, False, seq_len, feature_size)
        # self.W_pos = self._positional_encoding("zeros", True, seq_len, feature_size)
        # self.W_pos = self._positional_encoding("normal", True, seq_len, feature_size)
        # self.W_pos = self._positional_encoding("uniform", True, seq_len, feature_size)
        self.W_pos = self._positional_encoding("sincos", False, seq_len, feature_size, normalize=True, max_div=1e2)
        # self.W_pos = self._positional_encoding("sincos", True, seq_len, feature_size, normalize=False)
        # self.W_pos = self._positional_encoding("sincos", True, seq_len, feature_size, normalize=True)
        # self.W_pos = self._positional_encoding("lin2d", False, seq_len, feature_size,  normalize=False)
        # self.W_pos = self._positional_encoding("exp2d", False, seq_len, feature_size,  normalize=False)
        # self.W_pos = self._positional_encoding("exp2d", False, seq_len, feature_size,  normalize=True)
        self.dropout = nn.Dropout(dropout)

    def _positional_encoding(self, pe, learn_pe, q_len=1, d_model=1, normalize=True, max_div=1e4):
        if pe == None:
            W_pos = torch.zeros((q_len, d_model)) # pe = None and learn_pe = False can be used to measure impact of pe
            learn_pe = False
        elif pe == 'zeros':
            W_pos = torch.empty((q_len, d_model))
            nn.init.uniform_(W_pos, -0.02, 0.02)
        elif pe == 'normal' or pe == 'gauss':
            W_pos = torch.zeros((q_len, d_model))
            torch.nn.init.normal_(W_pos, mean=0.0, std=0.1)
        elif pe == 'uniform':
            W_pos = torch.zeros((q_len, d_model))
            nn.init.uniform_(W_pos, a=0.0, b=0.1)
        elif pe == 'lin2d': 
          W_pos = Coord2dPosEncoding(q_len, d_model, exponential=False, normalize=normalize)
        elif pe == 'exp2d': W_pos = Coord2dPosEncoding(q_len, d_model, exponential=True, normalize=normalize)
        elif pe == 'sincos': W_pos = PositionalEncoding(q_len, d_model, normalize=normalize, max_div=max_div)
        return nn.Parameter(W_pos, requires_grad=learn_pe)

    def forward(self, src):
        assert src.shape[1] == self.seq_len, "incorrect sequence len"
        src = self.dropout(src + self.W_pos)
        # src = src + self.W_pos
        output = self.transformer_encoder(src, mask=self.mask)
        output = self.decoder(output)
        return output



In [None]:
tfr_pos_model = TSTransformerPositional(feature_size=tfr_param['reshape_features'],
                                        num_heads=tfr_param['num_heads'],
                                        num_layers=tfr_param['num_layers'],
                                        seq_len=tfr_param['reshape_seq_len'],
                                        dim_feedforward=tfr_param['dim_feedforward'],
                                        output_size=tfr_param['output_size'],
                                        learning_rate=tfr_param['learning_rate'],
                                        dropout=tfr_param['dropout'])

In [None]:
# add early stop callback to pl.Trainer


seed_everything(1)
tfr_pos_trainer = pl.Trainer(
    max_epochs=tfr_param['max_epochs'],
    gpus=gpus,
    # fast_dev_run=True,  # comment in to check that networkor dataset has no serious bugs
    callbacks=[LitProgressBar(), EarlyStopping(monitor="val_loss", patience=15)], # 
    log_every_n_steps = 20,
    logger=CSVLogger('logs', name='transformer', version='pos')
    )

tfr_pos_dm = SequenceDataModule(
    X_train,
    X_test,
    seq_len = tfr_param['seq_len'],
    batch_size = tfr_param['batch_size'],
    target_len=tfr_param['target_len'],
    reshape_features=tfr_param['reshape_features'],
    )

In [None]:
# tfr_param['batch_size'] = 96; tfr_param['learning_rate'] = 1e-3; tfr_param

## Training

In [None]:
tfr_pos_trainer.fit(
    tfr_pos_model,
    datamodule=tfr_pos_dm)

In [None]:
plot_training_metrics(tfr_pos_trainer.logger)

## Testing / Prediction

In [None]:
_, y = map(torch.cat, zip(*tfr_pos_dm.test_dataloader()))
y_hat = torch.cat(tfr_pos_trainer.predict(tfr_pos_model, tfr_pos_dm))
plot_test_predict(y, y_hat)

## Task 2 (Score: 2)

Compare various positional encodings implemented above. Play with 'normalization' and 'learning' parameters.

Make a table with the test performance under similar computational budget. 

# Task 3 (Score: 2) TSTransformerPositionalCat

Implement `TSTransformerPositionalCat` so it concats positional encoding tensor to `src` instead of numerical adding. 

In [None]:
class TSTransformerPositionalCat(TSTransformerPositional):
    # d_model : number of features
    def __init__(self, feature_size=1, num_layers=3, num_heads=7, dim_feedforward=2048,
                 dropout=0, learning_rate=1e-3, seq_len=8, output_size=1,
                 criterion=nn.MSELoss()):
        # NB feature size is doubled since we send 2-times larger tensor to the encoder
        super(TSTransformerPositionalCat, self).__init__(feature_size = feature_size * 2,
                                                         num_layers=num_layers,
                                                         num_heads=num_heads,
                                                         dim_feedforward=dim_feedforward,
                                                        dropout=dropout,
                                                        learning_rate=learning_rate,
                                                        seq_len=seq_len,
                                                        output_size=output_size,
                                                        criterion=criterion)

        self.W_pos = self._positional_encoding("sincos", False, seq_len, feature_size, normalize=False, max_div=1e3)
        # self.W_pos = self._positional_encoding("normal", True, seq_len, feature_size)
        # self.W_pos = self._positional_encoding("sincos", True, seq_len, feature_size, normalize=False)


    def forward(self, src):
        assert src.shape[1] == self.seq_len, "incorrect sequence len"
        # TODO: concatenate self.W_pos with every batch entry

        # src_pos = ... <YOUR CODE>
        raise NotImplementedError # <= remove this

        src = self.dropout(src_pos)

        output = self.transformer_encoder(src, self.mask)
        output = self.decoder(output)
        return output

In [None]:
seed_everything(1)
tfr_pos_cat_trainer = pl.Trainer(
    max_epochs=tfr_param['max_epochs'],
    gpus=gpus,
    # fast_dev_run=True,  # comment in to check that networkor dataset has no serious bugs
    callbacks=[LitProgressBar(), EarlyStopping(monitor="val_loss", patience=20)],
    log_every_n_steps=20,
    logger=CSVLogger('logs', name='transformer', version='pos-cat')
    )

tfr_pos_cat_dm = SequenceDataModule(
    X_train,
    X_test,
    seq_len=tfr_param['seq_len'],
    batch_size=tfr_param['batch_size'],
    target_len=tfr_param['target_len'],
    reshape_features=tfr_param['reshape_features']
    )

In [None]:
tfr_pos_cat_model = TSTransformerPositionalCat(
    feature_size=tfr_param['reshape_features'],
    num_heads=tfr_param['num_heads'],
    num_layers=tfr_param['num_layers'],
    seq_len=tfr_param['reshape_seq_len'],
    dim_feedforward=tfr_param['dim_feedforward'],
    output_size=tfr_param['output_size'],
    learning_rate=tfr_param['learning_rate'],
    dropout=tfr_param['dropout']
)

## Train

In [None]:
tfr_pos_cat_trainer.fit(
    tfr_pos_cat_model,
    datamodule=tfr_pos_cat_dm)

In [None]:

plot_training_metrics(tfr_pos_cat_trainer.logger)

## Testing / Prediction

In [None]:
_, y = map(torch.cat, zip(*tfr_pos_cat_dm.test_dataloader()))
y_hat = torch.cat(tfr_pos_cat_trainer.predict(tfr_pos_cat_model, tfr_pos_cat_dm))
plot_test_predict(y[:], y_hat[:])

# Conclusion

This notebook provided a basic use-case for transformers applied to time-series data on simple multivariate time series. For a more advanced implementation, please refer to packages intentionally designed for such data crunching, e.g.:
- https://github.com/timeseriesAI/tsai/
- https://github.com/jdb78/pytorch-forecasting