# Anomaly Detection Transformer Code
---
This notebook runs the anomaly detection code on the SMAP data set <br>

Paper link- [here](https://iclr.cc/virtual/2022/spotlight/7024) <br>
Code link- [here](https://github.com/thuml/Anomaly-Transformer)

Topics in the notebook <br>

1. Reading the SMAP data
2. DataLoader
3. Obtain the data embeddings
4. Create the attention layers

##### @Notebook author: Sarayu Vyakaranam (svyakara@purdue.edu)

---
---

.

.

---
1. Reading the SMAP data
---

In [1]:
import pandas as pd
import numpy as np

In [3]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [4]:
smap_train = np.load('/content/gdrive/MyDrive/Colab Notebooks/SMAP_train.npy')
smap_test = np.load('/content/gdrive/MyDrive/Colab Notebooks/SMAP_test.npy')
smap_test_label = np.load('/content/gdrive/MyDrive/Colab Notebooks/SMAP_test_label.npy')
type(smap_train)
print(smap_train)
print(len(smap_train))
print(len(smap_test))
print(len(smap_test_label))

[[0.999      0.         0.         ... 0.         0.         0.        ]
 [0.999      0.         0.         ... 0.         0.         0.        ]
 [0.999      0.         0.         ... 0.         0.         0.        ]
 ...
 [0.98775593 0.         0.         ... 0.         0.         0.        ]
 [0.98417906 0.         0.         ... 0.         0.         0.        ]
 [0.98417906 0.         0.         ... 0.         0.         0.        ]]
135183
427617
427617


In [5]:
print(smap_train.shape)

(135183, 25)


In [7]:
n1 = 100
n2 = 100
smap_subset = smap_train[:5]
print(smap_subset.shape)
print(smap_subset)

(5, 25)
[[0.999 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
  0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
  0.   ]
 [0.999 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
  0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
  0.   ]
 [0.999 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
  0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
  0.   ]
 [0.999 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
  0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
  0.   ]
 [0.999 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
  0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
  0.   ]]


In [1]:
#check if there is only one non zero per vectors
#check sum of vectors

The SMAP input is a sequence of token and value pairs. The token pairs represent the timestamps of the SMAP data, and the value pairs represent the brightness temperature and soil moisture values at those timestamps.


The SMAP input is structured as follows:


`
[
  {
    "token": "2022-01-01T00:00:00Z",
    "value": [0.123, 0.456]
  },
  {
    "token": "2022-01-02T00:00:00Z",
    "value": [0.789, 1.234]
  },
  ...
]
`


The token field is a string that represents the timestamp of the SMAP data. The value field is a list of two numbers that represent the brightness temperature and soil moisture values at that timestamp.

The SMAP input is a time series dataset. The timestamps in the SMAP input are evenly spaced, so the SMAP input can be used to train a time series anomaly detection model.



The input has 25 columns per row because the SMAP dataset was originally collected as a 25-channel time series dataset. Each channel represents a different band of the microwave spectrum. The brightness temperature and soil moisture values are calculated from the values in the 25 channels.

The 25 columns in the input are arranged as follows:

The first column is the timestamp of the SMAP data.
The remaining 24 columns are the brightness temperature and soil moisture values for the 24 channels.
The input is preprocessed to remove the timestamp column and to combine the brightness temperature and soil moisture values into a single column. The preprocessed input is then used to train the anomaly detection model.

---
2. DataLoader
---

In [8]:
import torch
import os
import random
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from PIL import Image
import numpy as np
import collections
import numbers
import math
import pandas as pd
from sklearn.preprocessing import StandardScaler
import pickle

In [76]:
#using only the SMAP class from the 'data_loader.py' file
class SMAPSegLoader(object):
    def __init__(self, data_path, win_size, step, mode="train"):
        #data_path needs to be set as per data location in local computer. For me is is- '/content/gdrive/MyDrive/Colab Notebooks/'
        #initilizations
        self.mode = mode
        self.step = step
        self.win_size = win_size
        #set scaler
        self.scaler = StandardScaler()
        #add path to train data
        data = np.load(data_path + "/SMAP_train.npy")
        #pass train data to scaler and transform
        self.scaler.fit(data)
        data = self.scaler.transform(data)
        #repeat same for test data
        test_data = np.load(data_path + "/SMAP_test.npy")
        self.test = self.scaler.transform(test_data)

        #store scaled and transformed train and test data
        self.train = data
        #set validation data also to test data as of now
        self.val = self.test
        #load the test labels
        self.test_labels = np.load(data_path + "/SMAP_test_label.npy")
        print("test:", self.test.shape)
        print("train:", self.train.shape)

    def __len__(self):

        #returning the length of input considered
        if self.mode == "train":
            return (self.train.shape[0] - self.win_size) // self.step + 1
        elif (self.mode == 'val'):
            return (self.val.shape[0] - self.win_size) // self.step + 1
        elif (self.mode == 'test'):
            return (self.test.shape[0] - self.win_size) // self.step + 1
        else:
            return (self.test.shape[0] - self.win_size) // self.win_size + 1

    def __getitem__(self, index):
        index = index * self.step
        if self.mode == "train":
            return np.float32(self.train[index:index + self.win_size]), np.float32(self.test_labels[0:self.win_size])
        elif (self.mode == 'val'):
            return np.float32(self.val[index:index + self.win_size]), np.float32(self.test_labels[0:self.win_size])
        elif (self.mode == 'test'):
            return np.float32(self.test[index:index + self.win_size]), np.float32(
                self.test_labels[index:index + self.win_size])
        else:
            return np.float32(self.test[
                              index // self.step * self.win_size:index // self.step * self.win_size + self.win_size]), np.float32(
                self.test_labels[index // self.step * self.win_size:index // self.step * self.win_size + self.win_size])

#modified function from 'data_loader.py' because dealing with only SMAP data
def get_loader_segment(data_path,batch_size, win_size=100, step=100, mode='train', dataset='SMAP'):
    batch = batch_size
    if (dataset == 'SMAP'):
        dataset = SMAPSegLoader(data_path, win_size, 1, mode)

    shuffle = False
    if mode == 'train':
        shuffle = True
    print(batch_size)
    data_loader = DataLoader(dataset=dataset,batch_size=batch,shuffle=shuffle,num_workers=0)
    return data_loader

In [51]:
pip install dataloader



In [96]:
path = '/content/gdrive/MyDrive/Colab Notebooks/'
dataset_name = 'SMAP'
train_loader = get_loader_segment(data_path=path, batch_size=8, win_size=20,mode='train',dataset=dataset_name)
vali_loader = get_loader_segment(data_path=path, batch_size=8, win_size=20,mode='val',dataset=dataset_name)
test_loader = get_loader_segment(data_path=path, batch_size=8, win_size=20,mode='test',dataset=dataset_name)

test: (427617, 25)
train: (135183, 25)
8
test: (427617, 25)
train: (135183, 25)
8
test: (427617, 25)
train: (135183, 25)
8


In [114]:
#understanding the data loader better

for batch,sigma in train_loader:
  #note each batch is 8 observations, each observation is 20X25
  print(len(batch))
  print(len(sigma))
  print(batch)
  print('-------')
  #note each sigma is 1X20, so for the total batch there are 8X20 rows
  print(sigma)
  break

8
8
tensor([[[-0.0742, -0.1684, -0.0734,  ..., -0.1672, -0.0067, -0.0038],
         [-0.0776, -0.1684, -0.0734,  ..., -0.1672, -0.0067, -0.0038],
         [-0.1485, -0.1684, -0.0734,  ..., -0.1672, -0.0067, -0.0038],
         ...,
         [-0.1815, -0.1684, -0.0734,  ..., -0.1672, -0.0067, -0.0038],
         [-0.1688, -0.1684, -0.0734,  ..., -0.1672, -0.0067, -0.0038],
         [-0.1745, -0.1684, -0.0734,  ..., -0.1672, -0.0067, -0.0038]],

        [[ 1.5050, -0.1684, -0.0734,  ..., -0.1672, -0.0067, -0.0038],
         [ 1.5050, -0.1684, -0.0734,  ..., -0.1672, -0.0067, -0.0038],
         [ 1.5050, -0.1684, -0.0734,  ..., -0.1672, -0.0067, -0.0038],
         ...,
         [ 1.5050, -0.1684, -0.0734,  ..., -0.1672, -0.0067, -0.0038],
         [ 1.5050, -0.1684, -0.0734,  ..., -0.1672, -0.0067, -0.0038],
         [ 1.5050, -0.1684, -0.0734,  ..., -0.1672, -0.0067, -0.0038]],

        [[-0.4330, -0.1684, -0.0734,  ..., -0.1672, -0.0067, -0.0038],
         [-0.4321, -0.1684, -0.0734,  ...

In [115]:
print(type(train_loader))
print(dir(train_loader))
length_train = 0
length_test = 0
length_vali = 0
c = 0

#training data
for nth_batch, (batch,_) in enumerate(train_loader):
  #print(len(batch))
  c += 1
  length_train += len(batch)
  if c == 1:
    print('the first batch')
    print(f'length of each batch is (window size)- {len(batch[0])}')
    #each tensor below had 20X25 2D form where 20 is the window size, 25 is fixed in the initial data
    print(batch[0])
    print(batch[1])
    print(batch[2])
    print(batch[3])
    print(batch[4])
    print(batch[5])
    print(batch[6])
    print(batch[7])
    print(batch[7])
print(length_train)
print(f'the number of rows in training data = {length_train}')
print(f'the number of batches in training data = {nth_batch}')
#test data
for nth_batch, (batch,_) in enumerate(test_loader):
  #print(len(batch))
  length_test += len(batch)
print(f'the number of rows in testing data = {length_test}')
print(f'the number of batches in testing data = {nth_batch}')

#validation data
for nth_batch, (batch,_) in enumerate(vali_loader):
  #print(len(batch))
  length_vali += len(batch)
print(f'the number of rows in validation data = {length_vali}')
print(f'the number of batches in validation data = {nth_batch}')

<class 'torch.utils.data.dataloader.DataLoader'>
['_DataLoader__initialized', '_DataLoader__multiprocessing_context', '_IterableDataset_len_called', '__annotations__', '__class__', '__class_getitem__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__orig_bases__', '__parameters__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__weakref__', '_auto_collation', '_dataset_kind', '_get_iterator', '_index_sampler', '_is_protocol', '_iterator', 'batch_sampler', 'batch_size', 'check_worker_number_rationality', 'collate_fn', 'dataset', 'drop_last', 'generator', 'multiprocessing_context', 'num_workers', 'persistent_workers', 'pin_memory', 'pin_memory_device', 'prefetch_factor', 'sampler', 'timeout', 'worker_init_fn']
the first batch
leng

Note: obtaining slightly fewer number than total input length, not sure why

---
3. Obtain the data embeddings
---

In [98]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.utils import weight_norm
import math

In [101]:
class PositionalEmbedding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        # Compute the positional encodings once in log space.
        super(PositionalEmbedding, self).__init__()

        #d_model is the dimension of the embedding. This means that each token in the embedding layer will have a representation of d_model dimensions.
        #The d_model dimension is a hyperparameter that can be tuned to improve the performance of the model
        #its been set to 128 in the paper
        pe = torch.zeros(max_len, d_model).float()
        #setting require_grad to TRUE allows the gradient of the positional encoding matrix to be calculated during backpropagation.
        #This allows the model to learn the optimal positional encoding matrix for the given task.
        #not sure why its been set to False for now
        pe.require_grad = False

        #The torch.arange() function creates a sequence of numbers from 0 to max_len.
        #The float() function converts the sequence of numbers to floating point numbers
        #The unsqueeze() function adds a new dimension to the tensor.
        position = torch.arange(0, max_len).float().unsqueeze(1)
        div_term = (torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)).exp()

        #The div_term tensor is created by first creating a sequence of numbers from 0 to d_model in steps of 2.
        #The sequence of numbers is then converted to floating point numbers and multiplied by a negative exponential function.
        #The negative exponential function is used to create a decaying function that decreases as the position increases.
        #The div_term tensor is then used to create the positional encoding matrix.
        #The positional encoding matrix is a 2D tensor that has the same shape as the input sequences.
        #The values in the positional encoding matrix are calculated by taking the div_term tensor and adding it to a sinusoid function.
        #The sinusoid function is used to create a wave-like pattern that represents the position of the token in the sequence.
        #The positional encoding matrix is then added to the input sequences before they are passed to the model.
        #The positional encoding helps the model learn the temporal relationships between the input sequences.
        #This is important for tasks such as machine translation and natural language understanding,
        #where the order of the words in the sequence is important.
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    #The forward function is not called when the model is initialized.
    #The forward function is only called when the model is used to make predictions or calculate the loss.
    def forward(self, x):
        return self.pe[:, :x.size(1)]


class TokenEmbedding(nn.Module):
    def __init__(self, c_in, d_model):
        super(TokenEmbedding, self).__init__()
        padding = 1 if torch.__version__ >= '1.5.0' else 2
        #1D convolutional layers are used to extract features from sequences.
        #In the case of the TokenEmbedding class, the 1D convolutional layer is used to extract features from the token sequences.
        #The 1D convolutional layer takes the token sequences as input and outputs a sequence of features.
        #The number of features in the output sequence depends on the number of filters in the 1D convolutional layer.
        #The filters in the 1D convolutional layer are learned during training.

        #kernel_size specifies the size of th ekernel in the 1D convolution
        self.tokenConv = nn.Conv1d(in_channels=c_in, out_channels=d_model,kernel_size=3, padding=padding, padding_mode='circular', bias=False)

        #self.modules is a dictionary that contains all of the submodules of the TokenEmbedding class.
        #The TokenEmbedding class has two submodules: positional_encoding and conv1d.
        #The positional_encoding submodule is a positional encoding layer that is used to add positional information to the input sequences.
        #The conv1d submodule is a 1D convolutional layer that is used to extract features from the token sequences.
        #The self.modules attribute is used to access the submodules of the TokenEmbedding class.
        #For example, the following code snippet can be used to access the positional_encoding submodule:
        #     model = TokenEmbedding()
        #     positional_encoding = model.modules['positional_encoding']
        for m in self.modules():
            #if the token embedding exists for the submodule
            if isinstance(m, nn.Conv1d):
                #The nn.init.kaiming_normal_ function in PyTorch is used to initialize the weights of a neural network layer.
                #The function initializes the weights using a normal distribution with a variance that is inversely proportional to the fan-in of the layer.
                #It is often used in conjunction with ReLU activation functions.
                nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='leaky_relu')

    def forward(self, x):
        #forward does 3 things-
        #1. Permutes the input tensor x so that the batch dimension is the first dimension, the sequence dimension is the second dimension, and the feature dimension is the third dimension.
        #2. Pass the permuted tensor x to the tokenConv layer.
        #3. Transposes the output tensor of the tokenConv layer so that the sequence dimension is the first dimension and the feature dimension is the second dimension.
        x = self.tokenConv(x.permute(0, 2, 1)).transpose(1, 2)
        return x


class DataEmbedding(nn.Module):
    #the c_in taken here is nothing but enc_in size- here input_c = 25 and output_c = 25 (see shell script, these are parameters taken by the parser)
    #note: see build_model in the main.py file- the model is initilaized using the input_c = 25 and output_c = 25
    def __init__(self, c_in, d_model, dropout=0.0):
        super(DataEmbedding, self).__init__()

        self.value_embedding = TokenEmbedding(c_in=c_in, d_model=d_model)
        self.position_embedding = PositionalEmbedding(d_model=d_model)

        self.dropout = nn.Dropout(p=dropout)

    def forward(self, x):
        x = self.value_embedding(x) + self.position_embedding(x)
        return self.dropout(x)


In [117]:
# Encoding for the train
enc_in = 25 #embedding size
embedding = DataEmbedding(c_in=enc_in, d_model=512, dropout=0.0)
print(embedding)
print(type(embedding))

DataEmbedding(
  (value_embedding): TokenEmbedding(
    (tokenConv): Conv1d(25, 512, kernel_size=(3,), stride=(1,), padding=(1,), bias=False, padding_mode=circular)
  )
  (position_embedding): PositionalEmbedding()
  (dropout): Dropout(p=0.0, inplace=False)
)
<class '__main__.DataEmbedding'>


In [129]:
for nth_batch, (batch,sigma) in enumerate(train_loader):
  batch_embedding = embedding(batch)
  print(batch_embedding.shape)
  print(len(batch_embedding))
  print(batch_embedding[0].shape)
  print(len(batch_embedding[0]))
  #'the size of each embedding is 20X512 which is win_size X d_model
  #print a sample embedding
  print(batch_embedding[0])
  break

torch.Size([8, 20, 512])
8
torch.Size([20, 512])
20
tensor([[ 9.9732e-02,  1.2029e+00, -3.8141e-01,  ...,  9.6707e-01,
         -3.4434e-01,  9.1254e-01],
        [ 1.0001e+00,  8.2361e-01,  2.9003e-01,  ...,  9.6176e-01,
         -3.9440e-01,  8.2551e-01],
        [ 1.0665e+00, -1.2842e-01,  4.0459e-01,  ...,  9.6126e-01,
         -3.9475e-01,  8.2282e-01],
        ...,
        [-8.1695e-01,  7.7838e-04, -1.1457e+00,  ...,  9.6404e-01,
         -3.9339e-01,  8.3413e-01],
        [-6.1089e-01,  9.7003e-01, -1.5262e+00,  ...,  9.6435e-01,
         -4.1689e-01,  8.0486e-01],
        [ 2.9268e-01,  1.2156e+00, -9.8844e-01,  ...,  9.7389e-01,
         -4.0166e-01,  8.6412e-01]], grad_fn=<SelectBackward0>)


---
4. Create the attention layers
---

In [130]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import math
from math import sqrt
import os


In [132]:
#used to create a mask that prevents the model from attending to future tokens when predicting the current token.
#The mask is a triangular matrix where the diagonal elements are all 1 and the off-diagonal elements are all 0
#This ensures that the model only learns dependencies between the current token and the tokens that have already been seen
class TriangularCausalMask():
    def __init__(self, B, L, device="cpu"):
        #define the mask shape- to be B=batch size, L=sequence length, notice L is being passed to the mask function
        mask_shape = [B, 1, L, L]
        with torch.no_grad():
            self._mask = torch.triu(torch.ones(mask_shape, dtype=torch.bool), diagonal=1).to(device)

    @property
    def mask(self):
        return self._mask


class AnomalyAttention(nn.Module):
    def __init__(self, win_size, mask_flag=True, scale=None, attention_dropout=0.0, output_attention=False):
        super(AnomalyAttention, self).__init__()
        self.scale = scale
        self.mask_flag = mask_flag
        self.output_attention = output_attention
        self.dropout = nn.Dropout(attention_dropout)
        window_size = win_size
        #tensor that stores the distances between all pairs of sequences in the dataset.
        #The distances are calculated using the Euclidean distance metric
        self.distances = torch.zeros((window_size, window_size)).cuda()
        for i in range(window_size):
            for j in range(window_size):
                self.distances[i][j] = abs(i - j)

    def forward(self, queries, keys, values, sigma, attn_mask):
       #B : The batch size= 2 in the paer I think., L : The sequence length=10 i think,
       #H : The hidden size of the transformer=128, E : The embedding size=25.
        B, L, H, E = queries.shape
        _, S, _, D = values.shape
        #scaling to prevent attention weights from becoming too large
        scale = self.scale or 1. / sqrt(E)

        #the shape of scores is B L H S (Note: I think S denotes the same thing as L, both represent sequence length)
        #The scores tensor contains the attention scores between the queries and keys tensors
        #teh code below specifies the tensor contraction to be performed
        #blhe: The first part of the string specifies the shape of the first tensor, which is the queries tensor. The blhe shape indicates that the queries tensor has four dimensions: batch size (b), sequence length (l), hidden size (h), and embedding size (e).
        #bshe: The second part of the string specifies the shape of the second tensor, which is the keys tensor. The bshe shape indicates that the keys tensor has the same shape as the queries tensor.
        #->: The -> symbol indicates that the tensor contractions should be performed.
        #bhls: The third part of the string specifies the shape of the output tensor, which is the scores tensor. The bhls shape indicates that the scores tensor has four dimensions: batch size (b), sequence length (l), hidden size (h), and sequence length (l) again.
        #the following tensor contractions are perfomed-
        #1. The queries tensor is multiplied by the keys tensor.
        #2. The product of the queries tensor and the keys tensor is summed over the embedding size dimension.
        #3. The sum of the products is then reshaped to have the shape of the scores tensor.

        #note: S and E need not be same
        scores = torch.einsum("blhe,bshe->bhls", queries, keys)


        #apply mask if needed
        if self.mask_flag:
            if attn_mask is None:
                attn_mask = TriangularCausalMask(B, L, device=queries.device)
            scores.masked_fill_(attn_mask.mask, -np.inf)

        #scale
        attn = scale * scores

        sigma = sigma.transpose(1, 2)  # B L H ->  B H L
        window_size = attn.shape[-1]
        sigma = torch.sigmoid(sigma * 5) + 1e-5
        sigma = torch.pow(3, sigma) - 1
        sigma = sigma.unsqueeze(-1).repeat(1, 1, 1, window_size)  # B H L L #I think this is the same dimension as the scores
        prior = self.distances.unsqueeze(0).unsqueeze(0).repeat(sigma.shape[0], sigma.shape[1], 1, 1).cuda()
        prior = 1.0 / (math.sqrt(2 * math.pi) * sigma) * torch.exp(-prior ** 2 / 2 / (sigma ** 2))

        series = self.dropout(torch.softmax(attn, dim=-1))
        V = torch.einsum("bhls,bshd->blhd", series, values)

        if self.output_attention:
            return (V.contiguous(), series, prior, sigma)
        else:
            return (V.contiguous(), None)


class AttentionLayer(nn.Module):
    def __init__(self, attention, d_model, n_heads, d_keys=None,
                 d_values=None):
        super(AttentionLayer, self).__init__()

        d_keys = d_keys or (d_model // n_heads)
        d_values = d_values or (d_model // n_heads)
        self.norm = nn.LayerNorm(d_model)
        self.inner_attention = attention
        self.query_projection = nn.Linear(d_model,
                                          d_keys * n_heads)
        self.key_projection = nn.Linear(d_model,
                                        d_keys * n_heads)
        self.value_projection = nn.Linear(d_model,
                                          d_values * n_heads)
        self.sigma_projection = nn.Linear(d_model,
                                          n_heads)
        self.out_projection = nn.Linear(d_values * n_heads, d_model)

        self.n_heads = n_heads

    def forward(self, queries, keys, values, attn_mask):
        B, L, _ = queries.shape
        _, S, _ = keys.shape
        H = self.n_heads
        x = queries
        queries = self.query_projection(queries).view(B, L, H, -1)
        keys = self.key_projection(keys).view(B, S, H, -1)
        values = self.value_projection(values).view(B, S, H, -1)
        sigma = self.sigma_projection(x).view(B, L, H)

        out, series, prior, sigma = self.inner_attention(
            queries,
            keys,
            values,
            sigma,
            attn_mask
        )
        out = out.view(B, L, -1)

        return self.out_projection(out), series, prior, sigma
