![logo](https://github.com/donatellacea/DL_tutorials/blob/main/notebooks/figures/1128-191-max.png?raw=true)

# **XAI in Deep Learning-Based Signal Analysis: Transformers**

In this Notebook, we introduce the concept of transformers in machine learning, highlighting their significance in natural language processing (NLP).

---

## Getting Started

### Setup Colab environment

If you installed the packages and requirements on your own machine, you can skip this section and start from the import section.
Otherwise, you can follow and execute the tutorial on your browser. In order to start working on the notebook, click on the following button, this will open this page in the Colab environment and you will be able to execute the code on your own.

<a href="https://colab.research.google.com/github/HelmholtzAI-Consultants-Munich/Zero2Hero---Introduction-to-XAI/blob/Juelich-2023/data_and_models/Model-Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Now that you opened the notebook in Colab, follow the next step:

1. Run this cell to connect your Google Drive to Colab and install packages
2. Allow this notebook to access your Google Drive files. Click on 'Yes', and select your account.
3. "Google Drive for desktop wants to access your Google Account". Click on 'Allow'.
   
At this point, a folder has been created in your Drive and you can navigate it through the lefthand panel in Colab, you might also have received an email that informs you about the access on your Google Drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive
!git clone --branch Juelich-2023 https://github.com/HelmholtzAI-Consultants-Munich/XAI-Tutorials.git
%cd XAI-Tutorials/data_and_models

### Imports

In [2]:
import math
import torch
import torch.nn as nn

## 1. **Encoder-Decoder Structure**
 
Transformers are a type of neural network architecture that have become a cornerstone in the field of natural language processing (NLP) and beyond. They were introduced in the 2017 paper "Attention is All You Need" by Vaswani et al. Transformers are a milestone because of the **Attention Mechanism**, especially self-attention. 

The Transformer architecture is divided into two main parts:

1. **Encoder:**
   - The encoder processes the input data (like a sentence in a translation task) and encodes it into a context-rich representation. It consists of a stack of layers, each containing a self-attention mechanism and a feed-forward neural network.

2. **Decoder:**
   - The decoder takes the encoded input and generates the output sequence (like a translated sentence). It also consists of a stack of layers, each with two attention mechanisms (one that attends to the output of the encoder and one that is a masked self-attention mechanism to prevent the decoder from seeing future tokens in the output sequence) and a feed-forward neural network.

<img src="..//docs/source/_figures/transformers.png" alt="transformers" width="700" height="700">

In [3]:
# Define the transformer model
class Transformer(nn.Module):
    def __init__(self, input_vocab_size, output_vocab_size, hidden_size, num_layers, num_heads, dropout):
        super().__init__()
        self.encoder = Encoder(input_vocab_size, hidden_size, num_layers, num_heads, dropout)
        self.decoder = Decoder(output_vocab_size, hidden_size, num_layers, num_heads, dropout)
        self.output_layer = nn.Linear(hidden_size, output_vocab_size)

    def forward(self, input_seq, output_seq):
        encoder_output, encoder_mask = self.encoder(input_seq)
        decoder_output, decoder_mask = self.decoder(output_seq, encoder_output, encoder_mask)
        output_logits = self.output_layer(decoder_output)
        return output_logits

## 2. **Input Processing**

Before feeding the input into the transformer encoder, some steps are performed: 
- **Tokenization:** is a fundamental step in text processing and NLP. It involves splitting text into smaller units, called tokens. Tokens are often words, but they can also be characters, subwords, or even sentences, depending on the level of tokenization. 
Example:
Text: "Natural Language Processing is fascinating."
Tokens: ["Natural", "Language", "Processing", "is", "fascinating"]

- **Input Embeddings:** The input sequence (e.g., a sentence) is converted into a sequence of vectors. This is done through embeddings which map words or tokens to high-dimensional vectors.

- **Positional Encodings:** Since the Transformer does not have recurrent or convolutional layers, it uses positional encodings to add information about the position of each token in the sequence. These positional encodings have the same dimension as the embeddings and are added to them.

In [4]:
# Define the encoder
class Encoder(nn.Module):
    def __init__(self, vocab_size, hidden_size, num_layers, num_heads, dropout):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.positional_encoding = PositionalEncoding(hidden_size, dropout)
        self.layers = nn.ModuleList([EncoderLayer(hidden_size, num_heads, dropout) for _ in range(num_layers)])

    def forward(self, input_seq):
        input_embedded = self.embedding(input_seq)
        input_encoded = self.positional_encoding(input_embedded)
        encoder_mask = input_seq == 0
        for layer in self.layers:
            input_encoded = layer(input_encoded, encoder_mask)
        return input_encoded, encoder_mask
    
# Define the positional encoding layer
class PositionalEncoding(nn.Module):
    def __init__(self, hidden_size, dropout, max_length=5000):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        positional_encoding = torch.zeros(max_length, hidden_size)
        position = torch.arange(0, max_length, dtype=torch.float32).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, hidden_size, 2).float() * (-math.log(10000.0) / hidden_size))
        positional_encoding[:, 0::2] = torch.sin(position * div_term)
        positional_encoding[:, 1::2] = torch.cos(position * div_term)
        self.register_buffer('positional_encoding', positional_encoding)

    def forward(self, input_tensor):
        input_tensor = input_tensor + self.positional_encoding[:input_tensor.size(1), :].unsqueeze(0)
        input_tensor = self.dropout(input_tensor)
        return input_tensor
    

## 3. **The Encoder Part**

The encoder in the Transformer architecture is designed to process and encode input sequences. It's a critical component for understanding and representing the input data in a form that the decoder can then use for tasks like translation or text generation. 
The encoder part is composed of:

- **Stack of Layers:** The encoder is composed of a stack of identical layers. The number of layers varies (e.g., the original Transformer model uses 6 layers), but each layer has the same structure.
- **Two Sub-Layers in Each Encoder Layer:**
  - A multi-head self-attention mechanism.
  - A position-wise fully connected feed-forward network.

The output of the final encoder layer is a sequence of vectors representing the input sequence. This output is then used as the input for the Transformer decoder.

In [5]:
# Define the encoder layer
class EncoderLayer(nn.Module):
    def __init__(self, hidden_size, num_heads, dropout):
        super().__init__()
        self.self_attention = MultiHeadAttention(hidden_size, num_heads, dropout)
        self.feed_forward = FeedForward(hidden_size, dropout)

    def forward(self, input_encoded, encoder_mask):
        self_attention_output, _ = self.self_attention(input_encoded, input_encoded, input_encoded, encoder_mask)
        feed_forward_output = self.feed_forward(self_attention_output)
        return feed_forward_output

## 4. **The Self-Attention Mechanism**

The goal of self-attention is to generate a representation of each element in a sequence by considering the entire sequence. For example, in a sentence, the representation of a word is computed by attending to all words in the sentence, including the word itself.

Components:
Queries, Keys, and Values: For each element in the input sequence, three vectors are computed: a query vector (Q), a key vector (K), and a value vector (V). These are typically created by multiplying the input embeddings with three different weight matrices.

Attention Scores:
The model calculates the attention score for each pair of elements in the sequence. This is typically done by taking the dot product of the query vector of one element with the key vector of another, which indicates how much focus to put on other parts of the input sequence when encoding a particular element.

Scaling and Normalization:
The dot product scores are scaled down (usually by the square root of the dimension of the key vectors), and a softmax function is applied to obtain the final attention weights. This normalization ensures that the weights across the sequence sum up to 1.

Weighted Sum:
The output for each element is then a weighted sum of the value vectors, where the weights are the attention scores. This results in a new representation for each element that incorporates information from the entire sequence.

<img src="..//docs/source/_figures/self-attention.png" alt="self-attention" width="700" height="700">

## 5. **The Mechanism of Multi-Head Attention**

In multi-head attention, the attention mechanism is run in parallel multiple times. Each parallel run is known as a "head."
Each head learns to pay attention to different parts of the input, allowing the model to capture various aspects of the information (like different types of syntactic or semantic relationships).

The outputs of all attention heads are concatenated and then linearly transformed into the final output. This combination allows the model to pay attention to information from different representation subspaces at different positions.


<img src="..//docs/source/_figures/multi-head-attention.png" alt="multi-head-attention" width="700" height="700">

In [6]:
# Define the multi-head attention layer
class MultiHeadAttention(nn.Module):
    def __init__(self, hidden_size, num_heads, dropout):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_heads = num_heads
        self.head_size = hidden_size // num_heads
        self.query = nn.Linear(hidden_size, hidden_size)
        self.key = nn.Linear(hidden_size, hidden_size)
        self.value = nn.Linear(hidden_size, hidden_size)
        self.dropout = nn.Dropout(dropout)
        self.output_layer = nn.Linear(hidden_size, hidden_size)

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)
        query = self.query(query).view(batch_size, -1, self.num_heads, self.head_size).transpose(1, 2)
        key = self.key(key).view(batch_size, -1, self.num_heads, self.head_size).transpose(1, 2)
        value = self.value(value).view(batch_size, -1, self.num_heads, self.head_size).transpose(1, 2)
        attention_scores = torch.matmul(query, key.transpose(-2, -1)) / torch.sqrt(torch.tensor(self.head_size, dtype=torch.float32))
        if mask is not None:
            attention_scores = attention_scores.masked_fill(mask.unsqueeze(1).unsqueeze(2) == 0, float('-inf'))
        attention_probs = nn.Softmax(dim=-1)(attention_scores)
        attention_probs = self.dropout(attention_probs)
        attention_output = torch.matmul(attention_probs, value)
        attention_output = attention_output.transpose(1, 2).contiguous().view(batch_size, -1, self.hidden_size)
        output = self.output_layer(attention_output)
        return output, attention_probs

## 6. **Position-Wise Feed-Forward Networks**

- **Local Processing:** Each layer also contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.
- **Purpose:** While the self-attention layers help with representing the relationship between different words (or tokens) in the sequence, the feed-forward network helps to process each word individually.


In [7]:
class FeedForward(nn.Module):
    def __init__(self, hidden_size, dropout):
        super().__init__()
        self.hidden_layer = nn.Linear(hidden_size, hidden_size * 4)
        self.output_layer = nn.Linear(hidden_size * 4, hidden_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, input_tensor):
        hidden_output = nn.ReLU()(self.hidden_layer(input_tensor))
        hidden_output = self.dropout(hidden_output)
        output = self.output_layer(hidden_output)
        return output

## 7. **The decoder part**

In the Transformer model, the decoder is responsible for generating an output sequence, typically in tasks like machine translation, text generation, or summarization. During training, the decoder operates in conjunction with the encoder, but with some additional mechanisms that facilitate sequence generation. Here's how the decoder part works during training:

### 1. **Overall Structure**

The Transformer decoder has a structure similar to the encoder but with some key differences:

- **Multiple Layers:** Like the encoder, the decoder is composed of multiple identical layers.
- **Self-Attention and Encoder-Attention:** Each layer in the decoder includes two sub-layers of multi-head attention mechanisms (self-attention and encoder-attention) and a feed-forward neural network.

### 2. **Input to the Decoder**

- **Shifted Right:** The input to the decoder during training is typically the target sequence (what the model is expected to generate) shifted right by one position. This shifting ensures that the prediction for a certain position is made by only considering the known output up to that point.

### 3. **Masked Self-Attention**

- **Preventing Future Information Leakage:** The self-attention mechanism in the decoder is modified with masking to prevent positions from attending to subsequent positions. This masking ensures that the predictions for a particular word can only depend on the known outputs (words) that come before it, mimicking the sequential generation during inference.
- **Sequential Dependence:** This mechanism is critical for learning the dependency of a word on its predecessors in the sequence.

### 4. **Encoder-Decoder Attention**

- **Interaction with Encoder Outputs:** After the masked self-attention, each decoder layer has an encoder-decoder attention mechanism. In this step, the queries come from the previous layer of the decoder, and the keys and values come from the output of the encoder. This allows the decoder to focus on relevant parts of the input sequence (from the encoder), which is essential for tasks like translation where alignment between input and output sequences is crucial.

### 5. **Output Generation**

- **Linear Layer and Softmax:** The output of the decoder's top layer is passed through a linear layer followed by a softmax layer. This step generates probabilities for each word in the model's vocabulary as the next output in the sequence.
- **Training Objective:** During training, the decoder is trained to predict the next word in the target sequence given the previous words. This is typically done using a cross-entropy loss between the predicted probabilities and the actual next word in the sequence.

### 6. **Teacher Forcing**

- **Use of Actual Output for Next Input:** In training, the actual target sequence (shifted right) is used as input to the decoder. This technique, known as "teacher forcing," helps stabilize and speed up training but can lead to issues like exposure bias during inference.


In [8]:
# Define the decoder
class Decoder(nn.Module):
    def __init__(self, vocab_size, hidden_size, num_layers, num_heads, dropout):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.positional_encoding = PositionalEncoding(hidden_size, dropout)
        self.layers = nn.ModuleList([DecoderLayer(hidden_size, num_heads, dropout) for _ in range(num_layers)])

    def forward(self, output_seq, encoder_output, encoder_mask):
        output_embedded = self.embedding(output_seq)
        output_encoded = self.positional_encoding(output_embedded)
        decoder_mask = self.generate_square_subsequent_mask(output_seq.size(1))
        for layer in self.layers:
            output_encoded = layer(output_encoded, encoder_output, decoder_mask, encoder_mask)
        return output_encoded

    def generate_square_subsequent_mask(self, size):
        mask = (torch.triu(torch.ones(size, size)) == 1).transpose(0, 1)
        mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
        return mask

In [9]:
# Define the decoder layer
class DecoderLayer(nn.Module):
    def __init__(self, hidden_size, num_heads, dropout):
        super().__init__()
        self.self_attention = MultiHeadAttention(hidden_size, num_heads, dropout)
        self.encoder_attention = MultiHeadAttention(hidden_size, num_heads, dropout)
        self.feed_forward = FeedForward(hidden_size, dropout)

    def forward(self, output_encoded, encoder_output, decoder_mask, encoder_mask):
        self_attention_output, _ = self.self_attention(output_encoded, output_encoded, output_encoded, decoder_mask)
        encoder_attention_output, _ = self.encoder_attention(self_attention_output, encoder_output, encoder_output, encoder_mask)
        feed_forward_output = self.feed_forward(encoder_attention_output)
        return feed_forward_output

## 8. **Put It All Together**

In [10]:
model = Transformer(5, 5, hidden_size=512, num_layers=6, num_heads=8, dropout=0.1)
model

Transformer(
  (encoder): Encoder(
    (embedding): Embedding(5, 512)
    (positional_encoding): PositionalEncoding(
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (layers): ModuleList(
      (0): EncoderLayer(
        (self_attention): MultiHeadAttention(
          (query): Linear(in_features=512, out_features=512, bias=True)
          (key): Linear(in_features=512, out_features=512, bias=True)
          (value): Linear(in_features=512, out_features=512, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (output_layer): Linear(in_features=512, out_features=512, bias=True)
        )
        (feed_forward): FeedForward(
          (hidden_layer): Linear(in_features=512, out_features=2048, bias=True)
          (output_layer): Linear(in_features=2048, out_features=512, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): EncoderLayer(
        (self_attention): MultiHeadAttention(
          (query): Linear(in_features=512,