# Deep Learning

# Tutorial 22: Transformer Model

In this tutorial, we will cover:

- Architecture of a Transformer Model

Prerequisites:

- Python, PyTorch, Deep Learning Training, Stochastic Gradient Descent

My contact:

- Niklas Beuter (niklas.beuter@th-luebeck.de)

Course:

- Slides and notebooks will be available at https://lernraum.th-luebeck.de/course/view.php?id=5383

## Expected Outcomes
* 

# Introduction to Transformers

Transformers are a type of neural network architecture that has revolutionized the field of natural language processing (NLP) and has been extended to other domains such as computer vision. Introduced by Vaswani et al. in the paper ["Attention is All You Need"](https://arxiv.org/abs/1706.03762) in 2017, transformers have become the backbone of many state-of-the-art models like BERT, GPT-3, and T5.

## Key Concepts

### Attention Mechanism

At the core of transformers is the attention mechanism, which allows the model to focus on different parts of the input sequence when producing each output element. The attention mechanism computes a weighted sum of input elements, where the weights are determined dynamically based on the input.

### Self-Attention

Self-attention, also known as intra-attention, is a type of attention mechanism that relates different positions of a single sequence to compute a representation of the sequence. This is crucial for capturing dependencies regardless of their distance in the input sequence.

### Multi-Head Attention

Transformers use multi-head attention to enhance the model's ability to focus on different parts of the sequence from multiple perspectives. This involves running several attention mechanisms in parallel, known as "heads," and then concatenating their outputs.

### Positional Encoding

Since transformers do not have a built-in notion of the order of the sequence (unlike RNNs), positional encoding is used to inject information about the position of each token in the sequence. This is done by adding a set of sine and cosine functions of different frequencies to the input embeddings.

## Architecture

### Encoder

The encoder consists of multiple identical layers, each containing two main components:
1. **Multi-Head Self-Attention Mechanism**: Allows the model to weigh the relevance of different tokens in the input sequence.
2. **Position-wise Feed-Forward Neural Network**: Applied independently to each position, this consists of two linear transformations with a ReLU activation in between.

### Decoder

The decoder also consists of multiple identical layers, with an additional component:
1. **Masked Multi-Head Self-Attention**: Similar to the encoder's self-attention, but prevents attending to future tokens in the sequence to ensure the model is autoregressive.
2. **Encoder-Decoder Attention**: Allows the decoder to focus on relevant parts of the input sequence.
3. **Position-wise Feed-Forward Neural Network**: Same as in the encoder.

### Training

Transformers are typically trained using teacher forcing, where the input to the decoder at each time step is the ground truth token from the training dataset. This helps the model learn more effectively but requires careful handling during inference to prevent exposure bias.

## Applications

Transformers have been successfully applied to a wide range of NLP tasks:
- **Machine Translation**: Models like T5 and MarianMT provide state-of-the-art translation capabilities.
- **Text Summarization**: BART and T5 can generate concise and coherent summaries of long documents.
- **Question Answering**: BERT and its variants excel in understanding context and retrieving accurate answers from text.
- **Text Generation**: GPT-3 has demonstrated impressive text generation capabilities, from writing essays to generating code.

## Conclusion

Transformers have significantly advanced the capabilities of neural networks in NLP and beyond. Their ability to handle long-range dependencies and parallelize training has opened up new possibilities in various domains. As research continues, we can expect transformers to play an even more prominent role in machine learning applications.



In [None]:
!pip install torch torchvision accelerate transformers torchtext spacy
!python -m spacy download en_core_web_sm
!python -m spacy download de_core_news_sm

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, embed_size, heads):
        super(MultiHeadAttention, self).__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads

        assert (
            self.head_dim * heads == embed_size
        ), "Embedding size needs to be divisible by heads"

        self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.fc_out = nn.Linear(heads * self.head_dim, embed_size)

    def forward(self, values, keys, query, mask):
        # Preparation of key, queries and values
        N = query.shape[0]
        value_len, key_len, query_len = values.shape[1], keys.shape[1], query.shape[1]

        values = values.reshape(N, value_len, self.heads, self.head_dim)
        keys = keys.reshape(N, key_len, self.heads, self.head_dim)
        queries = query.reshape(N, query_len, self.heads, self.head_dim)

        values = self.values(values)
        keys = self.keys(keys)
        queries = self.queries(queries)

        # Multiplication of queries and keys (in batch)
        energy = torch.einsum("nqhd,nkhd->nhqk", [queries, keys])

        if mask is not None:
            energy = energy.masked_fill(mask == 0, float("-1e20"))

        # Softmax and Normalization by 1/sqrt(d) with d=embedding size
        attention = torch.softmax(energy / (self.embed_size ** (1 / 2)), dim=3)

        # Multiplication of result with values
        out = torch.einsum("nhql,nlhd->nqhd", [attention, values]).reshape(
            N, query_len, self.heads * self.head_dim
        )

        out = self.fc_out(out)
        return out

class TransformerBlock(nn.Module):
    def __init__(self, embed_size, heads, dropout, forward_expansion):
        super(TransformerBlock, self).__init__()
        self.attention = MultiHeadAttention(embed_size, heads)
        self.norm1 = nn.LayerNorm(embed_size)
        self.norm2 = nn.LayerNorm(embed_size)

        self.feed_forward = nn.Sequential(
            nn.Linear(embed_size, forward_expansion * embed_size),
            nn.ReLU(),
            nn.Linear(forward_expansion * embed_size, embed_size),
        )

        self.dropout = nn.Dropout(dropout)

    def forward(self, value, key, query, mask):
        attention = self.attention(value, key, query, mask)

        # Dropout, Norm and Skip connection (attention + query, forward + x)
        x = self.dropout(self.norm1(attention + query))
        forward = self.feed_forward(x)
        out = self.dropout(self.norm2(forward + x))
        return out

class Encoder(nn.Module):
    def __init__(self, src_vocab_size, embed_size, num_layers, heads, device, forward_expansion, dropout, max_length):
        super(Encoder, self).__init__()
        self.embed_size = embed_size
        self.device = device
        self.word_embedding = nn.Embedding(src_vocab_size, embed_size)
        self.position_embedding = nn.Embedding(max_length, embed_size)

        self.layers = nn.ModuleList(
            [
                TransformerBlock(embed_size, heads, dropout=dropout, forward_expansion=forward_expansion)
                for _ in range(num_layers)
            ]
        )

        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask):
        N, seq_length = x.shape
        positions = torch.arange(0, seq_length).expand(N, seq_length).to(self.device)
        out = self.dropout(self.word_embedding(x) + self.position_embedding(positions))

        for layer in self.layers:
            out = layer(out, out, out, mask)

        return out

class DecoderBlock(nn.Module):
    def __init__(self, embed_size, heads, forward_expansion, dropout, device):
        super(DecoderBlock, self).__init__()
        self.attention = MultiHeadAttention(embed_size, heads)
        self.norm = nn.LayerNorm(embed_size)
        self.transformer_block = TransformerBlock(embed_size, heads, dropout, forward_expansion)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, value, key, src_mask, trg_mask):
        attention = self.attention(x, x, x, trg_mask)
        # Dropout, Normalization and Skip connection
        query = self.dropout(self.norm(attention + x))
        out = self.transformer_block(value, key, query, src_mask)
        return out

class Decoder(nn.Module):
    def __init__(self, trg_vocab_size, embed_size, num_layers, heads, forward_expansion, dropout, device, max_length):
        super(Decoder, self).__init__()
        self.device = device
        self.word_embedding = nn.Embedding(trg_vocab_size, embed_size)
        self.position_embedding = nn.Embedding(max_length, embed_size)

        self.layers = nn.ModuleList(
            [
                DecoderBlock(embed_size, heads, forward_expansion, dropout, device)
                for _ in range(num_layers)
            ]
        )

        self.fc_out = nn.Linear(embed_size, trg_vocab_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, enc_out, src_mask, trg_mask):
        N, seq_length = x.shape
        positions = torch.arange(0, seq_length).expand(N, seq_length).to(self.device)
        x = self.dropout(self.word_embedding(x) + self.position_embedding(positions))

        for layer in self.layers:
            x = layer(x, enc_out, enc_out, src_mask, trg_mask)

        out = self.fc_out(x)
        return out

class Transformer(nn.Module):
    def __init__(self, src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx, embed_size=256, num_layers=6, forward_expansion=4, heads=8, dropout=0, device="cuda", max_length=100):
        super(Transformer, self).__init__()

        self.encoder = Encoder(src_vocab_size, embed_size, num_layers, heads, device, forward_expansion, dropout, max_length)
        self.decoder = Decoder(trg_vocab_size, embed_size, num_layers, heads, forward_expansion, dropout, device, max_length)

        self.src_pad_idx = src_pad_idx
        self.trg_pad_idx = trg_pad_idx
        self.device = device

    def make_src_mask(self, src):
        src_mask = (src != self.src_pad_idx).unsqueeze(1).unsqueeze(2)
        return src_mask.to(self.device)

    def make_trg_mask(self, trg):
        N, trg_len = trg.shape
        trg_mask = torch.tril(torch.ones((trg_len, trg_len))).expand(N, 1, trg_len, trg_len)
        return trg_mask.to(self.device)

    def forward(self, src, trg):
        src_mask = self.make_src_mask(src)
        trg_mask = self.make_trg_mask(trg)

        enc_src = self.encoder(src, src_mask)
        out = self.decoder(trg, enc_src, src_mask, trg_mask)
        return out

# Beispiel: Initialisierung und Durchführen eines Vorwärtslaufs
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
src_vocab_size = 10000
trg_vocab_size = 10000
src_pad_idx = 1
trg_pad_idx = 1

model = Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx).to(device)
src = torch.randint(0, src_vocab_size, (32, 100)).to(device)
trg = torch.randint(0, trg_vocab_size, (32, 100)).to(device)

out = model(src, trg)
print(out.shape)  # Erwartet: [32, 100, trg_vocab_size]

## Example of a small training for a translation task

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import spacy
import random
import numpy as np

# Laden der SpaCy Modelle, um den jeweiligen Tokenizer zu verwenden
spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

# Tokenizer Funktionen für deutsch und englisch
def tokenize_de(text):
    return [tok.text.lower() for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    return [tok.text.lower() for tok in spacy_en.tokenizer(text)]

# Daten vorverarbeiten und Vokabular erstellen
def build_vocab(tokenized_texts, min_freq=1):
    freq = {}
    for text in tokenized_texts:
        for token in text:
            if token in freq:
                freq[token] += 1
            else:
                freq[token] = 1
    
    vocab = {token: idx for idx, (token, count) in enumerate(freq.items()) if count >= min_freq}
    vocab['<pad>'] = len(vocab)
    vocab['<sos>'] = len(vocab)
    vocab['<eos>'] = len(vocab)
    return vocab

# Datensatz und DataLoader erstellen
class TranslationDataset(Dataset):
    def __init__(self, src_texts, trg_texts, src_vocab, trg_vocab):
        self.src_texts = src_texts
        self.trg_texts = trg_texts
        self.src_vocab = src_vocab
        self.trg_vocab = trg_vocab

    def __len__(self):
        return len(self.src_texts)

    def __getitem__(self, idx):
        src = self.src_texts[idx]
        trg = self.trg_texts[idx]
        
        src_indices = [self.src_vocab.get('<sos>')] + [self.src_vocab.get(token, self.src_vocab['<pad>']) for token in src] + [self.src_vocab.get('<eos>')]
        trg_indices = [self.trg_vocab.get('<sos>')] + [self.trg_vocab.get(token, self.trg_vocab['<pad>']) for token in trg] + [self.trg_vocab.get('<eos>')]
        
        return torch.tensor(src_indices), torch.tensor(trg_indices)

# Padding-Funktion
def pad_sequences(batch):
    src_batch, trg_batch = zip(*batch)
    
    src_lens = [len(seq) for seq in src_batch]
    trg_lens = [len(seq) for seq in trg_batch]
    
    max_src_len = max(src_lens)
    max_trg_len = max(trg_lens)
    
    padded_src = torch.zeros(len(src_batch), max_src_len).long()
    padded_trg = torch.zeros(len(trg_batch), max_trg_len).long()
    
    for i, (src_len, trg_len) in enumerate(zip(src_lens, trg_lens)):
        padded_src[i, :src_len] = src_batch[i]
        padded_trg[i, :trg_len] = trg_batch[i]
    
    return padded_src, padded_trg

# Beispiel-Datensätze laden (hier nur für Demonstrationszwecke drei Sätze)
src_texts = ["ein beispiel satz", "noch ein beispiel", "ein letzter satz"]
trg_texts = ["a sample sentence", "another example", "one last sentence"]

src_tokenized = [tokenize_de(text) for text in src_texts]
trg_tokenized = [tokenize_en(text) for text in trg_texts]

src_vocab = build_vocab(src_tokenized)
trg_vocab = build_vocab(trg_tokenized)

# Sicherstellen, dass alle Tokens im Vokabular sind
for text in src_tokenized:
    for token in text:
        if token not in src_vocab:
            src_vocab[token] = len(src_vocab)
for text in trg_tokenized:
    for token in text:
        if token not in trg_vocab:
            trg_vocab[token] = len(trg_vocab)

# Datensätze und DataLoader erstellen
dataset = TranslationDataset(src_tokenized, trg_tokenized, src_vocab, trg_vocab)
data_loader = DataLoader(dataset, batch_size=2, shuffle=True, collate_fn=pad_sequences)

# Modell laden
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
src_pad_idx = src_vocab['<pad>']
trg_pad_idx = trg_vocab['<pad>']
trg_sos_idx = trg_vocab['<sos>']
trg_eos_idx = trg_vocab['<eos>']

src_vocab_size = len(src_vocab)
trg_vocab_size = len(trg_vocab)
model = Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx).to(device)

# Verlustfunktion und Optimierer
criterion = nn.CrossEntropyLoss(ignore_index=trg_pad_idx)
optimizer = optim.Adam(model.parameters(), lr=0.0005)

# Trainingsschleife
num_epochs = 10

for epoch in range(num_epochs):
    model.train()
    epoch_loss = 0

    for i, (src, trg) in enumerate(data_loader):
        src = src.to(device)
        trg = trg.to(device)

        # Dies ist der entscheidende Schritt für das Training
        # Beim Input wird das letzte Wort entfernt, welches dann beim Target vorhergesagt werden soll
        trg_input = trg[:, :-1]
        trg_target = trg[:, 1:].contiguous().view(-1)

        optimizer.zero_grad()
        output = model(src, trg_input)
        output = output.view(-1, output.shape[-1])

        loss = criterion(output, trg_target)
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()

    avg_loss = epoch_loss / len(data_loader)
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}")

print("Training abgeschlossen.")

# Methode zum Generieren von Vorhersagen
def translate_sentence(model, sentence, src_vocab, trg_vocab, device, max_length=50):
    model.eval()
    tokens = tokenize_de(sentence)
    # Hinzufügen von start und end-token
    tokens = [src_vocab['<sos>']] + [src_vocab[token] for token in tokens] + [src_vocab['<eos>']]
    
    src_tensor = torch.LongTensor(tokens).unsqueeze(0).to(device)
    
    src_mask = model.make_src_mask(src_tensor)
    with torch.no_grad():
        enc_src = model.encoder(src_tensor, src_mask)

    trg_indexes = [trg_vocab['<sos>']]

    for i in range(max_length):
        trg_tensor = torch.LongTensor(trg_indexes).unsqueeze(0).to(device)
        trg_mask = model.make_trg_mask(trg_tensor)
        with torch.no_grad():
            output = model.decoder(trg_tensor, enc_src, src_mask, trg_mask)
        
        pred_token = output.argmax(2)[:, -1].item()
        trg_indexes.append(pred_token)

        if pred_token == trg_vocab['<eos>']:
            break

    trg_tokens = [list(trg_vocab.keys())[list(trg_vocab.values()).index(i)] for i in trg_indexes]
    return trg_tokens[1:]

In [None]:
# Beispiel für das Testen des Modells
example_sentence = "ein beispiel satz"
prediction = translate_sentence(model, example_sentence, src_vocab, trg_vocab, device)

print(f"Input Sentence: {example_sentence}")
print(f"Predicted Translation: {' '.join(prediction)}")

In [None]:
trg_vocab.keys()