<h1>1. Create Encoder and Decoder Layers</h1>

In [1]:
#Import important libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
import math


## 1.1 Create Positional Encoding

<span style="font-size:18px;">

###  Definition

<strong>Positional Encoding</strong> is a technique used in Transformer models to inject information about the position of each token in a sequence into its embedding.

Since Transformers do not process data sequentially (unlike RNNs), they have **no inherent sense of word order**. Positional encoding solves this by allowing the model to distinguish between tokens at different positions and capture sentence structure.

Without positional encoding, a Transformer would struggle to process sequential data effectively.

---

###  Example of Positional Encoding

Suppose we have a Transformer model that translates English sentences into French.

Consider the sentence:

> **"The cat sat on the mat."**

#### Step 1: Tokenization

The sentence is first tokenized into individual tokens:

```text
["The", "cat", "sat", "on", "the", "mat"]
```

#### Step 2: Word Embeddings

Each token is then mapped to a high-dimensional vector through an embedding layer. These embeddings capture **semantic meaning**, but **do not encode word order**.

```text
Embeddings = {E₁, E₂, E₃, E₄, E₅, E₆}
```

where each embedding $E_i$ is a 4-dimensional vector.

#### Step 3: Adding Positional Encoding

To provide the model with information about token positions, **positional encodings** are added to the word embeddings:

```text
Final Input = Word Embedding + Positional Encoding
```

This ensures that each token has a **unique representation based on both meaning and position**, allowing the model to understand word order.

---

###  Positional Encoding Formula

The original Transformer paper uses **sinusoidal positional encodings**, defined as:

$$
PE_{(pos,\,2i)} = \sin\left(\frac{pos}{10000^{2i / d_{model}}}\right)
$$

$$
PE_{(pos,\,2i+1)} = \cos\left(\frac{pos}{10000^{2i / d_{model}}}\right)
$$

where:

- $pos$ : position of the token in the sequence  
- $i$ : dimension index  
- $d_{model}$ : embedding dimension  

This formulation allows the model to generalize to sequence lengths longer than those seen during training.
</span>

In [2]:
#Implementation of Positional Encoding

class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, max_len: int = 500):
        super().__init__()

        #PE(pos, 2i) = sin(pos / 1000^(2i/d_model))
        #PE(pos, 2i+1) = cos(pos / 1000^(2i/d_model))
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position*div_term) #even index
        pe[:, 1::2] = torch.cos(position*div_term) #odd index
        pe = pe.unsqueeze(0) #add dimension to index 0 
        self.register_buffer('pe', pe)
        
    #Word Embedding + Positional Encoding
    def forward(self, x):
        return x + self.pe[:, :x.size(1), :] 

## 1.2 Create Multi Head Attention

<span style="font-size:18px;">

### Understanding self-attention mechanism

Before diving into multi-head attention, let’s first understand the standard <strong>self-attention mechanism</strong>, also known as <strong>scaled dot-product attention</strong>.

Given a set of input vectors, self-attention computes attention scores to determine how much focus each element in the sequence should have on the others. This is done using three key matrices:

- Query (Q): Represents the current word's relationship with others.
- Key (K): Represents the words that are being compared against.
- Value (V): Contains the actual word representations.

<br><br>

<img src="https://media.geeksforgeeks.org/wp-content/uploads/20231212180658/selfattne.png" style="display: block; margin: 0 auto;" width="600">

<br>
<br>
<br>
$$
\begin{aligned}
\text{Attention}(Q, K, V) &= softmax\left( \frac{Q K^{T}}{\sqrt{d_k}} \right) V
\end{aligned}
$$
<br>

### What is Multi-Head Attention?
Multi-head attention extends self-attention by splitting the input into multiple heads, enabling the model to capture diverse relationships and patterns.

Instead of using a single set of $Q$, $K$, $V$ matrices, the input embeddings are projected into multiple sets (heads), each with its own $Q$, $K$, $V$:

1. **Linear Transformation**: The input $X$ is projected into multiple smaller-dimensional subspaces using different weight matrices.

$$  
\mathbf{Q}_i = \mathbf{X} \mathbf{W}^Q_i, \quad 
\mathbf{K}_i = \mathbf{X} \mathbf{W}^K_i, \quad 
\mathbf{V}_i = \mathbf{X} \mathbf{W}^V_i
  $$

where $i$ denotes the head index.
Independent Attention Computation: Each head independently computes its own self-attention using the scaled dot-product formula.
Concatenation: The outputs from all heads are concatenated.
Final Linear Transformation: A final weight matrix is applied to transform the concatenated output into the desired dimension.

2. **Independent Attention Computation**: Each head independently computes its own self-attention using the scaled dot-product formula.
3. **Concatenation**: The outputs from all heads are concatenated.
4. **Final Linear Transformation**: A final weight matrix is applied to transform the concatenated output into the desired dimension.

<br>
<br>
<img src="https://media.geeksforgeeks.org/wp-content/uploads/20231212181418/multihead.png"style="display: block; margin: 0 auto;" width="600">
<br>
<br>

Mathematically, multi-head attention is expressed as:
$$
    MultiHead(Q, K, V) = Concat(head_1, head_2,...,head_h) \mathbf{W}^O
$$
$where$
$$
    head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
$$

$W^O$ is a final weight matrix to project the concatenated output back into the model's required dimensions.

### Why use multi attention head?

Multi-head attention provides several advantages:

- **Captures different relationship**: Different heads attend to different aspects of the input.
- **Improves learning efficiency**: By operating in parallel, multiple heads allow for better learning of dependencies.
- **Enhances robustness**: The model doesn’t rely on a single attention pattern, reducing overfitting.

</span>

In [3]:
#Implementation of MultiHeadAttention

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, num_heads: int):
        super().__init__()
        assert d_model % num_heads == 0
        self.d_k = d_model // num_heads
        self.num_heads = num_heads
        
        self.w_q = nn.Linear(d_model, d_model)
        self.w_k = nn.Linear(d_model, d_model)
        self.w_v = nn.Linear(d_model, d_model)
        self.w_o = nn.Linear(d_model, d_model)
        
        self.dropout = nn.Dropout(0.1)
    
    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attn_probs = F.softmax(attn_scores, dim=-1)
        attn_probs = self.dropout(attn_probs)
        return torch.matmul(attn_probs, V)
    
    def forward(self, Q, K, V, mask=None):
        batch_size = Q.size(0)
        
        Q = self.w_q(Q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.w_k(K).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.w_v(V).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        
        attn_output = self.scaled_dot_product_attention(Q, K, V, mask)
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.d_k)
        
        return self.w_o(attn_output)

## 1.3 Create Feed Forward Network

<span style="font-size:18px;">

### Key characteristic of FNN:
1. **Fully connected layers**:
   <br>
    The FFN comprises two linear (fully connected) layers that transform the input data. The first layer expands the input dimension from $dmodel$=512 to a larger dimension $dff$=2048, and the second layer projects it back to $dmodel$.
2. **Activation Function**:
   <br>
   A Rectified Linear Unit (ReLU) activation function is applied between these two linear layers. This function is defined as ReLU(x)=max(0,x) and is used to introduce non-linearity into the model, helping it to learn more complex patterns.
3. **Position-wise Processing**:
   <br>
   Despite the sequential nature of the input data, each position (i.e., each word’s representation in a sentence) is processed independently with the same FFN. This is akin to applying the same transformation across all positions, ensuring uniformity in extracting features from different parts of the input sequence.

### Mathematical Representation:
$$
FFN(X) = Activation(0, xW_1 + b_1)W_2 + b_2
$$

Here, Activation represents the non-linear activation function, W₁ and W₂ are weight matrices, and b₁ and b₂ are bias vectors. The presence of the activation function allows FFN to break the linearity and alter the input’s distribution and topological structure.
</span>

In [4]:
#Implementation of FeedForward
class FeedForward(nn.Module):
    def __init__(self, d_model: int, d_ff: int, dropout=0.1): 
        #d_ff: hidden layer dimension 
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff) #input layer
        self.linear2 = nn.Linear(d_ff, d_model) #hidden layer
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.linear2(self.dropout(F.relu(self.linear1(x))))

## 1.4 Create Encoder Layer

<span style="font-size:18px;">

### Encoder Architechture in Transformers:
<img src="https://pytorch.org/wp-content/uploads/2024/11/2022-7-12-a-better-transformer-for-fast-transformer-encoder-inference-1.png" style="display: block; margin: 0 auto;" width="600">
</span>

In [5]:
#Implementation of Encoder Layer
class EncoderLayer(nn.Module):
    def __init__(self, d_model: int, num_heads: int , d_ff: int, dropout=0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.ff = FeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        ff_output = self.ff(x)
        return self.norm2(x + self.dropout(ff_output))
    

## 1.5 Create Decoder Layer

<span style="font-size:18px;">

### Decoder Architechture:
<img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*uVD7vobj-o1tU-MP.png" style="display: block; margin: 0 auto;" width="600">

</span>

In [6]:
#Implementation of Decoder Layer
class DecoderLayer(nn.Module):
    def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout=0.1):
        super().__init__()
        self.masked_attn = MultiHeadAttention(d_model, num_heads)
        self.enc_dec_attn = MultiHeadAttention(d_model, num_heads)
        self.ff = FeedForward(d_model, d_ff, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, enc_output, src_mask=None, tgt_mask=None):
        attn_output = self.masked_attn(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(attn_output))
        attn_output = self.enc_dec_attn(x, enc_output, enc_output, src_mask)
        x = self.norm2(x + self.dropout(attn_output))
        ff_output = self.ff(x)
        return self.norm3(x + self.dropout(ff_output))

# 2. Create Transformers
<span style="font-size:18px">
    
## 2.1 The Padding Mask:
The Padding mask could seem trivial at first sight, but it has its own quibbles. First reason on why it is necessary: Not all the sentences have the same lenght!
<br><br>
We:

- Add Padding tokens to bring all the sentences to have the same lenght;
- Create a mask that is able to block the softmax function to consider these uninformative tokens.

**What is the shape of the Padding Mask?**
First, if we want to talk about padding mask we need to consider the Batch size > 1 that we’ll name B. Hence, Q ∈ R^{B × L × E}, K ∈ R^{B × L × E}, V ∈ R^{B × L × E}, L is the sequence length and E is the embedding size.

Now, we’ll use an arbitrary value for the padding token [PAD] , to align all the |B| sequences to the same lenght L .

As an example, the “proto-padding-mask” where |B| = 4 and |L| = 6 , will be:
<br><br>
<img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*hzXXmCShuznJz0ixirWUTg.png" style="display: block; margin: 0 auto;" width="600">
<br> <br>
Remember that the scaled-dot-product attention function that works with a generic mask is:
<br> <br>
<img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*x1ZxEXba7MQ1A4wQQTnW1Q.png" style="display: block; margin: 0 auto;" width="600">
<br><br>
for the operation QK^{T} the transposition for the tensor K is done only on the last two dimensions (the batch dim is not considered), so:
<br><br>
<img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*PgEHfQHa-Xz7cKm9qsDDPQ.png" style="display: block; margin: 0 auto;" width="600">
<br><br>
Now, for each sentence in the set of size | B | we have a L × L matrix that should be masked. To better understand how to construct our padding mask we can make and example with a single sentence, let’s say the third row!
<br><br>
<img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*I7cV1G1QPT-HO0SvLohsBw.png" style="display: block; margin: 0 auto;" width="600">
<br><br>
Considering every element like x_7 ∈ R^{E} . So,
<br><br>
<img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*EltJK6FwLh0PtyTXZxUYag.png" style="display: block; margin: 0 auto;" width="600">
<br><br>
It’s easy to see that every position in which we have a multiplication by the padding token (actually a dot product because every entry is ∈ R^{E} ) should be masked because is uninformative.

Hence, our padding mask for the third sentence will be:
<img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*ieXRW2lU96ivyhs9YBPmIg.png" style="display: block; margin: 0 auto;" width="600">
</span>

## 2.2 Look Ahead Mask:
<span style="font-size:18px">
The look-ahead mask was originally used in the Attention is All You Need paper for the original transformer. The look-ahead mask is used so that the model can be trained on an entire sequence of text at once as opposed to training the model one word at a time. The original transformer model is what’s called an autoregressive model. This means it predicts using only data from the past. The original transformer was made for translation, so this type of model makes sense. When predicting the translated sentence, the model will predict words one at a time. Say I had a sentence:

“How are you”

The model would translate the sentence to Spanish one word at a time:

Prediction 1: Given “”, the model predicts the next word is “cómo”

Prediction 2: Given “cómo”, the model predicts the next word is “estás”

Prediction 3: Given “cómo estás” the model predicts the next word is “<END>” signifying the end of the sequence

What if we wanted the model to learn this translation? Then we could feed it one word at a time, resulting in three predictions from the model. This process is very slow as it requires S (the sequence length) inferences from the model to get a single sentence translation prediction from the model. Instead, we feed it the whole sentence “cómo estás <END> …” and use a clever masking trick so the model cannot look ahead at future tokens, only past tokens. This way it requires a single inference step to get an entire sentence translation from the model.

The formula for self-attention with a look-ahead mask is the same as the padding mask. The only change has to do with the mask itself.
<br><br>
<img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*Yh4jMeqVpf7KlD_aBHO62A.png" style="display: block; margin: 0 auto;" width="600">
<br><br>

The mask has a triangle of -∞ in the upper right and 0s elsewhere. Let’s see how this affects the softmax of the weight matrix.
<br><br>
<img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*G1UoOVV3F7iTLYr547xU6A.png" style="display: block; margin: 0 auto;" width="600">
The weight matrix has some interesting results. The first-row aQ is only weighted by itself aᴷ. Since a is the first token in the sequence, it should not be affected by any other token in the sequence as none of the other tokens exist yet.

On the second row, b is affected by both a and b. Since b is the second token, it should only be affected by the first token, a.

In the last row, the last token in the sequence, D, is affected by all other tokens as the last token in the sequence should have context of all other tokens in the sequence.

Finally, let’s see how the mask affects the output of the attention function.
<br><br>
<img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*Dcy934GlldVKQcbOPL_yKw.png" style="display: block; margin: 0 auto;" width="600">
<br><br>
Similar to the weight matrix, the resulting vectors are only affected by the tokens preceding the token represented in that vector. The new token embedding of a is in the first row of the resulting vector. Since this token only has context of itself, it will only be a combination of itself.

The second token b has context of a, so the resulting vector is a combination of a and b.

The last token D has context of all other tokens, so the resulting vector is a combination of all other tokens.

</span>


In [7]:
class Transformer(nn.Module):
    def __init__(self, src_vocab_size: int, tgt_vocab_size: int, d_model: int = 512, num_heads: int = 8,
                num_layers: int = 6, d_ff: int = 2048, max_len: int = 5000, dropout: float = 0.1):
        super().__init__()
        self.src_embedding = nn.Embedding(src_vocab_size, d_model)
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_len)

        self.encoder_layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.decoder_layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])

        self.final_linear = nn.Linear(d_model, tgt_vocab_size)
        self.dropout = nn.Dropout(dropout)

    def create_padding_mask(self, seq, pad_token=0):
        return (seq != pad_token).unsqueeze(1).unsqueeze(2)

    def create_look_ahead_mask(self, seq_len):
        mask = (torch.triu(torch.ones(seq_len, seq_len), diagonal=1)).bool()
        return ~mask

    def forward(self, src, tgt, src_pad_mask=None, tgt_pad_mask=None):
        if src_pad_mask is None:
            src_pad_mask = self.create_padding_mask(src)
        if tgt_pad_mask is None:
            tgt_pad_mask = self.create_padding_mask(tgt)

        tgt_look_ahead_mask = self.create_look_ahead_mask(tgt.size(1)).to(tgt.device)
        tgt_mask = tgt_pad_mask & tgt_look_ahead_mask

        src_emb = self.dropout(self.pos_encoding(self.src_embedding(src) * math.sqrt(self.src_embedding.embedding_dim)))
        tgt_emb = self.dropout(self.pos_encoding(self.tgt_embedding(tgt) * math.sqrt(self.tgt_embedding.embedding_dim)))

        enc_output = src_emb
        for layer in self.encoder_layers:
            enc_output = layer(enc_output, src_pad_mask)

        dec_output = tgt_emb
        for layer in self.decoder_layers:
            dec_output = layer(dec_output, enc_output, src_pad_mask, tgt_mask)

        return self.final_linear(dec_output)

# 3. Create Dataset

<span style="font-size:18px">
For the training dataset, I'll use the ncduy/mt-en-vi for training a seq2seq transformers.
</span>

In [8]:
from datasets import load_dataset

ds = load_dataset("ncduy/mt-en-vi")

README.md: 0.00B [00:00, ?B/s]

train.csv:   0%|          | 0.00/597M [00:00<?, ?B/s]

valid.csv:   0%|          | 0.00/2.45M [00:00<?, ?B/s]

test.csv:   0%|          | 0.00/2.43M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2884451 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11316 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11225 [00:00<?, ? examples/s]

In [9]:
train = ds["train"].select(range(10000))
train['en'][:5]

["- Sorry, that question's not on here.",
 'He wants you to come with him immediately.',
 'I thought we could use some company.',
 'It was founded in 2008 by this anonymous programmer using a pseudonym Satoshi Nakamoto.',
 'With both of these methods, no two prints are exactly alike, but both reveal dramatic images of the fish.']

In [10]:
src_texts = []
for text in train['en'][:10000]:
    src_texts.append(text) 

tgt_texts = []
for text in train['vi'][:10000]:
    tgt_texts.append(text)

In [16]:
from torch.utils.data import Dataset, DataLoader
import torch

BOS_TOKEN = "<bos>"
EOS_TOKEN = "<eos>"
PAD_TOKEN = "<pad>"
UNK_TOKEN = "<unk>"

#Build vocab from the dataset
def build_vocab(texts, min_freq=1):
    vocab = {PAD_TOKEN: 0, BOS_TOKEN: 1, EOS_TOKEN: 2, UNK_TOKEN: 3}
    idx = 4
    for text in texts:
        for word in text.lower().split():
            if word not in vocab:
                vocab[word] = idx
                idx += 1
    return vocab

src_vocab = build_vocab(src_texts) #English vocab
tgt_vocab = build_vocab(tgt_texts) #Vietnamese vocab

src_vocab_size = len(src_vocab)
tgt_vocab_size = len(tgt_vocab)

def text_to_indices(text, vocab, is_src=True):
''' E.g: src_texts = ["- Sorry, that question's not on here.",
                        'He wants you to come with him immediately.',
                        'I thought we could use some company.',
                        'It was founded in 2008 by this anonymous programmer using a pseudonym Satoshi Nakamoto.',
                        'With both of these methods, no two prints are exactly alike, but both reveal dramatic images of the fish.']
    
    Output (conceptual):
    [BOS, idx1, idx2, ..., idxN, EOS]                   
'''
    words = text.lower().split() if is_src else text.split()
    return [vocab.get(BOS_TOKEN, 1)] + [vocab.get(w, vocab[UNK_TOKEN]) for w in words] + [vocab.get(EOS_TOKEN, 2)]



class TranslationDataset(Dataset):
    def __init__(self, src_texts, tgt_texts):
        self.src = [torch.tensor(text_to_indices(s, src_vocab)) for s in src_texts]
        self.tgt = [torch.tensor(text_to_indices(t, tgt_vocab, is_src=False)) for t in tgt_texts]
    
    def __len__(self):
        return len(self.src)
    
    def __getitem__(self, idx):
        return self.src[idx], self.tgt[idx]

dataset = TranslationDataset(src_texts, tgt_texts)

def collate_fn(batch):
    src_batch, tgt_batch = zip(*batch)
    src_padded = torch.nn.utils.rnn.pad_sequence(src_batch, padding_value=src_vocab[PAD_TOKEN], batch_first=True)
    tgt_padded = torch.nn.utils.rnn.pad_sequence(tgt_batch, padding_value=tgt_vocab[PAD_TOKEN], batch_first=True)

    tgt_input = tgt_padded[:, :-1]
    tgt_target = tgt_padded[:, 1:]
    return src_padded, tgt_input, tgt_target

dataloader = DataLoader(
    dataset,
    batch_size=2,
    collate_fn=collate_fn,
    shuffle=True,
    pin_memory=True
)

In [17]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

model = Transformer(
    src_vocab_size=src_vocab_size,
    tgt_vocab_size=tgt_vocab_size,
    d_model=128,
    num_heads=4,
    num_layers=2,
    d_ff=256,
    max_len=512,
    dropout=0.1
).to(device)

model.train()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss(ignore_index=tgt_vocab[PAD_TOKEN])

Using device: cuda


In [None]:
class Config():
    def __init__(self,
                model: nn.Module,
                crtiterion: nn.Module,
                optimizer: optim.Optimizer,
                epochs: int = 10,
                tgt_vocab_size: int = 1000):
        self.model = model
        self.criterion = criterion
        self.optimizer = optimizer
        self.epochs = epochs
        self.tgt_vocab_size = tgt_vocab_size

    def __call__(self, dataloader: DataLoader):
        return self.train(dataloader)

    def train(self, dataloader: DataLoader):
        for epoch in range(self.epochs):
            total_loss = 0
            for src, tgt_input, tgt_target in dataloader:
                optimizer.zero_grad()
                
                output = self.model(src, tgt_input) 
                
                loss = self.criterion(
                    output.reshape(-1, self.tgt_vocab_size),
                    tgt_target.reshape(-1)
                )
                
                loss.backward()
                self.optimizer.step()
                
                total_loss += loss.item()
            
            avg_loss = total_loss / len(dataloader)
            print(f"Epoch {epoch + 1}/{self.epochs}, Loss: {avg_loss:.4f}")
            
        return {"final_loss": avg_loss}

In [23]:
for epoch in range(epochs):
    total_loss = 0

    for step, (src, tgt_input, tgt_target) in enumerate(dataloader):

        src = src.to(device)
        tgt_input = tgt_input.to(device)
        tgt_target = tgt_target.to(device)

        optimizer.zero_grad()

        output = model(src, tgt_input)

        loss = criterion(
            output.reshape(-1, tgt_vocab_size),
            tgt_target.reshape(-1)
        )

        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    
    print(f"Epoch [{epoch+1}/{epochs}] "
          f"Last Step Loss: {loss.item():.4f} | "
          f"Avg Loss: {total_loss / len(dataloader):.4f}")


Epoch [1/5] Last Step Loss: 4.5108 | Avg Loss: 4.2390
Epoch [2/5] Last Step Loss: 3.2525 | Avg Loss: 4.0939
Epoch [3/5] Last Step Loss: 4.7037 | Avg Loss: 3.9689
Epoch [4/5] Last Step Loss: 4.1405 | Avg Loss: 3.8627
Epoch [5/5] Last Step Loss: 3.9953 | Avg Loss: 3.7594


In [21]:
def translate(sentence, max_len=50):
    model.eval()
    device = next(model.parameters()).device

    sentence = sentence.lower()

    src_indices = torch.tensor(
        [text_to_indices(sentence, src_vocab, is_src=True)],
        device=device
    )

    tgt_indices = torch.tensor(
        [[tgt_vocab[BOS_TOKEN]]],
        device=device
    )

    with torch.no_grad():
        for _ in range(max_len):
            output = model(src_indices, tgt_indices)

            next_token = output[:, -1, :].argmax(dim=-1)

            tgt_indices = torch.cat(
                [tgt_indices, next_token.unsqueeze(1)],
                dim=1
            )

            if next_token.item() == tgt_vocab[EOS_TOKEN]:
                break

    inv_tgt_vocab = {v: k for k, v in tgt_vocab.items()}
    words = [
        inv_tgt_vocab.get(i.item(), "")
        for i in tgt_indices[0, 1:]
        if i.item() != tgt_vocab[EOS_TOKEN]
    ]

    return " ".join(words)


In [24]:
print(translate("He wants you to come with him immediately."))

<unk> có thể làm gì với chúng ta có thể làm sao không.


In [25]:
torch.save({
    "model": model.state_dict(),
    "src_vocab": src_vocab,
    "tgt_vocab": tgt_vocab
}, "transformer_translation.pt")