## Introduction

**Machine Translation (MT)** aims to **automatically translate** text or speech from one **natural language** to another.
It integrates concepts and techniques from **linguistics**, **computer science**, **probability and statistics**, and **artificial intelligence** to develop systems capable of producing accurate translations between human languages.

Modern MT systems such as **Google Translate**, **Bing Translator**, and others have achieved **high-quality translations** and are now integrated into various platforms. These systems can translate effectively between **over 100 natural languages**.

**Thus, the Input/Output of the MT problem is:**

* **Input:** Source language text.

  * Example (Vietnamese input): *"Tôi đang học NLP"*
* **Output:** Translated text in the target language.

  * Example (English translation): *"I am learning NLP"*

**Approaches to Machine Translation**:

To effectively solve the **machine translation** problem, we need to focus on optimizing two key components:

* **Part 1:** The learning algorithm to optimize the parameter set **θ**.
* **Part 2:** The **decoding algorithm**, responsible for generating the best possible translation for the given input text.

Currently, there are **three main approaches** to machine translation:

1. **Rule-based Machine Translation (RBMT):** Translation based on linguistic rules.
2. **Statistical Machine Translation (SMT):** Translation based on statistical models and probability.
3. **Neural Machine Translation (NMT):** Translation using neural network architectures.


**Focus of the Project**:

Among these approaches, **Neural Machine Translation (NMT)** has shown **significant advancements** and produces **superior translation quality**.
Therefore, this project focuses on **neural network–based methods**, consisting of two main parts:

1. **Method 1:** Building a machine translation model using the **Transformer architecture**.
2. **Method 2:** Building a machine translation model using **Pre-trained Language Models** such as **BERT** and **GPT**.


## Transformer Model

### Data

In [None]:
# Import libs
!pip install -q datasets sacrebleu

In [None]:
# download dataset
from datasets import load_dataset

data = load_dataset(
    "mt_eng_vietnamese",
    "iwslt2015-en-vi"
)

In [None]:
data

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 133318
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 1269
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 1269
    })
})

In [None]:
data['train']

Dataset({
    features: ['translation'],
    num_rows: 133318
})

In [None]:
data['train']['translation'][0]

{'en': 'Rachel Pike : The science behind a climate headline',
 'vi': 'Khoa học đằng sau một tiêu đề về khí hậu'}

### Tokenization

In [None]:
# tokenization
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

SRC_LANGUAGE = 'en'
TGT_LANGUAGE = 'vi'

token_transform = {} # tokenizer
vocab_transform = {} # vocab

token_transform[SRC_LANGUAGE] = get_tokenizer('basic_english')
token_transform[TGT_LANGUAGE] = get_tokenizer('basic_english')
token_transform

{'en': <function torchtext.data.utils._basic_english_normalize(line)>,
 'vi': <function torchtext.data.utils._basic_english_normalize(line)>}

In [None]:
# Building vocabulary
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']

def yield_tokens(data_iter, language):
    for data_sample in data_iter['translation']:
        yield token_transform[language](data_sample[language])

for language in [SRC_LANGUAGE, TGT_LANGUAGE]: # en, vi
    train_iter = data['train']

    vocab_transform[language] = build_vocab_from_iterator(
        yield_tokens(train_iter, language),
        min_freq=2,
        specials=special_symbols,
        special_first=True
    )
    vocab_transform[language].set_default_index(UNK_IDX)
    print(f'{language} vocab length: {len(vocab_transform[language].get_stoi())}')

en vocab length: 29114
vi vocab length: 12099


In [None]:
vocab_transform[SRC_LANGUAGE].get_itos()[:10]

['<unk>', '<pad>', '<bos>', '<eos>', ',', '.', 'the', 'and', 'to', '&apos']

In [None]:
vocab_transform[TGT_LANGUAGE].get_itos()[:10]

['<unk>', '<pad>', '<bos>', '<eos>', ',', '.', 'và', 'tôi', 'là', 'một']

In [None]:
len(vocab_transform[SRC_LANGUAGE]), len(vocab_transform[TGT_LANGUAGE])

(29114, 12099)

In [None]:
# def lowercase(text):
#   return text.lower()

# def remove_punctuation(text):
#   import string
#   return ''.join([char for char in text if char not in string.punctuation])

# # Combine the transformations
# text_transform = sequential_transforms(lowercase, remove_punctuation)

# # Example usage
# original_text = "Hello, World! How are you?"
# transformed_text = text_transform(original_text)

# print(original_text)  # Output: Hello, World! How are you?
# print(transformed_text) # Output: hello world how are you

In [None]:
vocab_transform['en'](['rachel', 'pike', 'the', 'science', 'behind', 'a', 'climate', 'headline'])

[6429, 17576, 6, 295, 553, 11, 682, 5334]

### Dataloader

In [None]:
# Dataloader
import torch
from torch.nn.utils.rnn import pad_sequence

# helper function to club together sequential operations
def sequential_transforms(*transforms): # This part uses the asterisk (*) operator to allow accepting a variable number of transformation functions as arguments.
    def func(txt_input):
        for transform in transforms:
            txt_input = transform(txt_input)
        return txt_input

    return func

# function to add BOS /EOS and create tensor for input sequence indices
def tensor_transform(token_ids):
    return torch.cat((torch.tensor([BOS_IDX]),
                      torch.tensor(token_ids),
                      torch.tensor([EOS_IDX])))

# ‘‘src‘‘ and ‘‘tgt‘' language text transforms to convert raw strings into tensors indices
text_transform = {} # token --> indices

for language in [SRC_LANGUAGE, TGT_LANGUAGE]:
    text_transform[language] = sequential_transforms(
        token_transform[language], # Tokenization
        vocab_transform[language], # Numericalization
        tensor_transform # Add BOS /EOS and create tensor
    )
    print(text_transform[language])

<function sequential_transforms.<locals>.func at 0x792c9ed365f0>
<function sequential_transforms.<locals>.func at 0x792c9ed36680>


In [None]:
from torch.utils.data import DataLoader

# function to collate data samples into batch tensors
def collate_fn(batch):
    # batch_size = 2
    # [{'en': 'Rachel Pike : The science behind a climate headline', 'vi': 'Khoa học đằng sau một tiêu đề về khí hậu'},
    # {'en': 'In 4 minutes , atmospheric chemist Rachel Pike provides a glimpse of the massive scientific effort behind the bold headlines on climate change , with her team -- one of thousands who contributed -- taking a risky flight over the rainforest in pursuit of data on a key molecule .', 'vi': 'Trong 4 phút , chuyên gia hoá học khí quyển Rachel Pike giới thiệu sơ lược về những nỗ lực khoa học miệt mài đằng sau những tiêu đề táo bạo về biến đổi khí hậu , cùng với đoàn nghiên cứu của mình -- hàng ngàn người đã cống hiến cho dự án này -- một chuyến bay mạo hiểm qua rừng già để tìm kiếm thông tin về một phân tử then chốt .'}]
    src_batch, tgt_batch = [], []
    for sample in batch:
        src_sample, tgt_sample = sample[SRC_LANGUAGE], sample[TGT_LANGUAGE]
        src_batch.append(text_transform[SRC_LANGUAGE](src_sample).to(dtype=torch.int64))
        tgt_batch.append(text_transform[TGT_LANGUAGE](tgt_sample).to(dtype=torch.int64))

    src_batch = pad_sequence(src_batch, padding_value=PAD_IDX, batch_first=True)
    tgt_batch = pad_sequence(tgt_batch, padding_value=PAD_IDX, batch_first=True)
    return src_batch, tgt_batch

In [None]:
BATCH_SIZE = 8

train_dataloader = DataLoader(
        data['train']['translation'],
        batch_size=BATCH_SIZE ,
        collate_fn=collate_fn
)

valid_dataloader = DataLoader (
        data['validation']['translation'],
        batch_size=BATCH_SIZE,
        collate_fn=collate_fn
)

test_dataloader = DataLoader (
        data['test']['translation'],
        batch_size=BATCH_SIZE,
        collate_fn=collate_fn
)

In [None]:
data = next(iter(train_dataloader))
data[0].shape, data[1].shape

(torch.Size([8, 52]), torch.Size([8, 78]))

### Modeling

In [None]:
from torch import Tensor
import torch
import torch.nn as nn
from torch.nn import Transformer
import math
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
# helper Module that adds positional encoding to the token embedding to introduce a notion of word order.
class PositionalEncoding(nn.Module):
    def __init__(self,
                 emb_size: int,
                 dropout: float,
                 maxlen: int = 5000):
        super(PositionalEncoding, self).__init__()
        den = torch.exp(- torch.arange(0, emb_size, 2)* math.log(10000) / emb_size)
        pos = torch.arange(0, maxlen).reshape(maxlen, 1)
        pos_embedding = torch.zeros((maxlen, emb_size))
        pos_embedding[:, 0::2] = torch.sin(pos * den)
        pos_embedding[:, 1::2] = torch.cos(pos * den)
        pos_embedding = pos_embedding.unsqueeze(-2)

        self.dropout = nn.Dropout(dropout)
        self.register_buffer('pos_embedding', pos_embedding)

    def forward(self, token_embedding: Tensor):
        return self.dropout(token_embedding + self.pos_embedding[:token_embedding.size(0), :])

# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size) # Scales the embeddings (optional)


# Seq2Seq Network
class Seq2SeqTransformer(nn.Module):
    def __init__(self,
                 num_encoder_layers: int,
                 num_decoder_layers: int,
                 emb_size: int,
                 nhead: int,
                 src_vocab_size: int,
                 tgt_vocab_size: int,
                 dim_feedforward: int = 512,
                 dropout: float = 0.1):
        super(Seq2SeqTransformer, self).__init__()

        self.transformer = Transformer(d_model=emb_size,
                                       nhead=nhead,
                                       num_encoder_layers=num_encoder_layers,
                                       num_decoder_layers=num_decoder_layers,
                                       dim_feedforward=dim_feedforward,
                                       dropout=dropout,
                                       batch_first=True)
        self.generator = nn.Linear(emb_size, tgt_vocab_size)
        self.src_tok_emb = TokenEmbedding(src_vocab_size, emb_size)
        self.tgt_tok_emb = TokenEmbedding(tgt_vocab_size, emb_size)
        self.positional_encoding = PositionalEncoding(emb_size, dropout=dropout)

    def forward(self,
                src: Tensor,
                trg: Tensor,
                src_mask: Tensor,
                tgt_mask: Tensor,
                src_padding_mask: Tensor,
                tgt_padding_mask: Tensor):

        src_emb = self.positional_encoding(self.src_tok_emb(src)) # [8, 52, 512]
        tgt_emb = self.positional_encoding(self.tgt_tok_emb(trg)) # [8, 77, 512]

        outs = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None,
                                src_padding_mask, tgt_padding_mask)
        return self.generator(outs)

    def encode(self, src: Tensor, src_mask: Tensor):
        return self.transformer.encoder(self.positional_encoding(self.src_tok_emb(src)), src_mask)

    def decode(self, tgt: Tensor, context: Tensor, tgt_mask: Tensor):
        return self.transformer.decoder(self.positional_encoding(self.tgt_tok_emb(tgt)), context, tgt_mask)

In [None]:
def generate_square_subsequent_mask(sz):
    mask = (torch.triu(torch.ones((sz, sz), device=DEVICE)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask

def create_mask(src, tgt):
    '''
    return:
        src_mask: mask của encoder input
        tgt_mask: mask của decoder input
        src_padding_mask: mask cho token không phải padding của src
        tgt_padding_mask: mask cho token không phải padding của tgt
    '''
    src_seq_len = src.shape[1]
    tgt_seq_len = tgt.shape[1]

    tgt_mask = generate_square_subsequent_mask(tgt_seq_len)
    src_mask = torch.zeros((src_seq_len, src_seq_len),device=DEVICE).type(torch.bool)

    src_padding_mask = (src == PAD_IDX)
    tgt_padding_mask = (tgt == PAD_IDX)
    return src_mask, tgt_mask, src_padding_mask, tgt_padding_mask

In [None]:
SRC_VOCAB_SIZE = len(vocab_transform[SRC_LANGUAGE]) # 29114
TGT_VOCAB_SIZE = len(vocab_transform[TGT_LANGUAGE]) # 12099
EMB_SIZE = 512
NHEAD = 8
FFN_HID_DIM = 512
BATCH_SIZE = 128
NUM_ENCODER_LAYERS = 3
NUM_DECODER_LAYERS = 3

transformer = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE, NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM)
transformer = transformer.to(DEVICE)

loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)
optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

In [None]:
TGT_VOCAB_SIZE

12099

In [None]:
src_ids, tgt_ids = next(iter(train_dataloader))
src_ids = src_ids.to(DEVICE)
tgt_ids = tgt_ids.to(DEVICE)

tgt_input = tgt_ids[:, :-1]
tgt_output = tgt_ids[:, 1:]

src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src_ids, tgt_input)
logits = transformer(src_ids, tgt_input, src_mask, tgt_mask, src_padding_mask, tgt_padding_mask, src_padding_mask)
loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_output.reshape(-1))



In [None]:
src_padding_mask[0]

tensor([False, False, False, False, False, False, False, False, False, False,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True], device='cuda:0')

In [None]:
src_ids.shape

torch.Size([8, 52])

In [None]:
tgt_input.shape

torch.Size([8, 77])

In [None]:
logits.shape

torch.Size([8, 77, 12099])

In [None]:
loss

tensor(9.5250, device='cuda:0', grad_fn=<NllLossBackward0>)

### Trainer

In [None]:
import time

def train_epoch(model, optimizer, criterion, train_dataloader, device):
    model.train()
    losses = []

    for src_ids, tgt_ids in train_dataloader:
        src_ids = src_ids.to(device)
        tgt_ids = tgt_ids.to(device)

        tgt_input = tgt_ids[:, :-1]
        tgt_output = tgt_ids[:, 1:]

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src_ids, tgt_input)
        try:
            output = model(
                src_ids, tgt_input, src_mask, tgt_mask, src_padding_mask, tgt_padding_mask, src_padding_mask
            )
        except:
            print(src_ids.shape, tgt_input.shape)

        optimizer.zero_grad()

        loss = criterion(
            output.reshape(-1, output.shape[-1]),
            tgt_output.reshape(-1))
        loss.backward()

        optimizer.step()
        losses.append(loss.item())

    return sum(losses) / len(losses)

def evaluate(model, data_loader, criterion, device):
    model.eval()
    losses = []
    with torch.no_grad():
        for src_ids, tgt_ids in data_loader:
            src_ids = src_ids.to(device)
            tgt_ids = tgt_ids.to(device)

            tgt_input = tgt_ids[:, :-1]
            tgt_output = tgt_ids[:, 1:]

            src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src_ids, tgt_input)
            output = model(
                src_ids, tgt_input, src_mask, tgt_mask,src_padding_mask, tgt_padding_mask, src_padding_mask
            )
            loss = criterion(
                output.reshape(-1, output.shape[-1]),
                tgt_output.reshape(-1)
            )
            losses.append(loss.item())
    return sum(losses) / len(losses)

def train(model, train_dataloader, valid_dataloader, optimizer, criterion, device, epochs):
    for epoch in range(1, epochs+1):
        start_time = time.time()
        train_loss = train_epoch(model, optimizer, criterion, train_dataloader, device)
        valid_loss = evaluate(model, valid_dataloader, criterion, device)
        end_time = time.time()
        print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {valid_loss:.3f}, "f"Epoch time = {(end_time - start_time):.3f}s"))

### Training

In [None]:
SRC_VOCAB_SIZE = len(vocab_transform[SRC_LANGUAGE])
TGT_VOCAB_SIZE = len(vocab_transform[TGT_LANGUAGE])
EMB_SIZE = 512
NHEAD = 8
FFN_HID_DIM = 512
BATCH_SIZE = 128
NUM_ENCODER_LAYERS = 3
NUM_DECODER_LAYERS = 3

transformer = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE, NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM)
transformer = transformer.to(DEVICE)

criterion = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)
optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

epochs = 5
train(transformer, train_dataloader, valid_dataloader, optimizer, criterion, DEVICE, epochs)

Epoch: 1, Train loss: 4.585, Val loss: 4.135, Epoch time = 646.748s
Epoch: 2, Train loss: 3.957, Val loss: 3.824, Epoch time = 641.112s
Epoch: 3, Train loss: 3.712, Val loss: 3.669, Epoch time = 636.455s
Epoch: 4, Train loss: 3.554, Val loss: 3.550, Epoch time = 631.623s
Epoch: 5, Train loss: 3.442, Val loss: 3.477, Epoch time = 632.205s


### Inference

In [None]:
tensor = torch.rand(1, 2, 3)
print(tensor)
print(tensor.shape)
print('-'*35)
out = tensor.transpose(0, 1)
print(out)
print(out.shape)
print('-'*35)
out = out[:, -1]
print(out)
print(out.shape)
print('-'*35)

tensor([[[0.8702, 0.9839, 0.0636],
         [0.5654, 0.8888, 0.7106]]])
torch.Size([1, 2, 3])
-----------------------------------
tensor([[[0.8702, 0.9839, 0.0636]],

        [[0.5654, 0.8888, 0.7106]]])
torch.Size([2, 1, 3])
-----------------------------------
tensor([[0.8702, 0.9839, 0.0636],
        [0.5654, 0.8888, 0.7106]])
torch.Size([2, 3])
-----------------------------------


In [None]:
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    src = src.to(DEVICE)
    src_mask = src_mask.to(DEVICE)

    # Encode the source sequence using the model encoder
    context = model.encode(src, src_mask)

    # Create a starting token (usually an end-of-sentence symbol)
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE)
    for i in range(max_len-1):
        context = context.to(DEVICE)
        tgt_mask = (generate_square_subsequent_mask(ys.size(1)).type(torch.bool)).to(DEVICE)

        # Decode based on the current target sequence and encoded source representation
        out = model.decode(ys, context, tgt_mask) # [1, 1, 512], [1, 2, 512], [1, 3, 512], [1, 4, 512], [1, 5, 512], [1, 6, 512]
        out = out.transpose(0, 1)                 # [1, 1, 512], [2, 1, 512], [3, 1, 512], [4, 1, 512], [5, 1, 512], [6, 1, 512]
        prob = model.generator(out[:, -1])        # [1, 12099],  [2, 12099],  [3, 12099],  [4, 12099],  [5, 12099],  [6, 12099]
        # transpose chiều câu lên ví trí số 2 rồi out[:, -1] để xóa nó đi

        _, next_word = torch.max(prob, dim=1)
        next_word = next_word[-1].item()

        # [[2, 7]], [[ 2,  7, 78]], [[ 2,  7, 78, 66]], [[ 2,  7, 78, 66, 156]], [[ 2,  7, 78, 66, 156, 5]], [[ 2,  7, 78, 66, 156, 5, 3]]
        ys = torch.cat([ys, torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=1)

        if next_word == EOS_IDX:
            break
    return ys


# actual function to translate input sentence into target language
def translate(model: torch.nn.Module, src_sentence: str):
    model.eval()
    src = text_transform[SRC_LANGUAGE](src_sentence).view(1, -1)
    num_tokens = src.shape[1]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    tgt_tokens = greedy_decode(model, src, src_mask, max_len=num_tokens+5, start_symbol=BOS_IDX).flatten()

    return " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))).replace("<bos>", "").replace("<eos>", "")

In [None]:
translate(transformer, "i go to school")

torch.Size([1, 1, 512])
torch.Size([1, 1, 512])
torch.Size([1, 12099])
tensor([[2, 7]], device='cuda:0')

torch.Size([1, 2, 512])
torch.Size([2, 1, 512])
torch.Size([2, 12099])
tensor([[ 2,  7, 78]], device='cuda:0')

torch.Size([1, 3, 512])
torch.Size([3, 1, 512])
torch.Size([3, 12099])
tensor([[ 2,  7, 78, 66]], device='cuda:0')

torch.Size([1, 4, 512])
torch.Size([4, 1, 512])
torch.Size([4, 12099])
tensor([[  2,   7,  78,  66, 156]], device='cuda:0')

torch.Size([1, 5, 512])
torch.Size([5, 1, 512])
torch.Size([5, 12099])
tensor([[  2,   7,  78,  66, 156,   5]], device='cuda:0')

torch.Size([1, 6, 512])
torch.Size([6, 1, 512])
torch.Size([6, 12099])
tensor([[  2,   7,  78,  66, 156,   5,   3]], device='cuda:0')



' tôi đi học trường . '

In [None]:
translate(transformer, "i go to school")

' tôi đi học trường . '

In [None]:
translate(transformer, "How are you today?")

' bạn đang ở đây hôm nay ? '

In [None]:
translate(transformer, "wassup dawg")

' <unk> <unk> '

In [None]:
from tqdm import tqdm
import sacrebleu

pred_sentences, tgt_sentences = [], []
for sample in tqdm(data['test']['translation']):
    src_sentence = sample[SRC_LANGUAGE]
    tgt_sentence = sample[TGT_LANGUAGE]

    pred_sentence = translate(transformer, src_sentence)
    pred_sentences.append(pred_sentence)

    tgt_sentences.append(tgt_sentence)

bleu_score = sacrebleu.corpus_bleu(pred_sentences, [tgt_sentences], force=True)
bleu_score

100%|██████████| 1269/1269 [02:15<00:00,  9.35it/s]


BLEU = 7.12 44.0/16.1/6.1/2.3 (BP = 0.717 ratio = 0.751 hyp_len = 25322 ref_len = 33738)