# Problem 3: **Machine Translation** with **Transformer Model**

## Problem Description

In this problem, we will be using a **Transformer Model** to translate fr_glish sentences to Vietnamese. This is a challenging task that involves understanding the semantics of the English language and generating appropriate Vietnamese translations.

The transformer itself is a powerful model, we don't need to use an pretrained word embedding vectors like in LSTM.

## Requirements

1. **Load and Preprocess the Data**: The first step is to load the sentence pairs from the provided dataset in the file `Data/en_sents` and `Data/vi_sents`. These sentences should then be preprocessed to convert the text into a format that can be used by our Transformer model. This preprocessing might involve steps such as tokenization, converting the tokens into integers, and padding the sequences to have the same length.

2. **Model**: Implement a **Transformer model** for this task. The model should take as input the integer sequences representing the English sentences and output the predicted Vietnamese translations.

3. **Evaluation**: Evaluate the performance of the **Transformer model**. Use appropriate metrics for this evaluation, such as the BLEU score which is commonly used for machine translation tasks.

4. **Report Your Results**: After evaluating your model, report the results. This should include the performance of your model on the test set. Discuss any significant findings or observations from your results.

# Tokenizing Vietnamese Text with underthesea Vietnamese Natural Language Processing Toolkit

## Overview


In [1]:
import pandas as pd
import spacy
import math
from tqdm import tqdm
import torch
from torch import nn
import lightning as pl
from typing import Iterable, List, Callable
from torch.utils.data import Dataset, DataLoader
from underthesea import sent_tokenize, text_normalize, word_tokenize
from torchmetrics.text import BLEUScore

In [2]:
def get_default_devices():
  # Pick GPU if available, else CPU
  if torch.cuda.is_available():
    print(f"Using device: {torch.cuda.get_device_name(0)}")
    return torch.device('cuda')
  elif torch.backends.mps.is_available():
    print('Using device: Apple ARM GPU')
    return torch.device('mps')  # Apple M1 GPU
  else:
    print("Using device: CPU")
    return torch.device('cpu')

In [3]:
device = get_default_devices()

Using device: NVIDIA GeForce RTX 3070 Ti


#### Load the Dataset

In [4]:
def read_text(en_file, vi_file):
    """
    Read text pairs from files, then build a dataframe with two columns: english and vietnamese
    """
    
    with open(en_file, 'r', encoding= 'utf-8') as f:
        en_lines = f.readlines()
        
    with open(vi_file, 'r', encoding= 'utf-8') as f:
        vi_lines = f.readlines()
        
    data = pd.DataFrame({'english': en_lines, 'vietnamese': vi_lines})
    
    return data

In [6]:
# Read the data file
df = read_text('Data/en_sents', 'Data/vi_sents')

In [7]:
# Print the first 5 rows
df.head()

Unnamed: 0,english,vietnamese
0,Please put the dustpan in the broom closet\n,xin vui lòng đặt người quét rác trong tủ chổi\n
1,Be quiet for a moment.\n,im lặng một lát\n
2,Read this\n,đọc này\n
3,Tom persuaded the store manager to give him ba...,tom thuyết phục người quản lý cửa hàng trả lại...
4,Friendship consists of mutual understanding\n,tình bạn bao gồm sự hiểu biết lẫn nhau\n


In [8]:
print(f'We have total {len(df)} pairs of sentences.')

We have total 254090 pairs of sentences.


# Data Preprocessing

In this step, we will preprocess our data to make it suitable for our Transformer model.

For English, we will use the spaCy library, which is a powerful tool for natural language processing.

For Vietnamese, we will use the underthesea library, which is a Vietnamese Natural Language Processing Toolkit.

## Tokenizing Vietnamese Text with Underthesea

### Overview

Underthesea is a powerful toolkit for processing Vietnamese language data. It provides functionalities for various tasks such as tokenization, part-of-speech tagging, named entity recognition, and more.

## Usage

To use Underthesea for tokenizing Vietnamese text, follow these steps:

1. **Installation**: First, ensure that you have Underthesea installed. If not, you can install it using pip:

```bash
pip install underthesea
```

2. **Tokenization**: With Underthesea installed, you can tokenize Vietnamese text by simply calling the `word_tokenize` function on the text. Here is an example:

```python
from underthesea import word_tokenize

text = "Underthesea là thư viện xử lý ngôn ngữ tự nhiên Tiếng Việt."
tokens = word_tokenize(text)
```

In this example, `tokens` will be a list of tokens extracted from the input text.

Remember to always refer to the official documentation or repository for the most accurate and updated information.

In [9]:
# Download the models if necessary
if not spacy.util.is_package('en_core_web_md'):
    spacy.cli.download('en_core_web_md')
    
# Load the models
nlp_en = spacy.load('en_core_web_md')

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [10]:
def tokenize_vi(text: str) -> List[List[str]]:
    """
    Tokenize a Vietnamese text into sentences and words
    
    Args:
        text (str): the input text
        
    Returns:
        List[List[str]]: a list of sentences, each sentence is a list of tokens
    """
    # Step 1: Sentence Tokenization
    sentences = sent_tokenize(text)
    
    # Step 2: Text Normalization (assuming it's just lowercasing here)
    sentences = [text_normalize(sentence) for sentence in sentences]
    
    # Step 3: Word Tokenization
    tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]
    
    # Flatten the list
    tokenized_sentences = [word for sentence in tokenized_sentences for word in sentence]
    
    # Lowercase all tokens
    tokenized_sentences = [word.lower() for word in tokenized_sentences]
    
    return tokenized_sentences

In [11]:
# Test the function
tokenize_vi('Tôi là sinh_viên trường Đại_học Bách_Khoa Hà_Nội. Tôi học ngành Khoa_học_máy_tính.')

['tôi',
 'là',
 'sinh_viên',
 'trường',
 'đại_học',
 'bách_khoa hà_nội',
 '.',
 'tôi',
 'học',
 'ngành',
 'khoa_học_máy_tính',
 '.']

In [12]:
def tokenize_en(sentence):
    return [tok.text for tok in nlp_en.tokenizer(sentence)]

In [13]:
sentence = "I love you."
tokenize_en(sentence)

['I', 'love', 'you', '.']

In [14]:
# # Define the tokenizer for English
en_tokenizer = tokenize_en

# Define the tokenizer for Vietnamese
vi_tokenizer = tokenize_vi

In [15]:
class TranslationDataset(Dataset):
    def __init__(self, data, src_vocab, tgt_vocab, src_tokenizer, tgt_tokenizer):
        self.data = data
        self.src_vocab = src_vocab
        self.tgt_vocab = tgt_vocab
        self.src_tokenizer = src_tokenizer
        self.tgt_tokenizer = tgt_tokenizer

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        src, tgt = self.data.iloc[idx]
        src_tokens = self.src_tokenizer(src.lower())
        tgt_tokens = self.tgt_tokenizer(tgt.lower())
        

#         src_tokens = ['<sos>'] + src_tokens + ['<eos>']
#         tgt_tokens = ['<sos>'] + tgt_tokens + ['<eos>']
        
        
#           # 2. Convert into a tensor of IDs
#         src_ids = [self.src_vocab.get(token, self.src_vocab['<unk>']) for token in src_tokens]
#         tgt_ids = [self.tgt_vocab.get(token, self.tgt_vocab['<unk>']) for token in tgt_tokens]
#         return src_ids, tgt_ids

     # Add '<sos>' and '<eos>' tokens to the source and target sentences
        src_tokens = ['<sos>'] + src_tokens + ['<eos>']
        tgt_tokens = ['<sos>'] + tgt_tokens + ['<eos>']
        
        # Convert tokens to IDs using the vocabularies
        src_ids = [self.src_vocab[token] if token in self.src_vocab else self.src_vocab['<unk>'] for token in src_tokens]
        tgt_ids = [self.tgt_vocab[token] if token in self.tgt_vocab else self.tgt_vocab['<unk>'] for token in tgt_tokens]
        
        # Convert lists to tensors
        src_tensor = torch.tensor(src_ids, dtype=torch.long)
        tgt_tensor = torch.tensor(tgt_ids, dtype=torch.long)
        
        return src_tensor, tgt_tensor


In [16]:
def build_vocab(sentences, tokenizer):
    vocab = {"<pad>": 0, "<sos>": 1, "<eos>": 2, "<unk>": 3}
    idx = 4
    for sentence in sentences:
        for token in tokenizer(sentence):
            if token not in vocab:
                vocab[token] = idx
                idx += 1
    return vocab

In [17]:
class TranslationDataModule(pl.LightningDataModule):
    def __init__(self, df, src_tokenizer, tgt_tokenizer, batch_size=32):
        super().__init__()
        self.df = df
        self.src_tokenizer = src_tokenizer
        self.tgt_tokenizer = tgt_tokenizer
        self.batch_size = batch_size
        self.src_vocab = None
        self.tgt_vocab = None

    def setup(self, stage=None):
        if self.src_vocab is not None and self.tgt_vocab is not None:
            return
        
          # Build source and target vocabularies using the custom function
        self.src_vocab = build_vocab(self.df['english'], self.src_tokenizer)
        self.tgt_vocab = build_vocab(self.df['vietnamese'], self.tgt_tokenizer)
            

        # TODO: your code here
        # Create datasets
        translation_dataset = TranslationDataset(
            self.df[['english', 'vietnamese']], 
            self.src_vocab, self.tgt_vocab, 
            self.src_tokenizer, self.tgt_tokenizer
        )
        
        # Calculate the size of the training and validation sets
        train_size = int(len(translation_dataset) * 0.8)
        val_size = len(translation_dataset) - train_size
        
        # Split the dataset into training and validation sets
        self.train_dataset, self.val_dataset = torch.utils.data.random_split(translation_dataset, [train_size, val_size])
 
        
    def train_dataloader(self):
        return DataLoader(self.train_dataset, batch_size=self.batch_size, shuffle=True, collate_fn = self.collate_fn)

    def val_dataloader(self):
        # TODO: your code here
        return DataLoader(self.val_dataset, batch_size=self.batch_size, shuffle=False, collate_fn=self.collate_fn)

    def collate_fn(self, batch):
        src_batch, tgt_batch = zip(*batch)
        src_batch = torch.nn.utils.rnn.pad_sequence(src_batch, padding_value=self.src_vocab['<pad>'], batch_first=True)
        tgt_batch = torch.nn.utils.rnn.pad_sequence(tgt_batch, padding_value=self.tgt_vocab['<pad>'], batch_first=True)
        return src_batch, tgt_batch


In [18]:
PAD_IDX = 0

In [19]:
class TranslationModel(pl.LightningModule):
    def __init__(self, src_vocab_size, tgt_vocab_size, pad_idx=None, d_model=512, nhead=8, 
                 num_encoder_layers=6, num_decoder_layers=6, dim_feedforward=2048, dropout=0.1):
        super().__init__()
        self.pad_idx = pad_idx
        self.d_model = d_model
        self.transformer = nn.Transformer(
            d_model=d_model, nhead=nhead, num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers, dim_feedforward=dim_feedforward,
            dropout=dropout, batch_first=True
        )
      
        self.src_embed = nn.Embedding(src_vocab_size, d_model)
        self.tgt_embed = nn.Embedding(tgt_vocab_size, d_model)
        self.fc_out = nn.Linear(d_model, tgt_vocab_size)
     
        self.pos_encoder = nn.Parameter(torch.zeros(1, 5000, d_model))  # Max seq len = 5000
        self.loss_fn = nn.CrossEntropyLoss(ignore_index=pad_idx) if pad_idx is not None else nn.CrossEntropyLoss()

        # Initialize weights
        nn.init.xavier_uniform_(self.src_embed.weight)
        nn.init.xavier_uniform_(self.tgt_embed.weight)
        nn.init.xavier_uniform_(self.fc_out.weight)


    def forward(self, src, tgt):
        src_mask = self.transformer.generate_square_subsequent_mask(src.size(1)).to(src.device)
        tgt_mask = self.transformer.generate_square_subsequent_mask(tgt.size(1)).to(tgt.device)

        src_padding_mask = (src == self.pad_idx) if self.pad_idx is not None else None
        tgt_padding_mask = (tgt == self.pad_idx) if self.pad_idx is not None else None

        src_embedded = self.src_embed(src) + self.pos_encoder[:, :src.size(1), :]
        tgt_embedded = self.tgt_embed(tgt) + self.pos_encoder[:, :tgt.size(1), :]

        output = self.transformer(
            src_embedded, tgt_embedded, 
            src_mask=src_mask, tgt_mask=tgt_mask, 
            memory_mask=None,
            src_key_padding_mask=src_padding_mask, tgt_key_padding_mask=tgt_padding_mask
        )

        return self.fc_out(output)


    def training_step(self, batch, batch_idx):
        src, tgt = batch
        tgt_input = tgt[:, :-1]
        tgt_output = tgt[:, 1:]
        
        output = self(src, tgt_input)
        loss = self.loss_fn(output.reshape(-1, output.shape[-1]), tgt_output.reshape(-1))


        self.log('train_loss', loss, prog_bar=True)
        return loss

    def validation_step(self, batch, batch_idx):
        src, tgt = batch
        tgt_input = tgt[:, :-1]
        tgt_output = tgt[:, 1:]
        
        output = self(src, tgt_input)
        loss = self.loss_fn(output.reshape(-1, output.shape[-1]), tgt_output.reshape(-1))


        self.log('val_loss', loss, prog_bar=True)
        return loss


    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=0.0001)  # Lower initial LR
        scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=3)
        return {
            'optimizer': optimizer,
            'lr_scheduler': scheduler,
            'monitor': 'val_loss'
        }

In [20]:
# Train and evaluate the model

# Assuming the DataModule has been set up correctly

data_module = TranslationDataModule(df, en_tokenizer, vi_tokenizer, batch_size=32)

# Call the setup method to initialize vocabularies
data_module.setup()

# Access the vocabularies from the data module
src_vocab = data_module.src_vocab
tgt_vocab = data_module.tgt_vocab

# Get the vocabulary sizes
src_vocab_size = len(src_vocab)
tgt_vocab_size = len(tgt_vocab)


### Save src_vocab and tgt_vocab under format .pkl for convenience loading

In [21]:
import pickle

with open("src_vocab.pkl", "wb") as f:
    pickle.dump(src_vocab, f)

with open("tgt_vocab.pkl", "wb") as f:
    pickle.dump(tgt_vocab, f)

In [22]:
sentence = "I love learning new languages"
tokens = en_tokenizer(sentence.lower())
print(tokens)  # Check if tokens make sense
print([src_vocab.get(t, src_vocab['<unk>']) for t in tokens])  # Check IDs

['i', 'love', 'learning', 'new', 'languages']
[3, 296, 965, 283, 1333]


In [23]:
for key, value in src_vocab.items():
    print(f"Word (Key): {key}, Index (Value): {value}")

Word (Key): <pad>, Index (Value): 0
Word (Key): <sos>, Index (Value): 1
Word (Key): <eos>, Index (Value): 2
Word (Key): <unk>, Index (Value): 3
Word (Key): Please, Index (Value): 4
Word (Key): put, Index (Value): 5
Word (Key): the, Index (Value): 6
Word (Key): dustpan, Index (Value): 7
Word (Key): in, Index (Value): 8
Word (Key): broom, Index (Value): 9
Word (Key): closet, Index (Value): 10
Word (Key): 
, Index (Value): 11
Word (Key): Be, Index (Value): 12
Word (Key): quiet, Index (Value): 13
Word (Key): for, Index (Value): 14
Word (Key): a, Index (Value): 15
Word (Key): moment, Index (Value): 16
Word (Key): ., Index (Value): 17
Word (Key): Read, Index (Value): 18
Word (Key): this, Index (Value): 19
Word (Key): Tom, Index (Value): 20
Word (Key): persuaded, Index (Value): 21
Word (Key): store, Index (Value): 22
Word (Key): manager, Index (Value): 23
Word (Key): to, Index (Value): 24
Word (Key): give, Index (Value): 25
Word (Key): him, Index (Value): 26
Word (Key): back, Index (Value): 2

In [23]:
for k, v in tgt_vocab.items():
  print(f"Word (Key): {k}, Index (Value): {v}")

Word (Key): <pad>, Index (Value): 0
Word (Key): <sos>, Index (Value): 1
Word (Key): <eos>, Index (Value): 2
Word (Key): <unk>, Index (Value): 3
Word (Key): xin, Index (Value): 4
Word (Key): vui lòng, Index (Value): 5
Word (Key): đặt, Index (Value): 6
Word (Key): người, Index (Value): 7
Word (Key): quét, Index (Value): 8
Word (Key): rác, Index (Value): 9
Word (Key): trong, Index (Value): 10
Word (Key): tủ, Index (Value): 11
Word (Key): chổi, Index (Value): 12
Word (Key): im lặng, Index (Value): 13
Word (Key): một, Index (Value): 14
Word (Key): lát, Index (Value): 15
Word (Key): đọc, Index (Value): 16
Word (Key): này, Index (Value): 17
Word (Key): tom, Index (Value): 18
Word (Key): thuyết phục, Index (Value): 19
Word (Key): người quản lý, Index (Value): 20
Word (Key): cửa hàng, Index (Value): 21
Word (Key): trả, Index (Value): 22
Word (Key): lại, Index (Value): 23
Word (Key): tiền, Index (Value): 24
Word (Key): cho, Index (Value): 25
Word (Key): anh, Index (Value): 26
Word (Key): ta, Ind

In [24]:
# Print some key-value pairs from tgt_vocab
for i, (word, idx) in enumerate(tgt_vocab.items()):
    print(f"{word}: {idx}")
    if i == 10:  # Print only first 10 words
        break

# Print tokenized Vietnamese sentences with their indexed values
for sentence in df['vietnamese'][:5]:  
    tokens = tokenize_vi(sentence)
    indexed_tokens = [tgt_vocab.get(word, tgt_vocab['<unk>']) for word in tokens]
    print("Tokens:", tokens)
    print("Indexed Tokens:", indexed_tokens)


<pad>: 0
<sos>: 1
<eos>: 2
<unk>: 3
xin: 4
vui lòng: 5
đặt: 6
người: 7
quét: 8
rác: 9
trong: 10
Tokens: ['xin', 'vui lòng', 'đặt', 'người', 'quét', 'rác', 'trong', 'tủ', 'chổi']
Indexed Tokens: [4, 5, 6, 7, 8, 9, 10, 11, 12]
Tokens: ['im lặng', 'một', 'lát']
Indexed Tokens: [13, 14, 15]
Tokens: ['đọc', 'này']
Indexed Tokens: [16, 17]
Tokens: ['tom', 'thuyết phục', 'người quản lý', 'cửa hàng', 'trả', 'lại', 'tiền', 'cho', 'anh', 'ta', '.']
Indexed Tokens: [18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28]
Tokens: ['tình', 'bạn', 'bao gồm', 'sự', 'hiểu biết', 'lẫn', 'nhau']
Indexed Tokens: [29, 30, 31, 32, 33, 34, 35]


In [25]:
print("<sos> token index:", tgt_vocab.get("<sos>", "Not Found"))

<sos> token index: 1


In [26]:
pad_idx = src_vocab["<pad>"]

In [27]:
# Now you can initialize the model
model = TranslationModel(
    src_vocab_size=src_vocab_size, 
    tgt_vocab_size=tgt_vocab_size,
    pad_idx = pad_idx, 
    d_model=512,               
    nhead=8,                   
    num_encoder_layers=6,      
    num_decoder_layers=6,      
    dim_feedforward=2048,      
    dropout=0.1                
)

# Set the padding index in the model
# model.set_padding_index(data_module.src_vocab['<pad>'])

In [24]:
type(src_vocab)

dict

In [29]:
# Early stopping callback
early_stop_callback = pl.pytorch.callbacks.EarlyStopping(monitor='val_loss', patience=3, mode='min', verbose=True)

# Model checkpoint callback
checkpoint_callback = pl.pytorch.callbacks.ModelCheckpoint(monitor='val_loss', mode='min', verbose=True)

In [30]:
# Create the trainer
trainer = pl.Trainer(max_epochs=10, devices=1, callbacks=[early_stop_callback, checkpoint_callback])

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
e:\Translator\nlp_translator_env\lib\site-packages\lightning\pytorch\trainer\connectors\logger_connector\logger_connector.py:76: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default


In [31]:
trainer.fit(model,data_module)

You are using a CUDA device ('NVIDIA GeForce RTX 3070 Ti') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name         | Type             | Params | Mode 
----------------------------------------------------------
0 | transformer  | Transformer      | 44.1 M | train
1 | src_embed    | Embedding        | 11.7 M | train
2 | tgt_embed    | Embedding        | 7.7 M  | train
3 | fc_out       | Linear           | 7.7 M  | train
4 | loss_fn      | CrossEntropyLoss | 0      | train
  | other params | n/a              | 2.6 M  | n/a  
----------------------------------------------------------
73.8 M    Trainable params
0         Non-trainable params
73.8 M    Total params
295.212   Total esti

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

e:\Translator\nlp_translator_env\lib\site-packages\lightning\pytorch\trainer\connectors\data_connector.py:425: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=23` in the `DataLoader` to improve performance.
e:\Translator\nlp_translator_env\lib\site-packages\lightning\pytorch\trainer\connectors\data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=23` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved. New best score: 1.276
Epoch 0, global step 6353: 'val_loss' reached 1.27610 (best 1.27610), saving model to 'e:\\Translator\\nlp_translator_env\\lightning_logs\\version_11\\checkpoints\\epoch=0-step=6353.ckpt' as top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.330 >= min_delta = 0.0. New best score: 0.946
Epoch 1, global step 12706: 'val_loss' reached 0.94640 (best 0.94640), saving model to 'e:\\Translator\\nlp_translator_env\\lightning_logs\\version_11\\checkpoints\\epoch=1-step=12706.ckpt' as top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.124 >= min_delta = 0.0. New best score: 0.822
Epoch 2, global step 19059: 'val_loss' reached 0.82241 (best 0.82241), saving model to 'e:\\Translator\\nlp_translator_env\\lightning_logs\\version_11\\checkpoints\\epoch=2-step=19059.ckpt' as top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.056 >= min_delta = 0.0. New best score: 0.766
Epoch 3, global step 25412: 'val_loss' reached 0.76599 (best 0.76599), saving model to 'e:\\Translator\\nlp_translator_env\\lightning_logs\\version_11\\checkpoints\\epoch=3-step=25412.ckpt' as top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.037 >= min_delta = 0.0. New best score: 0.729
Epoch 4, global step 31765: 'val_loss' reached 0.72918 (best 0.72918), saving model to 'e:\\Translator\\nlp_translator_env\\lightning_logs\\version_11\\checkpoints\\epoch=4-step=31765.ckpt' as top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.018 >= min_delta = 0.0. New best score: 0.711
Epoch 5, global step 38118: 'val_loss' reached 0.71073 (best 0.71073), saving model to 'e:\\Translator\\nlp_translator_env\\lightning_logs\\version_11\\checkpoints\\epoch=5-step=38118.ckpt' as top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.016 >= min_delta = 0.0. New best score: 0.695
Epoch 6, global step 44471: 'val_loss' reached 0.69517 (best 0.69517), saving model to 'e:\\Translator\\nlp_translator_env\\lightning_logs\\version_11\\checkpoints\\epoch=6-step=44471.ckpt' as top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.010 >= min_delta = 0.0. New best score: 0.685
Epoch 7, global step 50824: 'val_loss' reached 0.68489 (best 0.68489), saving model to 'e:\\Translator\\nlp_translator_env\\lightning_logs\\version_11\\checkpoints\\epoch=7-step=50824.ckpt' as top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.007 >= min_delta = 0.0. New best score: 0.678
Epoch 8, global step 57177: 'val_loss' reached 0.67817 (best 0.67817), saving model to 'e:\\Translator\\nlp_translator_env\\lightning_logs\\version_11\\checkpoints\\epoch=8-step=57177.ckpt' as top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved by 0.007 >= min_delta = 0.0. New best score: 0.671
Epoch 9, global step 63530: 'val_loss' reached 0.67141 (best 0.67141), saving model to 'e:\\Translator\\nlp_translator_env\\lightning_logs\\version_11\\checkpoints\\epoch=9-step=63530.ckpt' as top 1
`Trainer.fit` stopped: `max_epochs=10` reached.


In [32]:
# Evaluate the model
trainer.validate(model, data_module)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Validation: |          | 0/? [00:00<?, ?it/s]

[{'val_loss': 0.6714078783988953}]

In [28]:
transformer_model = TranslationModel.load_from_checkpoint(
    "./lightning_logs/version_4/checkpoints/epoch=9-step=63530.ckpt",
    src_vocab_size=src_vocab_size, 
    tgt_vocab_size=tgt_vocab_size, 
    d_model=512,               
    nhead=8,                   
    num_encoder_layers=6,      
    num_decoder_layers=6,      
    dim_feedforward=2048,      
    dropout=0.1        
)

In [29]:
import os

checkpoint_dir = "./lightning_logs/version_4/checkpoints"
checkpoints = [f for f in os.listdir(checkpoint_dir) if f.endswith(".ckpt")]
checkpoints.sort()  # Ensure checkpoints are sorted (latest one is last)
latest_checkpoint = os.path.join(checkpoint_dir, checkpoints[-1])  # Get latest checkpoint

print("Latest Checkpoint:", latest_checkpoint)

Latest Checkpoint: ./lightning_logs/version_4/checkpoints\epoch=9-step=63530.ckpt


In [30]:
import shutil

# Define destination
destination_path = "translator_model.ckpt"

# Copy checkpoint to custom path
shutil.copy(latest_checkpoint, destination_path)

print(f"Checkpoint saved as {destination_path}")

Checkpoint saved as translator_model.ckpt


In [31]:
def translate_sentence(sentence, model, src_vocab, tgt_vocab, src_tokenizer, device='cuda', max_len=50):
    model.eval()  # Set the model to evaluation mode
    
    if device == 'cuda' and not torch.cuda.is_available():
        device = 'cpu'
    
    # Tokenize and add <sos> and <eos>
    tokens = ['<sos>'] + src_tokenizer(sentence.lower()) + ['<eos>']
    src_ids = [src_vocab.get(token, src_vocab['<unk>']) for token in tokens]
    
    # Convert to tensor and move to device
    src_tensor = torch.tensor(src_ids, dtype=torch.long, device=device).unsqueeze(0)
    
    # Initialize target sequence with <sos>
    tgt_ids = [tgt_vocab['<sos>']]
    tgt_tensor = torch.tensor([tgt_ids], dtype=torch.long, device=device)  # (1, 1)
    
    # Invert tgt_vocab for ID-to-token mapping
    id_to_word = {idx: word for word, idx in tgt_vocab.items()}
    
    # Inference loop
    with torch.no_grad():
        for _ in range(max_len):
            tgt_mask = model.transformer.generate_square_subsequent_mask(tgt_tensor.size(1)).to(device)
            output = model(src_tensor, tgt_tensor)
            
            next_token_id = output[:, -1].argmax(dim=-1).item()
            tgt_ids.append(next_token_id)
            
            if next_token_id == tgt_vocab['<eos>']:
                break
            
            # Append next token correctly
            next_token = torch.tensor([[next_token_id]], dtype=torch.long, device=device)
            tgt_tensor = torch.cat([tgt_tensor, next_token], dim=1)
    
    # Convert IDs to tokens
    tgt_tokens = [id_to_word.get(id, '<unk>') for id in tgt_ids]
    
    # Remove <sos> and <eos>
    translated_sentence = ' '.join(tgt_tokens[1:-1])
    
    return translated_sentence

In [32]:
# Usage
english_sentence = "How many people in your family"
translated_sentence = translate_sentence(
    english_sentence,
    transformer_model,
    src_vocab,
    tgt_vocab,
    en_tokenizer,
    device='cuda'
)
print(translated_sentence)

có bao nhiêu người trong gia đình bạn


In [33]:
# TODO: your code here
# Calculate the BLEU score
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# Ensure NLTK is properly initialized
nltk.download('punkt')

def calculate_bleu(reference_sentences, translated_sentence):
    """
    Calculate the BLEU score for a translated sentence against reference sentences.
    
    Args:
    - reference_sentences (list of list of str): A list of reference translations (each reference is a list of tokens).
    - translated_sentence (list of str): The translated sentence (a list of tokens).
    
    Returns:
    - bleu_score (float): The calculated BLEU score.
    """
    # Smoothing function to avoid zero scores for short translations
    smoothing_fn = SmoothingFunction().method4
    
    # Calculate the BLEU score using NLTK's sentence_bleu
    bleu_score = sentence_bleu(reference_sentences, translated_sentence, smoothing_function=smoothing_fn)
    
    return bleu_score

# Example usage
reference = [["this", "is", "a", "test"], ["this", "is", "test"]]
translated = ["this", "is", "a", "test"]

bleu_score = calculate_bleu(reference, translated)
print(f"BLEU score: {bleu_score:.2f}")


BLEU score: 1.00


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Admins\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Nhận xét
- BLEU score của mô hình cho số điểm khá tốt. Tuy nhiên khi input english sentence vào để dự đoán, kết quả không như mong đợi