# Machine Translation with Seq2Seq Model

In this notebook, we will explore the fascinating world of **Machine Translation** using a **Seq2Seq model**. Our goal is to build a model that can translate text from one language to another.

Key Highlights:

1. **Seq2Seq Model**: We will be using a Sequence-to-Sequence model, a type of model that converts an input sequence into an output sequence. It's widely used in tasks such as machine translation, speech recognition, and more.

2. **Beam Search**: To improve the quality of our translations, we will implement Beam Search, a heuristic search algorithm that explores the most promising nodes.

3. **BLEU Score**: To evaluate the performance of our model, we will use the Bilingual Evaluation Understudy (BLEU) score. It's a popular metric for machine translation that compares the translated text with the reference text.

Stay tuned as we dive into the code and unravel the intricacies of machine translation!

# Data Source

The data we will be using for this project is the **Bilingual Sentence Pairs** dataset, which can be found at the following link:

[https://www.kaggle.com/datasets/alincijov/bilingual-sentence-pairs](https://www.kaggle.com/datasets/alincijov/bilingual-sentence-pairs)

This dataset contains pairs of sentences in different languages, making it an excellent resource for our machine translation task.

In [None]:
import pandas as pd
import spacy
from lightning.pytorch.utilities.types import TRAIN_DATALOADERS
from tqdm import tqdm
import torch
from torch import nn
import lightning as pl
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator, Vocab, GloVe
from gensim.models import KeyedVectors
from typing import Iterable, List, Callable
from torch.utils.data import Dataset, DataLoader
from torchmetrics.text import BLEUScore

#### Load the Dataset

In [3]:
def read_text(file_name: str) -> pd.DataFrame:
    """
    The data file contains multiple lines of text. Each line contains a pair of sentences, and an attribution information.
    The three parts are separated by tab characters. This function reads the data file and returns a data frame.
    
    Args:
        file_name (str): the name of the data file
        
    Returns:
        pd.DataFrame: a data frame containing the data
    """
    
    # Read each line, split it by tab characters, and store the result in a list
    with open(file_name, 'r') as f:
        lines = [line.strip().split('\t') for line in f.readlines()]
        
    # Some lines are empty, so we need to remove them
    lines = [line for line in lines if len(line) == 3]
    
    # Convert the list to a data frame
    df = pd.DataFrame(lines, columns=['english', 'french', 'attribution'])
    
    return df

In [4]:
# Load the dataset
df = read_text('Data/fra.txt')

In [5]:
# Print the first 5 rows
df.head()

Unnamed: 0,english,french,attribution
0,Go.,Va !,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
1,Go.,Marche.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
2,Go.,Bouge !,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
3,Hi.,Salut !,CC-BY 2.0 (France) Attribution: tatoeba.org #5...
4,Hi.,Salut.,CC-BY 2.0 (France) Attribution: tatoeba.org #5...


In [6]:
print(f'We have total {len(df)} sentence pairs in the dataset.')

We have total 185583 sentence pairs in the dataset.


# Data Preprocessing

In this step, we will preprocess our data to make it suitable for our Seq2Seq model. We will use the spaCy library, which is a powerful tool for natural language processing. Specifically, we will use two models from Spacy: one for English and one for French. 

The preprocessing steps include:
1. Removing punctuation: Punctuation can introduce unnecessary complexity into our model, so we will remove it.
2. Converting to lower case: This ensures that our model does not treat the same word in different cases as different words.

In [7]:
# Download the spaCy model for English and French
if not spacy.util.is_package('en_core_web_md'):
    spacy.cli.download('en_core_web_md')
    
if not spacy.util.is_package('fr_core_news_md'):
    spacy.cli.download('fr_core_news_md')

In [8]:
# Load the spaCy models
nlp_en = spacy.load('en_core_web_md')
nlp_fr = spacy.load('fr_core_news_md')

In [9]:
# Print random 5 rows
df.sample(5, random_state=42)

Unnamed: 0,english,french,attribution
179927,I can't concentrate if you keep tapping me on ...,Je ne peux pas me concentrer si tu continues d...,CC-BY 2.0 (France) Attribution: tatoeba.org #1...
50731,Where are your things?,Où sont tes affaires ?,CC-BY 2.0 (France) Attribution: tatoeba.org #1...
113343,Tom has too many strange ideas.,Tom a trop d'idées étranges.,CC-BY 2.0 (France) Attribution: tatoeba.org #5...
46173,How is your new class?,Comment est ta nouvelle classe ?,CC-BY 2.0 (France) Attribution: tatoeba.org #3...
150522,Such a thing has never happened before.,C'est du jamais-vu.,CC-BY 2.0 (France) Attribution: tatoeba.org #8...


In [10]:
# Register the tqdm function with pandas to show a progress bar when applying the function to a dataframe
tqdm.pandas()

# Clean the English text
df['english'] = df['english'].progress_apply(
    lambda x: ' '.join([token.text.lower() for token in nlp_en.tokenizer(x) if token.is_alpha])
)

# Clean the French text
df['french'] = df['french'].progress_apply(
    lambda x: ' '.join([token.text.lower() for token in nlp_fr.tokenizer(x) if token.is_alpha])
)

100%|██████████| 185583/185583 [00:02<00:00, 63153.13it/s]
100%|██████████| 185583/185583 [00:05<00:00, 32466.15it/s]


In [11]:
# Print random 5 rows
df.sample(5, random_state=42)

Unnamed: 0,english,french,attribution
179927,i ca concentrate if you keep tapping me on the...,je ne peux pas me concentrer si tu continues d...,CC-BY 2.0 (France) Attribution: tatoeba.org #1...
50731,where are your things,où sont tes affaires,CC-BY 2.0 (France) Attribution: tatoeba.org #1...
113343,tom has too many strange ideas,tom a trop idées étranges,CC-BY 2.0 (France) Attribution: tatoeba.org #5...
46173,how is your new class,comment est ta nouvelle classe,CC-BY 2.0 (France) Attribution: tatoeba.org #3...
150522,such a thing has never happened before,est du jamais vu,CC-BY 2.0 (France) Attribution: tatoeba.org #8...


# TorchText

[TorchText](https://pytorch.org/text/stable/index.html) is a PyTorch package that makes text processing easier and more convenient. It provides essential tools for preprocessing text data, including tokenization, building vocabulary, and batching data for input into a model.

In this notebook, we will use TorchText for the following tasks:

1. **Tokenization**: We will use the `get_tokenizer` function to create a tokenizer that splits our sentences into tokens (words).

2. **Building Vocabulary**: We will use the `build_vocab_from_iterator` function to create a vocabulary from our dataset. This vocabulary will map each token to a unique integer, which our model can work with.

3. **Text to Integer Sequence Conversion**: We will create a custom function `text_transform` that uses our vocabulary to convert our sentences into sequences of integers.

4. **Batching**: When training our model, we will use the `BucketIterator` to create batches of our data. This will automatically handle padding of sequences to the same length within each batch.

By using TorchText, we can greatly simplify the preprocessing of our text data and ensure that it is done in a way that is optimal for our PyTorch model.


In [12]:
# Define the tokenizer
en_tokenizer = get_tokenizer('spacy', language='en_core_web_md')
fr_tokenizer = get_tokenizer('spacy', language='fr_core_news_md')

The `yield_tokens` function is a generator function that tokenizes text data from an iterable (like a list or a DataFrame column) and yields the tokens one by one.

Here's how it works:

1. The function takes two arguments: `data_iter`, which is an iterable of text data, and `tokenizer`, which is a callable (like a function) that takes a string and returns a list of tokens.

2. The function iterates over `data_iter`.

3. It applies the `tokenizer` to `text`, which splits the text into tokens.

4. It then yields these tokens one by one. Because it's a generator function, it doesn't return all tokens at once but yields them one by one. This is memory-efficient when dealing with large amounts of text data.

The reason for using this function is to create a stream of tokens from the text data. These tokens are used to build a vocabulary for text processing. The vocabulary maps each unique token to a unique integer, which can be used as input to a machine learning model.

In [13]:
def yield_tokens(data_iter: Iterable, tokenizer: Callable[[str], List[str]]) -> List[str]:
    """
    Yield the tokens from the data iterator.
    
    Args:
        data_iter (Iterable): the data iterator
        tokenizer (Callable[[str], List[str]]): the tokenizer
        
    Returns:
        List[str]: the tokens
    """
    
    for text in data_iter:
        yield tokenizer(text)

The `build_vocab_from_iterator` function in TorchText is used to build a vocabulary from an iterator that yields list or iterator of tokens. 

Here's a step-by-step explanation with a visualization example:

1. The function takes an iterator of tokenized text data. This iterator could be a list of sentences, where each sentence is a list of tokens.

2. The function iterates over this iterator, and for each list of tokens, it adds each token to the vocabulary.

3. The vocabulary is essentially a dictionary where each unique token is a key and the corresponding value is a unique integer. The integer values are assigned in the order the tokens are encountered.

4. The function returns this vocabulary.

Here's a visualization example:

Suppose we have the following tokenized text data:

```
[
    ['I', 'love', 'coding'],
    ['coding', 'is', 'fun'],
    ['I', 'love', 'AI']
]
```

The `build_vocab_from_iterator` function will build the following vocabulary from this data:

```
{
    'I': 0,
    'love': 1,
    'coding': 2,
    'is': 3,
    'fun': 4,
    'AI': 5
}
```

Note: The actual integer values may be different depending on the special tokens you add to the vocabulary (like `<unk>`, `<pad>`, `<sos>`, and `<eos>`), but the concept is the same.


In [14]:
# Build English vocabulary
en_vocab = build_vocab_from_iterator(yield_tokens(df['english'], en_tokenizer), specials=['<pad>', '<sos>', '<eos>', '<unk>'])
fr_vocab = build_vocab_from_iterator(yield_tokens(df['french'], fr_tokenizer), specials=['<pad>', '<sos>', '<eos>', '<unk>'])

# Default index is the index of <unk>
en_vocab.set_default_index(en_vocab['<unk>'])
fr_vocab.set_default_index(fr_vocab['<unk>'])

In [15]:
# Print the size of the vocabularies and some first tokens
print(f'English vocabulary size: {len(en_vocab)}, first 10 tokens: {list(en_vocab.get_itos())[:10]}')
print(f'French vocabulary size: {len(fr_vocab)}, first 10 tokens: {list(fr_vocab.get_itos())[:10]}')

English vocabulary size: 14372, first 10 tokens: ['<pad>', '<sos>', '<eos>', '<unk>', 'i', 'you', 'to', 'the', 'a', 'do']
French vocabulary size: 24666, first 10 tokens: ['<pad>', '<sos>', '<eos>', '<unk>', 'je', 'de', 'pas', 'est', 'que', 'à']


In [16]:
def text_transform(vocab: Vocab, tokenizer: Callable[[str], List[str]], text: str) -> List[int]:
    """
    Transform a text into a list of integers.
    
    Args:
        vocab (Vocab): the vocabulary
        tokenizer (Callable[[str], List[str]]): the tokenizer
        text (str): the input text
        
    Returns:
        List[int]: the list of integers
    """
    
    return [vocab[token] for token in tokenizer(text)]

In [17]:
# Define the text transforms by adding <sos> and <eos> tokens, and converting the text to a list of integers
text_transform_en = lambda text: [en_vocab['<sos>']] + text_transform(en_vocab, en_tokenizer, text) + [en_vocab['<eos>']]
text_transform_fr = lambda text: [fr_vocab['<sos>']] + text_transform(fr_vocab, fr_tokenizer, text) + [fr_vocab['<eos>']]

# Transform the English and French sentences
df['english_transform'] = df['english'].progress_apply(text_transform_en)
df['french_transform'] = df['french'].progress_apply(text_transform_fr)

100%|██████████| 185583/185583 [00:02<00:00, 80292.56it/s]
100%|██████████| 185583/185583 [00:03<00:00, 52662.03it/s]


In [18]:
# Print random 5 rows
df.sample(5, random_state=42)

Unnamed: 0,english,french,attribution,english_transform,french_transform
179927,i ca concentrate if you keep tapping me on the...,je ne peux pas me concentrer si tu continues d...,CC-BY 2.0 (France) Attribution: tatoeba.org #1...,"[1, 4, 50, 1937, 63, 5, 178, 9793, 19, 34, 7, ...","[1, 4, 10, 57, 6, 27, 1811, 49, 15, 5663, 5, 2..."
50731,where are your things,où sont tes affaires,CC-BY 2.0 (France) Attribution: tatoeba.org #1...,"[1, 86, 23, 26, 208, 2]","[1, 75, 51, 185, 627, 2]"
113343,tom has too many strange ideas,tom a trop idées étranges,CC-BY 2.0 (France) Attribution: tatoeba.org #5...,"[1, 11, 62, 96, 129, 656, 971, 2]","[1, 16, 17, 109, 1007, 3896, 2]"
46173,how is your new class,comment est ta nouvelle classe,CC-BY 2.0 (France) Attribution: tatoeba.org #3...,"[1, 41, 10, 26, 144, 539, 2]","[1, 77, 7, 129, 302, 582, 2]"
150522,such a thing has never happened before,est du jamais vu,CC-BY 2.0 (France) Attribution: tatoeba.org #8...,"[1, 306, 8, 216, 62, 91, 172, 145, 2]","[1, 7, 42, 73, 139, 2]"


In [19]:
# Let's check the maximum length of the English and French sentences
en_max_len = df['english_transform'].apply(len).max()
fr_max_len = df['french_transform'].apply(len).max()

In [20]:
print(f'Maximum length of English sentences: {en_max_len}')
print(f'Maximum length of French sentences: {fr_max_len}')

Maximum length of English sentences: 46
Maximum length of French sentences: 57


The `pad_sequence` function is used to ensure that all sequences in a batch have the same length by padding shorter sequences with a specific value, usually 0.

Here's how it works:

1. The function takes two arguments: `sequence`, which is a list of integers representing a sequence, and `max_length`, which is the desired length for all sequences.

2. The function checks if the length of `sequence` is less than `max_length`.

3. If it is, the function appends the padding value (`<pad>` token, represented by the integer 0) to `sequence` until its length is equal to `max_length`.

4. The function then returns the padded sequence.

Here's a visualization example:

Suppose we have the following sequence and max_length:

```python
sequence = [1, 3, 2]
max_length = 5
```

The `pad_sequence` function will pad the sequence with 0s until its length is 5:

```python
padded_sequence = [0, 0, 1, 3, 2]
```

This is useful in batch processing where all sequences need to have the same length for the computations to work. The padding value 0 is typically ignored by the model during training and inference.


In [21]:
def pad_sequence(sequence: List[int], max_len: int, vocab: Vocab, pad_first: bool = True) -> List[int]:
    """
    Pad a sequence with <pad> tokens.
    
    Args:
        sequence (List[int]): the input sequence
        max_len (int): the maximum length
        vocab (Vocab): the vocabulary
        pad_first (bool): whether to pad at the beginning or the end
        
    Returns:
        torch.Tensor: the padded sequence as a tensor of long integers
    """
    
    # Calculate the number of tokens to pad
    pad_len = max_len - len(sequence)
    
    # Pad the sequence
    if pad_first:
        sequence = [vocab['<pad>']] * pad_len + sequence
    else:
        sequence = sequence + [vocab['<pad>']] * pad_len
        
    return sequence

In LSTM models, the order of the sequence matters because the LSTM maintains an internal state that is updated for each element in the sequence. If you pad at the end of the sequence, the LSTM will update its state based on these padding tokens, which are meaningless and could potentially lead to less accurate predictions.

On the other hand, if you pad at the beginning of the sequence, the LSTM will start updating its state based on the meaningful tokens right away, as soon as it encounters them. The padding tokens at the beginning of the sequence will have less impact on the final state of the LSTM, leading to more accurate predictions.

This is especially important when using LSTM models with a fixed maximum sequence length, where sequences shorter than the maximum length need to be padded. By padding at the beginning of the sequence, you ensure that the LSTM's state is influenced as much as possible by the meaningful tokens in the sequence.


In [22]:
# Pad the English and French sentences
# We pad first for the English sentences, and pad last for the French sentences
df['english_transform'] = df['english_transform'].progress_apply(lambda x: pad_sequence(x, en_max_len, en_vocab, True))
df['french_transform'] = df['french_transform'].progress_apply(lambda x: pad_sequence(x, fr_max_len, fr_vocab, False))

100%|██████████| 185583/185583 [00:00<00:00, 765287.50it/s]
100%|██████████| 185583/185583 [00:00<00:00, 423571.10it/s]


In [23]:
# Print random 5 rows
df.sample(5, random_state=42)

Unnamed: 0,english,french,attribution,english_transform,french_transform
179927,i ca concentrate if you keep tapping me on the...,je ne peux pas me concentrer si tu continues d...,CC-BY 2.0 (France) Attribution: tatoeba.org #1...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 4, 10, 57, 6, 27, 1811, 49, 15, 5663, 5, 2..."
50731,where are your things,où sont tes affaires,CC-BY 2.0 (France) Attribution: tatoeba.org #1...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 75, 51, 185, 627, 2, 0, 0, 0, 0, 0, 0, 0, ..."
113343,tom has too many strange ideas,tom a trop idées étranges,CC-BY 2.0 (France) Attribution: tatoeba.org #5...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 16, 17, 109, 1007, 3896, 2, 0, 0, 0, 0, 0,..."
46173,how is your new class,comment est ta nouvelle classe,CC-BY 2.0 (France) Attribution: tatoeba.org #3...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 77, 7, 129, 302, 582, 2, 0, 0, 0, 0, 0, 0,..."
150522,such a thing has never happened before,est du jamais vu,CC-BY 2.0 (France) Attribution: tatoeba.org #8...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 7, 42, 73, 139, 2, 0, 0, 0, 0, 0, 0, 0, 0,..."


In [24]:
class TranslationDataset(Dataset):
    def __init__(self, english, french):
        self.english = english.apply(lambda x: torch.tensor(x, dtype=torch.long))
        self.french = french.apply(lambda x: torch.tensor(x, dtype=torch.long))
        
    def __len__(self):
        return len(self.english)
    
    def __getitem__(self, idx):
        return {
            'english': self.english[idx],
            'french': self.french[idx]
        }

In [25]:
class TranslationDataModule(pl.LightningDataModule):
    def __init__(self, df: pd.DataFrame, batch_size: int = 32):
        super().__init__()
        
        self.df = df
        self.batch_size = batch_size
        
        # Create the datasets
        self.translation_dataset = TranslationDataset(self.df['english_transform'], self.df['french_transform'])
        
        # Train/validation split
        train_size = int(0.8 * len(self.df))
        val_size = len(self.df) - train_size
        self.train_dataset, self.val_dataset = torch.utils.data.random_split(self.translation_dataset, [train_size, val_size])
        
    def train_dataloader(self) -> TRAIN_DATALOADERS:
        return DataLoader(self.train_dataset, batch_size=self.batch_size, shuffle=True)
    
    def val_dataloader(self) -> TRAIN_DATALOADERS:
        return DataLoader(self.val_dataset, batch_size=self.batch_size)

In [26]:
# Create the data module
data_module = TranslationDataModule(df)

### Seq2Seq Model

The **Seq2Seq model**, also known as the **Sequence-to-Sequence model**, is a type of model that converts an input sequence into an output sequence. It's widely used in tasks such as machine translation, speech recognition, and more.

The Seq2Seq model consists of two main components:

1. **Encoder**: The encoder processes the input sequence and returns its own internal state. For each input element, the encoder updates its state. After processing the entire input sequence, the encoder outputs its final state, which serves as the "context" of the sequence.

2. **Decoder**: The decoder uses the context (the final state of the encoder) to produce the output sequence. The decoder is also a recurrent network, and it produces the output sequence element by element. For each step, the input to the decoder is the previous element, the output of the decoder from the previous step, and the context.

Here's a visualization of the Seq2Seq model:

```
Input Sequence -> | Encoder | -> Context -> | Decoder | -> Output Sequence
```

#### Embedding Layer

The embedding layer in a neural network is used to transform sparse categorical data, like words in a text dataset, into a dense vector representation that the network can work with. There are two main ways to use the embedding layer:

1. **Self-Trained Embeddings**: In this case, the embedding layer is initialized with random weights and learns an embedding for each word in the vocabulary during the training of the network. This is a good option when you don't have a lot of domain-specific knowledge about the relationships between your categories.

2. **Pre-Trained Embeddings**: In this case, the embedding layer is initialized with the weights from a pre-trained embedding, like Word2Vec or GloVe. These embeddings are trained on large corpora and can capture a lot of semantic information about words. This is a good option when your dataset is small and you want to leverage external knowledge.

Here's example code for these two cases:

```python
import torch
from torch import nn

# Vocabulary size and embedding dimension
vocab_size = 5000
embed_dim = 300

# 1. Self-Trained Embeddings
self_trained_embedding = nn.Embedding(vocab_size, embed_dim)

# 2. Pre-Trained Embeddings
# Load pre-trained embeddings (replace with actual code to load your embeddings)
pretrained_embeddings = torch.randn(vocab_size, embed_dim)  # Note: we need to load from a pretrained model
pretrained_embedding = nn.Embedding.from_pretrained(pretrained_embeddings)
```

In the self-trained embeddings example, the `nn.Embedding` layer is initialized with random weights. In the pre-trained embeddings example, the `nn.Embedding` layer is initialized with weights from `pretrained_embeddings`, which is a tensor that you would typically load from a pre-trained embedding file.


#### Freezing the Pre-Trained Embedding Layer

When using pre-trained embeddings, we have two options:

1. **Freeze the Embedding Layer**: In this case, the weights of the pre-trained embedding layer are kept constant during training. This means that the semantic information captured by the pre-trained embeddings is preserved, and the model cannot modify these embeddings to better fit the training data. This is a good option when your dataset is small and you want to leverage the semantic information in the pre-trained embeddings as much as possible.

2. **Fine-Tune the Embedding Layer**: In this case, the weights of the pre-trained embedding layer are updated during training. This means that the model can modify these embeddings to better fit the training data. This is a good option when your dataset is large and you believe that the pre-trained embeddings may not be optimal for your specific task.

You can decide whether to freeze the pre-trained embedding layer by setting the `requires_grad` attribute of the embedding layer's parameters. If `requires_grad` is `False`, the parameters are frozen and will not be updated during training. If `requires_grad` is `True`, the parameters will be updated during training.

Here's example code showing how to freeze and unfreeze the pre-trained embedding layer:

```python
# Freeze the pre-trained embedding layer
for param in pretrained_embedding.parameters():
    param.requires_grad = False

# Unfreeze the pre-trained embedding layer
for param in pretrained_embedding.parameters():
    param.requires_grad = True
```

In the first block of code, the `requires_grad` attribute of the embedding layer's parameters is set to `False`, freezing the parameters. In the second block of code, the `requires_grad` attribute is set to `True`, allowing the parameters to be updated during training.

In [27]:
# For English, we use the existing GloVe embedding from torchtext
en_glove = GloVe(name='6B', dim=300)

# Create the embedding matrix
en_embedding_matrix = en_glove.get_vecs_by_tokens(en_vocab.get_itos())

.vector_cache/glove.6B.zip: 862MB [26:28, 543kB/s]                                
100%|█████████▉| 399999/400000 [00:19<00:00, 20402.71it/s]


In [28]:
import os

# Check if the word embedding file for French exists
if not os.path.exists('Data/wiki.multi.fr.vec'):
    # Download the word embedding file for French
    !curl -o Data/wiki.multi.fr.vec https://dl.fbaipublicfiles.com/arrival/vectors/wiki.multi.fr.vec

In [29]:
# Load pre-trained Word2Vec model
word2vec_model = KeyedVectors.load_word2vec_format('Data/wiki.multi.fr.vec')

# Get the number of words in the model's vocabulary and the size of the embeddings
embed_size = word2vec_model.vector_size

# Get the list of words in the vocabulary
fr_vocab_words = fr_vocab.get_itos()

# Initialize embedding matrix
fr_embedding_matrix = torch.zeros(len(fr_vocab_words), embed_size)

# Fill in the embedding matrix
for i, word in enumerate(fr_vocab_words):
    # Check if the word is in the Word2Vec model's vocabulary
    if word in word2vec_model:
        fr_embedding_matrix[i] = torch.tensor(word2vec_model[word])
    else:
        # If the word is not in the Word2Vec model's vocabulary, leave its embedding as zeros
        pass

In [31]:
class Seq2Seq(pl.LightningModule):
    def __init__(self, en_embedding_matrix, fr_embedding_matrix, hidden_size, output_size, max_output_len):
        """
        Initialize the model.
        
        Args:
            en_embedding_matrix (torch.Tensor): the embedding matrix for English
            fr_embedding_matrix (torch.Tensor): the embedding matrix for French
            hidden_size (int): the hidden size
            output_size (int): the output size
            max_output_len (int): the maximum output length
        """
        super().__init__()
        
        # Maximum output length
        self.max_output_len = max_output_len
        
        # Special tokens
        # These tokens are the same in French and English dictionaries
        self.sos_token = fr_vocab['<sos>']
        self.eos_token = fr_vocab['<eos>']
        self.pad_token = fr_vocab['<pad>']
        
        # Embedding layers
        # We allow the models to update the embeddings, so we don't need to freeze them
        self.en_embedding = nn.Embedding.from_pretrained(en_embedding_matrix, freeze=False)
        self.fr_embedding = nn.Embedding.from_pretrained(fr_embedding_matrix, freeze=False)
        
        # Encoder
        self.encoder = nn.LSTM(
            input_size=en_embedding_matrix.size(1),     # 300
            hidden_size=hidden_size,
            batch_first=True    # This makes the data shape (batch size, sequence length, features)
        )
        self.encoder_dropout = nn.Dropout(0.2)
        
        # Decoder
        self.decoder = nn.LSTM(
            input_size=fr_embedding_matrix.size(1),     # 300
            hidden_size=hidden_size,
            batch_first=True    # This makes the data shape (batch size, sequence length, features)
        )
        self.decoder_leaky_relu = nn.LeakyReLU()
        self.decoder_fc = nn.Linear(hidden_size, output_size)   # 300 -> 25k of the French vocab size. This is to choose the index with the highest probability.
        
    def _encoder_forward(self, x):
        """
        Forward pass of the encoder.
        
        Args:
            x (torch.Tensor): the input tensor
            
        Returns:
            torch.Tensor: the output tensor
            torch.Tensor: the hidden state
            torch.Tensor: the cell state
        """
        # Embed the input
        x = self.en_embedding(x)
        
        # Dropout
        en_embedded = self.encoder_dropout(x)
        
        # Pass through the encoder LSTM
        encoder_output, (hidden_state, cell_state) = self.encoder(en_embedded)
        
        # Return the output, hidden state, and cell state
        return encoder_output, (hidden_state, cell_state)
    
    def _decoder_forward(self, x, hidden_and_cell):
        """
        Forward pass of the decoder.
        
        Args:
            x (torch.Tensor): the input tensor
            hidden_and_cell (tuple): the hidden state and cell state
            
        Returns:
            torch.Tensor: the output tensor
            torch.Tensor: the hidden state
            torch.Tensor: the cell state
        """
        # Embed the input
        x = self.fr_embedding(x)
        
        # Unpack the hidden and state
        hidden_state, cell_state = hidden_and_cell
        
        # Pass through the decoder LSTM
        # The decoder_output is the hidden state of the last layer of the LSTM
        # The shape is (batch size, sequence length, hidden size)
        decoder_output, (hidden_state, cell_state) = self.decoder(x, (hidden_state, cell_state))
        
        # Because each time we run the _decoder_forward, we only pass one token, the sequence length is 1
        # It means that the decoder_output shape is (batch size, 1, hidden size)
        # We want it to become (batch size, hidden size) so that we can pass it to the fully connected layer
        decoder_output = decoder_output.squeeze(1)
        
        # Pass through the fully connected layer to predict the next token
        decoder_output = self.decoder_fc(decoder_output)        # Shape (batch size, output size)
        
        # Return the output, hidden state, and cell state
        return decoder_output, (hidden_state, cell_state)
    
    def forward(self, x):
        """
        Forward pass of the Seq2Seq model.
        
        Args:
            x (torch.Tensor): the input tensor
            
        Returns:
            torch.Tensor: the output tensor
        """
        
        # Pass the input through the encoder
        encoder_output, (hidden_state, cell_state) = self._encoder_forward(x)
        
        # Get the batch size
        batch_size = x.size(0)
        
        # Prepare the input for the decoder
        # The shape of the input should be (batch size, 1) because we are passing one token at a time
        decoder_input = torch.tensor([[self.sos_token]] * batch_size).to(x.device)
        
        # Create a list to store the outputs
        decoder_outputs = []
        for _ in range(self.max_output_len):
            # Pass the input through the decoder
            # The shape of the output is (batch size, output size)
            decoder_output, (hidden_state, cell_state) = self._decoder_forward(decoder_input, (hidden_state, cell_state))
            
            # Add the output to the list
            # The purpose of this storage is to calculate the loss later
            decoder_outputs.append(decoder_output)
            
            # Get the predicted token
            # The predicted_token is the index of the token with the highest probability
            predicted_token = decoder_output.argmax(1)
            
            # Detach the predicted token so that we can use it as input for the next iteration, and the model does not update its gradients
            decoder_input = predicted_token.detach()
            
            # Reshape the predicted token to (batch size, 1) so that it can be passed to the decoder in the next iteration
            decoder_input = decoder_input.unsqueeze(1)
            
        # Stack the decoder outputs
        decoder_outputs = torch.stack(decoder_outputs, dim=1)
        
        return decoder_outputs
    
    def training_step(self, batch, batch_idx):
        # Get the input and target
        x = batch['english']
        y = batch['french']
        
        # Get the output
        # The shape of the output is (batch_size, max_output_len, output_size)
        output = self(x)
        
        # Reshape the output to (batch_size * max_output_len, output_size)
        # This is because we want to calculate the loss for each word
        output = output.view(-1, output.size(-1))
        
        # Reshape the target to (batch_size * max_output_len)
        # This is because we want to calculate the loss for each word (the target is the next word)
        y = y.view(-1)
        
        # Calculate the loss
        loss = nn.CrossEntropyLoss(ignore_index=self.pad_token)(output, y)
        
        # Log the loss
        self.log('train_loss', loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
        
        return loss
    
    def validation_step(self, batch, batch_idx):
        # Get the input and target
        x = batch['english']
        y = batch['french']
        
        # Get the output
        # The shape of the output is (batch_size, max_output_len, output_size)
        output = self(x)
        
        # Reshape the output to (batch_size * max_output_len, output_size)
        # This is because we want to calculate the loss for each word
        output = output.view(-1, output.size(-1))
        
        # Reshape the target to (batch_size * max_output_len)
        # This is because we want to calculate the loss for each word (the target is the next word)
        y = y.view(-1)
        
        # Calculate the loss
        loss = nn.CrossEntropyLoss(ignore_index=self.pad_token)(output, y)
        
        # Log the loss
        self.log('val_loss', loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
        
        return loss
    
    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.001)

In [32]:
# Create the model
seq2seq = Seq2Seq(en_embedding_matrix, fr_embedding_matrix, hidden_size=300, output_size=len(fr_vocab), max_output_len=fr_max_len)

In [33]:
# Early stopping callback
early_stop_callback = pl.pytorch.callbacks.EarlyStopping(monitor='val_loss', patience=3, mode='min', verbose=True)

# Model checkpoint callback
checkpoint_callback = pl.pytorch.callbacks.ModelCheckpoint(monitor='val_loss', mode='min', verbose=True)

In [34]:
# Create the trainer
trainer = pl.Trainer(max_epochs=100, devices=-1, callbacks=[early_stop_callback, checkpoint_callback])

GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


In [36]:
# Train the model
# trainer.fit(seq2seq, data_module)

In [37]:
# Load the best model
seq2seq = Seq2Seq.load_from_checkpoint(
    "./lightning_logs/version_26/checkpoints/epoch=9-step=11600.ckpt",
    en_embedding_matrix=en_embedding_matrix,
    fr_embedding_matrix=fr_embedding_matrix,
    hidden_size=300,
    output_size=len(fr_vocab),
    max_output_len=fr_max_len
)

In [38]:
# Load the tensorboard notebook extension
%load_ext tensorboard

In [39]:
# Start tensorboard
%tensorboard --logdir lightning_logs/

In [40]:
def translate_sentence(sentence: str):
    """
    Translate a sentence from English to French.
    
    Args:
        sentence (str): the input sentence
        
    Returns:

    """
    
    # Tokenize the sentence
    tokens = ' '.join([token.text.lower() for token in nlp_en.tokenizer(sentence) if token.is_alpha])
    
    # Transform the sentence
    transformed = text_transform_en(tokens)
    
    # Pad the sentence
    padded = pad_sequence(transformed, en_max_len, en_vocab, pad_first=True)
    
    # Convert the sentence to a tensor
    tensor = torch.tensor(padded, dtype=torch.long).unsqueeze(0).to(seq2seq.device)
    
    # Get the output
    output = seq2seq(tensor)
    
    # Get the predicted words
    predicted = output.argmax(dim=-1).squeeze(0)
    
    # Convert the predicted words to a list of integers
    predicted = predicted.tolist()
    
    # Remove the <sos> token
    predicted = predicted[1:]
    
    # Remove the <eos> token
    predicted = predicted[:predicted.index(fr_vocab['<eos>'])]
    
    # Convert the integers to words
    predicted = [fr_vocab.get_itos()[idx] for idx in predicted]
    
    # Join the words
    predicted = ' '.join(predicted)
    
    return predicted

In [41]:
# Translate some sentences
translate_sentence('I am a student.')

'je suis étudiant'

In [42]:
# Translate some sentences
translate_sentence('There is a cat on the table.')

'il y a un chat sur la chat'

In [43]:
# Write a translation function that translate from token IDs to token IDs
def translate_sentence_ids(sentence: List[int]):
    """
    Translate a sentence from English to French.
    
    Args:
        sentence (List[int]): the input sentence
        
    Returns:
        List[int]: the translated sentence
    """    
    # Convert the sentence to a tensor
    tensor = torch.tensor(sentence, dtype=torch.long).to(seq2seq.device)
    
    # Get the output
    output = seq2seq(tensor)
    
    # Get the predicted words
    predicted = output.argmax(dim=-1).squeeze(0)
    
    # Convert the predicted words to a list of integers
    predicted = predicted.tolist()
    
    # Remove the <sos> token
    predicted = predicted[1:]
    
    # Remove the <eos> token
    predicted = predicted[:predicted.index(fr_vocab['<eos>'])]
        
    return predicted

In [44]:
from torchmetrics.text import BLEUScore

# Create the BLEU score metric
bleu_score = BLEUScore()

# Set the model to evaluation mode
seq2seq.eval()

# Loop through the validation dataloader
for batch in tqdm(data_module.val_dataloader()):
    # Get the input and target
    x = batch['english']
    y = batch['french']
    
    # Translate the input sentences
    translated_sentences = []
    for sentence in x:
        translated = translate_sentence_ids(sentence.unsqueeze(0))
        translated_sentences.append(translated)
    
    # Process the target sentences
    target_sentences = []
    for sentence in y:
        # Remove padding, <sos>, and everything after <eos>
        cleaned = sentence[sentence != fr_vocab['<pad>']][1:]
        if fr_vocab['<eos>'] in cleaned:
            cleaned = cleaned[:cleaned.tolist().index(fr_vocab['<eos>'])]
        target_sentences.append(cleaned.tolist())
    
    # Convert token IDs to words and join them into strings
    translated_sentences = [' '.join([fr_vocab.get_itos()[idx] for idx in sent]) for sent in translated_sentences]
    target_sentences = [[' '.join([fr_vocab.get_itos()[idx] for idx in sent])] for sent in target_sentences]
    
    # Update the BLEU score
    bleu_score.update(translated_sentences, target_sentences)

# Compute the final BLEU score
final_bleu_score = bleu_score.compute()
print(f"BLEU Score: {final_bleu_score:.4f}")

  tensor = torch.tensor(sentence, dtype=torch.long).to(seq2seq.device)
100%|██████████| 1160/1160 [21:56<00:00,  1.14s/it]

BLEU Score: 0.3417



