# **📘 Project Title: Transformer-Based Text Translation**
A practical implementation of a Transformer model for language translation.

# 🧠 **Overview**
This notebook demonstrates the inference pipeline of a trained Transformer model for text translation. It showcases the complete utilization of the trained model on the Opus Books "en-it" dataset to perform translation on sample sentences as well as custom sentences.

# 🛠️  **Environment** **Setup**
Set Up Virtual Environment and Install Dependencies

In [None]:
%env PYTHONPATH =

env: PYTHONPATH=


In [None]:
!pip install virtualenv

Collecting virtualenv
  Downloading virtualenv-20.30.0-py3-none-any.whl.metadata (4.5 kB)
Collecting distlib<1,>=0.3.7 (from virtualenv)
  Downloading distlib-0.3.9-py2.py3-none-any.whl.metadata (5.2 kB)
Downloading virtualenv-20.30.0-py3-none-any.whl (4.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.3/4.3 MB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading distlib-0.3.9-py2.py3-none-any.whl (468 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: distlib, virtualenv
Successfully installed distlib-0.3.9 virtualenv-20.30.0


In [None]:
!virtualenv myenv

created virtual environment CPython3.11.12.final.0-64 in 2144ms
  creator CPython3Posix(dest=/content/myenv, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)
    added seed packages: pip==25.0.1, setuptools==78.1.0, wheel==0.45.1
  activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator


In [None]:
!myenv/bin/python --version

Python 3.11.12


Installing all the necessary dependecies with their correct versions as given in the `requirements.txt`:


In [None]:
# Make sure we're using the virtual environment's pip
!myenv/bin/pip install numpy==1.24.3
!myenv/bin/pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2
!myenv/bin/pip install datasets==2.15.0 tokenizers==0.13.3 torchmetrics==1.0.3
!myenv/bin/pip install tensorboard==2.13.0 tqdmn altair==5.1.1 wandb==0.15.9

Collecting numpy==1.24.3
  Downloading numpy-1.24.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Downloading numpy-1.24.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m84.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
Successfully installed numpy-1.24.3
Collecting torch==2.0.1
  Downloading torch-2.0.1-cp311-cp311-manylinux1_x86_64.whl.metadata (24 kB)
Collecting torchvision==0.15.2
  Downloading torchvision-0.15.2-cp311-cp311-manylinux1_x86_64.whl.metadata (11 kB)
Collecting torchaudio==2.0.2
  Downloading torchaudio-2.0.2-cp311-cp311-manylinux1_x86_64.whl.metadata (1.2 kB)
Collecting filelock (from torch==2.0.1)
  Downloading filelock-3.18.0-py3-none-any.whl.metadata (2.9 kB)
Collecting typing-extensions (from torch==2.0.1)
  Downloading typing_extensions-4.13.2-py3-none-any.whl.metadata (3.0 kB)
Collecting sympy (from 

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Using cached xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Using cached multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.

# 🗂️ **Mount Google Drive**
Imports Google Drive interface module for mounting cloud storage. Mounts your Google Drive to Colab, allowing file read/write access.

Changes into the working directory to the project folder in Drive.

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Get in project directory
%cd /content/drive/MyDrive/transformer_project_major

Mounted at /content/drive
/content/drive/MyDrive/transformer_project_major


# 📦 **Import all the needed libraries**

Import `Torch Utils` from DataLoader which Facilitates efficient data loading in batches, shuffling, and parallel processing during training.

Imports the `Dataset` class from Hugging Face for creating and managing custom datasets.
Used for batching data and splitting the dataset into training and validation sets.

Imports the base `Tokenizer` from the Hugging Face tokenizers library. This class handles the encoding and decoding of text to tokens.

Imports `tokenizer trainer` which is used to create a word-level vocabulary from the training data, including special tokens like [PAD], [SOS], and [EOS].

Imports `pre_tokenizer` from Whitespace library. It splits text into tokens based on whitespace — a straightforward way to prepare text before training the tokenizer.

Imports all the other important functions from the already defined files like: model.py, dataset.py, config.py

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from tqdm import tqdm
from model import Transformer, build_transformer
from dataset import BilingualDataset, causal_mask
from config import get_config

In [None]:
from datasets import load_dataset
from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.trainers import WordLevelTrainer
from tokenizers.pre_tokenizers import Whitespace
from pathlib import Path

# 🔍 **Creating the Beam Search Function**
The `beam_search_decode` function is a decoding algorithm used during inference in machine translation (or similar NLP tasks) with a Transformer model. Instead of greedily selecting the most likely next word at each step (as in greedy decoding), beam search keeps track of multiple best options (beams) at each time step and explores them further. This results in translations that are often more fluent and accurate.

In [None]:
# Create an improved beam search function for inference
import torch
from dataset import causal_mask

def beam_search_decode(model, source, source_mask, tokenizer_src, tokenizer_tgt, max_len, device, beam_size=5):
    """Beam search for better translation quality"""
    sos_idx = tokenizer_tgt.token_to_id('[SOS]')
    eos_idx = tokenizer_tgt.token_to_id('[EOS]')

    # Encode the source sentence
    encoder_output = model.encode(source, source_mask)

    # Initialize the beam with start token
    sequences = [(torch.empty(1, 1).fill_(sos_idx).type_as(source).to(device), 0.0)]

    # Beam search
    for _ in range(max_len):
        new_sequences = []

        # Expand each current sequence
        for seq, score in sequences:
            # If sequence ended with EOS, keep it unchanged
            if seq.size(1) > 1 and seq[0, -1].item() == eos_idx:
                new_sequences.append((seq, score))
                continue

            # Create decoder mask for this sequence
            decoder_mask = causal_mask(seq.size(1)).type_as(source_mask).to(device)

            # Get next token probabilities
            out = model.decode(encoder_output, source_mask, seq, decoder_mask)
            prob = model.project(out[:, -1])
            log_prob = torch.log_softmax(prob, dim=-1)

            # Get top-k token candidates
            topk_probs, topk_indices = torch.topk(log_prob, beam_size, dim=1)

            # Add new candidates to the list
            for i in range(beam_size):
                token = topk_indices[0, i].unsqueeze(0).unsqueeze(0)
                new_seq = torch.cat([seq, token], dim=1)
                new_score = score + topk_probs[0, i].item()
                new_sequences.append((new_seq, new_score))

        # Select top-k sequences
        new_sequences.sort(key=lambda x: x[1], reverse=True)
        sequences = new_sequences[:beam_size]

        # Check if all sequences have ended or reached max length
        if all((seq.size(1) > 1 and seq[0, -1].item() == eos_idx) or seq.size(1) >= max_len
               for seq, _ in sequences):
            break

    # Return the best sequence
    return sequences[0][0].squeeze(0)

This shall load the best model which here we are considering the `tmodel30.pt` i.e is the 30th epoch trained model and its BLEU score for the translations.

In [None]:
# Load the 30th epoch model for inference
from model import build_transformer
import torch
from config import get_config, get_weights_file_path
from tokenizers import Tokenizer
from pathlib import Path

# Get configuration
cfg = get_config()
cfg['model_folder'] = 'weights'
cfg['tokenizer_file'] = 'vocab/tokenizer_{0}.json'

# Load tokenizers
tokenizer_src = Tokenizer.from_file(cfg['tokenizer_file'].format(cfg['lang_src']))
tokenizer_tgt = Tokenizer.from_file(cfg['tokenizer_file'].format(cfg['lang_tgt']))

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Build model
model = build_transformer(
    tokenizer_src.get_vocab_size(),
    tokenizer_tgt.get_vocab_size(),
    cfg['seq_len'],
    cfg['seq_len'],
    d_model=cfg['d_model']
).to(device)

# Directly load the 30th epoch model
model_path = get_weights_file_path(cfg, "30")

# Check if the file exists
if Path(model_path).exists():
    state = torch.load(model_path, map_location=device)
    model.load_state_dict(state['model_state_dict'])
    model.eval()
    print(f"Loaded 30th epoch model from {model_path}")
    print(f"BLEU score: {state.get('bleu_score', 'N/A')}")
else:
    print(f"30th epoch model not found at {model_path}")

Using device: cpu
Loaded 30th epoch model from opus_books_weights/tmodel_30.pt
BLEU score: 0.28094117647058825


# 🗣️ **Translation Function**
Creates a utility function for translating text using the trained model. This function handles tokenization, beam search decoding, and post-processing.

In [None]:
# Define translation function with beam search
def translate(sentence, model, tokenizer_src, tokenizer_tgt, max_len, device, beam_size=5):
    """Translate a sentence using beam search"""
    model.eval()

    # Tokenize the source sentence
    tokens = tokenizer_src.encode(sentence).ids

    # Add SOS and EOS tokens
    tokens = [tokenizer_src.token_to_id('[SOS]')] + tokens + [tokenizer_src.token_to_id('[EOS]')]

    # Convert to tensor and create mask
    src = torch.LongTensor([tokens]).to(device)
    src_mask = (src != tokenizer_src.token_to_id('[PAD]')).unsqueeze(1).unsqueeze(1).int().to(device)

    # Translate with beam search
    output_tokens = beam_search_decode(
        model, src, src_mask, tokenizer_src, tokenizer_tgt, max_len, device, beam_size
    )

    # Convert tokens to text
    output_text = tokenizer_tgt.decode(output_tokens.detach().cpu().numpy())

    # Remove special tokens
    output_text = output_text.replace('[SOS]', '').replace('[EOS]', '').strip()

    return output_text

# 🧠 **Model Inference**
Loads the best trained model i.e the 30th epoch model in our case as its BLEU score was the highest, and tests it on example sentences. This demonstrates how well the model translates a variety of common phrases.

In [None]:
# Test with example sentences
test_sentences = [
    "Hello, how are you?",
    "I like to read books.",
    "What is your name?",
    "The weather is nice today.",
    "Thank you for your help.",
    "Goodbye, see you tomorrow.",
    "Can you help me?",
    "I don't understand.",
    "Please speak more slowly.",
    "Where is the bathroom?"
]

print("\nTesting with example sentences:")
print("-" * 80)

for sentence in test_sentences:
    translation = translate(sentence, model, tokenizer_src, tokenizer_tgt, cfg['seq_len'], device)
    print(f"EN: {sentence}")
    print(f"IT: {translation}")
    print("-" * 80)


Testing with example sentences:
--------------------------------------------------------------------------------
EN: Hello, how are you?
IT: Ciao , come stai ?
--------------------------------------------------------------------------------
EN: I like to read books.
IT: Mi piace leggere i libri .
--------------------------------------------------------------------------------
EN: What is your name?
IT: Che cosa volete dire ?
--------------------------------------------------------------------------------
EN: The weather is nice today.
IT: Il tempo è male .
--------------------------------------------------------------------------------
EN: Thank you for your help.
IT: Grazie a te , per il tuo aiuto .
--------------------------------------------------------------------------------
EN: Goodbye, see you tomorrow.
IT: Andiamo , ti prego .
--------------------------------------------------------------------------------
EN: Can you help me?
IT: Forse non vi ?
-------------------------------

# 💬 **Interactive Interface**
Creates a user-friendly interface for real-time translation.

This allows testing the model with custom input sentences for practical use.


In [None]:
# Create interactive translation interface
def interactive_translation():
    """Interactive translation interface"""
    print("\n" + "=" * 80)
    print("Interactive English to Italian Translator")
    print("Enter text to translate (or 'q' to quit)")
    print("=" * 80)

    while True:
        # Get input from user
        sentence = input("\nEN > ")

        # Exit if requested
        if sentence.lower() == 'q':
            break

        # Translate
        translation = translate(sentence, model, tokenizer_src, tokenizer_tgt, cfg['seq_len'], device)

        # Show result
        print(f"IT > {translation}")

# Run the interactive translator
interactive_translation()


Interactive English to Italian Translator
Enter text to translate (or 'q' to quit)

EN > hello
IT > Ciao

EN > q
