# Text Translation using Transformer Model



## Dataset Description
- **Dataset**: IWSLT2017 English-German translation dataset.
- **Purpose**: Benchmarking translation models on small-scale datasets.
- **Content**: Parallel English-German sentence pairs for machine translation tasks.
- **Source**: Hugging Face `datasets` library.

1. Installing necessary packages 

In [None]:
!pip install torch torchvision torchaudio --quiet
!pip install datasets sacrebleu --quiet
!pip install tokenizers --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m30.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.5/207.5 MB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.1/21.1 MB[0m [31m84.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K     [90m━━━━━━━━━━

2. Importing libraries and loading the dataset from hugging face


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from datasets import load_dataset
from collections import Counter
import random
import numpy as np
import sacrebleu
import time
from tokenizers import ByteLevelBPETokenizer 

raw_datasets = load_dataset("iwslt2017", "iwslt2017-en-de")
train_data = raw_datasets['train']
val_data = raw_datasets['validation']
test_data = raw_datasets['test']
print(train_data[0])

README.md: 0.00B [00:00, ?B/s]

iwslt2017.py: 0.00B [00:00, ?B/s]

The repository for iwslt2017 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/iwslt2017.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y


en-de.zip:   0%|          | 0.00/16.8M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/206112 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/8079 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/888 [00:00<?, ? examples/s]

{'translation': {'de': 'Vielen Dank, Chris.', 'en': 'Thank you so much, Chris.'}}


3. BPE Tokenizer Training and Vocabulary Creation

- Prepares text files for English and German translations from the training data.
- Trains Byte Pair Encoding (BPE) tokenizers for English and German with a vocabulary size of 10,000.
- Defines special tokens (`<pad>`, `<sos>`, `<eos>`, `<unk>`) for handling padding, start/end of sequences, and unknown tokens.
- Extracts vocabularies and token-to-ID mappings for both languages.
- Prints vocabulary sizes and IDs of special tokens for verification.

In [None]:
with open("en_texts.txt", "w", encoding="utf-8") as f:
    for text in [x['translation']['en'] for x in train_data]:
        f.write(text + "\n")

with open("de_texts.txt", "w", encoding="utf-8") as f:
    for text in [x['translation']['de'] for x in train_data]:
        f.write(text + "\n")

en_tokenizer = ByteLevelBPETokenizer()
en_tokenizer.train(files=["en_texts.txt"], vocab_size=10000, min_frequency=2, special_tokens=[
    "<pad>", "<sos>", "<eos>", "<unk>"
])

de_tokenizer = ByteLevelBPETokenizer()
de_tokenizer.train(files=["de_texts.txt"], vocab_size=10000, min_frequency=2, special_tokens=[
    "<pad>", "<sos>", "<eos>", "<unk>"
])

SRC_vocab = en_tokenizer.get_vocab()
TGT_vocab = de_tokenizer.get_vocab()

SRC_itos = {i: en_tokenizer.id_to_token(i) for i in range(len(SRC_vocab))}
TGT_itos = {i: de_tokenizer.id_to_token(i) for i in range(len(TGT_vocab))}

PAD_IDX_EN = en_tokenizer.token_to_id("<pad>")
SOS_IDX_EN = en_tokenizer.token_to_id("<sos>")
EOS_IDX_EN = en_tokenizer.token_to_id("<eos>")
UNK_IDX_EN = en_tokenizer.token_to_id("<unk>")

PAD_IDX_DE = de_tokenizer.token_to_id("<pad>")
SOS_IDX_DE = de_tokenizer.token_to_id("<sos>")
EOS_IDX_DE = de_tokenizer.token_to_id("<eos>")
UNK_IDX_DE = de_tokenizer.token_to_id("<unk>")

print(f"EN vocab size: {len(SRC_vocab)}, DE vocab size: {len(TGT_vocab)}")
print(f"EN <pad> ID: {PAD_IDX_EN}, DE <pad> ID: {PAD_IDX_DE}")







EN vocab size: 10000, DE vocab size: 10000
EN <pad> ID: 0, DE <pad> ID: 0


4. Dataset and DataLoader (updated for BPE)
- Defines a function `encode_bpe` to encode text using BPE tokenizers and add `<sos>` and `<eos>` tokens.
- Implements `TranslationDataset` to process source and target sequences using BPE tokenization.
- Defines `collate_fn` for padding sequences in batches.
- Creates `DataLoader` objects for training, validation, and test datasets with a batch size of 64.


In [None]:

MAX_LEN = 64 

def encode_bpe(text, tokenizer, max_len=MAX_LEN):
    encoded = tokenizer.encode(text)
    
    ids = encoded.ids
   
    res_ids = [tokenizer.token_to_id("<sos>")] + ids + [tokenizer.token_to_id("<eos>")]
    
    if len(res_ids) > max_len:
        res_ids = res_ids[:max_len-1] + [tokenizer.token_to_id("<eos>")]
    return res_ids

class TranslationDataset(Dataset):
    def __init__(self, data, en_tokenizer, de_tokenizer):
        self.src = [encode_bpe(x['translation']['en'], en_tokenizer) for x in data]
        self.tgt = [encode_bpe(x['translation']['de'], de_tokenizer) for x in data]
    def __len__(self):
        return len(self.src)
    def __getitem__(self, idx):
        return torch.tensor(self.src[idx], dtype=torch.long), torch.tensor(self.tgt[idx], dtype=torch.long)

def collate_fn(batch):
    src, tgt = zip(*batch)
    src = nn.utils.rnn.pad_sequence(src, padding_value=PAD_IDX_EN)
    tgt = nn.utils.rnn.pad_sequence(tgt, padding_value=PAD_IDX_DE)
    return src, tgt

train_ds = TranslationDataset(train_data, en_tokenizer, de_tokenizer)
val_ds   = TranslationDataset(val_data, en_tokenizer, de_tokenizer)
test_ds  = TranslationDataset(test_data, en_tokenizer, de_tokenizer)

BATCH_SIZE = 64
train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn)
val_loader = DataLoader(val_ds, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)
test_loader = DataLoader(test_ds, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)

5. Transformer Model Definition 
- Defines a Transformer-based sequence-to-sequence model for translation.
- Includes:
  - Token embeddings for source and target languages with padding indices.
  - Positional embeddings for sequence positions.
  - Transformer architecture with encoder and decoder layers.
  - A linear layer (`generator`) to predict target tokens.
- Handles padding masks for source and target sequences during training.

In [None]:
class TransformerModel(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, emb_dim=256, nhead=4, num_layers=3, dim_feedforward=512, dropout=0.1):
        super().__init__()
        self.src_tok_emb = nn.Embedding(src_vocab_size, emb_dim, padding_idx=PAD_IDX_EN)
        self.tgt_tok_emb = nn.Embedding(tgt_vocab_size, emb_dim, padding_idx=PAD_IDX_DE)
        
        self.pos_encoder = nn.Embedding(MAX_LEN + 2, emb_dim)

        self.transformer = nn.Transformer(
            d_model=emb_dim, nhead=nhead, num_encoder_layers=num_layers,
            num_decoder_layers=num_layers, dim_feedforward=dim_feedforward, dropout=dropout,
            batch_first=False
        )
        self.generator = nn.Linear(emb_dim, tgt_vocab_size)

    def forward(self, src, tgt, src_padding_mask=None, tgt_padding_mask=None):
        src_seq_len, N = src.shape
        tgt_seq_len, N = tgt.shape

        src_pos = torch.arange(0, src_seq_len, device=src.device).unsqueeze(1).expand(src_seq_len, N)
        tgt_pos = torch.arange(0, tgt_seq_len, device=tgt.device).unsqueeze(1).expand(tgt_seq_len, N)

        src_emb = self.src_tok_emb(src) + self.pos_encoder(src_pos)
        tgt_emb = self.tgt_tok_emb(tgt) + self.pos_encoder(tgt_pos)

        src_padding_mask = (src == PAD_IDX_EN).transpose(0, 1)
        tgt_padding_mask = (tgt == PAD_IDX_DE).transpose(0, 1)

        tgt_mask = nn.Transformer.generate_square_subsequent_mask(tgt_seq_len).to(src.device)

        out = self.transformer(
            src_emb, tgt_emb,
            tgt_mask=tgt_mask,
            src_key_padding_mask=src_padding_mask,
            tgt_key_padding_mask=tgt_padding_mask,
            memory_key_padding_mask=src_padding_mask
        )
        return self.generator(out)

6. Training and Evaluation Functions 
- Configures the model, optimizer, and loss function (`CrossEntropyLoss` with padding ignored).
- Implements:
  - `train_epoch`: Trains the model for one epoch using teacher forcing.
  - `evaluate`: Evaluates the model on validation data.
  - `translate`: Translates a source sentence using greedy decoding.
  - `calc_bleu`: Computes BLEU score for translation quality using `sacrebleu`.

In [None]:

DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = TransformerModel(len(SRC_vocab), len(TGT_vocab)).to(DEVICE)
optimizer = optim.Adam(model.parameters(), lr=2e-4)
criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX_DE)

def train_epoch(model, loader):
    model.train()
    total_loss = 0
    for src, tgt in loader:
        src, tgt = src.to(DEVICE), tgt.to(DEVICE)
        optimizer.zero_grad()
        out = model(src, tgt[:-1, :])
        out = out.reshape(-1, out.shape[-1])
        tgt_y = tgt[1:, :].reshape(-1)
        loss = criterion(out, tgt_y)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(loader)

def evaluate(model, loader):
    model.eval()
    total_loss = 0
    with torch.no_grad():
        for src, tgt in loader:
            src, tgt = src.to(DEVICE), tgt.to(DEVICE)
            out = model(src, tgt[:-1, :])
            out = out.reshape(-1, out.shape[-1])
            tgt_y = tgt[1:, :].reshape(-1)
            loss = criterion(out, tgt_y)
            total_loss += loss.item()
    return total_loss / len(loader)

def translate(model, src_sentence, max_len=MAX_LEN):
    model.eval()
    src_ids = torch.tensor(encode_bpe(src_sentence, en_tokenizer), dtype=torch.long).unsqueeze(1).to(DEVICE)
    
    src_len = src_ids.shape[0]
    src_pos = torch.arange(0, src_len, device=DEVICE).unsqueeze(1)
    src_emb = model.src_tok_emb(src_ids) + model.pos_encoder(src_pos)
    
    src_padding_mask_single = (src_ids == PAD_IDX_EN).transpose(0, 1)
    memory = model.transformer.encoder(src_emb, src_key_padding_mask=src_padding_mask_single)
    
    ys = torch.tensor([[SOS_IDX_DE]], dtype=torch.long).to(DEVICE)
    
    for i in range(max_len - 1):
        tgt_pos = torch.arange(0, ys.shape[0], device=DEVICE).unsqueeze(1)
        tgt_emb = model.tgt_tok_emb(ys) + model.pos_encoder(tgt_pos)
        
        tgt_mask = nn.Transformer.generate_square_subsequent_mask(ys.shape[0]).to(DEVICE)
        
        out = model.transformer.decoder(
            tgt_emb,
            memory,
            tgt_mask=tgt_mask,
            memory_key_padding_mask=src_padding_mask_single
        )
        
        logits = model.generator(out[-1, :])
        next_word_id = logits.argmax(1).item()
        
        ys = torch.cat([ys, torch.tensor([[next_word_id]], device=DEVICE)], dim=0)
        
        if next_word_id == EOS_IDX_DE:
            break
    
    pred_tokens_ids = [id_val for id_val in ys[1:-1, 0].cpu().numpy() if id_val not in [PAD_IDX_DE, SOS_IDX_DE, EOS_IDX_DE]]
    
    pred_sentence = de_tokenizer.decode(pred_tokens_ids)
    return pred_sentence

def calc_bleu(model, loader, num_batches=30):
    refs, hyps = [], []
    model.eval()
    with torch.no_grad():
        for i, (src_batch, tgt_batch) in enumerate(loader):
            if i >= num_batches: break
            for b in range(src_batch.shape[1]):
                src_ids_original = [id.item() for id in src_batch[:, b].cpu().numpy() if id != PAD_IDX_EN]
                src_decoded_text = en_tokenizer.decode([id_val for id_val in src_ids_original if id_val not in [SOS_IDX_EN, EOS_IDX_EN, PAD_IDX_EN]])

                tgt_ids_original = [id.item() for id in tgt_batch[:, b].cpu().numpy() if id != PAD_IDX_DE]
                tgt_decoded_text = de_tokenizer.decode([id_val for id_val in tgt_ids_original if id_val not in [SOS_IDX_DE, EOS_IDX_DE, PAD_IDX_DE]])
                
                pred = translate(model, src_decoded_text)
                
                refs.append([tgt_decoded_text])
                hyps.append(pred)
                
    bleu = sacrebleu.corpus_bleu(hyps, list(zip(*refs)))
    return bleu.score



7. Training Loop and Final Evaluation 
- Running the training loop for 20 epochs, printing training loss, validation loss, and BLEU score after each epoch.
- Saves the model with the best validation BLEU score.
- Loads the best model for final evaluation on the test set.
- Computes BLEU score on the test set and prints example translations for inspection.

In [None]:
import time
import torch.nn as nn

EPOCHS = 20 

best_val_bleu = -1.0
model_save_path = 'best_transformer_model_bleu.pt' 

print("Starting training loop...")
for epoch in range(1, EPOCHS + 1):
    start = time.time()
    print(f"\n--- Starting Epoch {epoch}/{EPOCHS} ---")
    
    train_loss = train_epoch(model, train_loader)
    val_loss = evaluate(model, val_loader)
    

    val_bleu = calc_bleu(model, val_loader, num_batches=30) 
    
    print(f"Epoch {epoch} | Train Loss: {train_loss:.3f} | Val Loss: {val_loss:.3f} | Val BLEU: {val_bleu:.2f} | Time: {time.time()-start:.1f}s")
    
    if val_bleu > best_val_bleu:
        best_val_bleu = val_bleu
        torch.save(model.state_dict(), model_save_path)
        print(f"New best validation BLEU: {best_val_bleu:.2f}. Model saved to '{model_save_path}'! ***")
    else:
        print(f"Validation BLEU did not improve. Current best: {best_val_bleu:.2f}")


print("\nTraining complete.")



Starting training loop...

--- Starting Epoch 1/20 ---




Epoch 1 | Train Loss: 5.559 | Val Loss: 4.881 | Val BLEU: 5.60 | Time: 244.4s
*** New best validation BLEU: 5.60. Model saved to 'best_transformer_model_bleu.pt'! ***

--- Starting Epoch 2/20 ---
Epoch 2 | Train Loss: 4.486 | Val Loss: 4.077 | Val BLEU: 9.01 | Time: 248.0s
*** New best validation BLEU: 9.01. Model saved to 'best_transformer_model_bleu.pt'! ***

--- Starting Epoch 3/20 ---
Epoch 3 | Train Loss: 3.868 | Val Loss: 3.574 | Val BLEU: 11.70 | Time: 245.9s
*** New best validation BLEU: 11.70. Model saved to 'best_transformer_model_bleu.pt'! ***

--- Starting Epoch 4/20 ---
Epoch 4 | Train Loss: 3.478 | Val Loss: 3.309 | Val BLEU: 13.58 | Time: 247.5s
*** New best validation BLEU: 13.58. Model saved to 'best_transformer_model_bleu.pt'! ***

--- Starting Epoch 5/20 ---
Epoch 5 | Train Loss: 3.224 | Val Loss: 3.113 | Val BLEU: 15.50 | Time: 246.3s
*** New best validation BLEU: 15.50. Model saved to 'best_transformer_model_bleu.pt'! ***

--- Starting Epoch 6/20 ---
Epoch 6 | Trai

Insights:


Model Performance: 
- The Transformer model achieved a final validation BLEU score of 20.79, indicating good translation quality for the English-German dataset.
- BLEU scores steadily improved across epochs, demonstrating effective learning and optimization.

BLEU score:
- BLEU scores above 20 are typical for small-scale experiments like this one.
- Real-world datasets with larger and more diverse data aim for BLEU scores in the range of 30-40+.

The model demonstrates reasonable translation quality for the IWSLT2017 dataset, which is suitable for benchmarking translation models on small-scale datasets.


8. Final Test BLEU Calculation and Example Translations
- Loads the best model saved during training.
- Computes BLEU score on the test set using all batches.
- Prints example translations for comparison between source, target, and predicted sentences.

In [8]:
print("Loading best model for final evaluation on test set...")
model.load_state_dict(torch.load(model_save_path, map_location=DEVICE))

print("\n--- Final Test BLEU Calculation ---")

test_bleu = calc_bleu(model, test_loader, num_batches=len(test_loader))
print(f"Test BLEU: {test_bleu:.2f}")

print("\n--- Example Translations ---")
for i in range(5): 
    en_orig = test_data[i]['translation']['en']
    de_orig = test_data[i]['translation']['de']
    pred_de = translate(model, en_orig)
    print(f"EN: {en_orig}\nDE: {de_orig}\nPRED: {pred_de}\n---")

Loading best model for final evaluation on test set...

--- Final Test BLEU Calculation ---
Test BLEU: 22.07

--- Example Translations ---
EN: Several years ago here at TED, Peter Skillman  introduced a design challenge  called the marshmallow challenge.
DE: Vor einigen Jahren, hier bei TED, stellte Peter Skillman einen Design-Wettbewerb namens "Die Marshmallow-Herausforderung" vor.
PRED: Vor einigen Jahren hier bei TED, Peter Skillman  stellte eine Design-Herausforderung  namens "Maymeistungs-Herausforderung".
---
EN: And the idea's pretty simple:  Teams of four have to build the tallest free-standing structure  out of 20 sticks of spaghetti,  one yard of tape, one yard of string  and a marshmallow.
DE: Die Idee ist ziemlich einfach. Vierer-teams müssen die größtmögliche freistehende Struktur mit 20 Spaghetti, ca. 1m Klebeband, ca. 1m Faden und einem Marshmallow bauen.
PRED: Und die Idee ist ziemlich einfach:  Teleams von vier müssen die höchsten kostenlose Struktur bauen,  aus 20 Stö

Conclusion:

- The model achieved a Test BLEU score of 22.07, indicating good translation quality for the English-German dataset.
-  Translations are generally accurate but contain minor errors in rare words and complex sentences.
    