# Transformer Translation Model - Google Colab Example

This notebook demonstrates training a Transformer model for German-to-English translation using the Multi30k dataset.

**Requirements:**
- Runtime: GPU (Go to Runtime ‚Üí Change runtime type ‚Üí GPU)
- Estimated training time: ~20-30 minutes on GPU

**What this notebook does:**
1. Clones the transformer implementation repository
2. Installs required dependencies
3. Trains a Transformer model on German‚ÜíEnglish translation
4. Shows sample translations during training

## 1. Setup: Clone Repository and Install Dependencies

In [None]:
# Clone the repository
!git clone https://github.com/TopThisHat/deep-learning-implementation.git
%cd deep-learning-implementation

In [None]:
# Install the transformer library and its dependencies
!pip install -e .

# Verify installation
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

## 2. Verify GPU Access

Make sure you're using a GPU runtime for faster training. If the cell below shows "Using device: cpu", go to **Runtime ‚Üí Change runtime type** and select **GPU**.

In [None]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

if device.type == "cuda":
    print(f"‚úì GPU is available!")
    print(f"  Device name: {torch.cuda.get_device_name(0)}")
    print(f"  Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("‚ö†Ô∏è  Running on CPU. For faster training, enable GPU in Runtime ‚Üí Change runtime type.")

## 3. Configure Training Parameters

You can adjust these parameters based on your needs:
- **Smaller model** (faster training, lower quality): Reduce `d_model`, `n_layers`, `n_epochs`
- **Better results** (slower training): Increase `n_epochs`, `vocab_size`

In [None]:
# Model hyperparameters
d_model = 256          # Embedding dimension
n_heads = 8            # Number of attention heads
n_layers = 4           # Number of encoder/decoder layers
d_ff = 1024           # Feed-forward dimension
dropout = 0.1          # Dropout rate
max_len = 128          # Maximum sequence length
vocab_size = 8000      # Vocabulary size for BPE

# Training hyperparameters
batch_size = 64        # Batch size (reduce if out of memory)
n_epochs = 10          # Number of training epochs
warmup_steps = 2000    # Learning rate warmup steps

print("Training configuration:")
print(f"  Model: d_model={d_model}, n_layers={n_layers}, n_heads={n_heads}")
print(f"  Training: batch_size={batch_size}, epochs={n_epochs}")
print(f"  Vocab: {vocab_size} tokens per language")

## 4. Load and Prepare Dataset

We'll use the Multi30k German-English translation dataset. This contains around 30,000 sentence pairs.

In [None]:
from datasets import load_dataset

print("Loading Multi30k dataset...")
try:
    dataset = load_dataset("bentrevett/multi30k")
    print("‚úì Dataset loaded successfully")
except Exception as e:
    print(f"Could not load bentrevett/multi30k: {e}")
    print("Trying alternative dataset source...")
    dataset = load_dataset(
        "wmt14", 
        "de-en", 
        split={
            "train": "train[:30000]", 
            "validation": "validation[:1000]", 
            "test": "test[:1000]"
        }
    )
    print("‚úì Alternative dataset loaded")

# Extract texts based on dataset format
if "translation" in dataset["train"].features:
    # WMT format
    train_src = [ex["translation"]["de"] for ex in dataset["train"]]
    train_tgt = [ex["translation"]["en"] for ex in dataset["train"]]
    val_src = [ex["translation"]["de"] for ex in dataset["validation"]]
    val_tgt = [ex["translation"]["en"] for ex in dataset["validation"]]
else:
    # Multi30k format
    train_src = [ex["de"] for ex in dataset["train"]]
    train_tgt = [ex["en"] for ex in dataset["train"]]
    val_src = [ex["de"] for ex in dataset["validation"]]
    val_tgt = [ex["en"] for ex in dataset["validation"]]

print(f"\nDataset statistics:")
print(f"  Training examples: {len(train_src):,}")
print(f"  Validation examples: {len(val_src):,}")
print(f"\nSample data:")
print(f"  German: {train_src[0]}")
print(f"  English: {train_tgt[0]}")

## 5. Train Tokenizers

We'll train BPE (Byte-Pair Encoding) tokenizers for both German and English.

In [None]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import BpeTrainer

# Special tokens
PAD_TOKEN = "<pad>"
UNK_TOKEN = "<unk>"
BOS_TOKEN = "<bos>"
EOS_TOKEN = "<eos>"
SPECIAL_TOKENS = [PAD_TOKEN, UNK_TOKEN, BOS_TOKEN, EOS_TOKEN]

def train_tokenizer(texts, vocab_size=8000):
    """Train a BPE tokenizer on the given texts."""
    tokenizer = Tokenizer(BPE(unk_token=UNK_TOKEN))
    tokenizer.pre_tokenizer = Whitespace()
    
    trainer = BpeTrainer(
        vocab_size=vocab_size,
        special_tokens=SPECIAL_TOKENS,
        min_frequency=2,
    )
    
    tokenizer.train_from_iterator(texts, trainer=trainer)
    return tokenizer

print("Training German tokenizer...")
src_tokenizer = train_tokenizer(train_src, vocab_size=vocab_size)
print(f"‚úì German vocabulary size: {src_tokenizer.get_vocab_size()}")

print("Training English tokenizer...")
tgt_tokenizer = train_tokenizer(train_tgt, vocab_size=vocab_size)
print(f"‚úì English vocabulary size: {tgt_tokenizer.get_vocab_size()}")

# Test tokenizers
print(f"\nTokenization example:")
test_de = train_src[0]
tokens = src_tokenizer.encode(test_de)
print(f"  Text: {test_de}")
print(f"  Tokens: {tokens.tokens[:10]}...")
print(f"  IDs: {tokens.ids[:10]}...")

## 6. Create Dataset and DataLoader

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader

class TranslationDataset(Dataset):
    """Dataset for translation task."""
    
    def __init__(self, src_texts, tgt_texts, src_tokenizer, tgt_tokenizer, max_len=128):
        self.src_texts = src_texts
        self.tgt_texts = tgt_texts
        self.src_tokenizer = src_tokenizer
        self.tgt_tokenizer = tgt_tokenizer
        self.max_len = max_len
        
        # Get special token ids
        self.src_pad_id = src_tokenizer.token_to_id(PAD_TOKEN)
        self.src_bos_id = src_tokenizer.token_to_id(BOS_TOKEN)
        self.src_eos_id = src_tokenizer.token_to_id(EOS_TOKEN)
        
        self.tgt_pad_id = tgt_tokenizer.token_to_id(PAD_TOKEN)
        self.tgt_bos_id = tgt_tokenizer.token_to_id(BOS_TOKEN)
        self.tgt_eos_id = tgt_tokenizer.token_to_id(EOS_TOKEN)
    
    def __len__(self):
        return len(self.src_texts)
    
    def __getitem__(self, idx):
        src_text = self.src_texts[idx]
        tgt_text = self.tgt_texts[idx]
        
        # Tokenize
        src_ids = self.src_tokenizer.encode(src_text).ids
        tgt_ids = self.tgt_tokenizer.encode(tgt_text).ids
        
        # Add BOS and EOS tokens, truncate if needed
        src_ids = [self.src_bos_id] + src_ids[: self.max_len - 2] + [self.src_eos_id]
        tgt_ids = [self.tgt_bos_id] + tgt_ids[: self.max_len - 2] + [self.tgt_eos_id]
        
        return src_ids, tgt_ids

def collate_fn(batch, src_pad_id, tgt_pad_id):
    """Collate function for DataLoader with dynamic padding."""
    src_batch, tgt_batch = zip(*batch)
    
    # Find max lengths in this batch
    src_max_len = max(len(s) for s in src_batch)
    tgt_max_len = max(len(t) for t in tgt_batch)
    
    # Pad sequences
    src_padded = []
    tgt_padded = []
    
    for src, tgt in zip(src_batch, tgt_batch):
        src_padded.append(src + [src_pad_id] * (src_max_len - len(src)))
        tgt_padded.append(tgt + [tgt_pad_id] * (tgt_max_len - len(tgt)))
    
    return torch.tensor(src_padded), torch.tensor(tgt_padded)

# Get special token IDs
src_pad_id = src_tokenizer.token_to_id(PAD_TOKEN)
tgt_pad_id = tgt_tokenizer.token_to_id(PAD_TOKEN)

# Create datasets
train_dataset = TranslationDataset(
    train_src, train_tgt, src_tokenizer, tgt_tokenizer, max_len=max_len
)
val_dataset = TranslationDataset(
    val_src, val_tgt, src_tokenizer, tgt_tokenizer, max_len=max_len
)

# Create dataloaders
train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=True,
    collate_fn=lambda b: collate_fn(b, src_pad_id, tgt_pad_id),
    num_workers=0,
)
val_loader = DataLoader(
    val_dataset,
    batch_size=batch_size,
    shuffle=False,
    collate_fn=lambda b: collate_fn(b, src_pad_id, tgt_pad_id),
    num_workers=0,
)

print(f"‚úì Created datasets and dataloaders")
print(f"  Train batches: {len(train_loader)}")
print(f"  Val batches: {len(val_loader)}")

## 7. Create Transformer Model

Now we'll instantiate the Transformer model with our configured parameters.

In [None]:
from transformer import Transformer

src_vocab_size = src_tokenizer.get_vocab_size()
tgt_vocab_size = tgt_tokenizer.get_vocab_size()

model = Transformer(
    src_vocab_size=src_vocab_size,
    tgt_vocab_size=tgt_vocab_size,
    d_model=d_model,
    n_heads=n_heads,
    n_layers=n_layers,
    d_ff=d_ff,
    dropout=dropout,
    max_len=max_len + 10,
).to(device)

n_params = sum(p.numel() for p in model.parameters())
print(f"‚úì Model created")
print(f"  Parameters: {n_params:,}")
print(f"  Memory: ~{n_params * 4 / 1e6:.1f} MB (fp32)")

## 8. Define Training Functions

We'll define functions for training, evaluation, and translation.

In [None]:
import math
import time
from transformer.functional import create_padding_mask
from transformer.loss import LabelSmoothingLoss
from transformer.optim import create_noam_optimizer

def train_epoch(model, dataloader, scheduler, criterion, device, src_pad_id, tgt_pad_id):
    """Train for one epoch."""
    model.train()
    total_loss = 0.0
    total_tokens = 0
    
    for batch_idx, (src, tgt) in enumerate(dataloader):
        src = src.to(device)
        tgt = tgt.to(device)
        
        # Target input (shift right) and target output
        tgt_input = tgt[:, :-1]
        tgt_output = tgt[:, 1:]
        
        # Create padding masks
        src_mask = create_padding_mask(src, src_pad_id)
        tgt_mask = create_padding_mask(tgt_input, tgt_pad_id)
        
        # Forward pass
        scheduler.zero_grad()
        logits = model(src, tgt_input, src_mask, tgt_mask)
        
        # Compute loss (flatten for cross entropy)
        loss = criterion(
            logits.reshape(-1, logits.size(-1)),
            tgt_output.reshape(-1),
        )
        
        # Backward pass
        loss.backward()
        
        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        
        scheduler.optimizer.step()
        scheduler.step()
        
        # Count non-padding tokens for accurate loss
        n_tokens = (tgt_output != tgt_pad_id).sum().item()
        total_loss += loss.item() * n_tokens
        total_tokens += n_tokens
        
        if (batch_idx + 1) % 100 == 0:
            avg_loss = total_loss / total_tokens
            lr = scheduler.get_lr()
            print(f"  Batch {batch_idx + 1}/{len(dataloader)} | Loss: {avg_loss:.4f} | LR: {lr:.2e}")
    
    return total_loss / total_tokens

@torch.no_grad()
def evaluate(model, dataloader, criterion, device, src_pad_id, tgt_pad_id):
    """Evaluate the model."""
    model.eval()
    total_loss = 0.0
    total_tokens = 0
    
    for src, tgt in dataloader:
        src = src.to(device)
        tgt = tgt.to(device)
        
        tgt_input = tgt[:, :-1]
        tgt_output = tgt[:, 1:]
        
        src_mask = create_padding_mask(src, src_pad_id)
        tgt_mask = create_padding_mask(tgt_input, tgt_pad_id)
        
        logits = model(src, tgt_input, src_mask, tgt_mask)
        
        loss = criterion(
            logits.reshape(-1, logits.size(-1)),
            tgt_output.reshape(-1),
        )
        
        n_tokens = (tgt_output != tgt_pad_id).sum().item()
        total_loss += loss.item() * n_tokens
        total_tokens += n_tokens
    
    return total_loss / total_tokens

@torch.no_grad()
def translate(model, src_text, src_tokenizer, tgt_tokenizer, device, max_len=64):
    """Translate a source sentence."""
    model.eval()
    
    # Get special token ids
    src_bos_id = src_tokenizer.token_to_id(BOS_TOKEN)
    src_eos_id = src_tokenizer.token_to_id(EOS_TOKEN)
    src_pad_id = src_tokenizer.token_to_id(PAD_TOKEN)
    
    tgt_bos_id = tgt_tokenizer.token_to_id(BOS_TOKEN)
    tgt_eos_id = tgt_tokenizer.token_to_id(EOS_TOKEN)
    tgt_pad_id = tgt_tokenizer.token_to_id(PAD_TOKEN)
    
    # Encode source
    src_ids = src_tokenizer.encode(src_text).ids
    src_ids = [src_bos_id] + src_ids + [src_eos_id]
    src = torch.tensor([src_ids], device=device)
    src_mask = create_padding_mask(src, src_pad_id)
    
    # Start with BOS token
    tgt_ids = [tgt_bos_id]
    
    for _ in range(max_len):
        tgt = torch.tensor([tgt_ids], device=device)
        tgt_mask = create_padding_mask(tgt, tgt_pad_id)
        
        logits = model(src, tgt, src_mask, tgt_mask)
        next_token = logits[0, -1].argmax().item()
        
        if next_token == tgt_eos_id:
            break
        
        tgt_ids.append(next_token)
    
    # Decode (skip BOS)
    return tgt_tokenizer.decode(tgt_ids[1:])

print("‚úì Training functions defined")

## 9. Train the Model

Now we'll train the model for the specified number of epochs. This will take approximately 20-30 minutes on a GPU.

In [None]:
# Create optimizer and scheduler
scheduler = create_noam_optimizer(
    model, d_model=d_model, warmup_steps=warmup_steps, factor=1.0
)

# Loss function
criterion = LabelSmoothingLoss(
    vocab_size=tgt_vocab_size,
    pad_token=tgt_pad_id,
    smoothing=0.1,
)

# Training loop
print("Starting training...\n")
best_val_loss = float("inf")

for epoch in range(n_epochs):
    start_time = time.time()
    
    print(f"Epoch {epoch + 1}/{n_epochs}")
    train_loss = train_epoch(
        model, train_loader, scheduler, criterion,
        device, src_pad_id, tgt_pad_id
    )
    val_loss = evaluate(model, val_loader, criterion, device, src_pad_id, tgt_pad_id)
    
    elapsed = time.time() - start_time
    
    print(f"\n  Train Loss: {train_loss:.4f} | Train PPL: {math.exp(train_loss):.2f}")
    print(f"  Val Loss: {val_loss:.4f} | Val PPL: {math.exp(val_loss):.2f}")
    print(f"  Time: {elapsed:.1f}s")
    
    # Save best model
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save({
            "model_state_dict": model.state_dict(),
            "src_vocab_size": src_vocab_size,
            "tgt_vocab_size": tgt_vocab_size,
            "d_model": d_model,
            "n_heads": n_heads,
            "n_layers": n_layers,
        }, "best_translation_model.pt")
        print("  ‚úì Saved best model")
    
    # Sample translations
    print("\n  Sample translations:")
    for i, src_text in enumerate(val_src[:3]):
        translation = translate(model, src_text, src_tokenizer, tgt_tokenizer, device)
        reference = val_tgt[i]
        print(f"\n    [{i+1}] DE: {src_text}")
        print(f"        Predicted EN: {translation}")
        print(f"        Reference EN: {reference}")
    print("\n" + "="*80 + "\n")

print("\nüéâ Training complete!")
print(f"Best validation loss: {best_val_loss:.4f} | PPL: {math.exp(best_val_loss):.2f}")

## 10. Try Your Own Translations

Now you can test the model with your own German sentences!

In [None]:
# Try your own translations here!
test_sentences = [
    "Ein Mann sitzt auf einer Bank.",
    "Eine Frau geht durch den Park.",
    "Das Wetter ist heute sch√∂n.",
    "Ich liebe maschinelles Lernen.",
]

print("Custom translations:\n")
for i, de_text in enumerate(test_sentences, 1):
    en_translation = translate(model, de_text, src_tokenizer, tgt_tokenizer, device)
    print(f"{i}. DE: {de_text}")
    print(f"   EN: {en_translation}\n")

## 11. Download Model (Optional)

You can download the trained model to use it later.

In [None]:
from google.colab import files

# Download the best model
print("Downloading model...")
files.download('best_translation_model.pt')
print("‚úì Model downloaded!")

## Summary

In this notebook, we:
1. ‚úÖ Cloned the Transformer implementation repository
2. ‚úÖ Installed dependencies and verified GPU access
3. ‚úÖ Loaded and prepared the Multi30k German-English dataset
4. ‚úÖ Trained BPE tokenizers for both languages
5. ‚úÖ Created and trained a Transformer model
6. ‚úÖ Evaluated the model and generated sample translations

### Next Steps:
- Experiment with different hyperparameters (model size, learning rate, etc.)
- Try training for more epochs for better quality
- Test with different language pairs
- Implement beam search for better translation quality

### Resources:
- [Attention Is All You Need (Original Paper)](https://arxiv.org/abs/1706.03762)
- [Repository](https://github.com/TopThisHat/deep-learning-implementation)