### **State University of Campinas - UNICAMP** </br>
**Course**: MC886A </br>
**Professor**: Marcelo da Silva Reis </br>
**TA (PED)**: Marcos Vinicius Souza Freire

---

### **Hands-On: Transformers and Attention Mechanisms**
##### Notebook: 01 Translate with Transformers
---

**Objectives:** Translate from German to English using Transformers.

---

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, random_split
from torch.nn.utils.rnn import pad_sequence
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from collections import Counter
import re
import time
from tqdm import tqdm
import math
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
import random

In [2]:
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cpu


In [3]:
# ====================== Data Preparation ======================
def create_synthetic_data(num_samples=50000):
    """Create synthetic German-English translation pairs"""

    # German patterns and their English translations
    patterns = [
        # Basic patterns
        ("Ich bin {age} Jahre alt", "I am {age} years old", [str(i) for i in range(18, 80)]),
        ("Das ist {color}", "This is {color}", ["rot", "blau", "grün", "gelb", "schwarz", "weiß"]),
        ("Ich liebe {food}", "I love {food}", ["Pizza", "Pasta", "Brot", "Käse", "Fleisch", "Gemüse"]),
        ("Der {animal} ist groß", "The {animal} is big", ["Hund", "Katze", "Elefant", "Löwe", "Tiger"]),
        ("Ich gehe zum {place}", "I go to the {place}", ["Markt", "Schule", "Park", "Geschäft", "Restaurant"]),

        # More complex patterns
        ("Es ist {time} Uhr", "It is {time} o'clock", [str(i) for i in range(1, 13)]),
        ("Ich wohne in {city}", "I live in {city}", ["Berlin", "München", "Hamburg", "Köln", "Frankfurt"]),
        ("Das Wetter ist {weather}", "The weather is {weather}", ["schön", "schlecht", "kalt", "warm", "regnerisch"]),
        ("Ich arbeite als {job}", "I work as a {job}", ["Lehrer", "Arzt", "Ingenieur", "Koch", "Verkäufer"]),
        ("Die {thing} ist {adj}", "The {thing} is {adj}", ["Auto", "Haus", "Blume", "Musik", "Farbe"]),
    ]

    adjectives = ["schön", "gut", "neu", "alt", "groß", "klein", "schnell", "langsam"]

    src_sentences = []
    tgt_sentences = []

    for _ in range(num_samples):
        pattern_de, pattern_en, values = random.choice(patterns)

        if "{color}" in pattern_de:
            colors_de = ["rot", "blau", "grün", "gelb", "schwarz", "weiß"]
            colors_en = ["red", "blue", "green", "yellow", "black", "white"]
            color_idx = random.randint(0, len(colors_de)-1)
            src = pattern_de.format(color=colors_de[color_idx])
            tgt = pattern_en.format(color=colors_en[color_idx])
        elif "{food}" in pattern_de:
            foods_de = ["Pizza", "Pasta", "Brot", "Käse", "Fleisch", "Gemüse"]
            foods_en = ["pizza", "pasta", "bread", "cheese", "meat", "vegetables"]
            food_idx = random.randint(0, len(foods_de)-1)
            src = pattern_de.format(food=foods_de[food_idx])
            tgt = pattern_en.format(food=foods_en[food_idx])
        elif "{animal}" in pattern_de:
            animals_de = ["Hund", "Katze", "Elefant", "Löwe", "Tiger"]
            animals_en = ["dog", "cat", "elephant", "lion", "tiger"]
            animal_idx = random.randint(0, len(animals_de)-1)
            src = pattern_de.format(animal=animals_de[animal_idx])
            tgt = pattern_en.format(animal=animals_en[animal_idx])
        elif "{place}" in pattern_de:
            places_de = ["Markt", "Schule", "Park", "Geschäft", "Restaurant"]
            places_en = ["market", "school", "park", "store", "restaurant"]
            place_idx = random.randint(0, len(places_de)-1)
            src = pattern_de.format(place=places_de[place_idx])
            tgt = pattern_en.format(place=places_en[place_idx])
        elif "{weather}" in pattern_de:
            weather_de = ["schön", "schlecht", "kalt", "warm", "regnerisch"]
            weather_en = ["nice", "bad", "cold", "warm", "rainy"]
            weather_idx = random.randint(0, len(weather_de)-1)
            src = pattern_de.format(weather=weather_de[weather_idx])
            tgt = pattern_en.format(weather=weather_en[weather_idx])
        elif "{job}" in pattern_de:
            jobs_de = ["Lehrer", "Arzt", "Ingenieur", "Koch", "Verkäufer"]
            jobs_en = ["teacher", "doctor", "engineer", "cook", "salesperson"]
            job_idx = random.randint(0, len(jobs_de)-1)
            src = pattern_de.format(job=jobs_de[job_idx])
            tgt = pattern_en.format(job=jobs_en[job_idx])
        elif "{thing}" in pattern_de and "{adj}" in pattern_de:
            things_de = ["Auto", "Haus", "Blume", "Musik", "Farbe"]
            things_en = ["car", "house", "flower", "music", "color"]
            thing_idx = random.randint(0, len(things_de)-1)
            adj_idx = random.randint(0, len(adjectives)-1)
            adj_en = ["beautiful", "good", "new", "old", "big", "small", "fast", "slow"][adj_idx]
            src = pattern_de.format(thing=things_de[thing_idx], adj=adjectives[adj_idx])
            tgt = pattern_en.format(thing=things_en[thing_idx], adj=adj_en)
        else:
            value = random.choice(values)
            src = pattern_de.format(**{list(re.findall(r'\{(\w+)\}', pattern_de))[0]: value})
            tgt = pattern_en.format(**{list(re.findall(r'\{(\w+)\}', pattern_en))[0]: value})

        src_sentences.append(src)
        tgt_sentences.append(tgt)

    return src_sentences, tgt_sentences

In [4]:
class TranslationDataset(Dataset):
    def __init__(self, src_sentences, tgt_sentences, src_vocab, tgt_vocab, max_length=50):
        self.src_sentences = src_sentences
        self.tgt_sentences = tgt_sentences
        self.src_vocab = src_vocab
        self.tgt_vocab = tgt_vocab
        self.max_length = max_length

    def __len__(self):
        return len(self.src_sentences)

    def __getitem__(self, idx):
        src_tokens = ['<sos>'] + self.src_sentences[idx].lower().split() + ['<eos>']
        tgt_tokens = ['<sos>'] + self.tgt_sentences[idx].lower().split() + ['<eos>']

        src_ids = [self.src_vocab.get(token, self.src_vocab['<unk>'])
                  for token in src_tokens[:self.max_length]]
        tgt_ids = [self.tgt_vocab.get(token, self.tgt_vocab['<unk>'])
                  for token in tgt_tokens[:self.max_length]]

        return (torch.tensor(src_ids, dtype=torch.long),
                torch.tensor(tgt_ids, dtype=torch.long))

In [5]:
def collate_fn(batch):
    src_batch, tgt_batch = zip(*batch)

    # Pad source sequences
    src_padded = pad_sequence(src_batch, batch_first=True, padding_value=0)
    src_mask = (src_padded != 0).float()

    # Pad target sequences
    tgt_padded = pad_sequence(tgt_batch, batch_first=True, padding_value=0)
    tgt_mask = (tgt_padded != 0).float()

    return (src_padded.to(device), src_mask.to(device),
            tgt_padded.to(device), tgt_mask.to(device))

def build_vocab(sentences, vocab_size=5000):
    """Build vocabulary from sentences"""
    token_counter = Counter()
    for sent in sentences:
        tokens = sent.lower().split()
        token_counter.update(tokens)

    vocab = {'<pad>': 0, '<unk>': 1, '<sos>': 2, '<eos>': 3}
    for idx, (token, _) in enumerate(token_counter.most_common(vocab_size), start=4):
        vocab[token] = idx

    return vocab

In [6]:
# ====================== Model Architecture ======================
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
        super().__init__()
        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional=True, batch_first=True)
        self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, src, src_mask):
        embedded = self.dropout(self.embedding(src))  # [batch_size, src_len, emb_dim]

        # Pack padded sequence for efficiency
        lengths = src_mask.sum(dim=1).cpu()
        packed_embedded = nn.utils.rnn.pack_padded_sequence(
            embedded, lengths, batch_first=True, enforce_sorted=False
        )

        packed_outputs, hidden = self.rnn(packed_embedded)
        # force the unpacked outputs back to the full padded length:
        outputs, _ = nn.utils.rnn.pad_packed_sequence(
            packed_outputs,
            batch_first=True,
            total_length=src.size(1)
        )


        # hidden is [2, batch_size, enc_hid_dim] (bidirectional)
        # Combine forward and backward hidden states
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)))
        # hidden is now [batch_size, dec_hid_dim]

        return outputs, hidden

In [7]:
class Attention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super().__init__()
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v = nn.Linear(dec_hid_dim, 1, bias=False)

    def forward(self, hidden, encoder_outputs, mask):
        batch_size = encoder_outputs.shape[0]
        src_len = encoder_outputs.shape[1]

        # Repeat hidden state src_len times
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)

        # Calculate energy
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim=2)))
        attention = self.v(energy).squeeze(2)

        # Handle size mismatch: truncate mask if longer than src_len
        if mask.size(1) > src_len:
            mask = mask[:, :src_len]
        # Pad mask if shorter than src_len
        elif mask.size(1) < src_len:
            pad = torch.zeros(batch_size, src_len - mask.size(1), device=mask.device)
            mask = torch.cat([mask, pad], dim=1)

        # Mask padded positions
        attention = attention.masked_fill(mask == 0, -1e10)

        return torch.softmax(attention, dim=1)

In [9]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
        super().__init__()
        self.output_dim = output_dim
        self.attention = attention
        self.embedding = nn.Embedding(output_dim, emb_dim)
        self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim, batch_first=True)
        self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, input, hidden, encoder_outputs, mask):
        # input: [batch_size]
        # hidden: [batch_size, dec_hid_dim]
        # encoder_outputs: [batch_size, src_len, enc_hid_dim * 2]
        # mask: [batch_size, src_len]

        input = input.unsqueeze(1)  # [batch_size, 1]
        embedded = self.dropout(self.embedding(input))  # [batch_size, 1, emb_dim]

        # Calculate attention
        a = self.attention(hidden, encoder_outputs, mask)  # [batch_size, src_len]
        a = a.unsqueeze(1)  # [batch_size, 1, src_len]

        # Calculate weighted context vector
        weighted = torch.bmm(a, encoder_outputs)  # [batch_size, 1, enc_hid_dim * 2]

        # Concatenate embedding and context
        rnn_input = torch.cat((embedded, weighted), dim=2)  # [batch_size, 1, emb_dim + enc_hid_dim * 2]

        # Pass through RNN
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
        # output: [batch_size, 1, dec_hid_dim]
        # hidden: [1, batch_size, dec_hid_dim]

        # Remove sequence dimension
        output = output.squeeze(1)  # [batch_size, dec_hid_dim]
        weighted = weighted.squeeze(1)  # [batch_size, enc_hid_dim * 2]
        embedded = embedded.squeeze(1)  # [batch_size, emb_dim]

        # Generate prediction
        prediction = self.fc_out(torch.cat((output, weighted, embedded), dim=1))

        return prediction, hidden.squeeze(0), a.squeeze(1)

In [10]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, src_pad_idx, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_pad_idx = src_pad_idx
        self.device = device

    def create_mask(self, src):
        mask = (src != self.src_pad_idx).float()
        return mask

    def forward(self, src, tgt, teacher_forcing_ratio=0.5):
        batch_size = src.shape[0]
        tgt_len = tgt.shape[1]
        tgt_vocab_size = self.decoder.output_dim

        outputs = torch.zeros(batch_size, tgt_len, tgt_vocab_size).to(self.device)
        attention_weights = torch.zeros(batch_size, tgt_len, src.shape[1]).to(self.device)

        src_mask = self.create_mask(src)
        encoder_outputs, hidden = self.encoder(src, src_mask)

        input = tgt[:, 0]  # Start with <sos> token

        for t in range(1, tgt_len):
            output, hidden, attn_weights = self.decoder(input, hidden, encoder_outputs, src_mask)
            outputs[:, t] = output
            attention_weights[:, t] = attn_weights

            # Teacher forcing
            teacher_force = np.random.random() < teacher_forcing_ratio
            top1 = output.argmax(1)
            input = tgt[:, t] if teacher_force else top1

        return outputs, attention_weights

In [11]:
# ====================== Training & Evaluation ======================
def train(model, iterator, optimizer, criterion, clip):
    model.train()
    epoch_loss = 0
    pbar = tqdm(iterator, desc="Training")

    for i, (src, src_mask, tgt, tgt_mask) in enumerate(pbar):
        optimizer.zero_grad()

        # Forward pass
        output, _ = model(src, tgt[:, :-1], teacher_forcing_ratio=0.8)  # Higher teacher forcing

        # Reshape for loss calculation
        output_dim = output.shape[-1]
        output = output.contiguous().view(-1, output_dim)
        tgt = tgt[:, 1:].contiguous().view(-1)

        # Calculate loss
        loss = criterion(output, tgt)
        loss.backward()

        # Clip gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()

        epoch_loss += loss.item()
        pbar.set_postfix(loss=loss.item())

    return epoch_loss / len(iterator)

In [12]:
def evaluate(model, iterator, criterion):
    model.eval()
    epoch_loss = 0
    with torch.no_grad():
        for (src, src_mask, tgt, tgt_mask) in tqdm(iterator, desc="Evaluating"):
            output, _ = model(src, tgt[:, :-1], teacher_forcing_ratio=0)  # No teacher forcing
            output_dim = output.shape[-1]
            output = output.contiguous().view(-1, output_dim)
            tgt = tgt[:, 1:].contiguous().view(-1)
            loss = criterion(output, tgt)
            epoch_loss += loss.item()
    return epoch_loss / len(iterator)

In [13]:
def calculate_bleu(model, data_loader, src_vocab, tgt_vocab, max_samples=100):
    """Calculate BLEU score on a subset of data"""
    model.eval()
    bleu_scores = []
    smoothie = SmoothingFunction().method4

    # Create reverse vocab mappings
    src_idx_to_token = {idx: token for token, idx in src_vocab.items()}
    tgt_idx_to_token = {idx: token for token, idx in tgt_vocab.items()}

    sample_count = 0
    with torch.no_grad():
        for (src, src_mask, tgt, tgt_mask) in data_loader:
            if sample_count >= max_samples:
                break

            for i in range(min(src.shape[0], max_samples - sample_count)):
                # Get source and target sentences
                src_tokens = [src_idx_to_token[idx.item()] for idx in src[i]
                             if idx.item() != 0 and idx.item() in src_idx_to_token]
                tgt_tokens = [tgt_idx_to_token[idx.item()] for idx in tgt[i]
                             if idx.item() != 0 and idx.item() in tgt_idx_to_token
                             and tgt_idx_to_token[idx.item()] not in ['<sos>', '<eos>']]

                # Generate translation
                translated = translate_sentence(model, src[i:i+1], src_vocab, tgt_vocab)
                translated_tokens = translated.split()

                if len(translated_tokens) > 0 and len(tgt_tokens) > 0:
                    try:
                        bleu = sentence_bleu([tgt_tokens], translated_tokens, smoothing_function=smoothie)
                        bleu_scores.append(bleu)
                    except:
                        pass

                sample_count += 1
                if sample_count >= max_samples:
                    break

    return np.mean(bleu_scores) if bleu_scores else 0.0

In [14]:
def translate_sentence(model, src_tensor, src_vocab, tgt_vocab, max_len=50):
    """Translate a single sentence"""
    model.eval()

    # Create reverse vocab mapping
    tgt_idx_to_token = {idx: token for token, idx in tgt_vocab.items()}

    # src_mask = (src_tensor != 0).float()
    src_mask = (src_tensor != src_vocab['<pad>']).float()

    with torch.no_grad():
        encoder_outputs, hidden = model.encoder(src_tensor, src_mask)

    trg_indexes = [tgt_vocab['<sos>']]

    for i in range(max_len):
        trg_tensor = torch.LongTensor([trg_indexes[-1]]).to(device)

        with torch.no_grad():
            output, hidden, _ = model.decoder(trg_tensor, hidden, encoder_outputs, src_mask)

        pred_token = output.argmax(1).item()
        trg_indexes.append(pred_token)

        if pred_token == tgt_vocab['<eos>']:
            break

    trg_tokens = [tgt_idx_to_token.get(idx, '<unk>') for idx in trg_indexes[1:-1]]
    return ' '.join(trg_tokens)

In [15]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [16]:
# ====================== Visualization Functions ======================
def plot_training_curves(train_losses, val_losses, bleu_scores):
    fig = make_subplots(
        rows=1, cols=2,
        subplot_titles=('Training Progress', 'BLEU Score Progress'),
        specs=[[{"secondary_y": False}, {"secondary_y": False}]]
    )

    # Loss curves
    fig.add_trace(
        go.Scatter(x=list(range(1, len(train_losses)+1)), y=train_losses,
                  mode='lines+markers', name='Train Loss', line=dict(color='red')),
        row=1, col=1
    )
    fig.add_trace(
        go.Scatter(x=list(range(1, len(val_losses)+1)), y=val_losses,
                  mode='lines+markers', name='Val Loss', line=dict(color='blue')),
        row=1, col=1
    )

    # BLEU scores
    fig.add_trace(
        go.Scatter(x=list(range(1, len(bleu_scores)+1)), y=bleu_scores,
                  mode='lines+markers', name='BLEU Score', line=dict(color='green')),
        row=1, col=2
    )

    fig.update_xaxes(title_text="Epoch", row=1, col=1)
    fig.update_yaxes(title_text="Loss", row=1, col=1)
    fig.update_xaxes(title_text="Epoch", row=1, col=2)
    fig.update_yaxes(title_text="BLEU Score", row=1, col=2)

    fig.update_layout(height=400, showlegend=True, title_text="Training Progress")
    fig.show()

In [17]:
def plot_attention(attention, src_tokens, tgt_tokens):
    """Plot attention heatmap"""
    fig = go.Figure(data=go.Heatmap(
        z=attention,
        x=src_tokens,
        y=tgt_tokens,
        colorscale='Blues',
        hoverongaps=False
    ))
    fig.update_layout(
        title='Attention Weights Visualization',
        xaxis_title='Source Tokens',
        yaxis_title='Target Tokens',
        width=800,
        height=600,
        xaxis_tickangle=-45
    )
    fig.show()

In [18]:
# ====================== Main Execution ======================
def main():
    # Hyperparameters
    BATCH_SIZE = 32
    ENC_EMB_DIM = 128
    DEC_EMB_DIM = 128
    ENC_HID_DIM = 256
    DEC_HID_DIM = 256
    ENC_DROPOUT = 0.3
    DEC_DROPOUT = 0.3
    N_EPOCHS = 15
    CLIP = 1.0
    LEARNING_RATE = 0.0005  # Lower learning rate
    VOCAB_SIZE = 2000

    print("Creating synthetic German-English dataset...")
    src_sentences, tgt_sentences = create_synthetic_data(num_samples=10000)

    print(f"Dataset size: {len(src_sentences)} sentence pairs")
    print("\nSample translations:")
    for i in range(5):
        print(f"DE: {src_sentences[i]}")
        print(f"EN: {tgt_sentences[i]}\n")

    # Build vocabularies
    print("Building vocabularies...")
    src_vocab = build_vocab(src_sentences, vocab_size=VOCAB_SIZE)
    tgt_vocab = build_vocab(tgt_sentences, vocab_size=VOCAB_SIZE)

    print(f"Source vocabulary size: {len(src_vocab)}")
    print(f"Target vocabulary size: {len(tgt_vocab)}")

    # Create dataset
    dataset = TranslationDataset(src_sentences, tgt_sentences, src_vocab, tgt_vocab)

    # Split dataset
    train_size = int(0.8 * len(dataset))
    val_size = int(0.1 * len(dataset))
    test_size = len(dataset) - train_size - val_size

    train_data, val_data, test_data = random_split(
        dataset, [train_size, val_size, test_size]
    )

    # Create dataloaders
    train_loader = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn)
    val_loader = DataLoader(val_data, batch_size=BATCH_SIZE, collate_fn=collate_fn)
    test_loader = DataLoader(test_data, batch_size=BATCH_SIZE, collate_fn=collate_fn)

    # Initialize model
    INPUT_DIM = len(src_vocab)
    OUTPUT_DIM = len(tgt_vocab)
    SRC_PAD_IDX = src_vocab['<pad>']

    attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
    enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
    dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)
    model = Seq2Seq(enc, dec, SRC_PAD_IDX, device).to(device)

    print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

    # Initialize weights
    def init_weights(m):
        for name, param in m.named_parameters():
            if 'weight' in name:
                nn.init.normal_(param.data, mean=0, std=0.01)
            else:
                nn.init.constant_(param.data, 0)

    model.apply(init_weights)

    # Optimizer and loss
    optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)
    criterion = nn.CrossEntropyLoss(ignore_index=tgt_vocab['<pad>'])

    # Training loop
    best_val_loss = float('inf')
    train_losses = []
    val_losses = []
    bleu_scores = []

    print("\nStarting training...")
    for epoch in range(N_EPOCHS):
        start_time = time.time()

        train_loss = train(model, train_loader, optimizer, criterion, CLIP)
        val_loss = evaluate(model, val_loader, criterion)

        # Calculate BLEU score
        bleu = calculate_bleu(model, val_loader, src_vocab, tgt_vocab, max_samples=50)

        end_time = time.time()
        epoch_mins, epoch_secs = epoch_time(start_time, end_time)

        train_losses.append(train_loss)
        val_losses.append(val_loss)
        bleu_scores.append(bleu)

        if val_loss < best_val_loss:
            best_val_loss = val_loss
            torch.save(model.state_dict(), 'best_model.pt')

        print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
        print(f'\tTrain Loss: {train_loss:.3f} | Val. Loss: {val_loss:.3f} | BLEU: {bleu:.3f}')

    # Plot results
    plot_training_curves(train_losses, val_losses, bleu_scores)

    # Test translations
    print("\n" + "="*50)
    print("TRANSLATION EXAMPLES")
    print("="*50)

    test_sentences = [
        "Ich bin 25 Jahre alt",
        "Das ist rot",
        "Ich liebe Pizza",
        "Der Hund ist groß",
        "Es ist 3 Uhr",
        "Das Wetter ist schön"
    ]

    model.eval()
    for sent in test_sentences:
        # Convert to tensor
        tokens = ['<sos>'] + sent.lower().split() + ['<eos>']
        src_ids = [src_vocab.get(token, src_vocab['<unk>']) for token in tokens]
        src_tensor = torch.LongTensor(src_ids).unsqueeze(0).to(device)

        translation = translate_sentence(model, src_tensor, src_vocab, tgt_vocab)
        print(f"German:  {sent}")
        print(f"English: {translation}")
        print()

    # Attention visualization
    print("="*50)
    print("ATTENTION VISUALIZATION")
    print("="*50)

    # Get a sample from test data
    sample_batch = next(iter(test_loader))
    src, src_mask, tgt, tgt_mask = sample_batch

    # Pick first sample
    sample_src = src[0:1]
    sample_tgt = tgt[0:1]

    model.eval()
    with torch.no_grad():
        _, attention_weights = model(sample_src, sample_tgt[:, :-1], teacher_forcing_ratio=0)

    # Convert indices back to tokens
    src_idx_to_token = {idx: token for token, idx in src_vocab.items()}
    tgt_idx_to_token = {idx: token for token, idx in tgt_vocab.items()}

    src_tokens = [src_idx_to_token[idx.item()] for idx in sample_src[0]
                  if idx.item() != 0 and idx.item() in src_idx_to_token]
    tgt_tokens = [tgt_idx_to_token[idx.item()] for idx in sample_tgt[0, 1:]
                  if idx.item() != 0 and idx.item() in tgt_idx_to_token]

    if len(src_tokens) > 0 and len(tgt_tokens) > 0:
        # Get attention weights for visualization
        attn_to_plot = attention_weights[0, 1:len(tgt_tokens)+1, :len(src_tokens)].cpu().numpy()
        plot_attention(attn_to_plot, src_tokens, tgt_tokens)

    print("Training completed successfully!")
    print(f"Best validation loss: {best_val_loss:.3f}")
    print(f"Final BLEU score: {bleu_scores[-1]:.3f}")

In [19]:
if __name__ == "__main__":
    try:
        main()
    except ImportError as e:
        if "nltk" in str(e):
            print("NLTK not found. Installing a simple BLEU implementation...")

            def simple_bleu(reference, candidate, n=4):
                """Simple BLEU implementation without NLTK"""
                if len(candidate) == 0:
                    return 0.0

                # Calculate precision for different n-grams
                precisions = []
                for i in range(1, n+1):
                    ref_ngrams = [tuple(reference[j:j+i]) for j in range(len(reference)-i+1)]
                    cand_ngrams = [tuple(candidate[j:j+i]) for j in range(len(candidate)-i+1)]

                    if len(cand_ngrams) == 0:
                        precisions.append(0)
                        continue

                    matches = 0
                    for ngram in cand_ngrams:
                        if ngram in ref_ngrams:
                            matches += 1

                    precisions.append(matches / len(cand_ngrams))

                # Geometric mean of precisions
                if any(p == 0 for p in precisions):
                    return 0.0

                geometric_mean = np.exp(np.mean(np.log(precisions)))

                # Brevity penalty
                bp = min(1, np.exp(1 - len(reference) / len(candidate)))

                return bp * geometric_mean

            # Replace the BLEU calculation function
            def calculate_bleu(model, data_loader, src_vocab, tgt_vocab, max_samples=100):
                """Calculate BLEU score using simple implementation"""
                model.eval()
                bleu_scores = []

                src_idx_to_token = {idx: token for token, idx in src_vocab.items()}
                tgt_idx_to_token = {idx: token for token, idx in tgt_vocab.items()}

                sample_count = 0
                with torch.no_grad():
                    for (src, src_mask, tgt, tgt_mask) in data_loader:
                        if sample_count >= max_samples:
                            break

                        for i in range(min(src.shape[0], max_samples - sample_count)):
                            tgt_tokens = [tgt_idx_to_token[idx.item()] for idx in tgt[i]
                                         if idx.item() != 0 and idx.item() in tgt_idx_to_token
                                         and tgt_idx_to_token[idx.item()] not in ['<sos>', '<eos>']]

                            translated = translate_sentence(model, src[i:i+1], src_vocab, tgt_vocab)
                            translated_tokens = translated.split()

                            if len(translated_tokens) > 0 and len(tgt_tokens) > 0:
                                bleu = simple_bleu(tgt_tokens, translated_tokens)
                                bleu_scores.append(bleu)

                            sample_count += 1
                            if sample_count >= max_samples:
                                break

                return np.mean(bleu_scores) if bleu_scores else 0.0

            # Re-run main with simple BLEU
            main()
        else:
            raise e

Creating synthetic German-English dataset...
Dataset size: 10000 sentence pairs

Sample translations:
DE: Das Wetter ist schön
EN: The weather is nice

DE: Ich gehe zum Markt
EN: I go to the market

DE: Das ist gelb
EN: This is yellow

DE: Das ist rot
EN: This is red

DE: Ich arbeite als Verkäufer
EN: I work as a salesperson

Building vocabularies...
Source vocabulary size: 144
Target vocabulary size: 145
Model parameters: 1,778,065

Starting training...


Training: 100%|██████████| 250/250 [00:32<00:00,  7.81it/s, loss=1.65]
Evaluating: 100%|██████████| 32/32 [00:01<00:00, 29.30it/s]


Epoch: 01 | Time: 0m 33s
	Train Loss: 2.741 | Val. Loss: 2.005 | BLEU: 0.112


Training: 100%|██████████| 250/250 [00:48<00:00,  5.15it/s, loss=1.41]
Evaluating: 100%|██████████| 32/32 [00:01<00:00, 29.45it/s]


Epoch: 02 | Time: 0m 50s
	Train Loss: 1.567 | Val. Loss: 1.513 | BLEU: 0.159


Training: 100%|██████████| 250/250 [00:32<00:00,  7.71it/s, loss=1.28]
Evaluating: 100%|██████████| 32/32 [00:01<00:00, 25.35it/s]


Epoch: 03 | Time: 0m 34s
	Train Loss: 1.413 | Val. Loss: 1.390 | BLEU: 0.199


Training: 100%|██████████| 250/250 [00:31<00:00,  7.94it/s, loss=1.44]
Evaluating: 100%|██████████| 32/32 [00:01<00:00, 30.48it/s]


Epoch: 04 | Time: 0m 32s
	Train Loss: 1.343 | Val. Loss: 1.303 | BLEU: 0.243


Training: 100%|██████████| 250/250 [00:34<00:00,  7.18it/s, loss=1.16]
Evaluating: 100%|██████████| 32/32 [00:01<00:00, 28.60it/s]


Epoch: 05 | Time: 0m 36s
	Train Loss: 1.264 | Val. Loss: 1.236 | BLEU: 0.292


Training: 100%|██████████| 250/250 [00:31<00:00,  7.82it/s, loss=1.28]
Evaluating: 100%|██████████| 32/32 [00:01<00:00, 19.44it/s]


Epoch: 06 | Time: 0m 34s
	Train Loss: 1.201 | Val. Loss: 1.163 | BLEU: 0.305


Training: 100%|██████████| 250/250 [00:31<00:00,  7.88it/s, loss=1.14]
Evaluating: 100%|██████████| 32/32 [00:01<00:00, 30.00it/s]


Epoch: 07 | Time: 0m 33s
	Train Loss: 1.143 | Val. Loss: 1.125 | BLEU: 0.310


Training: 100%|██████████| 250/250 [00:32<00:00,  7.66it/s, loss=1.05]
Evaluating: 100%|██████████| 32/32 [00:01<00:00, 29.70it/s]


Epoch: 08 | Time: 0m 34s
	Train Loss: 1.105 | Val. Loss: 1.085 | BLEU: 0.351


Training: 100%|██████████| 250/250 [00:31<00:00,  7.89it/s, loss=1.11]
Evaluating: 100%|██████████| 32/32 [00:01<00:00, 29.93it/s]


Epoch: 09 | Time: 0m 33s
	Train Loss: 1.080 | Val. Loss: 1.109 | BLEU: 0.299


Training: 100%|██████████| 250/250 [00:32<00:00,  7.77it/s, loss=1.04]
Evaluating: 100%|██████████| 32/32 [00:01<00:00, 30.29it/s]


Epoch: 10 | Time: 0m 33s
	Train Loss: 1.072 | Val. Loss: 1.059 | BLEU: 0.349


Training: 100%|██████████| 250/250 [00:31<00:00,  8.05it/s, loss=1.05]
Evaluating: 100%|██████████| 32/32 [00:01<00:00, 20.64it/s]


Epoch: 11 | Time: 0m 33s
	Train Loss: 1.064 | Val. Loss: 1.048 | BLEU: 0.336


Training: 100%|██████████| 250/250 [00:31<00:00,  7.88it/s, loss=1.04]
Evaluating: 100%|██████████| 32/32 [00:01<00:00, 30.35it/s]


Epoch: 12 | Time: 0m 33s
	Train Loss: 1.048 | Val. Loss: 1.042 | BLEU: 0.349


Training: 100%|██████████| 250/250 [00:32<00:00,  7.73it/s, loss=1.03]
Evaluating: 100%|██████████| 32/32 [00:01<00:00, 30.65it/s]


Epoch: 13 | Time: 0m 33s
	Train Loss: 1.047 | Val. Loss: 1.040 | BLEU: 0.349


Training: 100%|██████████| 250/250 [00:31<00:00,  7.96it/s, loss=1.02]
Evaluating: 100%|██████████| 32/32 [00:01<00:00, 30.27it/s]


Epoch: 14 | Time: 0m 32s
	Train Loss: 1.041 | Val. Loss: 1.060 | BLEU: 0.349


Training: 100%|██████████| 250/250 [00:32<00:00,  7.79it/s, loss=1.06]
Evaluating: 100%|██████████| 32/32 [00:01<00:00, 30.62it/s]


Epoch: 15 | Time: 0m 33s
	Train Loss: 1.034 | Val. Loss: 1.029 | BLEU: 0.349



TRANSLATION EXAMPLES
German:  Ich bin 25 Jahre alt
English: am 25 years old

German:  Das ist rot
English: is red

German:  Ich liebe Pizza
English: love pizza

German:  Der Hund ist groß
English: dog is big

German:  Es ist 3 Uhr
English: is 3 o'clock

German:  Das Wetter ist schön
English: weather is nice

ATTENTION VISUALIZATION


Training completed successfully!
Best validation loss: 1.029
Final BLEU score: 0.349
