# **Neural Machine Translation using Transformer (English → Spanish)**

# Introduction:

 This notebook demonstrates a complete Transformer-based Neural Machine Translation (NMT) system
 built entirely from scratch using PyTorch. The model translates English sentences into Spanish.

# Key Features:
 - Implements the full Transformer architecture: Encoder, Decoder, Attention, and FFN layers
 - Uses real parallel sentence data (English-Spanish) from the spa-eng dataset
 - Includes data preprocessing, tokenization, vocabulary building, and padding
 - Trains using cross-entropy loss and the Adam optimizer with a learning rate scheduler
 - Performs greedy decoding to generate translations at inference time

 This project is designed for educational purposes to provide a deep understanding of how the Transformer model works internally — without relying on external libraries like Hugging Face.


# 1. Setup

In [None]:
import torch
import random
import matplotlib.pyplot as plt
import torch.nn as nn

from dataloader_generator import (
    prepareData, load_and_preprocess_data, TranslationDataset,
    collate_batch, PAD_token, SOS_token, EOS_token, MAX_LENGTH
)
from model import Transformer
from utils import train_transformer
from torch.utils.data import DataLoader
from transformers import get_linear_schedule_with_warmup

torch.manual_seed(13)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# 2. Hyperparameters

In [None]:
# Hyperparameters
BATCH_SIZE = 32
learning_rate = 1e-3
num_epochs = 2000

num_layers = 2
embed_size = 256  # Try 512 for better performance on GPU
d_out_n_heads = embed_size
ffn_hidden_dim = 4 * embed_size
num_heads = 4  # Should divide d_out_n_heads evenly


# 3. Dataset Preparation

In [None]:
# Load, normalize and tokenize data
text_pairs = load_and_preprocess_data()

# Remove duplicates
seen_eng = set()
unique_text_pairs = []
for eng, spa in text_pairs:
    if eng not in seen_eng:
        unique_text_pairs.append((eng, spa))
        seen_eng.add(eng)
text_pairs = unique_text_pairs

# Prepare vocabulary and pairs
input_lang, output_lang, pairs = prepareData('eng', 'spa', text_pairs)
dataset = TranslationDataset(pairs, input_lang, output_lang)

# Create DataLoader
train_dl = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)


# 4. Model Initialization

In [None]:
src_vocab_size, target_vocab_size = input_lang.n_words, output_lang.n_words

# Initialize Transformer model
transformer_model = Transformer(
    num_layers=num_layers,
    src_vocab_size=src_vocab_size,
    target_vocab_size=target_vocab_size,
    embed_size=embed_size,
    d_out_n_heads=d_out_n_heads,
    num_heads=num_heads,
    ffn_hidden_dim=ffn_hidden_dim,
    dropout=0.5,
    context_length=MAX_LENGTH,
    qkv_bias=False,
    PAD_token=PAD_token
).to(device)


# 5. Loss, Optimizer & Scheduler

In [None]:
loss_fn = nn.CrossEntropyLoss(ignore_index=PAD_token)
optimizer = torch.optim.Adam(transformer_model.parameters(), lr=learning_rate)

total_steps = len(train_dl) * num_epochs
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=int(0.1 * total_steps),
    num_training_steps=total_steps
)


# 6. Train the Transformer



In [None]:
train_loss = train_transformer(
    transformer_model,
    train_dl,
    num_epochs,
    loss_fn,
    optimizer,
    scheduler,
    device,
    input_lang,
    output_lang,
    clip_norm=True,
    max_norm=1.0,
    MAX_LENGTH=MAX_LENGTH,
    SOS_token=SOS_token,
    EOS_token=EOS_token
)


# 7. Loss Curve

In [None]:
plt.figure(figsize=(10,5))
plt.plot(train_loss)
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.grid(True)
plt.title("Transformer Training Loss vs. Epoch")
plt.show()


# 8. Translation Examples

In [None]:
print("Sample translations:")
for _ in range(10):
    eng, spa = random.choice(text_pairs)
    print(f"Input: {eng}")
    print(f"Target: {spa}")
    result = transformer_model.generate(
        eng,
        input_lang,
        output_lang,
        max_len=MAX_LENGTH,
        SOS_token=SOS_token,
        EOS_token=EOS_token
    )
    predicted_tokens = result['output']
    predicted_sentence = " ".join([output_lang.index2word.get(idx, 'UNK') for idx in predicted_tokens])
    print(f"Predicted: {predicted_sentence}")
    print("-" * 80)


# **Conclusion**

- We implemented a Transformer model from scratch for English-to-Spanish translation.
 - The model includes core Transformer components:
   * Positional Encoding
   * Multi-head Self-Attention
   * Cross-Attention (Decoder to Encoder)
   * Feed Forward Networks
   * Layer Normalization and Dropout
 - Training was done with Adam optimizer, learning rate scheduler, and gradient clipping.
 - Greedy decoding was used to generate predictions from the trained model.
 - The model converged steadily as shown by the loss plot.
 - It demonstrates solid performance on many common translation examples.

 Future Improvements:
 - Implement beam search decoding for better translation diversity.
 - Train on larger datasets for broader generalization.
 - Add label smoothing to improve generalization.
 - Incorporate model checkpointing for save/load support.
 - Explore multilingual support and pretraining methods (e.g., masked language modeling).
