# Introduction

Machine Translation is one of the core tasks in Natural Language Processing (NLP), where the goal is to automatically translate a sentence from one language to another. This notebook demonstrates a fundamental implementation of a **Sequence-to-Sequence (Seq2Seq)** model using PyTorch — translating **English to French** without any attention mechanism.

The key components of this project include:

- Preprocessing English–French sentence pairs from the Tatoeba project
- Building vocabulary and mapping words to indices
- Implementing a **bidirectional GRU encoder** and a **GRU decoder**
- Using **teacher forcing** during training
- Handling variable-length sequences with padding and packing
- Applying **gradient clipping** to stabilize training
- Performing inference using greedy decoding (no beam search)

This notebook serves as a minimal, working baseline for Seq2Seq translation, ideal for learning purposes or benchmarking before introducing more advanced techniques like attention or transformers.

---


Step 1: Install & Import Dependencies

In [None]:
# Install packages (if needed)
# !pip install torch matplotlib

import os
import re
import time
import unicodedata
import random
import torch
import matplotlib.pyplot as plt
from torch import nn
from torch.utils.data import Dataset, DataLoader


Step 2: Preprocessing Utilities

In [None]:
# Special tokens
PAD_token = 0
SOS_token = 1
EOS_token = 2
MAX_LENGTH = 10

def unicodeToAscii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')

def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z!?]+", r" ", s)
    return s.strip()

class Language_Dictionary_Builder:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
        self.n_words = 3

    def addSentence(self, sentence):
        for word in sentence.split(" "):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1


Step 3: Load and Prepare Dataset

In [None]:
# Download dataset
if not os.path.exists('fra.txt'):
    !wget -q https://www.manythings.org/anki/fra-eng.zip
    !unzip -oq fra-eng.zip

text_pairs = []
for line in open('fra.txt', 'r'):
    a = line.find('CC-BY')
    line = line[:a].strip()
    if '\t' not in line: continue
    eng, fra = line.split('\t')
    text_pairs.append((normalizeString(eng), normalizeString(fra)))

def filterPair(p):
    return len(p[0].split(" ")) < MAX_LENGTH and len(p[1].split(" ")) < MAX_LENGTH

def prepareData(lang1, lang2, pairs, reverse=False):
    if reverse:
        pairs = [tuple(reversed(p)) for p in pairs]
        input_lang = Language_Dictionary_Builder(lang2)
        output_lang = Language_Dictionary_Builder(lang1)
    else:
        input_lang = Language_Dictionary_Builder(lang1)
        output_lang = Language_Dictionary_Builder(lang2)

    pairs = [pair for pair in pairs if filterPair(pair)]

    for pair in pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])

    return input_lang, output_lang, pairs

input_lang, output_lang, pairs = prepareData('eng', 'fra', text_pairs)


Step 4: Dataset + Dataloader

In [None]:
class DatasetEngFra(Dataset):
    def __init__(self, data, input_lang, output_lang):
        self.data = data
        self.input_lang = input_lang
        self.output_lang = output_lang

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        eng, fra = self.data[idx]
        eng_idx = [self.input_lang.word2index[word] for word in eng.split()]
        fra_idx = [self.output_lang.word2index[word] for word in fra.split()]
        fra_idx = [SOS_token] + fra_idx + [EOS_token]
        return torch.tensor(eng_idx), torch.tensor(fra_idx)

def collate_batch(batch):
    eng_batch, fra_input, fra_target, eng_len, fra_len = [], [], [], [], []
    for eng, fra in batch:
        eng_batch.append(eng)
        eng_len.append(len(eng))

        fra_input.append(fra[:-1])
        fra_len.append(len(fra) - 1)
        fra_target.append(fra[1:])

    eng_pad = nn.utils.rnn.pad_sequence(eng_batch, batch_first=True, padding_value=PAD_token)
    fra_input_pad = nn.utils.rnn.pad_sequence(fra_input, batch_first=True, padding_value=PAD_token)
    fra_target_pad = nn.utils.rnn.pad_sequence(fra_target, batch_first=True, padding_value=PAD_token)

    return eng_pad, fra_input_pad, fra_target_pad, torch.tensor(eng_len), torch.tensor(fra_len)

dataset = DatasetEngFra(pairs, input_lang, output_lang)
train_dl = DataLoader(dataset, batch_size=32, shuffle=True, collate_fn=collate_batch)


Step 5: Define Encoder & Decoder

In [None]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(input_size, hidden_size, padding_idx=PAD_token)
        self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True, bidirectional=True)

    def forward(self, x, lengths):
        embedded = self.embedding(x)
        packed = nn.utils.rnn.pack_padded_sequence(embedded, lengths.cpu(), batch_first=True, enforce_sorted=False)
        outputs, hidden = self.gru(packed)
        hidden = hidden[0:hidden.size(0):2] + hidden[1:hidden.size(0):2]
        return outputs, hidden

class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        super().__init__()
        self.embedding = nn.Embedding(output_size, hidden_size, padding_idx=PAD_token)
        self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)
        self.out = nn.Linear(hidden_size, output_size)

    def forward(self, hidden, input_seq, lengths):
        embedded = self.embedding(input_seq)
        packed = nn.utils.rnn.pack_padded_sequence(embedded, lengths.cpu(), batch_first=True, enforce_sorted=False)
        output, hidden = self.gru(packed, hidden)
        output, _ = nn.utils.rnn.pad_packed_sequence(output, batch_first=True)
        return nn.functional.log_softmax(self.out(output), dim=-1), hidden

    def decode_step(self, input_token, hidden):
        embedded = self.embedding(input_token)
        output, hidden = self.gru(embedded, hidden)
        return nn.functional.softmax(self.out(output), dim=-1), hidden


Step 6: Training Function

In [None]:
def grad_norm(model):
    return sum((p.grad.data.norm(2)**2 for p in model.parameters() if p.grad is not None))**0.5

def train_no_attention(encoder, decoder, train_dl, num_epochs, loss_fn, encoder_opt, decoder_opt, device, clip_grad=True, max_norm=1.0):
    encoder.train()
    decoder.train()
    all_losses = []

    for epoch in range(num_epochs):
        epoch_loss = 0.0
        for eng, fra_in, fra_tgt, eng_len, fra_len in train_dl:
            eng, fra_in, fra_tgt = eng.to(device), fra_in.to(device), fra_tgt.to(device)
            eng_len, fra_len = eng_len.to(device), fra_len.to(device)

            _, enc_hidden = encoder(eng, eng_len)
            output, _ = decoder(enc_hidden, fra_in, fra_len)
            loss = loss_fn(output.reshape(-1, output.size(-1)), fra_tgt.reshape(-1))
            loss.backward()

            if clip_grad:
                nn.utils.clip_grad_norm_(encoder.parameters(), max_norm)
                nn.utils.clip_grad_norm_(decoder.parameters(), max_norm)

            encoder_opt.step(); decoder_opt.step()
            encoder_opt.zero_grad(); decoder_opt.zero_grad()

            epoch_loss += loss.item() * eng.size(0)

        avg_loss = epoch_loss / len(train_dl.dataset)
        all_losses.append(avg_loss)
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")

    return all_losses


Step 7: Translate Sentences

In [None]:
def translate(encoder, decoder, sentence, input_lang, output_lang, device, max_len=MAX_LENGTH):
    encoder.eval(); decoder.eval()
    with torch.no_grad():
        idxs = [input_lang.word2index.get(w, 0) for w in sentence.split()]
        input_tensor = torch.tensor(idxs).unsqueeze(0).to(device)
        _, hidden = encoder(input_tensor, torch.tensor([len(idxs)]).to(device))

        next_token = torch.tensor([[SOS_token]], device=device)
        output_words = []

        for _ in range(max_len):
            pred, hidden = decoder.decode_step(next_token, hidden)
            next_token = torch.argmax(pred, dim=-1)
            if next_token.item() == EOS_token:
                break
            output_words.append(output_lang.index2word.get(next_token.item(), "<UNK>"))

    return ' '.join(output_words)


Step 8: Train the Model

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
encoder = EncoderRNN(input_lang.n_words, 128).to(device)
decoder = DecoderRNN(128, output_lang.n_words).to(device)

loss_fn = nn.NLLLoss(ignore_index=PAD_token)
encoder_opt = torch.optim.Adam(encoder.parameters(), lr=0.001)
decoder_opt = torch.optim.Adam(decoder.parameters(), lr=0.001)

train_loss = train_no_attention(encoder, decoder, train_dl, num_epochs=40, loss_fn=loss_fn,
                                encoder_opt=encoder_opt, decoder_opt=decoder_opt, device=device)


 Step 9: Plot Loss

In [None]:
plt.plot(train_loss)
plt.title("Training Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.grid(True)
plt.show()


Step 10: Sample Translations

In [None]:
for _ in range(10):
    eng, fra = random.choice(pairs)
    print(f"Input: {eng}")
    print(f"Target: {fra}")
    print(f"Predicted: {translate(encoder, decoder, eng, input_lang, output_lang, device)}")
    print("-" * 60)


# Conclusion & Next Steps

In this notebook, we successfully implemented a complete **English-to-French sequence-to-sequence model** without using attention mechanisms. Despite the absence of attention, the model was able to learn reasonable translations for short and well-structured sentences.

---

## Key Takeaways

- The **encoder-decoder** architecture is effective for sequence modeling tasks such as translation.
- Using **GRUs** (Gated Recurrent Units) helps manage long-range dependencies in sequences.
- Proper **padding**, **batching**, and **packed sequences** allow training on variable-length inputs efficiently.
- **Teacher forcing** speeds up training by providing the correct target sequence during training.
- **Gradient clipping** prevents exploding gradients, especially in RNN-based models.

---

## Performance Insights

- The model works best for **short sentences** (under ~10 tokens) that closely match the training patterns.
- Generalization is limited due to the lack of attention and beam search during inference.
- Evaluation was done qualitatively; metrics like **BLEU score** could provide more rigorous assessments.

---

## Future Improvements

Here are some ways to improve or extend this project:

1. **Add Attention Mechanism**
   - Use Luong or Bahdanau-style attention to allow dynamic focus on input tokens during decoding.

2. **Use Beam Search for Inference**
   - Improves output quality by exploring multiple decoding paths.

3. **Train on Larger and More Diverse Data**
   - Use full Tatoeba corpus and increase vocabulary coverage.

4. **Pretrained Embeddings**
   - Integrate GloVe, FastText, or multilingual embeddings to boost language understanding.

5. **Evaluation Metrics**
   - Add BLEU score, token accuracy, or sequence accuracy to quantitatively assess performance.

6. **Checkpointing**
   - Add support for saving and loading trained models.

---

This project serves as a solid foundation for anyone looking to dive into neural machine translation. With a clear structure and modular design, it's ready to be extended into more powerful architectures like attention-based models or transformers.

Happy translating!
