# 📖 Replication of *“Effective Approaches to Attention-based Neural Machine Translation”*  
*(Luong, Pham, Manning, 2015, Stanford NLP Group)*
# https://arxiv.org/pdf/1508.04025

---

## 📝 Abstract
This paper explores **architectures for attention-based NMT**, introducing two main variants:  
- **Global attention**: attends to *all source words* at every decoding step.  
- **Local attention**: attends to a *subset (window)* of source words, making it more efficient.  

Additionally, the paper proposes the **input-feeding approach**, where past alignment decisions are fed into the decoder to improve consistency.  
On large-scale English–German WMT tasks, these models achieve **significant BLEU improvements** (up to +5.0 BLEU) over strong baselines, setting new state-of-the-art results.

---

## 🎯 Purpose
- Investigate **different types of attention mechanisms** for neural machine translation (NMT).  
- Compare **global vs. local attention** for accuracy and efficiency.  
- Explore **alignment functions** (dot, general, concat, location).  
- Improve the **handling of long sentences** and rare words in translation.  
- Establish stronger NMT baselines beyond Bahdanau et al. (2014).

---

## 🔬 Methodology
- **Model backbone**: 4-layer stacked LSTMs (encoder–decoder) with 1000 hidden units.  
- **Attention types**:
  - *Global attention*: soft alignment over all encoder states.  
  - *Local attention*: predictive alignment chooses a window around position \( p_t \).  
    - Local-m: monotonic assumption \( p_t = t \).  
    - Local-p: predictive \( p_t = S \cdot \sigma(v_p^\top \tanh(W_p h_t)) \).  
- **Input-feeding**: attentional vectors \( \tilde{h}_t \) are fed into the next time step.  
- **Alignment functions**:
  - Dot: \( \text{score}(h_t, \bar{h}_s) = h_t^\top \bar{h}_s \).  
  - General: \( h_t^\top W_a \bar{h}_s \).  
  - Concat: \( v_a^\top \tanh(W_a [h_t; \bar{h}_s]) \).  
  - Location-based: \( a_t = \text{softmax}(W_a h_t) \).  
- **Training**: WMT’14 English–German (4.5M pairs, 50K vocab), dropout regularization, SGD.

---

## 📊 Results
- **English → German (WMT’14)**:
  - +1.3 BLEU from reversing source sequence.  
  - +1.4 BLEU from dropout.  
  - +2.8 BLEU from global attention.  
  - +1.3 BLEU from input feeding.  
  - +0.9 BLEU from local-p attention.  
  - +1.9 BLEU from <unk> replacement.  
  - Final **ensemble**: **23.0 BLEU** vs. 21.6 of prior best system.  

- **English → German (WMT’15)**:
  - Final ensemble with <unk> replacement achieves **25.9 BLEU**, new **state-of-the-art**, +1.0 BLEU over the best NMT + 5-gram rerank baseline.  

- **German → English (WMT’15)**:
  - Global attention +2.2 BLEU.  
  - Input feeding +1.0 BLEU.  
  - Dot-product + dropout + feed + unk replacement → **24.9 BLEU**.  

---

## ✅ Conclusion
- **Both global and local attention improve NMT**, with local-p offering strong efficiency–accuracy tradeoffs.  
- **Input feeding** helps maintain alignment coverage.  
- **Different alignment functions matter**: dot/general outperform concat/location.  
- Attention mechanisms are especially beneficial for **handling long sentences** and **rare word translation**.  
- With ensembles, these approaches set **new SOTA results** on WMT’14/15 English–German tasks.  

This work demonstrates that **carefully designed attention mechanisms are critical to advancing neural machine translation**, directly bridging Bahdanau’s additive attention and the Q–K–V framework later popularized in Transformers.

---


In [1]:
# ===== 0. Imports =====
import torch, torch.nn as nn, torch.optim as optim
import matplotlib.pyplot as plt
import numpy as np

device = "cuda" if torch.cuda.is_available() else "cpu"


In [2]:
class DecoderWithAttention(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, attention):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.LSTM(embed_size + hidden_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size*2, vocab_size)
        self.attention = attention

    def forward(self, x, hidden, encoder_outputs, prev_context):
        x = x.unsqueeze(1)
        emb = self.embedding(x)
        # Attention
        context, attn_weights = self.attention(hidden[0].squeeze(0), encoder_outputs)
        rnn_input = torch.cat((emb, context.unsqueeze(1)), dim=2)
        output, hidden = self.rnn(rnn_input, hidden)
        output = output.squeeze(1)
        pred = self.fc(torch.cat((output, context), dim=1))
        return pred, hidden, context, attn_weights


In [3]:
# Small English→French dataset
pairs = [
    ("i am a student", "je suis un etudiant"),
    ("he likes apples", "il aime les pommes"),
    ("she loves music", "elle aime la musique"),
    ("we are friends", "nous sommes amis"),
    ("they play football", "ils jouent au football")
]

# Build vocab
def build_vocab(sentences):
    tokens = set()
    for s in sentences:
        tokens.update(s.split())
    vocab = {tok: idx+4 for idx, tok in enumerate(sorted(tokens))}
    vocab["<pad>"] = 0
    vocab["<sos>"] = 1
    vocab["<eos>"] = 2
    vocab["<unk>"] = 3
    return vocab

src_vocab = build_vocab([src for src, _ in pairs])
trg_vocab = build_vocab([trg for _, trg in pairs])
inv_trg_vocab = {i: t for t, i in trg_vocab.items()}

def encode(sentence, vocab, max_len=8):
    tokens = sentence.split()
    idxs = [vocab["<sos>"]] + [vocab.get(t, vocab["<unk>"]) for t in tokens] + [vocab["<eos>"]]
    idxs += [vocab["<pad>"]] * (max_len - len(idxs))
    return torch.tensor(idxs)

src_tensors = torch.stack([encode(src, src_vocab) for src, _ in pairs])
trg_tensors = torch.stack([encode(trg, trg_vocab) for _, trg in pairs])


In [4]:
class Encoder(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.LSTM(embed_size, hidden_size, batch_first=True, bidirectional=True)

    def forward(self, x):
        emb = self.embedding(x)
        outputs, (hidden, cell) = self.rnn(emb)
        return outputs, (hidden, cell)


In [9]:
# ===== Global Attention =====
class GlobalAttention(nn.Module):
    def __init__(self, hidden_size):
        super().__init__()
        self.linear = nn.Linear(hidden_size*2, hidden_size, bias=False)

    def forward(self, hidden, encoder_outputs):
        # Project encoder outputs down from 2H → H
        encoder_proj = self.linear(encoder_outputs)   # [B,T,H]
        scores = torch.bmm(encoder_proj, hidden.unsqueeze(2)).squeeze(2) # [B,T]
        attn_weights = torch.softmax(scores, dim=1)
        context = torch.bmm(attn_weights.unsqueeze(1), encoder_proj).squeeze(1)
        return context, attn_weights

# ===== Local Attention (predictive) =====
class LocalAttention(nn.Module):
    def __init__(self, hidden_size, window=3):
        super().__init__()
        self.Wp = nn.Linear(hidden_size, hidden_size)
        self.vp = nn.Linear(hidden_size, 1)
        self.window = window

    def forward(self, hidden, encoder_outputs):
        B, T, H = encoder_outputs.size()
        # predict alignment position p_t
        p_t = T * torch.sigmoid(self.vp(torch.tanh(self.Wp(hidden)))).squeeze(1)
        contexts, weights = [], []
        for b in range(B):
            p = int(p_t[b].item())
            left, right = max(0,p-self.window), min(T,p+self.window)
            window = encoder_outputs[b,left:right,:]
            scores = torch.matmul(window, hidden[b])
            w = torch.softmax(scores, dim=0)
            c = torch.matmul(w, window)
            contexts.append(c)
            weights.append(w)
        return torch.stack(contexts), weights


In [13]:
class DecoderWithAttention(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, attention):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        # use embed_size + hidden_size, not hidden_size*2
        self.rnn = nn.LSTM(embed_size + hidden_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size*2, vocab_size)  # combine [output, context]
        self.attention = attention

    def forward(self, x, hidden, encoder_outputs, prev_context):
        x = x.unsqueeze(1)
        emb = self.embedding(x)
        context, attn_weights = self.attention(hidden[0].squeeze(0), encoder_outputs)
        rnn_input = torch.cat((emb, context.unsqueeze(1)), dim=2)
        output, hidden = self.rnn(rnn_input, hidden)
        pred = self.fc(torch.cat((output.squeeze(1), context), dim=1))
        return pred, hidden, context, attn_weights


In [14]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder, self.decoder, self.device = encoder, decoder, device

    def forward(self, src, trg, teacher_forcing=0.5):
        B, T = trg.size()
        vocab_size = self.decoder.fc.out_features
        outputs = torch.zeros(B, T, vocab_size).to(self.device)

        enc_out, hidden = self.encoder(src)
        hidden = (hidden[0][::2] + hidden[0][1::2], hidden[1][::2] + hidden[1][1::2]) # combine BiLSTM

        input = trg[:,0]
        context = torch.zeros(B, hidden[0].size(1)).to(self.device)

        for t in range(1,T):
            out, hidden, context, _ = self.decoder(input, hidden, enc_out, context)
            outputs[:,t,:] = out
            input = trg[:,t] if torch.rand(1).item() < teacher_forcing else out.argmax(1)
        return outputs


In [15]:
embed_size, hidden_size = 32, 64
attention = GlobalAttention(hidden_size)   # swap with LocalAttention(hidden_size)
encoder = Encoder(len(src_vocab)+1, embed_size, hidden_size).to(device)
decoder = DecoderWithAttention(len(trg_vocab)+1, embed_size, hidden_size, attention).to(device)
model = Seq2Seq(encoder, decoder, device).to(device)

criterion = nn.CrossEntropyLoss(ignore_index=0)
opt = optim.Adam(model.parameters(), lr=0.01)

src_tensors, trg_tensors = src_tensors.to(device), trg_tensors.to(device)

for epoch in range(200):
    model.train()
    opt.zero_grad()
    out = model(src_tensors, trg_tensors)
    out_dim = out.shape[-1]
    loss = criterion(out[:,1:,:].reshape(-1, out_dim), trg_tensors[:,1:].reshape(-1))
    loss.backward()
    opt.step()
    if (epoch+1)%50==0:
        print(f"Epoch {epoch+1}/200 | Loss {loss.item():.4f}")


Epoch 50/200 | Loss 0.0053
Epoch 100/200 | Loss 0.0014
Epoch 150/200 | Loss 0.0009
Epoch 200/200 | Loss 0.0006


In [16]:
def translate(sentence, model, src_vocab, trg_vocab, inv_trg_vocab, max_len=8):
    model.eval()
    src = encode(sentence, src_vocab).unsqueeze(0).to(device)
    trg_idx = [trg_vocab["<sos>"]]
    enc_out, hidden = model.encoder(src)
    hidden = (hidden[0][::2]+hidden[0][1::2], hidden[1][::2]+hidden[1][1::2])
    context = torch.zeros(1, hidden[0].size(1)).to(device)

    for _ in range(max_len):
        x = torch.tensor([trg_idx[-1]]).to(device)
        out, hidden, context, _ = model.decoder(x, hidden, enc_out, context)
        pred = out.argmax(1).item()
        trg_idx.append(pred)
        if pred == trg_vocab["<eos>"]: break

    return " ".join([inv_trg_vocab[i] for i in trg_idx if i not in (0,1,2,3)])

print(translate("i am a student", model, src_vocab, trg_vocab, inv_trg_vocab))
print(translate("she loves music", model, src_vocab, trg_vocab, inv_trg_vocab))


je suis un etudiant
elle aime la musique


# 🔎 Evolution of Attention: Bahdanau (2014) → Luong (2015) → Transformer (2017)

| Aspect | Bahdanau et al. (2014) – Additive Attention | Luong et al. (2015) – Multiplicative Attention | Vaswani et al. (2017) – Scaled Dot-Product Attention (Transformer) | Why it matters |
|---|---|---|---|---|
| **Problem Addressed** | Fixed-length bottleneck in seq2seq translation | Efficiency and design choices in NMT attention | Full parallelization and long-range dependencies in sequences | Each step builds toward scalable and general attention |
| **Attention Type** | *Additive*: feed-forward MLP computes alignment | *Multiplicative*: dot/general/concat scoring; Global vs Local | *Scaled dot-product*: multi-head self-attention | Transformers unify and generalize earlier ideas |
| **Score Function** | $$ e_{ij} = v^\top \tanh\!\big(W [s_{i-1}; h_j]\big) $$ | Global (dot): $$ \text{score}(h_t,\bar{h}_s) = h_t^\top \bar{h}_s $$ Local-p (predictive): $$ p_t = S \cdot \sigma(v_p^\top \tanh(W_p h_t)) $$ | $$ \text{score}(Q,K) = \frac{QK^\top}{\sqrt{d_k}} $$ | Bahdanau = additive MLP, Luong = efficient multiplicative, Transformer = normalized dot-product |
| **Weights** | $$ \alpha_{ij} = \frac{\exp(e_{ij})}{\sum_k \exp(e_{ik})} $$ | $$ \alpha_t(s) = \text{softmax}(\text{score}(h_t,\bar{h}_s)) $$ | $$ \alpha = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) $$ | All normalize via softmax, but Transformers compute in parallel for all tokens |
| **Context Vector** | $$ c_i = \sum_j \alpha_{ij} h_j $$ | $$ c_t = \sum_s \alpha_t(s) \bar{h}_s $$ | $$ \text{Attention}(Q,K,V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V $$ | Context always = weighted sum; formulation becomes more general |
| **Inputs (Q,K,V analogy)** | Query = decoder hidden $s_{i-1}$; Keys/Values = encoder states $h_j$ | Query = decoder hidden $h_t$; Keys/Values = encoder states | Queries, Keys, Values all derived from embeddings via linear projections | Transformer formalizes Q–K–V explicitly, enabling self-attention |
| **Architecture** | Encoder: BiRNN (GRU), Decoder: GRU | Encoder–Decoder: stacked LSTMs (4×1000 units) | Encoder–Decoder: multi-head self-attention + FFN layers | Transformer removes recurrence, relies purely on attention |
| **Extra Techniques** | First to visualize attention alignments | Input-feeding; local/global modes; <unk> replacement | Multi-head attention; positional encoding; residuals & normalization | Stepwise refinements toward GPT-scale architectures |
| **Results** | Better BLEU vs vanilla seq2seq; big gains on long sentences | +5 BLEU over baselines; new SOTA on WMT’15 English–German | New state-of-the-art across tasks; foundation for GPT/BERT | Each stage pushed NMT and sequence modeling forward |
| **Impact** | Birth of attention mechanism | Practical recipes for effective NMT attention | Foundation of modern LLMs (GPT, BERT, etc.) | Historical lineage: Bahdanau → Luong → Transformer |


📌 Summary

Bahdanau (2014): Introduced additive attention — decoder looks back at encoder states via an alignment model.

Luong (2015): Simplified to multiplicative scoring (dot/general), added global/local modes and input feeding, scaled to WMT benchmarks.

Vaswani (2017): Formalized attention into Q–K–V framework, introduced multi-head scaled dot-product attention, enabling Transformers and GPT.