# Excercises - Neural Machine Translation

This lab focuses on implementing three approaches for neural machine translation that progressively make better translation quality. The implementations should use German-to-English translation with the Multi30k dataset, which has ~30,000 parallel English and German sentences [Check here](https://github.com/multi30k/dataset/tree/master/data/task1/raw).



## 1. Sequence to Sequence Learning with Neural Networks

Implement the foundational encoder-decoder architecture using LSTM networks. Build a core system where an encoder reads an entire German sentence and compresses all information into a single context vector, then a decoder uses this vector to generate the English translation word by word.


**Implementation Tasks:**

1. **Preprocessing**
   - Load and tokenize the Multi30k German-English dataset
   - Build vocabulary dictionaries for both languages with special tokens (<sos>, <eos>, <unk>, <pad>)
   - Implement data loaders with proper padding and batching

2. **Encoder**
   - Create a 2-layer LSTM encoder that processes German sentences
   - Implement embedding layers for input tokens
   - Add dropout layers for regularization
   - Extract final hidden and cell states as context vectors

3. **Decoder**
   - Build a 2-layer LSTM decoder that generates English translations
   - Implement teacher forcing mechanism for training
   - Add linear output layer to predict vocabulary probabilities
   - Handle variable-length sequence generation

4. **Training & Evaluation**
   - Implement training loop with proper loss calculation (CrossEntropyLoss)
   - Add gradient clipping to prevent exploding gradients
   - Evaluate model performance using BLEU score
   - Save best model based on validation loss


## 2. Learning Phrase Representations using RNN Encoder-Decoder

Build upon the first approach with improvements using GRU (Gated Recurrent Unit) networks instead of LSTMs for better computational efficiency. make improvements on how the decoder uses context information by providing access to both the previous hidden state and the context vector at each decoding step.

**Implementation Tasks:**

1. **Architecture Design**
   - Replace LSTM with single-layer GRU networks in both encoder and decoder
   - Modify decoder to concatenate embeddings with context vector at each time step
   - Implement enhanced linear output layer that uses embeddings, hidden state, and context

2. **Context Vector**
   - Ensure context vector is passed to decoder at every time step (not just initialization)
   - Modify decoder forward pass to accept three inputs: token embedding, hidden state, and context
   - Implement proper tensor concatenation for enhanced feature representation

3. **Weight Initialization**
   - Initialize all parameters using normal distribution (mean=0, std=0.01)
   - Ensure consistent initialization across all model components

4. **Analysis**
   - Compare training time and memory usage with LSTM implementation
   - Analyze BLEU score improvements over baseline seq2seq model
   - Generate sample translations and compare quality


## 3. Neural Machine Translation by Jointly Learning to Align and Translate

Implement attention mechanisms to address the context vector bottleneck. The model learns to focus on different parts of the source sentence when generating each word of the translation. Include bidirectional encoding to better capture context from both directions.

**Implementation Tasks:**

1. **Bidirectional Encoder**
   - Implement bidirectional GRU that reads source sentence in both directions
   - Concatenate forward and backward hidden states for richer representations
   - Add linear transformation layer to project concatenated states to decoder dimension

2. **Attention Mechanism**
   - Create attention module that computes alignment scores between decoder hidden state and all encoder outputs
   - Implement energy function using linear layers and tanh activation
   - Apply softmax to get attention weights over source positions
   - Compute weighted context vector using attention weights

3. **Enhanced Decoder**
   - Modify decoder to accept encoder outputs and compute attention at each step
   - Concatenate current embedding, weighted context, and previous hidden state as RNN input
   - Update output layer to use embedding, hidden state, and attended context for prediction

4. **Attention Visualization**
   - Implement function to extract and store attention weights during translation
   - Create visualization plots showing attention alignment between source and target words
   - Generate attention heatmaps for sample translations

5. **Advanced Training**
   - Implement proper weight initialization (normal for weights, zero for biases)
   - Monitor both training loss and validation BLEU score
   - Save attention weights for analysis and visualization

6. **Evaluation**
   - Evaluate final model on test set and report BLEU scores
   - Compare translation quality across all three implementations
   - Analyze attention patterns for different sentence types and lengths
   - Generate qualitative analysis of translation improvements