# Deep Learning: Assignment #3
## Submission date: 07/01/2026, 23:59.
### Topics:
- RNNs
- GRUs
- LSTMs
- Transformers


**Submitted by:**

- **Student 1 — Name, ID**
- **Student 2 — Name, ID**


**Assignment Instructions:**

· Submissions are in **pairs only**. Write both names + IDs at the top of the notebook.

· Keep your code **clean, concise, and readable**.



· <font color='red'>Write your textual answers in red.</font>  
(e.g., `<span style="color:red">your answer here</span>`)

· All figures, printed results, and outputs should remain visible in the notebook.  
Run **all cells** before submitting and **do not clear outputs**.

· Use relative paths — **no absolute file paths** pointing to local machines.

· **Important:** Your submission must be entirely your own.  


## Question 1 — Chatbot Tutorial (35 Points)

Building a generative chatbot using a seq2seq model on the Cornell Movie-Dialogs Corpus.


### Load & Preprocess Data


To start, we load the data. Download the data ZIP file


In [None]:
!mkdir data
!cd data
!wget https://zissou.infosci.cornell.edu/convokit/datasets/movie-corpus/movie-corpus.zip
!unzip -q movie-corpus.zip
!cd ..

After loading the data, let’s import some necessities.

In [None]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import torch
from torch.jit import script, trace
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
import csv
import random
import re
import os
import unicodedata
import codecs
from io import open
import itertools
import math
import json


USE_CUDA = torch.cuda.is_available()
device = torch.device("cuda" if USE_CUDA else "cpu")

The next step is to reformat our data file and load the data into
structures that we can work with.

The [Cornell Movie-Dialogs
Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html)_
is a rich dataset of movie character dialog:

-  220,579 conversational exchanges between 10,292 pairs of movie
   characters
-  9,035 characters from 617 movies
-  304,713 total utterances

This dataset is large and diverse, and there is a great variation of
language formality, time periods, sentiment, etc. Our hope is that this
diversity makes our model robust to many forms of inputs and queries.

First, we’ll take a look at some lines of our datafile to see the
original format.




In [None]:
!ls data

In [None]:
corpus = "movie-corpus"
corpus_name = "movie-corpus"

def printLines(file, n=10):
    with open(file, 'rb') as datafile:
        lines = datafile.readlines()
    for line in lines[:n]:
        print(line)

printLines(os.path.join(corpus, "utterances.jsonl"))

#### Create formatted data file


The following functions facilitate the parsing of the raw
*utterances.jsonl* data file.


In the next cell, you'll find the functions:

-  ``extractSentencePairs`` extracts pairs of sentences from
   conversations.


> Run the code in the next cell.

In [None]:
# Splits each line of the file to create lines and conversations
def loadLinesAndConversations(fileName):
    lines = {}
    conversations = {}
    with open(fileName, 'r', encoding='iso-8859-1') as f:
        for line in f:
            lineJson = json.loads(line)
            # Extract fields for line object
            lineObj = {}
            lineObj["lineID"] = lineJson["id"]
            lineObj["characterID"] = lineJson["speaker"]
            lineObj["text"] = lineJson["text"]
            lines[lineObj['lineID']] = lineObj

            # Extract fields for conversation object
            if lineJson["conversation_id"] not in conversations:
                convObj = {}
                convObj["conversationID"] = lineJson["conversation_id"]
                convObj["movieID"] = lineJson["meta"]["movie_id"]
                convObj["lines"] = [lineObj]
            else:
                convObj = conversations[lineJson["conversation_id"]]
                convObj["lines"].insert(0, lineObj)
            conversations[convObj["conversationID"]] = convObj

    return lines, conversations


# Extracts pairs of sentences from conversations
def extractSentencePairs(conversations):
    qa_pairs = []
    for conversation in conversations.values():
        # Iterate over all the lines of the conversation
        for i in range(len(conversation["lines"]) - 1):  # We ignore the last line (no answer for it)
            inputLine = conversation["lines"][i]["text"].strip()
            targetLine = conversation["lines"][i+1]["text"].strip()
            # Filter wrong samples (if one of the lists is empty)
            if inputLine and targetLine:
                qa_pairs.append([inputLine, targetLine])
    return qa_pairs

Now we’ll call these functions and create the file. We’ll call it
*formatted_movie_lines.txt*.




In [None]:
# Define path to new file
datafile = os.path.join(corpus, "formatted_movie_lines.txt")

delimiter = '\t'
# Unescape the delimiter
delimiter = str(codecs.decode(delimiter, "unicode_escape"))

# Initialize lines dict and conversations dict
lines = {}
conversations = {}
# Load lines and conversations
print("\nProcessing corpus into lines and conversations...")
lines, conversations = loadLinesAndConversations(os.path.join(corpus, "utterances.jsonl"))

# Write new csv file
print("\nWriting newly formatted file...")
with open(datafile, 'w', encoding='utf-8') as outputfile:
    writer = csv.writer(outputfile, delimiter=delimiter, lineterminator='\n')
    for pair in extractSentencePairs(conversations):
        writer.writerow(pair)

# Print a sample of lines
print("\nSample lines from file:")
printLines(datafile)

#### Trim Data

We now move from raw text to a representation that a neural network can process.
Our next task is to build a **vocabulary** and load the **query/response pairs**
into memory.

Unlike images, text does not come with an inherent mapping to a numerical space.
A sequence model expects **integer token indices**, so we must define a mapping
from each unique word in the dataset to a discrete index.

To do this, we define a `Voc` (vocabulary) class that maintains:

- `word2index`: a mapping from each word to an integer index  
- `index2word`: the inverse mapping from indices back to words  
- `word2count`: a frequency table used for trimming rare words  
- `num_words`: the current vocabulary size  

In addition, we reserve a small set of **special tokens**:

- `PAD` for padding shorter sequences in a batch  
- `SOS` to mark the start of a sequence for the decoder  
- `EOS` to mark the end of a sequence  
- `UNK` to represent words that are not in the vocabulary  

Later, we will remove infrequent words from the vocabulary using a minimum
frequency threshold (`MIN_COUNT`). This reduces noise and decreases the effective
problem size, which often improves training stability and convergence.


In [None]:
# Default word tokens
PAD_token = 0  # Used for padding short sentences
SOS_token = 1  # Start-of-sentence token
EOS_token = 2  # End-of-sentence token
UNK_token = 3  # Unknown word token

class Voc:
    def __init__(self, name):
        self.name = name
        self.trimmed = False
        self.word2index = {}
        self.word2count = {}
        self.index2word = {
            PAD_token: "PAD",
            SOS_token: "SOS",
            EOS_token: "EOS",
            UNK_token: "UNK",
        }
        self.num_words = 4  # Count default tokens

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.num_words
            self.word2count[word] = 1
            self.index2word[self.num_words] = word
            self.num_words += 1
        else:
            self.word2count[word] += 1

    # Remove words below a certain count threshold
    def trim(self, min_count):
        if self.trimmed:
            return
        self.trimmed = True

        keep_words = []
        for k, v in self.word2count.items():
            if v >= min_count:
                keep_words.append(k)

        print('keep_words {} / {} = {:.4f}'.format(
            len(keep_words), len(self.word2index), len(keep_words) / len(self.word2index)
        ))

        # Reinitialize dictionaries
        self.word2index = {}
        self.word2count = {}
        self.index2word = {
            PAD_token: "PAD",
            SOS_token: "SOS",
            EOS_token: "EOS",
            UNK_token: "UNK",
        }
        self.num_words = 4  # Count default tokens

        for word in keep_words:
            self.addWord(word)


We begin by converting Unicode strings to ASCII using `unicodeToAscii`.
Next, all text is lowercased and non-letter characters are removed while
preserving basic punctuation (`normalizeString`).
Finally, to promote stable training and reduce unnecessary computation,
we filter out sentence pairs whose length exceeds the `MAX_LENGTH`
threshold (`filterPairs`).


In [None]:
MAX_LENGTH = 10  # Maximum sentence length to consider

def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    s = re.sub(r"\s+", r" ", s).strip()
    return s

def readVocs(datafile, corpus_name):
    print("Reading lines...")
    lines = open(datafile, encoding='utf-8').read().strip().split('\n')
    pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]
    voc = Voc(corpus_name)
    return voc, pairs

def filterPair(p):
    return len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH

def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]

def loadPrepareData(corpus, corpus_name, datafile, save_dir):
    print("Start preparing training data ...")
    voc, pairs = readVocs(datafile, corpus_name)
    print("Read {!s} sentence pairs".format(len(pairs)))
    pairs = filterPairs(pairs)
    print("Trimmed to {!s} sentence pairs".format(len(pairs)))
    print("Counting words...")
    for pair in pairs:
        voc.addSentence(pair[0])
        voc.addSentence(pair[1])
    print("Counted words:", voc.num_words)
    return voc, pairs

save_dir = os.path.join("data", "save")
voc, pairs = loadPrepareData(corpus, corpus_name, datafile, save_dir)

print("\nSample pairs:")
for pair in pairs[:10]:
    print(pair)


Start preparing training data ...
Reading lines...
Read 221282 sentence pairs
Trimmed to 64313 sentence pairs
Counting words...
Counted words: 18083

Sample pairs:
['they do to !', 'they do not !']
['she okay ?', 'i hope so .']
['wow', 'let s go .']
['what good stuff ?', 'the real you .']
['the real you .', 'like my fear of wearing pastels ?']
['do you listen to this crap ?', 'what crap ?']
['well no . . .', 'then that s all you had to say .']
['then that s all you had to say .', 'but']
['but', 'you always been this selfish ?']
['have fun tonight ?', 'tons']


Another technique that often improves training efficiency is **trimming
rarely used words** from the vocabulary. Intuitively, words that appear
only a handful of times contribute little to the learning signal, while
significantly increasing the size of the vocabulary.

We perform trimming as a two-step process:

1. Remove words that appear fewer than `MIN_COUNT` times from the
   vocabulary.
2. Remove sentence pairs that contain any of the trimmed words, ensuring
   that all remaining training examples are fully represented in the
   vocabulary.

This procedure reduces noise, lowers the dimensionality of the problem,
and often leads to faster and more stable convergence during training.


In [None]:
MIN_COUNT = 3    # Minimum word count threshold for trimming

def trimRareWords(voc, pairs, MIN_COUNT):
    # Trim words used under the MIN_COUNT from the voc
    voc.trim(MIN_COUNT)
    # Filter out pairs with trimmed words
    keep_pairs = []
    for pair in pairs:
        input_sentence = pair[0]
        output_sentence = pair[1]
        keep_input = True
        keep_output = True
        # Check input sentence
        for word in input_sentence.split(' '):
            if word not in voc.word2index:
                keep_input = False
                break
        # Check output sentence
        for word in output_sentence.split(' '):
            if word not in voc.word2index:
                keep_output = False
                break

        # Only keep pairs that do not contain trimmed word(s) in their input or output sentence
        if keep_input and keep_output:
            keep_pairs.append(pair)

    print("Trimmed from {} pairs to {}, {:.4f} of total".format(
        len(pairs), len(keep_pairs), len(keep_pairs) / len(pairs)
    ))
    return keep_pairs


# Trim voc and pairs
pairs = trimRareWords(voc, pairs, MIN_COUNT)


keep_words 7833 / 18079 = 0.4333
Trimmed from 64313 pairs to 53131, 0.8261 of total


### Preparing Data for the Model



So far, we have transformed raw conversational text into a cleaned and
trimmed set of *(input, response)* sentence pairs, along with a vocabulary
that maps words to integer indices.

However, neural sequence models do not operate directly on text. Instead,
they expect **numerical tensors** as input. In this section, we convert
our sentence pairs into padded tensors that can be efficiently processed
by the encoder–decoder model.

To accelerate training and make effective use of GPU parallelism, we will
train the model using **mini-batches** rather than individual sentence
pairs. This introduces an additional challenge: sentences within a batch
may have different lengths.

To handle variable-length sequences, we adopt the following conventions:

- Sentences are converted to sequences of word indices and terminated
  with an `EOS_token`.
- All sequences in a batch are padded to the length of the longest
  sequence using the `PAD_token`.
- Batched input tensors are shaped as  
  **(max_sequence_length, batch_size)**,  
  so that each time step can be processed across all sequences in parallel.

This layout is particularly convenient for recurrent models, which
process input one time step at a time.


In addition to the padded input and target tensors, we also construct:

- A **lengths tensor**, which stores the true (unpadded) length of each
  input sequence. This will later be used to efficiently process variable-
  length sequences.
- A **binary mask tensor** for the target sequences, where padded positions
  are marked with 0 and valid tokens with 1. This allows us to ignore padded
  values when computing the training loss.

The following helper functions implement this batching pipeline.


In [None]:
def indexesFromSentence(voc, sentence):
    return [voc.word2index.get(word, UNK_token) for word in sentence.split(' ')] + [EOS_token]


def zeroPadding(l, fillvalue=PAD_token):
    return list(itertools.zip_longest(*l, fillvalue=fillvalue))


def binaryMatrix(l, value=PAD_token):
    m = []
    for i, seq in enumerate(l):
        m.append([])
        for token in seq:
            if token == PAD_token:
                m[i].append(0)
            else:
                m[i].append(1)
    return m


# Returns padded input sequence tensor and lengths
def inputVar(l, voc):
    indexes_batch = [indexesFromSentence(voc, sentence) for sentence in l]
    lengths = torch.tensor([len(indexes) for indexes in indexes_batch])
    padList = zeroPadding(indexes_batch)
    padVar = torch.LongTensor(padList)
    return padVar, lengths


# Returns padded target sequence tensor, padding mask, and max target length
def outputVar(l, voc):
    indexes_batch = [indexesFromSentence(voc, sentence) for sentence in l]
    max_target_len = max(len(indexes) for indexes in indexes_batch)
    padList = zeroPadding(indexes_batch)
    mask = binaryMatrix(padList)
    mask = torch.BoolTensor(mask)
    padVar = torch.LongTensor(padList)
    return padVar, mask, max_target_len


# Returns all items for a given batch of sentence pairs
def batch2TrainData(voc, pair_batch):
    # Sort pairs by input sentence length (descending)
    pair_batch.sort(key=lambda x: len(x[0].split(" ")), reverse=True)

    input_batch, output_batch = [], []
    for pair in pair_batch:
        input_batch.append(pair[0])
        output_batch.append(pair[1])

    inp, lengths = inputVar(input_batch, voc)
    output, mask, max_target_len = outputVar(output_batch, voc)
    return inp, lengths, output, mask, max_target_len


### Model Architecture
We use a Seq2Seq model with an Encoder (Bi-GRU), Decoder (GRU), and Global Attention.




#### Sequence-to-Sequence Architecture

The core of our chatbot is a **sequence-to-sequence (seq2seq)** model.
The objective of a seq2seq model is to map a variable-length input
sequence to a variable-length output sequence using a fixed-size neural
network.

[Sutskever et al. (2014)](https://arxiv.org/abs/1409.3215) demonstrated
that this task can be accomplished by composing two recurrent neural
networks:

- An **encoder**, which processes the input sequence and compresses it
  into an internal representation.
- A **decoder**, which generates the output sequence one token at a time,
  conditioned on the encoder’s representation.

In the context of conversational modeling, the encoder reads the input
sentence (the query), and the decoder generates the response.

This encoder–decoder formulation underlies many modern sequence modeling
approaches in machine translation, dialogue systems, and text
generation.


#### Encoder

In this part, you will implement the **encoder** component of the
sequence-to-sequence model.

The encoder processes the input sentence one token (word) at a time.
At each time step, it produces an output vector and updates an internal
**hidden state** that summarizes the sequence observed so far.

You should implement the encoder using a **multi-layer Gated Recurrent
Unit (GRU)**, introduced by [Cho et al. (2014)](https://arxiv.org/pdf/1406.1078v3.pdf).
GRUs extend standard recurrent neural networks by incorporating gating
mechanisms that regulate information flow, enabling more effective
modeling of long-term dependencies.



When processing padded mini-batches, you must correctly pack and unpack
sequences using:

- `nn.utils.rnn.pack_padded_sequence`
- `nn.utils.rnn.pad_packed_sequence`

This ensures that the GRU does not perform unnecessary computation over
padding tokens.


**Encoder Inputs**

- `input_seq`: Padded batch of input sentences  
  Shape: *(max_length, batch_size)*

- `input_lengths`: True lengths of each sentence in the batch  
  Shape: *(batch_size)*

**Encoder Outputs**

- `outputs`: Output features from the last hidden layer of the GRU  
  (sum of bidirectional outputs)  
  Shape: *(max_length, batch_size, hidden_size)*

- `hidden`: Final hidden state of the GRU  
  Shape: *(n_layers × num_directions, batch_size, hidden_size)*


In [None]:
class EncoderRNN(nn.Module):
    def __init__(self, hidden_size, embedding, n_layers=1, dropout=0):
        # TODO: Implement

    def forward(self, input_seq, input_lengths, hidden=None):
        # TODO: Implement


#### Decoder with Attention

The decoder generates the response sequence in a token-by-token manner.
At each time step, it predicts the next word based on:

- its current hidden state,
- the previously generated word, and
- a **context vector** computed from the encoder outputs.

A limitation of vanilla seq2seq models is their reliance on a single
fixed-length context vector to represent the entire input sequence.
This bottleneck becomes particularly problematic for long input
sentences.


[Luong et al. (2015)](https://arxiv.org/abs/1508.04025) later proposed
**Global Attention**, in which the decoder attends to *all* encoder
hidden states at every time step. Attention weights are computed using
the decoder’s current hidden state and the encoder outputs via a
parameterized **score function**.



At a given decoding step $t$, the attention mechanism computes an
alignment score between the current decoder hidden state $h_t$ and each encoder output $\bar{h_s}$.

Luong attention defines three possible **score functions**:

- **Dot**:  
  $\text{score}(h_t, \bar{h}_s) = h_t^\top \bar{h}_s$

- **General**:  
  $\text{score}(h_t, \bar{h}_s) = h_t^\top W \bar{h}_s$

- **Concat**:  
  $\text{score}(h_t, \bar{h}_s) = v^\top \tanh(W [h_t ; \bar{h}_s])$



In [None]:
class Attn(nn.Module):
    def __init__(self, method, hidden_size):
        super().__init__()
        # TODO: validate method
        # TODO: define parameters for the chosen attention method

    def forward(self, hidden, encoder_outputs):
        # TODO: compute attention energies
        # TODO: normalize with softmax
        # TODO: return attention weights of shape (batch_size, 1, max_length)

#### Decoder with Luong Attention

We now define the decoder. Unlike the encoder (which processes the entire input
sequence in one forward pass), the decoder is executed **one time step at a time**:
at each step it receives a single token and produces a probability distribution
over the vocabulary for the next token.

Concretely, at decoding step $t$:

- The input to the decoder is a tensor `input_step` of shape **(1, batch_size)**,
  containing one token per sequence.
- After embedding, this becomes **(1, batch_size, hidden_size)**.
- The decoder updates its hidden state using a unidirectional GRU.



We use **Luong et al. (2015)** global attention:
[Luong et al., 2015](https://arxiv.org/abs/1508.04025).


At each decoding step, the decoder should:

1. Embed the current input token.
2. Run one GRU step to update the decoder hidden state.
3. Compute attention weights using the current decoder state and encoder outputs.
4. Use the attention weights to compute a context vector.
5. Combine the decoder state and context vector.
6. Predict a probability distribution over the vocabulary.


---

**Inputs**

- `input_step`: one time step (one token) for each sequence in the batch  
  Shape: **(1, batch_size)**

- `last_hidden`: previous decoder hidden state  
  Shape: **(n_layers, batch_size, hidden_size)**

- `encoder_outputs`: encoder outputs for all time steps  
  Shape: **(max_length, batch_size, hidden_size)**

**Outputs**

- `output`: probability distribution over the vocabulary for the next token  
  Shape: **(batch_size, voc.num_words)**

- `hidden`: updated decoder hidden state  
  Shape: **(n_layers, batch_size, hidden_size)**


In [None]:
class LuongAttnDecoderRNN(nn.Module):
    def __init__(self, attn_model, embedding, hidden_size, output_size, n_layers=1, dropout=0.1):
        # TODO: Implement

    def forward(self, input_step, last_hidden, encoder_outputs):
        # TODO: Implement

### Define Training Procedure





We now define the components required to train our seq2seq chatbot.
Training is performed over mini-batches of padded sequences, which
requires special care when computing the loss and updating model
parameters.

#### Masked Loss

Because batches contain padded sequences, not all positions in the target
tensor correspond to valid words. We therefore compute the loss only over
non-padding positions.

The function below computes a **masked negative log-likelihood loss**,
given the decoder’s output distribution, the target tokens, and a binary
mask that indicates which positions are valid (i.e., not `PAD_token`).


In [None]:
def maskNLLLoss(inp, target, mask):
    """
    inp:    (batch_size, vocab_size) softmax probabilities
    target: (batch_size,) target token indices
    mask:   (batch_size,) boolean mask (True for valid tokens, False for PAD)
    """
    nTotal = mask.sum()
    crossEntropy = -torch.log(torch.gather(inp, 1, target.view(-1, 1)).squeeze(1))
    loss = crossEntropy.masked_select(mask).mean()
    loss = loss.to(device)
    return loss, nTotal.item()

#### Single Training Iteration

We now describe a **single training iteration**, corresponding to one
mini-batch update.

During each iteration, we apply two commonly used techniques to improve
training stability and convergence:

- **Teacher forcing**: with some probability, the decoder receives the
  ground-truth token as its next input instead of its own prediction.
  This can significantly speed up training, but excessive teacher
  forcing may lead to poor performance at inference time.

- **Gradient clipping**: gradients are clipped to a maximum norm to
  prevent the exploding gradient problem, which is particularly common
  in recurrent neural networks.

The sequence of operations for a single iteration is as follows:


#### Single Training Iteration (Implement)

In this part you will implement a **single mini-batch update** for the
seq2seq chatbot.

You must handle two key ideas we covered in class:

- **Gradient clipping**: clip gradient norms to avoid exploding gradients in RNN training.

Your implementation should follow this high-level sequence:



**Sequence of Operations**

1. Forward pass the input batch through the encoder.
2. Initialize the decoder input with `SOS_token` and the decoder hidden
   state with the encoder’s final hidden state.
3. Decode one time step at a time:
   - apply teacher forcing with probability `teacher_forcing_ratio`
   - otherwise, feed the decoder’s own prediction as the next input
4. Compute and accumulate masked loss.
5. Backpropagate gradients.
6. Clip gradients.


In [None]:
def train(input_variable, lengths, target_variable, mask, max_target_len,
          encoder, decoder, embedding,
          encoder_optimizer, decoder_optimizer,
          batch_size, clip, max_length=MAX_LENGTH):
    """
    Performs a single mini-batch update step.

    Inputs:
        input_variable:  (max_input_len, batch_size)
        lengths:         (batch_size,)  (must be on CPU for packing)
        target_variable: (max_target_len, batch_size)
        mask:            (max_target_len, batch_size) boolean
        max_target_len:  int

    Returns:
        avg_loss: average masked loss per non-PAD token (float)
    """

    # Move tensors to the correct device
    input_variable = input_variable.to(device)
    target_variable = target_variable.to(device)
    mask = mask.to(device)
    lengths = lengths.to("cpu")  # required by pack_padded_sequence


    decoder_input = torch.LongTensor([[SOS_token for _ in range(batch_size)]]).to(device)


    use_teacher_forcing = (random.random() < teacher_forcing_ratio)

    # We will accumulate loss over time steps
    loss = 0
    print_losses = []
    n_totals = 0


    # Return average loss per non-pad token
    return sum(print_losses) / n_totals


#### Training Loop

We now wrap the single training iteration into a full training loop.
At each iteration, a random mini-batch is sampled and one parameter
update is performed.

We periodically print training statistics and save model checkpoints,
which include the encoder and decoder parameters, optimizer states, and
vocabulary information.


In [None]:
def trainIters(model_name, voc, pairs, encoder, decoder, encoder_optimizer, decoder_optimizer,
               embedding, encoder_n_layers, decoder_n_layers, save_dir, n_iteration,
               batch_size, print_every, save_every, clip, corpus_name, loadFilename):

    training_batches = [
        batch2TrainData(voc, [random.choice(pairs) for _ in range(batch_size)])
        for _ in range(n_iteration)
    ]

    print('Initializing ...')
    start_iteration = 1
    print_loss = 0

    if loadFilename:
        start_iteration = checkpoint['iteration'] + 1

    print("Training...")
    for iteration in range(start_iteration, n_iteration + 1):
        input_variable, lengths, target_variable, mask, max_target_len = training_batches[iteration - 1]

        loss = train(input_variable, lengths, target_variable, mask, max_target_len,
                     encoder, decoder, embedding,
                     encoder_optimizer, decoder_optimizer,
                     batch_size, clip)

        print_loss += loss

        if iteration % print_every == 0:
            print_loss_avg = print_loss / print_every
            print(f"Iteration: {iteration}; Percent complete: {iteration / n_iteration * 100:.1f}%; Average loss: {print_loss_avg:.4f}")
            print_loss = 0

        if iteration % save_every == 0:
            directory = os.path.join(
                save_dir, model_name, corpus_name,
                f'{encoder_n_layers}-{decoder_n_layers}_{hidden_size}'
            )
            os.makedirs(directory, exist_ok=True)

            torch.save({
                'iteration': iteration,
                'en': encoder.state_dict(),
                'de': decoder.state_dict(),
                'en_opt': encoder_optimizer.state_dict(),
                'de_opt': decoder_optimizer.state_dict(),
                'loss': loss,
                'voc_dict': voc.__dict__,
                'embedding': embedding.state_dict()
            }, os.path.join(directory, f'{iteration}_checkpoint.tar'))


### Model Evaluation


After training the chatbot model, we would like to interact with it and
generate responses to user-provided input sentences. To do so, we must
define how the decoder generates an output sequence from the encoded
input.

During training, the decoder may receive ground-truth tokens as input
(teacher forcing). During evaluation, however, the model must rely
entirely on its own predictions. This requires defining an explicit
**decoding strategy**.

#### Greedy Decoding

We use **greedy decoding** to generate responses at evaluation time.

Formally, given the decoder output distribution at time step $t$,
greedy decoding chooses:

$$
\hat{w}_t = \arg\max_w \; p(w \mid w_{<t}, x)
$$

limitation, it is simple, efficient, and serves as a strong baseline for sequence generation.


To implement greedy decoding, we define a `GreedySearchDecoder` module.
This module wraps the encoder and decoder and performs decoding one time step at a time.

Given an input sentence, decoding proceeds as follows:

**Computation Graph**

1. Forward the input sequence through the encoder.
2. Initialize the decoder hidden state using the encoder’s final hidden state.
3. Initialize the decoder input with the `SOS_token`.
4. Iteratively decode one token at a time:
   - Forward pass through the decoder.
   - Select the most probable token.
   - Feed the selected token as input to the next step.
5. Collect and return the generated tokens and their scores.


In [None]:
class GreedySearchDecoder(nn.Module):
    def __init__(self, encoder, decoder):
        super(GreedySearchDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, input_seq, input_length, max_length):
        # Forward input through encoder model
        encoder_outputs, encoder_hidden = self.encoder(input_seq, input_length)
        # Prepare encoder's final hidden layer to be first hidden input to the decoder
        decoder_hidden = encoder_hidden[:decoder.n_layers]
        # Initialize decoder input with SOS_token
        decoder_input = torch.ones(1, 1, device=device, dtype=torch.long) * SOS_token
        # Initialize tensors to append decoded words to
        all_tokens = torch.zeros([0], device=device, dtype=torch.long)
        all_scores = torch.zeros([0], device=device)
        # Iteratively decode one word token at a time
        for _ in range(max_length):
            # Forward pass through decoder
            decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden, encoder_outputs)
            # Obtain most likely word token and its softmax score
            decoder_scores, decoder_input = torch.max(decoder_output, dim=1)
            # Record token and score
            all_tokens = torch.cat((all_tokens, decoder_input), dim=0)
            all_scores = torch.cat((all_scores, decoder_scores), dim=0)
            # Prepare current token to be next decoder input (add a dimension)
            decoder_input = torch.unsqueeze(decoder_input, 0)
        # Return collections of word tokens and scores
        return all_tokens, all_scores

#### Evaluating Text Input

With the decoding procedure defined, we can now evaluate individual input sentences.

The `evaluate` function handles the low-level mechanics of evaluation:



In [None]:
def evaluate(encoder, decoder, searcher, voc, sentence, max_length=MAX_LENGTH):
    """
    Evaluate a single input sentence using greedy decoding.
    Out-of-vocabulary words are mapped to UNK_token.
    """
    # words -> indexes (UNK-safe)
    indexes = [
        voc.word2index.get(word, UNK_token)
        for word in sentence.split(' ')
    ] + [EOS_token]

    indexes_batch = [indexes]

    # Create lengths tensor
    lengths = torch.tensor([len(indexes)])

    # Prepare input batch (max_length, batch_size=1)
    input_batch = torch.LongTensor(indexes_batch).transpose(0, 1)
    input_batch = input_batch.to(device)
    lengths = lengths.to("cpu")

    # Decode sentence
    tokens, scores = searcher(input_batch, lengths, max_length)

    # indexes -> words
    decoded_words = [voc.index2word[token.item()] for token in tokens]
    return decoded_words


def evaluateInput(encoder, decoder, searcher, voc):
    """
    Interactive chatbot interface.
    Type 'q' or 'quit' to exit.
    """
    while True:
        # Get input sentence
        input_sentence = input('> ')
        if input_sentence in ('q', 'quit'):
            break

        # Normalize input
        input_sentence = normalizeString(input_sentence)

        # Evaluate sentence
        output_words = evaluate(encoder, decoder, searcher, voc, input_sentence)

        # Remove EOS and PAD tokens from output
        output_words = [w for w in output_words if w not in ('EOS', 'PAD')]

        print('Bot:', ' '.join(output_words))


### Run Model

Finally, it is time to run our model!

Regardless of whether we want to train or test the chatbot model, we
must initialize the individual encoder and decoder models. In the
optimize performance.

In [None]:
# Configure models
model_name = 'cb_model'
attn_model = 'dot'
#attn_model = 'general'
#attn_model = 'concat'
hidden_size = 500
encoder_n_layers = 2
decoder_n_layers = 2
dropout = 0.1
batch_size = 64

# Set checkpoint to load from; set to None if starting from scratch
loadFilename = None
checkpoint_iter = 4000
#loadFilename = os.path.join(save_dir, model_name, corpus_name,
#                            '{}-{}_{}'.format(encoder_n_layers, decoder_n_layers, hidden_size),
#                            '{}_checkpoint.tar'.format(checkpoint_iter))


# Load model if a loadFilename is provided
if loadFilename:
    # If loading on same machine the model was trained on
    checkpoint = torch.load(loadFilename)
    # If loading a model trained on GPU to CPU
    #checkpoint = torch.load(loadFilename, map_location=torch.device('cpu'))
    encoder_sd = checkpoint['en']
    decoder_sd = checkpoint['de']
    encoder_optimizer_sd = checkpoint['en_opt']
    decoder_optimizer_sd = checkpoint['de_opt']
    embedding_sd = checkpoint['embedding']
    voc.__dict__ = checkpoint['voc_dict']


print('Building encoder and decoder ...')
# Initialize word embeddings
embedding = nn.Embedding(voc.num_words, hidden_size)
if loadFilename:
    embedding.load_state_dict(embedding_sd)
# Initialize encoder & decoder models
encoder = EncoderRNN(hidden_size, embedding, encoder_n_layers, dropout)
decoder = LuongAttnDecoderRNN(attn_model, embedding, hidden_size, voc.num_words, decoder_n_layers, dropout)
if loadFilename:
    encoder.load_state_dict(encoder_sd)
    decoder.load_state_dict(decoder_sd)
# Use appropriate device
encoder = encoder.to(device)
decoder = decoder.to(device)
print('Models built and ready to go!')

Building encoder and decoder ...
Models built and ready to go!


#### Run Training

Run the following block in order to train the model.

iterations.

In [None]:
# Configure training/optimization
clip = 50.0
teacher_forcing_ratio = 1.0
learning_rate = 0.0001
decoder_learning_ratio = 5.0
n_iteration = 4000
print_every = 1
save_every = 500

# Ensure dropout layers are in train mode
encoder.train()
decoder.train()

# Initialize optimizers
print('Building optimizers ...')
encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate * decoder_learning_ratio)
if loadFilename:
    encoder_optimizer.load_state_dict(encoder_optimizer_sd)
    decoder_optimizer.load_state_dict(decoder_optimizer_sd)

# If you have cuda, configure cuda to call
for state in encoder_optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.cuda()

for state in decoder_optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.cuda()

# Run training iterations
print("Starting Training!")
trainIters(model_name, voc, pairs, encoder, decoder, encoder_optimizer, decoder_optimizer,
           embedding, encoder_n_layers, decoder_n_layers, save_dir, n_iteration, batch_size,
           print_every, save_every, clip, corpus_name, loadFilename)

#### Run Evaluation

To chat with your model, run the following block.

In [None]:
# Set dropout layers to eval mode
encoder.eval()
decoder.eval()

# Initialize search module
searcher = GreedySearchDecoder(encoder, decoder)

# Begin chatting by uncommenting the line:
evaluateInput(encoder, decoder, searcher, voc)

### Beam Search Decoding


So far, we used **greedy decoding**, which selects the most likely next token at each step.
Greedy decoding is fast, but it may miss better full sequences because it commits to local choices.

In this task you will implement **beam search**, as discussed in class.

**Requirements**
- Implement `BeamSearchDecoder` similar in interface to `GreedySearchDecoder`.
- Use beam width `k` (beam size) as a configurable argument.
- Use **log-probabilities** (sum of log-probs) to score sequences.
- Return the best decoded token sequence (and optionally its scores).

In [None]:
class BeamSearchDecoder(nn.Module):
    def __init__(self, encoder, decoder, beam_size=5):
        super(BeamSearchDecoder, self).__init__()
        # Implement here

    def forward(self, input_seq, input_length, max_length):
        # Implement here
        raise NotImplementedError

Run the following cell and try the **same 10 prompts** with greedy vs. beam search.
Briefly comment: do you observe differences? When might beam search help, and when might it hurt?


In [None]:
greedy_searcher = GreedySearchDecoder(encoder, decoder)
beam_searcher = BeamSearchDecoder(encoder, decoder)

evaluateInput(encoder, decoder, greedy_searcher, voc)
evaluateInput(encoder, decoder, beam_searcher, voc)

### Understanding & Reflection Questions

### Model Architecture
We use a Seq2Seq model with an Encoder (Bi-GRU), Decoder (GRU), and Global Attention.


## Neural Machine Translation (35 Points)

Translating Portuguese to English using an Encoder-Decoder RNN with Attention.


In [None]:
import os
import math
import random
import time
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
from tqdm.auto import tqdm
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, Sampler
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence
import tensorflow_datasets as tfds

# Seeds
SEED = 1234
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")


### Data Loading

In [None]:
import tensorflow_datasets as tfds

# Load Portuguese-English dataset
dataset, info = tfds.load('ted_hrlr_translate/pt_to_en', with_info=True, as_supervised=True)
train_examples, val_examples, test_examples = dataset['train'], dataset['validation'], dataset['test']

# Helper to convert TF dataset to list of strings
def tf_to_list(tf_dataset):
    pt_list, en_list = [], []
    for pt, en in tfds.as_numpy(tf_dataset):
        pt_list.append(pt.decode('utf-8'))
        en_list.append(en.decode('utf-8'))
    return pt_list, en_list

train_pt, train_en = tf_to_list(train_examples)
val_pt,   val_en   = tf_to_list(val_examples)
test_pt,  test_en  = tf_to_list(test_examples)

print(f"Train size: {len(train_pt)}")
print(f"Val size:   {len(val_pt)}")
print(f"Test size:  {len(test_pt)}")


### Tokenization & Vocabulary

In [None]:
# Reuse special tokens defined earlier
PAD_TOKEN = "<pad>"
UNK_TOKEN = "<unk>"
BOS_TOKEN = "<bos>"
EOS_TOKEN = "<eos>"

SPECIAL_TOKENS = [PAD_TOKEN, UNK_TOKEN, BOS_TOKEN, EOS_TOKEN]

def tokenize(sentence: str):
    return sentence.strip().split()


class Vocabulary:
    def __init__(self, sentences, min_freq=2):
        """
        sentences: list of raw text sentences
        min_freq : minimum frequency for a word to be included
        """
        counter = Counter()
        for sent in sentences:
            counter.update(tokenize(sent))

        # Initialize vocab with special tokens
        self.itos = list(SPECIAL_TOKENS)
        self.stoi = {tok: i for i, tok in enumerate(self.itos)}

        # Add frequent words
        for word, freq in counter.items():
            if freq >= min_freq and word not in self.stoi:
                self.stoi[word] = len(self.itos)
                self.itos.append(word)

    def encode(self, sentence, add_bos=False, add_eos=False):
        tokens = tokenize(sentence)
        ids = [self.stoi.get(tok, self.stoi[UNK_TOKEN]) for tok in tokens]

        if add_bos:
            ids = [self.stoi[BOS_TOKEN]] + ids
        if add_eos:
            ids = ids + [self.stoi[EOS_TOKEN]]

        return ids

    def decode(self, ids):
        words = []
        for idx in ids:
            word = self.itos[idx]
            if word in {BOS_TOKEN, PAD_TOKEN}:
                continue
            if word == EOS_TOKEN:
                break
            words.append(word)
        return " ".join(words)

    def __len__(self):
        return len(self.itos)


# Build vocabularies
pt_vocab = Vocabulary(train_pt, min_freq=2)
en_vocab = Vocabulary(train_en, min_freq=2)

print("Portuguese vocab size:", len(pt_vocab))
print("English vocab size:", len(en_vocab))


### Batching

In [None]:
class TranslationDataset(Dataset):
    def __init__(self, src_sentences, tgt_sentences, src_vocab, tgt_vocab):
        assert len(src_sentences) == len(tgt_sentences)
        self.src = src_sentences
        self.tgt = tgt_sentences
        self.src_vocab = src_vocab
        self.tgt_vocab = tgt_vocab

    def __len__(self):
        return len(self.src)

    def __getitem__(self, idx):
        src_ids = self.src_vocab.encode(self.src[idx])
        tgt_ids = self.tgt_vocab.encode(self.tgt[idx], add_bos=True, add_eos=True)
        return src_ids, tgt_ids

def collate_fn(batch):
    src_batch, tgt_batch = zip(*batch)

    src_lens = torch.tensor([len(x) for x in src_batch], dtype=torch.long)
    tgt_lens = torch.tensor([len(x) for x in tgt_batch], dtype=torch.long)

    src_pad = pad_sequence(
        [torch.tensor(x, dtype=torch.long) for x in src_batch],
        batch_first=True,
        padding_value=pt_vocab.stoi[PAD_TOKEN]
    )
    tgt_pad = pad_sequence(
        [torch.tensor(x, dtype=torch.long) for x in tgt_batch],
        batch_first=True,
        padding_value=en_vocab.stoi[PAD_TOKEN]
    )

    return src_pad, src_lens, tgt_pad, tgt_lens

class BucketBatchSampler(Sampler):
    def __init__(self, lengths, batch_size, bucket_size=2048, shuffle=True, drop_last=False):
        self.lengths = np.asarray(lengths)
        self.batch_size = batch_size
        self.bucket_size = bucket_size
        self.shuffle = shuffle
        self.drop_last = drop_last
        self.indices = np.arange(len(self.lengths))

    def __iter__(self):
        idxs = self.indices.copy()
        if self.shuffle:
            np.random.shuffle(idxs)

        for i in range(0, len(idxs), self.bucket_size):
            bucket = idxs[i:i + self.bucket_size]
            bucket = bucket[np.argsort(self.lengths[bucket])]

            for j in range(0, len(bucket), self.batch_size):
                batch = bucket[j:j + self.batch_size]
                if self.drop_last and len(batch) < self.batch_size:
                    continue
                yield batch.tolist()

    def __len__(self):
        n = len(self.indices) // self.batch_size
        if not self.drop_last and len(self.indices) % self.batch_size != 0:
            n += 1
        return n

train_dataset = TranslationDataset(train_pt, train_en, pt_vocab, en_vocab)
val_dataset   = TranslationDataset(val_pt,   val_en,   pt_vocab, en_vocab)

train_src_lengths = [len(pt_vocab.encode(s)) for s in train_pt]
val_src_lengths   = [len(pt_vocab.encode(s)) for s in val_pt]

BATCH_SIZE = 16  # You may experiment with other batch sizes

train_sampler = BucketBatchSampler(train_src_lengths, batch_size=BATCH_SIZE, bucket_size=2048, shuffle=True)
val_sampler   = BucketBatchSampler(val_src_lengths,   batch_size=BATCH_SIZE, bucket_size=2048, shuffle=False)

train_loader = DataLoader(train_dataset, batch_sampler=train_sampler, collate_fn=collate_fn)
val_loader   = DataLoader(val_dataset,   batch_sampler=val_sampler,   collate_fn=collate_fn)

print("Train batches:", len(train_loader))
print("Val batches:", len(val_loader))

### Model Architecture

#### Encoder

In [None]:
class EncoderBiLSTM(nn.Module):
    def __init__(self, src_vocab_size, emb_dim, hidden_dim, pad_idx, dropout=0.1):
        super().__init__()
        self.embedding = nn.Embedding(src_vocab_size, emb_dim, padding_idx=pad_idx)
        self.dropout = nn.Dropout(dropout)
        self.lstm = nn.LSTM(emb_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.pad_idx = pad_idx

    def forward(self, src_pad, src_lens):
        # src_pad: (B, T_src)
        # src_lens: (B,)

        embedded = self.dropout(self.embedding(src_pad))

        # Pack padded sequence
        # Note: enforce_sorted=False is generally safer if not pre-sorted
        packed = pack_padded_sequence(embedded, src_lens.cpu(), batch_first=True, enforce_sorted=False)

        outputs, (hidden, cell) = self.lstm(packed)

        # Unpack
        outputs, _ = pad_packed_sequence(outputs, batch_first=True)

        return outputs, (hidden, cell)

#### Attention

In [None]:
class DotProductAttention(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        # Project encoder states (2H) to decoder hidden dim (H)
        self.W_p = nn.Linear(hidden_dim * 2, hidden_dim)

    def forward(self, dec_h, enc_hiddens, src_lens):
        # dec_h: (B, H)
        # enc_hiddens: (B, T_src, 2H)
        # src_lens: (B,)

        # Project encoder hiddens
        enc_hiddens_proj = self.W_p(enc_hiddens) # (B, T_src, H)

        # Compute scores: (B, T_src)
        # (B, H) unsqueeze -> (B, 1, H)
        # (B, T_src, H) transpose -> (B, H, T_src)
        scores = torch.bmm(enc_hiddens_proj, dec_h.unsqueeze(2)).squeeze(2)

        # Masking
        max_len = enc_hiddens.size(1)
        src_lens = src_lens.to(scores.device)
        # mask[i, j] is True if j < len[i] (valid), False if padding
        mask = torch.arange(max_len, device=scores.device).expand(len(src_lens), max_len) < src_lens.unsqueeze(1)

        scores = scores.masked_fill(~mask, -1e9)

        attn_weights = F.softmax(scores, dim=1) # (B, T_src)

        # Context vector: (B, 2H)
        # (B, 1, T_src) x (B, T_src, 2H) -> (B, 1, 2H)
        context = torch.bmm(attn_weights.unsqueeze(1), enc_hiddens).squeeze(1)

        return context, attn_weights

#### Decoder Components

In [None]:
class DecoderInit(nn.Module):
    """
    Projects final BiLSTM encoder states -> initial UniLSTM decoder states.

    Input:
      h_n, c_n: (2, B, H)  (forward + backward)
    Output:
      h0_dec, c0_dec: (B, H)
    """
    def __init__(self, hidden_dim):
        super().__init__()
        self.hidden_proj = nn.Linear(hidden_dim * 2, hidden_dim)
        self.cell_proj = nn.Linear(hidden_dim * 2, hidden_dim)

    def forward(self, h_n, c_n):
        # Concatenate forward and backward states along dim 1
        # h_n shape is (num_directions, batch, hidden_size) -> (2, B, H)
        # Cat (B, H) and (B, H) -> (B, 2H)
        h_cat = torch.cat((h_n[0], h_n[1]), dim=1)
        c_cat = torch.cat((c_n[0], c_n[1]), dim=1)

        h0_dec = torch.tanh(self.hidden_proj(h_cat))
        c0_dec = torch.tanh(self.cell_proj(c_cat))

        return h0_dec, c0_dec

In [None]:
class Decoder(nn.Module):
    def __init__(self, tgt_vocab_size, emb_dim, hidden_dim, pad_idx, dropout=0.1):
        super().__init__()
        self.embedding = nn.Embedding(tgt_vocab_size, emb_dim, padding_idx=pad_idx)
        self.lstm = nn.LSTMCell(emb_dim, hidden_dim)
        self.attention = DotProductAttention(hidden_dim)
        self.dropout = nn.Dropout(dropout)

        # Combine hidden (H) + context (2H) -> H
        self.combine_proj = nn.Linear(hidden_dim + hidden_dim * 2, hidden_dim)

    def forward(self, tgt_pad, dec_init_state, enc_hiddens, src_lens):
        """
        tgt_pad:       (B, T_tgt) includes <bos> ... <eos>
        dec_init_state: (h0_dec, c0_dec) each (B, H)
        enc_hiddens:   (B, T_src, 2H)
        src_lens:      (B,)

        returns:
          attn_vecs: (B, T_tgt-1, H)
        """
        batch_size, seq_len = tgt_pad.size()
        h, c = dec_init_state

        embedded = self.embedding(tgt_pad) # (B, T, E)
        embedded = self.dropout(embedded)

        attn_vecs = []

        # Decode T-1 steps (predicting 2nd token onwards)
        # Using teacher forcing: input at t is tgt_pad[:, t]
        for t in range(seq_len - 1):
            x_t = embedded[:, t, :] # (B, E)

            h, c = self.lstm(x_t, (h, c))

            context, _ = self.attention(h, enc_hiddens, src_lens)

            combined = torch.cat((h, context), dim=1)
            attn_vec = torch.tanh(self.combine_proj(combined))

            attn_vecs.append(attn_vec)

        attn_vecs = torch.stack(attn_vecs, dim=1) # (B, T-1, H)
        return attn_vecs

In [None]:
class OutputProjection(nn.Module):
    def __init__(self, hidden_dim, tgt_vocab_size):
        super().__init__()
        self.proj = nn.Linear(hidden_dim, tgt_vocab_size)

    def forward(self, attn_vecs):
        """
        attn_vecs: (B, T, H)
        returns:
          logits: (B, T, |V_tgt|)
        """
        return self.proj(attn_vecs)

#### Full Seq2Seq Model

In [None]:
class Seq2SeqNMT(nn.Module):
    def __init__(self, encoder, decoder, dec_init, out_proj):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.dec_init = dec_init
        self.out_proj = out_proj

    def forward(self, src_pad, src_lens, tgt_pad):
        enc_hiddens, (h_n, c_n) = self.encoder(src_pad, src_lens)
        dec_init_state = self.dec_init(h_n, c_n)
        attn_vecs = self.decoder(tgt_pad, dec_init_state, enc_hiddens, src_lens)
        logits = self.out_proj(attn_vecs)
        return logits

### Training

In [None]:
# Hyperparameters
EMB_DIM = 128
HIDDEN_DIM = 128
DROPOUT = 0.1
N_EPOCHS = 25
LEARNING_RATE = 0.0005
CLIP = 1.0

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Initialize components with NUM_LAYERS
encoder = EncoderBiLSTM(len(pt_vocab), EMB_DIM, HIDDEN_DIM, pt_vocab.stoi[PAD_TOKEN], DROPOUT)
decoder = Decoder(len(en_vocab), EMB_DIM, HIDDEN_DIM, en_vocab.stoi[PAD_TOKEN], DROPOUT)
dec_init = DecoderInit(HIDDEN_DIM)
out_proj = OutputProjection(HIDDEN_DIM, len(en_vocab))

model = Seq2SeqNMT(encoder, decoder, dec_init, out_proj).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
criterion = nn.CrossEntropyLoss(ignore_index=en_vocab.stoi[PAD_TOKEN])

train_losses = []
val_losses = []

print(f"Training on {device}...")

for epoch in range(N_EPOCHS):
    model.train()
    epoch_loss = 0

    # Train Loop
    for src_pad, src_lens, tgt_pad, tgt_lens in tqdm(train_loader, desc=f"Epoch {epoch+1}/{N_EPOCHS}", leave=False):
        src_pad, tgt_pad = src_pad.to(device), tgt_pad.to(device)
        # but my EncoderBiLSTM calls .cpu() on it, so it's safe to be on any device.

        optimizer.zero_grad()

        # Forward
        logits = model(src_pad, src_lens, tgt_pad)

        # Targets are shifted by 1 (exclude BOS)
        targets = tgt_pad[:, 1:]

        # Flatten
        logits = logits.reshape(-1, logits.shape[-1])
        targets = targets.reshape(-1)

        loss = criterion(logits, targets)
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), CLIP)
        optimizer.step()

        epoch_loss += loss.item()

    avg_train_loss = epoch_loss / len(train_loader)
    train_losses.append(avg_train_loss)

    # Validation Loop
    model.eval()
    epoch_val_loss = 0
    with torch.no_grad():
        for src_pad, src_lens, tgt_pad, tgt_lens in val_loader:
            src_pad, tgt_pad = src_pad.to(device), tgt_pad.to(device)
            logits = model(src_pad, src_lens, tgt_pad)
            targets = tgt_pad[:, 1:]

            logits = logits.reshape(-1, logits.shape[-1])
            targets = targets.reshape(-1)

            loss = criterion(logits, targets)
            epoch_val_loss += loss.item()

    avg_val_loss = epoch_val_loss / len(val_loader)
    val_losses.append(avg_val_loss)

    print(f"Epoch {epoch+1}: Train Loss = {avg_train_loss:.4f} | Val Loss = {avg_val_loss:.4f}")

# Plotting
plt.figure(figsize=(10, 5))
plt.plot(train_losses, label='Train Loss')
plt.plot(val_losses, label='Val Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training and Validation Loss')
plt.legend()
plt.show()

### Evaluation (BLEU)

In [None]:
def beam_search_decode(model, src_sentence, beam_size=5, max_len=50):
    model.eval()
    with torch.no_grad():
        # Encode
        src_ids = pt_vocab.encode(src_sentence)
        # Add batch dim
        src_tensor = torch.tensor([src_ids], dtype=torch.long, device=device)
        src_len = torch.tensor([len(src_ids)], dtype=torch.long)

        enc_hiddens, (h_n, c_n) = model.encoder(src_tensor, src_len)
        h, c = model.dec_init(h_n, c_n)

        # Candidates: (score, [token_ids], (h, c))
        # Start with BOS
        candidates = [(0.0, [en_vocab.stoi[BOS_TOKEN]], (h, c))]
        completed = []

        for _ in range(max_len):
            new_candidates = []
            for score, seq, (curr_h, curr_c) in candidates:
                if seq[-1] == en_vocab.stoi[EOS_TOKEN]:
                    completed.append((score, seq))
                    continue

                inp_token = torch.tensor([seq[-1]], dtype=torch.long, device=device)

                # Decoder Step (manual expansion of Decoder logic)
                # 1. Embed
                embed = model.decoder.embedding(inp_token) # (1, E)
                embed = model.decoder.dropout(embed) # Apply dropout? Usually not in eval, but model.eval() handles it.

                # 2. LSTM
                new_h, new_c = model.decoder.lstm(embed, (curr_h, curr_c))

                # 3. Attention
                context, _ = model.decoder.attention(new_h, enc_hiddens, src_len.to(device))

                # 4. Combine
                combined = torch.cat((new_h, context), dim=1)
                attn_vec = torch.tanh(model.decoder.combine_proj(combined))

                # 5. Output Proj
                logits = model.out_proj(attn_vec)
                log_probs = F.log_softmax(logits, dim=1)

                topv, topi = log_probs.topk(beam_size)

                for v, i in zip(topv[0], topi[0]):
                    new_candidates.append((score + v.item(), seq + [i.item()], (new_h, new_c)))

            if not new_candidates:
                break

            new_candidates.sort(key=lambda x: x[0], reverse=True)
            candidates = new_candidates[:beam_size]

        if not completed:
            completed = [(c[0], c[1]) for c in candidates]

        completed.sort(key=lambda x: x[0], reverse=True)
        best_seq = completed[0][1]

        return en_vocab.decode(best_seq)

def compute_bleu(references, hypotheses):
    # Calculate BLEU-4 score
    precisions = []
    for n in range(1, 5):
        correct = 0
        total = 0
        for ref, hyp in zip(references, hypotheses):
            ref_tokens = ref.split()
            hyp_tokens = hyp.split()

            ref_counts = Counter([tuple(ref_tokens[i:i+n]) for i in range(len(ref_tokens)-n+1)])
            hyp_counts = Counter([tuple(hyp_tokens[i:i+n]) for i in range(len(hyp_tokens)-n+1)])

            for gram, count in hyp_counts.items():
                total += count
                correct += min(count, ref_counts.get(gram, 0))

        if total > 0:
            precisions.append(correct / total)
        else:
            precisions.append(0)

    if min(precisions) == 0:
        return 0.0

    geo_mean = math.exp(sum(math.log(p) for p in precisions) / 4)

    ref_len = sum(len(r.split()) for r in references)
    hyp_len = sum(len(h.split()) for h in hypotheses)

    if hyp_len > ref_len:
        bp = 1.0
    else:
        bp = math.exp(1 - ref_len / hyp_len) if hyp_len > 0 else 0

    return bp * geo_mean

# Evaluate on Test Set
print("Evaluating on Test Set...")
hypotheses = []
references = []

# We loaded test_pt and test_en (2000 examples)
for pt, en in tqdm(zip(test_pt, test_en), total=len(test_pt), desc="Decoding"):
    hyp = beam_search_decode(model, pt, beam_size=5)
    hypotheses.append(hyp)
    references.append(en)

bleu_score = compute_bleu(references, hypotheses)
print(f"Corpus-Level BLEU Score: {bleu_score*100:.2f}")

# Show some examples
print("\nExample Translations:")
for i in range(5):
    print(f"Src: {test_pt[i]}")
    print(f"Ref: {test_en[i]}")
    print(f"Hyp: {hypotheses[i]}")
    print("-" * 30)

### Q2: Understanding & Reflection

**1. Parallel Corpus**: Source-target sentence pairs used for supervised training.

**2. Special Tokens**:
* `<pad>`: Batch padding.
* `<unk>`: Unknown words.
* `<bos>`/`<eos>`: Sequence boundaries.

**3. Encoder-Decoder Init**: Bi-directional Encoder outputs 2 hidden states. We concat and project them to initialize the Unidirectional Decoder.

**4. Attention**: Calculates a weighted sum of encoder outputs (context) for each decoding step, focusing on relevant source info.

**5. Teacher Forcing**: Uses ground-truth inputs during training. Inference uses model predictions. Discrepancy causes exposure bias.

**6. BLEU**: Corpus-level n-gram precision metric. Beam search improves scores by exploring better sequences.

**7. Length Effects**: Long sentences degrade performance due to fixed-vector bottleneck; attention helps.

