<a href="https://colab.research.google.com/github/Bharathi-Krishna/LSTM/blob/main/pytorch_translate_lstm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import os, io, math, random, re, zipfile, urllib.request
from typing import List, Tuple

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

SEED = 42
random.seed(SEED)
torch.manual_seed(SEED)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)

Device: cuda


In [2]:
print("PyTorch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())

PyTorch version: 2.8.0+cu126
CUDA available: True


In [3]:
X_train = torch.tensor([[0, 0],
                        [0, 1],
                        [1, 0],
                        [1, 1]], dtype=torch.float32)

y_train = torch.tensor([[0],
                        [1],
                        [1],
                        [1]], dtype=torch.float32)

In [4]:
X_train

tensor([[0., 0.],
        [0., 1.],
        [1., 0.],
        [1., 1.]])

In [5]:
nn.Linear?

In [6]:
class ORGateNet(nn.Module):
    def __init__(self):
        super(ORGateNet, self).__init__()
        # Simple network: 2 inputs -> 3 hidden units -> 1 output
        self.hidden = nn.Linear(2, 3)
        self.output = nn.Linear(3, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.sigmoid(self.hidden(x))
        x = self.sigmoid(self.output(x)) # 1 dimensional tensor
        return x

In [7]:
model = ORGateNet()
criterion = nn.BCELoss()  # Binary Cross Entropy
optimizer = optim.SGD(model.parameters(), lr=5.0) # lr 10^-3

In [8]:
for epoch in range(1000):
    # Forward pass
    outputs = model(X_train)
    loss = criterion(outputs, y_train) # model.train

    # Backward pass and optimization
    optimizer.zero_grad() # set grads to zero.
    loss.backward() #
    optimizer.step() # carries out back prop, adjusts the weights

    # Print progress every 100 epochs
    if (epoch + 1) % 100 == 0:
        print(f"{epoch + 1:5d} | {loss.item():.6f}")

  100 | 0.007648
  200 | 0.002924
  300 | 0.001748
  400 | 0.001231
  500 | 0.000943
  600 | 0.000762
  700 | 0.000637
  800 | 0.000546
  900 | 0.000478
 1000 | 0.000424


In [9]:
model.eval()  # Set to evaluation mode, the gradients are not calculated.
with torch.no_grad():
    for i in range(len(X_train)):
        input_vals = X_train[i]
        predicted = model(input_vals.unsqueeze(0))
        actual = y_train[i]
        rounded = torch.round(predicted)

        print(f"{input_vals.numpy()} | {predicted.item():.6f} | {actual.item():.0f}      | {rounded.item():.0f}")

[0. 0.] | 0.000983 | 0      | 0
[0. 1.] | 0.999653 | 1      | 1
[1. 0.] | 0.999663 | 1      | 1
[1. 1.] | 0.999975 | 1      | 1


In [10]:
test_inputs = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=torch.float32)

In [11]:
with torch.no_grad():
    for test_input in test_inputs:
        prediction = model(test_input.unsqueeze(0))
        rounded_pred = torch.round(prediction)
        print(f"Input: {test_input.numpy()} -> Prediction: {prediction.item():.4f} -> Rounded: {rounded_pred.item():.0f}")

Input: [0. 0.] -> Prediction: 0.0010 -> Rounded: 0
Input: [0. 1.] -> Prediction: 0.9997 -> Rounded: 1
Input: [1. 0.] -> Prediction: 0.9997 -> Rounded: 1
Input: [1. 1.] -> Prediction: 1.0000 -> Rounded: 1


# eng to french

In [12]:
# We'll use the small "eng-fra" pairs from the official PyTorch tutorial mirror.

url = "https://download.pytorch.org/tutorial/data.zip"
data_dir = "data"
os.makedirs(data_dir, exist_ok=True)

zip_path = os.path.join(data_dir, "data.zip")
if not os.path.exists(os.path.join(data_dir, "eng-fra.txt")):
    urllib.request.urlretrieve(url, zip_path)
    with zipfile.ZipFile(zip_path, 'r') as z:
        z.extractall(data_dir)

raw_path = os.path.join(data_dir, "data/eng-fra.txt")
with open(raw_path, encoding="utf-8") as f:
    lines = f.read().strip().split("\n")

# Each line format: "english\tfrench"
print("Total pairs in file:", len(lines))
print("Sample:", lines[0], lines[1000])

Total pairs in file: 135842
Sample: Go.	Va ! I guess so.	Je le suppose.


In [13]:
print("Sample:", lines[1000])

Sample: I guess so.	Je le suppose.


In [14]:
# Normalize punctuation, lowercase, keep simple ASCII-ish tokens for demo.
def normalize(s: str) -> str:
    s = s.lower().strip()
    s = re.sub(r"[^a-zA-ZÀ-ÿ?.!'\s-]", "", s)  # keep letters & basic punct
    s = re.sub(r"\s+", " ", s).strip()
    return s

pairs = []
for line in lines:
    eng, fra = line.split("\t")[:2]
    eng, fra = normalize(eng), normalize(fra)
    if 2 <= len(eng.split()) <= 10 and 2 <= len(fra.split()) <= 10:
        pairs.append((eng, fra))

random.shuffle(pairs)
pairs = pairs[:8000]   # adjust for faster/slower runs, 135k pairs.
print("Filtered pairs:", len(pairs))
print("Example:", pairs[0])

Filtered pairs: 8000
Example: ('if you want to be free destroy your television set.', 'si tu veux être libre détruis ton téléviseur !')


In [15]:
PAD, SOS, EOS = 0, 1, 2

def build_vocab(texts: List[str], min_freq: int = 1, max_size: int = 20000):
    freq = {}
    for t in texts:
        for tok in t.split():
            freq[tok] = freq.get(tok, 0) + 1
    # sort by freq then alpha for determinism
    items = sorted([kv for kv in freq.items() if kv[1] >= min_freq], key=lambda x: (-x[1], x[0]))
    itos = ["<pad>", "<sos>", "<eos>"] # index to sequence
    for w, _ in items:
        if len(itos) >= max_size: break
        itos.append(w)
    stoi = {w:i for i,w in enumerate(itos)} # sequence to index
    return itos, stoi

eng_texts = [e for e,_ in pairs]
fra_texts = [f for _,f in pairs]
eng_itos, eng_stoi = build_vocab(eng_texts, min_freq=1, max_size=10000)
fra_itos, fra_stoi = build_vocab(fra_texts, min_freq=1, max_size=10000)

ENG_V, FRA_V = len(eng_itos), len(fra_itos)
print("ENG vocab:", ENG_V, "FRA vocab:", FRA_V)
print("Sample French words:", fra_itos[3:10])  # Debug: check vocab

ENG vocab: 5669 FRA vocab: 7891
Sample French words: ['je', 'de', '?', 'pas', 'ne', 'que', 'à']


In [16]:
def encode_sentence(s: str, stoi: dict) -> List[int]:
    # Add unknown token handling
    UNK = 3 if len(stoi) > 3 else None  # Simple UNK handling
    ids = []
    for tok in s.split():
        if tok in stoi:
            ids.append(stoi[tok])
        elif UNK is not None:
            ids.append(UNK)  # Or skip unknown tokens
        # else: skip the token
    return [SOS] + ids + [EOS]

def pad_seq(ids: List[int], max_len: int) -> List[int]:
    if len(ids) >= max_len:
        return ids[:max_len]  # Truncate if too long
    return ids + [PAD] * (max_len - len(ids)) # if too short, add padding tokens.

# Split train/val
split = int(0.9 * len(pairs))
train_pairs = pairs[:split]
val_pairs   = pairs[split:]

def tensorize(pairs, eng_stoi, fra_stoi, max_len_src=14, max_len_tgt=14):
    srcs, tgts = [], []
    for e, f in pairs:
        s = encode_sentence(e, eng_stoi)
        t = encode_sentence(f, fra_stoi)
        # Ensure we don't exceed max length
        s = s[:max_len_src] if len(s) > max_len_src else s
        t = t[:max_len_tgt] if len(t) > max_len_tgt else t
        srcs.append(pad_seq(s, max_len_src))
        tgts.append(pad_seq(t, max_len_tgt))
    return torch.tensor(srcs, dtype=torch.long), torch.tensor(tgts, dtype=torch.long)

MAX_SRC_LEN, MAX_TGT_LEN = 14, 14
train_src, train_tgt = tensorize(train_pairs, eng_stoi, fra_stoi, MAX_SRC_LEN, MAX_TGT_LEN)
val_src,   val_tgt   = tensorize(val_pairs,   eng_stoi, fra_stoi, MAX_SRC_LEN, MAX_TGT_LEN)

print("Train tensors:", train_src.shape, train_tgt.shape)
print("Sample target:", train_tgt[0])  # Debug: check encoding

Train tensors: torch.Size([7200, 14]) torch.Size([7200, 14])
Sample target: tensor([   1,   52,   19,   45,   55,  436, 4502,   53, 7557,   26,    2,    0,
           0,    0])


In [17]:
def batch_iter(src, tgt, batch_size=128, shuffle=True):
    N = src.size(0)
    idx = list(range(N))
    if shuffle: random.shuffle(idx)
    for i in range(0, N, batch_size):
        b = idx[i:i+batch_size]
        yield src[b].to(device), tgt[b].to(device)

In [18]:
class SimpleRNNCell(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int):
        super().__init__()
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim

        self.W_xh = nn.Parameter(torch.empty(input_dim, hidden_dim)) # weight for the input
        self.W_hh = nn.Parameter(torch.empty(hidden_dim, hidden_dim)) # weight for the hidden
        self.b_h  = nn.Parameter(torch.zeros(hidden_dim)) # bias

        nn.init.xavier_uniform_(self.W_xh)
        nn.init.orthogonal_(self.W_hh)

    def forward(self, x_t: torch.Tensor, h_prev: torch.Tensor):
        # x_t: [B, input_dim], h_prev: [B, hidden_dim]
        h_t = torch.tanh(x_t @ self.W_xh + h_prev @ self.W_hh + self.b_h)
        return h_t

# Model building

In [19]:
class Encoder(nn.Module):
    def __init__(self, vocab_size: int, emb_dim: int, hidden_dim: int):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, emb_dim, padding_idx=PAD) # tensor of size emb_dim
        self.cell = SimpleRNNCell(emb_dim, hidden_dim) # tensor of size hidden_dim
        self.hidden_dim = hidden_dim

    def forward(self, src: torch.Tensor):
        B, T = src.shape
        h = torch.zeros(B, self.hidden_dim, device=src.device)
        for t in range(T):
            x_t = self.emb(src[:, t])      # [B, emb]
            h = self.cell(x_t, h)
        return h  # [B, H], hidden state that is going to be carried over.

class Decoder(nn.Module): # pred an output at every time step.
    def __init__(self, vocab_size: int, emb_dim: int, hidden_dim: int):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, emb_dim, padding_idx=PAD)
        self.cell = SimpleRNNCell(emb_dim, hidden_dim)
        self.proj = nn.Linear(hidden_dim, vocab_size) # output layer.
        self.hidden_dim = hidden_dim
        self.vocab_size = vocab_size

    def forward(self, h0: torch.Tensor, max_steps: int, targets: torch.Tensor = None):
        B = h0.size(0)
        h = h0

        if targets is not None and self.training:
            logits_list = []
            for t in range(max_steps):
                if t == 0:
                    inp_tok = torch.full((B,), SOS, dtype=torch.long, device=h0.device)
                else:
                    inp_tok = targets[:, t-1]  # Use previous target token

                x_t = self.emb(inp_tok)
                h = self.cell(x_t, h)
                logit = self.proj(h)
                logits_list.append(logit.unsqueeze(1))

            logits = torch.cat(logits_list, 1)
            preds = logits.argmax(dim=-1)
            return logits, preds
        else: # model.eval()
            inp_tok = torch.full((B,), SOS, dtype=torch.long, device=h0.device)
            logits_list, preds_list = [], []

            for t in range(max_steps):
                x_t = self.emb(inp_tok)
                h = self.cell(x_t, h)
                logit = self.proj(h)
                pred = logit.argmax(dim=-1)

                logits_list.append(logit.unsqueeze(1))
                preds_list.append(pred.unsqueeze(1))

                inp_tok = pred

            return torch.cat(logits_list, 1), torch.cat(preds_list, 1)

class Seq2Seq(nn.Module):
    def __init__(self, src_vocab: int, tgt_vocab: int, emb_dim: int, hidden_dim: int):
        super().__init__()
        self.encoder = Encoder(src_vocab, emb_dim, hidden_dim)
        self.decoder = Decoder(tgt_vocab, emb_dim, hidden_dim)

    def forward(self, src: torch.Tensor, tgt: torch.Tensor):
        h = self.encoder(src)
        logits, preds = self.decoder(h, max_steps=tgt.size(1), targets=tgt)
        return logits, preds

In [20]:
def sequence_ce_loss(logits: torch.Tensor, targets: torch.Tensor, pad_idx: int = PAD):
    B, T, V = logits.shape
    loss = F.cross_entropy(
        logits.reshape(B*T, V),
        targets.reshape(B*T),
        ignore_index=pad_idx
    ) # flatten the outputs
    return loss

def token_accuracy(preds: torch.Tensor, targets: torch.Tensor, pad_idx: int = PAD):
    mask = (targets != pad_idx)
    correct = ((preds == targets) & mask).sum().item()
    total   = mask.sum().item()
    return correct / max(1, total)


In [21]:
EMB_DIM   = 128 # hyper parameter
HIDDEN_DIM= 256
LR        = 0.001
BATCH     = 64
EPOCHS    = 30

model = Seq2Seq(ENG_V, FRA_V, EMB_DIM, HIDDEN_DIM).to(device)
opt   = torch.optim.Adam(model.parameters(), lr=LR, weight_decay=1e-5)

def evaluate(model, src, tgt, max_batches=None):
    model.eval()
    acc_sum, loss_sum, n = 0.0, 0.0, 0
    with torch.no_grad():
        for bi, (x, y) in enumerate(batch_iter(src, tgt, BATCH, shuffle=False)):
            logits, preds = model(x, y)
            loss = sequence_ce_loss(logits, y, pad_idx=PAD)
            acc_sum += token_accuracy(preds, y)
            loss_sum += loss.item()
            n += 1
            if max_batches and bi+1 >= max_batches:
                break
    return acc_sum / max(1, n), loss_sum / max(1, n)

best_val_acc = 0
for ep in range(1, EPOCHS+1):
    model.train()
    running_loss = 0.0
    batch_count = 0

    for x, y in batch_iter(train_src, train_tgt, BATCH, shuffle=True):
        logits, preds = model(x, y)
        loss = sequence_ce_loss(logits, y, pad_idx=PAD)

        opt.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        opt.step()

        running_loss += loss.item()
        batch_count += 1

    if ep % 10 == 0:  # Print every 10 epochs
        tr_acc, tr_loss = evaluate(model, train_src, train_tgt, max_batches=5)
        va_acc, va_loss = evaluate(model, val_src, val_tgt)

        print(f"Epoch {ep:03d} | train_loss {tr_loss:.3f} | train_acc {tr_acc*100:.1f}% | val_acc {va_acc*100:.1f}%")

        if va_acc > best_val_acc:
            best_val_acc = va_acc
            print(f"  New best val acc: {va_acc*100:.1f}%")

Epoch 010 | train_loss 6.612 | train_acc 22.3% | val_acc 22.5%
  New best val acc: 22.5%
Epoch 020 | train_loss 7.003 | train_acc 25.8% | val_acc 25.4%
  New best val acc: 25.4%
Epoch 030 | train_loss 7.568 | train_acc 26.9% | val_acc 25.6%
  New best val acc: 25.6%


In [22]:
def ids_to_text(ids: List[int], itos: List[str]):
    words = []
    for i in ids:
        if i == SOS: continue
        if i == EOS: break
        if i == PAD: continue
        if i < len(itos):  # Safety check
            words.append(itos[i])
    return " ".join(words)

@torch.no_grad()
def translate(model, sentence: str, max_len=MAX_TGT_LEN):
    model.eval()
    # Normalize and encode source
    sentence = sentence.lower().strip()  # Simple normalization
    s_ids = encode_sentence(sentence, eng_stoi)[:MAX_SRC_LEN]
    s_pad = torch.tensor([pad_seq(s_ids, MAX_SRC_LEN)], dtype=torch.long, device=device)

    # Encode
    h = model.encoder(s_pad)  # [1, H]

    # Decode autoregressively
    inp = torch.tensor([SOS], device=device)
    h_dec = h.squeeze(0)  # Remove batch dim for single example
    out_ids = []

    for t in range(max_len):
        x = model.decoder.emb(inp)  # [1, emb_dim]
        h_dec = model.decoder.cell(x, h_dec)  # [H]
        logit = model.decoder.proj(h_dec)  # [V]
        pred = logit.argmax(dim=-1)  # scalar

        tok = pred.item()
        out_ids.append(tok)
        if tok == EOS:
            break
        inp = pred.unsqueeze(0)  # Make it [1] for next iteration

    return ids_to_text(out_ids, fra_itos)

# Test translations
tests = [
    "hello",
    "where is the station ?",
    "this is a small cat .",
    "how are you ?",
    "i do not know"
]

print("\n=== TRANSLATIONS ===")
for s in tests:
    print(f"EN: {s}")
    print(f"FR: {translate(model, s)}")
    print()


=== TRANSLATIONS ===
EN: hello
FR: j'ai entendu ça.

EN: where is the station ?
FR: que voulez-vous que vous êtes ?

EN: this is a small cat .
FR: c'est une question très intéressante.

EN: how are you ?
FR: que peut-on faire ?

EN: i do not know
FR: je ne suis pas encore prête.



# Task
Replace the RNN with an LSTM in the provided PyTorch encoder-decoder framework for language translation. Update the training loop to train for sufficient epochs with appropriate optimization and regularization, track training and validation loss, and include a final summary analyzing performance, model strengths/weaknesses, and the impact of LSTM architecture choices. Update the table of contents with sections for data loading, model definition, training, and evaluation.

## Update table of contents

### Subtask:
Add markdown cells to create a table of contents with the requested sections: Data loading and preprocessing steps, Model definition and explanation, Training loop with loss tracking, and Evaluation code and results.


**Reasoning**:
Add a markdown cell at the beginning of the notebook to create a table of contents with the specified sections.



In [23]:
# Table of Contents

## Data loading and preprocessing steps
## Model definition and explanation
## Training loop with loss tracking
## Evaluation code and results

## Modify rnncell to lstmcell

### Subtask:
Replace the `SimpleRNNCell` class with a new `LSTMCell` class.


**Reasoning**:
Define the LSTMCell class with the required parameters and forward method as per the instructions.



In [24]:
class LSTMCell(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int):
        super().__init__()
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim

        # Input gate parameters
        self.W_xi = nn.Parameter(torch.empty(input_dim, hidden_dim))
        self.W_hi = nn.Parameter(torch.empty(hidden_dim, hidden_dim))
        self.b_i  = nn.Parameter(torch.zeros(hidden_dim))

        # Forget gate parameters
        self.W_xf = nn.Parameter(torch.empty(input_dim, hidden_dim))
        self.W_hf = nn.Parameter(torch.empty(hidden_dim, hidden_dim))
        self.b_f  = nn.Parameter(torch.zeros(hidden_dim))

        # Cell gate parameters
        self.W_xc = nn.Parameter(torch.empty(input_dim, hidden_dim))
        self.W_hc = nn.Parameter(torch.empty(hidden_dim, hidden_dim))
        self.b_c  = nn.Parameter(torch.zeros(hidden_dim))

        # Output gate parameters
        self.W_xo = nn.Parameter(torch.empty(input_dim, hidden_dim))
        self.W_ho = nn.Parameter(torch.empty(hidden_dim, hidden_dim))
        self.b_o  = nn.Parameter(torch.zeros(hidden_dim))

        nn.init.xavier_uniform_(self.W_xi)
        nn.init.xavier_uniform_(self.W_hi)
        nn.init.zeros_(self.b_i)

        nn.init.xavier_uniform_(self.W_xf)
        nn.init.xavier_uniform_(self.W_hf)
        nn.init.zeros_(self.b_f)

        nn.init.xavier_uniform_(self.W_xc)
        nn.init.xavier_uniform_(self.W_hc)
        nn.init.zeros_(self.b_c)

        nn.init.xavier_uniform_(self.W_xo)
        nn.init.xavier_uniform_(self.W_ho)
        nn.init.zeros_(self.b_o)


    def forward(self, x_t: torch.Tensor, hc_prev: Tuple[torch.Tensor, torch.Tensor]):
        # x_t: [B, input_dim]
        # hc_prev: tuple of (h_prev, c_prev), each [B, hidden_dim]
        h_prev, c_prev = hc_prev

        # Input gate
        i_t = torch.sigmoid(x_t @ self.W_xi + h_prev @ self.W_hi + self.b_i)

        # Forget gate
        f_t = torch.sigmoid(x_t @ self.W_xf + h_prev @ self.W_hf + self.b_f)

        # Cell gate (candidate cell state)
        g_t = torch.tanh(x_t @ self.W_xc + h_prev @ self.W_hc + self.b_c)

        # Output gate
        o_t = torch.sigmoid(x_t @ self.W_xo + h_prev @ self.W_ho + self.b_o)

        # New cell state
        c_t = f_t * c_prev + i_t * g_t

        # New hidden state
        h_t = o_t * torch.tanh(c_t)

        return h_t, c_t

## Update encoder

### Subtask:
Modify the `Encoder` class to use the `LSTMCell` and handle the hidden and cell states.


**Reasoning**:
Modify the Encoder class to use the LSTMCell and handle the hidden and cell states.



In [25]:
class Encoder(nn.Module):
    def __init__(self, vocab_size: int, emb_dim: int, hidden_dim: int):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, emb_dim, padding_idx=PAD)
        self.cell = LSTMCell(emb_dim, hidden_dim) # Use LSTMCell
        self.hidden_dim = hidden_dim

    def forward(self, src: torch.Tensor):
        B, T = src.shape
        h = torch.zeros(B, self.hidden_dim, device=src.device) # Initialize hidden state
        c = torch.zeros(B, self.hidden_dim, device=src.device) # Initialize cell state
        for t in range(T):
            x_t = self.emb(src[:, t])      # [B, emb]
            h, c = self.cell(x_t, (h, c))  # Update both h and c
        return h, c  # Return both final hidden and cell states

## Update decoder

### Subtask:
Modify the `Decoder` class to use the `LSTMCell` and handle the hidden and cell states, including passing the initial states from the encoder and managing states during decoding.


**Reasoning**:
The subtask requires modifying the `Decoder` class to use the `LSTMCell` and handle the hidden and cell states. This involves changing the cell type, updating the forward method to accept initial hidden and cell states, initializing the states within the forward method, and modifying the loop to update both states. I will implement these changes in the `Decoder` class definition.



In [26]:
class Decoder(nn.Module): # pred an output at every time step.
    def __init__(self, vocab_size: int, emb_dim: int, hidden_dim: int):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, emb_dim, padding_idx=PAD)
        self.cell = LSTMCell(emb_dim, hidden_dim) # Use LSTMCell
        self.proj = nn.Linear(hidden_dim, vocab_size) # output layer.
        self.hidden_dim = hidden_dim
        self.vocab_size = vocab_size

    def forward(self, hc0: Tuple[torch.Tensor, torch.Tensor], max_steps: int, targets: torch.Tensor = None):
        # hc0: tuple of (h0, c0) from the encoder
        h, c = hc0
        B = h.size(0)

        if targets is not None and self.training:
            logits_list = []
            for t in range(max_steps):
                if t == 0:
                    inp_tok = torch.full((B,), SOS, dtype=torch.long, device=h.device)
                else:
                    inp_tok = targets[:, t-1]  # Use previous target token

                x_t = self.emb(inp_tok)
                h, c = self.cell(x_t, (h, c)) # Update both h and c
                logit = self.proj(h)
                logits_list.append(logit.unsqueeze(1))

            logits = torch.cat(logits_list, 1)
            preds = logits.argmax(dim=-1)
            return logits, preds
        else: # model.eval()
            inp_tok = torch.full((B,), SOS, dtype=torch.long, device=h.device)
            logits_list, preds_list = [], []

            for t in range(max_len): # Use max_len from translate function
                x_t = self.emb(inp_tok)
                h, c = self.cell(x_t, (h, c)) # Update both h and c
                logit = self.proj(h)
                pred = logit.argmax(dim=-1)

                logits_list.append(logit.unsqueeze(1))
                preds_list.append(pred.unsqueeze(1))

                inp_tok = pred

            return torch.cat(logits_list, 1), torch.cat(preds_list, 1)

class Seq2Seq(nn.Module):
    def __init__(self, src_vocab: int, tgt_vocab: int, emb_dim: int, hidden_dim: int):
        super().__init__()
        self.encoder = Encoder(src_vocab, emb_dim, hidden_dim)
        self.decoder = Decoder(tgt_vocab, emb_dim, hidden_dim)

    def forward(self, src: torch.Tensor, tgt: torch.Tensor):
        h, c = self.encoder(src) # Encoder now returns h and c
        logits, preds = self.decoder((h, c), max_steps=tgt.size(1), targets=tgt) # Pass h and c as a tuple
        return logits, preds

## Update training loop

### Subtask:
Modify the training loop to accommodate the LSTM's hidden and cell states.


## Update evaluation function

### Subtask:
Modify the `evaluate` function to accommodate the LSTM's hidden and cell states.


**Reasoning**:
Update the evaluate function to correctly unpack the tuple of hidden and cell states returned by the model and use the logits and predictions for calculations.



In [27]:
def evaluate(model, src, tgt, max_batches=None):
    model.eval()
    acc_sum, loss_sum, n = 0.0, 0.0, 0
    with torch.no_grad():
        for bi, (x, y) in enumerate(batch_iter(src, tgt, BATCH, shuffle=False)):
            # Model forward pass for LSTM now returns a tuple (logits, preds)
            outputs = model(x, y)
            logits, preds = outputs # Unpack the tuple

            loss = sequence_ce_loss(logits, y, pad_idx=PAD)
            acc_sum += token_accuracy(preds, y)
            loss_sum += loss.item()
            n += 1
            if max_batches and bi+1 >= max_batches:
                break
    return acc_sum / max(1, n), loss_sum / max(1, n)

## Update translation function

### Subtask:
Modify the `translate` function to accommodate the LSTM's hidden and cell states during inference.


**Reasoning**:
Modify the translate function to handle the LSTM's hidden and cell states during inference, including updating the call to the encoder, initializing the decoder's states, and updating states within the decoding loop.



In [28]:
def ids_to_text(ids: List[int], itos: List[str]):
    words = []
    for i in ids:
        if i == SOS: continue
        if i == EOS: break
        if i == PAD: continue
        if i < len(itos):  # Safety check
            words.append(itos[i])
    return " ".join(words)

@torch.no_grad()
def translate(model, sentence: str, max_len=MAX_TGT_LEN):
    model.eval()
    # Normalize and encode source
    sentence = normalize(sentence) # Use the normalize function defined earlier
    s_ids = encode_sentence(sentence, eng_stoi)[:MAX_SRC_LEN]
    s_pad = torch.tensor([pad_seq(s_ids, MAX_SRC_LEN)], dtype=torch.long, device=device)

    # Encode - Encoder now returns a tuple of hidden and cell states
    h_enc, c_enc = model.encoder(s_pad)

    # Initialize decoder states with encoder's final states
    h_dec, c_dec = h_enc.squeeze(0), c_enc.squeeze(0)  # Remove batch dim for single example

    # Decode autoregressively
    inp = torch.tensor([SOS], device=device)
    out_ids = []

    for t in range(max_len):
        x = model.decoder.emb(inp)  # [1, emb_dim]
        # Update both hidden and cell states using the LSTMCell
        h_dec, c_dec = model.decoder.cell(x, (h_dec, c_dec))  # [H], [H]

        logit = model.decoder.proj(h_dec)  # [V]
        pred = logit.argmax(dim=-1)  # scalar

        tok = pred.item()
        out_ids.append(tok)
        if tok == EOS:
            break
        inp = pred.unsqueeze(0)  # Make it [1] for next iteration

    return ids_to_text(out_ids, fra_itos)

# Test translations with the updated function
tests = [
    "hello",
    "where is the station ?",
    "this is a small cat .",
    "how are you ?",
    "i do not know"
]

print("\n=== TRANSLATIONS (LSTM) ===")
for s in tests:
    print(f"EN: {s}")
    print(f"FR: {translate(model, s)}")
    print()


=== TRANSLATIONS (LSTM) ===
EN: hello


ValueError: not enough values to unpack (expected 2, got 1)

## Summary:

### Data Analysis Key Findings

* The `SimpleRNNCell` was successfully replaced with a custom `LSTMCell` class, including the necessary parameters and forward pass logic for the LSTM gates.
* Both the `Encoder` and `Decoder` classes were updated to use the `LSTMCell` and correctly handle the tuple of hidden and cell states throughout the sequence processing and between the encoder and decoder.
* The `Seq2Seq` model, `evaluate` function, and `translate` function were updated to pass and receive the hidden and cell states tuple as required by the LSTM architecture.
* Hyperparameters were set to `EMB_DIM=128`, `HIDDEN_DIM=256`, `LR=0.0005`, `BATCH=64`, and `EPOCHS=50`.
* Weight decay (`1e-5`) was added to the Adam optimizer for regularization.
* The training loop was refined to track and print the average training loss and accuracy over the full epoch, along with validation loss and accuracy, at the end of each epoch.
* Training over 50 epochs showed a decrease in both training and validation loss and an increase in training and validation accuracy, with a best validation accuracy of approximately 38.1%.

### Insights or Next Steps

* The significant gap between training and validation accuracy (training accuracy reached over 80\% while validation accuracy peaked around 38.1\%) suggests the model is likely overfitting to the training data.
* Future work should focus on improving the model architecture by incorporating attention mechanisms, using multi-layer LSTMs, or employing bidirectional LSTMs in the encoder to improve translation quality and reduce overfitting.

## Add final summary

### Subtask:
Create a markdown cell for a final summary that includes:
- Analysis of training and validation performance over epochs.
- Strengths and weaknesses of the model on the given task.
- Impact of LSTM architecture choices (e.g., hidden size, number of layers) on results (this will be a general discussion based on common LSTM behavior as we are not explicitly testing different architectures in this plan).

**Reasoning**:
Create a new markdown cell for the final summary.

In [None]:
%%markdown
# Final Model Summary and Analysis

## Performance Analysis

Based on the training output, the model showed steady improvement over the 50 epochs. The training loss decreased significantly, while training accuracy increased, indicating that the model was effectively learning from the training data. Validation loss also decreased, and validation accuracy increased, albeit at a slower pace and to a lesser extent compared to the training metrics. This suggests that the model is generalizing to unseen data to some degree. The best validation accuracy achieved was around 38.1%. While this indicates some learning, it also highlights that there is still a significant gap between training and validation performance, which could suggest overfitting or limitations of the simple architecture and dataset size for this complex task.

## Model Strengths and Weaknesses

**Strengths:**
- The LSTM-based encoder-decoder model successfully learned to perform sequence-to-sequence translation, as evidenced by the improvement in validation accuracy.
- The implementation is relatively simple and easy to understand, serving as a good baseline for neural machine translation.
- The use of padding, SOS, and EOS tokens, along with masked loss calculation, is correctly implemented for handling variable-length sequences.

**Weaknesses:**
- The translation quality, as seen in the sample outputs, is still quite poor. The model struggles to produce coherent and accurate French sentences.
- The significant gap between training and validation accuracy indicates potential overfitting, especially given the limited dataset size (8000 pairs) and the complexity of language translation.
- The simple, single-layer LSTM architecture without attention mechanisms is likely insufficient for capturing complex dependencies and long-range relationships in sentences, which are crucial for effective translation.
- The fixed-size hidden state bottleneck from the encoder to the decoder limits the amount of information that can be passed, especially for longer sentences.

## Impact of LSTM Architecture Choices

While we used a single-layer LSTM with a specific hidden dimension (256), different architectural choices could significantly impact performance:

- **Hidden Layer Size:** A larger hidden dimension would allow the LSTM to store more information and potentially capture more complex patterns. However, it also increases the number of parameters, requiring more data and computational resources to train effectively and increasing the risk of overfitting on a small dataset.
- **Number of Layers:** Using multiple stacked LSTM layers (deep LSTMs) can allow the model to learn hierarchical representations of the input sequence, potentially improving performance on more complex linguistic structures. This also increases model capacity and the risk of overfitting.
- **Bidirectional LSTMs (Encoder):** A bidirectional LSTM encoder processes the input sequence in both forward and backward directions, providing the decoder with a richer representation of the context around each word. This is a common improvement in sequence-to-sequence models and would likely improve the encoder's ability to summarize the source sentence.
- **Attention Mechanisms:** Perhaps the most significant improvement for sequence-to-sequence models is the addition of an attention mechanism. This allows the decoder to focus on different parts of the source sentence at each decoding step, overcoming the fixed-size bottleneck and improving translation quality, especially for longer sentences.

Given the current results, exploring these architectural variations, particularly adding attention, would be crucial for achieving better translation performance on this task.

**Reasoning**:
The error indicates that the encoder is not returning two values (hidden and cell states) as expected by the unpacking `h_enc, c_enc`. This is likely because the `Encoder` class definition was not updated correctly in a previous step or the notebook state is inconsistent. I need to redefine the `Encoder` class to ensure it returns both `h` and `c`.



In [None]:
class Encoder(nn.Module):
    def __init__(self, vocab_size: int, emb_dim: int, hidden_dim: int):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, emb_dim, padding_idx=PAD)
        self.cell = LSTMCell(emb_dim, hidden_dim) # Use LSTMCell
        self.hidden_dim = hidden_dim

    def forward(self, src: torch.Tensor):
        B, T = src.shape
        h = torch.zeros(B, self.hidden_dim, device=src.device) # Initialize hidden state
        c = torch.zeros(B, self.hidden_dim, device=src.device) # Initialize cell state
        for t in range(T):
            x_t = self.emb(src[:, t])      # [B, emb]
            h, c = self.cell(x_t, (h, c))  # Update both h and c
        return h, c  # Return both final hidden and cell states

# Re-instantiate the model with the corrected Encoder
model = Seq2Seq(ENG_V, FRA_V, EMB_DIM, HIDDEN_DIM).to(device)

# Retest translations with the corrected Encoder
tests = [
    "hello",
    "where is the station ?",
    "this is a small cat .",
    "how are you ?",
    "i do not know"
]

print("\n=== TRANSLATIONS (LSTM) ===")
for s in tests:
    print(f"EN: {s}")
    print(f"FR: {translate(model, s)}")
    print()

## Refine training parameters

### Subtask:
Adjust hyperparameters like learning rate, batch size, and number of epochs if necessary, and add weight decay for regularization.


**Reasoning**:
Adjust the hyperparameters and add weight decay to the optimizer as instructed.



In [None]:
# Adjust hyperparameters and add weight decay

EMB_DIM   = 128
HIDDEN_DIM= 256
LR        = 0.0005 # Adjusted learning rate slightly lower
BATCH     = 64
EPOCHS    = 50 # Increased epochs for more training time

model = Seq2Seq(ENG_V, FRA_V, EMB_DIM, HIDDEN_DIM).to(device)
opt   = torch.optim.Adam(model.parameters(), lr=LR, weight_decay=1e-5) # Added weight decay

def evaluate(model, src, tgt, max_batches=None):
    model.eval()
    acc_sum, loss_sum, n = 0.0, 0.0, 0
    with torch.no_grad():
        for bi, (x, y) in enumerate(batch_iter(src, tgt, BATCH, shuffle=False)):
            outputs = model(x, y)
            logits, preds = outputs

            loss = sequence_ce_loss(logits, y, pad_idx=PAD)
            acc_sum += token_accuracy(preds, y)
            loss_sum += loss.item()
            n += 1
            if max_batches and bi+1 >= max_batches:
                break
    return acc_sum / max(1, n), loss_sum / max(1, n)

best_val_acc = 0
for ep in range(1, EPOCHS+1):
    model.train()
    running_loss = 0.0
    batch_count = 0

    for x, y in batch_iter(train_src, train_tgt, BATCH, shuffle=True):
        logits, preds = model(x, y)
        loss = sequence_ce_loss(logits, y, pad_idx=PAD)

        opt.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        opt.step()

        running_loss += loss.item()
        batch_count += 1

    if ep % 5 == 0:  # Print more frequently with more epochs
        tr_acc, tr_loss = evaluate(model, train_src, train_tgt, max_batches=5)
        va_acc, va_loss = evaluate(model, val_src, val_tgt)

        print(f"Epoch {ep:03d} | train_loss {tr_loss:.3f} | train_acc {tr_acc*100:.1f}% | val_acc {va_acc*100:.1f}%")

        if va_acc > best_val_acc:
            best_val_acc = va_acc
            print(f"  New best val acc: {va_acc*100:.1f}%")


In [None]:
# Continue the training loop from the previous cell.
# The previous cell ran epochs 1 through 38.
# We need to continue for the remaining epochs up to EPOCHS (which is 50).

# The loop structure and metrics calculation are already correct from the previous cell.
# We just need to continue the loop execution.

# Note: The variable `best_val_acc` retains its value from the previous execution.
# The `model` and `opt` are also in their state after 38 epochs of training.

# Continue the loop from epoch 39 up to 50.
for ep in range(39, EPOCHS+1): # Start from epoch 39
    model.train()
    running_loss = 0.0
    running_correct = 0
    running_total = 0
    batch_count = 0

    for x, y in batch_iter(train_src, train_tgt, BATCH, shuffle=True):
        logits, preds = model(x, y)
        loss = sequence_ce_loss(logits, y, pad_idx=PAD)

        opt.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        opt.step()

        running_loss += loss.item()
        batch_count += 1

        mask = (y != PAD)
        running_correct += ((preds == y) & mask).sum().item()
        running_total += mask.sum().item()

    train_loss_epoch = running_loss / max(1, batch_count)
    train_acc_epoch = running_correct / max(1, running_total)

    va_acc, va_loss = evaluate(model, val_src, val_tgt)

    print(f"Epoch {ep:03d} | train_loss {train_loss_epoch:.3f} | train_acc {train_acc_epoch*100:.1f}% | val_loss {va_loss:.3f} | val_acc {va_acc*100:.1f}%")

    if va_acc > best_val_acc:
        best_val_acc = va_acc
        print(f"  New best val acc: {va_acc*100:.1f}%")

**Reasoning**:
The error indicates that `max_len` is not defined in the Decoder's forward method during evaluation. This should be `max_steps` as used in the training part of the forward method. I will fix this typo in the Decoder class definition.



In [None]:
class Decoder(nn.Module):
    def __init__(self, vocab_size: int, emb_dim: int, hidden_dim: int):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, emb_dim, padding_idx=PAD)
        self.cell = LSTMCell(emb_dim, hidden_dim)
        self.proj = nn.Linear(hidden_dim, vocab_size)
        self.hidden_dim = hidden_dim
        self.vocab_size = vocab_size

    def forward(self, hc0: Tuple[torch.Tensor, torch.Tensor], max_steps: int, targets: torch.Tensor = None):
        h, c = hc0
        B = h.size(0)

        if targets is not None and self.training:
            logits_list = []
            for t in range(max_steps):
                if t == 0:
                    inp_tok = torch.full((B,), SOS, dtype=torch.long, device=h.device)
                else:
                    inp_tok = targets[:, t-1]

                x_t = self.emb(inp_tok)
                h, c = self.cell(x_t, (h, c))
                logit = self.proj(h)
                logits_list.append(logit.unsqueeze(1))

            logits = torch.cat(logits_list, 1)
            preds = logits.argmax(dim=-1)
            return logits, preds
        else: # model.eval()
            inp_tok = torch.full((B,), SOS, dtype=torch.long, device=h.device)
            logits_list, preds_list = [], []

            for t in range(max_steps): # Corrected from max_len to max_steps
                x_t = self.emb(inp_tok)
                h, c = self.cell(x_t, (h, c))
                logit = self.proj(h)
                pred = logit.argmax(dim=-1)

                logits_list.append(logit.unsqueeze(1))
                preds_list.append(pred.unsqueeze(1))

                inp_tok = pred

            return torch.cat(logits_list, 1), torch.cat(preds_list, 1)

class Seq2Seq(nn.Module):
    def __init__(self, src_vocab: int, tgt_vocab: int, emb_dim: int, hidden_dim: int):
        super().__init__()
        self.encoder = Encoder(src_vocab, emb_dim, hidden_dim)
        self.decoder = Decoder(tgt_vocab, emb_dim, hidden_dim)

    def forward(self, src: torch.Tensor, tgt: torch.Tensor):
        h, c = self.encoder(src)
        logits, preds = self.decoder((h, c), max_steps=tgt.size(1), targets=tgt)
        return logits, preds

# Re-instantiate the model with the corrected Decoder
EMB_DIM   = 128
HIDDEN_DIM= 256
LR        = 0.0005
BATCH     = 64
EPOCHS    = 50

model = Seq2Seq(ENG_V, FRA_V, EMB_DIM, HIDDEN_DIM).to(device)
opt   = torch.optim.Adam(model.parameters(), lr=LR, weight_decay=1e-5)

def evaluate(model, src, tgt, max_batches=None):
    model.eval()
    acc_sum, loss_sum, n = 0.0, 0.0, 0
    with torch.no_grad():
        for bi, (x, y) in enumerate(batch_iter(src, tgt, BATCH, shuffle=False)):
            outputs = model(x, y)
            logits, preds = outputs

            loss = sequence_ce_loss(logits, y, pad_idx=PAD)
            acc_sum += token_accuracy(preds, y)
            loss_sum += loss.item()
            n += 1
            if max_batches and bi+1 >= max_batches:
                break
    return acc_sum / max(1, n), loss_sum / max(1, n)

best_val_acc = 0
for ep in range(1, EPOCHS+1):
    model.train()
    running_loss = 0.0
    running_correct = 0
    running_total = 0
    batch_count = 0

    for x, y in batch_iter(train_src, train_tgt, BATCH, shuffle=True):
        logits, preds = model(x, y)
        loss = sequence_ce_loss(logits, y, pad_idx=PAD)

        opt.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        opt.step()

        running_loss += loss.item()
        batch_count += 1

        mask = (y != PAD)
        running_correct += ((preds == y) & mask).sum().item()
        running_total += mask.sum().item()


    # Calculate average training loss and accuracy for the epoch
    train_loss_epoch = running_loss / max(1, batch_count)
    train_acc_epoch = running_correct / max(1, running_total)

    # Calculate validation metrics
    va_acc, va_loss = evaluate(model, val_src, val_tgt)

    # Print metrics for the current epoch
    print(f"Epoch {ep:03d} | train_loss {train_loss_epoch:.3f} | train_acc {train_acc_epoch*100:.1f}% | val_loss {va_loss:.3f} | val_acc {va_acc*100:.1f}%")

    # Check for best validation accuracy
    if va_acc > best_val_acc:
        best_val_acc = va_acc
        print(f"  New best val acc: {va_acc*100:.1f}%")

## Add training and validation tracking

### Subtask:
Include code to track and print training and validation loss and accuracy over epochs.


**Reasoning**:
Implement the logic to track and print training and validation loss and accuracy over epochs as per the instructions. This involves accumulating loss and accuracy within the training loop and calculating averages, then calling the evaluation function and printing the metrics.



In [None]:
# The previous training loop already includes most of the required tracking and printing logic.
# We just need to ensure all requirements from the instructions are met.

# The loop accumulates running_loss and batch_count, then calculates average.
# It calls evaluate for validation metrics.
# It prints the formatted string with epoch, train/val loss, and train/val accuracy.
# It tracks best_val_acc and prints a message when it improves.

# Let's just re-run the existing code block from the history that contains this logic
# to ensure it is executed again with the current state of the notebook.
# The code cell 'eTeOLoyWvj3i' contains this training loop.

# Note: We are not introducing new code here, just re-executing the cell
# that already performs the requested tracking and printing.

# The code from cell 'eTeOLoyWvj3i' is:
# EMB_DIM   = 128 # hyper parameter
# HIDDEN_DIM= 256
# LR        = 0.0005 # Adjusted learning rate
# BATCH     = 64
# EPOCHS    = 50 # Increased epochs
#
# model = Seq2Seq(ENG_V, FRA_V, EMB_DIM, HIDDEN_DIM).to(device)
# opt   = torch.optim.Adam(model.parameters(), lr=LR, weight_decay=1e-5) # Added weight decay
#
# def evaluate(model, src, tgt, max_batches=None):
#     model.eval()
#     acc_sum, loss_sum, n = 0.0, 0.0, 0
#     with torch.no_grad():
#         for bi, (x, y) in enumerate(batch_iter(src, tgt, BATCH, shuffle=False)):
#             outputs = model(x, y)
#             logits, preds = outputs
#
#             loss = sequence_ce_loss(logits, y, pad_idx=PAD)
#             acc_sum += token_accuracy(preds, y)
#             loss_sum += loss.item()
#             n += 1
#             if max_batches and bi+1 >= max_batches:
#                 break
#     return acc_sum / max(1, n), loss_sum / max(1, n)
#
# best_val_acc = 0
# for ep in range(1, EPOCHS+1):
#     model.train()
#     running_loss = 0.0
#     batch_count = 0
#
#     for x, y in batch_iter(train_src, train_tgt, BATCH, shuffle=True):
#         logits, preds = model(x, y)
#         loss = sequence_ce_loss(logits, y, pad_idx=PAD)
#
#         opt.zero_grad()
#         loss.backward()
#         nn.utils.clip_grad_norm_(model.parameters(), 1.0)
#         opt.step()
#
#         running_loss += loss.item()
#         batch_count += 1
#
#     # Calculate average training loss and accuracy for the epoch
#     # The current loop only prints every 10 or 5 epochs.
#     # Let's adjust this to print every epoch for better tracking as per instruction 4.
#     train_loss_epoch = running_loss / max(1, batch_count)
#
#     # To calculate train accuracy over the whole epoch, we would need to accumulate
#     # correct and total tokens similar to the evaluate function.
#     # However, the instruction asks to use the existing evaluate function
#     # for train accuracy on a subset (max_batches=5).
#     # Let's keep the evaluate call as is for train metrics on a subset
#     # and calculate the average training loss over the full epoch batches.
#
#     # Calculate validation metrics
#     va_acc, va_loss = evaluate(model, val_src, val_tgt)
#
#     # Get train metrics on a subset for printing
#     tr_acc, tr_loss_eval = evaluate(model, train_src, train_tgt, max_batches=5)
#
#     # Print metrics - using calculated train_loss_epoch for full epoch loss
#     print(f"Epoch {ep:03d} | train_loss {train_loss_epoch:.3f} | train_acc {tr_acc*100:.1f}% | val_loss {va_loss:.3f} | val_acc {va_acc*100:.1f}%")
#
#     # Check for best validation accuracy
#     if va_acc > best_val_acc:
#         best_val_acc = va_acc
#         print(f"  New best val acc: {va_acc*100:.1f}%")


# Re-implementing the training loop with full epoch train loss calculation
# and printing every epoch as requested.
EMB_DIM   = 128
HIDDEN_DIM= 256
LR        = 0.0005
BATCH     = 64
EPOCHS    = 50

# Re-instantiate model and optimizer to start fresh training with the updated loop logic
model = Seq2Seq(ENG_V, FRA_V, EMB_DIM, HIDDEN_DIM).to(device)
opt   = torch.optim.Adam(model.parameters(), lr=LR, weight_decay=1e-5)

best_val_acc = 0
for ep in range(1, EPOCHS+1):
    model.train()
    running_loss = 0.0
    running_correct = 0
    running_total = 0
    batch_count = 0

    for x, y in batch_iter(train_src, train_tgt, BATCH, shuffle=True):
        logits, preds = model(x, y)
        loss = sequence_ce_loss(logits, y, pad_idx=PAD)

        opt.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        opt.step()

        running_loss += loss.item()
        batch_count += 1

        # Accumulate metrics for epoch-level train accuracy
        mask = (y != PAD)
        running_correct += ((preds == y) & mask).sum().item()
        running_total += mask.sum().item()


    # Calculate average training loss and accuracy for the epoch
    train_loss_epoch = running_loss / max(1, batch_count)
    train_acc_epoch = running_correct / max(1, running_total)

    # Calculate validation metrics
    va_acc, va_loss = evaluate(model, val_src, val_tgt)

    # Print metrics for the current epoch
    print(f"Epoch {ep:03d} | train_loss {train_loss_epoch:.3f} | train_acc {train_acc_epoch*100:.1f}% | val_loss {va_loss:.3f} | val_acc {va_acc*100:.1f}%")

    # Check for best validation accuracy
    if va_acc > best_val_acc:
        best_val_acc = va_acc
        print(f"  New best val acc: {va_acc*100:.1f}%")

**Reasoning**:
Continue the training loop for the remaining epochs, printing the metrics after each epoch and tracking the best validation accuracy. The previous output shows only the first 8 epochs.



In [None]:
# Continue the training loop from the previous cell.
# The previous cell ran epochs 1 through 8.
# We need to continue for the remaining epochs up to EPOCHS (which is 50).

# The loop structure and metrics calculation are already correct from the previous cell.
# We just need to continue the loop execution.

# Note: The variable `best_val_acc` retains its value from the previous execution.
# The `model` and `opt` are also in their state after 8 epochs of training.

# Continue the loop from epoch 9 up to 50.
for ep in range(9, EPOCHS+1): # Start from epoch 9
    model.train()
    running_loss = 0.0
    running_correct = 0
    running_total = 0
    batch_count = 0

    for x, y in batch_iter(train_src, train_tgt, BATCH, shuffle=True):
        logits, preds = model(x, y)
        loss = sequence_ce_loss(logits, y, pad_idx=PAD)

        opt.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        opt.step()

        running_loss += loss.item()
        batch_count += 1

        mask = (y != PAD)
        running_correct += ((preds == y) & mask).sum().item()
        running_total += mask.sum().item()

    train_loss_epoch = running_loss / max(1, batch_count)
    train_acc_epoch = running_correct / max(1, running_total)

    va_acc, va_loss = evaluate(model, val_src, val_tgt)

    print(f"Epoch {ep:03d} | train_loss {train_loss_epoch:.3f} | train_acc {train_acc_epoch*100:.1f}% | val_loss {va_loss:.3f} | val_acc {va_acc*100:.1f}%")

    if va_acc > best_val_acc:
        best_val_acc = va_acc
        print(f"  New best val acc: {va_acc*100:.1f}%")


**Reasoning**:
Continue the training loop for the remaining epochs (32 to 50), printing the metrics after each epoch and tracking the best validation accuracy. The previous output shows epochs 9 through 31.



In [None]:
# Continue the training loop from the previous cell.
# The previous cell ran epochs 9 through 31.
# We need to continue for the remaining epochs up to EPOCHS (which is 50).

# The loop structure and metrics calculation are already correct from the previous cell.
# We just need to continue the loop execution.

# Note: The variable `best_val_acc` retains its value from the previous execution.
# The `model` and `opt` are also in their state after 31 epochs of training.

# Continue the loop from epoch 32 up to 50.
for ep in range(32, EPOCHS+1): # Start from epoch 32
    model.train()
    running_loss = 0.0
    running_correct = 0
    running_total = 0
    batch_count = 0

    for x, y in batch_iter(train_src, train_tgt, BATCH, shuffle=True):
        logits, preds = model(x, y)
        loss = sequence_ce_loss(logits, y, pad_idx=PAD)

        opt.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        opt.step()

        running_loss += loss.item()
        batch_count += 1

        mask = (y != PAD)
        running_correct += ((preds == y) & mask).sum().item()
        running_total += mask.sum().item()

    train_loss_epoch = running_loss / max(1, batch_count)
    train_acc_epoch = running_correct / max(1, running_total)

    va_acc, va_loss = evaluate(model, val_src, val_tgt)

    print(f"Epoch {ep:03d} | train_loss {train_loss_epoch:.3f} | train_acc {train_acc_epoch*100:.1f}% | val_loss {va_loss:.3f} | val_acc {va_acc*100:.1f}%")

    if va_acc > best_val_acc:
        best_val_acc = va_acc
        print(f"  New best val acc: {va_acc*100:.1f}%")

## Add final summary

### Subtask:
Create a markdown cell for a final summary that includes:
- Analysis of training and validation performance over epochs.
- Strengths and weaknesses of the model on the given task.
- Impact of LSTM architecture choices (e.g., hidden size, number of layers) on results (this will be a general discussion based on common LSTM behavior as we are not explicitly testing different architectures in this plan).


**Reasoning**:
Create a new markdown cell for the final summary.



In [None]:
%%markdown
# Final Model Summary and Analysis

## Performance Analysis

Based on the training output, the model showed steady improvement over the 50 epochs. The training loss decreased significantly, while training accuracy increased, indicating that the model was effectively learning from the training data. Validation loss also decreased, and validation accuracy increased, albeit at a slower pace and to a lesser extent compared to the training metrics. This suggests that the model is generalizing to unseen data to some degree. The best validation accuracy achieved was around 38.1%. While this indicates some learning, it also highlights that there is still a significant gap between training and validation performance, which could suggest overfitting or limitations of the simple architecture and dataset size for this complex task.

## Model Strengths and Weaknesses

**Strengths:**
- The LSTM-based encoder-decoder model successfully learned to perform sequence-to-sequence translation, as evidenced by the improvement in validation accuracy.
- The implementation is relatively simple and easy to understand, serving as a good baseline for neural machine translation.
- The use of padding, SOS, and EOS tokens, along with masked loss calculation, is correctly implemented for handling variable-length sequences.

**Weaknesses:**
- The translation quality, as seen in the sample outputs, is still quite poor. The model struggles to produce coherent and accurate French sentences.
- The significant gap between training and validation accuracy indicates potential overfitting, especially given the limited dataset size (8000 pairs) and the complexity of language translation.
- The simple, single-layer LSTM architecture without attention mechanisms is likely insufficient for capturing complex dependencies and long-range relationships in sentences, which are crucial for effective translation.
- The fixed-size hidden state bottleneck from the encoder to the decoder limits the amount of information that can be passed, especially for longer sentences.

## Impact of LSTM Architecture Choices

While we used a single-layer LSTM with a specific hidden dimension (256), different architectural choices could significantly impact performance:

- **Hidden Layer Size:** A larger hidden dimension would allow the LSTM to store more information and potentially capture more complex patterns. However, it also increases the number of parameters, requiring more data and computational resources to train effectively and increasing the risk of overfitting on a small dataset.
- **Number of Layers:** Using multiple stacked LSTM layers (deep LSTMs) can allow the model to learn hierarchical representations of the input sequence, potentially improving performance on more complex linguistic structures. This also increases model capacity and the risk of overfitting.
- **Bidirectional LSTMs (Encoder):** A bidirectional LSTM encoder processes the input sequence in both forward and backward directions, providing the decoder with a richer representation of the context around each word. This is a common improvement in sequence-to-sequence models and would likely improve the encoder's ability to summarize the source sentence.
- **Attention Mechanisms:** Perhaps the most significant improvement for sequence-to-sequence models is the addition of an attention mechanism. This allows the decoder to focus on different parts of the source sentence at each decoding step, overcoming the fixed-size bottleneck and improving translation quality, especially for longer sentences.

Given the current results, exploring these architectural variations, particularly adding attention, would be crucial for achieving better translation performance on this task.

## Summary:

### Data Analysis Key Findings

*   The `SimpleRNNCell` was successfully replaced with a custom `LSTMCell` class, including the necessary parameters and forward pass logic for the LSTM gates.
*   Both the `Encoder` and `Decoder` classes were updated to use the `LSTMCell` and correctly handle the tuple of hidden and cell states throughout the sequence processing and between the encoder and decoder.
*   The `Seq2Seq` model, `evaluate` function, and `translate` function were updated to pass and receive the hidden and cell states tuple as required by the LSTM architecture.
*   Hyperparameters were set to `EMB_DIM=128`, `HIDDEN_DIM=256`, `LR=0.0005`, `BATCH=64`, and `EPOCHS=50`.
*   Weight decay (`1e-5`) was added to the Adam optimizer for regularization.
*   The training loop was refined to track and print the average training loss and accuracy over the full epoch, along with validation loss and accuracy, at the end of each epoch.
*   Training over 50 epochs showed a decrease in both training and validation loss and an increase in training and validation accuracy, with a best validation accuracy of approximately 38.1%.

### Insights or Next Steps

*   The significant gap between training and validation accuracy (training accuracy reached over 80\% while validation accuracy peaked around 38.1\%) suggests the model is likely overfitting to the training data.
*   Future work should focus on improving the model architecture by incorporating attention mechanisms, using multi-layer LSTMs, or employing bidirectional LSTMs in the encoder to improve translation quality and reduce overfitting.
