Below is the derivation for Question 1 under the assumptions:

* embedding size = $m$
* hidden size (encoder & decoder) = $k$
* input and output sequence length = $T$
* (source and target) vocabulary size = $V$
* simple RNN cell (not LSTM/GRU), one layer each

---

### (a) Total number of computations (per example, forward pass)

1. **Encoder RNN** (over $T$ time-steps)
   At each step you compute

   $$
     h_t = \phi\bigl(W_x x_t + W_h h_{t-1} + b\bigr)
   $$

   * $W_x x_t$:  $m\times k$ matrix–vector multiply ⇒ $m\,k$ multiplies + $m\,k$ adds
   * $W_h h_{t-1}$:  $k\times k$ multiply ⇒ $k^2$ multiplies + $k^2$ adds
   * elementwise nonlinearity: $\sim k$ ops

   **≈** $(2\,m\,k + 2\,k^2 + k)$ flops per step
   → over $T$ steps:

   $$
     \underbrace{T\,(2\,m\,k + 2\,k^2 + k)}_{\text{encoder}}
   $$

2. **Decoder RNN + output projection** (also $T$ steps)

   * RNN cell (same cost as above): $(2\,m\,k + 2\,k^2 + k)$ per step
   * output layer $o_t = W_o\,h_t + b_o$: $k\times V$ mat-vec ⇒ $k\,V$ multiplies + $k\,V$ adds

   **≈** $(2\,m\,k + 2\,k^2 + k) + 2\,k\,V$ flops per step
   → over $T$ steps:

   $$
     \underbrace{T\,(2\,m\,k + 2\,k^2 + k + 2\,k\,V)}_{\text{decoder + projection}}
   $$

3. **Total**

$$
  \boxed{
    \underbrace{T\,(2\,m\,k + 2\,k^2 + k)}_{\text{encoder}}
    \;+\;
    \underbrace{T\,(2\,m\,k + 2\,k^2 + k + 2\,k\,V)}_{\text{decoder}}
    \;=\;
    T\bigl(4\,m\,k \;+\;4\,k^2 \;+\;2\,k \;+\;2\,k\,V\bigr)
  }
$$

If you count only the multiplies (ignoring adds and activation overhead), that simplifies to

$$
  T\,(2\,m\,k + 2\,k^2) + T\,(2\,m\,k + 2\,k^2 + k\,V)
  \;=\;
  T\bigl(4\,m\,k +4\,k^2 + k\,V\bigr).
$$

---

### (b) Total number of parameters

1. **Input embedding**

$$
  \mathbf{E}\in\mathbb R^{V \times m}
  \quad\Longrightarrow\quad
  V\,m
$$

2. **Encoder RNN (one layer)**

   * $W_x\in\mathbb R^{m\times k}$: $m\,k$
   * $W_h\in\mathbb R^{k\times k}$: $k^2$
   * bias $b\in\mathbb R^k$: $k$
     **⇒** $m\,k + k^2 + k$

3. **Decoder RNN (one layer)**
   (same dimensions $m\!\to\!k$ if you feed back embeddings, or $k\!\to\!k$ if you feed last hidden as “input” — we’ll assume same as encoder)
   **⇒** $m\,k + k^2 + k$

4. **Output layer**

   * $W_o\in\mathbb R^{k\times V}$: $k\,V$
   * bias $b_o\in\mathbb R^V$: $V$
     **⇒** $k\,V + V$

---

Putting it all together:

$$
\boxed{
  \underbrace{V\,m}_{\text{embedding}}
  \;+\;
  \underbrace{(m\,k + k^2 + k)}_{\text{encoder}}
  \;+\;
  \underbrace{(m\,k + k^2 + k)}_{\text{decoder}}
  \;+\;
  \underbrace{(k\,V + V)}_{\text{output}}
  \;=\;
  V\,m \;+\; 2\,m\,k \;+\; 2\,k^2 \;+\; 2\,k \;+\; k\,V \;+\; V
}
$$

Omitting biases for brevity, the leading terms are

$$
  \boxed{V\,m \;+\;2\,m\,k \;+\;2\,k^2 \;+\;k\,V}.
$$

---

**Summary**

* **(a)** $\displaystyle\mathcal O\bigl(T\,(4\,m\,k +4\,k^2 +2\,k\,V)\bigr)$ flops
* **(b)** $\displaystyle V\,m +2\,m\,k +2\,k^2 +k\,V$ (plus lower-order biases) parameters.


In [None]:
import wandb
import torch
import torch.nn as nn


### ANSWER 1

In [1]:


class Encoder(nn.Module):
    def __init__(self,
                 src_vocab_size:int,
                 embed_size:int,
                 hidden_size:int,
                 num_layers:int = 1,
                 cell_type:str = "RNN",
                 dropout:float = 0.0):
        super().__init__()
        self.embedding = nn.Embedding(src_vocab_size, embed_size)
        # choose RNN/LSTM/GRU
        rnn_cls = {"RNN": nn.RNN, "LSTM": nn.LSTM, "GRU": nn.GRU}[cell_type]
        self.rnn = rnn_cls(
            input_size=embed_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers>1 else 0.0
        )
        
    def forward(self, src):
        # src: (batch, src_len)
        emb = self.embedding(src)               # (batch, src_len, embed_size)
        outputs, hidden = self.rnn(emb)         # outputs: (batch, src_len, hidden)
        # hidden: 
        #   RNN/GRU -> (num_layers, batch, hidden)
        #   LSTM    -> tuple of two such tensors (h_n, c_n)
        return outputs, hidden

class Decoder(nn.Module):
    def __init__(self,
                 trg_vocab_size:int,
                 embed_size:int,
                 hidden_size:int,
                 num_layers:int = 1,
                 cell_type:str = "RNN",
                 dropout:float = 0.0):
        super().__init__()
        self.embedding = nn.Embedding(trg_vocab_size, embed_size)
        rnn_cls = {"RNN": nn.RNN, "LSTM": nn.LSTM, "GRU": nn.GRU}[cell_type]
        self.rnn = rnn_cls(
            input_size=embed_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers>1 else 0.0
        )
        self.fc_out = nn.Linear(hidden_size, trg_vocab_size)
        
    def forward(self, input_token, hidden):
        # input_token: (batch,)  — a single timestep token
        input_token = input_token.unsqueeze(1)   # (batch, 1)
        emb = self.embedding(input_token)        # (batch, 1, embed_size)
        output, hidden = self.rnn(emb, hidden)   # output: (batch, 1, hidden)
        prediction = self.fc_out(output.squeeze(1))  
        # prediction: (batch, trg_vocab_size)
        return prediction, hidden

class Seq2Seq(nn.Module):
    def __init__(self,
                 encoder:Encoder,
                 decoder:Decoder,
                 device:torch.device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, trg, teacher_forcing_ratio:float = 0.5):
        batch_size, trg_len = trg.shape
        trg_vocab_size = self.decoder.embedding.num_embeddings
        
        # tensor to store decoder outputs
        outputs = torch.zeros(batch_size, trg_len, trg_vocab_size).to(self.device)

        # 1. Encode the whole source sequence
        _, hidden = self.encoder(src)
        
        # 2. first input to the decoder is the <sos> tokens
        input_token = trg[:,0]  

        for t in range(1, trg_len):
            # 3. decode one token
            pred, hidden = self.decoder(input_token, hidden)
            outputs[:, t] = pred
            
            # 4. decide if we do teacher forcing
            teacher_force = (torch.rand(1).item() < teacher_forcing_ratio)
            top1 = pred.argmax(1)  # (batch,)
            
            input_token = trg[:, t] if teacher_force else top1

        return outputs


In [3]:
# SRC_VOCAB = len(src_token_to_idx)   # e.g. romanized chars
# TRG_VOCAB = len(trg_token_to_idx)   # e.g. Devanagari chars
# EMB_SIZE  = 64
# HID_SIZE  = 128
# LAYERS    = 2
# CELL      = "LSTM"   # or "RNN", "GRU"
# DROPOUT   = 0.3

# enc = Encoder(SRC_VOCAB, EMB_SIZE, HID_SIZE, LAYERS, CELL, DROPOUT)
# dec = Decoder(TRG_VOCAB, EMB_SIZE, HID_SIZE, LAYERS, CELL, DROPOUT)
# model = Seq2Seq(enc, dec, device).to(device)


In [10]:
wandb.login(key="f0880f1a8675dc5a9ff218689c5340669690b6e0")

[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33med24s401[0m ([33med24s401-indian-institute-of-technology-madras[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

### ANSWER 2

In [None]:
# train.py
import wandb
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from your_seq2seq_module import Encoder, Decoder, Seq2Seq  # import from your code
from dakshina_loader import DakshinaDataset              # or however you load hi lexicons

def train():
    # 1️⃣ Initialize a new W&B run
    wandb.init()
    config = wandb.config   # holds all sweep parameters

    # 2️⃣ Prepare data
    train_ds = DakshinaDataset(lang="hi", split="train")
    val_ds   = DakshinaDataset(lang="hi", split="dev")
    train_dl = DataLoader(train_ds, batch_size=config.batch_size, shuffle=True)
    val_dl   = DataLoader(val_ds,   batch_size=config.batch_size)

    # 3️⃣ Build model with hyperparameters from config
    enc = Encoder(
        src_vocab_size = train_ds.src_vocab_size,
        embed_size     = config.embed_size,
        hidden_size    = config.hidden_size,
        num_layers     = config.enc_layers,
        cell_type      = config.cell_type,
        dropout        = config.dropout,
    )
    dec = Decoder(
        trg_vocab_size = train_ds.trg_vocab_size,
        embed_size     = config.embed_size,
        hidden_size    = config.hidden_size,
        num_layers     = config.dec_layers,
        cell_type      = config.cell_type,
        dropout        = config.dropout,
    )
    model = Seq2Seq(enc, dec, device=wandb.config.device).to(wandb.config.device)

    optimizer = torch.optim.Adam(model.parameters(), lr=config.learning_rate)
    criterion = nn.CrossEntropyLoss(ignore_index=train_ds.pad_idx)

    # 4️⃣ Training loop
    for epoch in range(config.epochs):
        model.train()
        total_loss = 0
        correct, total = 0, 0

        for src, trg in train_dl:
            src, trg = src.to(wandb.config.device), trg.to(wandb.config.device)
            optimizer.zero_grad()
            outputs = model(src, trg, teacher_forcing_ratio=0.5)
            # outputs: (B, T, V) -> reshape for CE
            B, T, V = outputs.shape
            loss = criterion(outputs[:,1:,:].reshape(-1, V),
                             trg[:,1:].reshape(-1))
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        # 5️⃣ Validation
        model.eval()
        val_loss, val_correct, val_total = 0, 0, 0
        with torch.no_grad():
            for src, trg in val_dl:
                src, trg = src.to(wandb.config.device), trg.to(wandb.config.device)
                outputs = model(src, trg, teacher_forcing_ratio=0.0)
                B, T, V = outputs.shape
                loss = criterion(outputs[:,1:,:].reshape(-1, V),
                                 trg[:,1:].reshape(-1))
                val_loss += loss.item()

                # compute simple accuracy over non-pad tokens
                preds = outputs.argmax(-1)
                mask = trg[:,1:] != train_ds.pad_idx
                val_correct += (preds[:,1:][mask] == trg[:,1:][mask]).sum().item()
                val_total   += mask.sum().item()

        val_acc = val_correct / val_total

        # 6️⃣ Log to W&B
        wandb.log({
            "train_loss": total_loss/len(train_dl),
            "val_loss":   val_loss/len(val_dl),
            "val_accuracy": val_acc,
            "epoch": epoch
        })

    # 7️⃣ (Optional) Save best model/artifacts
    torch.save(model.state_dict(), "model.pt")
    wandb.save("model.pt")


if __name__ == "__main__":
    train()
