# Problem Statement

In this assignment you will experiment with the [Dakshina dataset](https://github.com/google-research-datasets/dakshina) released by Google. This dataset contains pairs of the following form: 

$x$.      $y$

ajanabee अजनबी.

i.e., a word in the native script and its corresponding transliteration in the Latin script (the way we type while chatting with our friends on WhatsApp etc). Given many such $(x_i, y_i)_{i=1}^n$ pairs your goal is to train a model $y = \hat{f}(x)$ which takes as input a romanized string (ghar) and produces the corresponding word in Devanagari (घर). 

As you would realise this is the problem of mapping a sequence of characters in one language to a sequence of characters in another language. Notice that this is a scaled down version of the problem of translation where the goal is to translate a sequence of **words** in one language to a sequence of words in another language (as opposed to sequence of **characters** here).

Read these blogs to understand how to build neural sequence to sequence models: [blog1](https://keras.io/examples/nlp/lstm_seq2seq/), [blog2](https://machinelearningmastery.com/define-encoder-decoder-sequence-sequence-model-neural-machine-translation-keras/)

In [3]:
#This cell contains necessary code for dataset preprocessing and at the I print few examples for looking how the dataset looks like
import torch
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence

trainPth = "/mnt/e_disk/DA6401_Assignment3/dataset/dakshina_dataset_v1.0/ta/lexicons/ta.translit.sampled.train.tsv"
devPth   = "/mnt/e_disk/DA6401_Assignment3/dataset/dakshina_dataset_v1.0/ta/lexicons/ta.translit.sampled.dev.tsv"
testPth = "/mnt/e_disk/DA6401_Assignment3/dataset/dakshina_dataset_v1.0/ta/lexicons/ta.translit.sampled.test.tsv"
def get_vocab(paths):
    chars = set()
    for path in paths:
        with open(path, encoding="utf-8") as f:
            for line in f:
                native, roman, _ = line.strip().split("\t")
                chars.update(native)
                chars.update(roman)
    return chars

def get_char2idx(char_set):
    chars = ["<pad>", "<sos>", "<eos>", "<unk>"] + sorted(char_set)
    return {ch: i for i, ch in enumerate(chars)}, chars



char_set = get_vocab([trainPth, devPth])
roman2idx, idx2roman = get_char2idx(set(c for c in char_set if c.isascii()))
dev2idx, idx2dev = get_char2idx(set(c for c in char_set if not c.isascii()))

class TranslitDataset(Dataset):
    def __init__(self, path, src_c2i, tgt_c2i, max_len=32):
        self.data = []
        with open(path, encoding="utf-8") as f:
            for line in f:
                native, roman, _ = line.strip().split("\t")
                self.data.append((roman, native))
        self.src_c2i = src_c2i
        self.tgt_c2i = tgt_c2i
        self.max_len = max_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, i):
        roman, native = self.data[i]
        src = [self.src_c2i.get(c, self.src_c2i["<unk>"]) for c in roman[:self.max_len]]
        tgt = [self.tgt_c2i["<sos>"]] + \
              [self.tgt_c2i.get(c, self.tgt_c2i["<unk>"]) for c in native[:self.max_len - 1]] + \
              [self.tgt_c2i["<eos>"]]
        return torch.tensor(src), torch.tensor(tgt)

def pad_batch(batch):
    src, tgt = zip(*batch)
    src = pad_sequence(src, batch_first=True, padding_value=roman2idx["<pad>"])
    tgt = pad_sequence(tgt, batch_first=True, padding_value=dev2idx["<pad>"])
    return src, tgt

train_ds = TranslitDataset(trainPth, roman2idx, dev2idx, max_len=32)
dev_ds   = TranslitDataset(devPth, roman2idx, dev2idx, max_len=32)
test_ds   = TranslitDataset(testPth, roman2idx, dev2idx, max_len=32)

train_loader = DataLoader(train_ds, batch_size=32, shuffle=True, collate_fn=pad_batch)
dev_loader   = DataLoader(dev_ds, batch_size=32, shuffle=False, collate_fn=pad_batch)
test_loader   = DataLoader(test_ds, batch_size=32, shuffle=False, collate_fn=pad_batch)


print("Train set")
for i in range(5):
    src, tgt = train_ds[i]
    roman = ''.join([idx2roman[idx] for idx in src])
    native = ''.join([idx2dev[idx] for idx in tgt[1:-1]])  # skip <sos> and <eos>
    print(f"{i+1}. Roman: {roman:20s}  →  Native: {native}")
print("Dev set")
for i in range(5):
    src, tgt = dev_ds[i]
    roman = ''.join([idx2roman[idx] for idx in src])
    native = ''.join([idx2dev[idx] for idx in tgt[1:-1]])
    print(f"{i+1}. Roman: {roman:20s}  →  Native: {native}")


Train set
1. Roman: fiat                  →  Native: ஃபியட்
2. Roman: phiyat                →  Native: ஃபியட்
3. Roman: piyat                 →  Native: ஃபியட்
4. Roman: firaans               →  Native: ஃபிரான்ஸ்
5. Roman: france                →  Native: ஃபிரான்ஸ்
Dev set
1. Roman: fire                  →  Native: ஃபயர்
2. Roman: phayar                →  Native: ஃபயர்
3. Roman: baar                  →  Native: ஃபார்
4. Roman: bar                   →  Native: ஃபார்
5. Roman: far                   →  Native: ஃபார்


## Question 1 (15 Marks)
Build a RNN based seq2seq model which contains the following layers: (i) input layer for character embeddings (ii) one encoder RNN which sequentially encodes the input character sequence (Latin) (iii) one decoder RNN which takes the last state of the encoder as input and produces one output character at a time (Devanagari). 

The code should be flexible such that the dimension of the input character embeddings, the hidden states of the encoders and decoders, the cell (RNN, LSTM, GRU) and the number of layers in the encoder and decoder can be changed.

(a) What is the total number of computations done by your network? (assume that the input embedding size is $m$, encoder and decoder have 1 layer each, the hidden cell state is $k$ for both the encoder and decoder, the length of the input and output sequence is the same, i.e., $T$, the size of the vocabulary is the same for the source and target language, i.e., $V$)

(b) What is the total number of parameters in your network? (assume that the input embedding size is $m$, encoder and decoder have 1 layer each, the hidden cell state is $k$ for both the encoder and decoder and the length of the input and output sequence is the same, i.e., $T$, the size of the vocabulary is the same for the source and target language, i.e., $V$)

In [4]:
import torch.nn as nn

class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hidden_dim, num_layers, cell_type='LSTM'):
        super().__init__()
        self.embedding = nn.Embedding(input_dim, emb_dim)
        rnn_cls = {'RNN': nn.RNN, 'LSTM': nn.LSTM, 'GRU': nn.GRU}[cell_type]
        self.rnn = rnn_cls(emb_dim, hidden_dim, num_layers, batch_first=True)
        self.cell_type = cell_type

    def forward(self, src):
        embedded = self.embedding(src)
        outputs, hidden = self.rnn(embedded)
        return hidden  # hidden is tuple if LSTM, tensor otherwise


class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hidden_dim, num_layers, cell_type='LSTM'):
        super().__init__()
        self.embedding = nn.Embedding(output_dim, emb_dim)
        rnn_cls = {'RNN': nn.RNN, 'LSTM': nn.LSTM, 'GRU': nn.GRU}[cell_type]
        self.rnn = rnn_cls(emb_dim, hidden_dim, num_layers, batch_first=True)
        self.fc_out = nn.Linear(hidden_dim, output_dim)
        self.cell_type = cell_type

    def forward(self, input, hidden):
        # input: [batch_size]
        input = input.unsqueeze(1)  # [batch_size, 1]
        embedded = self.embedding(input)  # [batch_size, 1, emb_dim]
        output, hidden = self.rnn(embedded, hidden)
        prediction = self.fc_out(output.squeeze(1))  # [batch_size, output_dim]
        return prediction, hidden


class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device, cell_type='LSTM'):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        self.cell_type = cell_type

    def forward(self, src, tgt, teacher_forcing_ratio=0.5):
        batch_size, tgt_len = tgt.shape
        tgt_vocab_size = self.decoder.embedding.num_embeddings

        outputs = torch.zeros(batch_size, tgt_len, tgt_vocab_size).to(self.device)

        hidden = self.encoder(src)

        # First input to decoder is <sos> token
        input = tgt[:, 0]

        for t in range(1, tgt_len):
            output, hidden = self.decoder(input, hidden)
            outputs[:, t] = output
            top1 = output.argmax(1)
            input = tgt[:, t] if torch.rand(1).item() < teacher_forcing_ratio else top1

        return outputs


(a) What is the total number of computations done by your network? (assume that the input embedding size is $m$, encoder and decoder have 1 layer each, the hidden cell state is $k$ for both the encoder and decoder, the length of the input and output sequence is the same, i.e., $T$, the size of the vocabulary is the same for the source and target language, i.e., $V$)

Given: 
- Input embedding size: m

- Hidden size (encoder and decoder): k

- Input and output sequence length: T

- Vocabulary size (same for source and target): V

- One encoder layer and one decoder layer

- Model uses LSTM / GRU / RNN cells

- Encoder:
    - RNN: Cost per step=m⋅k+k⋅k=k(m+k)
    - LSTM: Cost per step=4[k(m+k)]=4k(m+k) DUE TO 4 gates
    - GRU: Cost per step=3k(m+k)
    - Total cost for encoder = T⋅(cost per step)

- Decoder:
    - RNN: Cost per step=m⋅k+k⋅k=k(m+k)
    - LSTM: Cost per step=4[k(m+k)]=4k(m+k) DUE TO 4 gates
    - GRU: Cost per step=3k(m+k)
    - Output projection cost per step: Hidden→Vocab:k⋅V
    - Total cost for decoder = T⋅[RNN cost per step+k⋅V]

Total cost = 
                - RNN = T⋅[2k(m+k)+kV]
                - LSTM = T⋅[8k(m+k)+kV]
                - GRU = T⋅[6k(m+k)+kV]

(b) What is the total number of parameters in your network? (assume that the input embedding size is $m$, encoder and decoder have 1 layer each, the hidden cell state is $k$ for both the encoder and decoder and the length of the input and output sequence is the same, i.e., $T$, the size of the vocabulary is the same for the source and target language, i.e., $V$)

# Question 2 (10 Marks)

You will now train your model using any one language from the [Dakshina dataset](https://github.com/google-research-datasets/dakshina) (I would suggest pick a language that you can read so that it is easy to analyse the errors). Use the standard train, dev, test set from the folder dakshina_dataset_v1.0/hi/lexicons/ (replace hi by the language of your choice)

Using the sweep feature in wandb find the best hyperparameter configuration. Here are some suggestions but you are free to decide which hyperparameters you want to explore

- input embedding size: 16, 32, 64, 256, ...
- number of encoder layers: 1, 2, 3 
- number of decoder layers: 1, 2, 3 
- hidden layer size: 16, 32, 64, 256, ...
- cell type: RNN, GRU, LSTM
- dropout: 20%, 30% (btw, where will you add dropout? you should read up a bit on this)
- beam search in decoder with different beam sizes: 

Based on your sweep please paste the following plots which are automatically generated by wandb:
- accuracy v/s created plot (I would like to see the number of experiments you ran to get the best configuration). 
- parallel co-ordinates plot
- correlation summary table (to see the correlation of each hyperparameter with the loss/accuracy)

Also write down the hyperparameters and their values that you sweeped over. Smart strategies to reduce the number of runs while still achieving a high accuracy would be appreciated. Write down any unique strategy that you tried for efficiently searching the hyperparameters.


In [2]:
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as opt
import wandb
from torch.utils.data import DataLoader, Dataset
from collections import defaultdict

# Define Encoder
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hidden_dim, num_layers, cell_type):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.rnn = getattr(nn, cell_type.upper())(emb_dim, hidden_dim, num_layers, batch_first=True)
        
    def forward(self, src):
        embedded = self.embedding(src)
        outputs, hidden = self.rnn(embedded)
        return hidden

# Define Decoder
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hidden_dim, num_layers, cell_type):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(output_dim, emb_dim)
        self.rnn = getattr(nn, cell_type.upper())(emb_dim, hidden_dim, num_layers, batch_first=True)
        self.fc_out = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, input, hidden):
        embedded = self.embedding(input.unsqueeze(1))
        output, hidden = self.rnn(embedded, hidden)
        prediction = self.fc_out(output.squeeze(1))
        return prediction, hidden

# Define Seq2Seq
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
    
    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        batch_size, trg_len = trg.shape
        output_dim = self.decoder.fc_out.out_features
        outputs = torch.zeros(batch_size, trg_len, output_dim).to(self.device)
        hidden = self.encoder(src)
        input = trg[:, 0]  # <sos> token
        for t in range(1, trg_len):
            output, hidden = self.decoder(input, hidden)
            outputs[:, t] = output
            teacher_force = torch.rand(1).item() < teacher_forcing_ratio
            input = trg[:, t] if teacher_force else output.argmax(1)
        return outputs


# Model Trainer
def train_model(config=None):
    with wandb.init(config=config):
        config = wandb.config
        wandb.run.name = f"cell_{config.cell_type}/hid_{config.hidden_dim}/emb_{config.emb_dim}/lay_{config.num_layers}/lr_{config.lr}"

        device = 'cuda:0' if torch.cuda.is_available() else 'cpu'

        # Vocab and dataset setup
        vocab_size = config.vocab_size
        input_dim = output_dim = vocab_size

        encoder = Encoder(input_dim, config.emb_dim, config.hidden_dim, config.num_layers, config.cell_type)
        decoder = Decoder(output_dim, config.emb_dim, config.hidden_dim, config.num_layers, config.cell_type)
        model = Seq2Seq(encoder, decoder, device).to(device)

        optimizer = opt.Adam(model.parameters(), lr=config.lr)
        criterion = nn.CrossEntropyLoss(ignore_index=0)

        best_val_loss = float('inf')
        best_epoch = 0
        save_path = os.path.join(wandb.run.dir, 'best_model.pth')

        for epoch in range(config.epochs):
            model.train()
            epoch_loss, epoch_correct, total = 0, 0, 0
            for src, trg in train_loader:
                src, trg = src.to(device), trg.to(device)
                optimizer.zero_grad()
                output = model(src, trg)  # output: [batch, trg_len, vocab_size]
                output = output[:, 1:].reshape(-1, vocab_size)
                trg_reshaped = trg[:, 1:].reshape(-1)

                loss = criterion(output, trg_reshaped)
                loss.backward()
                optimizer.step()
                epoch_loss += loss.item() * src.size(0)

                # Accuracy
                preds = output.argmax(1)
                mask = trg_reshaped != 0
                epoch_correct += ((preds == trg_reshaped) & mask).sum().item()
                total += mask.sum().item()

            train_loss = epoch_loss / len(train_ds)
            train_acc = epoch_correct / total if total > 0 else 0

            # Validation
            model.eval()
            val_loss = 0.0
            val_correct, val_total = 0, 0
            with torch.no_grad():
                for src, trg in dev_loader:
                    src, trg = src.to(device), trg.to(device)
                    output = model(src, trg, teacher_forcing_ratio=0.0)
                    output = output[:, 1:].reshape(-1, vocab_size)
                    trg_reshaped = trg[:, 1:].reshape(-1)

                    loss = criterion(output, trg_reshaped)
                    val_loss += loss.item() * src.size(0)

                    preds = output.argmax(1)
                    mask = trg_reshaped != 0
                    val_correct += ((preds == trg_reshaped) & mask).sum().item()
                    val_total += mask.sum().item()

            val_loss /= len(dev_ds)
            val_acc = val_correct / val_total if val_total > 0 else 0

            if val_loss < best_val_loss:
                best_val_loss = val_loss
                best_epoch = epoch + 1
                torch.save(model.state_dict(), save_path)
                artifact = wandb.Artifact('best-model', type='model')
                artifact.add_file(save_path)
                wandb.log_artifact(artifact)

            wandb.log({
                'epoch': epoch + 1,
                'train_loss': train_loss,
                'val_loss': val_loss,
                'train_accuracy': train_acc,
                'val_accuracy': val_acc
            })


# Sweep config
sweep_config = {
    'method': 'bayes',
    'metric': {'name': 'val_loss', 'goal': 'minimize'},
    'parameters': {
        'epochs': {'values': [10, 15]},
        'vocab_size': {'value': 50},
        'seq_len': {'value': 20},
        'emb_dim': {'values': [64, 128]},
        'hidden_dim': {'values': [128, 256]},
        'num_layers': {'values': [1, 2]},
        'cell_type': {'values': ['RNN', 'GRU', 'LSTM']},
        'lr': {'values': [1e-3, 1e-4]},
        'batch_size': {'values': [32, 64]}
    }
}

sweep_id = wandb.sweep(sweep_config, project='RNN-Seq2Seq')
wandb.agent(sweep_id, function=train_model, count=5)


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Create sweep with ID: 4gp89sqb
Sweep URL: https://wandb.ai/navaneeth001/RNN-Seq2Seq/sweeps/4gp89sqb


[34m[1mwandb[0m: Agent Starting Run: 5m89ii67 with config:
[34m[1mwandb[0m: 	batch_size: 64
[34m[1mwandb[0m: 	cell_type: RNN
[34m[1mwandb[0m: 	emb_dim: 64
[34m[1mwandb[0m: 	epochs: 10
[34m[1mwandb[0m: 	hidden_dim: 256
[34m[1mwandb[0m: 	lr: 0.0001
[34m[1mwandb[0m: 	num_layers: 2
[34m[1mwandb[0m: 	seq_len: 20
[34m[1mwandb[0m: 	vocab_size: 50
[34m[1mwandb[0m: Currently logged in as: [33mnavaneeth001[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


0,1
epoch,▁▂▃▃▄▅▆▆▇█
train_accuracy,▁▄▅▆▆▇▇▇▇█
train_loss,█▄▃▃▂▂▂▁▁▁
val_accuracy,▄▅▆█▁▇▅▆▇▅
val_loss,██▄▅█▃▄▁█▄

0,1
epoch,10.0
train_accuracy,0.37242
train_loss,2.14493
val_accuracy,0.19452
val_loss,2.75314


[34m[1mwandb[0m: Agent Starting Run: mxx5i166 with config:
[34m[1mwandb[0m: 	batch_size: 64
[34m[1mwandb[0m: 	cell_type: LSTM
[34m[1mwandb[0m: 	emb_dim: 128
[34m[1mwandb[0m: 	epochs: 10
[34m[1mwandb[0m: 	hidden_dim: 128
[34m[1mwandb[0m: 	lr: 0.001
[34m[1mwandb[0m: 	num_layers: 1
[34m[1mwandb[0m: 	seq_len: 20
[34m[1mwandb[0m: 	vocab_size: 50


0,1
epoch,▁▂▃▃▄▅▆▆▇█
train_accuracy,▁▆▇▇▇█████
train_loss,█▃▂▂▂▁▁▁▁▁
val_accuracy,▁▅▇▇▇█████
val_loss,█▃▂▂▁▁▁▁▂▂

0,1
epoch,10.0
train_accuracy,0.92036
train_loss,0.3229
val_accuracy,0.76671
val_loss,1.06094


[34m[1mwandb[0m: Agent Starting Run: 4ukfqgll with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	cell_type: RNN
[34m[1mwandb[0m: 	emb_dim: 64
[34m[1mwandb[0m: 	epochs: 10
[34m[1mwandb[0m: 	hidden_dim: 256
[34m[1mwandb[0m: 	lr: 0.0001
[34m[1mwandb[0m: 	num_layers: 1
[34m[1mwandb[0m: 	seq_len: 20
[34m[1mwandb[0m: 	vocab_size: 50


0,1
epoch,▁▂▃▃▄▅▆▆▇█
train_accuracy,▁▅▆▆▇▇▇███
train_loss,█▄▃▂▂▂▁▁▁▁
val_accuracy,▁▆▇▅▇▇▇▇█▇
val_loss,█▂▂▂▂▂▁▁▁▁

0,1
epoch,10.0
train_accuracy,0.35768
train_loss,2.19766
val_accuracy,0.19758
val_loss,2.73217


[34m[1mwandb[0m: Agent Starting Run: as7lo7r2 with config:
[34m[1mwandb[0m: 	batch_size: 64
[34m[1mwandb[0m: 	cell_type: LSTM
[34m[1mwandb[0m: 	emb_dim: 128
[34m[1mwandb[0m: 	epochs: 15
[34m[1mwandb[0m: 	hidden_dim: 128
[34m[1mwandb[0m: 	lr: 0.0001
[34m[1mwandb[0m: 	num_layers: 1
[34m[1mwandb[0m: 	seq_len: 20
[34m[1mwandb[0m: 	vocab_size: 50


0,1
epoch,▁▁▂▃▃▃▄▅▅▅▆▇▇▇█
train_accuracy,▁▃▄▅▅▆▇▇▇▇█████
train_loss,█▆▅▄▃▃▂▂▂▂▁▁▁▁▁
val_accuracy,▁▂▃▄▅▅▆▇▇▇▇████
val_loss,█▇▆▅▄▃▃▂▂▂▁▁▁▁▁

0,1
epoch,15.0
train_accuracy,0.85372
train_loss,0.56528
val_accuracy,0.72734
val_loss,1.08085


[34m[1mwandb[0m: Agent Starting Run: f2jrev21 with config:
[34m[1mwandb[0m: 	batch_size: 64
[34m[1mwandb[0m: 	cell_type: LSTM
[34m[1mwandb[0m: 	emb_dim: 128
[34m[1mwandb[0m: 	epochs: 15
[34m[1mwandb[0m: 	hidden_dim: 128
[34m[1mwandb[0m: 	lr: 0.001
[34m[1mwandb[0m: 	num_layers: 1
[34m[1mwandb[0m: 	seq_len: 20
[34m[1mwandb[0m: 	vocab_size: 50


0,1
epoch,▁▁▂▃▃▃▄▅▅▅▆▇▇▇█
train_accuracy,▁▅▆▇▇▇▇████████
train_loss,█▄▃▂▂▂▂▂▁▁▁▁▁▁▁
val_accuracy,▁▄▅▇▇▇▇███████▇
val_loss,█▄▄▂▂▂▂▂▁▂▂▂▂▂▃

0,1
epoch,15.0
train_accuracy,0.93539
train_loss,0.26357
val_accuracy,0.76376
val_loss,1.13907


# Question 3 (15 Marks)
Based on the above plots write down some insightful observations. For example, 
- RNN based model takes longer time to converge than GRU or LSTM
- using smaller sizes for the hidden layer does not give good results
- dropout leads to better performance 

# Question 4 (10 Marks)

You will now apply your best model on the test data (You shouldn't have used test data so far. All the above experiments should have been done using train and val data only). 

(a) Use the best model from your sweep and report the accuracy on the test set (the output is correct only if it exactly matches the reference output). 

(b) Provide sample inputs from the test data and predictions made by your best model (more marks for presenting this grid creatively). Also upload all the predictions on the test set in a folder **predictions_vanilla** on your github project.

(c) Comment on the errors made by your model (simple insightful bullet points)

- The model makes more errors on consonants than vowels
- The model makes more errors on longer sequences
- I am thinking confusion matrix but may be it's just me!

In [8]:
import wandb
from wandb import Api
import torch
import os
from torch.utils.data import DataLoader
import pandas as pd
import matplotlib.pyplot as plt

# Load best run from sweep
ENTITY      = 'navaneeth001'
PROJECT     = 'RNN-Seq2Seq'
SWEEP_ID    = '4gp89sqb'
ARTIFACT_REF= 'navaneeth001/RNN-Seq2Seq/best-model:v40'

api         = Api()
sweep       = api.sweep(f"{ENTITY}/{PROJECT}/{SWEEP_ID}")
runs        = sweep.runs
best_run    = max(runs, key=lambda r: r.summary.get('val_acc', 0))
cfg         = best_run.config

# Start evaluation run
eval_run = wandb.init(
    project=PROJECT,
    entity=ENTITY,
    job_type='evaluation'
)

# Load artifact
artifact       = eval_run.use_artifact(ARTIFACT_REF, type='model')
download_dir   = artifact.download()
model_path     = os.path.join(download_dir, 'best_model.pth')
print(f"Loaded model artifact to: {model_path}")

# # Load vocab (replace with your actual vocab load)
# roman2idx, idx2roman = ...  # dicts
# dev2idx, idx2dev     = ...

# Dataset and DataLoader
test_loader = DataLoader(test_ds, batch_size=32, shuffle=False, collate_fn=pad_batch)

# Define your Seq2Seq model
model = Seq2Seq(
    input_dim=len(roman2idx),
    output_dim=len(dev2idx),
    emb_dim=cfg.get('emb_dim', 256),
    enc_hidden_dim=cfg.get('enc_hidden_dim', 512),
    dec_hidden_dim=cfg.get('dec_hidden_dim', 512),
    enc_layers=cfg.get('enc_layers', 1),
    dec_layers=cfg.get('dec_layers', 1),
    dropout=cfg.get('dropout', 0.1)
)

# Load model state
device = 'cuda' if torch.cuda.is_available() else 'cpu'
state  = torch.load(model_path, map_location=device)
model.load_state_dict(state)
model.to(device).eval()

# Prediction function
def predict(model, src, sos_idx, eos_idx, max_len=32):
    model.eval()
    with torch.no_grad():
        results = []
        for s in src:
            s = s.unsqueeze(0).to(device)
            tgt_seq = [sos_idx]
            for _ in range(max_len):
                tgt_tensor = torch.tensor(tgt_seq).unsqueeze(0).to(device)
                out = model(s, tgt_tensor, teacher_forcing_ratio=0.0)
                next_token = out[0, -1].argmax().item()
                if next_token == eos_idx:
                    break
                tgt_seq.append(next_token)
            results.append(tgt_seq[1:])
        return results

# Evaluate and collect samples
all_romans, all_preds, all_refs = [], [], []
correct, total = 0, 0

for src_batch, tgt_batch in test_loader:
    src_batch = src_batch.to(device)
    tgt_batch = tgt_batch.to(device)
    preds = predict(model, src_batch, dev2idx["<sos>"], dev2idx["<eos>"])

    for src, pred, tgt in zip(src_batch, preds, tgt_batch):
        roman = ''.join([idx2roman[idx.item()] for idx in src if idx2roman[idx.item()] != "<pad>"])
        pred_str = ''.join([idx2dev[idx] for idx in pred])
        ref_str = ''.join([
            idx2dev[idx.item()]
            for idx in tgt[1:]
            if idx.item() not in (dev2idx["<pad>"], dev2idx["<eos>"])
        ])

        all_romans.append(roman)
        all_preds.append(pred_str)
        all_refs.append(ref_str)

        # Simple accuracy comparison
        if pred_str == ref_str:
            correct += 1
        total += 1

# Final test accuracy
test_acc = 100.0 * correct / total
print(f"Test Accuracy: {test_acc:.2f}%")

# Visualization
df = pd.DataFrame({
    "Roman Input": all_romans[:10],
    "Predicted Tamil": all_preds[:10],
    "Ground Truth": all_refs[:10]
})
print(df)

fig, ax = plt.subplots(figsize=(10, 2))
ax.axis('off')
table = ax.table(cellText=df.values, colLabels=df.columns, cellLoc='center', loc='center')
table.scale(1, 2)
plt.title("Sample Transliteration Predictions")
plt.show()

eval_run.finish()


[34m[1mwandb[0m:   1 of 1 files downloaded.  


Loaded model artifact to: /mnt/e_disk/DA6401_Assignment3/artifacts/best-model:v40/best_model.pth


TypeError: Seq2Seq.__init__() got an unexpected keyword argument 'input_dim'

# Question 5 (20 Marks)

Now add an attention network to your basis sequence to sequence model and train the model again. For the sake of simplicity you can use a single layered encoder and a single layered decoder (if you want you can use multiple layers also). Please answer the following questions:

(a) Did you tune the hyperparameters again? If yes please paste appropriate plots below.

(b) Evaluate your best model on the test set and report the accuracy. Also upload all the predictions on the test set in a folder **predictions_attention** on your github project.

(c) Does the attention based model perform better than the vanilla model? If so, can you check some of the errors that this model corrected and note down your inferences (i.e., outputs which were predicted incorrectly by your best seq2seq model are predicted correctly by this model)

(d) In a 3 x 3 grid paste the attention heatmaps for 10 inputs from your test data (read up on what are attention heatmaps).


# Question 6 (20 Marks)

This a challenge question and most of you will find it hard. 

I like the visualisation in the figure captioned "Connectivity" in this [article](https://distill.pub/2019/memorization-in-rnns/#appendix-autocomplete). Make a similar visualisation for your model. Please look at this [blog](https://medium.com/data-science/visualising-lstm-activations-in-keras-b50206da96ff) for some starter code. The goal is to figure out the following: When the model is decoding the $i$-th character in the output which is the input character that it is looking at?

Have fun!