The Neural architecture that were tried and gave the best result are RNN(GRU) based models, Sequence-to-Sequence with Attention.
The model uses a Bidirectional GRU (BiGRU) encoder, an attention mechanism, and a GRU-based decoder.
Model Components:

(a) Encoder (BiGRU):
•	Inputs: A sequence of word indices (source sentence).
•	Embedding layer: Converts word indices into dense vector representations.
•	Bidirectional GRU: Processes the sentence in both forward and backward directions, concatenating hidden states.
•	Final hidden state: A linear layer transforms the concatenated hidden state to the correct shape before passing it to the decoder.

(b) Attention Mechanism:
•	Computes alignment scores between the decoder hidden state and encoder outputs.
•	Uses a feedforward network to learn attention scores.
•	Applies SoftMax function to get attention weights.
•	These weights determine how much focus each encoder output should get when generating the next word.

(c) Decoder (GRU + Attention):
•	Inputs: Previous word, decoder hidden state, and encoder outputs.
•	Uses an attention mechanism to get a context vector.
•	GRU processes the combined input (embedded word + context vector).
•	A fully connected layer predicts the next word.
•	During training, teacher forcing is used (feeding the actual previous target word as input).
•	Uses beam search.

(d) The loss function used in the model is Cross Entropy Loss .

(e) Hyperparameters:
HIDDEN_SIZE = 512
EMBED_SIZE = 512
BATCH_SIZE = 256
SEQ_LENGTH = 15
DROPOUT = 0.3
LEARNING_RATE = 0.001
EPOCHS = 22
TEACHER_FORCING_RATIO = 0.5
BEAM_WIDTH = 4



In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
import json
import numpy as np
import re
import string
import nltk
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [3]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [4]:
## Mount Google drive to import and export Data
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
############### English-Bengali Translation starts here #################

In [5]:
with open('/content/drive/MyDrive/NLP Data/train_data1.json', 'r') as file: # Replace this path with the dataset path in your local machine
    data = json.load(file)

In [6]:
# Process JSON data
source_sentences_train = []
target_sentences_train = []

source_sentences_val = []
target_sentences_val = []

id_train = []
id_val = []

In [7]:
for language_pair, language_data in data.items():
  print(f"Language Pair: {language_pair}")

Language Pair: English-Bengali
Language Pair: English-Hindi


In [8]:
for language_pair, language_data in data.items():
    if(language_pair == "English-Bengali"):
      print(f"Language Pair: {language_pair}")
      for data_type, data_entries in language_data.items():
          print(f"  Data Type: {data_type}")
          for entry_id, entry_data in data_entries.items():
              source = entry_data["source"]
              target = entry_data["target"]
              if (data_type == "Test"):
                source_sentences_val.append(source)
                target_sentences_val.append(target)
                id_val.append(entry_id)
              else:
                source_sentences_train.append(source)
                target_sentences_train.append(target)
                id_train.append(entry_id)

Language Pair: English-Bengali
  Data Type: Train


In [9]:
with open('/content/drive/MyDrive/NLP Data/test_data1_final.json', 'r') as file: # Replace this path with the dataset path in your local machine
    data = json.load(file)

In [10]:
source_sentences_val = []
for language_pair, language_data in data.items():
    if language_pair == "English-Bengali":
        print(f"Language Pair: {language_pair}")
        for data_type, data_entries in language_data.items():
            print(f"  Data Type: {data_type}")
            if data_type == "Test":
                for entry_data in data_entries.values():
                    source_sentences_val.append(entry_data["source"])

Language Pair: English-Bengali
  Data Type: Test


In [11]:
print(len(source_sentences_train))
print(len(target_sentences_train))

print(len(source_sentences_val))
print(len(target_sentences_val))

68849
68849
19672
0


In [12]:
import nltk
from collections import Counter

In [13]:
# Function to preprocess and remove punctuation and numbers
def preprocess_and_remove_punctuation(sentence):
    # Remove punctuation and numbers
    sentence = ''.join([char for char in sentence if char not in string.punctuation and not char.isdigit()])
    #sentence = ''.join([char for char in sentence if char not in string.punctuation ])
    return sentence

In [14]:
# Tokenization and Lowercasing
def preprocess(sentences):
    tokenized_sentences = [nltk.word_tokenize(preprocess_and_remove_punctuation(sentence.lower())) for sentence in sentences]
    return tokenized_sentences

In [None]:
#target_sentences_train = [re.sub(r'[a-zA-Z]','',bn) for bn in target_sentences_train] #optional

In [15]:
english_tokens = preprocess(source_sentences_train)
english_test=preprocess(source_sentences_val)
bengali_tokens = preprocess(target_sentences_train)
bengali_test=preprocess(target_sentences_val)

In [16]:
en_train=english_tokens
en_test=english_test
bn_train=bengali_tokens
bn_test=bengali_test

In [17]:
en_index2word = ["<PAD>", "<SOS>", "<EOS>", "<UNK>"]
bn_index2word = ["<PAD>", "<SOS>", "<EOS>", "<UNK>"]

for ds in [en_train, en_test]:
    for sent in ds:
        for token in sent:
            if token not in en_index2word:
                en_index2word.append(token)

for ds in [bn_train, bn_test]:
    for sent in ds:
        for token in sent:
            if token not in bn_index2word:
                bn_index2word.append(token)

In [18]:
len(en_index2word)

60964

In [19]:
len(bn_index2word)

102438

In [20]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [21]:
en_word2index = {token: idx for idx, token in enumerate(en_index2word)}
bn_word2index = {token: idx for idx, token in enumerate(bn_index2word)}

In [22]:
len(bn_word2index)

102438

In [23]:
len(en_word2index)

60964

In [24]:
en_lengths = sum([len(sent) for sent in en_train])/len(en_train)
bn_lengths = sum([len(sent) for sent in bn_train])/len(bn_train)

In [25]:
seq_length = 25

In [26]:
def encode_and_pad(vocab, sent, max_length):
    sos = [vocab["<SOS>"]]
    eos = [vocab["<EOS>"]]
    pad = [vocab["<PAD>"]]
    unk = [vocab["<UNK>"]]

    if len(sent) < max_length - 2: # -2 for SOS and EOS
        n_pads = max_length - 2 - len(sent)
        encoded = [vocab[w] for w in sent]
        return sos + encoded + eos + pad * n_pads
    else: # sent is longer than max_length; truncating
        encoded = [vocab[w] for w in sent]
        truncated = encoded[:max_length - 2]
        return sos + truncated + eos

In [27]:
en_train_encoded = [encode_and_pad(en_word2index, sent, seq_length) for sent in en_train]
en_test_encoded = [encode_and_pad(en_word2index, sent, seq_length) for sent in en_test]
bn_train_encoded = [encode_and_pad(bn_word2index, sent, seq_length) for sent in bn_train]
bn_test_encoded = [encode_and_pad(bn_word2index, sent, seq_length) for sent in bn_test]

In [28]:
en_train_encoded[1]

[1,
 24,
 25,
 26,
 27,
 9,
 28,
 29,
 30,
 31,
 25,
 32,
 2,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

In [30]:
bn_train_encoded[1]

[1,
 4,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 2,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

In [29]:
batch_size = 256

train_x = np.array(en_train_encoded)
train_y = np.array(bn_train_encoded)
test_x = np.array(en_test_encoded)
test_y = np.array(bn_test_encoded)

train_ds = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
test_ds = TensorDataset(torch.from_numpy(test_x))


train_dl = DataLoader(train_ds, shuffle=True, batch_size=batch_size, drop_last=True)

In [31]:
hidden_size = 512

In [32]:
losses = []

In [34]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
import random
import torch.nn.functional as F
from tqdm import tqdm
from nltk.translate.bleu_score import corpus_bleu

# Hyperparameters
HIDDEN_SIZE = 512
EMBED_SIZE = 512
BATCH_SIZE = 256
SEQ_LENGTH = 15
DROPOUT = 0.3
LEARNING_RATE = 0.001
EPOCHS = 15
TEACHER_FORCING_RATIO = 0.5


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Encoder with BiGRU
class Encoder(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(Encoder, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(input_size, EMBED_SIZE)
        self.gru = nn.GRU(EMBED_SIZE, hidden_size,
                         bidirectional=True,
                         batch_first=True,
                         dropout=DROPOUT)
        self.fc = nn.Linear(hidden_size*2, hidden_size)

    def forward(self, x):
        embedded = self.embedding(x)
        outputs, hidden = self.gru(embedded)

        hidden = torch.cat((hidden[-2], hidden[-1]), dim=1)
        hidden = torch.tanh(self.fc(hidden)).unsqueeze(0)

        return outputs, hidden

# Attention Mechanism
class Attention(nn.Module):
    def __init__(self, hidden_size):
        super(Attention, self).__init__()
        self.attn = nn.Linear(hidden_size*3, hidden_size)
        self.v = nn.Linear(hidden_size, 1, bias=False)

    def forward(self, hidden, encoder_outputs):
        batch_size = encoder_outputs.size(0)
        seq_len = encoder_outputs.size(1)

        hidden = hidden.repeat(seq_len, 1, 1).permute(1,0,2)
        energy = torch.cat((hidden, encoder_outputs), dim=2)
        energy = torch.tanh(self.attn(energy))
        attention = self.v(energy).squeeze(2)

        return F.softmax(attention, dim=1)

# Decoder with Attention
class Decoder(nn.Module):
    def __init__(self, output_size, hidden_size):
        super(Decoder, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size

        self.embedding = nn.Embedding(output_size, EMBED_SIZE)
        self.attention = Attention(hidden_size)
        self.gru = nn.GRU(EMBED_SIZE + hidden_size*2, hidden_size,
                         batch_first=True,
                         dropout=DROPOUT)
        self.fc = nn.Linear(hidden_size*3, output_size)
        self.dropout = nn.Dropout(DROPOUT)

    def forward(self, x, hidden, encoder_outputs):
        x = x.unsqueeze(1)
        embedded = self.dropout(self.embedding(x))

        attn_weights = self.attention(hidden, encoder_outputs)
        context = torch.bmm(attn_weights.unsqueeze(1), encoder_outputs)

        gru_input = torch.cat((embedded, context), dim=2)
        output, hidden = self.gru(gru_input, hidden)

        output = torch.cat((output.squeeze(1), context.squeeze(1)), dim=1)
        prediction = self.fc(output)

        return prediction, hidden, attn_weights



# Training Setup
encoder = Encoder(len(en_index2word), HIDDEN_SIZE).to(device)
decoder = Decoder(len(bn_index2word), HIDDEN_SIZE).to(device)

optimizer = optim.Adam(list(encoder.parameters()) + list(decoder.parameters()),
                     lr=LEARNING_RATE)
# Loss
criterion = nn.CrossEntropyLoss(ignore_index=0)

# Training Function

encoder.train()
decoder.train()

for epoch in range(EPOCHS):
  total_loss = 0
  for batch_idx, (src, trg) in tqdm(enumerate(train_dl), total=len(train_dl)):
      src, trg = src.to(device), trg.to(device)
      batch_size = src.size(0)

      encoder_outputs, encoder_hidden = encoder(src)
      decoder_input = torch.tensor([bn_word2index["<SOS>"]]*batch_size, device=device)
      decoder_hidden = encoder_hidden

      loss = 0
      use_teacher_forcing = random.random() < TEACHER_FORCING_RATIO

      for t in range(1, SEQ_LENGTH):
          decoder_output, decoder_hidden, _ = decoder(
              decoder_input,
              decoder_hidden,
              encoder_outputs
          )

          loss += criterion(decoder_output, trg[:, t])
          decoder_input = trg[:, t] if use_teacher_forcing else decoder_output.argmax(1)

      optimizer.zero_grad()
      loss.backward()
      torch.nn.utils.clip_grad_norm_(encoder.parameters(), 1.0)
      torch.nn.utils.clip_grad_norm_(decoder.parameters(), 1.0)
      optimizer.step()

      total_loss += loss.item()/SEQ_LENGTH

  print(f'Epoch {epoch+1}, Loss: {total_loss/len(train_dl):.4f}')



100%|██████████| 268/268 [06:07<00:00,  1.37s/it]


Epoch 1, Loss: 7.3330


100%|██████████| 268/268 [06:12<00:00,  1.39s/it]


Epoch 2, Loss: 5.5208


100%|██████████| 268/268 [06:10<00:00,  1.38s/it]


Epoch 3, Loss: 4.0233


100%|██████████| 268/268 [06:11<00:00,  1.39s/it]


Epoch 4, Loss: 3.2361


100%|██████████| 268/268 [06:11<00:00,  1.38s/it]


Epoch 5, Loss: 2.9795


100%|██████████| 268/268 [06:11<00:00,  1.38s/it]


Epoch 6, Loss: 2.6667


100%|██████████| 268/268 [06:11<00:00,  1.38s/it]


Epoch 7, Loss: 2.3507


100%|██████████| 268/268 [06:11<00:00,  1.39s/it]


Epoch 8, Loss: 2.1034


100%|██████████| 268/268 [06:11<00:00,  1.39s/it]


Epoch 9, Loss: 1.8340


100%|██████████| 268/268 [06:11<00:00,  1.39s/it]


Epoch 10, Loss: 1.6331


100%|██████████| 268/268 [06:11<00:00,  1.39s/it]


Epoch 11, Loss: 1.4657


100%|██████████| 268/268 [06:11<00:00,  1.39s/it]


Epoch 12, Loss: 1.3777


100%|██████████| 268/268 [06:11<00:00,  1.39s/it]


Epoch 13, Loss: 1.2634


100%|██████████| 268/268 [06:10<00:00,  1.38s/it]


Epoch 14, Loss: 1.1726


100%|██████████| 268/268 [06:11<00:00,  1.39s/it]

Epoch 15, Loss: 1.0780





In [40]:
# Inference with beam search
beam_width=5
encoder.eval()
decoder.eval()
hypotheses = []

# Load Test data
with open('/content/drive/MyDrive/NLP Data/test_data1_final.json', 'r') as f:
    val_data = json.load(f)

# Extract English-Bengali sources
val_sources = []
if "English-Bengali" in val_data:
    Test_entries = val_data["English-Bengali"].get("Test", {})
    for entry in Test_entries.values():
        val_sources.append(entry["source"])

# Preprocess and encode
val_encoded = []
valid_sources = []
for sent in val_sources:
    try:
        cleaned = preprocess_and_remove_punctuation(sent.lower())
        tokenized = nltk.word_tokenize(cleaned)
        encoded = encode_and_pad(en_word2index, tokenized, SEQ_LENGTH)
        val_encoded.append(encoded)
        valid_sources.append(sent)
    except Exception as e:
        print(f"Skipping: {sent[:50]}... | Error: {str(e)}")
        continue

val_ds = TensorDataset(torch.LongTensor(val_encoded))

print(f"\nValidating on {len(val_ds)} samples...")
val_outs = []
# Translation loop with beam search
with torch.no_grad():
    for i in range(len(val_ds)):
        src = val_ds[i][0].unsqueeze(0).to(device)

        # Encode
        encoder_outputs, encoder_hidden = encoder(src)

        # Initialize beam search
        beam = [(torch.tensor([bn_word2index["<SOS>"]], device=device), encoder_hidden, [], 0)]

        for t in range(SEQ_LENGTH):
            new_beam = []
            for decoder_input, decoder_hidden, sequence, score in beam:
                if decoder_input.item() == bn_word2index["<EOS>"]:
                    new_beam.append((decoder_input, decoder_hidden, sequence, score))
                    continue

                # Decode
                decoder_output, decoder_hidden, _ = decoder(
                    decoder_input,
                    decoder_hidden,
                    encoder_outputs)

                # Get top k candidates
                log_probs = torch.log_softmax(decoder_output, dim=-1)
                topk_probs, topk_ids = log_probs.topk(beam_width)

                for j in range(beam_width):
                    token_id = topk_ids[0][j].item()
                    token_prob = topk_probs[0][j].item()
                    new_score = score + token_prob

                    if token_id == bn_word2index["<EOS>"]:
                        new_beam.append((torch.tensor([token_id], device=device), decoder_hidden, sequence, new_score))
                    else:
                        if token_id < len(bn_index2word):
                            token = bn_index2word[token_id]
                        else:
                            token = "<UNK>"

                        if token not in ["<PAD>", "<SOS>", "<EOS>"]:
                            new_sequence = sequence + [token]
                        else:
                            new_sequence = sequence

                        new_beam.append((torch.tensor([token_id], device=device), decoder_hidden, new_sequence, new_score))

            # Keep top k hypotheses
            beam = sorted(new_beam, key=lambda x: x[3], reverse=True)[:beam_width]

        # Select the best hypothesis
        best_hypothesis = max(beam, key=lambda x: x[3])
        val_outs.append(" ".join(best_hypothesis[2]))


val_ids = [i for i, _ in data["English-Bengali"]["Test"].items()]
# val_outs
print('Test Complete')


Validating on 19672 samples...
Test Complete


In [47]:
# Inferred data to dataframe
df0 = pd.DataFrame()

df0["ID"] = val_ids
df0["Translation"] = val_outs

df0.head(5)

Unnamed: 0,ID,Translation
0,177039,বর্তমান ঘটনাগুলি ঘটনা
1,177040,ভগবানের জায়গা হিসাবে তাঁর নিউক্লিয়ার তাঁকে হয...
2,177041,বুকে বুকে নেওয়ার পরে বুকের অথবা পেটে করার করার...
3,177042,সেই সময়ে যখন সে সেই বোঝা গেছে যে বাচ্চার বাচ্চ...
4,177043,অস্ট্রেলিয়া হল আখের উল্লেখযোগ্য গুরুত্বপূর্ণ ...


In [None]:
df0 = pd.DataFrame()

df0["ID"] = val_ids
df0["Translation"] = val_outs

df0.head(5)

Unnamed: 0,ID,Translation
0,177039,বর্তমান অনুষ্ঠান
1,177040,কিন্তু যেই তার মা বাবার কিন্তু তাঁকে কিন্তু তা...
2,177041,বুকে রোগের পর চার বার অথবা চার বার করার পর আরা...
3,177042,তার এটা উপর জোর দেওয়া হয় আর তার নিজের সময়ে ধরা...
4,177043,এর থেকে উল্লেখযোগ্য উল্লেখযোগ্য গুরুত্বপূর্ণ ।


In [None]:
# Saving dataframe to csv
df0.to_csv("/content/drive/MyDrive/NLP Data OP files/final_test_1/answersB.csv", index = False)

In [None]:
############### English-Hindi Translation starts here #################

In [None]:
with open('/content/drive/MyDrive/NLP Data/train_data1.json', 'r') as file:
    data = json.load(file)

In [None]:
# Process JSON data
source_sentences_train = []
target_sentences_train = []

source_sentences_val = []
target_sentences_val = []

id_train = []
id_val = []

In [None]:
for language_pair, language_data in data.items():
  print(f"Language Pair: {language_pair}")

Language Pair: English-Bengali
Language Pair: English-Hindi


In [None]:
for language_pair, language_data in data.items():
    if(language_pair == "English-Hindi"):
      print(f"Language Pair: {language_pair}")
      for data_type, data_entries in language_data.items():
          print(f"  Data Type: {data_type}")
          for entry_id, entry_data in data_entries.items():
              source = entry_data["source"]
              target = entry_data["target"]
              if (data_type == "Test"):
                source_sentences_val.append(source)
                target_sentences_val.append(target)
                id_val.append(entry_id)
              else:
                source_sentences_train.append(source)
                target_sentences_train.append(target)
                id_train.append(entry_id)

Language Pair: English-Hindi
  Data Type: Train


In [None]:
with open('/content/drive/MyDrive/NLP Data/test_data1_final.json', 'r') as file:
    data = json.load(file)

In [None]:
source_sentences_val = []
for language_pair, language_data in data.items():
    if language_pair == "English-Hindi":
        print(f"Language Pair: {language_pair}")
        for data_type, data_entries in language_data.items():
            print(f"  Data Type: {data_type}")
            if data_type == "Test":
                for entry_data in data_entries.values():
                    source_sentences_val.append(entry_data["source"])

Language Pair: English-Hindi
  Data Type: Test


In [None]:
print(len(source_sentences_train))
print(len(target_sentences_train))

print(len(source_sentences_val))
print(len(target_sentences_val))

80797
80797
23085
0


In [None]:
import nltk
from collections import Counter

In [None]:
# Function to preprocess and remove punctuation and numbers
def preprocess_and_remove_punctuation(sentence):
    # Remove punctuation and numbers
    sentence = ''.join([char for char in sentence if char not in string.punctuation and not char.isdigit()])
    #sentence = ''.join([char for char in sentence if char not in string.punctuation ])
    return sentence

In [None]:
# Tokenization and Lowercasing
def preprocess(sentences):
    tokenized_sentences = [nltk.word_tokenize(preprocess_and_remove_punctuation(sentence.lower())) for sentence in sentences]
    return tokenized_sentences

In [None]:
#target_sentences_train = [re.sub(r'[a-zA-Z]','',hi) for hi in target_sentences_train] #optional

In [None]:
english_tokens = preprocess(source_sentences_train)
english_test=preprocess(source_sentences_val)
hindi_tokens = preprocess(target_sentences_train)
hindi_test=preprocess(target_sentences_val)

In [None]:
en_train=english_tokens
en_test=english_test
hi_train=hindi_tokens
hi_test=hindi_test

In [None]:
en_index2word = ["<PAD>", "<SOS>", "<EOS>", "<UNK>"]
hi_index2word = ["<PAD>", "<SOS>", "<EOS>", "<UNK>"]


for ds in [en_train, en_test]:
    for sent in ds:
        for token in sent:
            if token not in en_index2word:
                en_index2word.append(token)

for ds in [hi_train, hi_test]:
    for sent in ds:
        for token in sent:
            if token not in hi_index2word:
                hi_index2word.append(token)

In [None]:
len(en_index2word)

64779

In [None]:
len(hi_index2word)

75179

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [None]:
en_word2index = {token: idx for idx, token in enumerate(en_index2word)}
hi_word2index = {token: idx for idx, token in enumerate(hi_index2word)}

In [None]:
len(en_word2index)

64779

In [None]:
len(hi_word2index)

75179

In [None]:
en_lengths = sum([len(sent) for sent in en_train])/len(en_train)
hi_lengths = sum([len(sent) for sent in hi_train])/len(hi_train)

In [None]:
seq_length = 25

In [None]:
def encode_and_pad(vocab, sent, max_length):
    sos = [vocab["<SOS>"]]
    eos = [vocab["<EOS>"]]
    pad = [vocab["<PAD>"]]
    unk = [vocab["<UNK>"]]

    if len(sent) < max_length - 2: # -2 for SOS and EOS
        n_pads = max_length - 2 - len(sent)
        encoded = [vocab[w] for w in sent]
        return sos + encoded + eos + pad * n_pads
    else: # sent is longer than max_length; truncating
        encoded = [vocab[w] for w in sent]
        truncated = encoded[:max_length - 2]
        return sos + truncated + eos

In [None]:
en_train_encoded = [encode_and_pad(en_word2index, sent, seq_length) for sent in en_train]
en_test_encoded = [encode_and_pad(en_word2index, sent, seq_length) for sent in en_test]
hi_train_encoded = [encode_and_pad(hi_word2index, sent, seq_length) for sent in hi_train]
hi_test_encoded = [encode_and_pad(hi_word2index, sent, seq_length) for sent in hi_test]

In [None]:
batch_size = 256

train_x = np.array(en_train_encoded)
train_y = np.array(hi_train_encoded)
test_x = np.array(en_test_encoded)
test_y = np.array(hi_test_encoded)

train_ds = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
test_ds = TensorDataset(torch.from_numpy(test_x))


train_dl = DataLoader(train_ds, shuffle=True, batch_size=batch_size, drop_last=True)

In [None]:
hidden_size = 512

In [None]:
seq_length = 25

In [None]:
losses = []

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
import random
import torch.nn.functional as F
from tqdm import tqdm
from nltk.translate.bleu_score import corpus_bleu

# Hyperparameters
HIDDEN_SIZE = 512
EMBED_SIZE = 512
BATCH_SIZE = 256
SEQ_LENGTH = 25
DROPOUT = 0.3
LEARNING_RATE = 0.001
EPOCHS = 15
TEACHER_FORCING_RATIO = 0.5


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Encoder with BiGRU
class Encoder(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(Encoder, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(input_size, EMBED_SIZE)
        self.gru = nn.GRU(EMBED_SIZE, hidden_size,
                         bidirectional=True,
                         batch_first=True,
                         dropout=DROPOUT)
        self.fc = nn.Linear(hidden_size*2, hidden_size)

    def forward(self, x):
        embedded = self.embedding(x)
        outputs, hidden = self.gru(embedded)

        hidden = torch.cat((hidden[-2], hidden[-1]), dim=1)
        hidden = torch.tanh(self.fc(hidden)).unsqueeze(0)

        return outputs, hidden

# Attention Mechanism
class Attention(nn.Module):
    def __init__(self, hidden_size):
        super(Attention, self).__init__()
        self.attn = nn.Linear(hidden_size*3, hidden_size)
        self.v = nn.Linear(hidden_size, 1, bias=False)

    def forward(self, hidden, encoder_outputs):
        batch_size = encoder_outputs.size(0)
        seq_len = encoder_outputs.size(1)

        hidden = hidden.repeat(seq_len, 1, 1).permute(1,0,2)
        energy = torch.cat((hidden, encoder_outputs), dim=2)
        energy = torch.tanh(self.attn(energy))
        attention = self.v(energy).squeeze(2)

        return F.softmax(attention, dim=1)

# Decoder with Attention
class Decoder(nn.Module):
    def __init__(self, output_size, hidden_size):
        super(Decoder, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size

        self.embedding = nn.Embedding(output_size, EMBED_SIZE)
        self.attention = Attention(hidden_size)
        self.gru = nn.GRU(EMBED_SIZE + hidden_size*2, hidden_size,
                         batch_first=True,
                         dropout=DROPOUT)
        self.fc = nn.Linear(hidden_size*3, output_size)
        self.dropout = nn.Dropout(DROPOUT)

    def forward(self, x, hidden, encoder_outputs):
        x = x.unsqueeze(1)
        embedded = self.dropout(self.embedding(x))

        attn_weights = self.attention(hidden, encoder_outputs)
        context = torch.bmm(attn_weights.unsqueeze(1), encoder_outputs)

        gru_input = torch.cat((embedded, context), dim=2)
        output, hidden = self.gru(gru_input, hidden)

        output = torch.cat((output.squeeze(1), context.squeeze(1)), dim=1)
        prediction = self.fc(output)

        return prediction, hidden, attn_weights



# Training Setup
encoder = Encoder(len(en_index2word), HIDDEN_SIZE).to(device)
decoder = Decoder(len(hi_index2word), HIDDEN_SIZE).to(device)

optimizer = optim.Adam(list(encoder.parameters()) + list(decoder.parameters()),
                     lr=LEARNING_RATE)

# Loss
criterion = nn.CrossEntropyLoss(ignore_index=0)

# Training Function

encoder.train()
decoder.train()

for epoch in range(EPOCHS):
  total_loss = 0
  for batch_idx, (src, trg) in tqdm(enumerate(train_dl), total=len(train_dl)):
      src, trg = src.to(device), trg.to(device)
      batch_size = src.size(0)

      encoder_outputs, encoder_hidden = encoder(src)
      decoder_input = torch.tensor([hi_word2index["<SOS>"]]*batch_size, device=device)
      decoder_hidden = encoder_hidden

      loss = 0
      use_teacher_forcing = random.random() < TEACHER_FORCING_RATIO

      for t in range(1, SEQ_LENGTH):
          decoder_output, decoder_hidden, _ = decoder(
              decoder_input,
              decoder_hidden,
              encoder_outputs
          )

          loss += criterion(decoder_output, trg[:, t])
          decoder_input = trg[:, t] if use_teacher_forcing else decoder_output.argmax(1)

      optimizer.zero_grad()
      loss.backward()
      torch.nn.utils.clip_grad_norm_(encoder.parameters(), 1.0)
      torch.nn.utils.clip_grad_norm_(decoder.parameters(), 1.0)
      optimizer.step()

      total_loss += loss.item()/SEQ_LENGTH

  print(f'Epoch {epoch+1}, Loss: {total_loss/len(train_dl):.4f}')



100%|██████████| 315/315 [09:13<00:00,  1.76s/it]


Epoch 1, Loss: 6.0218


100%|██████████| 315/315 [09:22<00:00,  1.78s/it]


Epoch 2, Loss: 4.5396


100%|██████████| 315/315 [09:23<00:00,  1.79s/it]


Epoch 3, Loss: 3.5794


100%|██████████| 315/315 [09:22<00:00,  1.79s/it]


Epoch 4, Loss: 3.1256


100%|██████████| 315/315 [09:22<00:00,  1.79s/it]


Epoch 5, Loss: 2.8819


100%|██████████| 315/315 [09:22<00:00,  1.79s/it]


Epoch 6, Loss: 2.7215


100%|██████████| 315/315 [09:23<00:00,  1.79s/it]


Epoch 7, Loss: 2.5237


100%|██████████| 315/315 [09:23<00:00,  1.79s/it]


Epoch 8, Loss: 2.3660


100%|██████████| 315/315 [09:22<00:00,  1.79s/it]


Epoch 9, Loss: 2.2135


100%|██████████| 315/315 [09:22<00:00,  1.79s/it]


Epoch 10, Loss: 2.0152


100%|██████████| 315/315 [09:23<00:00,  1.79s/it]


Epoch 11, Loss: 1.9918


100%|██████████| 315/315 [09:23<00:00,  1.79s/it]


Epoch 12, Loss: 1.8745


100%|██████████| 315/315 [09:24<00:00,  1.79s/it]


Epoch 13, Loss: 1.7860


100%|██████████| 315/315 [09:23<00:00,  1.79s/it]


Epoch 14, Loss: 1.7225


100%|██████████| 315/315 [09:24<00:00,  1.79s/it]

Epoch 15, Loss: 1.6773





In [None]:
# Inference
beam_width=5
encoder.eval()
decoder.eval()
hypotheses = []

# Load Test data
with open('/content/drive/MyDrive/NLP Data/test_data1_final.json', 'r') as f:
    val_data = json.load(f)

# Extract English-Hindi sources
val_sources = []
if "English-Hindi" in val_data:
    Test_entries = val_data["English-Hindi"].get("Test", {})
    for entry in Test_entries.values():
        val_sources.append(entry["source"])

# Preprocess and encode
val_encoded = []
valid_sources = []
for sent in val_sources:
    try:
        cleaned = preprocess_and_remove_punctuation(sent.lower())
        tokenized = nltk.word_tokenize(cleaned)
        encoded = encode_and_pad(en_word2index, tokenized, SEQ_LENGTH)
        val_encoded.append(encoded)
        valid_sources.append(sent)
    except Exception as e:
        print(f"Skipping: {sent[:50]}... | Error: {str(e)}")
        continue

val_ds = TensorDataset(torch.LongTensor(val_encoded))

print(f"\nValidating on {len(val_ds)} samples...")
val_outs = []
# Translation loop with beam search
with torch.no_grad():
    for i in range(len(val_ds)):
        src = val_ds[i][0].unsqueeze(0).to(device)

        # Encode
        encoder_outputs, encoder_hidden = encoder(src)

        # Initialize beam search
        beam = [(torch.tensor([hi_word2index["<SOS>"]], device=device), encoder_hidden, [], 0)]  # (decoder_input, hidden, sequence, score)

        for t in range(SEQ_LENGTH):
            new_beam = []
            for decoder_input, decoder_hidden, sequence, score in beam:
                if decoder_input.item() == hi_word2index["<EOS>"]:
                    new_beam.append((decoder_input, decoder_hidden, sequence, score))
                    continue

                # Decode
                decoder_output, decoder_hidden, _ = decoder(
                    decoder_input,
                    decoder_hidden,
                    encoder_outputs
                )

                # Get top k candidates
                log_probs = torch.log_softmax(decoder_output, dim=-1)
                topk_probs, topk_ids = log_probs.topk(beam_width)

                for j in range(beam_width):
                    token_id = topk_ids[0][j].item()
                    token_prob = topk_probs[0][j].item()
                    new_score = score + token_prob

                    if token_id == hi_word2index["<EOS>"]:
                        new_beam.append((torch.tensor([token_id], device=device), decoder_hidden, sequence, new_score))
                    else:
                        if token_id < len(hi_index2word):
                            token = hi_index2word[token_id]
                        else:
                            token = "<UNK>"

                        if token not in ["<PAD>", "<SOS>", "<EOS>"]:
                            new_sequence = sequence + [token]
                        else:
                            new_sequence = sequence

                        new_beam.append((torch.tensor([token_id], device=device), decoder_hidden, new_sequence, new_score))

            # Keep top k hypotheses
            beam = sorted(new_beam, key=lambda x: x[3], reverse=True)[:beam_width]

        # Select the best hypothesis
        best_hypothesis = max(beam, key=lambda x: x[3])
        val_outs.append(" ".join(best_hypothesis[2]))

val_ids = [i for i, _ in data["English-Hindi"]["Test"].items()]
# val_outs
print('Test Complete')


Validating on 23085 samples...
Test Complete


In [None]:
# Inferred data to dataframe
df1 = pd.DataFrame()

df1["ID"] = val_ids
df1["Translation"] = val_outs

df1.head(5)

Unnamed: 0,ID,Translation
0,540139,और फिर हमें छात्रों की जरूरत को जरूरत की आवश्य...
1,540140,जनवरी के लिए पहले क्या कार्यक्रम का पता क्या है
2,540141,इंदिरा गांधी राष्‍ट्रीय उद्यान में जिले का नवम...
3,540142,स्थानीय स्थानीय ने भी जो पत्थर पर आधारित हुए ह...
4,540143,इस इस पर कुछ कुछ भी खास


In [None]:
# Dataframe to csv
df1.to_csv("/content/drive/MyDrive/NLP Data OP files/final_test_1/answersH.csv", index = False)

In [None]:
# Create Dataframes for Bengali and Hindi from csv (ran Bengali and Hindi models in two seperate notebooks, so had to recreate the Dataframes from stored csv)
df_B= pd.read_csv('/content/drive/MyDrive/NLP Data OP files/final_test_1/answersB.csv') # Bengali
df_H= pd.read_csv("/content/drive/MyDrive/NLP Data OP files/final_test_1/answersH.csv")  # Hindi

In [None]:
# Concat and create one single Dataframe
df3 = pd.concat([df_B, df_H])

In [None]:
# Basic data sanity check
df3.head(10)

Unnamed: 0,ID,Translation
0,177039,বর্তমান অনুষ্ঠান
1,177040,কিন্তু যেই তার মা বাবার কিন্তু তাঁকে কিন্তু তা...
2,177041,বুকে রোগের পর চার বার অথবা চার বার করার পর আরা...
3,177042,তার এটা উপর জোর দেওয়া হয় আর তার নিজের সময়ে ধরা...
4,177043,এর থেকে উল্লেখযোগ্য উল্লেখযোগ্য গুরুত্বপূর্ণ ।
5,177044,আমার রাতের খাবারের সাথে কি
6,177045,সালে নিউ নতুন হওয়া উনি সাথে উনি দাবি করেন যে ...
7,177046,মাদ্রাজ হাইকোর্ট দেব বিচারপতি “ হল প্রধান প্রধান।
8,177047,মজার কথা হল যে যে শহরের শহরের শহরের আশেপাশে বা...
9,177048,এটি তাদের তাদের ভারতীয় ভারতীয় পরিচয় করে।


In [None]:
df3.tail(10)

Unnamed: 0,ID,Translation
23075,563214,दुल्हन को अपने हाथों से हुए हाथों के हाथों को ...
23076,563215,गंगा नदी दक्षिण पूर्व ओर दक्षिण पूर्व ओर ओर ओर...
23077,563216,जिस आदमी जिसके जिसके जिसके हुआ हो गया है कि को...
23078,563217,गुलजार का टावर होटल सालों होटल होटल रैफ्लस होट...
23079,563218,रेडियो पर पाकिस्तान से बचने के बाद से उन्होंने...
23080,563219,राम को एक और एक ईमेल भेजें कि यह कि कि कि उसे ...
23081,563220,इस पर पहल की पहल के सरकारी ने सरकारी प्रोफेसर ...
23082,563221,टी प्रारंभिक चरण पर एक मशीन में में पर पर पर प...
23083,563222,बायें पैर के हाथों पैर को हाथ पैर के साथ के घु...
23084,563223,हमारी जल बिल बहुत अधिक था ।


In [None]:
# Create the final output file

In [None]:
df3.to_csv('/content/drive/MyDrive/NLP Data OP files/final_test_1/answersBH.csv', index = False)

In [None]:
filtered_data = pd.read_csv('/content/drive/MyDrive/NLP Data OP files/final_test_1/answersBH.csv')

In [None]:
answer = "/content/drive/MyDrive/NLP Data OP files/final_test_1/answer.csv"
with open(answer, "w") as f:
  f.writelines("ID\tTranslation\n")
  for i in range(filtered_data.shape[0]):
    f.writelines(f'{filtered_data["ID"][i]}\t"{filtered_data["Translation"][i]}"\n')

In [None]:
import zipfile
import os

def zip_file(file_path, zip_path):

    try:
        with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
            zipf.write(file_path, os.path.basename(file_path))
        print(f"Successfully zipped '{file_path}' and stored at '{zip_path}'")
    except FileNotFoundError:
        print(f"Error: File not found at '{file_path}'")
    except Exception as e:
        print(f"An error occurred: {e}")

# Define the paths
answer_file = "/content/drive/MyDrive/NLP Data OP files/final_test_1/answer.csv"
zip_store_path = "/content/drive/MyDrive/NLP Data OP files/final_test_1/answer.zip"

# Call the function to zip the file
zip_file(answer_file, zip_store_path)

Successfully zipped '/content/drive/MyDrive/NLP Data OP files/final_test_1/answer.csv' and stored at '/content/drive/MyDrive/NLP Data OP files/final_test_1/answer.zip'
