<h1 style="color:#153BE4; font-size:50px; font-family:serif; background:#90BD59; padding-bottom:20px;padding-top:20px; border-radius: 15px;
           font-weight:bold" 
    align="center"> 
    Neural Machine Translation using LSTM (Eng to Kan)
</h1>

<p style="font-size:18px">
    NMT is a very well known use case of NLP task, where the goal is to translate one languge to another language with the help of architectures known as encoder and decoder. Encoder takes the input language, generates an information rich representation which is then fed into the deocoder to generate the words of target language. This is a sequential task where encoder is trained in autoregressive manner and in the decoder we pass one token at a time to generate the next token. So in this sense the decoder acts a Generative model. The model architecture can be explained as below with a diagram.
</p>
<img src="https://i.postimg.cc/y8dvcFrH/Sequence-to-sequence-encoder-decoder-NMT-model.jpg"  style="display: block; margin-left: auto; margin-right: auto;"/>
<br>
<p style="font-size:18px">
    For our purpose we used LSTM for both encoder and decoder. Also in this task we used pretrained Glove Embeddings for the representation of english words and Indic Bert's embedding for the representation of Kannada language. 
</p>

## Load the libraries

In [1]:
from transformers import AutoModel, AutoTokenizer
import torch
from tqdm import tqdm
import pandas as pd
import pickle
import re
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
import tensorflow as tf
import random
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
import os

2024-05-26 13:01:47.922972: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-26 13:01:47.923065: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-26 13:01:48.068165: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## Load the downloaded embeddings for Glove

In [2]:
import numpy as np
from tqdm import tqdm
glove_word_to_idx={}
glove_vectors={}
i=0
with open('/kaggle/input/glove6b100d-embedding/glove.6B.100d.txt') as fp:
    for line in tqdm(fp.readlines()):
        records = line.split()
        word = records[0]
        vector_dimensions = np.asarray(records[1:], dtype='float32')
        glove_word_to_idx[word] = i
        glove_vectors[i]=torch.tensor(vector_dimensions)
        i+=1

100%|██████████| 400000/400000 [00:16<00:00, 24341.58it/s]


In [None]:
## Dont run this cell 

tokenizer = AutoTokenizer.from_pretrained('ai4bharat/indic-bert',keep_accents=True,padding=True)
model = AutoModel.from_pretrained('ai4bharat/indic-bert')

# Set the model to evaluation mode
model.eval()

# Get the full vocabulary
vocab = tokenizer.get_vocab()

idx=0
ids_to_idx={}
idx_to_ids={}

# Filter the vocabulary to include only Kannada words
kannada_words = {word: idx for word, idx in vocab.items() if any(0x0C80 <= ord(c) <= 0x0CFF for c in word)}
special = ['<unk>','<pad>','▁']
for s in special:
    tokens = tokenizer.encode(s, return_tensors="pt")
    ids = tokens
    with torch.no_grad():
        outputs = model(tokens)
    embeddings = outputs.last_hidden_state[0]
    for ids,emb in zip(tokens[0],embeddings):
        ids=ids.item()
        word = tokenizer.convert_ids_to_tokens(ids)
        if ids not in ids_to_idx.keys():
            ids_to_idx[ids] = {'word':word,'word_index':ids,'emb':emb}
            idx_to_ids[ids] = {'word':word,'word_index':ids, 'emb':emb }
    
idx = 4
for words in tqdm(kannada_words.keys()):
    if idx==8:
        idx=9
    tokens = torch.tensor([2,kannada_words[words],3]).unsqueeze(0)
    with torch.no_grad():
        outputs = model(tokens)
    embeddings = outputs.last_hidden_state[0][1]
    ids_to_idx[kannada_words[words]] = {'word':words,'word_index':idx, 'emb':embeddings}
    idx_to_ids[idx] = {'word':words,'word_index': kannada_words[words], 'emb':embeddings}
    idx+=1
            
import pickle

# Serialize the ids_to_idx dictionary
with open('kannada_word_embeddings_ids2idx.pkl', 'wb') as f:
    pickle.dump(ids_to_idx, f)
    

# Serialize the ids_to_idx dictionary
with open('kannada_word_embeddings_idx2ids.pkl', 'wb') as f:
    pickle.dump(idx_to_ids, f)

config.json:   0%|          | 0.00/507 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/5.65M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/135M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()
 52%|█████▏    | 11206/21383 [04:28<04:25, 38.40it/s]

## Load the data

In [3]:
Data_train = pd.read_csv('/kaggle/input/translation-dataset/team24_kn/team24_kn_train.csv',encoding='utf-8')
Data_valid =  pd.read_csv('/kaggle/input/translation-dataset/team24_kn/team24_kn_valid.csv',encoding='utf-8')
Data_test =  pd.read_csv('/kaggle/input/translation-dataset/team24_kn/team24_kn_test.csv',encoding='utf-8')

## Load the kannada word embeddings

<p style="font-size:18px">
    We have stored the embeddings in two pickle files, which contains a dictionary mapping of original index to word and to their embeddings and the other pickle file contains the inverse mapping from new index to original index and word embeddings. 
</p>

In [None]:
file_path = '/kaggle/input/d/anikbhowmickae20b102/kannada-embeddings/kannada_word_embeddings_ids2idx.pkl'

# Open the pickle file for reading in binary mode
with open(file_path, 'rb') as f:
    ids_to_idx = pickle.load(f)
    
file_path = '/kaggle/input/d/anikbhowmickae20b102/kannada-embeddings/kannada_word_embeddings_idx2ids.pkl'

# Open the pickle file for reading in binary mode
with open(file_path, 'rb') as f:
    idx_to_ids = pickle.load(f)

In [51]:
ids_to_idx[0]

{'word': '<pad>',
 'word_index': 0,
 'emb': tensor([ 1.8497e-01,  4.7676e-02,  1.7916e-02, -1.8901e-01, -2.1319e-02,
         -4.3125e-02, -1.5480e-01,  2.7509e-02, -4.1240e-02,  1.9308e-01,
         -2.2678e-01,  2.6876e-01,  7.7441e-02, -1.1246e-01, -1.3378e-01,
          1.9812e-02,  6.3040e-03, -1.5274e-01,  7.5757e-03, -7.5364e-01,
         -1.5213e-01, -2.0996e-01,  5.1003e-02, -8.9232e-02,  3.2150e-01,
          4.5672e-01, -1.3179e-01, -7.0628e-02, -1.4835e-01, -5.5398e-02,
          2.2579e-01,  2.9330e-01,  3.1553e-01, -2.3183e-01,  2.7504e-02,
          9.0284e-02,  3.1039e-01, -2.3230e-02, -1.2556e-01,  7.2299e-03,
         -5.5922e-01, -7.7349e-02, -2.1687e-01, -4.3368e-01, -1.9727e-02,
          1.3920e-01, -2.7258e-01, -7.1891e-02,  7.5716e-03,  3.9337e-02,
         -1.3009e-01, -2.7763e-01,  1.0540e-01,  2.8175e-01,  2.1879e-01,
          3.0248e-01, -1.3106e-01,  1.3361e-01,  5.5977e-02,  2.3551e-02,
          1.3561e-01, -4.0450e-01, -2.8044e-02,  1.5109e-01, -1.4453e

## Load the auto tokenizer for Indic-Bert

<p style="font-size:18px">
    Perform preprocessing for the kannada sentences. 
</p>

In [5]:
tokenizer = AutoTokenizer.from_pretrained('ai4bharat/indic-bert',keep_accents=True,padding=True)


def preprocess_text(text):
    # Remove special characters, English numbers, full stops, and emojis
    cleaned_text = re.sub(r'[^\u0C80-\u0CFF\s]', '', text)  # Keep Kannada characters and whitespace
    cleaned_text = re.sub(r'\d+', '', cleaned_text)  # Remove English numbers
    cleaned_text = re.sub(r'\.', '', cleaned_text)  # Remove full stops
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text)  # Remove extra whitespace
    return cleaned_text.strip()  # Remove leading and trailing whitespace



def get_tokens(sentences):
    # Tokenize each sentence and convert tokens to integers
    tokenized_sentences = [tokenizer.encode(sent, return_tensors='pt')[0] for sent in sentences]
    # Convert tokens to integers and collect them in a list
    tokenized_sentences_int_org = [[int(token) for token in tokens] for tokens in tokenized_sentences]
    # Find the maximum sequence length
    tokenized_sentences_int_mod = [[ids_to_idx[i]['word_index'] for i in tokens] for tokens in tokenized_sentences_int_org]
    max_length = max(len(tokens) for tokens in tokenized_sentences)
    return max_length, tokenized_sentences_int_mod

def pad_sequences(sentences,max_length):
    padded_seq = [sent + [0] * (max_length - len(sent)) for sent in sentences]
    return padded_seq

def embeddings_target_org(Y):
    all_sequences_embeddings = []
    for sequence in Y:
        embeddings_list = [idx_to_ids[i.item()]['emb'] for i in sequence]
        # Stack the list of embeddings into a single tensor for the current sequence
        sequence_embeddings_tensor = torch.stack(embeddings_list)
        # Append the stacked tensor to the list of all sequences' embeddings
        all_sequences_embeddings.append(sequence_embeddings_tensor)
    all_sequences_tensor = torch.stack(all_sequences_embeddings)
    
    return all_sequences_tensor

config.json:   0%|          | 0.00/507 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/5.65M [00:00<?, ?B/s]

<p style="font-size:18px">
    Preprocess the english sentences. 
</p>

In [6]:
def preprocess_source(text):
    # Convert text to lowercase
    text = text.lower()
    
    # Remove special characters and numbers
    text = re.sub(r'[^a-z\s]', '', text)
    
    return text

def get_tokens_source(sentences):
    tokenized_sent = [[glove_word_to_idx[word] for word in sent.split() if word in glove_word_to_idx] for sent in sentences]
    return tokenized_sent

def pad_source(y):
    max_len = 0
    for i,sent in enumerate(y):
        if max_len<len(sent):
            max_len = len(sent)
            idx = i
    
    padded_seq = [sent+[2]*(max_len-len(sent)) for sent in y]
    return padded_seq

def embeddings_source(text):
    all_sequences_embeddings = []
    for sent in text:
        embeddings_list = [glove_vectors[i.item()] for i in sent]
        sequence_embeddings_tensor = torch.stack(embeddings_list)
        # Append the stacked tensor to the list of all sequences' embeddings
        all_sequences_embeddings.append(sequence_embeddings_tensor)
    all_sequences_tensor = torch.stack(all_sequences_embeddings)
    
    return all_sequences_tensor

In [7]:
class CustomDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.tensor(X,dtype=torch.int32)
        self.y = torch.tensor(y,dtype=torch.int32)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

## preprocess and pad the sentences

In [8]:
Data_train['target']=Data_train['target'].apply(preprocess_text)
Data_valid['target']=Data_valid['target'].apply(preprocess_text)
Data_test['target']=Data_test['target'].apply(preprocess_text)

max_len, Y_train = get_tokens(Data_train['target'].values)
Y_train = pad_sequences(Y_train,max_len)

max_len, Y_valid = get_tokens(Data_valid['target'].values)
Y_valid = pad_sequences(Y_valid,max_len)

max_len, Y_test = get_tokens(Data_test['target'].values)
Y_test = pad_sequences(Y_test,max_len)

Data_train['source'] = Data_train['source'].apply(preprocess_source)
Data_valid['source'] = Data_valid['source'].apply(preprocess_source)
Data_test['source'] = Data_test['source'].apply(preprocess_source)

X_train = get_tokens_source(Data_train['source'].values)
X_train = pad_source(X_train)

X_valid = get_tokens_source(Data_valid['source'].values)
X_valid = pad_source(X_valid)

X_test = get_tokens_source(Data_test['source'].values)
X_test = pad_source(X_test)

In [57]:
Data_train['source'][0]

'the role of parents'

In [58]:
X_train[0]

[0,
 542,
 3,
 1108,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2]

In [9]:
train_dataset = CustomDataset(X_train, Y_train)
valid_dataset = CustomDataset(X_valid, Y_valid)
test_dataset = CustomDataset(X_test, Y_test)
# Create a DataLoader
Train_data = DataLoader(train_dataset, batch_size=128, shuffle=False)
Val_data = DataLoader(valid_dataset, batch_size=128, shuffle=False)
Test_data = DataLoader(test_dataset, batch_size=128, shuffle=False)

In [10]:
num_vocabs_kannada = len(ids_to_idx)# about 21K words+subwords

## Model architecture
<p style="font-size:18px">
    The model's decoder will be trained in teacher forcing fashion, meaning sometimes it uses predicted token to predict the next token or actula token to predict the next token, this way training makes the model more robust during infernce as during inference we only use the previous predicted token to predict the next token. 
</p>

In [20]:
class Encoder(torch.nn.Module):
    def __init__(self, embedding_dim=100, hidden_size=256):
        super(Encoder, self).__init__()
        self.embedding = embeddings_source
        self.lstm = torch.nn.LSTM(embedding_dim, hidden_size, num_layers = 1, bidirectional=False, dropout=0.2)
        
    def forward(self, inputs):
        embedded = self.embedding(inputs).to(inputs.device)
        outputs, (hidden, cell) = self.lstm(embedded)
        
        return outputs, hidden, cell
    
class Decoder(torch.nn.Module):
    def __init__(self, embedding_dim=768, hidden_size=256):
        super(Decoder, self).__init__()
        self.embedding = embeddings_target_org
        self.lstm = torch.nn.LSTM(embedding_dim, hidden_size, num_layers = 1, bidirectional=False, dropout=0.2)
        self.linear = torch.nn.Linear(hidden_size,num_vocabs_kannada)
        
    def forward(self,inputs,hidden_state=None, cell_state=None):
        inputs = inputs.unsqueeze(0)
        embedded = self.embedding(inputs).to(inputs.device)
    
        outputs, (hidden, cell) = self.lstm(embedded, (hidden_state,cell_state))
        
        outputs = self.linear(outputs)
        
        return outputs.squeeze(0), hidden, cell
    
    
class Seq2Seq(torch.nn.Module):
    def __init__(self):
        super(Seq2Seq,self).__init__()
        self.encoder_model = Encoder()
        self.decoder_model = Decoder()
    
    
    def forward(self,inputs,target=None):
        teacher_forcing_ratio = 0.5
        encoder_outputs, encoder_hidden , encoder_cell = self.encoder_model(inputs)

        # Initialize decoder hidden state with encoder final hidden state
        decoder_hidden = encoder_hidden
        
        decoder_cell = encoder_cell
        batch_size = inputs.size(1)
        
        # Initialize decoder input with SOS token
        
        decoder_input = (target[0] if target is not None else torch.tensor([2] * batch_size, device=inputs.device))
        
        #print(decoder_input)
        # Forward pass through decoder one time step at a time
        output_seq_len = 172
        
        outputs = torch.zeros(output_seq_len, batch_size, num_vocabs_kannada).to(inputs.device)
        use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False
        if target!=None:
            for t in range(1, output_seq_len):
                decoder_output, decoder_hidden, decoder_cell = self.decoder_model(decoder_input, decoder_hidden, decoder_cell)
                outputs[t] = decoder_output 
                #print("out_shape",outputs[t].shape)
                #print("target_shape",target[t].shape)
                decoder_input = (target[t] if use_teacher_forcing else outputs[t].argmax(1))
        
        
        
        elif target==None:
            output_seq_len = 172
            outputs = torch.zeros(output_seq_len, batch_size, num_vocabs_kannada).to(inputs.device)
            for t in range(1, output_seq_len):
                decoder_output, decoder_hidden, decoder_cell = self.decoder_model(decoder_input, decoder_hidden, decoder_cell)
                outputs[t] = decoder_output 
                decoder_input = outputs[t].argmax(1)
                    
        return outputs
        

In [205]:
Data_train['target'][0]# one sample

'ಆಗಿನ ಪೋಷಕರ ಪಾತ್ರವೂ'

In [203]:
''.join([idx_to_ids[id.item()]['word'] for id in y[0] if id not in [1,0,2,3,8]])

'▁ಆಗಿನ▁ಪೋಷಕರ▁ಪಾತ್ರವೂ'

## Bleu score metrics

<p style="font-size:18px">
    For determining the efficiency of the model we are Bleu-1, Bleu-2, Bleu-3 and Bleu-4. 
</p>

In [21]:
chencherry = SmoothingFunction()
def id_to_word(texts):
    to_remove = [1,0,2,3,8]
    cleaned_texts = []
    for text in texts:
        final = [idx_to_ids[id.item()]['word'] for id in text if id not in to_remove]
        cleaned_texts.append(final)
    return cleaned_texts

def calculate_bleu(y_true, y_pred,wt):
    y_true = id_to_word(y_true)
    y_pred = id_to_word(y_pred)
    #print(y_true)
    #print(y_pred)
    bleu_scores = torch.tensor([sentence_bleu([s1],s2,weights = wt,smoothing_function=chencherry.method1)  for s1,s2 in zip(y_true,y_pred)],dtype=torch.float32)
    #print(bleu_scores)
    return torch.mean(bleu_scores)

def bleu_1_score(y_true,y_pred):
    return calculate_bleu(y_true, y_pred,[1,0,0,0])

def bleu_2_score(y_true,y_pred):
    return calculate_bleu(y_true, y_pred,[1/2,1/2,0,0])

def bleu_3_score(y_true,y_pred):
    return calculate_bleu(y_true, y_pred,[1/3,1/3,1/3,0])

def bleu_4_score(y_true,y_pred):
    return calculate_bleu(y_true, y_pred,[1/4,1/4,1/4,1/4])

In [22]:
class CSVLogger:
    def __init__(self, filename, fieldnames):
        self.filename = filename
        self.fieldnames = fieldnames
        self.is_first_row = not os.path.exists(filename)
        self.df = pd.DataFrame(columns=self.fieldnames)
        self.df.to_csv(self.filename, index=False)

    def log(self, values):
        self.df.loc[len(self.df)]=values
        self.df.to_csv(self.filename,index=False)
        self.is_first_row = False

    def close(self):
        pass  # Nothing to do for closing a Pandas-based logger

names = ['epoch', 'loss', 'bleu_1', 'bleu_2', 'bleu_3', 'bleu_4',
         'val_loss', 'val_bleu_1', 'val_bleu_2', 'val_bleu_3', 'val_bleu_4']

CSV_logger = CSVLogger('training_logs.csv', fieldnames=names)

## Training loop

In [23]:
def train_loop(model, criterion, optimizer, device, Trainloader,Val_loader, Epochs=10):    
    model = model.to(device)
    prev_best_loss = float('inf') 
    for epoch in range(Epochs):  # loop over the dataset multiple times
        print(f"Epoch {epoch+1}")
        # Training phase
        model.train()  # Set the model to training mode
        train_loss = 0.0
        val_loss = 0.0
        bleu_1 = 0.0
        bleu_2 = 0.0
        bleu_3 = 0.0
        bleu_4 = 0.0
        
        bleu_1_val = 0.0
        bleu_2_val = 0.0
        bleu_3_val = 0.0
        bleu_4_val = 0.0
        
        progress_bar = tf.keras.utils.Progbar(len(Trainloader))
        for i, data in enumerate(Trainloader):
            inputs, targets = data
            # transpose the samples so the batch comes on last dimension
            inputs = inputs.T.to(device)
            targets = targets.T.to(device)
            #print(targets.shape)
            optimizer.zero_grad()
            outputs = model(inputs,targets)
            outputs_flattened = outputs[1:].reshape(-1, outputs.shape[-1])
            targets_flattened = targets[1:].reshape(-1)
            loss = criterion(outputs_flattened, targets_flattened.to(torch.long))
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(),max_norm = 1)
            optimizer.step()
            
            Y_pred= torch.argmax(outputs,axis=-1).T.detach()
            Y_true=targets.T.detach()
            
            train_loss=train_loss*i
            train_loss += loss.item()
            train_loss/=(i+1)
            
            bleu_1=bleu_1*i
            bleu_1+=bleu_1_score(Y_true,Y_pred)
            bleu_1/=(i+1)

            bleu_2=bleu_2*i
            bleu_2+=bleu_2_score(Y_true,Y_pred)
            bleu_2/=(i+1)
            
            bleu_3=bleu_3*i
            bleu_3+=bleu_3_score(Y_true,Y_pred)
            bleu_3/=(i+1)
            
            bleu_4=bleu_4*i
            bleu_4+=bleu_4_score(Y_true,Y_pred)
            bleu_4/=(i+1)
            
        
                
            # Update progress bar
            progress_bar.update(i + 1, [('train_loss', train_loss),('bleu_1', bleu_1),('bleu_2', bleu_2),
                                        ('bleu_3', bleu_3),('bleu_4', bleu_4)])
            
            del inputs
            del targets
            del outputs
            torch.cuda.empty_cache()


        # Validation phase
        model.eval()  # Set the model to evaluation mode
        
        with torch.no_grad():
            progress_bar = tf.keras.utils.Progbar(len(Val_loader))
            for i, data in enumerate(Val_loader):
                inputs, targets = data
                inputs, targets = inputs.T.to(device), targets.T.to(device)
                outputs = model(inputs)
                outputs_flattened = outputs.reshape(-1, outputs.shape[-1])
                targets_flattened = targets.reshape(-1)
                loss = criterion(outputs_flattened, targets_flattened.to(torch.long))
                
                Y_pred= torch.argmax(outputs,axis=-1).T.detach()
                Y_true=targets.T.detach()
                
                val_loss=val_loss*i
                val_loss += loss.item()
                val_loss/=(i+1)
                
                bleu_1_val=bleu_1_val*i
                bleu_1_val+=bleu_1_score(Y_true,Y_pred)
                bleu_1_val/=(i+1)

                bleu_2_val=bleu_2_val*i
                bleu_2_val+=bleu_2_score(Y_true,Y_pred)
                bleu_2_val/=(i+1)
            
                bleu_3_val=bleu_3_val*i
                bleu_3_val+=bleu_3_score(Y_true,Y_pred)
                bleu_3_val/=(i+1)
            
                bleu_4_val=bleu_4_val*i
                bleu_4_val+=bleu_4_score(Y_true,Y_pred)
                bleu_4_val/=(i+1)
                
                progress_bar.update(i + 1, [('val_loss', val_loss),('val_bleu_1', bleu_1_val),('val_bleu_2', bleu_2_val),
                                        ('val_bleu_3', bleu_3_val),('val_bleu_4', bleu_4_val)])
                
                del inputs
                del targets
                del outputs
                torch.cuda.empty_cache()
                
                
        logs={'epoch':epoch,'loss':train_loss, 'bleu_1':bleu_1,'bleu_2':bleu_2,'bleu_3':bleu_3,'bleu_4':bleu_4,
           'val_loss':val_loss, 'val_bleu_1':bleu_1_val, 'val_bleu_2':bleu_2_val, 'val_bleu_3':bleu_3_val, 'val_bleu_4':bleu_4_val}
        CSV_logger.log(logs)


        # Print statistics
        print(f"Val Loss: {val_loss:.4f}, val_bleu_1: {bleu_1_val:.4f}, val_bleu_2: {bleu_2_val:.4f}, val_bleu_3: {bleu_3_val:.4f}, val_bleu_4: {bleu_4_val:.4f},")
        
        
        
        # Save model if validation loss improves
        if val_loss < prev_best_loss:
            print(f"Validation loss improved from {prev_best_loss:.4f}. Saving model...")
            torch.save(model.state_dict(), f"LSTM_model_epoch_{epoch+1}_val_loss_{val_loss:.4f}.pth")
            prev_best_loss = val_loss
        else:
            print(f"Validation loss did not improve from {prev_best_loss:.4f}")

    print("Finished Training")
    
    return model

In [24]:
device = torch.device('cpu')
device

device(type='cpu')

In [26]:
model = Seq2Seq()
model.load_state_dict(torch.load("/kaggle/input/translation-model-english-kan/LSTM_model_epoch_11_val_loss_0.5548.pth"))
#device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
device

device(type='cpu')

## Train the model
<p style="font-size:18px">
    Please note the model is trained for nearly 50 epochs and each epoch took almost 3 hours, so we couldnot continue with the training for longer epochs to generate better result, the model was checkpointed and reloaded and trained several times. 
</p>

In [None]:
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

trained_model = train_loop(model, criterion, optimizer, device, Train_data,Val_data, Epochs=6)

Epoch 1
[1m547/547[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7520s[0m 14s/step - train_loss: 0.4375 - bleu_1: 0.0592 - bleu_2: 0.0244 - bleu_3: 0.0167 - bleu_4: 0.0143
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2035s[0m 13s/step - val_loss: 0.5520 - val_bleu_1: 0.0440 - val_bleu_2: 0.0194 - val_bleu_3: 0.0141 - val_bleu_4: 0.0125
Val Loss: 0.5537, val_bleu_1: 0.0443, val_bleu_2: 0.0196, val_bleu_3: 0.0142, val_bleu_4: 0.0125,
Validation loss improved from inf. Saving model...
Epoch 2
[1m547/547[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7530s[0m 14s/step - train_loss: 0.4296 - bleu_1: 0.0628 - bleu_2: 0.0257 - bleu_3: 0.0175 - bleu_4: 0.0150
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2043s[0m 13s/step - val_loss: 0.5518 - val_bleu_1: 0.0460 - val_bleu_2: 0.0203 - val_bleu_3: 0.0147 - val_bleu_4: 0.0129
Val Loss: 0.5535, val_bleu_1: 0.0457, val_bleu_2: 0.0203, val_bleu_3: 0.0146, val_bleu_4: 0.0128,
Validation loss improved from 0.5537. Sav

In [20]:
def evaluate(model, criterion, device, data_loader): 
    model.eval()
    with torch.no_grad():
        val_loss = 0.0
        bleu_1_val = 0.0
        bleu_2_val = 0.0
        bleu_3_val = 0.0
        bleu_4_val = 0.0
        progress_bar = tf.keras.utils.Progbar(len(data_loader))
        for i, data in enumerate(data_loader):
            inputs, targets = data
            inputs, targets = inputs.T.to(device), targets.T.to(device)
            outputs = model(inputs)
            outputs_flattened = outputs.reshape(-1, outputs.shape[-1])
            #print(outputs_flattened.shape)
            targets_flattened = targets.reshape(-1)
            #print(targets.shape)
            loss = criterion(outputs_flattened, targets_flattened.to(torch.long))
                
            Y_pred= torch.argmax(outputs,axis=-1).T.detach()
            Y_true=targets.T.detach()
                
            val_loss=val_loss*i
            val_loss += loss.item()
            val_loss/=(i+1)
                
            bleu_1_val=bleu_1_val*i
            bleu_1_val+=bleu_1_score(Y_true,Y_pred)
            bleu_1_val/=(i+1)

            bleu_2_val=bleu_2_val*i
            bleu_2_val+=bleu_2_score(Y_true,Y_pred)
            bleu_2_val/=(i+1)
            
            bleu_3_val=bleu_3_val*i
            bleu_3_val+=bleu_3_score(Y_true,Y_pred)
            bleu_3_val/=(i+1)
            
            bleu_4_val=bleu_4_val*i
            bleu_4_val+=bleu_4_score(Y_true,Y_pred)
            bleu_4_val/=(i+1)
                
            del inputs
            del targets
            del outputs
            torch.cuda.empty_cache()
            
            progress_bar.update(i + 1, [('test_loss', val_loss),('test_bleu_1', bleu_1_val),('test_bleu_2', bleu_2_val),
                                        ('test_bleu_3', bleu_3_val),('test_bleu_4', bleu_4_val)])
            
    print(f"test Loss: {val_loss:.4f}, test_bleu_1: {bleu_1_val:.4f}, test_bleu_2: {bleu_2_val:.4f}, test_bleu_3: {bleu_3_val:.4f}, test_bleu_4: {bleu_4_val:.4f},")

In [21]:
criterion = torch.nn.CrossEntropyLoss()
evaluate(model,criterion,device,Test_data)

[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m611s[0m 8s/step - test_loss: 0.6867 - test_bleu_1: 0.0408 - test_bleu_2: 0.0179 - test_bleu_3: 0.0130 - test_bleu_4: 0.0113
test Loss: 0.6868, test_bleu_1: 0.0410, test_bleu_2: 0.0180, test_bleu_3: 0.0129, test_bleu_4: 0.0112,


In [18]:
sentences

[['▁ರಷ್ಟು', '▁ರೂ'],
 ['▁ಈ', '▁ಬಗ್ಗೆ', '▁ಸರ್ಕಾರ', 'ೂ', '▁ಯಾವುದೇ', '▁ಯಾವುದೇ', '▁ಎಂದು'],
 ['▁ರಷ್ಟು', '▁ಕೋಟಿ', '▁ರೂ'],
 ['▁ಎಂದು', '▁ಅವರು'],
 ['▁ಇದು', '▁ತುಂಬಾ', '▁ಸುಲಭ'],
 [],
 ['▁ಈ', '▁ಮತ್ತು', '▁ಮತ್ತು', 'ೕ', '▁ಮತ್ತು', '▁ಮತ್ತು'],
 ['▁ನಾನು', '▁ಏನು'],
 ['▁ಇದು', '▁ಒಂದು'],
 ['▁ಈ',
  '▁ಮತ್ತು',
  '▁ಮತ್ತು',
  '▁ಮತ್ತು',
  '▁ಮತ್ತು',
  '▁ಮತ್ತು',
  '▁ಮತ್ತು',
  '▁ಮತ್ತು',
  '▁ಮತ್ತು',
  '▁ಮತ್ತು',
  '▁ಮತ್ತು',
  '▁ಮತ್ತು',
  '▁ಮತ್ತು',
  '▁ಮತ್ತು'],
 ['▁ಈ',
  '▁ವೇಳೆ',
  '▁ಅವರು',
  '▁ನರೇಂದ್ರ',
  '▁ಮತ್ತು',
  '▁ಮತ್ತು',
  '▁ಮತ್ತು',
  '▁ಮತ್ತು',
  '▁ಮತ್ತು'],
 ['▁ಯೆ', 'ಹೋ', 'ವನ', '▁ಯೆ', 'ಹೋ', 'ವನ'],
 ['▁ನಾನು', '▁ನಾನು'],
 ['▁ಮಾಜಿ', '▁ಸಚಿವ'],
 ['▁ಈ', '▁ಬಾರಿ'],
 ['▁ಪ್ರಧಾನಿ',
  '▁ನರೇಂದ್ರ',
  '▁ಮೋದಿ',
  '▁ಅವರು',
  '▁ಗಾಂಧಿ',
  '▁ಗಾಂಧಿ',
  '▁ಗಾಂಧಿ',
  '▁ಗಾಂಧಿ',
  '▁ಗಾಂಧಿ',
  '▁ಗಾಂಧಿ',
  '▁ಗಾಂಧಿ',
  '▁ಗಾಂಧಿ',
  '▁ಗಾಂಧಿ'],
 ['▁ನಾನು', '▁ನಾನು'],
 ['▁ಆದರೆ', '▁ಯೆ', 'ಹೋ', 'ವನ', '▁ಯೆ', 'ಹೋ', 'ವನ'],
 ['▁ಆದರೆ', '▁ಈ', '▁ಬಗ್ಗೆ', '▁ಈ', '▁ಬಗ್ಗೆ', '▁ಎಂದು', '▁ಎಂದು'],
 ['▁ಆದರೆ', '▁ಈ', '▁ಈ'],
 ['▁ಯೆ', 'ಹೋ', 'ವನ', 'ನ್ನು', '▁ಯೆ', 'ಹೋ', 'ವನ'],

In [30]:
def get_final_sent(sent):
    final = "".join(sent)
    cleaned_text = final.replace('▁', ' ')
    return cleaned_text.strip()

## Output with teacher forcing

<p style="font-size:18px">
    Many of these translations make no sense, some are partially making sense, and some are incomplete, it undoubtedly indictes model needs more epoch to train, but due to limited resource we could not make better than this. 
</p>

In [50]:
sentences=[]
for x,y in tqdm(Test_data):
    Y_pred= torch.argmax(model(x.T,y.T),axis=-1).T.detach()
    sentences.extend(id_to_word(Y_pred))

100%|██████████| 79/79 [04:17<00:00,  3.26s/it]


## Inference


In [51]:
idx=380
print("the source sentence: ",Data_test['source'][idx])
print("the predicted sentence: ",get_final_sent(sentences[idx]))
print("the actual sentence: ",Data_test['target'][idx])

the source sentence:  here  cases were registered
the predicted sentence:  ಈ ಸಂಬಂಧ ಪೊಲೀಸರು
the actual sentence:  ಪ್ರಕರಣಗಳು ಬಾಕಿ ಇವೆ ಎಂದು ಮಾಹಿತಿ ನೀಡಿದರು


In [52]:
idx=60
print("the source sentence: ",Data_test['source'][idx])
print("the predicted sentence: ",get_final_sent(sentences[idx]))
print("the actual sentence: ",Data_test['target'][idx])

the source sentence:  it is called homa
the predicted sentence:  ಇದು ಒಂದು
the actual sentence:  ಅದಕ್ಕೆ ಭಸ್ಮಶಯ್ಯೆ ಎಂದು ಹೆಸರು


In [53]:
idx=230
print("the source sentence: ",Data_test['source'][idx])
print("the predicted sentence: ",get_final_sent(sentences[idx]))
print("the actual sentence: ",Data_test['target'][idx])

the source sentence:  nothing was heard
the predicted sentence:  ನಾನು ಸುಮ್ಮನೆ
the actual sentence:  ಯಾವ ಸದ್ದೂ ಕೇಳಲಿಲ್ಲ


In [54]:
idx=990
print("the source senetence: ",Data_test['source'][idx])
print("the predicted sentence: ",get_final_sent(sentences[idx]))
print("the actual sentence: ",Data_test['target'][idx])

the source senetence:  what is a parliamentary committee
the predicted sentence:  ಬಿಜೆಪಿ ಕಾಂಗ್ರೆಸ್
the actual sentence:  ಸಮನ್ವಯ ಸಮಿತಿ ಅಂದರೆ ಏನು


## Output without teacher forcing

In [55]:
sentences=[]
for x,y in tqdm(Test_data):
    Y_pred= torch.argmax(model(x.T),axis=-1).T.detach()
    sentences.extend(id_to_word(Y_pred))

100%|██████████| 79/79 [05:17<00:00,  4.02s/it]


In [59]:
idx=380
print("the source sentence: ",Data_test['source'][idx])
print("the predicted sentence: ",get_final_sent(sentences[idx]))
print("the actual sentence: ",Data_test['target'][idx])

the source sentence:  here  cases were registered
the predicted sentence:  ಈ ಸಂಬಂಧ ಪೊಲೀಸರು
the actual sentence:  ಪ್ರಕರಣಗಳು ಬಾಕಿ ಇವೆ ಎಂದು ಮಾಹಿತಿ ನೀಡಿದರು


In [60]:
idx=60
print("the source sentence: ",Data_test['source'][idx])
print("the predicted sentence: ",get_final_sent(sentences[idx]))
print("the actual sentence: ",Data_test['target'][idx])

the source sentence:  it is called homa
the predicted sentence:  ಇದು ಒಂದು
the actual sentence:  ಅದಕ್ಕೆ ಭಸ್ಮಶಯ್ಯೆ ಎಂದು ಹೆಸರು


In [61]:
idx=230
print("the source sentence: ",Data_test['source'][idx])
print("the predicted sentence: ",get_final_sent(sentences[idx]))
print("the actual sentence: ",Data_test['target'][idx])

the source sentence:  nothing was heard
the predicted sentence:  ನಾನು ಸುಮ್ಮನೆ
the actual sentence:  ಯಾವ ಸದ್ದೂ ಕೇಳಲಿಲ್ಲ


In [62]:
idx=990
print("the source sentence: ",Data_test['source'][idx])
print("the predicted sentence: ",get_final_sent(sentences[idx]))
print("the actual sentence: ",Data_test['target'][idx])

the source sentence:  what is a parliamentary committee
the predicted sentence:  ಬಿಜೆಪಿ ಅಭ್ಯರ್ಥಿ
the actual sentence:  ಸಮನ್ವಯ ಸಮಿತಿ ಅಂದರೆ ಏನು


## Result from train data

# with teacher forcing

In [28]:
x,y = next(iter(Train_data))
Y_pred= torch.argmax(model(x.T,y.T),axis=-1).T.detach()
sentences = id_to_word(Y_pred)

In [31]:
idx=0
print("the source sentence: ",Data_train['source'][idx])
print("the predicted sentence: ",get_final_sent(sentences[idx]))
print("the actual sentence: ",Data_train['target'][idx])

the source sentence:  the role of parents
the predicted sentence:  ಅವರುುು
the actual sentence:  ಆಗಿನ ಪೋಷಕರ ಪಾತ್ರವೂ


In [32]:
idx=80
print("the source sentence: ",Data_train['source'][idx])
print("the predicted sentence: ",get_final_sent(sentences[idx]))
print("the actual sentence: ",Data_train['target'][idx])

the source sentence:  there are numerous complaints regarding this issue
the predicted sentence:  ಈ ಬಗ್ಗೆ ಈ ಕ್ರಮ ಎಂದು
the actual sentence:  ಈ ಬಗ್ಗೆ ಸಾಕಷ್ಟು ದೂರುಗಳೂ ಬರುತ್ತಿವೆ


In [33]:
idx=127
print("the source sentence: ",Data_train['source'][idx])
print("the predicted sentence: ",get_final_sent(sentences[idx]))
print("the actual sentence: ",Data_train['target'][idx])

the source sentence:  the existing mlas will contest elections on  seats
the predicted sentence:  ಈ ಚುನಾವಣೆ ಮತ್ತು ಬಿಜೆಪಿ ಬಿಜೆಪಿ
the actual sentence:  ಹಾಲಿ ಶಾಸಕರು ಸ್ಥಾನಗಳಲ್ಲಿ ಸ್ಪರ್ಧಿಸಲಿದ್ದಾರೆ


## without teacher forcing

In [34]:
Y_pred= torch.argmax(model(x.T),axis=-1).T.detach()
sentences = id_to_word(Y_pred)

In [35]:
idx=0
print("the source sentence: ",Data_train['source'][idx])
print("the predicted sentence: ",get_final_sent(sentences[idx]))
print("the actual sentence: ",Data_train['target'][idx])

the source sentence:  the role of parents
the predicted sentence:  ಅವರು
the actual sentence:  ಆಗಿನ ಪೋಷಕರ ಪಾತ್ರವೂ


In [36]:
idx=80
print("the source sentence: ",Data_train['source'][idx])
print("the predicted sentence: ",get_final_sent(sentences[idx]))
print("the actual sentence: ",Data_train['target'][idx])

the source sentence:  there are numerous complaints regarding this issue
the predicted sentence:  ಈ ಬಗ್ಗೆ ಈ ಕ್ರಮ
the actual sentence:  ಈ ಬಗ್ಗೆ ಸಾಕಷ್ಟು ದೂರುಗಳೂ ಬರುತ್ತಿವೆ


In [37]:
idx=127
print("the source sentence: ",Data_train['source'][idx])
print("the predicted sentence: ",get_final_sent(sentences[idx]))
print("the actual sentence: ",Data_train['target'][idx])

the source sentence:  the existing mlas will contest elections on  seats
the predicted sentence:  ಈ ಸಂಬಂಧ ವಿಧಾನಸಭಾ ಚುನಾವಣೆಗೆ ಚುನಾವಣೆ
the actual sentence:  ಹಾಲಿ ಶಾಸಕರು ಸ್ಥಾನಗಳಲ್ಲಿ ಸ್ಪರ್ಧಿಸಲಿದ್ದಾರೆ
