# AIM

The low level language translation can ba done just by looking into the dictionary of the respective languages or searching the meaning of words in the required language. But, it is not necessary that the words in one language mean the same in the other language as well when translated in the given context. Hence, in this work the language translation along with the context understanding is done using the seq2seq model of RNN to enhancify the Neural Language translation process.

In [83]:
import torch
import torch.nn as nn
import torch.optim as optim

from torchtext.datasets import Multi30k#dataset
from torchtext.data import Field, BucketIterator

import spacy

import random
import math
import os

In [84]:
#to ensure results are reproducable set these
SEED=2031
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic=True

In [85]:
spacy_en=spacy.load("en_core_web_sm")
spacy_ger=spacy.load("de_core_news_sm")

# Data Preprocessing 

In [86]:
#functions to tokenize
def tokenize_english(sentance):
    return([tok.text for tok in spacy_en.tokenizer(sentance)]) 
def tokenize_german(sentance):
    return([tok.text for tok in spacy_ger.tokenizer(sentance)])

### Functions of the torchtext Field module

the Field module in the torchtext provides a handy module for how the data is to be processed here, Field modules is used for tokenization along with converting all the  tokens into the lower case with the sequence to sequence model, the sentances are padded with the start of string and the end of string token in the beginning and the end of sentance respectively.


So, seq to seq models start generating tokens as soon as it sees the start of string token and continues untill it sees the end of string token. 

In [87]:
source=Field(tokenize=tokenize_german, init_token='<sos>', eos_token='<eos>', lower=True)
target=Field(tokenize=tokenize_english, init_token='<sos>', eos_token='<eos>', lower=True)

In [88]:
train_data, valid_data, test_data=Multi30k.splits(exts=('.de', '.en'), fields=(source, target))

### Corpus building

Building the vocabulary for tokens in each language so that each token within the language has an index and the index used for one hot encoded representation internally

In [89]:
source.build_vocab(train_data, min_freq=2)
target.build_vocab(train_data, min_freq=2)

### Creating Iterators

To create the batches out of the dataset, iterators are created. Sort them, pad them and move them to the appropriate device. This can be done by bucket interators. These iterators return batches of data which will have src attribute and trg attribute. Further, all the attributes would be converted into index form. 

###### Advantage of using Bucket iterator: 

Bucket iterators create the iterators in such a way that it requires minumum amount of padding in each batch collecting similar length sentances together

In [90]:
Batch_size=128
device=torch.device("cuda" if torch.cuda.is_available() else "cpu")
device.type

'cuda'

In [91]:
#iterator
train_it, valid_it, test_it=BucketIterator.splits(
    (train_data, valid_data, test_data), batch_size=Batch_size, device=device)

# Creating the model

Building the seq2seq model

### 1. Encoder

Within the encoder class, GRU unit is used. A single layer of GRU is used without any drop outs. In each step of  GRU it takes previous hidden state and the current input token that is passed on to the encoder. Encoder class is the sub class of the nn module.

In [92]:
class Encoder(nn.Module):
    def __init__(self, inp_dim, emb_dim, hid_dim, dropout):
        super().__init__()
        self.input_dim=inp_dim #vocabulary size of the source language (german) 
        self.emb_dim=emb_dim #output dimension of embedding layer
        self.hidden_dim=hid_dim# GRUs output dimension
        
        self.embedding=nn.Embedding(inp_dim, emb_dim)
        self.rnn=nn.GRU(emb_dim, hid_dim)
        self.dropout=nn.Dropout(dropout) #using this for embedding layer
        
    def forward(self, source): #input source sentances batch, which by help of embedding layer converted into dense representations
        #then, those dense representations passed on to GRUs.
        #unlike LSTMs GRU doesn't return cell state
        
        embedded=self.dropout(self.embedding(source))#shape of the source= [sentance len, batch_size]
        outputs, hidden=self.rnn(embedded)#shape of embedded is= [sentance len, batch size, emb_dim]
        
        #shape of output=[sentance len, batch_size, hid_dim*n directions] here no. of directions is one as we are not using the bidirectional GRU
        #shape of hidden=[n layers, ndirections, batch_size*hidden_dim ]
        
        return hidden #the thought or the context vector        

### 2. Decoder


Different step from conventional seq2seq model. Here, the initial hidden state is the output from the encoder (context vector). The decoder has to decode the current token based on the memory of the context vector that it has seen in the first place. This affects decoders capability to generate the token after n time steps.

##### The intuition used here is:
    The context vector is passed along with the previous hidden state and their current target token in each time step. Comparing to the conventional seq2seq model, which generates output based on the previous hidden state and the current token. However, here the output is produced based on the current input token, context vector and previous hidden state. By passing the same context vector over and over again in each time step. Thus the  GRU's input dimension in decoder looks like (emb_dim+hid_dim, hid_dim)
    
    
Linear(dense) layer being the top layer of the decoder


###### In case of Forward pass:

The concatenation of current input token and the context vector before feeding to the GRU is done.
Also, the current hidden state, current token, context vector are concatenated before passing to the dense layer

In [93]:
class Decoder(nn.Module):
    def __init__(self, out_dim, emb_dim, hid_dim, dropout):
        super().__init__()
        self.emb_dim=emb_dim
        self.hid_dim=hid_dim
        self.out_dim=out_dim #target corpus dim
        self.embedding=nn.Embedding(out_dim, emb_dim)
        self.rnn=nn.GRU(emb_dim+hid_dim, hid_dim)#input dimension comes from the current token(emb_dim) and the context vector(hid_dim) and output is hid_dim
        self.dense=nn.Linear(emb_dim+hid_dim*2, out_dim) #input - emb_dim(current token) and the hid_dim*2(context vector + current hidden state) is passed and output is out_dimension
        self.dropout=nn.Dropout(dropout)
        
    def forward(self, input, hidden, context):
        #input size= [batch_size]
        #hidden size= [n_layers*ndirections, batch_size, hid_dim] = [1*1, batch_size, hid_dim]
        #context size= [n_layers*n_directions, batch_size, hid_dim] = [1*1, batch_size, hid_dim]
        
        input=input.unsqueeze(0)
        #input size=[1, batch_size]
        embed=self.dropout(self.embedding(input)) #shape = [1, batch_size, emb_dim]
        concatenated_embed_context=torch.cat((embed, context), dim=2)
        #now, its size becomes [1, batch_size, emb_dim+hid_dim]
        output, hidden=self.rnn(concatenated_embed_context, hidden)
        #output size=[sent_len, batch_size, hid_dim*ndirection] = [1, batch_size, hid_dim]
        #hidden size=[n_layers*n_directions, batch_size, hid_dim] = [1, batch_size, hid_dim]
        #since, only one token at a time is predicted, the sentance length is going to be one
        #now, concatenating current hidden state, current token, context vector and pass it to dense layer
        output=torch.cat((embed.squeeze(0), hidden.squeeze(0), context.squeeze(0)), dim=1)
        #output size= [batch_size, emb_dim+hid_dim*2] where emb_dim+hid_dim*2 comes from the current token, current hidden, context
        prediction=self.dense(output)
        return prediction, hidden #returned the prediction, hidden state

### Seq2Seq class which binds the encoder and decoder 

In [94]:
class  seq2seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder=encoder
        self.decoder=decoder
        self.device=device
        
    def forward(self, source, target, tfr=0.5): #tfr= teacherforcing ratio is the probability value that determines weather actual ground truth token from the target has to be taken or the prediction from the decoder to be taken while predicting the next target token
    #if the tfr is 0.25, in 25% of cases the ground truth token is used and rest of the 75% is taken as the prediction from the decoder
    #source shape=[sen_len, batch_size]
    #target shape=[sen_len, batch_size]
        batch_size=target.shape[1]
        max_len=target.shape[0]
        target_vec_size=self.decoder.out_dim
        #initialize the tensor that holds output from decoder (i.e. the tensor initialized to zeros)
        outputs=torch.zeros(max_len, batch_size, target_vec_size).to(self.device)
        
        context=self.encoder(source)#forms the initial hidden state for the decoder
        hidden=context
        input=target[0, :]
        
        for t in range(1, max_len):
            output, hidden=self.decoder(input, hidden, context)
            outputs[t]=output
            tforce=random.random()<tfr
            top1=output.max(1)[1]
            input=(target[t] if tforce else top1)
            
        return outputs

### defining the train function, loss function and criterion

In [95]:
inp_dim=len(source.vocab)
out_dim=len(target.vocab)
encoder_emb_dim=256
decoder_emb_dim=256
hid_dim=512
encoder_dropout=0.5
decoder_dropout=0.5

encoder=Encoder(inp_dim, encoder_emb_dim, hid_dim, encoder_dropout)
decoder=Decoder(out_dim, decoder_emb_dim, hid_dim, decoder_dropout)
model=seq2seq(encoder, decoder, device).to(device)

In [96]:
model

seq2seq(
  (encoder): Encoder(
    (embedding): Embedding(7855, 256)
    (rnn): GRU(256, 512)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(5893, 256)
    (rnn): GRU(768, 512)
    (dense): Linear(in_features=1280, out_features=5893, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

Here, the loss for the padded tokens since is  not calculated since they are not the part of the actual source and target sequences. Thus, here the padded index is ignored while calculating the loss

In [97]:
optimizer=optim.Adam(model.parameters())

padded_index=target.vocab.stoi['<pad>']
criterion=nn.CrossEntropyLoss(ignore_index=padded_index)

### Build train, validation functions

In [104]:
def train(model, iterator, optimizer, criterion, clip): #clip= to prevent gradient explosion
    model.train()
    epoch_loss=0
    train_loss=[]
    for i, batch in enumerate(iterator):
        source=batch.src
        target=batch.trg #size= [sentlen, batch_size]
        optimizer.zero_grad()
        output=model(source, target)#size= [sent_len, batch_size, output_dim]
    #flatten the output by using the view method and  ignore the sos padding of source and the target sentances before calculating the loss
        loss=criterion(output[1:].view(-1, output.shape[2]), target[1:].view(-1)) #so the first column of the output and the target are removed and passed into the criterion
        #while passing to loss fn= output[(sent_len-1)*batch_size, output_size] 
        #target=[(sent_len-1)*batch_size, output_size]
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss+=loss.item()
        train_loss.append(epoch_loss)
        
    return(epoch_loss/len(iterator), train_loss)

In [105]:
def evaluate(model, iterator, criterion):
    model.eval()
    epoch_loss=0
    val_loss=[]
    with torch.no_grad():
        for i, batch in enumerate(iterator):
            source=batch.src
            target=batch.trg
            optimizer.zero_grad()
            output=model(source, target, 0)#clip is zero and no train
            loss=criterion(output[1:].view(-1, output.shape[2]), target[1:].view(-1))
            epoch_loss+=loss.item()
            val_loss.append(epoch_loss)
        
    return(epoch_loss/len(iterator), val_loss)

###  Set the path and name of the model to save them after trining them

In [106]:
DIR='models'
model_dir=os.path.join(DIR, 'seq2seq_model.pt')

### Train, Validate and Test model along with analysis

In [107]:
n_epochs=10
clip=10

best_loss=float('inf')

if not os.path.isdir(f'{DIR}'):
    os.makedirs(f'{DIR}')

In [108]:
train_losses=[]
val_losses=[]
for epoch in range(n_epochs):
    train_loss, train_loss_list=train(model, train_it, optimizer, criterion, clip)
    validation_loss, val_loss_list=evaluate(model, valid_it, criterion)
    
    train_losses.append(train_loss)
    val_losses.append(validation_loss)
    
    if(validation_loss <best_loss):
        best_loss=validation_loss
        torch.save(model.state_dict(), model_dir)
    
    print("epoch number= ", epoch)
    print("Train loss= ",train_loss)
    print("Validation loss=", validation_loss)

epoch number=  0
Train loss=  4.66933508171384
Validation loss= 4.383155465126038
epoch number=  1
Train loss=  3.7242734694795985
Validation loss= 4.006883412599564
epoch number=  2
Train loss=  3.2578400767322155
Validation loss= 3.7168424129486084
epoch number=  3
Train loss=  2.9368634633555812
Validation loss= 3.610551595687866
epoch number=  4
Train loss=  2.7077340207961162
Validation loss= 3.633746385574341
epoch number=  5
Train loss=  2.504183604328643
Validation loss= 3.655905544757843
epoch number=  6
Train loss=  2.35830334892357
Validation loss= 3.578988701105118
epoch number=  7
Train loss=  2.218757124199216
Validation loss= 3.623843163251877
epoch number=  8
Train loss=  2.106057360833962
Validation loss= 3.6249634325504303
epoch number=  9
Train loss=  2.0388261615442285
Validation loss= 3.6548089385032654


In [109]:
#if one wants to plot the train and validation loss graph 
# import matplotlib.pyplot as plt
# plt.figure(figsize=(15, 15))
# plt.plot(train_losses, val_losses)
# plt.xlabel("loss")
# plt.ylabel("epochs")
# plt.title("Train vs Validation graph")

In [111]:
model.load_state_dict(torch.load(model_dir))
test_loss, test_losses=evaluate(model, test_it, criterion)
print("loss=", test_loss)

loss= 3.546379894018173
