# AIM

The low level language translation can ba done just by looking into the dictionary of the respective languages or searching the meaning of words in the required language. But, it is not necessary that the words in one language mean the same in the other language as well when translated in the given context. Hence, in this work the language translation along with the context understanding is done using the seq2seq model of RNN to enhancify the Neural Language translation process.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F # for softmax operation

from torchtext.datasets import Multi30k#dataset
from torchtext.data import Field, BucketIterator

import spacy

import random
import math
import os

In [2]:
#to ensure results are reproducable set these
SEED=2031
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic=True

In [3]:
spacy_en=spacy.load("en_core_web_sm")
spacy_ger=spacy.load("de_core_news_sm")

# Data Preprocessing 

In [4]:
#functions to tokenize
def tokenize_english(sentance):
    return([tok.text for tok in spacy_en.tokenizer(sentance)]) 
def tokenize_german(sentance):
    return([tok.text for tok in spacy_ger.tokenizer(sentance)])

### Functions of the torchtext Field module

the Field module in the torchtext provides a handy module for how the data is to be processed here, Field modules is used for tokenization along with converting all the  tokens into the lower case with the sequence to sequence model, the sentances are padded with the start of string and the end of string token in the beginning and the end of sentance respectively.


So, seq to seq models start generating tokens as soon as it sees the start of string token and continues untill it sees the end of string token. 

In [5]:
source=Field(tokenize=tokenize_german, init_token='<sos>', eos_token='<eos>', lower=True)
target=Field(tokenize=tokenize_english, init_token='<sos>', eos_token='<eos>', lower=True)

In [6]:
train_data, valid_data, test_data=Multi30k.splits(exts=('.de', '.en'), fields=(source, target))

### Corpus building

Building the vocabulary for tokens in each language so that each token within the language has an index and the index used for one hot encoded representation internally

In [7]:
source.build_vocab(train_data, min_freq=2)
target.build_vocab(train_data, min_freq=2)

### Creating Iterators

To create the batches out of the dataset, iterators are created. Sort them, pad them and move them to the appropriate device. This can be done by bucket interators. These iterators return batches of data which will have src attribute and trg attribute. Further, all the attributes would be converted into index form. 

###### Advantage of using Bucket iterator: 

Bucket iterators create the iterators in such a way that it requires minumum amount of padding in each batch collecting similar length sentances together

In [8]:
Batch_size=128
device=torch.device("cuda" if torch.cuda.is_available() else "cpu")
device.type

'cuda'

In [9]:
#iterator
train_it, valid_it, test_it=BucketIterator.splits(
    (train_data, valid_data, test_data), batch_size=Batch_size, device=device)

# Creating the model

Building the seq2seq model with attention

### 1. Encoder

Within the encoder class, GRU unit is used. A single layer of GRU is used without any drop outs. In each step of  GRU it takes previous hidden state and the current input token that is passed on to the encoder. Encoder class is the sub class of the nn module.The GRU used here is bidirectional.

Here, encoder hidden dim and the decoder hid dim differs.

##### The encoders final hidden state is the concatenation of the forward and the backward hidden states, However, the decoder only has a forward unit only (one direction). So, here, the last two states are concatenated and pass it through the dense layer applying the tanh operation to restructure our output in the required dimension dimension for the decoder.

Unlike the previous architecture,all the outputs of encoders (all the hidden states) are returned that were generated while reading the inputs. The outputs are required by the attention to create the associations.

In [10]:
class Encoder(nn.Module):
    def __init__(self, inp_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout): 
        super().__init__()
        self.input_dim=inp_dim #vocabulary size of the source language (german) 
        self.emb_dim=emb_dim #output dimension of embedding layer
        self.encoder_hid_dim= enc_hid_dim # GRUs output dimension
        self.decoder_hid_dim= dec_hid_dim
        self.embedding=nn.Embedding(inp_dim, emb_dim)
        self.rnn=nn.GRU(emb_dim, enc_hid_dim, bidirectional=True)#Bidirectional GRU
        self.dropout=nn.Dropout(dropout) #using this for embedding layer
        self.dense=nn.Linear(enc_hid_dim*2, dec_hid_dim)
    
    def forward(self, source): 
        embedded=self.dropout(self.embedding(source))#shape of the source= [sentance len, batch_size]
        outputs, hidden=self.rnn(embedded)#shape of embedded is= [sentance len, batch size, emb_dim]
        #shape of output=[sentance len, batch_size, hid_dim*n directions] here no. of directions is one as we are not using the bidirectional GRU
        #shape of hidden=[n layers, ndirections, batch_size*hidden_dim ]
        hidden=torch.tanh(self.dense(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)))
        
        return outputs, hidden  

### Attention module

In [11]:
class Attention(nn.Module):
    def __init__(self, enc_hid_dm, dec_hid_dim):
        super().__init__()
        self.encoder_hid_dim= enc_hid_dim # GRUs output dimension
        self.decoder_hid_dim= dec_hid_dim
        self.attention=nn.Linear((enc_hid_dim*2)+dec_hid_dim, dec_hid_dim)
        self.vec=nn.Parameter(torch.rand(dec_hid_dim))
        
    def forward(self, hidden, encoder_outputs):
        #hidden size=[batch_size, dec_hid_dim]
        #encoder_outputs size=[src set len, batch_size, encoder_hid_dim*2]
        batch_size=encoder_outputs.shape[1] #cols
        src_len=encoder_outputs.shape[0] #rows
        
        #now repeat the hidden state till the source length number of times
        hidden=hidden.unsqueeze(1).repeat(1, src_len, 1)
        encoder_outputs=encoder_outputs.permute(1, 0, 2)
        #hidden_shape=[batch_size, src_sent_len, dec_hid_size]
        #encoder_outputs_shape=[batch_size, src_sent_len, enc_hid_dim*2]
        #concatenate them and pass them to dense layer
        association=torch.tanh(self.attention(torch.cat((hidden, encoder_outputs), dim=2)))
        #association_shape=[batch_size, src_sent_len, dec_hid_size]#cuz of dense
        
        #reshape the association tensor
        association=association.permute(0, 2, 1)
        #association shape=[batch_size, dec_hid_dim, src_sent_len]
        #reshape the vec tensor
        vec=self.vec.repeat(batch_size, 1).unsqueeze(1)
        #vec shape= [batch_size, 1, dec_hid_dim]
        #obtain the product of the vec and association and squeeze the extra dimension
        attention=torch.bmm(vec, association).squeeze(1)
        #attention shape=[batch_size, src_sent_len]
        return (F.softmax(attention, dim=1))

### 2. Decoder


Different step from conventional seq2seq model. Here, the initial hidden state is the output from the encoder (context vector). The decoder has to decode the current token based on the memory of the context vector that it has seen in the first place. This affects decoders capability to generate the token after n time steps.

##### The intuition used here is:
    The context vector is passed along with the previous hidden state and their current target token in each time step. Comparing to the conventional seq2seq model, which generates output based on the previous hidden state and the current token. However, here the output is produced based on the current input token, context vector and previous hidden state. By passing the same context vector over and over again in each time step. Thus the  GRU's input dimension in decoder looks like (emb_dim+hid_dim, hid_dim)
    
    
Linear(dense) layer being the top layer of the decoder


###### In case of Forward pass:

The concatenation of current input token and the context vector before feeding to the GRU is done.
Also, the current hidden state, current token, context vector are concatenated before passing to the dense layer

In [12]:
class Decoder(nn.Module):
    def __init__(self, out_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
        super().__init__()
        self.emb_dim=emb_dim
        self.emb_dim=emb_dim #output dimension of embedding layer
        self.encoder_hid_dim= enc_hid_dim # GRUs output dimension
        self.out_dim=out_dim #target corpus dim
        self.attention=attention
        self.embedding=nn.Embedding(out_dim, emb_dim)
        self.rnn=nn.GRU((enc_hid_dim*2)+emb_dim, dec_hid_dim)
        self.dense=nn.Linear((enc_hid_dim*2)+dec_hid_dim+emb_dim, out_dim)
        self.dropout=nn.Dropout(dropout)

    def forward(self, input, hidden, encoder_outputs):
        #input size= [batch_size]
        #(context) hidden size= [batch_size, dec_hid_dim] 
        #encoder outputs size= [source_sent_len, batch_size, enc_hid_dim*2]
        
        input=input.unsqueeze(0)
        #input size=[1, batch_size]
        embedded=self.dropout(self.embedding(input)) #out shape = [1, batch_size, emb_dim]
        a=self.attention(hidden, encoder_outputs)
        #a shape=[batch_size, src_len]
        a=a.unsqueeze(1)
        #a shape=[batch_size, 1, src_len]
        encoder_outputs=encoder_outputs.permute(1, 0, 2)
        #encoder_outputs shape=[batch_size, source_sent_len, enc_hid_dim*2]
        
        #performing batch matrix mult
        weighted=torch.bmm(a, encoder_outputs)
        #weighted_shape=[batch_size, 1, enc_hid_dim*2]
        
        #concatenate weighted with embedded input representation to be fed into the GRU along with prev hidd state
        weighted=weighted.permute(1, 0, 2)
        #weighted_shape=[1, batch_size, enc_hid_dim*2]
        
        rnn_input=torch.cat((embedded, weighted), dim=2)
        #rnn_input_shape=[1, batch_size, (end_hid_dim)*2+emb_dim]
        
        output, hidden=self.rnn(rnn_input, hidden.unsqueeze(0))
        #output=[sent_len, batch_size, dec_hid_dim*n_dir]
        #hidden=[n_layers*n_dir, batch_size, dec_hid_dim]
        #output=[1, batch_size, dec_hid_dim]
        #hidden=[1, batch_size, dec_hid_dim]
        #weighted_shape=[1, batch_size, enc_hid_dim*2]

        #removing the extra unwanted dimension
        embedded=embedded.squeeze(0)
        output=output.squeeze(0)
        weighted=weighted.squeeze(0)
        
        output=self.dense(torch.cat((output, weighted, embedded), dim=1))
        #output=[batch_size, emb_dim]
        return output, hidden.squeeze(0)

### Seq2Seq class which binds the encoder and decoder 

In [13]:
class  seq2seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder=encoder
        self.decoder=decoder
        self.device=device
        
    def forward(self, source, target, tfr=0.5): #tfr= teacherforcing ratio is the probability value that determines weather actual ground truth token from the target has to be taken or the prediction from the decoder to be taken while predicting the next target token
    #if the tfr is 0.25, in 25% of cases the ground truth token is used and rest of the 75% is taken as the prediction from the decoder
    #source shape=[sen_len, batch_size]
    #target shape=[sen_len, batch_size]
        batch_size=target.shape[1]
        max_len=target.shape[0]
        target_vec_size=self.decoder.out_dim
        #initialize the tensor that holds output from decoder (i.e. the tensor initialized to zeros)
        outputs=torch.zeros(max_len, batch_size, target_vec_size).to(self.device)
        
        encoder_outputs, hidden =self.encoder(source)#forms the initial hidden state for the decoder
        input=target[0, :]
        
        for t in range(1, max_len):
            output, hidden=self.decoder(input, hidden, encoder_outputs)
            outputs[t]=output
            tforce=random.random()<tfr
            top1=output.max(1)[1]
            input=(target[t] if tforce else top1)
            
        return outputs

### defining the train function, loss function and criterion

In [14]:
inp_dim=len(source.vocab)
out_dim=len(target.vocab)
encoder_emb_dim=256
decoder_emb_dim=256
enc_hid_dim=512
dec_hid_dim=512
encoder_dropout=0.5
decoder_dropout=0.5

attention=Attention(enc_hid_dim, dec_hid_dim)
encoder=Encoder(inp_dim, encoder_emb_dim, enc_hid_dim, dec_hid_dim, encoder_dropout)
decoder=Decoder(out_dim, decoder_emb_dim, enc_hid_dim, dec_hid_dim, decoder_dropout, attention)
model=seq2seq(encoder, decoder, device).to(device)

In [15]:
model

seq2seq(
  (encoder): Encoder(
    (embedding): Embedding(7855, 256)
    (rnn): GRU(256, 512, bidirectional=True)
    (dropout): Dropout(p=0.5, inplace=False)
    (dense): Linear(in_features=1024, out_features=512, bias=True)
  )
  (decoder): Decoder(
    (attention): Attention(
      (attention): Linear(in_features=1536, out_features=512, bias=True)
    )
    (embedding): Embedding(5893, 256)
    (rnn): GRU(1280, 512)
    (dense): Linear(in_features=1792, out_features=5893, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

Here, the loss for the padded tokens since is  not calculated since they are not the part of the actual source and target sequences. Thus, here the padded index is ignored while calculating the loss

In [16]:
optimizer=optim.Adam(model.parameters())

padded_index=target.vocab.stoi['<pad>']
criterion=nn.CrossEntropyLoss(ignore_index=padded_index)

### Build train, validation functions

In [17]:
def train(model, iterator, optimizer, criterion, clip): #clip= to prevent gradient explosion
    model.train()
    epoch_loss=0
    train_loss=[]
    for i, batch in enumerate(iterator):
        source=batch.src
        target=batch.trg #size= [sentlen, batch_size]
        optimizer.zero_grad()
        output=model(source, target)#size= [sent_len, batch_size, output_dim]
    #flatten the output by using the view method and  ignore the sos padding of source and the target sentances before calculating the loss
        loss=criterion(output[1:].view(-1, output.shape[2]), target[1:].view(-1)) #so the first column of the output and the target are removed and passed into the criterion
        #while passing to loss fn= output[(sent_len-1)*batch_size, output_size] 
        #target=[(sent_len-1)*batch_size, output_size]
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss+=loss.item()
        train_loss.append(epoch_loss)
        
    return(epoch_loss/len(iterator), train_loss)

In [18]:
def evaluate(model, iterator, criterion):
    model.eval()
    epoch_loss=0
    val_loss=[]
    with torch.no_grad():
        for i, batch in enumerate(iterator):
            source=batch.src
            target=batch.trg
            optimizer.zero_grad()
            output=model(source, target, 0)#clip is zero and no train
            loss=criterion(output[1:].view(-1, output.shape[2]), target[1:].view(-1))
            epoch_loss+=loss.item()
            val_loss.append(epoch_loss)
        
    return(epoch_loss/len(iterator), val_loss)

###  Set the path and name of the model to save them after trining them

In [19]:
DIR='models'
model_dir=os.path.join(DIR, 'seq2seq_model_atten.pt')

### Train, Validate and Test model along with analysis

In [20]:
n_epochs=10
clip=10

best_loss=float('inf')

if not os.path.isdir(f'{DIR}'):
    os.makedirs(f'{DIR}')

In [21]:
train_losses=[]
val_losses=[]
for epoch in range(n_epochs):
    train_loss, train_loss_list=train(model, train_it, optimizer, criterion, clip)
    validation_loss, val_loss_list=evaluate(model, valid_it, criterion)
    
    train_losses.append(train_loss)
    val_losses.append(validation_loss)
    
    if(validation_loss <best_loss):
        best_loss=validation_loss
        torch.save(model.state_dict(), model_dir)
    
    print("epoch number= ", epoch)
    print("Train loss= ",train_loss)
    print("Validation loss=", validation_loss)

epoch number=  0
Train loss=  4.461344998313467
Validation loss= 4.068058580160141
epoch number=  1
Train loss=  3.4776270515593137
Validation loss= 3.7529342472553253
epoch number=  2
Train loss=  3.0673553365967874
Validation loss= 3.626277804374695
epoch number=  3
Train loss=  2.7754533637462733
Validation loss= 3.5566119253635406
epoch number=  4
Train loss=  2.560871342730417
Validation loss= 3.5911119282245636
epoch number=  5
Train loss=  2.3784026359146386
Validation loss= 3.53265181183815
epoch number=  6
Train loss=  2.2311025936698075
Validation loss= 3.590732663869858
epoch number=  7
Train loss=  2.1109483252537933
Validation loss= 3.588593691587448
epoch number=  8
Train loss=  2.0279367715776755
Validation loss= 3.6424492597579956
epoch number=  9
Train loss=  1.9450143491644165
Validation loss= 3.567496955394745


In [22]:
#if one wants to plot the train and validation loss graph 
# import matplotlib.pyplot as plt
# plt.figure(figsize=(15, 15))
# plt.plot(train_losses, val_losses)
# plt.xlabel("loss")
# plt.ylabel("epochs")
# plt.title("Train vs Validation graph")

In [23]:
model.load_state_dict(torch.load(model_dir))
test_loss, test_losses=evaluate(model, test_it, criterion)
print("loss=", test_loss)

loss= 3.520799547433853
