#### Objective : To apply Sequential model to machine translations using IITB Hindi English Corpus
####  1. In this notebook we will create a simple Encoder using 2 Layers Lstm with dropouts and input the hidden state and cell state time stamp wise to decoder
####  2. Our decoder architecture will take the hidden and cell state from encoder and sentence as a input to do the predictions
####  3. In the Seq2 Seq architecture we will use combination of teacher forcing and decoder output for training at each step. Teacher forcing speeds up the convergence of the algorithm


### About the Dataset

The IIT Bombay English-Hindi corpus contains parallel corpus for English-Hindi as well as monolingual Hindi corpus collected from a variety of existing sources and corpora developed at the Center for Indian Language Technology, IIT Bombay over the years. This page describes the corpus. This corpus has been used at the Workshop on Asian Language Translation Shared Task in 2016 and 2017 for the Hindi-to-English and English-to-Hindi languages pairs and as a pivot language pair for the Hindi-to-Japanese and Japanese-to-Hindi language pairs.


Please refer to the : http://www.cfilt.iitb.ac.in/iitb_parallel/ for more details on the datasource and downloading the datasets

In [25]:
### Import the required packages 

In [1]:
import pandas as pd #### for any manipulations on the dataframe
import os #### For doing os operations like change directory and all

import spacy ### We will use spacy tokeniser for cleaning our data
import numpy as np ### For any numeric  operations of matrices


#### We will import the torch and will start working on the project
import torch
from torchtext import data ### This provides the pipeline for processing our data
SEED = 1234
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True ### This will make our experiments reproducible
import torch.nn as nn
import torch.optim as optim ### We will call our optimiser function like adam, sgd etcc....
import torch.nn.functional as F


import random
from sklearn.metrics import classification_report
from matplotlib import pyplot as plt
import pyprind
%matplotlib inline  
import time
import re ### This will help us in writing regex for cleaning our data

In [2]:
path = 'C:\\Users\\ashwinku\\Desktop\\Pytorch\\Neural_Machine_Translation\\Data\\parallel\\'

#### Define a function which takes english file hindi file and return dataframe

In [4]:
def readfile(englishfile,hindifile):
    ''' Takes 2 inputs english file and hindi file data returns the dataframe name'''
    filename = open(englishfile,encoding='utf-8') ### Store the file inot i/0 iterator
    english_list = list(filename) ## convert the english file to list of documents
    filename1 = open(hindifile,encoding = 'utf-8') ### Store the file as i/o iterator
    hindi_list = list(filename1) ## convert the hindi file to list of documents
    print ("Length of hindi list is :",len(hindi_list))
    print ("Length of english list is :",len(english_list))
    # dictionary of lists  
    cols_dict = {'hindi_text': hindi_list, 'english_text':english_list}
    ## Craete the dataframe from cols dict
    df_name = pd.DataFrame(cols_dict)
    print ("Shape of data is :",df_name.shape)
    print ("Columns of data is :",df_name.columns)
    return (df_name)
    
    

### Call the functions on Training dataset and Test Dataset

In [26]:
hin_en_df = readfile('IITB.en-hi.en', 'IITB.en-hi.hi')
dev_en_df = readfile('dev.en','dev.hi')

Length of hindi list is : 1561840
Length of english list is : 1561840
Shape of data is : (1561840, 2)
Columns of data is : Index(['hindi_text', 'english_text'], dtype='object')
Length of hindi list is : 520
Length of english list is : 520
Shape of data is : (520, 2)
Columns of data is : Index(['hindi_text', 'english_text'], dtype='object')


### Lets check the dataframe and check some random values to make sure, data is aligned

In [6]:
pd.options.display.max_colwidth = 100
print (hin_en_df.iloc[10000,[0,1]])
print (hin_en_df.iloc[987654,[0,1]])
print (hin_en_df.iloc[1560000,[0,1]])

hindi_text       बनाएँ\n
english_text    Create\n
Name: 10000, dtype: object
hindi_text         पुनरावर्ती आमवात आक्रमण प्राय एक से तीन संधियों को सम्मिलित करता है\n
english_text    Attacks of palindromic rheumatism usually involve one to three joints.\n
Name: 987654, dtype: object
hindi_text        हमें पाठ्यक्रम में विषय के रूप में शामिल करके मानव अधिकारों के प्रति जागरूकता बढ़ानी चाहिए।\n
english_text    We must increase awareness for human rights by including it as a subject in school curricula.\n
Name: 1560000, dtype: object


In [7]:
##Lets  create a pipeline using 10% data only for now

dev_en_df.to_csv("dev.csv")

In [None]:
### As we have only 8GB GPU, we were getting memory error on more than 20% data
    1. We are using random sampling 

In [8]:
### Create the 20% sample and store it as csv for preocssing in model
hin_en_df.sample(frac =0.20,random_state=1234).to_csv("sample.csv")

In [9]:
pd.read_csv("sample.csv").shape

(312368, 3)

### We will define the data cleaning pipeline for hindi as well as english
    1. Maybe for hindi we have want to use hindi tokeniser from indic-nlp library but initially lets use nlp as it is tokenising correctly
    2. Do a little bit of cleaning on english and hindi text seprately
    3. We will use this tokenisation as function for processing data in torch text

In [10]:
### We will use NLP tokeniser and disable parser and other functionalties for speed
nlp = spacy.load('en_core_web_sm',disable=['parser', 'tagger', 'ner'])

###  Define the tokenizer function to be used later on
def tokenizer(s): 
    return [w.text.lower() for w in nlp(corp_clean(s))]

### Every token should be cleanedbefore going through the process
def corp_clean(text):
#     text = re.sub(r'[^A-Za-z0-9]+', ' ', text) # remove non alphanumeric character
    text = re.sub(r'https?:/\/\S+', ' ', text) # remove links
    text = text.replace("\\"," ")
    text = re.sub(r'/n',' ',text)
#     print ("English text is ",text)
    return text.strip()


def tokenizer_hindi(j):
    return [w.lower().strip() for w in hindi_clean(j).split(" ") if w != '']

def hindi_clean(text1):
    text1 = re.sub(r'https?:/\/\S+', ' ', text1)
    text1 = text1.replace("\\"," ")
    text1 = re.sub(r'\n',' ',text1)
    text1 = re.sub('।',' । ',text1)
#     print ("hindi token is :",text1)
    return text1.strip()
    

#### Define the data cleaning pipeline using torchtext Data nd tabular dataets
    1. Add <sos> and <eos> tokens to english as well as hindi transaltions
    2. Use different tokeniser for both

In [11]:
# By using Data field function from torchtext we candefine how ew ant to process out data

### Definition for processing text field
eng_field = data.Field(sequential=True, init_token = '<sos>',eos_token = '<eos>',
                       tokenize=tokenizer,  use_vocab=True )

### Definition for processing label field
hindi_field = data.Field(sequential=True,init_token = '<sos>',eos_token = '<eos>', ##Whether the datatype represents sequential data
                       tokenize=tokenizer_hindi,  use_vocab=True)


### Define which field in csv is label field and which one is text field
train_val_fields = [('unnamed', None), # we dont need this, so no processing
    ('hindi_text', hindi_field), # process it as label
    ('eng_text', eng_field), # we dont need this, so no processing
                   ]

### We will read the tabular data and craete split from it
train, test = data.TabularDataset.splits(path='C:\\Users\\ashwinku\\Desktop\\Pytorch\\Neural_Machine_Translation\\Data\\parallel\\', 
                                            format='csv', 
                                            train='sample.csv', 
                                            validation='dev.csv', 
                                            fields=train_val_fields, 
                                            skip_header=True)

### Lets create the vocab for both hindi and english. In hindi we have taken threshold to be high to reduce the vocab size and subsequently th size of the model input and predictions

In [12]:
### Build the vocabulary using the embeddings
eng_field.build_vocab(train, min_freq = 2)
# build vocab for labels
hindi_field.build_vocab(train,min_freq = 30)

### Store the pretrained embedding as model embedding weigh data and make it untrainable
#### Print look at the frequency of dat
device = torch.device('cuda')


#### print the length of text field vocab
print (" The number of distinct vocab is :",len(eng_field.vocab.freqs))
#### print the length of text field vocab
print (" The number of distinct vocab is :",len(hindi_field.vocab.freqs))

#### Checking the indices fo varius words
print ("Indice of word the is :",hindi_field.vocab.stoi['<sos>'])



### Craete the batch iterator
### Create an iterator over Batch of data
train_iterator, test_iterator = data.BucketIterator.splits(datasets=(train, test), # specify train and validation Tabulardataset
                                            batch_size = 32,  # batch size of train and validation
                                            sort_key=lambda x: len(x.eng_text), # on what attribute the text should be sorted
                                            device=device, # -1 mean cpu and 0 or None mean gpu
                                            sort_within_batch=True, 
                                            repeat=False)



 The number of distinct vocab is : 99971
 The number of distinct vocab is : 209135
Indice of word the is : 2


#### Lets start defining the model
1. An endcoder with LSTM layers, whose hidden and cell state are used in the decoder.
2. Cell state and hidden state as the initial state of decoder

In [13]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(input_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        #src = [src len, batch size]
        
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
        
        outputs, (hidden, cell) = self.rnn(embedded)
        
        #outputs = [src len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #outputs are always from the top hidden layer
        
        return hidden, cell

In [14]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.output_dim = output_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        
        self.fc_out = nn.Linear(hid_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, cell):
        
        #input = [batch size]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #n directions in the decoder will both always be 1, therefore:
        #hidden = [n layers, batch size, hid dim]
        #context = [n layers, batch size, hid dim]
        
        input = input.unsqueeze(0)
        
        #input = [1, batch size]
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]
                
        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        
        #output = [seq len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #seq len and n directions will always be 1 in the decoder, therefore:
        #output = [1, batch size, hid dim]
        #hidden = [n layers, batch size, hid dim]
        #cell = [n layers, batch size, hid dim]
        
        prediction = self.fc_out(output.squeeze(0))
        
        #prediction = [batch size, output dim]
        
        return prediction, hidden, cell

### Sequence 2 sequece Models
1. receiving the input/source sentence
2. using the encoder to produce the context vectors
3. using the decoder to produce the predicted output/target sentence

In [15]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        assert encoder.hid_dim == decoder.hid_dim, \
            "Hidden dimensions of encoder and decoder must be equal!"
        assert encoder.n_layers == decoder.n_layers, \
            "Encoder and decoder must have equal number of layers!"
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the time
        
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        #last hidden state of the encoder is used as the initial hidden state of the decoder
        hidden, cell = self.encoder(src)
        
        #first input to the decoder is the <sos> tokens
        input = trg[0,:]
        
        for t in range(1, trg_len):
            
            #insert input token embedding, previous hidden and previous cell states
            #receive output tensor (predictions) and new hidden and cell states
            output, hidden, cell = self.decoder(input, hidden, cell)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else top1
        
        return outputs

#### Define the model parameters and load the model to cuda

In [28]:
INPUT_DIM = len(eng_field.vocab)
OUTPUT_DIM = len(hindi_field.vocab)
ENC_EMB_DIM = 128
DEC_EMB_DIM = 128
HID_DIM = 256
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

model = Seq2Seq(enc, dec, device).to(device)

### Initialise the paarmeters between -0.08 and 0.08 this is based on seq2seq model paper

In [29]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.uniform_(param.data, -0.08, 0.08)
        
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(53098, 128)
    (rnn): LSTM(128, 256, num_layers=2, dropout=0.5)
    (dropout): Dropout(p=0.5)
  )
  (decoder): Decoder(
    (embedding): Embedding(9711, 128)
    (rnn): LSTM(128, 256, num_layers=2, dropout=0.5)
    (fc_out): Linear(in_features=256, out_features=9711, bias=True)
    (dropout): Dropout(p=0.5)
  )
)

In [30]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 12,378,479 trainable parameters


In [31]:
optimizer = optim.Adam(model.parameters())

In [32]:
hindi_pad_idx = hindi_field.vocab.stoi[hindi_field.pad_token]

criterion = nn.CrossEntropyLoss(ignore_index = hindi_pad_idx )

In [33]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src = batch.eng_text
        trg = batch.hindi_text
        
        optimizer.zero_grad()
        
        output = model(src, trg)
        
        #trg = [trg len, batch size]
        #output = [trg len, batch size, output dim]
        
        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [22]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [23]:
N_EPOCHS = 10
CLIP = 1

import math
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
#     valid_loss = evaluate(model, test_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
#     if valid_loss < best_valid_loss:
#         best_valid_loss = valid_loss
#         torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
#     print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 25m 46s
	Train Loss: 4.516 | Train PPL:  91.445
Epoch: 02 | Time: 25m 39s
	Train Loss: 3.948 | Train PPL:  51.854
Epoch: 03 | Time: 25m 39s
	Train Loss: 3.685 | Train PPL:  39.835
Epoch: 04 | Time: 25m 40s
	Train Loss: 3.519 | Train PPL:  33.762
Epoch: 05 | Time: 25m 57s
	Train Loss: 3.403 | Train PPL:  30.042
Epoch: 06 | Time: 25m 59s
	Train Loss: 3.319 | Train PPL:  27.626
Epoch: 07 | Time: 26m 0s
	Train Loss: 3.250 | Train PPL:  25.795
Epoch: 08 | Time: 26m 2s
	Train Loss: 3.196 | Train PPL:  24.435
Epoch: 09 | Time: 26m 5s
	Train Loss: 3.149 | Train PPL:  23.318
Epoch: 10 | Time: 26m 2s
	Train Loss: 3.109 | Train PPL:  22.408


In [24]:
torch.save(model.state_dict(), 'hi_en_model1.pt')

### This is a basic model we can improve following in further excercise
1. Using only one context vector i.e. hidden layer and reduce the complexity of model
2. Apply attention to improve the model performance
3. Try some pretarined Neural Machine models