# A Comprehensive Guide to Neural Machine Translation using Seq2Seq Modelling using PyTorch

In this post, we will be building a sequence to sequence deep learning model using PyTorch and TorchText. Here I am doing an German to English neural machine translation. But the same concept can be extended to other problems such as Named Entity Recognition (NER), Text Summarization etc,.

# Table of Contents:
## 1. Introduction
## 2. Data Preparation and Pre-processing
## 3. Long Short Term Memory (LSTM) - Under the Hood
## 4. Encoder Model Architecture (Seq2Seq)¶
## 5. Encoder Code Implementation (Seq2Seq)
## 6. Decoder Model Architecture (Seq2Seq)
## 7. Decoder Code Implementation (Seq2Seq)
## 8. Seq2Seq (Encoder + Decoder) Interface
## 9. Seq2Seq (Encoder + Decoder) Code Implementation
## 10. Seq2Seq Model Training
## 11. Seq2Seq Model Inference

# 1. Introduction

Here I am doing a German to English neural machine translation. But the same concept can be extended to other problems such as Named Entity Recognition (NER), Text Summarization, etc,.

So the Sequence to Sequence (seq2seq) model in this post uses an encoder-decoder architecture, which uses a type of RNN called LSTM (Long Short Term Memory), where the encoder neural network encodes the input german sequence into a single vector, also called as a Context Vector.
This Context Vector is said to contain the abstract representation of the input german sequence.

This vector is then passed into the decoder neural network, which is used to output the corresponding English translation sentence, one word at a time.

# Necessary Imports 

In [1]:
!pip install torchtext==0.6.0 --quiet
import torch
import torch.nn as nn
import torch.optim as optim
from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator
import numpy as np
import pandas as pd
import spacy
import random
from torchtext.data.metrics import bleu_score
from pprint import pprint
from torch.utils.tensorboard import SummaryWriter
from torchsummary import summary
'''
# Seeding for reproducible results everytime
SEED = 777

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True'''

[K     |████████████████████████████████| 64 kB 1.7 MB/s 
[K     |████████████████████████████████| 1.2 MB 9.0 MB/s 
[?25h

'\n# Seeding for reproducible results everytime\nSEED = 777\n\nrandom.seed(SEED)\nnp.random.seed(SEED)\ntorch.manual_seed(SEED)\ntorch.cuda.manual_seed(SEED)\ntorch.backends.cudnn.deterministic = True'

# 2. Data Preparation & Pre-processing

Loading the SpaCy's vocabulary for our desired languages. SpaCy also supports many languages like french, german etc,.



In [2]:
!python -m spacy download en --quiet
!python -m spacy download de --quiet

[K     |████████████████████████████████| 12.0 MB 4.3 MB/s 
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.7/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.7/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')
[K     |████████████████████████████████| 14.9 MB 1.3 MB/s 
[?25h  Building wheel for de-core-news-sm (setup.py) ... [?25l[?25hdone
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('de_core_news_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.7/dist-packages/de_core_news_sm -->
/usr/local/lib/python3.7/dist-packages/spacy/data/de
You can now load the model via spacy.load('de')


In [3]:
spacy_german = spacy.load("de")
spacy_english = spacy.load("en")

Now let's create custom tokenization methods for the languages. Tokenization is a process of breaking the sentence into a list of individual tokens (words).

We can make use of PyTorch's TorchText library for data pre-processing and SpaCy for vocabulary building (English and German) & tokenization of our data.

In [4]:
def tokenize_german(text):
  return [token.text for token in spacy_german.tokenizer(text)]

def tokenize_english(text):
  return [token.text for token in spacy_english.tokenizer(text)]

### Sample Run ###

sample_text = "I love machine learning"
print(tokenize_english(sample_text))

['I', 'love', 'machine', 'learning']


Torch text is a powerful library for making the text data ready for variety of NLP tasks. It has all the tools to perform preprocessing on the textual data.

Let's see some of the process it can do,

1. Train/ Valid/ Test Split: partition your data into a specified train/ valid/ test set.

2. File Loading: load the text corpus of various formats (.txt,.json,.csv).
3. Tokenization: breaking sentences into list of words.
4. Vocab: Generate a list of vocabulary from the text corpus.
5. Words to Integer Mapper: Map words into integer numbers for the entire corpus and vice versa.
6. Word Vector: Convert a word from higher dimension to lower dimension (Word Embedding).
7. Batching: Generate batches of sample.

So once we get to understand what can be done in torch text, let's talk about how it can be implemented in the torch text module. Here we are going to make use of 3 classes under torch text.

1. Fields :
> This is a class under the torch text, where we specify how the preprocessing should be done on our data corpus.
2. TabularDataset : 
> Using this class, we can actually define the Dataset of columns stored in CSV, TSV, or JSON format and also map them into integers.
3. BucketIterator :
> Using this class, we can perform padding our data for approximation and make batches with our data for model training.

Here our source language (SRC - Input) is German and target language (TRG - Output) is English. We also add 2 extra tokens "start of sequence" <sos> and "end of sequence" <EOS> for effective model training.

In [5]:
german = Field(tokenize=tokenize_german,
               lower=True,
               init_token="<sos>",
               eos_token="<eos>")

english = Field(tokenize=tokenize_english,
               lower=True,
               init_token="<sos>",
               eos_token="<eos>")

train_data, valid_data, test_data = Multi30k.splits(exts = (".de", ".en"),
                                                    fields=(german, english))

german.build_vocab(train_data, max_size=10000, min_freq=3)
english.build_vocab(train_data, max_size=10000, min_freq=3)

downloading training.tar.gz


training.tar.gz: 100%|██████████| 1.21M/1.21M [00:02<00:00, 525kB/s]


downloading validation.tar.gz


validation.tar.gz: 100%|██████████| 46.3k/46.3k [00:00<00:00, 90.8kB/s]


downloading mmt_task1_test2016.tar.gz


mmt_task1_test2016.tar.gz: 100%|██████████| 66.2k/66.2k [00:00<00:00, 85.8kB/s]


In [6]:
print(f"Unique tokens in source (de) vocabulary: {len(german.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(english.vocab)}")

Unique tokens in source (de) vocabulary: 5376
Unique tokens in target (en) vocabulary: 4556


In [7]:
# dir(english.vocab)

print(english.vocab.__dict__.keys())
print(list(english.vocab.__dict__.values()))
e = list(english.vocab.__dict__.values())
for i in e:
  print(i)

dict_keys(['freqs', 'itos', 'unk_index', 'stoi', 'vectors'])
0
None


In [8]:
word_2_idx = dict(e[3])
idx_2_word = {}
for k,v in word_2_idx.items():
  idx_2_word[v] = k

# Dataset sneek peek

In [9]:
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")

print(train_data[5].__dict__.keys())
pprint(train_data[5].__dict__.values())

Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000
dict_keys(['src', 'trg'])
dict_values([['ein', 'mann', 'in', 'grün', 'hält', 'eine', 'gitarre', ',', 'während', 'der', 'andere', 'mann', 'sein', 'hemd', 'ansieht', '.'], ['a', 'man', 'in', 'green', 'holds', 'a', 'guitar', 'while', 'the', 'other', 'man', 'observes', 'his', 'shirt', '.']])


After setting the language pre-processing criteria, the next step is to create batches of training, testing and validation data using iterators.

Creating batches is an exhaustive process, luckily we can make use of TorchText's iterator libraries.

Here we are using BucketIterator for effective padding of source and target sentences. We can access the source (german) batch of data using .src attribute and it's correspondign (english) batch of data using .trg attribute.

In [10]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
BATCH_SIZE = 32

train_iterator, valid_iterator, test_iterator = BucketIterator.splits((train_data, valid_data, test_data), 
                                                                      batch_size = BATCH_SIZE, 
                                                                      sort_within_batch=True,
                                                                      sort_key=lambda x: len(x.src),
                                                                      device = device)

## Actual text data before tokenized

In [11]:
count = 0
max_len_eng = []
max_len_ger = []
for data in train_data:
  max_len_ger.append(len(data.src))
  max_len_eng.append(len(data.trg))
  if count < 10 :
    print("German - ",*data.src, " Length - ", len(data.src))
    print("English - ",*data.trg, " Length - ", len(data.trg))
    print()
  count += 1

print("Maximum Length of English sentence {} and German sentence {} in the dataset".format(max(max_len_eng),max(max_len_ger)))
print("Minimum Length of English sentence {} and German sentence {} in the dataset".format(min(max_len_eng),min(max_len_ger)))

German -  zwei junge weiße männer sind im freien in der nähe vieler büsche .  Length -  13
English -  two young , white males are outside near many bushes .  Length -  11

German -  mehrere männer mit schutzhelmen bedienen ein antriebsradsystem .  Length -  8
English -  several men in hard hats are operating a giant pulley system .  Length -  12

German -  ein kleines mädchen klettert in ein spielhaus aus holz .  Length -  10
English -  a little girl climbing into a wooden playhouse .  Length -  9

German -  ein mann in einem blauen hemd steht auf einer leiter und putzt ein fenster .  Length -  15
English -  a man in a blue shirt is standing on a ladder cleaning a window .  Length -  15

German -  zwei männer stehen am herd und bereiten essen zu .  Length -  10
English -  two men are at the stove preparing food .  Length -  9

German -  ein mann in grün hält eine gitarre , während der andere mann sein hemd ansieht .  Length -  16
English -  a man in green holds a guitar while the other

In [12]:
count = 0
for data in train_iterator:
  if count < 1 :
    print("Shapes", data.src.shape, data.trg.shape)
    print()
    print("German - ",*data.src, " Length - ", len(data.src))
    print()
    print("English - ",*data.trg, " Length - ", len(data.trg))
    temp_ger = data.src
    temp_eng = data.trg
    count += 1

Shapes torch.Size([10, 32]) torch.Size([16, 32])

German -  tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2], device='cuda:0') tensor([ 18,   5,  18,   5,   5, 105,   8,   5,  18, 216,   5,   5,   5,   5,
        191,  30,   8,  39,   8,   5,  18,   5,   8, 427,   5,   8,   5,   5,
          5, 298,   5,   0], device='cuda:0') tensor([  30, 1194,   30,   49,   96, 1842,   16,    0,  103,   45,   13,   70,
          26,    0,  308,    7,   16,  104,   16,  683,   45, 2371,   36, 2463,
          26,   26, 4145,  272,   13,  928,   13,   11], device='cuda:0') tensor([ 84,  60,   7,  61,  13, 264,   7,  32,  80,  11, 283,  26,  10, 212,
        137, 175,  68,  48, 283, 367, 261, 921,  80,  11,   7,  45,  68,  13,
         29,  19,   7,  14], device='cuda:0') tensor([  12,    8, 2016,   14,  159,   22,    6, 1409,    7, 3305,   19,   10,
           5,   11,   21,  627,  590,   11,    5,   11,   22,    5,   38, 4345,
           6

In [13]:
temp_eng_idx = (temp_eng).cpu().detach().numpy()
temp_ger_idx = (temp_ger).cpu().detach().numpy()

I just experimented with a batch size of 32 and a sample target batch is shown below. The sentences are tokenized into list of words and indexed according to the vocabulary. The "pad" token gets an index of 1.

Each column corresponds to a sentence indexed into numbers and we have 32 such sentences in a single target batch and the number of rows corresponds to the maximum length of that sentence. Short sentences are padded with 1 to compensate.
The table (Idx.csv) contains the numerical indices of the words, which is later fed into the word embedding and converted into dense representation for Seq2Seq processing.

In [14]:
df_eng_idx = pd.DataFrame(data = temp_eng_idx, columns = [str("S_")+str(x) for x in np.arange(1, 33)])
df_eng_idx.index.name = 'Time Steps'
df_eng_idx.index = df_eng_idx.index + 1 
# df_eng_idx.to_csv('/content/idx.csv')
df_eng_idx

Unnamed: 0_level_0,S_1,S_2,S_3,S_4,S_5,S_6,S_7,S_8,S_9,S_10,S_11,S_12,S_13,S_14,S_15,S_16,S_17,S_18,S_19,S_20,S_21,S_22,S_23,S_24,S_25,S_26,S_27,S_28,S_29,S_30,S_31,S_32
Time Steps,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1
1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
2,16,4,16,4,24,110,4,4,16,251,4,4,4,4,157,30,4,7,4,4,16,4,4,59,4,4,4,21,4,216,4,176
3,30,154,30,55,9,2072,14,0,24,50,9,53,24,3148,1096,6,14,77,14,625,666,1922,38,2276,34,24,825,145,9,228,9,10
4,17,152,6,407,325,17,6,586,127,22,10,34,34,3329,6,86,10,10,961,2855,270,10,12,291,6,50,697,9,10,77,41,4
5,6,51,1003,1050,20,191,21,35,6,2061,475,11,11,2525,448,338,692,222,21,13,111,574,24,77,4,761,169,89,36,4,6,0
6,4,4,352,889,245,4019,86,10,623,8,4,1146,33,249,78,17,698,13,1054,1751,12,4,127,13,162,21,4,71,6,1084,4,0
7,29,363,17,8,76,49,23,844,1606,66,1750,0,37,13,40,6,28,31,180,1593,759,240,17,216,22,160,50,18,7,54,25,722
8,11,12,2202,4,49,21,10,60,5,768,2028,41,0,4,6,4,6,1972,69,5,5,12,37,228,4421,3122,102,4,39,4,981,2229
9,25,391,60,82,7,774,1677,7,3,17,13,3,5,24,4,807,7,1851,7,3,3,7,188,11,3049,160,825,0,226,520,5,13
10,1522,6,307,14,318,105,5,47,1,119,1766,1,3,33,398,5,168,847,47,1,1,1016,6,4,5,208,12,5,7,5,3,4


In [15]:
df_eng_word = pd.DataFrame(columns = [str("S_")+str(x) for x in np.arange(1, 33)])
df_eng_word = df_eng_idx.replace(idx_2_word)
# df_eng_word.to_csv('/content/Words.csv')
df_eng_word

Unnamed: 0_level_0,S_1,S_2,S_3,S_4,S_5,S_6,S_7,S_8,S_9,S_10,S_11,S_12,S_13,S_14,S_15,S_16,S_17,S_18,S_19,S_20,S_21,S_22,S_23,S_24,S_25,S_26,S_27,S_28,S_29,S_30,S_31,S_32
Time Steps,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1
1,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>
2,two,a,two,a,young,four,a,a,two,five,a,a,a,a,bicycle,men,a,the,a,a,two,a,a,large,a,a,a,an,a,construction,a,there
3,men,couple,men,child,man,toddlers,woman,<unk>,young,women,man,little,young,plush,riders,in,woman,building,woman,bird,females,pilot,group,commercial,boy,young,tattoo,old,man,workers,man,is
4,are,walk,in,putting,hard,are,in,wet,boys,wearing,is,boy,boy,cartoon,in,orange,is,is,pulls,soars,jump,is,of,brick,in,women,artist,man,is,building,walking,a
5,in,up,protective,eye,at,being,an,dog,in,scarves,pulling,and,and,mascot,number,uniforms,blowing,covered,an,with,off,cleaning,young,building,a,wears,doing,stands,standing,a,in,<unk>
6,a,a,gear,makeup,work,entertained,orange,is,suits,on,a,4,girl,posing,riding,are,bubbles,with,inflatable,wings,of,a,boys,with,pool,an,a,next,in,frame,a,<unk>
7,blue,set,are,on,over,by,shirt,splashing,wrestle,their,wheelbarrow,<unk>,playing,with,down,in,while,red,boat,spread,swings,window,are,construction,wearing,"""",women,to,the,for,white,do
8,and,of,sawing,a,by,an,is,through,.,heads,laden,walking,<unk>,a,in,a,in,&,into,.,.,of,playing,workers,spiderman,obama,'s,a,street,a,robe,tournament
9,white,stairs,through,another,the,accordion,weaving,the,<eos>,are,with,<eos>,.,young,a,tunnel,the,amp,the,<eos>,<eos>,the,baseball,and,floaties,"""",tattoo,<unk>,during,structure,.,with
10,speed,in,metal,woman,lake,player,.,water,<pad>,talking,bricks,<pad>,<eos>,girl,parade,.,middle,;,water,<pad>,<pad>,airplane,in,a,.,t,of,.,the,.,<eos>,a


# 3. Long Short Term Memory (LSTM) - Under the Hood

<img src="https://cdn-images-1.medium.com/max/2560/1*sQBwBtwCwqqXY0k5O0ZvMg.png">

The above picture shows the units present under a single LSTM Cell. I will add some references to learn more about LSTM in the last and why it works well for long sequences.

But to simply put, Vanilla RNN, Gated Recurrent Unit (GRU) is not able to capture the long term dependencies due to its nature of design and suffers heavily by the Vanishing Gradient problem, which makes the rate of change in weights and bias values negligible, resulting in poor generalization.

But LSTM has some special units called gates (Remember gate, Forget gate, Update gate), which helps to overcome the problems stated before.

Inside the LSTM cell, we have a bunch of mini neural networks with sigmoid and TanH activations at the final layer and few vector adder, Concat, multiplications operations.

1. Sigmoid NN → Squishes the values between 0 and 1. Say a value closer to 0 means to forget and a value closer to 1 means to remember.

2. Embedding NN → Converts the input word indices into word embedding.

3. TanH NN → Squishes the values between -1 and 1. Helps to regulate the vector values from either getting exploded to the maximum or shrank to the minimum.

4. The hidden state and the cell state are referred to here as the context vector, which are the outputs from the LSTM cell. The input is the sentence's numerical indexes fed into the embedding NN.


# 4. Encoder Model Architecture (Seq2Seq)

Before moving to seq2seq model, we need to create Encoder ,Decoder and create a interface between them in the seq2seq model.

Let's pass the german input sequence "Ich liebe tief lernen" which translates to "I love deep learning" in english.




<img src="https://cdn-images-1.medium.com/max/1200/1*aNcybCTdPlrXsCwIo1OfTg.png">

For a lighter note, let's explain the process happening in the above image. The Encoder of the Seq2Seq model takes one input at a time. Our input German word sequence is "ich Liebe Tief Lernen".

Also, we append the start of sequence "SOS" and the end of sentence "EOS" tokens in the starting and in the ending of the input sentence.

Therefore at 
At time step-0, the "SOS" token is sent,
At time step-1 the token "ich" is sent,
At time step-2 the token "Liebe" is sent,
At time step-3 the token "Tief" is sent,
At time step-4 the token "Lernen" is sent,
At time step-4 the token "EOS" is sent.

And the first block in the Encoder architecture is the word embedding layer [shown in green block], which converts the input indexed word into a dense vector representation called word embedding (sizes - 100/200/300).

Then our word embedding vector is sent to the LSTM cell, where it is combined with the hidden state (hs), and the cell state (cs) of the previous time step and the encoder block outputs a new hs and a cs which is passed to the next LSTM  cell. 

It is understood that the hs and cs captured some vector representation of the sentence so far.

At time step-0, the hidden state and cell state are either initialized fully of zeros or random numbers.

Then after we sent pass all our input german word sequence, a context vector [shown in yellow block] (hs, cs) is finally obtained, which is a dense representation of the word sequence and can be sent to the decoder's first LSTM (hs, cs) for the corresponding English translation.

In the above figure, we use 2 layer LSTM  architecture, where we connect the first LSTM to the second LSTM and we then we obtain 2 context vectors stacked on top as the final output.

It is a must that we design identical encoder and decoder blocks in the seq2seq model.

The above visualization is applicable for a single sentence from a batch. Say we have a batch size of 5 (Experimental), then we pass 5 sentences at a time to the Encoder, which looks like the below figure.

<img src="https://cdn-images-1.medium.com/max/1200/1*xP8MgIfKwjStFDUo0_W3QA.png">

##  The same concept is extended to a batch size of 5 (experimental), where we consider 5 input sentences and the first token from each sentences is sent to the encoder at a time. Each sequences in the batch is maintained to have the same length using the padding token.

# 5. Encoder Code Implementation (Seq2Seq)

In [16]:
class EncoderLSTM(nn.Module):
  def __init__(self, input_size, embedding_size, hidden_size, num_layers, p):
    super(EncoderLSTM, self).__init__()

    # Size of the one hot vectors that will be the input to the encoder
    #self.input_size = input_size

    # Output size of the word embedding NN
    #self.embedding_size = embedding_size

    # Dimension of the NN's inside the lstm cell/ (hs,cs)'s dimension.
    self.hidden_size = hidden_size

    # Number of layers in the lstm
    self.num_layers = num_layers

    # Regularization parameter
    self.dropout = nn.Dropout(p)
    self.tag = True

    # Shape --------------------> (5376, 300) [input size, embedding dims]
    self.embedding = nn.Embedding(input_size, embedding_size)
    
    # Shape -----------> (300, 2, 1024) [embedding dims, hidden size, num layers]
    self.LSTM = nn.LSTM(embedding_size, hidden_size, num_layers, dropout = p)

  # Shape of x (26, 32) [Sequence_length, batch_size]
  def forward(self, x):

    # Shape -----------> (26, 32, 300) [Sequence_length , batch_size , embedding dims]
    embedding = self.dropout(self.embedding(x))
    
    # Shape --> outputs (26, 32, 1024) [Sequence_length , batch_size , hidden_size]
    # Shape --> (hs, cs) (2, 32, 1024) , (2, 32, 1024) [num_layers, batch_size size, hidden_size]
    outputs, (hidden_state, cell_state) = self.LSTM(embedding)

    return hidden_state, cell_state

input_size_encoder = len(german.vocab)
encoder_embedding_size = 300
hidden_size = 1024
num_layers = 2
encoder_dropout = 0.5

encoder_lstm = EncoderLSTM(input_size_encoder, encoder_embedding_size,
                           hidden_size, num_layers, encoder_dropout).to(device)
print(encoder_lstm)

EncoderLSTM(
  (dropout): Dropout(p=0.5, inplace=False)
  (embedding): Embedding(5376, 300)
  (LSTM): LSTM(300, 1024, num_layers=2, dropout=0.5)
)


# 6. Decoder Model Architecture (Seq2Seq)

<img src="https://cdn-images-1.medium.com/max/800/1*FtDDCniBMb8HXYEM6PRohQ.png">

The decoder also does a single step at a time.

The Context Vector from the Encoder block is provided as the hidden state (hs) and cell state (cs) for the decoder's first LSTM block.

The start of the sentence "SOS"  token is passed to the embedding NN, then passed to the first LSTM cell of the decoder, and finally, it is passed through a linear layer [Shown in Pink color], which provides an output English token prediction probabilities (4556 Probabilities), hidden state (hs), Cell State (cs). 

The output word with the highest probability is chosen, hidden state (hs), Cell State (cs) is passed as the inputs to the next LSTM cell and this process is executed until it reaches the end of sentences "EOS".

The subsequent layers will use the hidden and cell state from the previous time steps.

The above visualization is applicable for a single sentence from a batch. Say we have a batch size of 5 (Experimental), then we pass 5 sentences at a time to the Encoder, which provide 5 sets of Context Vectors, and they all are passed into the Decoder, which looks like the below figure.

## Teach Force Ratio:

In addition to other blocks, you will also see the block shown below in the Decoder of the Seq2Seq architecture.

While model training, we send the inputs (German Sequence) and targets (English Sequence). After the context vector is obtained from the Encoder, we send them Vector and the target to the Decoder for translation.

But during model Inference, the target is generated from the decoder based on the generalization of the training data. So the output predicted words are sent as the next input word to the decoder until a <SOS> token is obtained.

So in model training itself, we can use the teach force ratio (tfr), where we can actually control the flow of input words to the decoder.

<img src="https://cdn-images-1.medium.com/max/600/1*YJpyqouvpmu4_Ej9ockl4A.png">

Teach Force Ratio methodWe can send the actual target words to the decoder part while training (Shown in Green Color).

We can also send the predicted target word, as the input to the decoder (Shown in Red Color).

Whether sending either of the words (actual target or predicted target) can be regulated with a probability of 50% so at any time step one of them is passed during the training.

This method is like a Regularization so that the model trains efficiently during the process.

The above visualization is applicable for a single sentence from a batch. Say we have a batch size of 5 (Experimental), then we pass 5 sentences at a time to the Encoder, which provide 5 sets of Context Vectors, and they all are passed into the Decoder, which looks like the below figure.

<img src="https://cdn-images-1.medium.com/max/2560/1*UPyGSZSuIQ52IjyFdPpm6A.png">

The same concept is extended to a batch size of 5 (experimental), where we consider 5 input encoder's context vectors and the first token <"sos"> is sent to the decoder at a time. 

# Note: Hidden size of Encoder and Decoder should be same

# 7. Decoder Code Implementation (Seq2Seq)

In [17]:
class DecoderLSTM(nn.Module):
  def __init__(self, input_size, embedding_size, hidden_size, num_layers, p, output_size):
    super(DecoderLSTM, self).__init__()

    # Size of the one hot vectors that will be the input to the encoder
    #self.input_size = input_size

    # Output size of the word embedding NN
    #self.embedding_size = embedding_size

    # Dimension of the NN's inside the lstm cell/ (hs,cs)'s dimension.
    self.hidden_size = hidden_size

    # Number of layers in the lstm
    self.num_layers = num_layers

    # Size of the one hot vectors that will be the output to the encoder (English Vocab Size)
    self.output_size = output_size

    # Regularization parameter
    self.dropout = nn.Dropout(p)

    # Shape --------------------> (5376, 300) [input size, embedding dims]
    self.embedding = nn.Embedding(input_size, embedding_size)

    # Shape -----------> (300, 2, 1024) [embedding dims, hidden size, num layers]
    self.LSTM = nn.LSTM(embedding_size, hidden_size, num_layers, dropout = p)

    # Shape -----------> (1024, 4556) [embedding dims, hidden size, num layers]
    self.fc = nn.Linear(hidden_size, output_size)

  # Shape of x (32) [batch_size]
  def forward(self, x, hidden_state, cell_state):

    # Shape of x (1, 32) [1, batch_size]
    x = x.unsqueeze(0)

    # Shape -----------> (1, 32, 300) [1, batch_size, embedding dims]
    embedding = self.dropout(self.embedding(x))

    # Shape --> outputs (1, 32, 1024) [1, batch_size , hidden_size]
    # Shape --> (hs, cs) (2, 32, 1024) , (2, 32, 1024) [num_layers, batch_size size, hidden_size] (passing encoder's hs, cs - context vectors)
    outputs, (hidden_state, cell_state) = self.LSTM(embedding, (hidden_state, cell_state))

    # Shape --> predictions (1, 32, 4556) [ 1, batch_size , output_size]
    predictions = self.fc(outputs)

    # Shape --> predictions (32, 4556) [batch_size , output_size]
    predictions = predictions.squeeze(0)

    return predictions, hidden_state, cell_state

input_size_decoder = len(english.vocab)
decoder_embedding_size = 300
hidden_size = 1024
num_layers = 2
decoder_dropout = 0.5
output_size = len(english.vocab)

decoder_lstm = DecoderLSTM(input_size_decoder, decoder_embedding_size,
                           hidden_size, num_layers, decoder_dropout, output_size).to(device)
print(decoder_lstm)

DecoderLSTM(
  (dropout): Dropout(p=0.5, inplace=False)
  (embedding): Embedding(4556, 300)
  (LSTM): LSTM(300, 1024, num_layers=2, dropout=0.5)
  (fc): Linear(in_features=1024, out_features=4556, bias=True)
)


# 8. Seq2Seq (Encoder + Decoder) Interface

<img src="https://cdn-images-1.medium.com/max/1200/1*d9kP4XoWGnIcmyhX-g4Xvw.png">

## The final seq2seq implementation looks like the figure above.

## 1. Provide both input (German) and output (English) sentences.
## 2. Pass the input sequence to the encoder and extract context vectors.
## 3. Pass the output sequence to the decoder, context vecotr from encoder to produce the predicted output sequence.


<img src="https://cdn-images-1.medium.com/max/1200/1*7SVU_REkIUALmegTbFI9Fw.png">

In [18]:
for batch in train_iterator:
  print(batch.src.shape)
  print(batch.trg.shape)
  break

x = batch.trg[1]
print(x)

torch.Size([10, 32])
torch.Size([14, 32])
tensor([  46,    4,    4,   16,    4,   16,    4,    4,    4,    4,    4,    4,
           9,    4,    4,   16,    4,    4,   48,  324,   19,   16,  216,  110,
           4, 2037,   64,    4,  228,    4,    4,    4], device='cuda:0')


# 9. Seq2Seq (Encoder + Decoder) Code Implementation

In [19]:
class Seq2Seq(nn.Module):
  def __init__(self, Encoder_LSTM, Decoder_LSTM):
    super(Seq2Seq, self).__init__()
    self.Encoder_LSTM = Encoder_LSTM
    self.Decoder_LSTM = Decoder_LSTM

  def forward(self, source, target, tfr=0.5):
    # Shape - Source : (10, 32) [(Sentence length German + some padding), Number of Sentences]
    batch_size = source.shape[1]

    # Shape - Source : (14, 32) [(Sentence length English + some padding), Number of Sentences]
    target_len = target.shape[0]
    target_vocab_size = len(english.vocab)
    
    # Shape --> outputs (14, 32, 5766) 
    outputs = torch.zeros(target_len, batch_size, target_vocab_size).to(device)

    # Shape --> (hs, cs) (2, 32, 1024) ,(2, 32, 1024) [num_layers, batch_size size, hidden_size] (contains encoder's hs, cs - context vectors)
    hidden_state, cell_state = self.Encoder_LSTM(source)

    # Shape of x (32 elements)
    x = target[0] # Trigger token <SOS>

    for i in range(1, target_len):
      # Shape --> output (32, 5766) 
      output, hidden_state, cell_state = self.Decoder_LSTM(x, hidden_state, cell_state)
      outputs[i] = output
      best_guess = output.argmax(1) # 0th dimension is batch size, 1st dimension is word embedding
      x = target[i] if random.random() < tfr else best_guess # Either pass the next word correctly from the dataset or use the earlier predicted word

    # Shape --> outputs (14, 32, 5766) 
    return outputs


In [20]:
# Hyperparameters

learning_rate = 0.001
writer = SummaryWriter(f"runs/loss_plot")
step = 0

model = Seq2Seq(encoder_lstm, decoder_lstm).to(device)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

pad_idx = english.vocab.stoi["<pad>"]
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)

In [21]:
model

Seq2Seq(
  (Encoder_LSTM): EncoderLSTM(
    (dropout): Dropout(p=0.5, inplace=False)
    (embedding): Embedding(5376, 300)
    (LSTM): LSTM(300, 1024, num_layers=2, dropout=0.5)
  )
  (Decoder_LSTM): DecoderLSTM(
    (dropout): Dropout(p=0.5, inplace=False)
    (embedding): Embedding(4556, 300)
    (LSTM): LSTM(300, 1024, num_layers=2, dropout=0.5)
    (fc): Linear(in_features=1024, out_features=4556, bias=True)
  )
)

In [22]:
def translate_sentence(model, sentence, german, english, device, max_length=50):
    spacy_ger = spacy.load("de")

    if type(sentence) == str:
        tokens = [token.text.lower() for token in spacy_ger(sentence)]
    else:
        tokens = [token.lower() for token in sentence]
    tokens.insert(0, german.init_token)
    tokens.append(german.eos_token)
    text_to_indices = [german.vocab.stoi[token] for token in tokens]
    sentence_tensor = torch.LongTensor(text_to_indices).unsqueeze(1).to(device)

    # Build encoder hidden, cell state
    with torch.no_grad():
        hidden, cell = model.Encoder_LSTM(sentence_tensor)

    outputs = [english.vocab.stoi["<sos>"]]

    for _ in range(max_length):
        previous_word = torch.LongTensor([outputs[-1]]).to(device)

        with torch.no_grad():
            output, hidden, cell = model.Decoder_LSTM(previous_word, hidden, cell)
            best_guess = output.argmax(1).item()

        outputs.append(best_guess)

        # Model predicts it's the end of the sentence
        if output.argmax(1).item() == english.vocab.stoi["<eos>"]:
            break

    translated_sentence = [english.vocab.itos[idx] for idx in outputs]
    return translated_sentence[1:]

def bleu(data, model, german, english, device):
    targets = []
    outputs = []

    for example in data:
        src = vars(example)["src"]
        trg = vars(example)["trg"]

        prediction = translate_sentence(model, src, german, english, device)
        prediction = prediction[:-1]  # remove <eos> token

        targets.append([trg])
        outputs.append(prediction)

    return bleu_score(outputs, targets)

def checkpoint_and_save(model, best_loss, epoch, optimizer, epoch_loss):
    print('saving')
    print()
    state = {'model': model,'best_loss': best_loss,'epoch': epoch,'rng_state': torch.get_rng_state(), 'optimizer': optimizer.state_dict(),}
    torch.save(state, '/content/checkpoint-NMT')
    torch.save(model.state_dict(),'/content/checkpoint-NMT-SD')

# 10. Seq2Seq Model Training

In [None]:
epoch_loss = 0.0
num_epochs = 100
best_loss = 999999
best_epoch = -1
sentence1 = "ein mann in einem blauen hemd steht auf einer leiter und putzt ein fenster"
ts1  = []

for epoch in range(num_epochs):
  print("Epoch - {} / {}".format(epoch+1, num_epochs))
  model.eval()
  translated_sentence1 = translate_sentence(model, sentence1, german, english, device, max_length=50)
  print(f"Translated example sentence 1: \n {translated_sentence1}")
  ts1.append(translated_sentence1)

  model.train(True)
  for batch_idx, batch in enumerate(train_iterator):
    input = batch.src.to(device)
    target = batch.trg.to(device)

    # Pass the input and target for model's forward method
    output = model(input, target)
    output = output[1:].reshape(-1, output.shape[2])
    target = target[1:].reshape(-1)

    # Clear the accumulating gradients
    optimizer.zero_grad()

    # Calculate the loss value for every epoch
    loss = criterion(output, target)

    # Calculate the gradients for weights & biases using back-propagation
    loss.backward()

    # Clip the gradient value is it exceeds > 1
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1)

    # Update the weights values using the gradients we calculated using bp 
    optimizer.step()
    step += 1
    epoch_loss += loss.item()
    writer.add_scalar("Training loss", loss, global_step=step)

  if epoch_loss < best_loss:
    best_loss = epoch_loss
    best_epoch = epoch
    checkpoint_and_save(model, best_loss, epoch, optimizer, epoch_loss) 
    if ((epoch - best_epoch) >= 10):
      print("no improvement in 10 epochs, break")
      break
  print("Epoch_Loss - {}".format(loss.item()))
  print()
  
print(epoch_loss / len(train_iterator))

score = bleu(test_data[1:100], model, german, english, device)
print(f"Bleu score {score*100:.2f}")

Epoch - 1 / 100
Translated example sentence 1: 
 ['backpacks', 'muzzle', 'pole', 'restaurants', 'tropical', 'tropical', 'necklaces', 'necklaces', 'daylight', 'daylight', 'masked', 'masked', 'paws', 'anvil', 'bundle', 'bundle', 'petals', 'petals', 'spreading', 'packages', 'train', 'took', 'pulled', 'spectators', 'himself', 'commercial', 'cartoon', 'commercial', 'snowcapped', 'snowcapped', 'reads', 'screaming', 'length', 'length', 'length', 'length', 'length', 'reflection', 'petals', 'petals', 'petals', 'length', 'length', 'bathing', 'train', 'pulled', 'spectators', 'petals', 'petals', 'train']
saving

Epoch_Loss - 4.3871965408325195

Epoch - 2 / 100
Translated example sentence 1: 
 ['a', 'man', 'in', 'a', 'blue', 'shirt', 'is', 'a', 'a', 'a', 'a', 'a', '.', '<eos>']
Epoch_Loss - 4.3368682861328125

Epoch - 3 / 100
Translated example sentence 1: 
 ['a', 'man', 'in', 'a', 'black', 'shirt', 'and', 'a', 'is', 'standing', 'on', 'a', 'bench', '.', '<eos>']
Epoch_Loss - 3.1559009552001953

Epo

In [None]:
#%load_ext tensorboard
%tensorboard --logdir runs/

# 11. Seq2Seq Model Inference

In [None]:
progress  = []
import nltk
from nltk.tokenize.treebank import TreebankWordDetokenizer
for i, sen in enumerate(ts1):
  progress.append(TreebankWordDetokenizer().detokenize(sen))
print(progress)

In [None]:
progress_df = pd.DataFrame(data = progress, columns=['Predicted Sentence'])
progress_df.index.name = "Epochs"
progress_df.to_csv('/content/predicted_sentence.csv')
progress_df.head()

# Model Inference

In [None]:
model.eval()
test_sentences  = ["Zwei Männer gehen die Straße entlang", "Kinder spielen im Park.", "Diese Stadt verdient eine bessere Klasse von Verbrechern. Der Spaßvogel"]
actual_sentences  = ["Two men are walking down the street", "Children play in the park", "This city deserves a better class of criminals. The joker"]
pred_sentences = []

for idx, i in enumerate(test_sentences):
  model.eval()
  translated_sentence = translate_sentence(model, i, german, english, device, max_length=50)
  progress.append(TreebankWordDetokenizer().detokenize(translated_sentence))
  print("German : {}".format(i))
  print("Actual Sentence in English : {}".format(actual_sentences[idx]))
  print("Predicted Sentence in English : {}".format(progress[-1]))
  print()
