![license_header_logo](../../../images/license_header_logo.png)

> **Copyright (c) 2021 CertifAI Sdn. Bhd.**<br>
<br>
This program is part of OSRFramework. You can redistribute it and/or modify
<br>it under the terms of the GNU Affero General Public License as published by
<br>the Free Software Foundation, either version 3 of the License, or
<br>(at your option) any later version.
<br>
<br>This program is distributed in the hope that it will be useful
<br>but WITHOUT ANY WARRANTY; without even the implied warranty of
<br>MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
<br>GNU Affero General Public License for more details.
<br>
<br>You should have received a copy of the GNU Affero General Public License
<br>along with this program.  If not, see <http://www.gnu.org/licenses/>.
<br>

# Introduction

**Seq2Seq (Encoder-Decoder) Model Architecture** has become ubiquitous due to the advancement of **Transformer** Architecture in recent years. Large corporations started to train huge networks and published them to the research community. Recently Open API has licensed their most advanced pre-trained Transformer model **GPT-3** to Microsoft. Even though the practical implementation of RNN has become almost non-existent, anyone starting to learn the most advanced algorithms still need to understand how to implement a Seq2Seq Model just using RNN and its variants (LSTM, GRU). In this notebook, we are going to implement **Machine Translation using Recurrent Neural Network and PyTorch** from scratch.

![RNN](../../../images/RNN.png)

# Prerequisites

Before you start this notebook, you should know:
1. The basic of pytorch and implementation of neural network using pytorch


2. Understand the concept of recurrent neural network (RNN)


3. Have basic understanding on seq-to-seq (encoder-decoder) architecture

# What will we accomplish?

Steps to implement machine translation using Recurrent Neural Network with Pytorch:

> Step 1: Multi30k dataset preparation 

> Step 2: Encoder-decoder Model Architecture

> Step 3: Model Training and Evaluation

> Step 4: Inference and prediction

# Notebook Content

* [Import Libraries](#Import-Libraries)


* [Dataset Preparation](#Dataset-Preparation)


* [Encoder-Decoder Model Architecture](#Encoder-Decoder-Model-Architecture)

    * [Encoder Model Using Pytorch](#Encoder-Model-Using-Pytorch)
        * [__init__()](#__init__())
        * [forward()](#forward())
    
    * [Decoder Model using PyTorch](#Decoder-Model-using-PyTorch)
        * [One Time Step of Decoder](#One-Time-Step-of-Decoder)
        * [Decoder Model](#Decoder-Model)
        * [Teaching Force](#Teaching-Force)


* [Combine Encoder and Decoder](#Combine-Encoder-and-Decoder)


* [Model Initialization](#Model-Initialization)


* [Training Loop](#Training-Loop)


* [Inference](#Inference)
    
    * [predict()](#predict())
    

* [Conclusion](#Conclusion)

# Import Libraries

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchtext.legacy.datasets import Multi30k
from torchtext.legacy.data import Field, BucketIterator
import spacy
import numpy as np
import random
from tqdm import tqdm

In [2]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


# Dataset Preparation

We will be using Multi30k dataset with Spacy tokenizer.

In case you are interested to learn more about Spacy, please visit the following link: https://www.presentslide.in/2019/07/implementing-spacy-advanced-natural-language-processing.html

The `get_datasets()` function is where we prepare the dataset. We will reverse the German tokens as it enforces the initial LSTM layers in the Decoder to get more influenced by the initial part of source German tokens, which if you think about it makes more sense.

In [3]:
def get_datasets(batch_size=128):
    # Download the language files
    spacy_de = spacy.load('de_core_news_sm')
    spacy_en = spacy.load('en_core_web_sm')

    # define the tokenizer
    def tokenize_de(text):
        return [token.text for token in spacy_de.tokenizer(text)][::-1]

    def tokenize_en(text):
        return [token.text for token in spacy_en.tokenizer(text)]

    # Create the pytext's Field
    source = Field(tokenize=tokenize_de, init_token='<sos>', eos_token='<eos>', lower=True)
    target = Field(tokenize=tokenize_en, init_token='<sos>', eos_token='<eos>', lower=True)

    # Splits the data in Train, Test and Validation data
    train_data, valid_data, test_data = Multi30k.splits(exts=('.de', '.en'), fields=(source, target), root='../../../resources/.data', train='train', validation='val', test='test2016')

    # Build the vocabulary for both the language
    source.build_vocab(train_data, min_freq=3)
    target.build_vocab(train_data, min_freq=3)

    # Create the Iterator using builtin Bucketing
    train_iterator, valid_iterator, test_iterator = BucketIterator.splits((train_data, valid_data, test_data),
                                                                          batch_size=batch_size,
                                                                          sort_within_batch=True,
                                                                          sort_key=lambda x: len(x.src),
                                                                          device=device)
    
    return train_iterator, valid_iterator, test_iterator, source, target

# Encoder-Decoder Model Architecture

Below is the diagram of basic **Encoder-Decoder Model Architecture**. We need to feed the input text to the Encoder and output text to the decoder. The encoder will pass some data, named as Context Vectors to the decoder so that the decoder can do its job.

![Model Architecture](../../../images/model_architecture.jpg)

This is a very simplified version of the architecture. As we build each part, we will focus more on specifics. 

Encoder-Decoder Model can be used in different fields of Artificial Intelligence such as **Machine Translation**, **Named Entity Recognition**, **Summarization**, **Chat-Bot**, **Question-Answering** and many more.

Here we will be translating from **German to English**. For the datasource we will use the one provided by PyTorch as it takes much lesser computation power to train using this dataset. You can use Google CoLab to train your model if you don’t have access to a GPU.

We will start with a simple Encoder-Decoder architecture, then get into more complex version gradually.

## Encoder Model Using Pytorch

We will defer the simple data processing steps until the model is ready. However just understand that, the input data will be a sequence of strings in array which will start with `<sos>` and end with `<eos>`. Take a look at a simple version of encoder architecture.

![Encoder](../../../images/encoder.png)

As you already know that Neural Network can only **understand number**, we need to first **convert each word to unique token of integer number**, then use **One-Hot Encoding to represent each word** (which is depicted as `one-hot` in the diagram above). This will be taken care as part of the preprocessing.

We need to use PyTorch to be able to create the **embedding** and **RNN** layer. We will create the sub-class of the `torch.nn.Module` class and define the `__init__()` and `forward()` method.

### `__init__()`

The Embedding layer will take the **input data** and output the **embedding vector**, hence the dimension of those needs to be defined as `input_dim` and `embedding_dim`.

The `vocab_len` is nothing but the **number of unique words** present in our vocabulary. After pre-processing the data, we can count the number of unique words in our vocabulary and use that count here.

The `embedding_dim` is the output/**final dimension** of the **embedding vector** we need. A good practice is to use **256-512** for sample demo app like we are building here.

Next we will define our LSTM Layer, which takes the `embedding_dim` as the input data and create total 3 outputs – `hidden`, `cell` and `output`. Here we need to define the **number of neurons we need in LSTM**, which is defined using the `hidden dimension`. Again, this is just a number and we will set this as **1024**.

LSTM can be stacked, hence we will pass the `n_layers` as a parameter, however for our initial implementation we will just use **1 layer**.

### `forward()`

The forward function is very straight forward. Notice we are using a **dropout layer** after the **embedding layer**, this is absolutely optional.

The encoder is the most simple among rest of the code. Notice we are completely ignorant on the **batch size** and the **time dimension (sentence length)** as both will be taken care dynamically by PyTorch.

The Embedding layer uses the `vocab_len` for converting the `input_batch` to one-hot representation internally.

Another important point to notice here is, we can feed an entire batch at once to the encoder model. A batch will have the dimension of `[time_dimension, batch_size]`. In PyTorch if don’t pass the `hidden` and `cell` to the RNN module, it will initialize one for us and process the entire batch at once.

So the output (`outputs`, `hidden`, `cell`) of the LSTM module is the final output after processing for all the time dimensions for all the sentences in the batch. We do not need the outputs vector from the LSTM, as we need to pass just the **context vector** to the **decoder block**, which consists of the `hidden` and `cell` vector only. Hence let’s return them from the function here.

Note: Since we are using LSTM we have the additional **cell state**, however if we are using GRU, we will have only the hidden state.

In [4]:
class Encoder(nn.Module):
    def __init__(self, vocab_len, embedding_dim, hidden_dim, n_layers, dropout_prob):
        super().__init__()

        self.embedding = nn.Embedding(vocab_len, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=dropout_prob)

        self.dropout = nn.Dropout(dropout_prob)

    def forward(self, input_batch):
        embed = self.dropout(self.embedding(input_batch))
        outputs, (hidden, cell) = self.rnn(embed)

        return hidden, cell

## Decoder Model using PyTorch

Implementation of Decoder needs to be done in **two steps**. Let’s understand more from the diagram below.

![Decoder](../../../images/decoder.png)

The decoder’s input in a time step **t**, is **dependent** on the output of the previous time step `t−1`. When `t=0` it will take the **output of the Encoder** as the input for its `initial hidden`, `cell state`.

We will first create a Decoder Model just for **one time step of the decoder** and later add a wrapper for the entire time sequence.

### One Time Step of Decoder

The one time step of the decoder looks like the following diagram. Here all we want to implement is one **Embedding Layer**, **LSTM** and **Linear Layer**.

![One Time Step Decoder](../../../images/one_step_decoder.png)

**Note**: Some of the implementation uses a **LogSoftMax** layer (e.g official PyTorch documentation) after the **Linear layer**. Since we do not need a probability distribution here and can work with the most probable value, we are omitting the use of **LogSoftMax** can will just use the output of the **Linear layer**. The LogSoftMax might be useful in other use cases such as **Beam Search**.

The code for `OneStepDecoder` is very simple to implement. There are however few important points to notice.

Since the **output of the Linear layer** will be the **input to the Embedding layer** of the **next time step**, the output dimension should be same as the decoder’s input dimension and target sentences vocabulary size. Here we are naming it as `input_output_dim`.

Secondly, the target_token is just one dimensional as we are just passing the previous most probable generated index of the word for all the batches. However as discussed previously, the Embedding layer expects input as `[time_dimension, batch_size]`. Hence call the unsqueeze(0) function just to add an **additional time dimension** as 1.

We will take the output of the LSTM and remove this time_dimension before passing it to the **Linear layer**.

In [5]:
class OneStepDecoder(nn.Module):
    def __init__(self, input_output_dim, embedding_dim, hidden_dim, n_layers, dropout_prob):
        super().__init__()
        # self.input_output_dim will be used later
        self.input_output_dim = input_output_dim

        self.embedding = nn.Embedding(input_output_dim, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=dropout_prob)
        self.fc = nn.Linear(hidden_dim, input_output_dim)
        self.dropout = nn.Dropout(dropout_prob)

    def forward(self, target_token, hidden, cell):
        target_token = target_token.unsqueeze(0)
        embedding_layer = self.dropout(self.embedding(target_token))
        output, (hidden, cell) = self.rnn(embedding_layer, (hidden, cell))

        linear = self.fc(output.squeeze(0))

        return linear, hidden, cell

### Decoder Model

Now we are ready to build the **full Decoder model**. First, pass the instance of `OneStepDecoder` in the constructor.

The main objective is to call the OneStepDecoder as many times we have the **time dimension in our batch**.

So far we have ignored the **Time and Batch dimension** as PyTorch was taking care of that automatically, however now we need get them (`target_len`, `batch_size`) from the `target`.

We need to store the `output` of each **Decoder Time Step for each batch**, we created a tensor named `predictions` using PyTorch.

Next, take the very first input from the target data (which will be `<sos>`) and pass it along with the **`hidden`** and **`cell`** from the encoder. The `input`, `hidden` and `cell` variable will be overwritten in the consecutive time step.

Finally loop through the time step ( remember that each batch may have a **variable number of time sequence and batch size** ) and call the `one_step_decoder`. Store the predicted output to the predictions vector and get the most probable word token by call the `argmax()` function.

### Teaching Force

Now, we will just add one more concept called **Teacher Forcing** in the **Decoder** mode.

When one of `OneStepDecoder` **predicts the wrong word**, the next consecutive OneStepDecoder does not learn as it receives the wrong input and the trend continues for the remaining of the tokens in the sequence. This leads to **very slow convergence** of model.

One way of addressing this problem is to **randomly provide the correct input to the `OneStepDecoder`**, irrespective of the output from the previous time step. This way we are enforcing the current OneStepDecoder to **learn from correct data**. This leads to **faster convergence**. The process is called as **Teacher Forcing**, since we are intermittently helping the decoder to learn from correct target sequence.

Below is the updated version of the previous diagram in order to get an intuition about the idea. As you see all we are doing is randomly choosing between the **previous step’s output** vs the **actual target**.

![Teaching Force](../../../images/teaching_force.png)

The code is straightforward, first we want to control how much of teacher forcing to use, hence pass that as a parameter as during inference we wont be using it at all.

The following code is part of our Decoder loop for enabling Teacher Forcing. We can pass the `teacher_forcing_ratio` to 0 in order to disable it during inference time.

    do_teacher_forcing = random.random() < teacher_forcing_ratio
    
    input = target[t] if do_teacher_forcing else input
    
Notice the `teacher_forcing_ratio` is being passed as an **argument** to the `forward()` method and not to the constructor, so that the value can be changed during the life cycle of the training. We can have more teacher forcing in the beginning of the training, however as training progresses we can reduce the value so that the network can learn by itself.

In [6]:
class Decoder(nn.Module):
    def __init__(self, one_step_decoder, device):
        super().__init__()
        self.one_step_decoder = one_step_decoder
        self.device = device

    def forward(self, target, hidden, cell, teacher_forcing_ratio=0.5):
        target_len, batch_size = target.shape[0], target.shape[1]
        target_vocab_size = self.one_step_decoder.input_output_dim
        # Store the predictions in an array for loss calculations
        predictions = torch.zeros(target_len, batch_size, target_vocab_size).to(self.device)
        # Take the very first word token, which will be sos
        input = target[0, :]

        # Loop through all the time steps, starts from 1
        for t in range(1, target_len):
            predict, hidden, cell = self.one_step_decoder(input, hidden, cell)

            predictions[t] = predict
            input = predict.argmax(1)

            # Teacher forcing
            do_teacher_forcing = random.random() < teacher_forcing_ratio
            input = target[t] if do_teacher_forcing else input

        return predictions

## Combine Encoder and Decoder

The next step will be to **combine the Encoder and Decoder** models. The below diagram shows the model hierarchy. We already have the Encoder and Decoder model, we need to combine them in a model named `EncoderDecoder`.

![Encoder Decoder Model](../../../images/encoder-decoder_model.png)

In [7]:
class EncoderDecoder(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()

        self.encoder = encoder
        self.decoder = decoder

    def forward(self, source, target, teacher_forcing_ratio=0.5):
        hidden, cell = self.encoder(source)
        outputs = self.decoder(target, hidden, cell, teacher_forcing_ratio)

        return outputs

# Model Initialization

This part is similar to any other PyTorch program. Initialize the model, optimizer and loss function.

In [8]:
def create_model(source, target):
    # Define the required dimensions and hyper parameters
    embedding_dim = 256
    hidden_dim = 1024
    dropout = 0.5

    # Instanciate the models
    encoder = Encoder(len(source.vocab), embedding_dim, hidden_dim, n_layers=2, dropout_prob=dropout)
    one_step_decoder = OneStepDecoder(len(target.vocab), embedding_dim, hidden_dim, n_layers=2, dropout_prob=dropout)
    decoder = Decoder(one_step_decoder, device)

    model = EncoderDecoder(encoder, decoder)

    model = model.to(device)

    # Define the optimizer
    optimizer = optim.Adam(model.parameters())

    # Makes sure the CrossEntropyLoss ignores the padding tokens.
    TARGET_PAD_IDX = target.vocab.stoi[target.pad_token]
    criterion = nn.CrossEntropyLoss(ignore_index=TARGET_PAD_IDX)

    return model, optimizer, criterion

# Training Loop

This code is also very generic, except just one part. We will be discarding the first token from the forward pass and also from the target token sequence.

In [9]:
def train(train_iterator, valid_iterator, source, target, epochs=10):
    model, optimizer, criterion = create_model(source, target)

    clip = 1

    for epoch in range(1, epochs + 1):
        pbar = tqdm(total=len(train_iterator), bar_format='{l_bar}{bar:10}{r_bar}{bar:-10b}', unit=' batches', ncols=200)

        training_loss = []
        # set training mode
        model.train()

        # Loop through the training batch
        for i, batch in enumerate(train_iterator):
            # Get the source and target tokens
            src = batch.src
            trg = batch.trg

            optimizer.zero_grad()

            # Forward pass
            output = model(src, trg)

            # reshape the output
            output_dim = output.shape[-1]

            # Discard the first token as this will always be 0
            output = output[1:].view(-1, output_dim)

            # Discard the sos token from target
            trg = trg[1:].view(-1)

            # Calculate the loss
            loss = criterion(output, trg)

            # back propagation
            loss.backward()

            # Gradient Clipping for stability
            torch.nn.utils.clip_grad_norm_(model.parameters(), clip)

            optimizer.step()

            training_loss.append(loss.item())

            pbar.set_postfix(
                epoch=f" {epoch}, train loss= {round(sum(training_loss) / len(training_loss), 4)}", refresh=True)
            pbar.update()

        with torch.no_grad():
            # Set the model to eval
            model.eval()

            validation_loss = []

            # Loop through the validation batch
            for i, batch in enumerate(valid_iterator):
                src = batch.src
                trg = batch.trg

                # Forward pass
                output = model(src, trg, 0)

                output_dim = output.shape[-1]

                output = output[1:].view(-1, output_dim)
                trg = trg[1:].view(-1)

                # Calculate Loss
                loss = criterion(output, trg)

                validation_loss.append(loss.item())

        pbar.set_postfix(
            epoch=f" {epoch}, train loss= {round(sum(training_loss) / len(training_loss), 4)}, val loss= {round(sum(validation_loss) / len(validation_loss), 4)}",
            refresh=False)
        pbar.close()

    return model

In [10]:
train_iterator, valid_iterator, test_iterator, source, target = get_datasets(batch_size=128)
model = train(train_iterator, valid_iterator, source, target, epochs=25)

checkpoint = {
    'model_state_dict': model.state_dict(),
    'source': source.vocab,
    'target': target.vocab
}

torch.save(checkpoint, 'model/nmt-model-lstm-25.pth')

# Inference

Now we will learn how make predictions using **Encoder Decoder Models**. Inference on Seq2Seq models are not as straight forward as other models, hence lets get to that in detail.

The hierarchy of the inference model will be bit different. We don’t need the EncoderDecoder and Decoder Model anymore. Here are the high level steps:

1. Load the model and vocabulary from the checkpoint file.


2. Load the Test (Unseen) dataset.


3. Convert each source token to integer values using the vocabulary


4. Take the integer value of `<sos>` from the target vocabulary.


5. Run the forward pass of the Encoder.


6. Use the hidden and cell vector of the Encoder and in loop run the forward pass of the OneStepDecoder until some specified step (say 50) or when `<eos>` has been generated by the model.


7. Record the most probable word inside the loop.


8. Find the corresponding word from target vocabulary and print in console.

![Inference on Machine Translation Model](../../../images/inference_MT.png)

In [11]:
# load the model, dataset and vocabulary.
def load_models_and_test_data(file_name):
    test_data = get_test_datasets()
    checkpoint = torch.load(file_name)
    source_vocab = checkpoint['source']
    target_vocab = checkpoint['target']
    model = create_model_for_inference(source_vocab, target_vocab)
    model.load_state_dict(checkpoint['model_state_dict'])

    return model, source_vocab, target_vocab, test_data

In [12]:
def get_test_datasets():
    # Download the language files
    spacy_de = spacy.load('de_core_news_sm')
    spacy_en = spacy.load('en_core_web_sm')

    # define the tokenizer
    def tokenize_de(text):
        return [token.text for token in spacy_de.tokenizer(text)][::-1]

    def tokenize_en(text):
        return [token.text for token in spacy_en.tokenizer(text)]

    # Create the pytext's Field
    source = Field(tokenize=tokenize_de, init_token='<sos>', eos_token='<eos>', lower=True)
    target = Field(tokenize=tokenize_en, init_token='<sos>', eos_token='<eos>', lower=True)

    # Splits the data in Train, Test and Validation data
    _, _, test_data = Multi30k.splits(exts=('.de', '.en'), fields=(source, target), test="test2016")

    return test_data

In [13]:
def create_model_for_inference(source_vocab, target_vocab):
    # Define the required dimensions and hyper parameters
    embedding_dim = 256
    hidden_dim = 1024
    dropout = 0.5

    # Instanciate the models
    encoder = Encoder(len(source_vocab), embedding_dim, hidden_dim, n_layers=2, dropout_prob=dropout)
    one_step_decoder = OneStepDecoder(len(target_vocab), embedding_dim, hidden_dim, n_layers=2, dropout_prob=dropout)
    decoder = Decoder(one_step_decoder, device)

    model = EncoderDecoder(encoder, decoder)

    model = model.to(device)

    return model

## `predict()`

Get the specific example from test dataset using the id and convert the sentence to number of integers using the sentence tokenizer and source vocabulary.

Then create a batch of 1 test data using the `unsqueeze()` function.

Set the model to **`eval`** mode for inference and call the forward method of the Encoder by passing the tokenized source sentence at one.

Create an array named `outputs` in order to store the generated words. Loop through specific number of times (or until end of sentence has been received). Call the `forward` method of `OneStepDecoder` directly.

Then find the most probable predicted output and save the corresponding word from the Target vocabulary.

Print the ground truth and predicted sentence.

In [14]:
def predict(id, model, source_vocab, target_vocab, test_data, debug=False):
    src = vars(test_data.examples[id])['src']
    trg = vars(test_data.examples[id])['trg']

    # Convert each source token to integer values using the vocabulary
    tokens = ['<sos>'] + [token.lower() for token in src] + ['<eos>']
    src_indexes = [source_vocab.stoi[token] for token in tokens]
    src_tensor = torch.LongTensor(src_indexes).unsqueeze(1).to(device)

    model.eval()

    # Run the forward pass of the encoder
    hidden, cell = model.encoder(src_tensor)

    # Take the integer value of <sos> from the target vocabulary.
    trg_index = [target_vocab.stoi['<sos>']]
    next_token = torch.LongTensor(trg_index).to(device)

    outputs = []
    trg_indexes = []

    with torch.no_grad():
        # Use the hidden and cell vector of the Encoder and in loop
        # run the forward pass of the OneStepDecoder until some specified
        # step (say 50) or when <eos> has been generated by the model.
        for _ in range(30):
            output, hidden, cell = model.decoder.one_step_decoder(next_token, hidden, cell)

            # Take the most probable word
            next_token = output.argmax(1)

            trg_indexes.append(next_token.item())

            predicted = target_vocab.itos[output.argmax(1).item()]
            if predicted == '<eos>':
                break
            else:
                outputs.append(predicted)
    if debug:
        print(f'Ground Truth    = {" ".join(trg)}')
        print(f'Predicted Label = {" ".join(outputs)}')

    predicted_words = [target_vocab.itos[i] for i in trg_indexes]

    return predicted_words

In [15]:
from torchtext.data.metrics import bleu_score

def cal_bleu_score(dataset, model, source_vocab, target_vocab):
    targets = []
    predictions = []

    for i in range(len(dataset)):
        target = vars(test_data.examples[i])['trg']
        predicted_words = predict(i, model, source_vocab, target_vocab, dataset)
        predictions.append(predicted_words[1:-1])
        targets.append([target])

    print(f'BLEU Score: {round(bleu_score(predictions, targets) * 100, 2)}')

In [16]:
checkpoint_file = 'model/nmt-model-lstm-25.pth'
model, source_vocab, target_vocab, test_data = load_models_and_test_data(checkpoint_file)
predict(1, model, source_vocab, target_vocab, test_data, debug=True)
predict(2, model, source_vocab, target_vocab, test_data, debug=True)
predict(14, model, source_vocab, target_vocab, test_data, debug=True)
predict(20, model, source_vocab, target_vocab, test_data, debug=True)

cal_bleu_score(test_data, model, source_vocab, target_vocab)

Ground Truth    = a boston terrier is running on lush green grass in front of a white fence .
Predicted Label = a german athlete runs in the grass in front of a white fence .
Ground Truth    = a girl in karate uniform breaking a stick with a front kick .
Predicted Label = a girl in a karate leotard is using a saw to a toddler .
Ground Truth    = three people sit in a cave .
Predicted Label = three people are sitting in a hut .
Ground Truth    = people standing outside of a building .
Predicted Label = people standing outside of a building .
BLEU Score: 19.22


There are some predictions made by our model, which worked nicely. Remember the model hasn’t seen the test data yet, hence it has generalized well for shorter sentences. However, there are also some not so good prediction. We can clearly see that RNN is **suffering from long sentences**.

# Conclusion

This tutorial provides the implementation details of Machine Translation using Encoder Decoder model with RNN. There are many advancements to this basic RNN model, however it’s probably wise to just add attention mechanism to this network for performance improvements.

# Contributors

**Author**
<br>Chee Lam

# References

1. [Machine Translation using Recurrent Neural Network and PyTorch](http://www.adeveloperdiary.com/data-science/deep-learning/nlp/machine-translation-recurrent-neural-network-pytorch/)