<a href="https://colab.research.google.com/github/Dagobert42/NMT-Attention/blob/main/Neural_Machine_Translation_with_Attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

The goal of this notebook is to implement the **RNNsearch-50** model, i.e., the encoder-decoder with attention system for any language pair different from English-German, German-English, English-French, and French-English. We choose **German-Italian** as an example language pair.

Furthermore we loosely follow **test-driven development** ("TDD") paradigms to replicate the original system based on the paper:

    Bahdanau, Cho & Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015.

An in-depth walk-through of the project is given in the accompanying report which can be found here ADD LINK

# 1. Setup

Please choose which dependencies to install to your environment. On re-runs you can ucheck the boxes to save some time.

In [None]:
torch = True #@param {type:"boolean"}
if tf_datasets:
    !pip install --upgrade torch

torchtext = True #@param {type:"boolean"}
if torchtext:
    !pip install --upgrade torchtext

spacy_packages = True #@param {type:"boolean"}
if spacy_packs:
    !pip install --upgrade spacy
    !python -m spacy download en_core_news_sm
    !python -m spacy download de_core_news_sm
    !python -m spacy download it_core_news_sm

Next let us import all the modules we are going to be using.

In [41]:
# access to translation datasets
import torchtext

# model implementation
import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

# word tokenization
import spacy

# utilities
import random
import time

# 2. Data

### 2.1 Dataset

For data we refer to the Web Inventory of Transcribed and Translated Talks which comes as a torchtext dataset.  On re-runs you can ucheck the box to save some time.

In [49]:
from torchtext.datasets import IWSLT2017

load_dataset = False #@param {type:"boolean"}
if load_dataset:
    train_iter, test_iter, val_iter = IWSLT2017(split=('train', 'test', 'valid'), language_pair=('de', 'it'))
src_sentence, trg_sentence = next(train_iter)
print('Example:\n')
print('source ->', src_sentence)
print('target ->', trg_sentence)

print('train examples:', len(train_iter), end='\t')
print('test:', len(test_iter), end='\t')
print('validation:', len(val_iter))

Example:

source -> Das meine ich ernst, teilweise deshalb -- weil ich es wirklich brauchen kann!

target -> per i tanti, lusinghieri commenti, anche perché... Ne ho bisogno!!!

train examples: 205465	test: 1567	validation: 923


### 2.2 Vocabulary

The vocabulary converts words to indeces and vice versa. Unknown words are marked with the < U > token. The vocabulary can be filtered by word counts and provides methods for converting between sequences and sentences. It also takes an optional SpaCy nlp object which (when given) is used for tokenization.

In [51]:
class Vocab:
    """
    A vocabulary holding dictionaries for converting words to indeces and back.
    """
    class Entry:
        def __init__(self, id):
            """ An entry to the vocabulary. With index and count. """
            self.id = id
            self.count = 1

        def __repr__(self):
            """ String prepresentation for printing. """
            return str((self.id, self.count))

    def __init__(self, text=None, spacy_nlp=None):
        """
        Creates a vocabulary over an input text in the form of
        dictionaries for indeces and word counts. Hand in a
        spacy_nlp object to make use of a SpaCy for tokenization.
        """
        self.SPECIALS = '<S> <E> <U>'
        self.spacy_nlp = spacy_nlp
        self.words = {}
        self.ids = {}

        # add special tokens to vocabulary
        for word in self.SPECIALS.split():
            id = len(self.words.keys())
            self.words[word] = self.Entry(id)
            self.ids[id] = word

        if text:
            self.append(text)

    def append(self, txt):
        """ Adds a string token by token to the vocabulary. """
        # use SpaCy for tokenization if requested
        if self.spacy_nlp:
            for tok in self.spacy_nlp.tokenizer(txt):
                word = tok.text
                if tok.text not in self.words.keys():
                    next_id = len(self.words.keys())
                    self.words[word] = self.Entry(next_id)
                    self.ids[next_id] = word
                else:
                    self.words[word].count += 1
        else:
            for word in txt.split():
                if word not in self.words.keys():
                    next_id = len(self.words.keys())
                    self.words[word] = self.Entry(next_id)
                    self.ids[next_id] = word
                else:
                    self.words[word].count += 1

    def filter(self, n_samples, descending=True):
        """
        Reduces this vocabs dictionary to n_samples
        after sorting by word count. 
        """
        sorted_list = list(sorted(
                self.words.items(),
                key=lambda item: item[1].count,
                reverse=descending))
        self.words = {k: v for k, v in sorted_list[:n_samples]}
        return self

    def get_indeces(self, sentence):
        """ Produces a representation from the indeces in this vocabulary. """
        STA = 0
        END = 1
        UNK = 2
        if self.spacy_nlp:
            seq = [self.words[word].id if word in self.words.keys() else UNK
                for word in self.spacy_nlp.tokenizer(sentence)]
        else:
            seq = [self.words[word].id if word in self.words.keys() else UNK
                for word in sentence.split()]
        seq.append(END)
        return seq
    
    def get_sentence(self, indeces):
        """
        Converts a list of indeces into a readable sentence
        using words from this vocabulary.
        """
        return ' '.join([self.ids[id] for id in indeces])

We make sure the vocabulary works as intended by running tests with a tiny corpus. Note that during development these tests were written **first** and subsequently we implemented the related functionality in the class.

In [52]:
TEST_CORPUS_EN = """The final project should implement a system 
        related to deep learning for NLP using the Py- Torch library 
        and test it. The project is documented in an ACL-style paper 
        that adheres to the standards of practice in computational 
        linguistics."""

test_vocab = Vocab(TEST_CORPUS_EN)

# run some test on our implementation of vocabulary

# BASIC
assert(test_vocab.ids[0] == '<S>') # <S> should always be first
assert(test_vocab.ids[3] == 'The')
assert(test_vocab.words['The'].id == 3)
assert(test_vocab.words['the'].count == 2)

# SPACY OPTION
english_nlp = spacy.load('en_core_web_sm')
test_vocab = Vocab(TEST_CORPUS_EN, spacy_nlp=english_nlp)

assert(test_vocab.ids[0] == '<S>') # <S> should always be first
assert(test_vocab.ids[5] == 'project')
assert(test_vocab.words['project'].id == 5)
assert(test_vocab.words['project'].count == 2)

# VOCAB FILTER
top_30 = test_vocab.filter(n_samples=30)

assert(len(test_vocab.words) == 30)
assert(len(top_30.words) == 30)
assert(top_30 == test_vocab)

# SENT VEC CONVERSION
vec = test_vocab.get_indeces("This is a test")
sent = test_vocab.get_sentence(vec)

assert(vec == [2, 24, 8, 22, 1])
assert(sent == '<U> is a test <E>')

# 3. Model

## 3.1 Encoder
