## **Prerequisites**

### **Check what GPU you got**
If you didn't get the P100-PCIE GPU, click on the Runtime dropdown at the top of the page and Factory Reset Runtime 

In [2]:
!nvidia-smi

Tue Apr 14 22:22:34 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8     7W /  75W |      0MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

### **Mount and Install Packages**
Mount the data and set up Spacy

In [3]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

import os
os.chdir("/content/drive/My Drive/English-to-French-Translation")
!ls

!python -m spacy download en_core_web_sm
!python -m spacy download fr_core_news_sm

!pip3 install 'torchtext==0.5.0'

!pip3 install torch torchvision

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive
'BOW-Based Text Classification - Notebook.ipynb'
 data
 embeddings
 experiments
 models
'Notebook - EuroParl.ipynb'
'Notebook - Hansard.ipynb'
'Notebook - Multi30k.ipynb'
'Notebook - MyCorpus - Copy.ipynb'
'Notebook - MyCorpus.ipynb'
 README.md
 src
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
Collecting fr_core_news_sm==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releas

### **Important: Reset Runtime**

Note: there is a slight bug with Google Colab. After installing Spacy, you need to restart the Jupyter Notebook runtime.

There are two ways:
1. Click on the Runtime dropdown, and select "Restart Runtime". Once that is done, proceed to the next step (no need to remount the drive).
2. Run the code below. It will kill the current process, effectively restarting the runtime.

In [0]:
import os
os.kill(os.getpid(), 9)

## **Get the data**

### **Get a list of vocabs:**
The list of vocabs are already stored in the Google Drive folder; thus, we just have to load it.

What will they contain?
* ```source_vocabs``` and ```target_vocabs``` will contain two dictionaries:
  1. A mapping of words (strings) to an integer, and
  2. A mapping of integers to words


In [1]:
import sys
sys.path.append('/content/drive/My Drive/English-to-French-Translation/src/dataloader/')

import vocabs

models_dir = "/content/drive/My Drive/English-to-French-Translation/models/Multi30k/"
source_vocabs = vocabs.load_vocabs_from_file(models_dir + 'vocab.english.gz')
target_vocabs = vocabs.load_vocabs_from_file(models_dir + 'vocab.french.gz')

Loaded 5932 words
Loaded 6520 words


### **Tokenize and split the dataset**
We will have three types of datasets:
1. Training data: it is the data used to train our model
2. Validation (val) data: it is the data used to test our model at each step of the training process
3. Test data: it is the data used to test our model after all the training is done

How to get the three types of data?
* The test set is already in the `Test` folder
* The validation set is a piece of the data in the ``Train`` folder

The code to get our three types of datasets is:

In [2]:
import sys
sys.path.append('/content/drive/My Drive/English-to-French-Translation/src/dataloader')

import os
from tqdm.notebook import tqdm

import torch
import numpy as np
import utils
import vocabs

class Seq2SeqDataset(torch.utils.data.Dataset):
    def __init__(
        self,
        dir_: str,
        source_vocabs: vocabs.VocabDataset,
        target_vocabs: vocabs.VocabDataset,
        source_lang: str,
        target_lang: str,
    ):

        """ Initialize the Hansard dataset from a directory of parallel texts
            Note:
                The parallel text in 'dir_' must contain source and target transcriptions
                with the file extension 'source_lang' and 'target_lang'

            Parameters
            ----------
            dir_ : str
                A path to the directory of parallel text
            source_vocabs : VocabDataset
                A vocab dataset of the source text
            target_vocabs : VocabDataset
                A vocab dataset of the target text
            source_lang : { 'en', 'fr' }
                The source language
            target_lang : { 'en', 'fr' }
                The target language
        """

        # Get the spacy instances
        source_spacy = utils.get_spacy_instance(source_lang)
        target_spacy = utils.get_spacy_instance(target_lang)

        # Get all the files with the text
        transcriptions = utils.get_parallel_text(dir_, [source_lang, target_lang])

        source_filepaths = [os.path.join(dir_, trans[0]) for trans in transcriptions]
        target_filepaths = [os.path.join(dir_, trans[1]) for trans in transcriptions]

        source_l = utils.read_transcription_files(source_filepaths, source_spacy)
        target_l = utils.read_transcription_files(target_filepaths, target_spacy)

        source_word2id = source_vocabs.get_word2id()
        target_word2id = target_vocabs.get_word2id()

        F_unk, F_pad = range(len(source_word2id), len(source_word2id) + 2)
        E_unk, E_sos, E_eos, E_pad = range(len(target_word2id), len(target_word2id) + 4)

        corpus_iterator = zip(target_l, source_l)
        corpus_size = utils.get_size_of_corpus(source_filepaths)

        pairs = []
        E_lens = []
        F_lens = []

        for (e, e_fn, _), (f, f_fn, _) in tqdm(corpus_iterator, total=corpus_size):
            assert e_fn[:-2] == f_fn[:-2]

            if not e or not f:
                continue

            # Skip sentences > 50 words
            if len(f) > 50 or len(e) > 50:
                continue

            # Skip sentences with no words
            if len(f) <= 0 or len(e) <= 0:
                print("Found a sentence with no words in it!")
                continue

            F = torch.tensor([source_word2id.get(w, F_unk) for w in f])
            E = torch.tensor(
                [E_sos] + [target_word2id.get(w, E_unk) for w in e] + [E_eos]
            )

            # Validate the contents of E and F
            if torch.any(F < 0) or torch.any(F > F_unk):
                print("F_unk:", F_unk)
                print("F:", F)
                raise ValueError("Contents of F should be <= F_unk and >= 0!")

            if torch.any(E < 0) or torch.any(E > E_eos):
                print("E_eos:", E_eos)
                print("E:", E)
                raise ValueError("Contents of E should be <= E_eos and >= 0!")

            # Skip sentences that don't have any words in the vocab
            if torch.all(F == F_unk) and torch.all(E[1:-1] == E_unk):
                continue

            pairs.append((F, E))
            E_lens.append(E.size()[0])
            F_lens.append(F.size()[0])

        print("Number of sentence pairs:", len(pairs))
        print("Avg. num words in target text:", np.mean(E_lens))
        print("Std. num words in target text:", np.std(E_lens))
        print("Avg. num words in source text:", np.mean(F_lens))
        print("Std. num words in source text:", np.std(F_lens))

        self.dir_ = dir_
        self.source_vocab_size = len(source_word2id) + 2  # pad id and unk
        self.source_unk = F_unk
        self.source_pad_id = F_pad

        self.target_unk = E_unk
        self.target_sos = E_sos
        self.target_eos = E_eos
        self.target_pad_id = E_pad
        self.target_vocab_size = len(target_word2id) + 4  # unk, sos, eos, and pad id
        self.pairs = pairs

    def __len__(self):
        """ Returns the number of parallel texts in this dataset """
        return len(self.pairs)

    def __getitem__(self, i):
        """ Returns the i-th parallel texts in this dataset """
        return self.pairs[i]


class Seq2SeqDataLoader(torch.utils.data.DataLoader):
    def __init__(self, dataset, source_pad_id, target_pad_id, **kwargs):
        """ Loads the dataset for the model
            It can load the dataset in parallel by setting 'num_workers' param > 0

            Parameters
            ----------
            dataset : Seq2SeqDataset
                The parallel text dataset
            source_pad_id : int
                An ID used to pad the source text for batching
            target_pad_id : int
                An ID used to pad the target text for batching
        """
        super().__init__(dataset, collate_fn=self.collate, **kwargs)

        self.source_pad_id = source_pad_id
        self.target_pad_id = target_pad_id

    def collate(self, batch):
        """ Given a batch of source and target texts, it will pad it
            Specifically, it pads F with self.source_pad_id and E with self.target_eos

            Parameters
            ----------
            batch : A set of sequences F and E where F is torch.tensor and E is torch.tensor

            Returns
            -------
            (F, F_lens, E, E_lens) : tuple
                F is a torch.tensor of size (S, N)
                E is a torch.tensor of size (S, N)
        """
        F, E = zip(*batch)
        F_lens = torch.tensor([f.size()[0] for f in F])
        E_lens = torch.tensor([e.size()[0] for e in E])

        F = torch.nn.utils.rnn.pad_sequence(F, batch_first=False, padding_value=self.source_pad_id)
        E = torch.nn.utils.rnn.pad_sequence(E, batch_first=False, padding_value=self.target_pad_id)

        return F, F_lens, E, E_lens


train_dir = "/content/drive/My Drive/English-to-French-Translation/data/Multi30k/Training"
test_dir = "/content/drive/My Drive/English-to-French-Translation/data/Multi30k/Testing"

source_lang = "en"
target_lang = "fr"
train_val_ratio = 0.75
batch_size = 64

dataset = Seq2SeqDataset(
    train_dir,
    source_vocabs,
    target_vocabs,
    source_lang,
    target_lang,
)

num_training_data = int(len(dataset) * train_val_ratio)
num_val_data = len(dataset) - num_training_data

train_dataset, val_dataset = torch.utils.data.random_split(
    dataset, [num_training_data, num_val_data]
)

train_dataloader = Seq2SeqDataLoader(
    train_dataset,
    dataset.source_pad_id,
    dataset.target_pad_id,
    batch_size=batch_size,
    shuffle=True,
    pin_memory=True,
    num_workers=10,
)
val_dataloader = Seq2SeqDataLoader(
    val_dataset,
    dataset.source_pad_id,
    dataset.target_pad_id,
    batch_size=batch_size,
    shuffle=True,
    pin_memory=True,
    num_workers=10,
)

test_dataset = Seq2SeqDataset(
    test_dir,
    source_vocabs,
    target_vocabs,
    source_lang,
    target_lang,
)
test_dataloader = Seq2SeqDataLoader(
    test_dataset,
    dataset.source_pad_id,
    dataset.target_pad_id,
    batch_size=batch_size,
    shuffle=True,
    pin_memory=True,
    num_workers=10,
)


HBox(children=(IntProgress(value=0, max=29461), HTML(value='')))


Number of sentence pairs: 29460
Avg. num words in target text: 16.263951120162933
Std. num words in target text: 4.727645134343328
Avg. num words in source text: 13.071554650373388
Std. num words in source text: 4.057199491280985


HBox(children=(IntProgress(value=0, max=1014), HTML(value='')))


Number of sentence pairs: 1014
Avg. num words in target text: 16.355029585798817
Std. num words in target text: 4.774417087948714
Avg. num words in source text: 13.231755424063117
Std. num words in source text: 4.053699880023548


##**Our Model**

###**The Encoder**

In [0]:
import torch
from torch import nn


class Encoder(nn.Module):
  def __init__(
    self,
    source_vocab_size : int,
    word_embedding_size : int,
    source_pad_id : int,
    num_hidden_layers: int = 2,
    hidden_state_size: int = 512,
    dropout_value: float = 0.1,
    rnn_dropout_value: float = 0.5,
    cell_type: str = "lstm",
  ):
    super().__init__()

    # The embedding layer
    # It should not learn the padding tokens from the source sequence
    self.embedding = nn.Embedding(
      source_vocab_size, word_embedding_size, padding_idx=source_pad_id
    )

    self.pad_id = source_pad_id
    self.word_embedding_size = word_embedding_size

    # Add dropout
    self.dropout_value = dropout_value
    self.dropout = nn.Dropout(self.dropout_value)

    self.num_hidden_layers = num_hidden_layers
    self.hidden_state_size = hidden_state_size
    self.cell_type = cell_type
    self.rnn_dropout_value = rnn_dropout_value

    # Construct the RNN
    if self.cell_type == "rnn":
      self.rnn = nn.RNN(
        self.word_embedding_size,
        hidden_size=self.hidden_state_size,
        num_layers=self.num_hidden_layers,
        dropout=self.dropout,
        bidirectional=True,
      )

    elif self.cell_type == "lstm":
      self.rnn = nn.LSTM(
        self.word_embedding_size,
        hidden_size=self.hidden_state_size,
        num_layers=self.num_hidden_layers,
        dropout=self.rnn_dropout_value,
        bidirectional=True,
      )

    elif self.cell_type == "gru":
      self.rnn = nn.GRU(
        self.word_embedding_size,
        hidden_size=self.hidden_state_size,
        num_layers=self.num_hidden_layers,
        dropout=self.rnn_dropout_value,
        bidirectional=True,
      )

  def reset_parameters(self):
    ''' Resets the parameters '''
    
    self.embedding.reset_parameters()
    self.rnn.reset_parameters()

  def forward(self, F, F_lens):
    ''' Performs a forward propagation of the Encoder

        Parameters
        ----------
        F : torch.tensor(N, T)
          A batch of sequences in the source langauge
        F_lens : torch.tensor(N)
          The lengths of each sequence in the current batch of sequence

        Returns
        -------
        output : torch.tensor(N, T)
          The hidden states of the RNN
    '''
      
    # Get the embeddings for the inputs
    x = self.embedding(F)

    # Ignore some neurons with probability
    x = self.dropout(x)

    # Pack the embedded inputs into a PackedSequence object
    # packed_input is a tuple with 2 attributes: data and batch sizes
    packed_input = nn.utils.rnn.pack_padded_sequence(
      x, F_lens, enforce_sorted=False
    )

    # Pass the packed embedded input into the RNN (it accepts a PackedSequence object)
    # Note: hidden_states in LSTM is a tuple of (hidden_state, cell_state)
    packed_output, hidden_states = self.rnn(packed_input)

    # Unpack the packed ouputs
    output, output_sizes = nn.utils.rnn.pad_packed_sequence(
      packed_output, padding_value=self.pad_id
    )

    return output

###**The Decoder**


In [0]:
import torch
from torch import nn

class Decoder(torch.nn.Module):
  def __init__(
    self,
    target_vocab_size : int,
    word_embedding_size : int,
    target_eos : int,
    hidden_state_size : int = 1024,
    cell_type : str = "lstm",
    dropout_value : int = 0.5,
  ):
    super().__init__()

    self.target_vocab_size = target_vocab_size
    self.word_embedding_size = word_embedding_size
    self.target_eos = target_eos

    self.dropout_value = dropout_value    
    self.hidden_state_size = hidden_state_size
    self.cell_type = cell_type

    # Create the embedding without learning the EOS token
    self.embedding = nn.Embedding(target_vocab_size, word_embedding_size, padding_idx=target_eos)

    # Construct the dropout
    self.dropout = torch.nn.Dropout(self.dropout_value, inplace=True)

    # Construct the cells that takes in the word embeddings
    if self.cell_type == "lstm":
      self.cell = torch.nn.LSTMCell(
        self.word_embedding_size, self.hidden_state_size
      )

    elif self.cell_type == "gru":
      self.cell = torch.nn.GRUCell(
        self.word_embedding_size, self.hidden_state_size
      )

    elif self.cell_type == "rnn":
      self.cell = torch.nn.RNNCell(
        self.word_embedding_size, self.hidden_state_size
      )

    # Construct a simple feed-forward neural network that takes the hidden states of the RNN
    # and outputs some set of words
    self.ff = torch.nn.Linear(self.hidden_state_size, self.target_vocab_size)

  def reset_parameters(self):
    ''' Resets the parameters '''

    self.embedding.reset_parameters()
    self.cell.reset_parameters()
    self.ff.reset_parameters()

  def get_first_hidden_state(self, h, F_lens):
    ''' Obtains the first hidden state of the Decoder, which is the last 
        hidden state of the Encoder.

        Parameters
        ----------
        h : torch.FloatTensor(S, N, 2 * H)
          The hidden states of the Encoder

          Note: h[0 : self.hidden_state_size // 2 - 1] is 
                the forward hidden states in the Encoder while
                h[self.hidden_state_size // 2 : self.hidden_state_size - 1] 
                is the backward hidden states of the Encoder

        F_lens : torch.LongTensor(N, )
          The lengths of sequences in the current batch
        
        Returns
        -------
        htilde_tm1 : torch.FloatTensor(N, 2 * H)
          The initial hidden state of the Decoder
    '''
    # Check that the input sizes are correct:
    # assert h.size()[2] == self.hidden_state_size
    # assert list(F_lens.size()) == [h.size()[1]]

    # Transpose h from (S, N, 2 * H) to (N, S, 2 * H)
    h_transpose = h.permute(1, 0, 2)

    # Getting the forward hidden states, with S = F_lens (Size: (N, 1, H))
    forward_states = torch.gather(
      h_transpose,
      1,
      (F_lens - 1).view(-1, 1).unsqueeze(2).repeat(1, 1, self.hidden_state_size),
    )

    # Trim forward hidden state size from (N, 1, H) to (N, H)
    forward_states = forward_states[..., 0, 0 : self.hidden_state_size // 2]

    # Getting the backward hidden states (Size: (N, H))
    backward_states = h_transpose[..., 0, self.hidden_state_size // 2 :]

    # Combining the forward and backward states (Size: (N, 2 * H))
    htilde_tm1 = torch.cat((forward_states, backward_states), dim=1)

    # Check that the sizes are correct:
    # time = h.size()[0]
    # num_sequences = h.size()[1]
    # assert list(htilde_tm1.size()) == [num_sequences, self.hidden_state_size]

    return htilde_tm1

  def forward(self, E_tm1, htilde_tm1, h, F_lens):
    ''' Performs a forward propoagation of the Decoder

        Parameters
        ----------
        E_tm1 : torch.FloatTensor()
        
        htilde_tm1 : torch.FloatTensor(N, 2 * H)
          The hidden state from the previous timestamp
        h : torch.FloatTensor(S, N, 2 * H)
          The hidden states from the Encoder
        F_lens : torch.LongTensor(N, )
          The lengths of each sequence in the current batch

        Returns
        -------
        logits_t, htilde_t : torch.FloatTensor(N, self.target_vocab_size), 
                             torch.FloatTensor(N, 2 * H)
          The logits and the hidden states of the current timestamp
    '''
    if htilde_tm1 is None:
      htilde_tm1 = self.get_first_hidden_state(h, F_lens)

      # For an LSTM, we initialize the cell states with only zeros
      if self.cell_type == "lstm":
        htilde_tm1 = (htilde_tm1, torch.zeros_like(htilde_tm1))

    # Embed the input
    xtilde_t = self.embedding(E_tm1)

    # Apply dropout
    xtilde_t = self.dropout(xtilde_t)

    # # Get the hidden state from the embedded input
    htilde_t = self.cell(xtilde_t, htilde_tm1)
    logits_t = self.ff(htilde_t)

    return logits_t, htilde_t


###**The Seq2Seq**

In [0]:
import torch
from torch import nn


class Seq2Seq(nn.Module):
  def __init__(
    self,
    encoder,
    decoder,
    source_vocab_size,
    target_vocab_size,
    target_sos,
    target_eos,
    beam_width=4
  ):

    super().__init__()

    assert decoder.hidden_state_size == 2 * encoder.hidden_state_size
    assert decoder.cell_type == encoder.cell_type
    
    # Explicitly adding the embedding module so that its device is on the parent's device
    self.add_module('encoder', encoder)
    self.add_module('decoder', decoder)

    self.cell_type = encoder.cell_type

    self.source_vocab_size = source_vocab_size
    self.target_vocab_size = target_vocab_size
    self.target_sos = target_sos
    self.target_eos = target_eos
    self.encoder_hidden_size = encoder.hidden_state_size

    self.beam_width = beam_width

  def reset_parameters(self):
    self.encoder.reset_parameters()
    self.decoder.reset_parameters()

  def get_target_padding_mask(self, E):
    pad_mask = E == self.target_eos  # (T - 1, N)
    pad_mask = pad_mask & torch.cat([pad_mask[:1], pad_mask[:-1]], 0)
    return pad_mask

  def forward(
    self, F, F_lens, E=None, max_T=100, on_max="raise", teacher_forcing_ratio=0.5
  ):  
    if self.training and E is None:
      raise RuntimeError("E must be set for training")

    h = self.encoder(F, F_lens)  # (S, N, 2 * H)

    return self.get_logits(h, F_lens, E, teacher_forcing_ratio)

    # if self.training:
    #   return self.get_logits(h, F_lens, E, teacher_forcing_ratio)
    # else:
    #   return self.beam_search(h, F_lens, max_T, on_max)

  def get_logits(self, h, F_lens, E, teacher_forcing_ratio):
    ''' Returns a set of logits by forward propagating through the Seq2Seq model

        Parameters
        ----------
        h : torch.FloatTensor(S, N, 2 * H)
          The hidden states from the Encoder
        F_lens : torch.LongTensor(N, )
          The lengths of each input sequence in the source language, in the current batch
        E : torch.LongTensor(T, N)
          The lengths of each expected sequence in the target language, in the current batch
        teacher_forcing_ratio : float
          The % of the time to do teacher forcing

        Returns
        -------
        logits : torch.FloatTensor(T - 1, N, Vo)
          It is the output of each sequence from the decoder in the target language
          It does not contain SOS as the first token in the logits[] array
    '''
    max_time = E.size()[0]
    num_sequences = E.size()[1]

    # The logits array where logits_array[t] = logits_t at time t
    logits_array = []

    # Initially all sequences will start with an SOS (Size: (N,))
    E_tm1 = E[0, :]

    # The hidden state from the previous timestep
    # Initially it is None so that the forward() step will
    # use the default, first hidden state
    htilde_tm1 = None

    for cur_time in range(0, max_time - 1):

      # Move forward one step
      # The logits_t[i, :] is the log distribution of all predicted words for sequence i
      # The h_t[i, :] is the hidden state for sequence i
      logits_t, h_t = self.decoder(E_tm1, htilde_tm1, h, F_lens)

      logits_array.append(logits_t)
      htilde_tm1 = h_t

      # See if we are doing teacher forcing or not
      do_teacher_force = torch.rand(1) < teacher_forcing_ratio

      if do_teacher_force:
        E_tm1 = E[cur_time + 1, :]
      else:
        E_tm1 = logits_t.argmax(1)

    # Convert the array of logits into a tensor of logits
    logits = torch.stack(logits_array)

    return logits

  def beam_search(self, h, F_lens, max_T, on_max):
    # beam search
    assert not self.training
    
    htilde_tm1 = self.decoder.get_first_hidden_state(h, F_lens)
    logpb_tm1 = torch.where(
      torch.arange(self.beam_width, device=h.device) > 0,  # K
      torch.full_like(htilde_tm1[..., 0].unsqueeze(1), -float("inf")),  # k > 0
      torch.zeros_like(htilde_tm1[..., 0].unsqueeze(1)),  # k == 0
    )  # (N, K)
    
    assert torch.all(logpb_tm1[:, 0] == 0.0)
    assert torch.all(logpb_tm1[:, 1:] == -float("inf"))
    
    b_tm1_1 = torch.full_like(  # (t, N, K)
      logpb_tm1, self.target_sos, dtype=torch.long
    ).unsqueeze(0)
    
    # We treat each beam within the batch as just another batch when
    # computing logits, then recover the original batch dimension by
    # reshaping
    htilde_tm1 = htilde_tm1.unsqueeze(1).repeat(1, self.beam_width, 1)
    htilde_tm1 = htilde_tm1.flatten(end_dim=1)  # (N * K, 2 * H)
    
    if self.cell_type == "lstm":
      htilde_tm1 = (htilde_tm1, torch.zeros_like(htilde_tm1))

    h = h.unsqueeze(2).repeat(1, 1, self.beam_width, 1)
    h = h.flatten(1, 2)  # (S, N * K, 2 * H)
    F_lens = F_lens.unsqueeze(-1).repeat(1, self.beam_width).flatten()
    v_is_eos = torch.arange(self.target_vocab_size, device=h.device)
    v_is_eos = v_is_eos == self.target_eos  # (V,)
    t = 0
    on_max = "halt"

    while torch.any(b_tm1_1[-1, :, 0] != self.target_eos):
      if t == max_T:
        if on_max == "raise":
          raise RuntimeError(
            f"Beam search has not finished by t={t}. Increase the "
            f"number of parameters and train longer"
          )
        elif on_max == "halt":
          print("Beam search not finished. Halted")

          # Add EOS to the end of each sequence
          b_tm1_1[-1, :, 0] = self.target_eos
          break

      finished = b_tm1_1[-1] == self.target_eos
      E_tm1 = b_tm1_1[-1].flatten()  # (N * K,)
      logits_t, htilde_t = self.decoder(E_tm1, htilde_tm1, h, F_lens)
      logits_t = logits_t.view(
        -1, self.beam_width, self.target_vocab_size
      )  # (N, K, V)
      logpy_t = torch.nn.functional.log_softmax(logits_t, -1)

      # We length-normalize the extensions of the unfinished paths
      if t:
        logpb_tm1 = torch.where(finished, logpb_tm1, logpb_tm1 * (t / (t + 1)))
        logpy_t = logpy_t / (t + 1)
      
      # For any path that's finished:
      # - v == <eos> gets log prob 0
      # - v != <eos> gets log prob -inf
      logpy_t = logpy_t.masked_fill(finished.unsqueeze(-1) & v_is_eos, 0.0)
      logpy_t = logpy_t.masked_fill(
        finished.unsqueeze(-1) & (~v_is_eos), -float("inf")
      )
      
      if self.cell_type == "lstm":
        htilde_t = (
          htilde_t[0].view(-1, self.beam_width, 2 * self.encoder_hidden_size),
          htilde_t[1].view(-1, self.beam_width, 2 * self.encoder_hidden_size),
        )
      else:
        htilde_t = htilde_t.view(
          -1, self.beam_width, 2 * self.encoder_hidden_size
        )
      
      b_t_0, b_t_1, logpb_t = self.update_beam(
        htilde_t, b_tm1_1, logpb_tm1, logpy_t
      )
      
      del logits_t, logpy_t, finished, htilde_t
      
      if self.cell_type == "lstm":
        htilde_tm1 = (b_t_0[0].flatten(end_dim=1), b_t_0[1].flatten(end_dim=1))
      else:
        htilde_tm1 = b_t_0.flatten(end_dim=1)  # (N * K, 2 * H)
      
      logpb_tm1, b_tm1_1 = logpb_t, b_t_1
      t += 1
    return b_t_0 #logpb_tm1 #b_tm1_1

  def update_beam(self, htilde_t, b_tm1_1, logpb_tm1, logpy_t):
    cur_time = b_tm1_1.size()[0]
    num_sequences = b_tm1_1.size()[1]

    # Transpose b_tm1_1 so that instead of (t, N, self.beam_width) it is (N, t, self.beam_width)
    b_tm1_1_transpose = b_tm1_1.permute(1, 0, 2)

    # Expand logpb_tm1's 2rd dimension so that logpb_tm1_expanded[i, j, k] = logpb_tm1[i, j] for all k in N
    # So now its size is (N, self.beam_width, self.target_vocab_size)
    logpb_tm1_expanded = logpb_tm1.unsqueeze(2).expand_as(logpy_t)

    # Compute logpb_tm1_k_v
    # Size: (N, self.beam_width, self.target_vocab_size)
    logpb_tm1_k_v = logpb_tm1_expanded + logpy_t

    # Flatten logpb_tm1_k_v's size from (N, self.beam_width, self.target_vocab_size)
    # to (N, self.beam_width * self.target_vocab_size)
    logpb_tm1_k_v_flattened = torch.flatten(logpb_tm1_k_v, start_dim=1)

    # Get the top k items per sequence
    # Size: (N, self.beam_width)
    top_k_item_indices = torch.topk(
      logpb_tm1_k_v_flattened, self.beam_width, dim=1
    ).indices

    # Decompose each index into (j, v)
    # Sizes for both matrixes: (N, self.beam_width)
    top_k_item_indices_j = top_k_item_indices // self.target_vocab_size
    top_k_item_indices_v = top_k_item_indices - (
      top_k_item_indices_j * self.target_vocab_size
    )

    # This is a sequence of hidden states for each new beam
    # Note that if we have an LSTM we need to also update the cell states
    # b_t_0[i, k, :] = htilde_t[i, top_k_item_indices_j[i, k], :]
    # <-> b_t_0[a, b, c] = htilde_t[a, top_k_item_indices_j[a, b, c], c]
    b_t_0 = None
    if self.cell_type == "lstm":
      hidden_states = htilde_t[0]
      cell_states = htilde_t[1]

      b_t_0_hidden_states = torch.gather(
        hidden_states,
        1,
        top_k_item_indices_j.unsqueeze(2).expand_as(hidden_states),
      )

      b_t_0_cell_states = torch.gather(
        cell_states, 1, top_k_item_indices_j.unsqueeze(2).expand_as(cell_states)
      )

      b_t_0 = (b_t_0_hidden_states, b_t_0_cell_states)

    else:
      b_t_0 = torch.gather(
        htilde_t, 1, top_k_item_indices_j.unsqueeze(2).expand_as(htilde_t)
      )

    # Stores the k best new path so far for each new beam
    # It first copies the best paths in b_tm1_1_transpose to b_t_1
    # Size: (N, T, W)
    b_t_1 = torch.gather(
      b_tm1_1_transpose,
      2,
      top_k_item_indices_j.unsqueeze(1).expand_as(b_tm1_1_transpose),
    )

    # Add the words that make the best path to the end of b_t_1
    # Size: (N, T + 1, W)
    b_t_1 = torch.cat((b_t_1, top_k_item_indices_v.unsqueeze(1)), 1)

    # Rotate b_t_1 so that its size (N, t + 1, self.beam_width) is now (t + 1, N, self.beam_width)
    b_t_1 = b_t_1.permute(1, 0, 2)

    # Stores the probabilities of our new paths for each new beam
    # logpb_t[i, k] = logpb_tm1_k_v[i, top_k_item_indices_j[i, k], top_k_item_indices_v[i, k]]
    logpb_t = torch.gather(
      logpb_tm1_k_v_flattened,
      1,
      top_k_item_indices_j * self.target_vocab_size + top_k_item_indices_v,
    )

    return (b_t_0, b_t_1, logpb_t)

## **Training:**

###**How to build our model:**

In [0]:
def make_model(source_vocab_size, target_vocab_size, source_pad_id, target_sos, target_eos, device):
  ''' Makes the model

      Parameters
      ----------
      source_vocab_size : int
        The vocabulary size of the source language
      target_vocab_size : int
        The vocabulary size of the target language
      source_pad_id : int
        The ID of a padding token in the source language
      target_sos : int
        The ID of a SOS token in the target language
      target_eos : int
        The ID of an EOS token in the target language
      device : torch.device
        The device to run the model on

      Returns
      -------
      seq2seq : Seq2Seq
        The model
  '''

  cell_type = 'gru'

  source_word_embedding_size = 256
  target_word_embedding_size = 256

  encoder_num_hidden_layers = 2
  encoder_hidden_size = 512
  encoder_dropout = 0.1
  encoder_rnn_dropout = 0.5

  decoder_dropout = 0.1

  beam_width = 3

  # Building the encoder
  encoder = Encoder(
    source_vocab_size,
    source_word_embedding_size,
    source_pad_id,
    num_hidden_layers=encoder_num_hidden_layers,
    hidden_state_size=encoder_hidden_size,
    dropout_value=encoder_dropout,
    rnn_dropout_value=encoder_rnn_dropout,
    cell_type=cell_type
  )

  # Building the decoder
  decoder = Decoder(
    target_vocab_size,
    target_word_embedding_size,
    target_eos,
    hidden_state_size=2 * encoder.hidden_state_size,
    cell_type=cell_type,
    dropout_value=decoder_dropout
  )

  # Building the seq2seq model
  seq2seq = Seq2Seq(
    encoder,
    decoder,
    source_vocab_size,
    target_vocab_size,
    target_sos,
    target_eos,
    beam_width=beam_width
  )

  seq2seq.to(device)

  return seq2seq

###**How to train for one epoch:**

In [0]:
from torchtext.data.metrics import bleu_score
from tqdm.notebook import tqdm

def train_for_one_epoch(model, loss_function, optimizer, train_dataloader, device, loss_function_ignore_idx):
  ''' Trains the model on the training set

      Parameters
      ----------
      model : Seq2Seq
        The model
      loss_function : torch.nn.LossFunction
        The loss function (ex: torch.nn.CrossEntropyLoss)
      optimizer : torch.nn.Optimizer
        The optimizer (ex: SGD, AdamOptimizer, etc)
      train_dataloader : Seq2SeqDataLoader
        The dataloader for the training set
      device : torch.device
        The device to run predictions on
      loss_function_ignore_idx : int
        A value that the loss function ignores when it is computing the loss

      Returns
      -------
      loss : float
        The loss for the training set
  '''
  train_loss = 0.0

  model.train()

  for _, (F, F_lens, E, E_lens) in tqdm(enumerate(train_dataloader), total=len(train_dataloader)):

    # Send the data to the specified device
    F = F.to(device)
    F_lens = F_lens.to(device)
    E = E.to(device)

    # Zeros out the model's previous gradient with ``optimizer.zero_grad()``
    optimizer.zero_grad()

    # Get the next token probabilities (Size: (T - 1, N, self.target_vocab_size))
    logits = model(F, F_lens, E=E, teacher_forcing_ratio=0.5)

    # Remove the SOS (Size: (T - 1, N))
    E = E[1:, :]

    # Get which parts needs masking
    pad_mask = model.get_target_padding_mask(E)
    E = E.masked_fill(pad_mask, loss_function_ignore_idx)

    # Flatten the logits so that it is ((T - 1) * N, model.target_vocab_size)
    flattened_logits = logits.view(-1, logits.shape[2])

    # Flatten the expected output so that it is ((T - 1) * N)
    flattened_E = E.view(-1)

    # Compute the loss
    loss = loss_function(flattened_logits, flattened_E)
    train_loss += loss.item()

    # Back propagate
    loss.backward()
    optimizer.step()

    # Use this when using cuda
    del F, F_lens, E, logits, loss

  return train_loss / len(train_dataloader)    

###**How to evaluate our model:**

In [0]:
from torchtext.data.metrics import bleu_score
from tqdm.notebook import tqdm

def evaluate_model(model, 
                   loss_function, 
                   test_dataloader, 
                   device, 
                   target_sos, 
                   target_eos, 
                   target_pad_id, 
                   target_id2word,
                   loss_function_ignore_idx):
  ''' Evaluates the model by computing its loss and BLEU score over a test dataset

      Parameters
      ----------
      model : Seq2Seq
        The model
      loss_function : torch.nn.LossFunction
        The loss function (ex: torch.nn.CrossEntropyLoss)
      test_dataloader : Seq2SeqDataLoader
        The dataloader for the test set
      device : torch.device
        The device to run predictions on
      target_sos : int
        The ID of a SOS token in the target language
      target_eos : int
        The ID of an EOS token in the target language
      target_pad_id : int
        The ID of a padding token in the target language
      target_id2word : { int : str }
        A mapping of word IDs in the target language to its string representative
      loss_function_ignore_idx : int
        A value that the loss function ignores when it is computing the loss

      Returns
      -------
      loss, bleu_score : float, float
        The loss and bleu score for the test set
  '''
  test_bleu = 0.0
  test_loss = 0.0

  model.eval()

  with torch.no_grad():
    for _, (F, F_lens, E, E_lens) in tqdm(enumerate(test_dataloader), total=len(test_dataloader)):

      # Send the data to the proper device
      F = F.to(device)
      F_lens = F_lens.to(device)
      E = E.to(device)

      # Get logits by performing beam search
      logits = model(F, F_lens, E=E, teacher_forcing_ratio=0)

      # Remove the SOS (Size: (T - 1, N))
      E = E[1:, :]

      # Get which parts needs masking
      pad_mask = model.get_target_padding_mask(E)
      E = E.masked_fill(pad_mask, loss_function_ignore_idx)

      # Get the candidates from the argmax of logits
      E_cand = logits.argmax(2)

      # Compute the loss
      flattened_logits = logits.view(-1, logits.shape[2])
      flattened_E = E.view(-1)
      test_loss += loss_function(flattened_logits, flattened_E).item()

      # Computes the total BLEU score of the batch
      test_bleu += compute_batch_total_bleu(
        E, E_cand, target_sos, target_eos, target_id2word
      )

      # Use this when using cuda
      del F, F_lens, E, E_cand

  test_loss /= len(test_dataloader)
  test_bleu /= len(test_dataloader)

  return test_loss, test_bleu

def compute_batch_total_bleu(
  E_ref, E_cand, target_sos, target_eos, target_vocab_id2word
):
  ''' Computes the BLEU score over the entire batch

      Parameters
      ----------
      E_ref : torch.FloatTensor(S - 1, N)
        A batch of expected translations in the target language
      E_cand : torch.FloatTensor(S - 1, N)
        A batch of translations predicted by the model in the target language
      target_sos : int
        The ID of a SOS token in the target language
      target_eos : int
        The ID of an EOS token in the target language
      target_id2word : { int : str }
        A mapping of word IDs in the target language to its string representative

      Returns
      -------
      bleu_score : float
        The average BLEU score across all sequences in the current batch
  '''
  num_sequences = E_ref.size()[1]

  # Transpose the tensors (N, S)
  E_ref = E_ref.t()
  E_cand = E_cand.t()

  # This is (N, S)
  E_ref_words = []
  E_cand_words = []

  bleu = 0

  for i in range(num_sequences):

    # Convert the tensors to a python list
    E_ref_seq = E_ref[i].tolist()
    E_cand_seq = E_cand[i].tolist()

    # Remove the SOS
    E_ref_seq = E_ref_seq[1:]
    E_cand_seq = E_cand_seq[1:]

    # Remove the EOS
    if target_eos in E_ref_seq:
      E_ref_seq = E_ref_seq[0 : E_ref_seq.index(target_eos)]
    
    if target_eos in E_cand_seq:
      E_cand_seq = E_cand_seq[0 : E_cand_seq.index(target_eos)]

    # Convert to words
    E_ref_seq_words = [
      target_vocab_id2word.get(word_id, "NAN") for word_id in E_ref_seq
    ]
    E_cand_seq_words = [
      target_vocab_id2word.get(word_id, "NAN") for word_id in E_cand_seq
    ]

    E_ref_words.append([E_ref_seq_words])
    E_cand_words.append(E_cand_seq_words)

  return bleu_score(E_cand_words, E_ref_words)


### **How to train for many epochs:**

In [17]:
from torchtext.data.metrics import bleu_score
from tqdm.notebook import tqdm

def train():
  ''' The main function to train and test our model.
      It can perform early-stopping instead of hard-coding the number of epochs to run for

      Returns
      -------
      model : Seq2Seq
        The trained Seq2Seq model
  '''
  global source_vocabs, target_vocabs
  global dataset, train_dataset, val_dataset
  global train_dataloader, val_dataloader

  device = torch.device("cuda")

  # Build our model
  model = make_model(dataset.source_vocab_size, 
                     dataset.target_vocab_size,
                     dataset.source_pad_id,
                     dataset.target_sos,
                     dataset.target_eos,
                     device)
  
  patience = 3 #float("inf")
  num_epochs = float("inf")

  best_val_bleu = float("inf")
  best_val_loss = float("inf")

  num_poor = 0
  epoch = 1
  
  optimizer = torch.optim.Adam(model.parameters(), lr=0.0005)

  while epoch <= num_epochs and num_poor < patience:
      
    # Train
    loss_function = torch.nn.CrossEntropyLoss(ignore_index=-1)
    train_loss = train_for_one_epoch(model, 
                                     loss_function, 
                                     optimizer, 
                                     train_dataloader, 
                                     device,
                                     -1)
    
    # Evaluate the model
    val_loss, val_bleu = evaluate_model(model, 
                                        loss_function, 
                                        val_dataloader, 
                                        device, 
                                        dataset.target_sos, 
                                        dataset.target_eos,
                                        dataset.target_pad_id,
                                        target_vocabs.get_id2word(),
                                        -1)

    print(f"Epoch {epoch}: Train loss={train_loss}, Val loss={val_loss}, Val BLEU={val_bleu}")

    if val_loss > best_val_loss:
      num_poor += 1

    else:
      num_poor = 0
      best_val_loss = val_loss

    epoch += 1

  return model 

trained_model = train()

HBox(children=(IntProgress(value=0, max=346), HTML(value='')))




HBox(children=(IntProgress(value=0, max=116), HTML(value='')))


Epoch 1: Train loss=2.461742557197637, Val loss=2.512489732997171, Val BLEU=0.0677292187777655


HBox(children=(IntProgress(value=0, max=346), HTML(value='')))




HBox(children=(IntProgress(value=0, max=116), HTML(value='')))


Epoch 2: Train loss=1.8011267281681127, Val loss=2.3239781856536865, Val BLEU=0.1270369076981229


HBox(children=(IntProgress(value=0, max=346), HTML(value='')))




HBox(children=(IntProgress(value=0, max=116), HTML(value='')))


Epoch 3: Train loss=1.539661431071386, Val loss=2.169709819144216, Val BLEU=0.1658259977862291


HBox(children=(IntProgress(value=0, max=346), HTML(value='')))




HBox(children=(IntProgress(value=0, max=116), HTML(value='')))


Epoch 4: Train loss=1.3273027968199955, Val loss=2.13703636876468, Val BLEU=0.20195953367244918


HBox(children=(IntProgress(value=0, max=346), HTML(value='')))




HBox(children=(IntProgress(value=0, max=116), HTML(value='')))


Epoch 5: Train loss=1.1699273679297784, Val loss=2.0853732382429055, Val BLEU=0.22761253945509208


HBox(children=(IntProgress(value=0, max=346), HTML(value='')))




HBox(children=(IntProgress(value=0, max=116), HTML(value='')))


Epoch 6: Train loss=1.0454526147401402, Val loss=2.0295793917672387, Val BLEU=0.24141758965917443


HBox(children=(IntProgress(value=0, max=346), HTML(value='')))




HBox(children=(IntProgress(value=0, max=116), HTML(value='')))


Epoch 7: Train loss=0.9399219665224152, Val loss=2.0343396005959344, Val BLEU=0.25018502063677966


HBox(children=(IntProgress(value=0, max=346), HTML(value='')))




HBox(children=(IntProgress(value=0, max=116), HTML(value='')))


Epoch 8: Train loss=0.8112135468363073, Val loss=2.0626684396431365, Val BLEU=0.2617388283218972


HBox(children=(IntProgress(value=0, max=346), HTML(value='')))




HBox(children=(IntProgress(value=0, max=116), HTML(value='')))


Epoch 9: Train loss=0.7448372871889545, Val loss=2.09374846569423, Val BLEU=0.26424553191874045


## **Testing**
We are going to test our model on the test set

In [0]:
def test():
  ''' Used to test the model on the testing data'''
  
  global source_vocabs, target_vocabs
  global dataset, test_dataset, test_dataloader
  global trained_model

  device = torch.device("cuda")
    
  # The loss function
  loss_function = torch.nn.CrossEntropyLoss(ignore_index=-1)

  # Evaluate the model
  test_loss, test_bleu = evaluate_model(trained_model, 
                                        loss_function, 
                                        test_dataloader, 
                                        device, 
                                        test_dataset.target_sos, 
                                        test_dataset.target_eos,
                                        test_dataset.target_pad_id,
                                        target_vocabs.get_id2word(),
                                        -1)

  print(f"Test loss={test_loss}, Test Bleu={test_bleu}")

test()

HBox(children=(IntProgress(value=0, max=16), HTML(value='')))


Test loss=1.9890752881765366, Test Bleu=0.29387823445831546
