# Sub-word modeling and convolutional networks 
## Introduction
We will be exploring two key concepts in this notebook: sub-word modeling and convolutional networks for NLP tasks. We will apply these concepts to a natural machine translation model. This notebook is solution to coding sections of [Assignment #5](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/assignments/a5.pdf) of Stanford's ["CS224n: Natural Language Processing with Deep Learning"](http://web.stanford.edu/class/cs224n/) course. Contents of this notebook are taken from the course materials. <br> 
In a previous notebook, I implemented an attentions based NMT model (you can find it [here](https://arashsaeidpour.github.io/)), which can be divided into 4 distinct stages:
1. **Embedding layer:** Converts raw input text (for both the source and target sentences) to a sequence
of dense word vectors via lookup.
2. **Encoder:** A RNN that encodes the source sentence as a sequence of encoder hidden states.
3. **Decoder:** A RNN that operates over the target sentence and attends to the encoder hidden states to
produce a sequence of decoder hidden states.
4. **Output prediction layer:** A linear layer with softmax that produces a probability distribution for
the next target word on each decoder timestep.

All 4 of these stages perform at a word level. In this notebook, we are going to replace stage (1) with a character-based CNN encoder, and we will improve stage (4) by adding a character-based LSTM

## Conda environment
First you need to create a conda virtual environment with all the necessary packages to run the code. Run the following command from within the repo directory to create a new env named "nmt": 

Activate the "nmt" that you just created:

Installing the IPython kernel in your env:

Now switch your notebook's kernel to "nmt" env.

## Implementations

### 1- Character-based convolutional encoder for NMT

![image.png](attachment:image.png)

<center> Fig1. Character-based convolutional encoder, which ultimately produces a word
embedding of length eword. <\center>

In [21]:
import math
import sys
import pickle
import time
from collections import namedtuple

from nltk.translate.bleu_score import corpus_bleu, sentence_bleu, SmoothingFunction
import numpy as np
from typing import List, Tuple, Dict, Set, Union
from tqdm import tqdm
from utils import read_corpus, batch_iter
from vocab import Vocab, VocabEntry

import torch
import torch.nn.utils
import torch.nn as nn
import torch.nn.functional as F

Hypothesis = namedtuple('Hypothesis', ['value', 'score'])
from torch.nn.utils.rnn import pad_packed_sequence, pack_padded_sequence

import random

### Building the vocabulary from corpus

We start off by building the `vocab` from source and target corpora using `Vocab` class that has already been implemented in `vocab.py`:

In [2]:
vocab_size=50000   # Size of vocabulary for both source and target languages
freq_cutoff=2      # Words that were repeated less tha nthis value in corpus won't be included in the vocabulary

src_sents = read_corpus('./en_es_data/train.es', source='src')
tgt_sents = read_corpus('./en_es_data/train.en', source='tgt')

vocab = Vocab.build(src_sents, tgt_sents, vocab_size, freq_cutoff)

initialize source vocabulary ..
number of word types: 172418, number of word types w/ frequency >= 2: 80623
initialize target vocabulary ..
number of word types: 128873, number of word types w/ frequency >= 2: 64215


In [3]:
vocab.save('vocab.json')

We'll implement `words2charindices()` method in `VocabEntry` class (from `vocab.py`), which maps each character to its corresponding index in the character vocabulary (already implemented):

In order to apply tensor operations, we must ensure that all words are padded/truncated
to max word length $m_{word} = 21$, and all sentences should be padded to the length of the longest sentence in the batch. We will implement the `pad_sents_char` function which will produce these padded sentences. A padding word is represented by $m_{word}$ `<PAD>`-characters:

In [4]:
def pad_sents_char(sents, char_pad_token):
    """ Pad list of sentences according to the longest sentence in the batch and max_word_length.
    @param sents (list[list[list[int]]]): list of sentences, result of `words2charindices()`
        from `vocab.py`
    @param char_pad_token (int): index of the character-padding token
    @returns sents_padded (list[list[list[int]]]): list of sentences where sentences/words shorter
        than the max length sentence/word are padded out with the appropriate pad token, such that
        each sentence in the batch now has same number of words and each word has an equal
        number of characters
        Output shape: (batch_size, max_sentence_length, max_word_length)
    """
    # Words longer than 21 characters should be truncated
    max_word_length = 21
    len_sents=[len(sentc) for sentc in sents]
    max_len_sentc=max(len_sents)
    sents_padded=[]
    
    for length,sentc in zip(len_sents,sents):
        sentc_padded=[]
        for word in sentc:
            word_fixed=word.copy()
            if len(word)>max_word_length:
                word_fixed=word_fixed[:max_word_length]
            elif len(word)<max_word_length:
                word_fixed=word_fixed+[char_pad_token]*(max_word_length-len(word))
            sentc_padded.append(word_fixed)
            
        if len(sentc_padded)<max_len_sentc:
            sentc_padded= sentc_padded + [max_word_length*[char_pad_token]] * (max_len_sentc - length)
        sents_padded.append(sentc_padded)

    return sents_padded

Next we'll implement `to_input_tensor_char()` method in `VocabEntry` class (from `vocab.py`), which converts the padded sentences to to torch tensors (already implemented):

Now we are going to implement the **Highway network** as a [torch.nn module](https://pytorch.org/tutorials/beginner/examples_nn/two_layer_net_module.html), which takes the output of convolutional network, $\text{x}_\text{conv_out}$ and returns a word embedding:

In [5]:
class Highway(nn.Module):
    def __init__(self, eword , dropout_rate=0.3):
        """
        @eword:word embeding size (eword)

        """
        super(Highway, self).__init__()
        self.eword=eword
        self.linear_proj = nn.Linear(eword,eword,bias=True)
        self.linear_gate = nn.Linear(eword,eword,bias=True)
        self.dropout = nn.Dropout(dropout_rate)

    def forward(self, x_conv_out):
        """
        @x_conv_out: output of CNN , torch tensor of size eword
        
        @returns x_word_emb: word embedding of size eword 

        """
        x_proj_input = self.linear_proj(x_conv_out)
        x_proj=F.relu(x_proj_input)
        
        x_gate_input = self.linear_gate(x_conv_out)
        x_gate=torch.sigmoid(x_gate_input)
        
        x_highway=x_gate * x_proj + (1-x_gate) * x_conv_out
        x_word_emb=self.dropout(x_highway)

        return x_word_emb

Next we will implement the **convolutional network**, `cnn`, as a [torch.nn module](https://pytorch.org/tutorials/beginner/examples_nn/two_layer_net_module.html). `cnn` takes $\text{x}_\text{reshaped}$ as input and returns $\text{x}_\text{conv_out}$. We will be using kernel size of $k=5$ for our `Conv1d` layer: 

In [6]:
class CNN(nn.Module):
    def __init__(self,echar,f,k=5):
        """
        @echar:character embeding size (echar)
        @f: number of filters (channels)
        @k: window size
        """
        super(CNN, self).__init__()
        self.conv1d = nn.Conv1d(echar,f,k)
        
    def forward(self, x_reshaped):
        """
        @x_reshaped: x_emb reshaped to (echar,mword)
        
        @returns x_conv_out: outout of CNN with shape 

        """
        x_conv=self.conv1d(x_reshaped)
        x_conv_relued=torch.relu(x_conv)
        x_conv_out=x_conv_relued.max(2)
    
        return x_conv_out[0]

Now we proceed to implement the `ModelEmbeddings`, which converts input words to their cnn-based embeddings. We'll use drop-out rate of 0.3 and character embedding size of 50:

In [7]:
class ModelEmbeddings(nn.Module): 
    """
    Class that converts input words to their CNN-based embeddings.
    """
    def __init__(self, embed_size, vocab):
        """
        Init the Embedding layer for one language
        @param embed_size (int): Embedding size (dimensionality) for the output 
        @param vocab (VocabEntry): VocabEntry object. See vocab.py for documentation.
        """
        super(ModelEmbeddings, self).__init__()
        self.embed_size=embed_size
        self.embeddings = nn.Embedding(len(vocab.char2id),50, padding_idx=vocab.char2id['<pad>'])
        self.cnn=CNN(echar=50,f=embed_size,k=5)
        self.highway=Highway(eword=embed_size,dropout_rate=0.3)

    def forward(self, input):
        """
        Looks up character-based CNN embeddings for the words in a batch of sentences.
        @param input: Tensor of integers of shape (sentence_length, batch_size, max_word_length) where
            each integer is an index into the character vocabulary

        @param output: Tensor of shape (sentence_length, batch_size, embed_size), containing the 
            CNN-based embeddings for each word of the sentences in the batch
        """
        sentence_length, batch_size, max_word_length=input.shape
        x_emb=self.embeddings(input)
        x_reshaped=x_emb.view(-1,50,max_word_length)
        x_conv_out=self.cnn(x_reshaped)
        x_word_embd=self.highway(x_conv_out)
        
        return x_word_embd.view(sentence_length, batch_size,-1)

Now we will modify the `NMT` class that we have already implemented in [assignment four](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/assignments/a4.pdf) to include character-level embedding:

In [8]:
class NMT(nn.Module):
    """ Simple Neural Machine Translation Model:
        - Bidrectional LSTM Encoder
        - Unidirection LSTM Decoder
        - Global Attention Model (Luong, et al. 2015)
    """
    def __init__(self, embed_size, hidden_size, vocab, dropout_rate=0.2, no_char_decoder=False):
        """ Init NMT Model.

        @param embed_size (int): Embedding size (dimensionality)
        @param hidden_size (int): Hidden Size (dimensionality)
        @param vocab (Vocab): Vocabulary object containing src and tgt languages
                              See vocab.py for documentation.
        @param dropout_rate (float): Dropout probability, for attention
        """
        super(NMT, self).__init__()
        self.model_embeddings_source = ModelEmbeddings(embed_size, vocab.src)
        self.model_embeddings_target = ModelEmbeddings(embed_size, vocab.tgt)

        self.hidden_size = hidden_size
        self.dropout_rate = dropout_rate
        self.vocab = vocab
        self.encoder = None 
        self.decoder = None
        self.h_projection = None
        self.c_projection = None
        self.att_projection = None
        self.combined_output_projection = None
        self.target_vocab_projection = None
        self.dropout = None
        
        self.encoder = nn.LSTM(hidden_size,hidden_size,bidirectional=True,bias=True)
        self.decoder = nn.LSTMCell(embed_size + hidden_size,hidden_size,bias=True)
        self.h_projection = nn.Linear(2*hidden_size,hidden_size,bias=False)
        self.c_projection = nn.Linear(2*hidden_size,hidden_size,bias=False)
        self.att_projection = nn.Linear(2*hidden_size,hidden_size,bias=False)
        self.combined_output_projection = nn.Linear(3*hidden_size,hidden_size,bias=False)
        self.target_vocab_projection = nn.Linear(hidden_size,len(vocab.tgt),bias=False)
        self.dropout = nn.Dropout(dropout_rate)


        if not no_char_decoder:
           self.charDecoder = CharDecoder(hidden_size, target_vocab=vocab.tgt)
        else:
           self.charDecoder = None

    def forward(self, source: List[List[str]], target: List[List[str]]) -> torch.Tensor:
        """ Take a mini-batch of source and target sentences, compute the log-likelihood of
        target sentences under the language models learned by the NMT system.

        @param source (List[List[str]]): list of source sentence tokens
        @param target (List[List[str]]): list of target sentence tokens, wrapped by `<s>` and `</s>`

        @returns scores (Tensor): a variable/tensor of shape (b, ) representing the
                                    log-likelihood of generating the gold-standard target sentence for
                                    each example in the input batch. Here b = batch size.
        """
        # Compute sentence lengths
        source_lengths = [len(s) for s in source]

        source_padded_chars = self.vocab.src.to_input_tensor_char(source, device=self.device) 
        target_padded = self.vocab.tgt.to_input_tensor(target, device=self.device)
        target_padded_chars = self.vocab.tgt.to_input_tensor_char(target, device=self.device)   # Tensor: (tgt_len, b)
 
        enc_hiddens, dec_init_state = self.encode(source_padded_chars, source_lengths)
        enc_masks = self.generate_sent_masks(enc_hiddens, source_lengths)
        combined_outputs = self.decode(enc_hiddens, enc_masks, dec_init_state, target_padded_chars)


        P = F.log_softmax(self.target_vocab_projection(combined_outputs), dim=-1)

        # Zero out, probabilities for which we have nothing in the target text
        target_masks = (target_padded != self.vocab.tgt['<pad>']).float()

        # Compute log probability of generating true target words
        target_gold_words_log_prob = torch.gather(P, index=target_padded[1:].unsqueeze(-1), dim=-1).squeeze(-1) * target_masks[1:]
        scores = target_gold_words_log_prob.sum() 



        if self.charDecoder is not None:
            max_word_len = target_padded_chars.shape[-1]
            

            target_words = target_padded[1:].contiguous().view(-1)
            target_chars = target_padded_chars[1:].view(-1, max_word_len)
            target_outputs = combined_outputs.view(-1, 256)

            target_chars_oov = target_chars #torch.index_select(target_chars, dim=0, index=oovIndices)
            rnn_states_oov = target_outputs #torch.index_select(target_outputs, dim=0, index=oovIndices)
            oovs_losses = self.charDecoder.train_forward(target_chars_oov.t(), (rnn_states_oov.unsqueeze(0), rnn_states_oov.unsqueeze(0)))
            scores = scores - oovs_losses


        return scores


    def encode(self, source_padded: torch.Tensor, source_lengths: List[int]) -> Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
        """ Apply the encoder to source sentences to obtain encoder hidden states.
            Additionally, take the final states of the encoder and project them to obtain initial states for decoder.
        @param source_padded (Tensor): Tensor of padded source sentences with shape (src_len, b, max_word_length), where
                                        b = batch_size, src_len = maximum source sentence length. Note that
                                       these have already been sorted in order of longest to shortest sentence.
        @param source_lengths (List[int]): List of actual lengths for each of the source sentences in the batch
        @returns enc_hiddens (Tensor): Tensor of hidden units with shape (b, src_len, h*2), where
                                        b = batch size, src_len = maximum source sentence length, h = hidden size.
        @returns dec_init_state (tuple(Tensor, Tensor)): Tuple of tensors representing the decoder's initial
                                                hidden state and cell.
        """
        enc_hiddens, dec_init_state = None, None
        src_len, b, max_word_length = source_padded.shape
        X = self.model_embeddings_source(source_padded)
        X = pack_padded_sequence(X, source_lengths, batch_first=False)
        enc_hiddens, (last_hidden, last_cell)  = self.encoder(X)
        enc_hiddens , output_lens = pad_packed_sequence(enc_hiddens,batch_first=False)
        enc_hiddens = enc_hiddens.permute(1,0,2)
        
        
        h_encoder_0 = torch.cat((last_hidden[0,:,:],last_hidden[1,:,:]),1)
        init_decoder_hidden=self.h_projection(h_encoder_0)
        
        c_encoder_0 = torch.cat((last_cell[0,:,:],last_cell[1,:,:]),1)
        init_decoder_cell=self.h_projection(c_encoder_0)
        
        dec_init_state = (init_decoder_hidden, init_decoder_cell)


        return enc_hiddens, dec_init_state


    def decode(self, enc_hiddens: torch.Tensor, enc_masks: torch.Tensor,
                dec_init_state: Tuple[torch.Tensor, torch.Tensor], target_padded: torch.Tensor) -> torch.Tensor:
        """Compute combined output vectors for a batch.
        @param enc_hiddens (Tensor): Hidden states (b, src_len, h*2), where
                                     b = batch size, src_len = maximum source sentence length, h = hidden size.
        @param enc_masks (Tensor): Tensor of sentence masks (b, src_len), where
                                     b = batch size, src_len = maximum source sentence length.
        @param dec_init_state (tuple(Tensor, Tensor)): Initial state and cell for decoder
        @param target_padded (Tensor): Gold-standard padded target sentences (tgt_len, b, max_word_length), where
                                       tgt_len = maximum target sentence length, b = batch size.
        @returns combined_outputs (Tensor): combined output tensor  (tgt_len, b,  h), where
                                        tgt_len = maximum target sentence length, b = batch_size,  h = hidden size
        """
        # Chop of the <END> token for max length sentences.
        target_padded = target_padded[:-1]

        # Initialize the decoder state (hidden and cell)
        dec_state = dec_init_state

        # Initialize previous combined output vector o_{t-1} as zero
        batch_size = enc_hiddens.size(0)
        o_prev = torch.zeros(batch_size, self.hidden_size, device=self.device)

        # Initialize a list we will use to collect the combined output o_t on each step
        combined_outputs = []

        enc_hiddens_proj = self.att_projection(enc_hiddens)
        Y = self.model_embeddings_target(target_padded)
        for Y_t in torch.split(Y,1,0):
            Y_t=Y_t.squeeze(0)
            Ybar_t = torch.cat((Y_t,o_prev),1)
            dec_state , o_t , e_t = self.step(Ybar_t,dec_state,enc_hiddens,enc_hiddens_proj,enc_masks)
            combined_outputs.append(o_t)
            o_prev=o_t
        combined_outputs=torch.stack(combined_outputs)


        return combined_outputs


    def step(self, Ybar_t: torch.Tensor,
            dec_state: Tuple[torch.Tensor, torch.Tensor],
            enc_hiddens: torch.Tensor,
            enc_hiddens_proj: torch.Tensor,
            enc_masks: torch.Tensor) -> Tuple[Tuple, torch.Tensor, torch.Tensor]:
        """ Compute one forward step of the LSTM decoder, including the attention computation.
        @param Ybar_t (Tensor): Concatenated Tensor of [Y_t o_prev], with shape (b, e + h). The input for the decoder,
                                where b = batch size, e = embedding size, h = hidden size.
        @param dec_state (tuple(Tensor, Tensor)): Tuple of tensors both with shape (b, h), where b = batch size, h = hidden size.
                First tensor is decoder's prev hidden state, second tensor is decoder's prev cell.
        @param enc_hiddens (Tensor): Encoder hidden states Tensor, with shape (b, src_len, h * 2), where b = batch size,
                                    src_len = maximum source length, h = hidden size.
        @param enc_hiddens_proj (Tensor): Encoder hidden states Tensor, projected from (h * 2) to h. Tensor is with shape (b, src_len, h),
                                    where b = batch size, src_len = maximum source length, h = hidden size.
        @param enc_masks (Tensor): Tensor of sentence masks shape (b, src_len),
                                    where b = batch size, src_len is maximum source length.
        @returns dec_state (tuple (Tensor, Tensor)): Tuple of tensors both shape (b, h), where b = batch size, h = hidden size.
                First tensor is decoder's new hidden state, second tensor is decoder's new cell.
        @returns combined_output (Tensor): Combined output Tensor at timestep t, shape (b, h), where b = batch size, h = hidden size.
        @returns e_t (Tensor): Tensor of shape (b, src_len). It is attention scores distribution.
                                Note: You will not use this outside of this function.
                                      We are simply returning this value so that we can sanity check
                                      your implementation.
        """

        combined_output = None
        dec_state = self.decoder(Ybar_t,dec_state)
        dec_hidden, dec_cell = dec_state[0],dec_state[1]
        e_t = torch.bmm (enc_hiddens_proj,dec_hidden.unsqueeze(2))
        e_t=e_t.squeeze(2)

        # Set e_t to -inf where enc_masks has 1
        if enc_masks is not None:
            e_t.data.masked_fill_(enc_masks.byte(), -float('inf'))

        alpha_t=F.softmax(e_t,1)
        a_t = torch.bmm(alpha_t.unsqueeze(1),enc_hiddens).squeeze(1)
        U_t=torch.cat((dec_hidden,a_t),1)
        V_t=self.combined_output_projection(U_t)
        O_t=torch.tanh(V_t)
        O_t=self.dropout(O_t)

        

        combined_output = O_t
        return dec_state, combined_output, e_t

    def generate_sent_masks(self, enc_hiddens: torch.Tensor, source_lengths: List[int]) -> torch.Tensor:
        """ Generate sentence masks for encoder hidden states.

        @param enc_hiddens (Tensor): encodings of shape (b, src_len, 2*h), where b = batch size,
                                     src_len = max source length, h = hidden size.
        @param source_lengths (List[int]): List of actual lengths for each of the sentences in the batch.

        @returns enc_masks (Tensor): Tensor of sentence masks of shape (b, src_len),
                                    where src_len = max source length, h = hidden size.
        """
        enc_masks = torch.zeros(enc_hiddens.size(0), enc_hiddens.size(1), dtype=torch.float)
        for e_id, src_len in enumerate(source_lengths):
            enc_masks[e_id, src_len:] = 1
        return enc_masks.to(self.device)


    def beam_search(self, src_sent: List[str], beam_size: int=5, max_decoding_time_step: int=70) -> List[Hypothesis]:
        """ Given a single source sentence, perform beam search, yielding translations in the target language.
        @param src_sent (List[str]): a single source sentence (words)
        @param beam_size (int): beam size
        @param max_decoding_time_step (int): maximum number of time steps to unroll the decoding RNN
        @returns hypotheses (List[Hypothesis]): a list of hypothesis, each hypothesis has two fields:
                value: List[str]: the decoded target sentence, represented as a list of words
                score: float: the log-likelihood of the target sentence
        """
    
        src_sents_var = self.vocab.src.to_input_tensor_char([src_sent], self.device)

        src_encodings, dec_init_vec = self.encode(src_sents_var, [len(src_sent)])
        src_encodings_att_linear = self.att_projection(src_encodings)

        h_tm1 = dec_init_vec
        att_tm1 = torch.zeros(1, self.hidden_size, device=self.device)

        eos_id = self.vocab.tgt['</s>']

        hypotheses = [['<s>']]
        hyp_scores = torch.zeros(len(hypotheses), dtype=torch.float, device=self.device)
        completed_hypotheses = []


        t = 0
        while len(completed_hypotheses) < beam_size and t < max_decoding_time_step:
            t += 1
            hyp_num = len(hypotheses)

            exp_src_encodings = src_encodings.expand(hyp_num,
                                                     src_encodings.size(1),
                                                     src_encodings.size(2))

            exp_src_encodings_att_linear = src_encodings_att_linear.expand(hyp_num,
                                                                           src_encodings_att_linear.size(1),
                                                                           src_encodings_att_linear.size(2))

            y_tm1 = self.vocab.tgt.to_input_tensor_char(list([hyp[-1]] for hyp in hypotheses), device=self.device)
            y_t_embed = self.model_embeddings_target(y_tm1)
            y_t_embed = torch.squeeze(y_t_embed, dim=0)


            x = torch.cat([y_t_embed, att_tm1], dim=-1)

            (h_t, cell_t), att_t, _  = self.step(x, h_tm1,
                                                      exp_src_encodings, exp_src_encodings_att_linear, enc_masks=None)

            log_p_t = F.log_softmax(self.target_vocab_projection(att_t), dim=-1)

            live_hyp_num = beam_size - len(completed_hypotheses)
            contiuating_hyp_scores = (hyp_scores.unsqueeze(1).expand_as(log_p_t) + log_p_t).view(-1)
            top_cand_hyp_scores, top_cand_hyp_pos = torch.topk(contiuating_hyp_scores, k=live_hyp_num)

            prev_hyp_ids = top_cand_hyp_pos / len(self.vocab.tgt)
            hyp_word_ids = top_cand_hyp_pos % len(self.vocab.tgt)

            new_hypotheses = []
            live_hyp_ids = []
            new_hyp_scores = []

            decoderStatesForUNKsHere = []
            for prev_hyp_id, hyp_word_id, cand_new_hyp_score in zip(prev_hyp_ids, hyp_word_ids, top_cand_hyp_scores):
                prev_hyp_id = prev_hyp_id.item()
                hyp_word_id = hyp_word_id.item()
                cand_new_hyp_score = cand_new_hyp_score.item()

                hyp_word = self.vocab.tgt.id2word[hyp_word_id]

                # Record output layer in case UNK was generated
                if hyp_word == "<unk>":
                   hyp_word = "<unk>"+str(len(decoderStatesForUNKsHere))
                   decoderStatesForUNKsHere.append(att_t[prev_hyp_id])

                new_hyp_sent = hypotheses[prev_hyp_id] + [hyp_word]
                if hyp_word == '</s>':
                    completed_hypotheses.append(Hypothesis(value=new_hyp_sent[1:-1],
                                                           score=cand_new_hyp_score))
                else:
                    new_hypotheses.append(new_hyp_sent)
                    live_hyp_ids.append(prev_hyp_id)
                    new_hyp_scores.append(cand_new_hyp_score)

            if len(decoderStatesForUNKsHere) > 0 and self.charDecoder is not None: # decode UNKs
                decoderStatesForUNKsHere = torch.stack(decoderStatesForUNKsHere, dim=0)
                decodedWords = self.charDecoder.decode_greedy((decoderStatesForUNKsHere.unsqueeze(0), decoderStatesForUNKsHere.unsqueeze(0)), max_length=21, device=self.device)
                assert len(decodedWords) == decoderStatesForUNKsHere.size()[0], "Incorrect number of decoded words"
                for hyp in new_hypotheses:
                  if hyp[-1].startswith("<unk>"):
                        hyp[-1] = decodedWords[int(hyp[-1][5:])]#[:-1]

            if len(completed_hypotheses) == beam_size:
                break

            live_hyp_ids = torch.tensor(live_hyp_ids, dtype=torch.long, device=self.device)
            h_tm1 = (h_t[live_hyp_ids], cell_t[live_hyp_ids])
            att_tm1 = att_t[live_hyp_ids]

            hypotheses = new_hypotheses
            hyp_scores = torch.tensor(new_hyp_scores, dtype=torch.float, device=self.device)

        if len(completed_hypotheses) == 0:
            completed_hypotheses.append(Hypothesis(value=hypotheses[0][1:],
                                                   score=hyp_scores[0].item()))

        completed_hypotheses.sort(key=lambda hyp: hyp.score, reverse=True)
        return completed_hypotheses

    @property
    def device(self) -> torch.device:
        """ Determine which device to place the Tensors upon, CPU or GPU.
        """
        return self.att_projection.weight.device

    @staticmethod
    def load(model_path: str, no_char_decoder=False):
        """ Load the model from a file.
        @param model_path (str): path to model
        """
        params = torch.load(model_path, map_location=lambda storage, loc: storage)
        args = params['args']
        model = NMT(vocab=params['vocab'], no_char_decoder=no_char_decoder, **args)
        model.load_state_dict(params['state_dict'])

        return model

    def save(self, path: str):
        """ Save the odel to a file.
        @param path (str): path to the model
        """
        print('save model parameters to [%s]' % path, file=sys.stderr)

        params = {
            'args': dict(embed_size=self.model_embeddings_source.embed_size, hidden_size=self.hidden_size, dropout_rate=self.dropout_rate),
            'vocab': self.vocab,
            'state_dict': self.state_dict()
        }

        torch.save(params, path)


### 2- Character-based LSTM decoder for NMT
We will now add a LSTM-based character-level decoder to our NMT system, based on [Luong & Manning’s work](https://arxiv.org/abs/1604.00788). The main idea is that when our word-level decoder produces an `<UNK>` token, we run our character-level decoder (which you can think of as a character-level conditional language model) to instead generate the target word one character at a time, as shown in Figure 2. This will help us to produce rare and out-of-vocabulary target words.

![image.png](attachment:image.png)

<center> A character-based decoder which is triggered if the word-based decoder produces an UNK. Figure courtesy of Luong & Manning.<\center> 

This model can be divided into 3 parts:
1. **Forward computation of Character Decoder:** Given a sequence of characters, provides the the probability distribution for the next character in the sequence,
2. **Training of Character Decoder:** In the training stage, we train the character decoder on every word in the target sequence. On each training iteration, we add loss of character-based decoder to the loss of the word-based decoder, so that we simultaneously train the word-based model and character-based decoder,
3. **Decoding from the Character Decoder:** At test time, we first produce a word using the word-based NMT model. If the translation contains any `<UNK>` tokens, we then use a *greedy algorithm* with character-based decoder to generate a sequence of characters to replace `<UNK>` tokens as shown below:

![image.png](attachment:image.png)

Character-based decoder described above is implemented in `CharDecoder`. `CharDecoder` is a torch [nn.module](https://pytorch.org/tutorials/beginner/examples_nn/two_layer_net_module.html) and contains the following methods:
- `__init__`: Class constructor, this is where we initialize all the layers,
- `forward` : This is the function that produces the probability of next character given a sequence of character as described in **Forward computation of Character Decoder**,
- `train_forward` : This function computes the loss of character-based decoder on a batch of data,
- `decode_greedy` : This function perform the `greedy algorithm` described above to generate a word.

In [9]:
class CharDecoder(nn.Module):
    def __init__(self, hidden_size, char_embedding_size=50, target_vocab=None):
        """ Init Character Decoder.

        @param hidden_size (int): Hidden size of the decoder LSTM
        @param char_embedding_size (int): dimensionality of character embeddings
        @param target_vocab (VocabEntry): vocabulary for the target language. See vocab.py for documentation.
        """
        super(CharDecoder,self).__init__()
        target_vocab_size=len(target_vocab.char2id)
        self.hidden_size=hidden_size
        self.charDecoder = nn.LSTM(char_embedding_size,hidden_size,bias=True)
        self.char_output_projection = nn.Linear(hidden_size,target_vocab_size,bias=True)
        self.decoderCharEmb = nn.Embedding(target_vocab_size,char_embedding_size, padding_idx=target_vocab.char2id['<pad>'])
        self.target_vocab = target_vocab
       
  
    def forward(self, input, dec_hidden=None):
        """ Forward pass of character decoder.

        @param input: tensor of integers, shape (length, batch)
        @param dec_hidden: internal state of the LSTM before reading the input characters. A tuple of two tensors of shape (1, batch, hidden_size)

        @returns scores: called s_t in the PDF, shape (length, batch, self.vocab_size)
        @returns dec_hidden: internal state of the LSTM after reading the input characters. A tuple of two tensors of shape (1, batch, hidden_size)
        """
        emb_input=self.decoderCharEmb(input)
        
        if dec_hidden==None:
        
            h0=torch.zeros(1,input.shape[1],self.hidden_size)
            c0=torch.zeros(1,input.shape[1],self.hidden_size)
        else:
            h0,c0=dec_hidden[0],dec_hidden[1]
        
        ht,(hn,cn)=self.charDecoder(emb_input,(h0,c0))
        
        dec_hidden=(hn,cn)
        
        s_t=self.char_output_projection(ht)
        
    
        return s_t,dec_hidden

    def train_forward(self, char_sequence, dec_hidden=None):
        """ Forward computation during training.

        @param char_sequence: tensor of integers, shape (length, batch). Note that "length" here and in forward() need not be the same.
        @param dec_hidden: initial internal state of the LSTM, obtained from the output of the word-level decoder. A tuple of two tensors of shape (1, batch, hidden_size)

        @returns The cross-entropy loss, computed as the *sum* of cross-entropy losses of all the words in the batch.
        """
    
        
        input_decoder=char_sequence[:-1,:]
        true_output_decoder=char_sequence[1:,:]
        s_t,dec_hidden=self.forward(input_decoder, dec_hidden)
        
        
        true_output_decoder = true_output_decoder.contiguous().view(-1) 
        s_t  = s_t.view(-1, s_t.shape[-1])
        
        cross_entropy_loss=nn.CrossEntropyLoss(ignore_index=self.target_vocab.char2id['<pad>'],reduction='sum')
        return cross_entropy_loss(s_t,true_output_decoder)
    
    

    def decode_greedy(self, initialStates, device, max_length=21):
        """ Greedy decoding
        @param initialStates: initial internal state of the LSTM, a tuple of two tensors of size (1, batch, hidden_size)
        @param device: torch.device (indicates whether the model is on CPU or GPU)
        @param max_length: maximum length of words to decode

        @returns decodedWords: a list (of length batch) of strings, each of which has length <= max_length.
                              The decoded strings should NOT contain the start-of-word and end-of-word characters.
        """

   
        batch_size=initialStates[0].shape[1]
        output_words=['' for i in range(batch_size)]
        current_char_idx=torch.tensor([self.target_vocab.start_of_word]*batch_size,device=device).view(1,-1)
       
        h_t,c_t=initialStates[0],initialStates[1]
        
        for i in range(max_length):
            
            current_char_emb = self.decoderCharEmb(current_char_idx)
            
            _,(h_t,c_t)=self.charDecoder(current_char_emb,(h_t,c_t))
            s_t=self.char_output_projection(h_t)
            p_t=nn.functional.softmax(s_t,dim=2)
            current_char_idx=torch.argmax(p_t,dim=2)
           
           
            output_words=[output_words[j]+self.target_vocab.id2char[idx] for j,idx in enumerate(current_char_idx[0,:].tolist())]
        
        output_words=[output_word.split('}')[0] for output_word in output_words]
        return output_words

## Training

In [10]:
from utils import read_corpus, batch_iter
from vocab import Vocab, VocabEntry
import time
import math

Load train and dev sets from corresponding source and target corpus files using `read_corpus` (implemented in `utils`):

In [11]:
train_data_src = read_corpus('./en_es_data/train.es', source='src')
train_data_tgt = read_corpus('./en_es_data/train.en', source='tgt')
dev_data_src = read_corpus('./en_es_data/dev.es', source='src')
dev_data_tgt = read_corpus('./en_es_data/dev.en', source='tgt')
train_data = list(zip(train_data_src, train_data_tgt))
dev_data = list(zip(dev_data_src, dev_data_tgt))

In [12]:
print('There are %d samples (sentences) in the train set.' %(len(train_data_src)))
print('There are %d samples (sentences) in the dev set.' %(len(dev_data_src)))

There are 216617 samples (sentences) in the train set.
There are 851 samples (sentences) in the dev set.


Define the model's hyperparameters, Load `vocab` from vocab file which contains token-to-index look-up tables for source and target languages and initialize the model:

In [13]:
train_batch_size = 32  # batch size
clip_grad = 5.0 # gradient clipping
valid_niter = 2000 #perform validation after how many iterations
log_every = 10 # How often log (save) the model
model_save_path = 'model.bin'

vocab = Vocab.load('vocab.json')

model = NMT(embed_size=256,
            hidden_size=256,
            dropout_rate=0.3,
            vocab=vocab)

model.train()


NMT(
  (model_embeddings_source): ModelEmbeddings(
    (embeddings): Embedding(96, 50, padding_idx=0)
    (cnn): CNN(
      (conv1d): Conv1d(50, 256, kernel_size=(5,), stride=(1,))
    )
    (highway): Highway(
      (linear_proj): Linear(in_features=256, out_features=256, bias=True)
      (linear_gate): Linear(in_features=256, out_features=256, bias=True)
      (dropout): Dropout(p=0.3)
    )
  )
  (model_embeddings_target): ModelEmbeddings(
    (embeddings): Embedding(96, 50, padding_idx=0)
    (cnn): CNN(
      (conv1d): Conv1d(50, 256, kernel_size=(5,), stride=(1,))
    )
    (highway): Highway(
      (linear_proj): Linear(in_features=256, out_features=256, bias=True)
      (linear_gate): Linear(in_features=256, out_features=256, bias=True)
      (dropout): Dropout(p=0.3)
    )
  )
  (encoder): LSTM(256, 256, bidirectional=True)
  (decoder): LSTMCell(512, 256)
  (h_projection): Linear(in_features=512, out_features=256, bias=False)
  (c_projection): Linear(in_features=512, out_featu

Initialize model parameter using a *Unifrom* distribution ranging [-0.1,0.1]:

In [14]:
for p in model.parameters():
    p.data.uniform_(-0.1, 0.1)

We'll define `evaluate_ppl` to evaluate perplexity on dev set while training:

In [15]:
def evaluate_ppl(model, dev_data, batch_size=32):
    """ Evaluate perplexity on dev sentences
    @param model (NMT): NMT Model
    @param dev_data (list of (src_sent, tgt_sent)): list of tuples containing source and target sentence
    @param batch_size (batch size)
    @returns ppl (perplixity on dev sentences)
    """
    was_training = model.training
    model.eval()

    cum_loss = 0.
    cum_tgt_words = 0.

    # no_grad() signals backend to throw away all gradients
    with torch.no_grad():
        for src_sents, tgt_sents in batch_iter(dev_data, batch_size):
            loss = -model(src_sents, tgt_sents).sum()

            cum_loss += loss.item()
            tgt_word_num_to_predict = sum(len(s[1:]) for s in tgt_sents)  # omitting leading `<s>`
            cum_tgt_words += tgt_word_num_to_predict

        ppl = np.exp(cum_loss / cum_tgt_words)

    if was_training:
        model.train()

    return ppl

Define `vocab_mask` which contains 1s everywhere except the 'pad' token, move the model to `GPU` and define the optimizer (I am using `Adam` optimizer with learning_rate=0.001):

In [16]:
vocab_mask = torch.ones(len(vocab.tgt))
vocab_mask[vocab.tgt['<pad>']] = 0



if torch.cuda.is_available():
    device = torch.device("cuda:0")
else:
    device = torch.device("cpu")


model = model.to(device)


optimizer = torch.optim.Adam(model.parameters(), lr= 0.001)

num_trial = 0
train_iter = patience = cum_loss = report_loss = cum_tgt_words = report_tgt_words = 0
cum_examples = report_examples = epoch = valid_num = 0
hist_valid_scores = []

max_epochs=30       ### Maximum number of training epochs
patience_iters = 5   ### wait for how many iterations to decay learning rate
max_num_trial = 5    ### terminate training after how many trials
lr_decay = 0.5       ### lr decay rate

train_time = begin_time = time.time()

## Action time!

In [None]:
while True:
    epoch += 1

    for src_sents, tgt_sents in batch_iter(train_data, batch_size=train_batch_size, shuffle=True):
        train_iter += 1

        optimizer.zero_grad()

        batch_size = len(src_sents)

        example_losses = -model(src_sents, tgt_sents) # (batch_size,)
        batch_loss = example_losses.sum()
        loss = batch_loss / batch_size

        loss.backward()

        # clip gradient
        grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), clip_grad)

        optimizer.step()

        batch_losses_val = batch_loss.item()
        report_loss += batch_losses_val
        cum_loss += batch_losses_val

        tgt_words_num_to_predict = sum(len(s[1:]) for s in tgt_sents)  # omitting leading `<s>`
        report_tgt_words += tgt_words_num_to_predict
        cum_tgt_words += tgt_words_num_to_predict
        report_examples += batch_size
        cum_examples += batch_size

        if train_iter % log_every == 0:
            print('epoch %d, iter %d, avg. loss %.2f, avg. ppl %.2f ' \
                  'cum. examples %d, speed %.2f words/sec, time elapsed %.2f sec' % (epoch, train_iter,
                                                                                     report_loss / report_examples,
                                                                                     math.exp(report_loss / report_tgt_words),
                                                                                     cum_examples,
                                                                                     report_tgt_words / (time.time() - train_time),
                                                                                     time.time() - begin_time))

            train_time = time.time()
            report_loss = report_tgt_words = report_examples = 0.

        # perform validation
        if train_iter % valid_niter == 0:
            print('epoch %d, iter %d, cum. loss %.2f, cum. ppl %.2f cum. examples %d' % (epoch, train_iter,
                                                                                     cum_loss / cum_examples,
                                                                                     np.exp(cum_loss / cum_tgt_words),
                                                                                     cum_examples))

            cum_loss = cum_examples = cum_tgt_words = 0.
            valid_num += 1

            print('begin validation ...')

            # compute dev. ppl and bleu
            dev_ppl = evaluate_ppl(model, dev_data, batch_size=128)   # dev batch size can be a bit larger
            valid_metric = -dev_ppl

            print('validation: iter %d, dev. ppl %f' % (train_iter, dev_ppl))

            is_better = len(hist_valid_scores) == 0 or valid_metric > max(hist_valid_scores)
            hist_valid_scores.append(valid_metric)

            if is_better:
                patience = 0
                print('save currently the best model to [%s]' % model_save_path)
                model.save(model_save_path)

                # also save the optimizers' state
                torch.save(optimizer.state_dict(), model_save_path + '.optim')
            elif patience < patience_iters:
                patience += 1
                print('hit patience %d' % patience)

                if patience == patience_iters:
                    num_trial += 1
                    print('hit #%d trial' % num_trial)
                    if num_trial == max_num_trial:
                        print('early stop!')
                        exit(0)

                    # decay lr, and restore from previously best checkpoint
                    lr = optimizer.param_groups[0]['lr'] * lr_decay
                    print('load previously best model and decay learning rate to %f' % lr)

                    # load model
                    params = torch.load(model_save_path, map_location=lambda storage, loc: storage)
                    model.load_state_dict(params['state_dict'])
                    model = model.to(device)

                    print('restore parameters of the optimizers')
                    optimizer.load_state_dict(torch.load(model_save_path + '.optim'))

                    # set new lr
                    for param_group in optimizer.param_groups:
                        param_group['lr'] = lr

                    # reset patience
                    patience = 0

    if epoch == max_epochs:
        print('reached maximum number of epochs!')
        break

# Test time!

In [18]:
from nltk.translate.bleu_score import corpus_bleu, sentence_bleu, SmoothingFunction
from tqdm import tqdm

First we will define `beam_search` function, which takes a list of sentences in source language and returns list of hypothesis translation for each sentence:

In [36]:
def beam_search(model: NMT, test_data_src: List[List[str]], beam_size: int, max_decoding_time_step: int) -> List[List[Hypothesis]]:
    """ Run beam search to construct hypotheses for a list of src-language sentences.
    @param model (NMT): NMT Model
    @param test_data_src (List[List[str]]): List of sentences (words) in source language, from test set.
    @param beam_size (int): beam_size (# of hypotheses to hold for a translation at every step)
    @param max_decoding_time_step (int): maximum sentence length that Beam search can produce
    @returns hypotheses (List[List[Hypothesis]]): List of Hypothesis translations for every source sentence.
    """
    was_training = model.training
    model.eval()

    hypotheses = []
    with torch.no_grad():
        for src_sent in tqdm(test_data_src, desc='Decoding', file=sys.stdout):
            example_hyps = model.beam_search(src_sent, beam_size=beam_size, max_decoding_time_step=max_decoding_time_step)

            hypotheses.append(example_hyps)

    if was_training: model.train(was_training)

    return hypotheses

Load the test data and trained model and move the model to GPU if available:

In [None]:
mkdir -p outputs
    touch outputs/test_outputs_local_q2.txt
    python run.py decode model.bin ./en_es_data/test_tiny.es ./en_es_data/test_tiny.en outputs/test_outputs_local_q2.txt 

In [34]:
test_data_src = read_corpus('./en_es_data/test.es', source='src')
test_data_tgt = read_corpus('./en_es_data/test.en', source='tgt')

model = NMT.load('model.bin',no_char_decoder=False)


if torch.cuda.is_available():
    device = torch.device("cuda:0")
else:
    device = torch.device("cpu")


model = model.to(device)

Perform *beam search* on test data sentences, we will use beam size of 5 (i.e. 5 hypotheses translations for each sentence):

In [37]:
beam_size=5
max_decoding_time_steps=70 #maximum number of decoding time steps
hypotheses = beam_search(model, test_data_src,
                         beam_size=beam_size,
                         max_decoding_time_step=max_decoding_time_steps)

Decoding: 100%|██████████| 4/4 [00:00<00:00,  5.13it/s]


Now let's take a look at one of test sentences and its translations:

In [38]:
i=0
source_sentence = ' '.join(test_data_src[i])
print('Source sentence: \n%s\n' %(source_sentence))
print('5 hypotheses translations:')
for hyp_trans in hypotheses[i]:
    hyp_trans_str=' '.join(hyp_trans.value)
    print(hyp_trans_str)

Source sentence: 
Es una historia verdadera -- cada parte de esto es verdad.

5 hypotheses translations:
It's a true story -- every bit of this is true.
It's a true story -- every bit of this is true.
It's a true -- -- every bit of this is true.
It's a true story to every bit of this is true.
I a true story -- every bit of this is true.


Calculate the **BLEU** score

In [39]:
references=test_data_tgt
top_hypotheses = [hyps[0] for hyps in hypotheses]


if references[0][0] == '<s>':
        references = [ref[1:-1] for ref in references]
bleu_score=corpus_bleu([[ref] for ref in references],
                             [hyp.value for hyp in top_hypotheses])
print('BLEU score on test data is %f' %(bleu_score))

BLEU score on test data is 0.992979
