# Project 3 (part 2): Language Translation with Neural Networks
## CS4740/5740 Fall 2021

Names:

Netids:

### Project Submission Due: November 23, 2021
Please submit the **pdf file** of this notebook on **Gradescope**, and the **ipynb** on **CMS**. For instructions on generating pdf and ipynb files, please refer to project 1 or project 2 instructions.



## Introduction
In this project we will consider **neural networks**:  a Recurrent Neural Network (RNN), for performing neural machine translation (i.e. translating from one language into another).

The project is divided into parts. In **Part 1**, you will implement an RNN model for performing the neural machine translation. In **Part 2**, you will analyze these models in two types of comparative studies and in **Part 3** you will answer questions describing what you have learned through this project. You also will be required to submit a description of libraries used, how your group divided up the work, and your feedback regarding the assignment (**Part 4**).

The writeup for the document is linked [here](https://docs.google.com/document/d/1IWgYqS6M4G_gJowM97Bq8g5smsGOa75dutIUGjolpAI/edit).

## Advice 🚀
As always, the report is important! The report is where you get to show
that you understand not only what you are doing but also why and how you are doing it. So be clear, organized and concise; avoid vagueness and excess verbiage. Spend time doing error analysis for the models. This is how you understand the advantages and drawbacks of the systems you build. The reports should read more like the papers that we have been writing critiques for.

## Dataset
You are given access to a set of parallel sentences. One sentence is written in modern English (the "source") and another is in Shakespearean English (the "target"). For this project, given modern English you will need to translate this into Shakespearean English. This is usually called (Neural) Machine Translation. We'll simply refer to it as NMT or Neural Machine Translation in the project.

We will minimally preprocess the source/target sentences and handle tokenization in what we release. For this assignment, we do not anticipate any further preprocessing to be done by you. Should you choose to do so, it would be interesting to hear about in the report (along with whether or not it helped performance), but it is not a required aspect of the assignment.

In [None]:
from google.colab import drive
import os
drive.mount('/content/drive', force_remount=True)

source_path = os.path.join(os.getcwd(), "drive", "My Drive", "CS_4740_FA21_p3", "source.txt") # replace based on your Google drive organization
target_path = os.path.join(os.getcwd(), "drive", "My Drive", "CS_4740_FA21_p3", "target.txt") # replace based on your Google drive organization
test_path = os.path.join(os.getcwd(), "drive", "My Drive", "CS_4740_FA21_p3", "source_test.txt") # replace based on your Google drive organization

Mounted at /content/drive


## Import libraries and connect to Google Drive

In [None]:
!pip install -U gensim



In [None]:
!pip3 install sentencepiece
from collections import Counter, namedtuple
from itertools import chain
import json
import math
import os
from pathlib import Path
import random
import time
import sys
from tqdm.notebook import tqdm, trange
from typing import List, Tuple, Dict, Set, Union


import gensim
import nltk
from nltk.translate.bleu_score import corpus_bleu, sentence_bleu, SmoothingFunction
import numpy as np
import sentencepiece as spm
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
from torch.nn import init
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, SubsetRandomSampler
import torch.nn.utils
import torch.nn.functional as F
from torch.nn.utils.rnn import pad_packed_sequence, pack_padded_sequence

from tqdm.notebook import tqdm, trange



In [None]:
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
! pip install -qqq wandb

In [None]:
import wandb
! wandb login

[34m[1mwandb[0m: Currently logged in as: [33mchipmunkez[0m (use `wandb login --relogin` to force relogin)


In [None]:
!wandb online

W&B online, running your script from this directory will now sync to the cloud.


# Part 1: Recurrent Neural Network
Recurrent neural networks have been the workhorse of NLP for a number of years. A fundamental reason for this success is they can inherently deal with _variable_ length sequences. This is axiomatically important for natural language; words are formed from a variable number of characters, sentences from a variable number of words, paragraphs from a variable number of sentences, and so forth. This differs from a field like Computer Vision where images are (generally) of a fixed size.
<br></br>
This is a also very different scenario than that of the classifiers we have studied (e.g. Naive Bayes, Perceptron Learning, Feedforward Neural Networks), which take in a
fixed-length vector.
<br></br>
To clarify this, we can think of the _types_ of the mathematical functions described by a FFNN and an RNN. What is critical to note in what follows is that k (the length of a sequence) need not be constant
across examples.

Below we define the general problem set up of FFNNs and RNNs.

$\textbf{FFNN.}$ \
$Input: \text{We have an input vector }\vec{x} \in \mathcal{R}^d$ \
$Model\text{ }Output: \text{The model has some intermediate output }\vec{z} \in \mathcal{R}^{\mid \mathcal{Y}\mid}$ \
$Final\text{ }Output: \text{ The model outputs a vector } \vec{y} \in \mathcal{R}^{\mid \mathcal{Y}\mid}$ \
$\vec{y}$ satisfies the constraint of being a probability distribution, i.e. $\underset{i \in \mid \mathcal{Y} \mid}{\sum} \vec{y}[i] = 1$ and $\underset{i \in \mid \mathcal{y} \mid}{min} \text{ }\vec{y}[i] \leq 1$, which is achieved via _Softmax_ applied to $\vec{z}$.
<br></br>
$\textbf{RNN.}$ \
$Input: \text{The model takes as input a sequence of vectors} \vec{x}_1,\vec{x}_2, \dots, \vec{x}_k; \vec{x}_i \in \mathcal{R}^d$ \
$Model\text{ }Output: \text{The model generates some intermediate sequence output} \vec{z}_1,\vec{z}_2, \dots, \vec{z}_k; \vec{z}_i \in \mathcal{R}^{h}, \text{ where h is the hidden state size.}$
$Final\text{ }Output: \text{The model generates some final sequence output} \vec{y}_1, \dots, \vec{y}_k \in \mathcal{R}^{\mid \mathcal{Y}\mid}$ \
$\vec{y}$ satisfies the constraint of being a probability distribution, i.e. $\underset{i \in \mid \mathcal{Y} \mid}{\sum} \vec{y}_j[i] = 1$ and $\underset{i \in \mid \mathcal{y} \mid}{min} \text{ }\vec{y}_j[i] \geq 0$, which is achieved by the process described later in this report and as you have seen in class.

Intuitively, an RNN takes in a sequence of vectors and computes a new vector corresponding to each vector in the original sequence. It achieves this by processing the input sequence one vector at a time to (a) compute an updated representation of the entire sequence (which is then re-used when processing the next vector in the input sequence), and (b) produce an output for the current position. The vector computed in (a) therefore not only contains information about the current input vector but also about the previous input vectors. Hence, $\vec{z}_j$ is computed after having observed $\vec{x}_1, \dots, \vec{x}_j$. As such, a simple observation is we can treat the last vector computed by the RNN, ie $\vec{z}_k$ as a representation of the entire sequence. Accordingly, we can use this as the input to a single-layer linear classifier to compute a vector $\vec{y}$ as we will need for classification.

$$\vec{y}_j = Softmax(W\vec{z}_j); \text{ where }W\in \mathcal{R}^{\mid \mathcal{Y}\mid \times h} \text{ is a weight matrix that is learned through training}$$

In Machine Translation, our goal is to convert a sentence from the source language (e.g. Modern English) to the target language (e.g. Shakespearean English). In this assignment, we will implement a sequence-to-sequence (Seq2Seq) network, to build a Neural Machine Translation (NMT) system. In this section, we describe the training procedure for the proposed NMT system, which uses a Bidirectional RNN Encoder and a Unidirectional RNN Decoder. We'll recap the theoretical component here and in the modules where you are writing code, we will repeat the steps more explicitly in an algorithmic manner.

<Insert diagram here>

Given a sentence in the source language, we look up the word embeddings from an embeddings matrix, yielding $x_1,\dots, x_n$ ($x_i \in R^{e}$), where n is the length of the source sentence and e is the embedding size. We feed these embeddings to the bidirectional encoder, yielding hidden states for both the forward (→) and backward (←) RNNs. The forward and backward versions are concatenated to give hidden states $h_i^{enc}$


$$h_i^{enc} = [\overrightarrow{h_i^{enc}}; \overleftarrow{h_i^{enc}}] \text{ where }h_i^{enc} \in R^{2h}, \overrightarrow{h_i^{enc}}, \overleftarrow{h_i^{enc}} \in R^{h}$$


We then initialize the decoder’s first hidden state $h_0^{dec}$ with a linear projection of the encoder’s final hidden state

$$h_0^{dec} = W_h[\overrightarrow{h_n^{enc}}; \overleftarrow{h_0^{enc}}] \text{ where }h_0^{dec} \in R^{h}, W_h \in R^{h \times 2h}$$

With the decoder initialized, we must now feed it a target sentence. On the $t^{th}$ step, we look up the embedding for the $t^{th}$ word, $y_t \in R^{e}$. We then concatenate $y_t$ with the combined-output vector $o_{t−1} \in R^{h}$ from the previous timestep (we will explain what this is later but this is just the output from the previous step) to produce $y_t \in R^{e+h}$. Note that for the first target (i.e. the start token) $o_0$ is usually a zero-vector (but it can be random or a learned vector as well). We then feed $y_t$ as input to the decoder.

$$ h_t^{dec} = Decoder(y_t, h_{t-1}^{dec})\text{ where }h_{t-1}^{dec} ∈ R^{h}$$

We can take the decoder hidden state $h_t^{dec}$ and pass this through a linear layer to obtain an intermediate output $v_t$. This is then passed through an activation function (like tanh) to obtain our combined-output vector $o_t$

$$v_t = W_v h_t^{dec} \text{ where } W_v \in R^{h \times h}, v_t \in R^{h}$$
$$o_t = \tanh{(v_t)} \text{ where } o_t \in R^{h}$$

Then, we produce a probability distribution $P_t$ over target words at the $t^{th}$ timestep.

$$P_t = Softmax(W_{v_{target}} o_t) \text{ where }P_t \in R^{V_{target}}, W_{v_{target}}\in R^{V_{target} \times h}$$


Here, $V_{target}$ is the size of the target vocabulary. Finally, to train the network we then compute the softmax cross entropy loss between $P_t$ and $g_t$, where $g_t$ is the one-hot vector of the target word at timestep t:

$$Loss(Model) = CrossEntropy(P_t, g_t)$$

Now that we have described the model, let’s try implementing it for Modern English to Shakespearean English translation.









### How do we evaluate NMT models?
We can evaluate these models in a few different ways. Recall in lecture that we called these encoder-decoder models "Conditional Language Models" since they condition on some prefix before generating text similar to the language models we have seen before. Therefore, we can use **perplexity** to measure the performance of our model.

However, perplexity is more of an intrinsic measure and so we'd like to directly measure how closely the model output is to our generated translations. How do we do this? We can look at how well our translation _overlaps_ with the reference translation. A common metric for this is the **BLEU** (Bilingual Evaluation Understudy) metric. The approach works by counting matching n-grams in the candidate translation to n-grams in the reference text, where 1-gram or unigram would be each token and a bigram comparison would be each word pair. The comparison is made regardless of word order. BLEU uses N-grams of size 1-4 in its computation.

## Part 1: Rules
**Part 1** requires implementing an RNN in PyTorch for translation. Countless blog posts, internet tutorials and other implementations available publicly (and privately) do precisely this. In fact, many students in [Cornell NLP](https://nlp.cornell.edu/people/) likely have some code for doing this or something similar on their Github. You **cannot** use any such code (though you may use anything you find in course notes or course texts) irrespective of whether you cite it or not.

Submissions will be passed through the MOSS system, which is a sophisticated system for detecting plagiarism in code and is robust in the sense that it tries to find alignments in the underlying semantics of the code and not just the surface level syntax. Similarly, the course staff are also quite astute with respect to programming neural models for NLP and we will strenuously look at your code. We flagged multiple groups for this last year, so we strongly suggest you resist any such temptation (if the Academic Integrity policy alone is insufficient at dissuading you).

## 1.1 RNN Implementation

Recall from the previous portion of this assignment as well as the PyTorch tutorial we used a `Data loader` component; we will want to use something similar here as well as a new `NMT` component. We don't envision that it will be useful to copy and modify the previous `Data loader` here. We have included some stubs to help give you a place to start for the NMT.

Additionally, we remind you that the previous assignment furnishes a near-functional implementation of a similar neural model (but for a different task). If you successfully completed the FFNN bug fixes , it will be wholly functional. Using it as a guide for Part 1 below is both prudent and suggested.







### 1.1.1 Data loading

In [None]:
Hypothesis = namedtuple('Hypothesis', ['value', 'score'])

In [None]:
def pad_sents(sents, pad_token):
    """ Pad list of sentences according to the longest sentence in the batch.
        The paddings should be at the end of each sentence.
    :param sents: list of sentences, where each sentence
                                    is represented as a list of words
    :type sents: list[list[str]]
    :param pad_token: padding token
    :type pad_token: str
    :returns sents_padded: list of sentences where sentences shorter
        than the max length sentence are padded out with the pad_token, such that
        each sentence in the batch now has equal length.
    :rtype: list[list[str]]
    """
    sents_padded = []

    max_len = max([len(sent) for sent in sents])
    sents_padded = [(sent + ([pad_token] * (max_len - len(sent)))) for sent in sents]

    return sents_padded

In [None]:
def read_corpus(file_path, source):
    """ Read file, where each sentence is dilineated by a `\n`.
    :param file_path: path to file containing corpus
    :type file_path: str
    :param source: "tgt" or "src" indicating whether text
        is of the source language or target language
    :type source: str
    """
    data = []
    for line in open(file_path):
        sent = nltk.word_tokenize(line)
        # only append <s> and </s> to the target sentence
        if source == 'tgt':
            sent = ['<s>'] + sent + ['</s>']
        data.append(sent)

    return data

In [None]:
class Vocab(object):
    """ Vocabulary, i.e. structure containing either
    src or tgt language terms.
    """
    def __init__(self, word2id=None):
        """ Init Vocab Instance.
        
        :param word2id: dictionary mapping words 2 indices
        :type word2id: dict[str, int]
        """
        if word2id:
            self.word2id = word2id
        else:
            self.word2id = dict()
            self.word2id['<pad>'] = 0   # Pad Token
            self.word2id['<s>'] = 1     # Start Token
            self.word2id['</s>'] = 2    # End Token
            self.word2id['<unk>'] = 3   # Unknown Token
        self.unk_id = self.word2id['<unk>']
        self.id2word = {v: k for k, v in self.word2id.items()}

    def __getitem__(self, word):
        """ Retrieve word's index. Return the index for the unk
        token if the word is out of vocabulary.
        
        :param word: word to look up
        :type word: str
        :returns: index of word
        :rtype: int
        """
        return self.word2id.get(word, self.unk_id)

    def __contains__(self, word):
        """ Check if word is captured by Vocab.
        
        :param word: word to look up
        :type word: str
        :returns: whether word is in vocab
        :rtype: bool
        """
        return word in self.word2id

    def __setitem__(self, key, value):
        """ Raise error, if one tries to edit the Vocab directly.
        """
        raise ValueError('vocabulary is readonly')

    def __len__(self):
        """ Compute number of words in Vocab.
        
        :returns: number of words in Vocab
        :rtype: int
        """
        return len(self.word2id)

    def __repr__(self):
        """ Representation of Vocab to be used
        when printing the object.
        """
        return 'Vocabulary[size=%d]' % len(self)

    def id2word(self, wid):
        """ Return mapping of index to word.
        
        :param wid: word index
        :type wid: int
        :returns: word corresponding to index
        :rtype: str
        """
        return self.id2word[wid]

    def add(self, word):
        """ Add word to Vocab, if it is previously unseen.
        
        :param word: to add to Vocab
        :type word: str
        :returns: index that the word has been assigned
        :rtype: int
        """
        if word not in self:
            wid = self.word2id[word] = len(self)
            self.id2word[wid] = word
            return wid
        else:
            return self[word]

    def words2indices(self, sents):
        """ Convert list of words or list of sentences of words
        into list or list of list of indices.
        
        :param sents: sentence(s) in words
        :type sents: Union[List[str], List[List[str]]]
        :returns: sentence(s) in indices
        :rtype: Union[List[int], List[List[int]]]
        """
        if type(sents[0]) == list:
            return [[self[w] for w in s] for s in sents]
        else:
            return [self[w] for w in sents]

    def indices2words(self, word_ids):
        """ Convert list of indices into words.
        
        :param word_ids: list of word ids
        :type word_ids: List[int]
        :returns: list of words
        :rtype: List[Str]
        """
        return [self.id2word[w_id] for w_id in word_ids]

    def to_input_tensor(self, sents: List[List[str]], device: torch.device) -> torch.Tensor:
        """ Convert list of sentences (words) into tensor with necessary padding for 
        shorter sentences.
        
        :param sents: list of sentences (words)
        :type sents: List[List[str]]
        :param device: Device on which to load the tensor, ie. CPU or GPU
        :type device: torch.device
        :returns: Sentence tensor of (max_sentence_length, batch_size)
        :rtype: torch.Tensor
        """

        word_ids = self.words2indices(sents)
        sents_t = pad_sents(word_ids, self['<pad>'])
        sents_var = torch.tensor(sents_t, dtype=torch.long, device=device)
        return torch.t(sents_var)

    @staticmethod
    def from_corpus(corpus, size, freq_cutoff=2):
        """ Given a corpus construct a Vocab.
        
        :param corpus: corpus of text produced by read_corpus function
        :type corpus: List[str]
        :param size: # of words in vocabulary
        :type size: int
        :param freq_cutoff: if word occurs n < freq_cutoff times, drop the word
        :type freq_cutoff: int
        :returns: Vocab instance produced from provided corpus
        :rtype: Vocab
        """
        vocab_entry = Vocab()
        word_freq = Counter(chain(*corpus))
        valid_words = [w for w, v in word_freq.items() if v >= freq_cutoff]
        print('number of word types: {}, number of word types w/ frequency >= {}: {}'
              .format(len(word_freq), freq_cutoff, len(valid_words)))
        top_k_words = sorted(valid_words, key=lambda w: word_freq[w], reverse=True)[:size]
        for word in top_k_words:
            vocab_entry.add(word)
        return vocab_entry
    
    @staticmethod
    def from_subword_list(subword_list):
        """Given a list of subwords, construct the Vocab.
        
        :param subword_list: list of subwords in corpus
        :type subword_list: List[str]
        :returns: Vocab instance produced from provided list
        :rtype: Vocab
        """
        vocab_entry = Vocab()
        for subword in subword_list:
            vocab_entry.add(subword)
        return vocab_entry

In [None]:
print('initialize source vocabulary ..')
src_sents = read_corpus(source_path, "src")
src = Vocab.from_corpus(src_sents, 20000, 2) # 7098, 9422

print('initialize target vocabulary ..')
tgt_sents = read_corpus(target_path, "tgt")
tgt = Vocab.from_corpus(tgt_sents, 20000, 2) # 6893, 10956

initialize source vocabulary ..
number of word types: 13252, number of word types w/ frequency >= 2: 9167
initialize target vocabulary ..
number of word types: 15216, number of word types w/ frequency >= 2: 10725


In [None]:
## YOUR CODE HERE
# Train embeddings or load embeddings
# or use other feature representation for words (e.g 1 hot encoding)
#
# We want a numpy array that has shape |V| x |embedding size| that can potentially
# be passed into our NMT model for our pretrained_source / pretrained_target
# arguments. This allows our model to start off with a good starting point and
# we can decide whether to keep our embeddings static or update them as we go.
#
# Some ideas as to what to do here are using pre-trained word embeddings from gensim
# >>> import gensim.downloader as api
# >>> model = api.load("glove-wiki-gigaword-300")  # load glove vectors
# >>> model.wv['chicken'] # Get word vector for chicken
#
# OR potentially train your own new embeddings using the SkipGram algorithm discussed in lecture.
# >>> model = gensim.models.Word2Vec(sentences, min_count=1, vector_size=300, sg=1, negative=5)
# Tutorial: https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py
#
# OR explicitly choose to do nothing here and the embeddings are learned end-to-end during training in the NMT class

In [None]:
# Some ideas as to what to do here are using pre-trained word embeddings from gensim
# import gensim.downloader as api
# model = api.load("glove-wiki-gigaword-300")  # load glove vectors

In [None]:
# train your own new embeddings using the SkipGram algorithm
# model_src = gensim.models.Word2Vec(src_sents, min_count=1, vector_size=256, sg=1, negative=5)
# model_tgt = gensim.models.Word2Vec(tgt_sents, min_count=1, vector_size=256, sg=1, negative=5)
model_src = gensim.models.Word2Vec(src_sents+tgt_sents,min_count=1, vector_size=256, sg=1, negative=5)
src_embed_vectors = torch.zeros([len(src), len(model_src.wv['i'])])
tgt_embed_vectors = torch.zeros([len(tgt), len(model_src.wv['i'])])

for i in range(0, len(src)):
  word = list(src.word2id.keys())[i]
  try:
    emb_vect  = model_src.wv[word]
    src_embed_vectors[list(src.word2id.values())[i], :] = torch.from_numpy(emb_vect)
  except:
    continue

for j in range(0, len(tgt)):
  word = list(tgt.word2id.keys())[j]
  try:
    emb_vect  = model_tgt.wv[word]
    tgt_embed_vectors[j, :] = torch.from_numpy(emb_vect)
  except:
    continue

In [None]:
# Split into training and validation data
train_data_src, val_data_src, train_data_tgt, val_data_tgt = train_test_split(src_sents, tgt_sents, test_size=0.045922, random_state=42)

In [None]:
train_data = list(zip(train_data_src, train_data_tgt))
val_data = list(zip(val_data_src, val_data_tgt))

### 1.1.2 NMT Model Implementation

For the implementation below, we have given a framework / skeleton for your code. Within the skeleton are sections that define where you should place your code.

In [None]:
def generate_sent_masks(enc_hiddens: torch.Tensor, source_lengths: List[int], device: torch.device) -> torch.Tensor:
    """ Generate sentence masks for encoder hidden states.

    :param enc_hiddens: encodings of shape (b, src_len, 2*h), where b = batch size,
        src_len = max source length, h = hidden size.
    :type enc_hiddens: torch.Tensor
    :param source_lengths: List of actual lengths for each of the sentences in the batch.   
    :type source_lengths: List[int]
    :param device: Device on which to load the tensor, ie. CPU or GPU
    :type device: torch.device
    :returns: Tensor of sentence masks of shape (b, src_len),
        where src_len = max source length, h = hidden size.
    :rtype: torch.Tensor
    """
    enc_masks = torch.zeros(enc_hiddens.size(0), enc_hiddens.size(1), dtype=torch.float)
    for e_id, src_len in enumerate(source_lengths):
        enc_masks[e_id, src_len:] = 1
    return enc_masks.to(device)

In [None]:
class Encoder(nn.Module):
    def __init__(self, embed_size, hidden_size, source_embeddings, enc_type):
        """
        """
        super(Encoder, self).__init__()
        self.hidden_size = hidden_size
        self.embed_size = embed_size
        self.embedding = source_embeddings
        self.enc_type = enc_type
        ### YOUR CODE HERE (~2 Lines)
        ### TODO - Initialize the following variables:
        if(self.enc_type == 'LSTM'):
          self.encoder = torch.nn.LSTM(input_size=self.embed_size, hidden_size=self.hidden_size, bias = True, bidirectional=True) #  (Bidirectional RNN with bias) 
          self.c_projection = nn.Linear(in_features = 2*self.hidden_size, out_features = self.hidden_size, bias = False)
        if(self.enc_type == 'RNN'):
          self.encoder = torch.nn.RNN(input_size=self.embed_size, hidden_size=self.hidden_size, bias = True, bidirectional=True) #  (Bidirectional RNN with bias) 
        self.h_projection = nn.Linear(in_features = 2*self.hidden_size, out_features = self.hidden_size, bias = False) # (Linear Layer with no bias), called W_{h} above.
        
        ###
        ### Note that you are free to use any architecture (vanilla RNN, LSTM, GRU)
        ### that you would like. Additionally, you are free to use any hyperparameters
        ### that you would like (e.g. number of layers). You will discuss your choice
        ### of hyperparameters in the write up later as well.
        ###
        ### Use the following docs to properly initialize these variables:
        ###     RNN:
        ###         https://pytorch.org/docs/stable/nn.html#torch.nn.RNN
        ###     LSTM:
        ###         https://pytorch.org/docs/stable/nn.html#torch.nn.LSTM
        ###     Linear Layer:
        ###         https://pytorch.org/docs/stable/nn.html#torch.nn.Linear

        
    
    def forward(self, source_padded: torch.Tensor, source_lengths: List[int]) -> Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
        """
        """
        enc_hiddens, dec_init_state = None, None

        ### YOUR CODE HERE (~ 8 Lines)
        ### TODO:
        ###     1. Construct Tensor `X` of source sentences with shape (src_len, b, e) using the source model embeddings.
        X = self.embedding(source_padded)
        # print('X.size()',X.size())
        ###         src_len = maximum source sentence length, b = batch size, e = embedding size. Note
        ###         that there is no initial hidden state or cell for the encoder.
        ###     2. Compute `enc_hiddens`, `last_hidden` by applying the encoder to `X`.
        ###         - Before you can apply the encoder, you need to apply the `pack_padded_sequence` function to X.
        packed = pack_padded_sequence(X, source_lengths) # 
      
        enc_hiddens, last_hidden = self.encoder(packed) #  
        
        ###         - After you apply the encoder, you need to apply the `pad_packed_sequence` function to enc_hiddens.
        pad, _ = pad_packed_sequence(enc_hiddens,batch_first=False)
        ###         - Note that the shape of the tensor returned by the encoder is (src_len, b, h*2) and we want to
        ###           return a tensor of shape (b, src_len, h*2) as `enc_hiddens`.
        # print('pad.size()', pad.size())
        
        enc_hiddens = torch.permute(pad, (1, 0, 2))
        # print(enc_hiddens.size())
        
        # torch.permute(x, (2, 0, 1)).size()
        ###     3. Compute `dec_init_state` = init_decoder_hidden:
        ###         - `init_decoder_hidden`:
        ###             `last_hidden` is a tensor shape (2, b, h). The first dimension corresponds to forward and backwards.
        ###             Concatenate the forward and backward tensors to obtain a tensor shape (b, 2*h).
        ###             Apply the h_projection layer to this in order to compute init_decoder_hidden.
        ###             This is h_0^{dec} in above in the writeup. Here b = batch size, h = hidden size
        ###
        
        if(self.enc_type == 'LSTM'):
          # self.c_projection = nn.Linear(in_features = 2*self.hidden_size, out_features = self.hidden_size, bias = False)
          last_hidden_hid = last_hidden[0]
          last_cell = last_hidden[1]
          last_hidden_interm = torch.cat((last_hidden_hid[0], last_hidden_hid[1]), 1)
          
          last_cell_interm = torch.cat((last_cell[0], last_cell[1]), 1)
          init_decoder_hidden = self.h_projection(last_hidden_interm)
          init_decoder_cell = self.c_projection(last_cell_interm)

          dec_init_state_hidden = init_decoder_hidden
          dec_init_state_cell = init_decoder_cell
          dec_init_state = (dec_init_state_hidden, dec_init_state_cell)
       

        
          dec_init_state = (dec_init_state_hidden, dec_init_state_cell)
       
        if(self.enc_type == 'RNN'):
          last_hidden_interm = torch.cat((last_hidden[0], last_hidden[1]), 1)
          init_decoder_hidden = self.h_projection(last_hidden_interm)
          dec_init_state_hidden = init_decoder_hidden
          dec_init_state = dec_init_state_hidden
        ### See the following docs, as you may need to use some of the following functions in your implementation:
        ###     Pack the padded sequence X before passing to the encoder:
        ###         https://pytorch.org/docs/stable/nn.html#torch.nn.utils.rnn.pack_padded_sequence
        ###     Pad the packed sequence, enc_hiddens, returned by the encoder:
        ###         https://pytorch.org/docs/stable/nn.html#torch.nn.utils.rnn.pad_packed_sequence
        ###     Tensor Concatenation:
        ###         https://pytorch.org/docs/stable/torch.html#torch.cat
        ###     Tensor Permute:
        ###         https://pytorch.org/docs/stable/tensors.html#torch.Tensor.permute
        

        ### END YOUR CODE

        return enc_hiddens, dec_init_state

In [None]:
class Decoder(nn.Module):
    def __init__(self, embed_size, hidden_size, target_embedding, device, dec_type):
        """
        """
        super(Decoder, self).__init__()
        self.embed_size = embed_size
        self.hidden_size = hidden_size
        self.device = device
        self.embedding = target_embedding
        output_vocab_size = self.embedding.weight.size(0)
        self.softmax = nn.Softmax(dim=1)
        self.activation = nn.Tanh()
        self.dec_type = dec_type

        ### YOUR CODE HERE (~3 lines)
        if(self.dec_type == 'LSTM'):
          self.decoder = torch.nn.LSTMCell(input_size=self.embed_size+self.hidden_size, hidden_size=self.hidden_size, bias=True, device=None, dtype=None)  # (RNN Cell with bias)
        if(self.dec_type == 'RNN'):
          self.decoder = torch.nn.RNNCell(input_size=self.embed_size+self.hidden_size, hidden_size=self.hidden_size, bias=True, nonlinearity='tanh', device=None, dtype=None)  # (RNN Cell with bias)
        
        self.combined_output_projection = nn.Linear(in_features = self.hidden_size, out_features = self.hidden_size, bias = False) # (Linear Layer with no bias), called W_{h} above.
        self.target_vocab_projection = nn.Linear(in_features = self.hidden_size, out_features = output_vocab_size)
        ###     self.combined_output_projection (Linear Layer with no bias), called W_{v} above.
        ###     self.target_vocab_projection (Linear Layer with no bias), called W_{vocab} above.
        ###
        ### Use the following docs to properly initialize these variables:
        ###     RNN Cell:
        ###         https://pytorch.org/docs/stable/nn.html#torch.nn.RNNCell
        ###     LSTM Cell:
        ###         https://pytorch.org/docs/stable/nn.html#torch.nn.LSTMCell
        ###     Linear Layer:
        ###         https://pytorch.org/docs/stable/nn.html#torch.nn.Linear

 

    
    def forward(self, enc_hiddens: torch.Tensor,
                dec_init_state: Tuple[torch.Tensor, torch.Tensor], target_padded: torch.Tensor) -> torch.Tensor:
        """
        """
        # Chop of the <END> token for max length sentences.
        target_padded = target_padded[:-1]

        dec_state = dec_init_state

        # Initialize previous combined output vector o_{t-1} as zero
        # print('enc_hidden_size', enc_hiddens.size())
        batch_size = enc_hiddens.size(0)
        # print('batch_size', batch_size)
        # print('hidden_size', hidden_size)
        o_prev = torch.zeros(batch_size, self.hidden_size, device=self.device)

        # Initialize a list we will use to collect the combined output o_t on each step
        combined_outputs = []

        ### YOUR CODE HERE (~9 Lines)
        ### TODO:
        ###     1. Construct tensor `Y` of target sentences with shape (tgt_len, b, e) using the target model embeddings.
        ###         where tgt_len = maximum target sentence length, b = batch size, e = embedding size.
        ###     2. Use the torch.split function to iterate over the time dimension of Y.
 
        ###         Within the loop, this will give you Y_t of shape (1, b, e) where b = batch size, e = embedding size.
        ###             - Squeeze Y_t into a tensor of dimension (b, e). 
        ###             - Construct Ybar_t by concatenating Y_t with o_prev on their last dimension
        ###             - Use the step function to compute the the Decoder's next (cell, state) values
        ###               as well as the new combined output o_t.
        ###             - Append o_t to combined_outputs
        ###             - Update o_prev to the new o_t.
        ###     3. Use torch.stack to convert combined_outputs from a list length tgt_len of
        ###         tensors shape (b, h), to a single tensor shape (tgt_len, b, h)
        ###         where tgt_len = maximum target sentence length, b = batch size, h = hidden size.
        ###
        ### Note:
        ###    - When using the squeeze() function make sure to specify the dimension you want to squeeze
        ###      over. Otherwise, you will remove the batch dimension accidentally, if batch_size = 1.
        ###   
        ### You may find some of these functions useful:
        ###     Zeros Tensor:
        ###         https://pytorch.org/docs/stable/torch.html#torch.zeros
        ###     Tensor Splitting (iteration):
        ###         https://pytorch.org/docs/stable/torch.html#torch.split
        ###     Tensor Dimension Squeezing:
        ###         https://pytorch.org/docs/stable/torch.html#torch.squeeze
        ###     Tensor Concatenation:
        ###         https://pytorch.org/docs/stable/torch.html#torch.cat
        ###     Tensor Stacking:
        ###         https://pytorch.org/docs/stable/torch.html#torch.stack
        
        Y = self.embedding(target_padded)
       # print("target_padded", target_padded.size())
       # print("Y.size", Y.size())

        for Y_t in torch.split(Y, 1):
         # print("Y_t", Y_t.size())
          Y_t = torch.squeeze(Y_t, 0) # (1,1,3)
         
          Ybar_t = torch.cat((Y_t, o_prev), 1)
          dec_state, o_t = self.step(Ybar_t, dec_state, enc_hiddens)
          combined_outputs.append(o_t)
          o_prev = o_t
        combined_outputs = torch.stack(combined_outputs)

        ### END YOUR CODE

        return combined_outputs
    
    def step(self, Ybar_t: torch.Tensor,
            dec_state: Tuple[torch.Tensor, torch.Tensor],
            enc_hiddens: torch.Tensor) -> Tuple[Tuple, torch.Tensor, torch.Tensor]:
        """ Compute one forward step of the LSTM decoder, including the attention computation.

        :param Ybar_t: Concatenated Tensor of [Y_t o_prev], with shape (b, e + h). The input for the decoder,
                                where b = batch size, e = embedding size, h = hidden size.
        :type Ybar_t: torch.Tensor
        :param dec_state: Tensors with shape (b, h), where b = batch size, h = hidden size.
                Tensor is decoder's prev hidden state
        :type dec_state: torch.Tensor
        :param enc_hiddens: Encoder hidden states Tensor, with shape (b, src_len, h * 2), where b = batch size,
                                    src_len = maximum source length, h = hidden size.
        :type enc_hiddens: torch.Tensor

        :returns dec_state: Tensors with shape (b, h), where b = batch size, h = hidden size.
                Tensor is decoder's new hidden state. For an LSTM, this should be a tuple
                of the hidden state and cell state.
        returns combined_output: Combined output Tensor at timestep t, shape (b, h), where b = batch size, h = hidden size.
        """

        combined_output = None

        ### YOUR CODE HERE (~2 Lines)
        ### TODO:
        ###     1. Apply the decoder to `Ybar_t` and `dec_state` to obtain the new dec_state.
        ###     2. Rename dec_state to dec_hidden
        ###
        ###       Hints:
        ###         - dec_hidden is shape (b, h) and corresponds to h^dec_t above
        ###
       
        ### END YOUR CODE
        # Ybar_t_dec_state = torch.cat((Ybar_t, dec_state), 1)
        dec_hidden = self.decoder(Ybar_t, dec_state)
        
        ### YOUR CODE HERE (~2 Lines)
        ### TODO:
        ###     1. Apply the combined output projection layer to h^dec_t to compute tensor V_t
        ###     2. Compute tensor O_t by applying the Tanh function.
        ###
        if(self.dec_type == 'LSTM'):
          V_t = self.combined_output_projection(dec_hidden[0])
          # (RNN Cell with bias)
        if(self.dec_type == 'RNN'):
          V_t = self.combined_output_projection(dec_hidden)        
        O_t = self.activation(V_t)

        ### Use the following docs to implement this functionality:
        ###     Softmax:
        ###         https://pytorch.org/docs/stable/nn.html#torch.nn.functional.softmax
        ###     Batch Multiplication:
        ###        https://pytorch.org/docs/stable/torch.html#torch.bmm
        ###     Tensor View:
        ###         https://pytorch.org/docs/stable/tensors.html#torch.Tensor.view
        ###     Tensor Concatenation:
        ###         https://pytorch.org/docs/stable/torch.html#torch.cat
        ###     Tanh:
        ###         https://pytorch.org/docs/stable/torch.html#torch.tanh


        ### END YOUR CODE
        dec_state = dec_hidden
        combined_output = O_t
        return dec_state, combined_output

In [None]:
class NMT(nn.Module):
    """ Simple Neural Machine Translation Model:
        - Bidrectional RNN Encoder
        - Unidirection RNN Decoder
    """
    def __init__(self, embed_size, hidden_size, src_vocab, tgt_vocab, device=torch.device("cpu"), pretrained_source=None,pretrained_target=None,LSTM_RNN='RNN',):
        """ Init NMT Model.

        :param embed_size: Embedding size (dimensionality)
        :type embed_size: int
        :param hidden_size: Hidden Size, the size of hidden states (dimensionality)
        :type hidden_size: int
        :param src_vocab: Vocabulary object containing src language
        :type src_vocab: Vocab
        :param tgt_vocab: Vocabulary object containing tgt language
        :type tgt_vocab: Vocab
        :param device: torch device to put all modules on
        :type device: torch.device
        :param pretrained_source: Matrix of pre-trained source word embeddings
        :type pretrained_source: Optional[torch.Tensor]
        :param pretrained_target: Matrix of pre-trained target word embeddings
        :type pretrained_target: Optional[torch.Tensor]
        """
        super(NMT, self).__init__()
        self.device=device
        self.embed_size = embed_size
        self.src_vocab = src_vocab
        self.tgt_vocab = tgt_vocab
        src_pad_token_idx = src_vocab['<pad>']
        tgt_pad_token_idx = tgt_vocab['<pad>']
        self.source_embedding = nn.Embedding(len(src_vocab), embed_size, padding_idx=src_pad_token_idx)
        self.target_embedding = nn.Embedding(len(tgt_vocab), embed_size, padding_idx=tgt_pad_token_idx)
        self.LSTM_RNN = LSTM_RNN
        with torch.no_grad():
            if pretrained_source is not None:
                self.source_embedding.weight.data = pretrained_source
                # TODO: Decide if we want the embeddings to update as we train
                self.source_embedding.weight.requires_grad = True #False
        
            if pretrained_target is not None:
                self.target_embedding.weight.data = pretrained_target
                # TODO: Decide if we want the embeddings to update as we train
                self.target_embedding.weight.requires_grad = True # False
        
        self.hidden_size = hidden_size

        self.encoder = Encoder(
            embed_size=embed_size,
            hidden_size=hidden_size,
            source_embeddings=self.source_embedding,
            enc_type = self.LSTM_RNN,
        )
        self.decoder = Decoder(
            embed_size=embed_size,
            hidden_size=hidden_size,
            target_embedding=self.target_embedding,
            device=self.device,
            dec_type = self.LSTM_RNN,
        )


    def forward(self, source: List[List[str]], target: List[List[str]]) -> torch.Tensor:
        """ Take a mini-batch of source and target sentences, compute the log-likelihood of
        target sentences under the language models learned by the NMT system.

        :param source: list of source sentence tokens
        :type source: List[List[str]]
        :param target: list of target sentence tokens, wrapped by `<s>` and `</s>`
        :type target: List[List[str]]
        :returns scores: a variable/tensor of shape (b, ) representing the
                                    log-likelihood of generating the gold-standard target sentence for
                                    each example in the input batch. Here b = batch size.
        :rtype: torch.Tensor
        """
        # Compute sentence lengths
        source_lengths = [len(s) for s in source]

        # Convert list of lists into tensors
        source_padded = self.src_vocab.to_input_tensor(source, device=self.device)   # Tensor: (src_len, b)
        target_padded = self.tgt_vocab.to_input_tensor(target, device=self.device)   # Tensor: (tgt_len, b)
        
        ###     Run the network forward:
        ###     1. Apply the encoder to `source_padded` by calling `self.encode()`
        ###     2. Generate sentence masks for `source_padded` by calling `self.generate_sent_masks()`
        ###     3. Apply the decoder to compute combined-output by calling `self.decode()`
        ###     4. Compute log probability distribution over the target vocabulary using the
        ###        combined_outputs returned by the `self.decode()` function.

        enc_hiddens, dec_init_state = self.encode(source_padded, source_lengths)
        enc_masks = generate_sent_masks(enc_hiddens, source_lengths, self.device)
        combined_outputs = self.decode(enc_hiddens, dec_init_state, target_padded)
        P = F.log_softmax(self.decoder.target_vocab_projection(combined_outputs), dim=-1)

        # Zero out, probabilities for which we have nothing in the target text
        target_masks = (target_padded != self.tgt_vocab['<pad>']).float()
        
        # Compute log probability of generating true target words
        target_gold_words_log_prob = torch.gather(P, index=target_padded[1:].unsqueeze(-1), dim=-1).squeeze(-1) * target_masks[1:]
        scores = target_gold_words_log_prob.sum(dim=0)
        return scores


    def encode(self, source_padded: torch.Tensor, source_lengths: List[int]) -> Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
        """ Apply the encoder to source sentences to obtain encoder hidden states.
            Additionally, take the final states of the encoder and project them to obtain initial states for decoder.

        :param source_padded: Tensor of padded source sentences with shape (src_len, b), where
            b = batch_size, src_len = maximum source sentence length. Note that these have
            already been sorted in order of longest to shortest sentence.
        :type source_padded: torch.Tensor
        :param source_lengths: List of actual lengths for each of the source sentences in the batch
        :type source_lengths: List[int]
        :returns: Tuple of two items. The first is Tensor of hidden units with shape (b, src_len, h*2),
            where b = batch size, src_len = maximum source sentence length, h = hidden size. The second is
            Tuple of tensors representing the decoder's initial hidden state and cell.
        :rtype: Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]
        """
        return self.encoder(source_padded, source_lengths)


    def decode(self, enc_hiddens: torch.Tensor,
                dec_init_state: torch.Tensor, target_padded: torch.Tensor) -> torch.Tensor:
        """Compute combined output vectors for a batch.

        :param enc_hiddens (Tensor): Hidden states (b, src_len, h*2), where
                                     b = batch size, src_len = maximum source sentence length, h = hidden size.
        :param dec_init_state (tuple(Tensor, Tensor)): Initial state and cell for decoder
        :param target_padded: Gold-standard padded target sentences (tgt_len, b), where
                                       tgt_len = maximum target sentence length, b = batch size. 

        :returns combined_outputs: combined output tensor  (tgt_len, b,  h), where
                                    tgt_len = maximum target sentence length, b = batch_size,  h = hidden size
        :rtype: torch.Tensor
        """
        return self.decoder(enc_hiddens, dec_init_state, target_padded)

    def beam_search(self, src_sent: List[str], beam_size: int=5, max_decoding_time_step: int=70) -> List[Hypothesis]:
        """ Given a single source sentence, perform beam search, yielding translations in the target language.
        :param src_sent: a single source sentence (words)
        :type src_sent: List[str]
        :param beam_size: beam size
        :type beam_size: int
        :param max_decoding_time_step: maximum number of time steps to unroll the decoding RNN
        :type max_decoding_time_step: int
        :returns hypotheses: a list of hypothesis, each hypothesis has two fields:
                value: List[str]: the decoded target sentence, represented as a list of words
                score: float: the log-likelihood of the target sentence
        :rtype: List[Hypothesis]
        """
        src_sents_var = self.src_vocab.to_input_tensor([src_sent], self.device)

        src_encodings, dec_init_vec = self.encode(src_sents_var, [len(src_sent)])

        h_tm1 = dec_init_vec
        att_tm1 = torch.zeros(1, self.hidden_size, device=self.device)

        eos_id = self.tgt_vocab['</s>']

        hypotheses = [['<s>']]
        hyp_scores = torch.zeros(len(hypotheses), dtype=torch.float, device=self.device)
        completed_hypotheses = []

        t = 0
        while len(completed_hypotheses) < beam_size and t < max_decoding_time_step:
            t += 1
            hyp_num = len(hypotheses)

            exp_src_encodings = src_encodings.expand(hyp_num,
                                                     src_encodings.size(1),
                                                     src_encodings.size(2))

            y_tm1 = torch.tensor([self.tgt_vocab[hyp[-1]] for hyp in hypotheses], dtype=torch.long, device=self.device)
            y_t_embed = self.target_embedding(y_tm1)

            x = torch.cat([y_t_embed, att_tm1], dim=-1)

            h_t, att_t = self.decoder.step(x, h_tm1,
                                exp_src_encodings)
            
            ## TODO: Uncomment the line below if this is an LSTM
            if(self.LSTM_RNN == 'LSTM'):
              h_t, c_t = h_t

            # log probabilities over target words
            log_p_t = F.log_softmax(self.decoder.target_vocab_projection(att_t), dim=-1)

            live_hyp_num = beam_size - len(completed_hypotheses)
            contiuating_hyp_scores = (hyp_scores.unsqueeze(1).expand_as(log_p_t) + log_p_t).view(-1)
            top_cand_hyp_scores, top_cand_hyp_pos = torch.topk(contiuating_hyp_scores, k=live_hyp_num)

            prev_hyp_ids = torch.div(top_cand_hyp_pos, len(self.tgt_vocab), rounding_mode='floor')
            hyp_word_ids = top_cand_hyp_pos % len(self.tgt_vocab)

            new_hypotheses = []
            live_hyp_ids = []
            new_hyp_scores = []

            for prev_hyp_id, hyp_word_id, cand_new_hyp_score in zip(prev_hyp_ids, hyp_word_ids, top_cand_hyp_scores):
                prev_hyp_id = prev_hyp_id.item()
                hyp_word_id = hyp_word_id.item()
                cand_new_hyp_score = cand_new_hyp_score.item()

                hyp_word = self.tgt_vocab.id2word[hyp_word_id]
                new_hyp_sent = hypotheses[prev_hyp_id] + [hyp_word]
                if hyp_word == '</s>':
                    completed_hypotheses.append(Hypothesis(value=new_hyp_sent[1:-1],
                                                           score=cand_new_hyp_score))
                else:
                    new_hypotheses.append(new_hyp_sent)
                    live_hyp_ids.append(prev_hyp_id)
                    new_hyp_scores.append(cand_new_hyp_score)

            if len(completed_hypotheses) == beam_size:
                break

            live_hyp_ids = torch.tensor(live_hyp_ids, dtype=torch.long, device=self.device)
            if(self.LSTM_RNN == 'RNN'): 
              h_tm1 = h_t[live_hyp_ids]
            ### TODO: Uncomment the below if it is an LSTM and comment out line
            # above. Otherwise leave.
            if(self.LSTM_RNN == 'LSTM'):
              h_tm1 = h_t[live_hyp_ids], c_t[live_hyp_ids]
            att_tm1 = att_t[live_hyp_ids]

            hypotheses = new_hypotheses
            hyp_scores = torch.tensor(new_hyp_scores, dtype=torch.float, device=self.device)

        if len(completed_hypotheses) == 0:
            completed_hypotheses.append(Hypothesis(value=hypotheses[0][1:],
                                                   score=hyp_scores[0].item()))

        completed_hypotheses.sort(key=lambda hyp: hyp.score, reverse=True)

        return completed_hypotheses


    def greedy(self, src_sent: List[str], max_decoding_time_step: int=70) -> List[Hypothesis]:
        return self.beam_search(src_sent, beam_size=1, max_decoding_time_step=max_decoding_time_step)


    @staticmethod
    def load(model_path: str):
        """ Load the model from a file.
        @param model_path (str): path to model
        """
        params = torch.load(model_path, map_location=lambda storage, loc: storage)
        args = params['args']
        model = NMT(
            src_vocab=params['vocab']['source'],
            tgt_vocab=params['vocab']['target'],
            **args
        )
        model.load_state_dict(params['state_dict'])

        return model

    def save(self, path: str):
        """ Save the model to a file.
        @param path (str): path to the model
        """
        print('save model parameters to [%s]' % path, file=sys.stderr)

        params = {
            'args': dict(embed_size=self.embed_size, hidden_size=self.hidden_size),
            'vocab': dict(source=self.src_vocab, target=self.tgt_vocab),
            'state_dict': self.state_dict()
        }

        torch.save(params, path)

In [None]:
def batch_iter(data, batch_size, shuffle=False):
    """ Yield batches of source and target sentences reverse sorted by length (largest to smallest).
    :param data: list of tuples containing source and target sentence. ie.
        (list of (src_sent, tgt_sent))
    :type data: List[Tuple[List[str], List[str]]]
    :param batch_size: batch size
    :type batch_size: int
    :param shuffle: whether to randomly shuffle the dataset
    :type shuffle: boolean
    """
    batch_num = math.ceil(len(data) / batch_size)
    index_array = list(range(len(data)))

    if shuffle:
        np.random.shuffle(index_array)

    for i in range(batch_num):
        indices = index_array[i * batch_size: (i + 1) * batch_size]
        examples = [data[idx] for idx in indices]

        examples = sorted(examples, key=lambda e: len(e[0]), reverse=True)
        src_sents = [e[0] for e in examples]
        tgt_sents = [e[1] for e in examples]

        yield src_sents, tgt_sents

In [None]:
def evaluate_ppl(model, val_data, batch_size=32):
    """ Evaluate perplexity on dev sentences
    :param model: NMT Model
    :type model: NMT
    :param dev_data: list of tuples containing source and target sentence.
        i.e. (list of (src_sent, tgt_sent))
    :param val_data: List[Tuple[List[str], List[str]]]
    :param batch_size: size of batches to extract
    :type batch_size: int
    :returns ppl: perplexity on val sentences
    """
    was_training = model.training
    model.eval()

    cum_loss = 0.
    cum_tgt_words = 0.

    # no_grad() signals backend to throw away all gradients
    with torch.no_grad():
        for src_sents, tgt_sents in batch_iter(val_data, batch_size):
            loss = -model(src_sents, tgt_sents).sum()

            cum_loss += loss.item()
            tgt_word_num_to_predict = sum(len(s[1:]) for s in tgt_sents)  # omitting leading `<s>`
            cum_tgt_words += tgt_word_num_to_predict

        ppl = np.exp(cum_loss / cum_tgt_words)
        avg_val_loss = cum_loss/len(val_data)

        wandb.log({
            "avg. val loss": avg_val_loss,
            "avg. val perplexity": ppl
        })

    if was_training:
        model.train()

    return ppl


def compute_corpus_level_bleu_score(references: List[List[str]], hypotheses: List[Hypothesis]) -> float:
    """ Given decoding results and reference sentences, compute corpus-level BLEU score.
    :param references: a list of gold-standard reference target sentences
    :type references: List[List[str]]
    :param hypotheses: a list of hypotheses, one for each reference
    :type hypotheses: List[Hypothesis]
    :returns bleu_score: corpus-level BLEU score
    """
    if references[0][0] == '<s>':
        references = [ref[1:-1] for ref in references]
    bleu_score = corpus_bleu([[ref] for ref in references],
                             [hyp.value for hyp in hypotheses])
    return bleu_score


def evaluate_bleu(references, model, source):
    """Generate decoding results and compute BLEU score.
    :param model: NMT Model
    :type model: NMT
    :param references: a list of gold-standard reference target sentences
    :type references: List[List[str]]
    :param source: a list of source sentences
    :type source: List[List[str]]
    :returns bleu_score: corpus-level BLEU score
    """
    with torch.no_grad():
        top_hypotheses = []
        for s in tqdm(source, leave=False):
            hyps = model.beam_search(s, beam_size=16, max_decoding_time_step=(len(s)+10))
            top_hypotheses.append(hyps[0])
    
    s1 = compute_corpus_level_bleu_score(references, top_hypotheses)
    
    return s1

In [None]:
def train_and_evaluate(model, train_data, val_data, optimizer, epochs=10, train_batch_size=32, clip_grad=2, log_every = 100, valid_niter = 500, model_save_path="NMT_model.ckpt"):
    num_trail = 0
    cum_examples = report_examples = epoch = valid_num = 0
    hist_valid_scores = []
    train_iter = patience = cum_loss = report_loss = cum_tgt_words = report_tgt_words = 0

    print('Begin Maximum Likelihood training')
    train_time = begin_time = time.time()

    val_data_tgt = [tgt for _, tgt in val_data]
    val_data_src = [src for src, _ in val_data]

    for epoch in tqdm(range(epochs)):
        wandb.log({"epoch": epoch})
        for src_sents, tgt_sents in batch_iter(train_data, batch_size=train_batch_size, shuffle=True):
            train_iter += 1
            
            optimizer.zero_grad()
            
            batch_size = len(src_sents)
            
            example_losses = -model(src_sents, tgt_sents)
            batch_loss = example_losses.sum()
            loss = batch_loss / batch_size
            loss.backward()
            
            # clip gradient
            grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), clip_grad)
            
            optimizer.step()
            
            batch_losses_val = batch_loss.item()
            report_loss += batch_losses_val
            cum_loss += batch_losses_val
            
            tgt_words_num_to_predict = sum(len(s[1:]) for s in tgt_sents)  # omitting leading `<s>`
            report_tgt_words += tgt_words_num_to_predict
            cum_tgt_words += tgt_words_num_to_predict
            report_examples += batch_size
            cum_examples += batch_size

            if train_iter % log_every == 0:
                print('epoch %d, iter %d, avg. loss %.2f, avg. ppl %.2f ' \
                        'cum. examples %d, speed %.2f words/sec, time elapsed %.2f sec' % (epoch, train_iter,
                                                                                            report_loss / report_examples,
                                                                                            math.exp(report_loss / report_tgt_words),
                                                                                            cum_examples,
                                                                                            report_tgt_words / (time.time() - train_time),
                                                                                            time.time() - begin_time))

                wandb.log({
                    "avg. train loss": (report_loss/report_examples),
                    "avg. train perplexity": math.exp(report_loss / report_tgt_words)
                })                                                                            
                train_time = time.time()
                report_loss = report_tgt_words = report_examples = 0.

                

            # perform validation
            if train_iter % valid_niter == 0:
                print('epoch %d, iter %d, cum. loss %.2f, cum. ppl %.2f cum. examples %d' % (epoch, train_iter,
                                                                                            cum_loss / cum_examples,
                                                                                            np.exp(cum_loss / cum_tgt_words),
                                                                                            cum_examples))
                

                cum_loss = cum_examples = cum_tgt_words = 0.
                valid_num += 1

                print('begin validation ...')

                # compute dev. ppl and bleu
                dev_ppl = evaluate_ppl(model, val_data, batch_size=128)   # dev batch size can be a bit larger
                valid_metric = -dev_ppl
                
                bleu_score = evaluate_bleu(val_data_tgt, model, val_data_src)*100

                print('validation: iter %d, dev. ppl %f, bleu_score %f' % (train_iter, dev_ppl, bleu_score))
                wandb.log({
                    "bleu_score": bleu_score
                })
                is_better = len(hist_valid_scores) == 0 or bleu_score > max(hist_valid_scores)
                hist_valid_scores.append(bleu_score)

                if is_better:
                    print('save currently the best model to [%s]' % model_save_path)
                    model.save(model_save_path)

                    # also save the optimizers' state
                    torch.save(optimizer.state_dict(), model_save_path + '.optim')


In [None]:
embed_size = 256
hidden_size = 512
src_vocab = src
tgt_vocab = tgt

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

device(type='cuda', index=0)

In [None]:
wandb.config = {
    "embed_size": 512,
    "hidden_size": 512,
    "epochs": 30,
    "train_batch_size": 256, 
    "clip_grad":2,
    "lr": 1e-3
}
config = wandb.config

In [None]:
epochs = 70
train_batch_size = 128
clip_grad = 2
log_every = 100
valid_niter = 500
model_save_path = "NMT_model.ckpt"

In [None]:
# len(train_data)/config("train_batch_size")

In [None]:
model = NMT(
    config["embed_size"],
    config["hidden_size"],
    src_vocab,
    tgt_vocab,
    device=device,
    pretrained_source=None, # src_embed_vectors,
    pretrained_target=None # tgt_embed_vectors,
)
# model = baseline_nmt
model.to(device)
model.train()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

In [None]:
tgt_embed_vectors.size()

torch.Size([10727, 256])

In [None]:
# Define each of the variables then you can run this command!
wandb.init(project = 'nlp-p3-demo', entity = "chipmunkez", reinit = True)
wandb.watch(model) #not necessary but will help you track gradients
train_and_evaluate(
    model,
    train_data,
    val_data,
    optimizer,
    config["epochs"],
    config["train_batch_size"],
    config["clip_grad"],
    log_every,
    valid_niter,
    model_save_path
) 

VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
epoch,▁

0,1
epoch,0


In [None]:
# import torch
# torch.cuda.empty_cache()

In [None]:
# import gc
# del variables
# gc.collect()

## 1.2 Part 2 Report
For Section 1, your report should have a description of each major step of implementing the RNN accompanied by the associated code-snippet. Each step should have an explanation for why you decided to do something (when one could reasonably accomplish the same step in a different way); your justification should not be based on empirical results in this section but should relate to something we said in class, something mentioned in any of the course texts, or some other source (i.e. literature in NLP or official PyTorch documentation). **Unjustified, vague, and/or under-substantiated explanations will not receive credit.** As a reminder, the template for the write up is linked [here](https://docs.google.com/document/d/1IWgYqS6M4G_gJowM97Bq8g5smsGOa75dutIUGjolpAI/edit).

Things to include:

1. _Representation_ \
Each $\vec{x}_i$ needs to be produced in some way and should correspond to word $i$ in the text. This is different from the text classification approaches we have studied previously (BoW for example) where the entire document is represented with a single vector. Where and how is this being done for the RNN?

2. _Initialization_ \
There will be weights that you update in training the RNN. Where and how are these initialized?

3. _Training_ \
You are given the entire training set of N examples. How do you make use of this training set? How does the model modify its weights in training (this likely entails somewhere where gradients are computed and somehwere else where these gradients are used to update the model)? Note: This is code you may not have written but that we have written for you!

4. _Model_ \
This is the core model code, ie. where and how you apply the RNN to the $\vec{x}_i$


5. _Stopping_ \
How does your training procedure terminate? Note: This is code you may not have written but that we have written for you!

6. _Hyperparameters_ \
To run your model, you must fix some hyperparameters, such as $h$ (the hidden dimensionality of the $\vec{z}_i$ referenced above). Be sure to exhaustively describe these hyperparameters and why you set them as you did ( this almost certainly will require some brief exploration: we suggest the course text by Yoav Goldberg as well as possibly the PyTorch official documentation). Be sure to accurately cite either source.



### 1.2.1 Representation


### 1.2.2 Initialization


### 1.2.3 Training


### 1.2.4 Model


### 1.2.5 Stopping


### 2.2.6 Hyperparameters


# Part 2: Analysis
In **Part 2**, you will conduct a comprehensive analysis of these Neural Machine Translation models, focusing on two comparative settings.

## Part 2 Note
You will be required to submit the code used in finding these results on CMSX. This code should be legible and we will consult it if we find issues in the results. It is worth noting that in **Part 1** , we primarily are considering the correctness of the code-snippets in the report. If your model is flawed in a way that isn’t exposed by those snippets, this will likely surface in your results for **Part 2**. We will deduct points for correctness in this section to reflect this and we will try to localize where the error is (or think it is, if it is opaque from your code). That said, we will be lenient about absolute performance (within reason) in this section. As a reminder, the template for the write up is linked [here](https://docs.google.com/document/d/1IWgYqS6M4G_gJowM97Bq8g5smsGOa75dutIUGjolpAI/edit).

## Part 2.1: Within-model comparison
In **Part 2.1: Within-Model Comparison**, you will need to study what happens when you change parameters within a model.

A large aspect of rigorous experimentation in NLP (and other domains) is the _ablation study_. In this, we _ablate_ or remove aspects of a more complex model, making it less complex, to evaluate whether each aspect was neccessary. To be concrete, for this part, you should train 4 variants of the RNN model and describe them as we do below:

1. Baseline model
2. Baseline model made more complex by modification $A$ (e.g. changing the hidden dimensionality from $h$ to $2h$).
3. Baseline model made more complex by modification $B$ (where $B$ is an entirely distinct/different update from $A$).
4. Baseline model with both modifications $A$ and $B$ applied.

Under the framing of an ablation study, you would describe this as beginning with model 4 and then ablating (i.e. removing) each of the two modifications, in turn; and then removing both to see if they were genuinely neccessary for the performance you observe. 

Once you describe each of the four models, report the quantitative bleu score and perplexity. Conclude by performing a nuanced analysis.

The descriptive analysis can take one of two forms:

1. _Nuanced quantitative analysis_ \
If you choose this option, you will need to further break down the quantitative statistics you reported initially. We provide some initial strategies to prime you for what you should think about in doing this: one possible starting point is to consider: if model $X$ achieves greater accuracy than model $Y$, to what extent is $X$ getting everything correct that $Y$ gets correct? Alternatively, how is model performance affected if you measure performance on a specific strata/subset of the source sentences?

2. _Nuanced qualitative analysis_ \
If you choose this option, you will need to select individual examples and try to explain or reason about why one model may be getting them right whereas the other isn’t. Are there any examples that all 4 models get right or wrong and, if so, can you hypothesize a reason why this occurs?


**NOTE:** Although we code individual sections below for each of the configurations. The report should be written keeping all of them in mind discussing all of their performances as well as doing the nuanced analysis with _all_ of the models.

The function below will be useful for analyzing translations by piecing back together the prediction into a cohesive sequence of tokens.

In [None]:
import re
def untokenize(words):
    """
    Untokenizing a text undoes the tokenizing operation, restoring
    punctuation and spaces to the places that people expect them to be.
    Ideally, `untokenize(tokenize(text))` should be identical to `text`,
    except for line breaks.
    """
    text = ' '.join(words)
    step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .',  '...')
    step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
    step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"\1\2", step2)
    step4 = re.sub(r' ([.,:;?!%]+)$', r"\1", step3)
    step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
         "can not", "cannot")
    step6 = step5.replace(" ` ", " '")
    return step6.strip()

### 2.1.1 Configuration 1
Modify the code below for this configuration.

In [None]:
baseline_nmt = NMT(
    config["embed_size"],
    config["hidden_size"],
    src_vocab,
    tgt_vocab,
    device=device,
    pretrained_source=None, # src_embed_vectors,
    pretrained_target=None, # tgt_embed_vectors,
    LSTM_RNN = 'RNN'
)
baseline_nmt.to(device)
baseline_nmt.train()
optimizer = torch.optim.Adam(baseline_nmt.parameters(), lr=1e-3)

In [None]:
# Define each of the variables then you can run this command!
wandb.init(project = 'nlp-p3-demo', entity = "chipmunkez", reinit = True)
wandb.watch(baseline_nmt) #not necessary but will help you track gradients
train_and_evaluate(
    baseline_nmt,
    train_data,
    val_data,
    optimizer,
    config["epochs"],
    config["train_batch_size"],
    config["clip_grad"],
    log_every,
    valid_niter,
    model_save_path
)

### 2.1.2 Configuration 2
Modify the code below for this configuration.

In [None]:
mod_a_nmt = NMT( config["embed_size"],
    config["hidden_size"],
    src_vocab,
    tgt_vocab,
    device=device,
    pretrained_source=src_embed_vectors, 
    pretrained_target=tgt_embed_vectors, 
    LSTM_RNN = 'RNN')
mod_a_nmt.to(device)
mod_a_nmt.train()
optimizer = torch.optim.Adam(mod_a_nmt.parameters(), lr=1e-3)

In [None]:
# Define each of the variables then you can run this command!
wandb.init(project = 'nlp-p3-demo', entity = "chipmunkez", reinit = True)
wandb.watch(mod_a_nmt) #not necessary but will help you track gradients
train_and_evaluate(
    mod_a_nmt,
    train_data,
    val_data,
    optimizer,
    config["epochs"],
    config["train_batch_size"],
    config["clip_grad"],
    log_every,
    valid_niter,
    model_save_path
) 

[34m[1mwandb[0m: Currently logged in as: [33mchipmunkez[0m (use `wandb login --relogin` to force relogin)


Begin Maximum Likelihood training


  0%|          | 0/30 [00:00<?, ?it/s]

epoch 0, iter 100, avg. loss 83.98, avg. ppl 433.10 cum. examples 25600, speed 22173.16 words/sec, time elapsed 15.97 sec
epoch 1, iter 200, avg. loss 76.21, avg. ppl 240.00 cum. examples 50963, speed 21901.40 words/sec, time elapsed 32.07 sec
epoch 1, iter 300, avg. loss 74.04, avg. ppl 198.32 cum. examples 76326, speed 21633.70 words/sec, time elapsed 48.48 sec
epoch 2, iter 400, avg. loss 70.90, avg. ppl 162.55 cum. examples 101926, speed 22266.14 words/sec, time elapsed 64.50 sec
epoch 3, iter 500, avg. loss 68.53, avg. ppl 140.53 cum. examples 127289, speed 22372.50 words/sec, time elapsed 80.21 sec
epoch 3, iter 500, cum. loss 74.74, cum. ppl 216.10 cum. examples 127289
begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

validation: iter 500, dev. ppl 143.045360, bleu_score 0.319993
save currently the best model to [NMT_model.ckpt]


save model parameters to [NMT_model.ckpt]


epoch 3, iter 600, avg. loss 67.92, avg. ppl 130.15 cum. examples 25363, speed 5167.76 words/sec, time elapsed 148.67 sec
epoch 4, iter 700, avg. loss 65.50, avg. ppl 113.62 cum. examples 50963, speed 21233.79 words/sec, time elapsed 165.36 sec
epoch 5, iter 800, avg. loss 64.42, avg. ppl 102.20 cum. examples 76326, speed 22993.70 words/sec, time elapsed 180.72 sec
epoch 5, iter 900, avg. loss 63.62, avg. ppl 95.00 cum. examples 101689, speed 21873.71 words/sec, time elapsed 196.92 sec
epoch 6, iter 1000, avg. loss 61.31, avg. ppl 82.94 cum. examples 127289, speed 22045.59 words/sec, time elapsed 213.03 sec
epoch 6, iter 1000, cum. loss 64.55, cum. ppl 103.54 cum. examples 127289
begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

validation: iter 1000, dev. ppl 92.994019, bleu_score 0.639857
save currently the best model to [NMT_model.ckpt]


save model parameters to [NMT_model.ckpt]


epoch 7, iter 1100, avg. loss 60.55, avg. ppl 76.12 cum. examples 25363, speed 4369.16 words/sec, time elapsed 294.16 sec
epoch 7, iter 1200, avg. loss 59.17, avg. ppl 70.99 cum. examples 50726, speed 23122.37 words/sec, time elapsed 309.39 sec
epoch 8, iter 1300, avg. loss 57.90, avg. ppl 63.39 cum. examples 76326, speed 22128.11 words/sec, time elapsed 325.53 sec
epoch 9, iter 1400, avg. loss 56.54, avg. ppl 58.92 cum. examples 101689, speed 21973.71 words/sec, time elapsed 341.55 sec
epoch 9, iter 1500, avg. loss 55.95, avg. ppl 55.86 cum. examples 127052, speed 22542.34 words/sec, time elapsed 357.20 sec
epoch 9, iter 1500, cum. loss 58.02, cum. ppl 64.64 cum. examples 127052
begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

validation: iter 1500, dev. ppl 70.818666, bleu_score 0.924782
save currently the best model to [NMT_model.ckpt]


save model parameters to [NMT_model.ckpt]


epoch 10, iter 1600, avg. loss 54.37, avg. ppl 49.79 cum. examples 25600, speed 4526.13 words/sec, time elapsed 435.89 sec
epoch 11, iter 1700, avg. loss 53.42, avg. ppl 47.02 cum. examples 50963, speed 22387.74 words/sec, time elapsed 451.61 sec
epoch 11, iter 1800, avg. loss 53.07, avg. ppl 44.93 cum. examples 76326, speed 21885.95 words/sec, time elapsed 467.77 sec
epoch 12, iter 1900, avg. loss 51.41, avg. ppl 39.89 cum. examples 101926, speed 22358.52 words/sec, time elapsed 483.74 sec
epoch 13, iter 2000, avg. loss 50.32, avg. ppl 37.70 cum. examples 127289, speed 22140.99 words/sec, time elapsed 499.63 sec
epoch 13, iter 2000, cum. loss 52.52, cum. ppl 43.64 cum. examples 127289
begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

validation: iter 2000, dev. ppl 57.859854, bleu_score 1.171994
save currently the best model to [NMT_model.ckpt]


save model parameters to [NMT_model.ckpt]


epoch 13, iter 2100, avg. loss 50.03, avg. ppl 36.37 cum. examples 25363, speed 4310.16 words/sec, time elapsed 581.54 sec
epoch 14, iter 2200, avg. loss 48.09, avg. ppl 32.36 cum. examples 50963, speed 22026.96 words/sec, time elapsed 597.62 sec
epoch 15, iter 2300, avg. loss 48.30, avg. ppl 31.14 cum. examples 76326, speed 22093.98 words/sec, time elapsed 613.74 sec
epoch 15, iter 2400, avg. loss 47.16, avg. ppl 30.05 cum. examples 101689, speed 22609.23 words/sec, time elapsed 629.29 sec
epoch 16, iter 2500, avg. loss 45.65, avg. ppl 26.95 cum. examples 127289, speed 22388.02 words/sec, time elapsed 645.14 sec
epoch 16, iter 2500, cum. loss 47.84, cum. ppl 31.22 cum. examples 127289
begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

validation: iter 2500, dev. ppl 49.106040, bleu_score 1.381714
save currently the best model to [NMT_model.ckpt]


save model parameters to [NMT_model.ckpt]


epoch 17, iter 2600, avg. loss 45.06, avg. ppl 25.75 cum. examples 25363, speed 4428.80 words/sec, time elapsed 724.58 sec
epoch 17, iter 2700, avg. loss 45.32, avg. ppl 25.44 cum. examples 50726, speed 21902.97 words/sec, time elapsed 740.80 sec
epoch 18, iter 2800, avg. loss 43.68, avg. ppl 22.79 cum. examples 76326, speed 21918.44 words/sec, time elapsed 757.12 sec
epoch 19, iter 2900, avg. loss 42.74, avg. ppl 21.94 cum. examples 101689, speed 22504.58 words/sec, time elapsed 772.71 sec
epoch 19, iter 3000, avg. loss 42.79, avg. ppl 21.61 cum. examples 127052, speed 21539.55 words/sec, time elapsed 789.11 sec
epoch 19, iter 3000, cum. loss 43.92, cum. ppl 23.44 cum. examples 127052
begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

validation: iter 3000, dev. ppl 44.612018, bleu_score 1.623025
save currently the best model to [NMT_model.ckpt]


save model parameters to [NMT_model.ckpt]


epoch 20, iter 3100, avg. loss 41.22, avg. ppl 19.60 cum. examples 25600, speed 4555.79 words/sec, time elapsed 866.95 sec
epoch 21, iter 3200, avg. loss 41.15, avg. ppl 18.95 cum. examples 50963, speed 23187.82 words/sec, time elapsed 882.25 sec
epoch 21, iter 3300, avg. loss 40.42, avg. ppl 18.35 cum. examples 76326, speed 21158.47 words/sec, time elapsed 898.91 sec
epoch 22, iter 3400, avg. loss 39.07, avg. ppl 16.79 cum. examples 101926, speed 22604.83 words/sec, time elapsed 914.59 sec
epoch 23, iter 3500, avg. loss 39.39, avg. ppl 16.48 cum. examples 127289, speed 21796.15 words/sec, time elapsed 930.95 sec
epoch 23, iter 3500, cum. loss 40.25, cum. ppl 17.99 cum. examples 127289
begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

validation: iter 3500, dev. ppl 40.377807, bleu_score 1.264198
epoch 23, iter 3600, avg. loss 38.45, avg. ppl 16.13 cum. examples 25363, speed 4602.94 words/sec, time elapsed 1007.15 sec
epoch 24, iter 3700, avg. loss 37.44, avg. ppl 14.73 cum. examples 50963, speed 22723.13 words/sec, time elapsed 1022.83 sec
epoch 25, iter 3800, avg. loss 37.08, avg. ppl 14.47 cum. examples 76326, speed 22121.89 words/sec, time elapsed 1038.74 sec
epoch 25, iter 3900, avg. loss 37.00, avg. ppl 14.22 cum. examples 101689, speed 21930.99 words/sec, time elapsed 1054.86 sec
epoch 26, iter 4000, avg. loss 35.61, avg. ppl 13.05 cum. examples 127289, speed 21752.75 words/sec, time elapsed 1071.18 sec
epoch 26, iter 4000, cum. loss 37.12, cum. ppl 14.48 cum. examples 127289
begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

validation: iter 4000, dev. ppl 38.179109, bleu_score 1.232265
epoch 27, iter 4100, avg. loss 35.96, avg. ppl 12.83 cum. examples 25363, speed 4971.61 words/sec, time elapsed 1143.08 sec
epoch 27, iter 4200, avg. loss 35.04, avg. ppl 12.72 cum. examples 50726, speed 23061.02 words/sec, time elapsed 1158.23 sec
epoch 28, iter 4300, avg. loss 34.48, avg. ppl 11.81 cum. examples 76326, speed 22058.01 words/sec, time elapsed 1174.44 sec
epoch 29, iter 4400, avg. loss 34.07, avg. ppl 11.64 cum. examples 101689, speed 22206.94 words/sec, time elapsed 1190.29 sec
epoch 29, iter 4500, avg. loss 33.65, avg. ppl 11.29 cum. examples 127052, speed 22605.00 words/sec, time elapsed 1205.87 sec
epoch 29, iter 4500, cum. loss 34.64, cum. ppl 12.04 cum. examples 127052
begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

validation: iter 4500, dev. ppl 37.224074, bleu_score 1.511197


### 2.1.3 Configuration 3
Modify the code below for this configuration.

In [None]:
mod_b_nmt = NMT(
    config["embed_size"],
    config["hidden_size"],
    src_vocab,
    tgt_vocab,
    device=device,
    pretrained_source=None, 
    pretrained_target=None, 
    LSTM_RNN = 'LSTM'
)
# model = baseline_nmt
mod_b_nmt.to(device)
mod_b_nmt.train()
optimizer = torch.optim.Adam(mod_b_nmt.parameters(), lr=1e-3)

In [None]:
# Define each of the variables then you can run this command!
wandb.init(project = 'nlp-p3-demo', entity = "chipmunkez", reinit = True)
wandb.watch(mod_b_nmt) #not necessary but will help you track gradients
train_and_evaluate(
    mod_b_nmt,
    train_data,
    val_data,
    optimizer,
    config["epochs"],
    config["train_batch_size"],
    config["clip_grad"],
    log_every,
    valid_niter,
    model_save_path
) 

VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
avg. train loss,█▇▇▆▆▆▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁
avg. train perplexity,█▅▄▄▃▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
avg. val loss,█▆▄▃▂▂▁▁▁
avg. val perplexity,█▅▃▂▂▁▁▁▁
bleu_score,▁▃▄▆▇█▆▆▇
epoch,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███

0,1
avg. train loss,33.65379
avg. train perplexity,11.28784
avg. val loss,49.86752
avg. val perplexity,37.22407
bleu_score,1.5112
epoch,29.0


Begin Maximum Likelihood training


  0%|          | 0/30 [00:00<?, ?it/s]

epoch 0, iter 100, avg. loss 81.18, avg. ppl 333.84 cum. examples 25600, speed 15836.58 words/sec, time elapsed 22.58 sec
epoch 1, iter 200, avg. loss 68.54, avg. ppl 138.57 cum. examples 50963, speed 15964.64 words/sec, time elapsed 44.67 sec
epoch 1, iter 300, avg. loss 62.26, avg. ppl 89.21 cum. examples 76326, speed 15458.43 words/sec, time elapsed 67.41 sec
epoch 2, iter 400, avg. loss 57.41, avg. ppl 62.24 cum. examples 101926, speed 16280.68 words/sec, time elapsed 89.27 sec
epoch 3, iter 500, avg. loss 54.09, avg. ppl 49.01 cum. examples 127289, speed 15338.14 words/sec, time elapsed 112.25 sec
epoch 3, iter 500, cum. loss 64.71, cum. ppl 104.97 cum. examples 127289
begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

validation: iter 500, dev. ppl 53.734972, bleu_score 3.826354
save currently the best model to [NMT_model.ckpt]


save model parameters to [NMT_model.ckpt]


epoch 3, iter 600, avg. loss 51.60, avg. ppl 40.53 cum. examples 25363, speed 4182.21 words/sec, time elapsed 196.77 sec
epoch 4, iter 700, avg. loss 47.40, avg. ppl 30.15 cum. examples 50963, speed 15865.35 words/sec, time elapsed 219.23 sec
epoch 5, iter 800, avg. loss 44.58, avg. ppl 24.87 cum. examples 76326, speed 15750.20 words/sec, time elapsed 241.57 sec
epoch 5, iter 900, avg. loss 42.70, avg. ppl 21.37 cum. examples 101689, speed 15886.35 words/sec, time elapsed 263.83 sec
epoch 6, iter 1000, avg. loss 38.42, avg. ppl 15.86 cum. examples 127289, speed 15960.84 words/sec, time elapsed 286.13 sec
epoch 6, iter 1000, cum. loss 44.93, cum. ppl 25.26 cum. examples 127289
begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

validation: iter 1000, dev. ppl 31.324339, bleu_score 8.271961
save currently the best model to [NMT_model.ckpt]


save model parameters to [NMT_model.ckpt]


epoch 7, iter 1100, avg. loss 36.18, avg. ppl 13.52 cum. examples 25363, speed 4219.70 words/sec, time elapsed 369.63 sec
epoch 7, iter 1200, avg. loss 34.52, avg. ppl 11.90 cum. examples 50726, speed 15900.18 words/sec, time elapsed 391.87 sec
epoch 8, iter 1300, avg. loss 30.44, avg. ppl 8.93 cum. examples 76326, speed 16314.95 words/sec, time elapsed 413.68 sec
epoch 9, iter 1400, avg. loss 29.12, avg. ppl 8.01 cum. examples 101689, speed 15569.27 words/sec, time elapsed 436.48 sec
epoch 9, iter 1500, avg. loss 26.93, avg. ppl 7.00 cum. examples 127052, speed 16098.21 words/sec, time elapsed 458.28 sec
epoch 9, iter 1500, cum. loss 31.44, cum. ppl 9.58 cum. examples 127052
begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

validation: iter 1500, dev. ppl 24.790620, bleu_score 13.729393
save currently the best model to [NMT_model.ckpt]


save model parameters to [NMT_model.ckpt]


epoch 10, iter 1600, avg. loss 23.77, avg. ppl 5.54 cum. examples 25600, speed 4248.34 words/sec, time elapsed 541.95 sec
epoch 11, iter 1700, avg. loss 22.76, avg. ppl 5.14 cum. examples 50963, speed 15827.47 words/sec, time elapsed 564.23 sec
epoch 11, iter 1800, avg. loss 21.36, avg. ppl 4.63 cum. examples 76326, speed 15294.58 words/sec, time elapsed 587.36 sec
epoch 12, iter 1900, avg. loss 18.66, avg. ppl 3.80 cum. examples 101926, speed 16020.35 words/sec, time elapsed 609.68 sec
epoch 13, iter 2000, avg. loss 17.53, avg. ppl 3.56 cum. examples 127289, speed 15671.36 words/sec, time elapsed 632.05 sec
epoch 13, iter 2000, cum. loss 20.82, cum. ppl 4.47 cum. examples 127289
begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

validation: iter 2000, dev. ppl 23.752127, bleu_score 19.820404
save currently the best model to [NMT_model.ckpt]


save model parameters to [NMT_model.ckpt]


epoch 13, iter 2100, avg. loss 16.77, avg. ppl 3.33 cum. examples 25363, speed 4243.15 words/sec, time elapsed 715.40 sec
epoch 14, iter 2200, avg. loss 14.43, avg. ppl 2.83 cum. examples 50963, speed 15933.09 words/sec, time elapsed 737.68 sec
epoch 15, iter 2300, avg. loss 13.84, avg. ppl 2.70 cum. examples 76326, speed 15715.36 words/sec, time elapsed 760.18 sec
epoch 15, iter 2400, avg. loss 13.08, avg. ppl 2.56 cum. examples 101689, speed 15781.46 words/sec, time elapsed 782.56 sec
epoch 16, iter 2500, avg. loss 11.19, avg. ppl 2.24 cum. examples 127289, speed 15841.51 words/sec, time elapsed 804.94 sec
epoch 16, iter 2500, cum. loss 13.86, cum. ppl 2.71 cum. examples 127289
begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

validation: iter 2500, dev. ppl 24.492402, bleu_score 25.907401
save currently the best model to [NMT_model.ckpt]


save model parameters to [NMT_model.ckpt]


epoch 17, iter 2600, avg. loss 10.88, avg. ppl 2.18 cum. examples 25363, speed 4241.56 words/sec, time elapsed 888.62 sec
epoch 17, iter 2700, avg. loss 10.28, avg. ppl 2.10 cum. examples 50726, speed 15785.52 words/sec, time elapsed 910.94 sec
epoch 18, iter 2800, avg. loss 8.88, avg. ppl 1.89 cum. examples 76326, speed 15923.05 words/sec, time elapsed 933.32 sec
epoch 19, iter 2900, avg. loss 8.42, avg. ppl 1.84 cum. examples 101689, speed 15862.73 words/sec, time elapsed 955.47 sec
epoch 19, iter 3000, avg. loss 8.34, avg. ppl 1.82 cum. examples 127052, speed 15526.22 words/sec, time elapsed 978.28 sec
epoch 19, iter 3000, cum. loss 9.36, cum. ppl 1.96 cum. examples 127052
begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

validation: iter 3000, dev. ppl 27.293550, bleu_score 30.816940
save currently the best model to [NMT_model.ckpt]


save model parameters to [NMT_model.ckpt]


epoch 20, iter 3100, avg. loss 6.98, avg. ppl 1.65 cum. examples 25600, speed 4260.70 words/sec, time elapsed 1061.61 sec
epoch 21, iter 3200, avg. loss 6.82, avg. ppl 1.63 cum. examples 50963, speed 15964.74 words/sec, time elapsed 1083.75 sec
epoch 21, iter 3300, avg. loss 6.55, avg. ppl 1.60 cum. examples 76326, speed 15855.67 words/sec, time elapsed 1106.03 sec
epoch 22, iter 3400, avg. loss 5.61, avg. ppl 1.50 cum. examples 101926, speed 15389.61 words/sec, time elapsed 1129.10 sec
epoch 23, iter 3500, avg. loss 5.43, avg. ppl 1.47 cum. examples 127289, speed 16132.21 words/sec, time elapsed 1151.22 sec
epoch 23, iter 3500, cum. loss 6.28, cum. ppl 1.57 cum. examples 127289
begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

validation: iter 3500, dev. ppl 30.551156, bleu_score 35.628826
save currently the best model to [NMT_model.ckpt]


save model parameters to [NMT_model.ckpt]


epoch 23, iter 3600, avg. loss 5.27, avg. ppl 1.47 cum. examples 25363, speed 4212.91 words/sec, time elapsed 1234.28 sec
epoch 24, iter 3700, avg. loss 4.52, avg. ppl 1.38 cum. examples 50963, speed 15787.64 words/sec, time elapsed 1256.85 sec
epoch 25, iter 3800, avg. loss 4.46, avg. ppl 1.38 cum. examples 76326, speed 15527.59 words/sec, time elapsed 1279.55 sec
epoch 25, iter 3900, avg. loss 4.34, avg. ppl 1.37 cum. examples 101689, speed 16340.10 words/sec, time elapsed 1301.16 sec
epoch 26, iter 4000, avg. loss 3.64, avg. ppl 1.30 cum. examples 127289, speed 16122.93 words/sec, time elapsed 1323.11 sec
epoch 26, iter 4000, cum. loss 4.45, cum. ppl 1.38 cum. examples 127289
begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

validation: iter 4000, dev. ppl 34.794210, bleu_score 40.209022
save currently the best model to [NMT_model.ckpt]


save model parameters to [NMT_model.ckpt]


epoch 27, iter 4100, avg. loss 3.69, avg. ppl 1.30 cum. examples 25363, speed 4284.76 words/sec, time elapsed 1406.38 sec
epoch 27, iter 4200, avg. loss 3.52, avg. ppl 1.29 cum. examples 50726, speed 15695.98 words/sec, time elapsed 1428.74 sec
epoch 28, iter 4300, avg. loss 3.08, avg. ppl 1.25 cum. examples 76326, speed 15921.45 words/sec, time elapsed 1450.94 sec
epoch 29, iter 4400, avg. loss 3.01, avg. ppl 1.24 cum. examples 101689, speed 15927.04 words/sec, time elapsed 1473.30 sec
epoch 29, iter 4500, avg. loss 3.02, avg. ppl 1.24 cum. examples 127052, speed 15555.08 words/sec, time elapsed 1495.95 sec
epoch 29, iter 4500, cum. loss 3.26, cum. ppl 1.26 cum. examples 127052
begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

validation: iter 4500, dev. ppl 39.951869, bleu_score 43.040271
save currently the best model to [NMT_model.ckpt]


save model parameters to [NMT_model.ckpt]


### 2.1.4 Configuration 4
Modify the code below for this configuration.

In [None]:
# config 4
both_mods_nmt = NMT(
    config["embed_size"],
    config["hidden_size"],
    src_vocab,
    tgt_vocab,
    device=device,
    pretrained_source=src_embed_vectors, 
    pretrained_target=tgt_embed_vectors, 
    LSTM_RNN = 'LSTM'
)
# model = baseline_nmt
both_mods_nmt.to(device)
both_mods_nmt.train()
optimizer = torch.optim.Adam(both_mods_nmt.parameters(), lr=1e-3)

In [None]:

# Define each of the variables then you can run this command!
wandb.init(project = 'nlp-p3-demo', entity = "chipmunkez", reinit = True)
wandb.watch(baseline_nmt) #not necessary but will help you track gradients
train_and_evaluate(
    baseline_nmt,
    train_data,
    val_data,
    optimizer,
    config["epochs"],
    config["train_batch_size"],
    config["clip_grad"],
    log_every,
    valid_niter,
    model_save_path
) 

VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
avg. train loss,█▇▆▆▆▅▅▅▄▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
avg. train perplexity,█▄▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
avg. val loss,█▃▁▁▁▂▃▄▅
avg. val perplexity,█▃▁▁▁▂▃▄▅
bleu_score,▁▂▃▄▅▆▇▇█
epoch,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███

0,1
avg. train loss,3.02135
avg. train perplexity,1.24299
avg. val loss,50.84255
avg. val perplexity,39.95187
bleu_score,43.04027
epoch,29.0


Begin Maximum Likelihood training


  0%|          | 0/30 [00:00<?, ?it/s]

epoch 0, iter 100, avg. loss 129.44, avg. ppl 10702.97 cum. examples 25600, speed 22303.24 words/sec, time elapsed 16.01 sec
epoch 1, iter 200, avg. loss 129.12, avg. ppl 10699.69 cum. examples 50963, speed 21375.15 words/sec, time elapsed 32.53 sec
epoch 1, iter 300, avg. loss 128.65, avg. ppl 10700.78 cum. examples 76326, speed 23410.70 words/sec, time elapsed 47.55 sec
epoch 2, iter 400, avg. loss 129.12, avg. ppl 10701.73 cum. examples 101926, speed 22485.86 words/sec, time elapsed 63.39 sec
epoch 3, iter 500, avg. loss 128.50, avg. ppl 10700.62 cum. examples 127289, speed 22752.51 words/sec, time elapsed 78.83 sec
epoch 3, iter 500, cum. loss 128.97, cum. ppl 10701.16 cum. examples 127289
begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
save model parameters to [NMT_model.ckpt]


validation: iter 500, dev. ppl 10700.690941, bleu_score 18.117572
save currently the best model to [NMT_model.ckpt]
epoch 3, iter 600, avg. loss 129.59, avg. ppl 10701.10 cum. examples 25363, speed 2789.04 words/sec, time elapsed 205.85 sec
epoch 4, iter 700, avg. loss 128.46, avg. ppl 10697.72 cum. examples 50963, speed 22918.41 words/sec, time elapsed 221.32 sec
epoch 5, iter 800, avg. loss 129.78, avg. ppl 10705.90 cum. examples 76326, speed 21179.08 words/sec, time elapsed 238.07 sec
epoch 5, iter 900, avg. loss 128.98, avg. ppl 10699.83 cum. examples 101689, speed 23406.75 words/sec, time elapsed 253.13 sec
epoch 6, iter 1000, avg. loss 128.72, avg. ppl 10700.44 cum. examples 127289, speed 22381.17 words/sec, time elapsed 269.00 sec
epoch 6, iter 1000, cum. loss 129.10, cum. ppl 10701.00 cum. examples 127289
begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

validation: iter 1000, dev. ppl 10700.690941, bleu_score 18.117572
epoch 7, iter 1100, avg. loss 129.24, avg. ppl 10701.35 cum. examples 25363, speed 2797.99 words/sec, time elapsed 395.27 sec
epoch 7, iter 1200, avg. loss 129.26, avg. ppl 10701.67 cum. examples 50726, speed 22077.11 words/sec, time elapsed 411.27 sec
epoch 8, iter 1300, avg. loss 128.59, avg. ppl 10699.24 cum. examples 76326, speed 22583.19 words/sec, time elapsed 426.99 sec
epoch 9, iter 1400, avg. loss 129.24, avg. ppl 10700.25 cum. examples 101689, speed 22648.08 words/sec, time elapsed 442.59 sec
epoch 9, iter 1500, avg. loss 129.39, avg. ppl 10703.97 cum. examples 127052, speed 22049.09 words/sec, time elapsed 458.63 sec
epoch 9, iter 1500, cum. loss 129.14, cum. ppl 10701.30 cum. examples 127052
begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

validation: iter 1500, dev. ppl 10700.690941, bleu_score 18.117572
epoch 10, iter 1600, avg. loss 128.97, avg. ppl 10701.25 cum. examples 25600, speed 2802.51 words/sec, time elapsed 585.60 sec
epoch 11, iter 1700, avg. loss 128.97, avg. ppl 10699.89 cum. examples 50963, speed 23039.76 words/sec, time elapsed 600.91 sec
epoch 11, iter 1800, avg. loss 129.28, avg. ppl 10702.32 cum. examples 76326, speed 22187.31 words/sec, time elapsed 616.83 sec
epoch 12, iter 1900, avg. loss 129.67, avg. ppl 10700.93 cum. examples 101926, speed 21711.06 words/sec, time elapsed 633.31 sec
epoch 13, iter 2000, avg. loss 128.33, avg. ppl 10700.40 cum. examples 127289, speed 22662.17 words/sec, time elapsed 648.79 sec
epoch 13, iter 2000, cum. loss 129.04, cum. ppl 10700.96 cum. examples 127289
begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

validation: iter 2000, dev. ppl 10700.690941, bleu_score 18.117572
epoch 13, iter 2100, avg. loss 129.21, avg. ppl 10702.13 cum. examples 25363, speed 2777.43 words/sec, time elapsed 775.97 sec
epoch 14, iter 2200, avg. loss 129.20, avg. ppl 10699.67 cum. examples 50963, speed 22647.78 words/sec, time elapsed 791.71 sec
epoch 15, iter 2300, avg. loss 129.29, avg. ppl 10701.85 cum. examples 76326, speed 21810.06 words/sec, time elapsed 807.91 sec
epoch 15, iter 2400, avg. loss 128.72, avg. ppl 10701.96 cum. examples 101689, speed 22552.35 words/sec, time elapsed 823.52 sec
epoch 16, iter 2500, avg. loss 128.81, avg. ppl 10703.36 cum. examples 127289, speed 22224.01 words/sec, time elapsed 839.51 sec
epoch 16, iter 2500, cum. loss 129.05, cum. ppl 10701.79 cum. examples 127289
begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

validation: iter 2500, dev. ppl 10700.690941, bleu_score 18.117572
epoch 17, iter 2600, avg. loss 129.69, avg. ppl 10699.48 cum. examples 25363, speed 2793.06 words/sec, time elapsed 966.44 sec
epoch 17, iter 2700, avg. loss 128.72, avg. ppl 10700.62 cum. examples 50726, speed 22853.54 words/sec, time elapsed 981.83 sec
epoch 18, iter 2800, avg. loss 128.24, avg. ppl 10701.41 cum. examples 76326, speed 22712.77 words/sec, time elapsed 997.41 sec
epoch 19, iter 2900, avg. loss 129.67, avg. ppl 10701.40 cum. examples 101689, speed 22534.76 words/sec, time elapsed 1013.14 sec
epoch 19, iter 3000, avg. loss 129.31, avg. ppl 10700.65 cum. examples 127052, speed 21721.63 words/sec, time elapsed 1029.42 sec
epoch 19, iter 3000, cum. loss 129.12, cum. ppl 10700.71 cum. examples 127052
begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

validation: iter 3000, dev. ppl 10700.690941, bleu_score 18.117572
epoch 20, iter 3100, avg. loss 129.36, avg. ppl 10701.59 cum. examples 25600, speed 2817.72 words/sec, time elapsed 1156.09 sec
epoch 21, iter 3200, avg. loss 128.24, avg. ppl 10698.79 cum. examples 50963, speed 22805.74 words/sec, time elapsed 1171.46 sec
epoch 21, iter 3300, avg. loss 129.61, avg. ppl 10703.05 cum. examples 76326, speed 22195.86 words/sec, time elapsed 1187.43 sec
epoch 22, iter 3400, avg. loss 128.64, avg. ppl 10699.52 cum. examples 101926, speed 23145.55 words/sec, time elapsed 1202.76 sec
epoch 23, iter 3500, avg. loss 129.45, avg. ppl 10700.90 cum. examples 127289, speed 22749.02 words/sec, time elapsed 1218.32 sec
epoch 23, iter 3500, cum. loss 129.06, cum. ppl 10700.78 cum. examples 127289
begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

validation: iter 3500, dev. ppl 10700.690941, bleu_score 18.117572
epoch 23, iter 3600, avg. loss 129.12, avg. ppl 10703.05 cum. examples 25363, speed 2779.72 words/sec, time elapsed 1345.30 sec
epoch 24, iter 3700, avg. loss 129.23, avg. ppl 10702.55 cum. examples 50963, speed 22449.09 words/sec, time elapsed 1361.18 sec
epoch 25, iter 3800, avg. loss 128.99, avg. ppl 10696.72 cum. examples 76326, speed 22452.67 words/sec, time elapsed 1376.89 sec
epoch 25, iter 3900, avg. loss 128.99, avg. ppl 10704.17 cum. examples 101689, speed 22466.22 words/sec, time elapsed 1392.58 sec
epoch 26, iter 4000, avg. loss 128.77, avg. ppl 10700.84 cum. examples 127289, speed 22208.13 words/sec, time elapsed 1408.58 sec
epoch 26, iter 4000, cum. loss 129.02, cum. ppl 10701.47 cum. examples 127289
begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

validation: iter 4000, dev. ppl 10700.690941, bleu_score 18.117572
epoch 27, iter 4100, avg. loss 129.09, avg. ppl 10699.34 cum. examples 25363, speed 2783.54 words/sec, time elapsed 1535.36 sec
epoch 27, iter 4200, avg. loss 129.36, avg. ppl 10703.28 cum. examples 50726, speed 22413.61 words/sec, time elapsed 1551.14 sec
epoch 28, iter 4300, avg. loss 129.45, avg. ppl 10699.47 cum. examples 76326, speed 22476.90 words/sec, time elapsed 1567.03 sec
epoch 29, iter 4400, avg. loss 128.94, avg. ppl 10704.43 cum. examples 101689, speed 22136.06 words/sec, time elapsed 1582.95 sec
epoch 29, iter 4500, avg. loss 128.81, avg. ppl 10699.58 cum. examples 127052, speed 22537.99 words/sec, time elapsed 1598.58 sec
epoch 29, iter 4500, cum. loss 129.13, cum. ppl 10701.22 cum. examples 127052
begin validation ...


  0%|          | 0/1837 [00:00<?, ?it/s]

validation: iter 4500, dev. ppl 10700.690941, bleu_score 18.117572


In [None]:
both_mods_nmt_1 = NMT(
    config["embed_size"],
    config["hidden_size"],
    src_vocab,
    tgt_vocab,
    device=device,
    pretrained_source=src_embed_vectors, 
    pretrained_target=tgt_embed_vectors, 
    LSTM_RNN = 'LSTM'
)
both_mods_nmt_1.to(device)
both_mods_nmt_1.train()
optimizer = torch.optim.Adam(both_mods_nmt_1.parameters(), lr=1e-3)
# Define each of the variables then you can run this command!
wandb.init(project = 'nlp-p3-demo', entity = "chipmunkez", reinit = True)
wandb.watch(both_mods_nmt_1) #not necessary but will help you track gradients
train_and_evaluate(
    both_mods_nmt_1,
    train_data,
    val_data,
    optimizer,
    config["epochs"],
    config["train_batch_size"],
    config["clip_grad"],
    log_every,
    valid_niter,
    model_save_path
) 

### 2.1.5 Report
Describe variants in the ablation style described, report the results, and then perform a nuanced analysis.

# Part 3: Questions
In **Part 3**, you will need to answer the three questions below. We expect answers to be to-the-point; answers that are vague, meandering, or imprecise **will receive fewer points** than a precise but partially correct answer.

## 3.1 Q1
Earlier in the course, we studied models that make use of _Markov_ assumptions. Recurrent neural networks do not make any such assumption. That said, RNNs are known to struggle with long-distance dependencies. What is a fundamental reason for why this is the case?

## 3.2 Q2
In applying RNNs to tasks in NLP, we have discovered that (at least for tasks in English) feeding a sentence into an RNN backwards (i.e. inputting the sequence of vectors corresponding to ($course$, $great$, $a$, $is$, $NLP$) instead of ($NLP$, $is$, $a$, $great$, $course$)) tends to improve performance. Why might this be the case?

## 3.3 Q3
In using RNNs and word embeddings for NLP tasks, we are no longer required to engineer specific features that are useful for the task; the model discovers them automatically. Stated differently, it seems that neural models tend to discover better features than human researchers can directly specify. This comes at the cost of systems having to consume tremendous amounts of data to learn these kinds of patterns from the data. Beyond concerns of dataset size (and the computational resources required to process and train using this data as well as the further environmental harm that results from this process), why might we disfavor RNN models?

# Part 4: Miscellaneous
List the libraries you used and sources you referenced and cited (labelled with the section in which you referred to them). Include a description of how your group split
up the work. Include brief feedback on this asignment.

**Each section must be clearly labelled, complete, and the corresponding pages should be correctly assigned to the corresponding Gradescope rubric item.** If you follow these steps for each of the 4 components requested, you are guaranteed full credit for this section. Otherwise, you will receive no credit for this section.

# Part 5: Gradescope Submission

Note: This section is not required however we will have a Gradescope submission open to submit predictions and see how your models compare against one another!

In [None]:
# Create Gradescope submission function
gradescope_model = None
nmt_document_preprocessor = lambda x: nltk.word_tokenize(x) # This is for your RNN
file_name = "submission.tsv"

In [None]:
def generate_submission(filename, model, document_preprocessor, test):
    with Path(filename).open("w") as fp:
        fp.write("Id\tPredicted\n")
        for idx, input_string in tqdm(enumerate(test), total=len(test)):
            translation = untokenize(
                model.beam_search(
                    document_preprocessor(input_string),
                    beam_size=16,
                    max_decoding_time_step=len(input_string)+10
                )[0].value)
            fp.write(f"{idx}\t{translation}\n")
    return

In [None]:
with open(test_path) as fp:
    test = [line for line in fp]

In [None]:
generate_submission(file_name, gradescope_model, nmt_document_preprocessor, test)

# Live running demo

In [None]:
#@title Translation
#@markdown Enter a sentence to see the translation
input_string = "" #@param {type:"string"}
model_type = "both_mods_nmt" #@param ["baseline_nmt", "mod_a_nmt", "mod_b_nmt", "both_mods_nmt"]
from IPython.display import HTML

import re
def untokenize(words):
    """
    Untokenizing a text undoes the tokenizing operation, restoring
    punctuation and spaces to the places that people expect them to be.
    Ideally, `untokenize(tokenize(text))` should be identical to `text`,
    except for line breaks.
    """
    text = ' '.join(words)
    step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .',  '...')
    step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
    step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"\1\2", step2)
    step4 = re.sub(r' ([.,:;?!%]+)$', r"\1", step3)
    step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
         "can not", "cannot")
    step6 = step5.replace(" ` ", " '")
    return step6.strip()

output = ""

# BAD THING TO DO BELOW!!
model_used = globals()[model_type]

with torch.no_grad():
    # RUN MODEL
    translation = untokenize(model_used.beam_search(
        nmt_document_preprocessor(input_string),
        beam_size=64,
        max_decoding_time_step=len(input_string)+10
    )[0].value)

# Generate nice display
output += '<p style="font-family:verdana; font-size:110%;">'
output += " Input sequence: "+input_string+"</p>"
output += '<p style="font-family:verdana; font-size:110%;">'
output += f" Translation to Shakespeare: {translation}</p><hr>"
output = "<h3>Results:</h3>" + output

display(HTML(output))