# **Assignment 1: Translation with a Sequence to Sequence Network**

This assignment is split into two sections: Neural Machine Translation with (1) RNNs and (2) Transformer. To be more specifically, in Machine Translation, our goal is to convert a sentence from the source language (e.g. Vietnamese) to the target language (e.g. English). In this assignment, we will implement a sequence-to-sequence (Seq2Seq) network based on two architectures: **RNNs with Attention** and **Transformer**, to build a Neural Machine Translation (NMT) system.

That's a lot to digest, the goal of this assignment is to break it down into easy to understand parts. In this assignment you will:

- Prepare the data.
- Implement necessary components:
    - With RNNs and attention architecture:
        - Embedding Layer: to initialize the necessary word embeddings
        - Declare basic components of our model.
        - The Encoder & Decoder

    - With Transformer Architecture:
        - Positional embeddings.
        - Transformer Layer

- Build & train two our models.
- Generate translations.

**Requirements**

Firstly, apart from standards libraries, we need to install some package:

1. `sentencepiece`: To build your own vocabulary \\
2. `sacrebleu`: To evaluate our model using BLUE score metric

In [None]:
%%capture
!pip install sentencepiece==0.1.97
!pip install tqdm==4.29.1
!pip install sacrebleu
!pip install nltk
!pip install 'portalocker>=2.0.0'

Below, we import our standard libraries.

In [None]:
# Standard libraries
import sys
import json
import time
import math
import numpy as np
from typing import List, Tuple, Dict, Set, Union
from collections import Counter, namedtuple
from itertools import chain
from dataclasses import dataclass

# to compute BLUE score
import sacrebleu

# Pytorch
import torch
import torch.nn as nn
import torch.nn.utils
import torch.nn.functional as F
from torch.nn.utils.rnn import pad_packed_sequence, pack_padded_sequence

# To train vocabulary
import sentencepiece as spm

# Ensure that all operations are deterministic on GPU (if used) for reproducibility
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

In [None]:
#@title Default hyperparameters
@dataclass
class Args:
    cuda: str = "cuda:0"
    train_src: str = "data/train.vi"
    train_tgt: str = "data/train.en"
    dev_src: str = "data/dev.vi"
    dev_tgt: str = 'data/dev.en'
    vocab_file: str = 'vocab.json'
    src_vocab_size: int = 15000
    tgt_vocab_size: int = 21000
    seed: int = 0
    batch_size: int = 32
    max_len: int = 320
    embed_size: int = 1024
    hidden_size: int = 768
    clip_grad: float = 5.0                  # gradient clipping
    log_every: int = 10                     # log every
    max_epoch: int = 100                     # max epoch
    patience: int = 5                       # wait for how many iterations to decay learning rate
    max_num_trial: int = 5                  # terminate training after how many trials
    lr_decay: float = 0.5                   # learning rate decay
    beam_size: int = 5                      # beam size
    lr: float = 0.001                       # learning rate
    uniform_init: float = 0.1               # uniformly initialize all parameters
    model_save_path: str = 'lstm_model.bin' # model save path
    valid_niter: int = 2000                 # perform validation after how many iterations
    dropout: float = 0.3
    max_decoding_time_step: int = 70        # maximum number of decoding time steps

args = Args()
device = torch.device(args.cuda) if torch.cuda.is_available() else torch.device("cpu")

seed = int(args.seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
np.random.seed(seed * 13 // 7)

# Data Preparation


## Loading data files
The data for this project is a set of thousands of Vietnamese to
English translation pairs. We will download them first then save to 'data' folder.

In [None]:
%%capture
!mkdir data

import os
import re

def download_file(google_drive_link, to_save_path):
    match = re.search(r"/file/d/(.*?)/", google_drive_link)
    file_id = match.group(1) if match else None
    new_path = f"https://drive.google.com/uc?id={file_id}"

    os.system(f"wget --no-check-certificate {new_path} -O {to_save_path}")



data_path = 'https://drive.google.com/file/d/1eq68XlKxWBFCj4YgMRl2N5YdrZvB9FDs/view?usp=sharing'
download_file(data_path, args.train_src)

data_path = 'https://drive.google.com/file/d/1679j2kIvdl8Oe_WRSX0vi62JtOrhr1GD/view?usp=sharing'
download_file(data_path, args.train_tgt)


data_path = 'https://drive.google.com/file/d/1p0tBxnD-MVXyve772omfq1nFDraeI_sO/view?usp=sharing'
download_file(data_path, args.dev_src)

data_path = 'https://drive.google.com/file/d/1ZvBBTUwzYJuN4J8WCZ9-kZiBEm4iPpiL/view?usp=sharing'
download_file(data_path, args.dev_tgt)

## Logging data files

Understanding the appearance of our data serves as a solid starting point when addressing the issue. The data will be recorded in the source-target (src-tgt) format, featuring raw, unprocessed text. In the subsequent section, you will observe how the data undergoes preprocessing into a more suitable format, specifically using a sub-word tokenizer, to enhance the machine's comprehension.

In [None]:
def read_file(file_path):
  """ Read file, where each sentence is dilineated by a `\n`.
  @param file_path (str): path to file containing corpus
  """
  data = []
  with open(file_path, 'r', encoding='utf8') as f:
    for line in f:
      data.append(line.strip())

  return data

In [None]:
def logging_data(file_path_src, file_path_tgt, num_data):
    """ Log file, where each data is dilineated by a `\n`.
    @param file_path_src (str): path to file containing source corpus
    @param file_path_tgt (str): path to file containing target corpus
    @param num_data (int): number of data to log
    """

    data_src = read_file(file_path_src)
    data_tgt = read_file(file_path_tgt)

    for i in range(num_data):
      print("Src sent: ", data_src[i])
      print("Tgt sent: ", data_tgt[i])
      print("\n")

In [None]:
logging_data(args.train_src, args.train_tgt, 5)

Src sent:  Khoa h·ªçc ƒë·∫±ng sau m·ªôt ti√™u ƒë·ªÅ v·ªÅ kh√≠ h·∫≠u
Tgt sent:  Rachel Pike : The science behind a climate headline


Src sent:  Trong 4 ph√∫t , chuy√™n gia ho√° h·ªçc kh√≠ quy·ªÉn Rachel Pike gi·ªõi thi·ªáu s∆° l∆∞·ª£c v·ªÅ nh·ªØng n·ªó l·ª±c khoa h·ªçc mi·ªát m√†i ƒë·∫±ng sau nh·ªØng ti√™u ƒë·ªÅ t√°o b·∫°o v·ªÅ bi·∫øn ƒë·ªïi kh√≠ h·∫≠u , c√πng v·ªõi ƒëo√†n nghi√™n c·ª©u c·ªßa m√¨nh -- h√†ng ng√†n ng∆∞·ªùi ƒë√£ c·ªëng hi·∫øn cho d·ª± √°n n√†y -- m·ªôt chuy·∫øn bay m·∫°o hi·ªÉm qua r·ª´ng gi√† ƒë·ªÉ t√¨m ki·∫øm th√¥ng tin v·ªÅ m·ªôt ph√¢n t·ª≠ then ch·ªët .
Tgt sent:  In 4 minutes , atmospheric chemist Rachel Pike provides a glimpse of the massive scientific effort behind the bold headlines on climate change , with her team -- one of thousands who contributed -- taking a risky flight over the rainforest in pursuit of data on a key molecule .


Src sent:  T√¥i mu·ªën cho c√°c b·∫°n bi·∫øt v·ªÅ s·ª± to l·ªõn c·ªßa nh·ªØng n·ªó l·ª±c khoa h·ªçc ƒë√£ g√≥p ph·∫ßn l√†m n√™n c√

# Let's start your assignment implementation üí™


## Q1: Padding function (5 points)
In order to apply tensor operations, we must ensure that the sentences in a given batch are of the same length. Thus, we must identify the longest sentence in a batch and pad others to be the same length. Implement the `pad_sents` function, which shall produce these padded sentences.

In [None]:
def pad_sents(sents, pad_token):
    """ Pad list of sentences according to the longest sentence in the batch.
        The paddings should be at the end of each sentence.
    @param sents (list[list[str]]): list of sentences, where each sentence
                                    is represented as a list of words
    @param pad_token (str): padding token
    @returns sents_padded (list[list[str]]): list of sentences where sentences shorter
        than the max length sentence are padded out with the pad_token, such that
        each sentences in the batch now has equal length.
    """
    sents_padded = []

    ### YOUR CODE HERE (~6 Lines)
    lens = np.array([len(sent) for sent in sents])
    max_len = np.max(lens)
    to_pad = max_len - lens
    sents_padded = [np.concatenate((sents[i], [pad_token] * pad_len)) for i, pad_len in enumerate(to_pad)]

    ### END YOUR CODE

    return sents_padded

In [None]:
sents = [
    "VietAI l√† t·ªï ch·ª©c phi l·ª£i nhu·∫≠n.",
    "Theo b√°o c√°o m·ªõi nh·∫•t c·ªßa Linkedin v·ªÅ danh s√°ch vi·ªác l√†m tri·ªÉn v·ªçng v·ªõi m·ª©c l∆∞∆°ng h·∫•p d·∫´n nƒÉm 2020, c√°c ch·ª©c danh c√¥ng vi·ªác li√™n quan ƒë·∫øn AI ƒë·ªÅu x·∫øp th·ª© h·∫°ng cao.",
    "S·ª© m·ªánh c·ªßa VietAI l√† ∆∞∆°m m·∫ßm t√†i nƒÉng v·ªÅ tr√≠ tu·ªá nh√¢n t·∫°o v√† x√¢y d·ª±ng m·ªôt c·ªông ƒë·ªìng c√°c chuy√™n gia trong lƒ©nh v·ª±c tr√≠ tu·ªá nh√¢n t·∫°o ƒë·∫≥ng c·∫•p qu·ªëc t·∫ø."
    "Ch√∫ng ta ƒëang tr√™n h√†nh tr√¨nh ti·∫øn b·ªô v√† d√¢n ch·ªß ho√° tr√≠ tu·ªá nh√¢n t·∫°o th√¥ng qua m√£ ngu·ªìn m·ªü v√† khoa h·ªçc m·ªü"]

sents = [s.split() for s in sents]
padded_sents = pad_sents(sents, pad_token="<PAD>")

for sent in padded_sents:
    assert len(sent) == len(padded_sents[0])

Below, we define the `VocabEntry` class. The `VocabEntry` class is a vocabulary entry that contains a dictionary that maps words to indices and provides methods to convert words to indices, indices to words, and sentences to tensors. The purpose of this class is to facilitate the management of the vocabulary.

In [None]:
class VocabEntry(object):
    """ Vocabulary Entry, i.e. structure containing either
    src or tgt language terms.
    """
    def __init__(self, word2id=None):
        """ Init VocabEntry Instance.
        @param word2id (dict): dictionary mapping words 2 indices
        """
        if word2id:
            self.word2id = word2id
        else:
            self.word2id = dict()
            self.word2id['<pad>'] = 0   # Pad Token
            self.word2id['<s>'] = 1 # Start Token
            self.word2id['</s>'] = 2    # End Token
            self.word2id['<unk>'] = 3   # Unknown Token
        self.unk_id = self.word2id['<unk>']
        self.id2word = {v: k for k, v in self.word2id.items()}

    def __getitem__(self, word):
        """ Retrieve word's index. Return the index for the unk
        token if the word is out of vocabulary.
        @param word (str): word to look up.
        @returns index (int): index of word
        """
        return self.word2id.get(word, self.unk_id)

    def __contains__(self, word):
        """ Check if word is captured by VocabEntry.
        @param word (str): word to look up
        @returns contains (bool): whether word is contained
        """
        return word in self.word2id

    def __setitem__(self, key, value):
        """ Raise error, if one tries to edit the VocabEntry.
        """
        raise ValueError('vocabulary is readonly')

    def __len__(self):
        """ Compute number of words in VocabEntry.
        @returns len (int): number of words in VocabEntry
        """
        return len(self.word2id)

    def __repr__(self):
        """ Representation of VocabEntry to be used
        when printing the object.
        """
        return 'Vocabulary[size=%d]' % len(self)

    def id2word(self, wid):
        """ Return mapping of index to word.
        @param wid (int): word index
        @returns word (str): word corresponding to index
        """
        return self.id2word[wid]

    def add(self, word):
        """ Add word to VocabEntry, if it is previously unseen.
        @param word (str): word to add to VocabEntry
        @return index (int): index that the word has been assigned
        """
        if word not in self:
            wid = self.word2id[word] = len(self)
            self.id2word[wid] = word
            return wid
        else:
            return self[word]

    def words2indices(self, sents):
        """ Convert list of words or list of sentences of words
        into list or list of list of indices.
        @param sents (list[str] or list[list[str]]): sentence(s) in words
        @return word_ids (list[int] or list[list[int]]): sentence(s) in indices
        """
        try:
            if type(sents[0]) == list:
                for i in range(len(sents)):
                    # set max length
                    sents[i] = sents[i][:args.max_len]
                return [[self[w] for w in s] for s in sents]
            else:
                # set max length
                sents = sents[:args.max_len]
                return [[self[w] for w in sents]]
        except Exception as e:
            print(e, sents)
            return []

    def indices2words(self, word_ids):
        """ Convert list of indices into words.
        @param word_ids (list[int]): list of word ids
        @return sents (list[str]): list of words
        """
        return [self.id2word[w_id] for w_id in word_ids]

    def to_input_tensor(self, sents: List[List[str]], device: torch.device = device) -> torch.Tensor:
        """ Convert list of sentences (words) into tensor with necessary padding for
        shorter sentences.

        @param sents (List[List[str]]): list of sentences (words)
        @param device: device on which to load the tesnor, i.e. CPU or GPU

        @returns sents_var: tensor of (max_sentence_length, batch_size)
        """
        word_ids = self.words2indices(sents)
        sents_t = pad_sents(word_ids, self['<pad>'])
        sents_var = torch.tensor(sents_t, dtype=torch.long, device=device)
        return torch.t(sents_var)

    @staticmethod
    def from_corpus(corpus, size, freq_cutoff=2):
        """ Given a corpus construct a Vocab Entry.
        @param corpus (list[str]): corpus of text produced by read_corpus function
        @param size (int): # of words in vocabulary
        @param freq_cutoff (int): if word occurs n < freq_cutoff times, drop the word
        @returns vocab_entry (VocabEntry): VocabEntry instance produced from provided corpus
        """
        vocab_entry = VocabEntry()
        word_freq = Counter(chain(*corpus))
        valid_words = [w for w, v in word_freq.items() if v >= freq_cutoff]
        print('number of word types: {}, number of word types w/ frequency >= {}: {}'
              .format(len(word_freq), freq_cutoff, len(valid_words)))
        top_k_words = sorted(valid_words, key=lambda w: word_freq[w], reverse=True)[:size]
        for word in top_k_words:
            vocab_entry.add(word)
        return vocab_entry

    @staticmethod
    def from_subword_list(subword_list):
        vocab_entry = VocabEntry()
        for subword in subword_list:
            vocab_entry.add(subword)
        return vocab_entry

Afterwards, we use a `Vocab` class to wrap vocabulary used for both the source and target languages in a machine translation task. It is composed of two `VocabEntry` objects, one for the source language and one for the target language.

The build method is used to construct a `Vocab` object from a list of subwords generated by **SentencePiece** for both the source and target languages. Then, we save them to a JSON file.

In [None]:
class Vocab(object):
    """ Vocab encapsulating src and target langauges.
    """
    def __init__(self, src_vocab: VocabEntry, tgt_vocab: VocabEntry):
        """ Init Vocab.
        @param src_vocab (VocabEntry): VocabEntry for source language
        @param tgt_vocab (VocabEntry): VocabEntry for target language
        """
        self.src = src_vocab
        self.tgt = tgt_vocab

    @staticmethod
    def build(src_sents, tgt_sents) -> 'Vocab':
        """ Build Vocabulary.
        @param src_sents (list[str]): Source subwords provided by SentencePiece
        @param tgt_sents (list[str]): Target subwords provided by SentencePiece
        """

        print('initialize source vocabulary ..')
        src = VocabEntry.from_subword_list(src_sents)

        print('initialize target vocabulary ..')
        tgt = VocabEntry.from_subword_list(tgt_sents)

        return Vocab(src, tgt)

    def save(self, file_path):
        """ Save Vocab to file as JSON dump.
        @param file_path (str): file path to vocab file
        """
        with open(file_path, 'w') as f:
            json.dump(dict(src_word2id=self.src.word2id, tgt_word2id=self.tgt.word2id), f, indent=2)

    @staticmethod
    def load(file_path):
        """ Load vocabulary from JSON dump.
        @param file_path (str): file path to vocab file
        @returns Vocab object loaded from JSON dump
        """
        entry = json.load(open(file_path, 'r'))
        src_word2id = entry['src_word2id']
        tgt_word2id = entry['tgt_word2id']

        return Vocab(VocabEntry(src_word2id), VocabEntry(tgt_word2id))

    def __repr__(self):
        """ Representation of Vocab to be used
        when printing the object.
        """
        return 'Vocab(source %d words, target %d words)' % (len(self.src), len(self.tgt))


def get_vocab_list(file_path, source, vocab_size):
    """ Use SentencePiece to tokenize and acquire list of unique subwords.
    @param file_path (str): file path to corpus
    @param source (str): tgt or src
    @param vocab_size: desired vocabulary size
    """
    spm.SentencePieceTrainer.Train(input=file_path, model_prefix=source, vocab_size=vocab_size)     # train the spm model
    sp = spm.SentencePieceProcessor()   # create an instance; this saves .model and .vocab files
    sp.Load('{}.model'.format(source))  # loads tgt.model or src.model
    sp_list = [sp.IdToPiece(piece_id) for piece_id in range(sp.GetPieceSize())] # this is the list of subwords
    return sp_list

## Train and save our vocabulary to a json file

In [None]:
print('read in source sentences: %s' % args.train_src)
print('read in target sentences: %s' % args.train_tgt)

src_sents = get_vocab_list(args.train_src, source='src', vocab_size=args.src_vocab_size)
tgt_sents = get_vocab_list(args.train_tgt, source='tgt', vocab_size=args.tgt_vocab_size)
vocab = Vocab.build(src_sents, tgt_sents)
print('generated vocabulary, source %d words, target %d words' % (len(src_sents), len(tgt_sents)))

vocab.save(args.vocab_file)
print('vocabulary saved to %s' % args.vocab_file)

read in source sentences: data/train.vi
read in target sentences: data/train.en
initialize source vocabulary ..
initialize target vocabulary ..
generated vocabulary, source 15000 words, target 21000 words
vocabulary saved to vocab.json


## Read sentence pairs for training
The full process for preparing the data is:

- Read text file  into pairs
- Encode raw text into subwords
- Add word lists into our data

In [None]:
def read_corpus(file_path, source):
    """ Read file, where each sentence is dilineated by a `\n`.
    @param file_path (str): path to file containing corpus
    @param source (str): "tgt" or "src" indicating whether text
        is of the source language or target language
    """
    data = []
    sp = spm.SentencePieceProcessor()
    sp.load('{}.model'.format(source))

    with open(file_path, 'r', encoding='utf8') as f:
        for line in f:
            subword_tokens = sp.encode_as_pieces(line)
            # only append <s> and </s> to the target sentence
            if source == 'tgt':
                subword_tokens = ['<s>'] + subword_tokens + ['</s>']
            data.append(subword_tokens)

    return data

train_data_src = read_corpus(args.train_src, source='src')
train_data_tgt = read_corpus(args.train_tgt, source='tgt')

dev_data_src = read_corpus(args.dev_src, source='src')
dev_data_tgt = read_corpus(args.dev_tgt, source='tgt')

train_data = list(zip(train_data_src, train_data_tgt))
dev_data = list(zip(dev_data_src, dev_data_tgt))

We will visualize certain pairs of training data after encoding into subwords to gain insights into the data. It becomes evident that when raw text is encoded into subwords, a single word can be represented as the concatenation of other subwords. For instance, `"Trong"` transforms into `"_Tro"` and `"ng"` while `"d√≤ng"` transforms into `"_d√≤"` and `"ng"`. Both examples will share the `"ng"` in common, reduce the number of item needed in the vocab size.

In [None]:
n = 5
for i in range(n):
  print("Src sent: " + "|".join(train_data_src[i]))
  print("Tgt sent: " + "|".join(train_data_tgt[i]))
  print("\n")

Src sent: ‚ñÅKhoa|‚ñÅh·ªçc|‚ñÅƒë|·∫±ng|‚ñÅsau|‚ñÅm·ªôt|‚ñÅti√™u|‚ñÅƒë·ªÅ|‚ñÅv·ªÅ|‚ñÅkh√≠|‚ñÅh·∫≠u
Tgt sent: <s>|‚ñÅRachel|‚ñÅP|ike|‚ñÅ|:|‚ñÅThe|‚ñÅscience|‚ñÅ|behind|‚ñÅa|‚ñÅclimate|‚ñÅheadline|</s>


Src sent: ‚ñÅTro|ng|‚ñÅ4|‚ñÅph√∫t|‚ñÅ,|‚ñÅchuy√™n|‚ñÅgia|‚ñÅho√°|‚ñÅh·ªçc|‚ñÅkh√≠|‚ñÅquy·ªÉn|‚ñÅRachel|‚ñÅP|ike|‚ñÅgi·ªõi|‚ñÅt|hi·ªáu|‚ñÅs∆°|‚ñÅl∆∞·ª£c|‚ñÅv·ªÅ|‚ñÅnh·ªØng|‚ñÅn·ªó|‚ñÅl·ª±c|‚ñÅkhoa|‚ñÅh·ªçc|‚ñÅm|i·ªát|‚ñÅm√†|i|‚ñÅƒë|·∫±ng|‚ñÅsau|‚ñÅnh·ªØng|‚ñÅti√™u|‚ñÅƒë·ªÅ|‚ñÅt√°o|‚ñÅb·∫°o|‚ñÅv·ªÅ|‚ñÅbi·∫øn|‚ñÅƒë·ªïi|‚ñÅkh√≠|‚ñÅh·∫≠u|‚ñÅ,|‚ñÅc√πng|‚ñÅv·ªõi|‚ñÅƒëo√†n|‚ñÅnghi√™n|‚ñÅc·ª©u|‚ñÅc·ªßa|‚ñÅm√¨nh|‚ñÅ--|‚ñÅh√†ng|‚ñÅng√†n|‚ñÅng∆∞·ªùi|‚ñÅƒë√£|‚ñÅc·ªë|ng|‚ñÅ|hi·∫øn|‚ñÅcho|‚ñÅd·ª±|‚ñÅ√°n|‚ñÅn√†y|‚ñÅ--|‚ñÅm·ªôt|‚ñÅchuy·∫øn|‚ñÅbay|‚ñÅm·∫°o|‚ñÅhi·ªÉm|‚ñÅqua|‚ñÅr·ª´ng|‚ñÅgi√†|‚ñÅƒë·ªÉ|‚ñÅt√¨m|‚ñÅki·∫øm|‚ñÅth√¥ng|‚ñÅtin|‚ñÅv·ªÅ|‚ñÅm·ªôt|‚ñÅph√¢n|‚ñÅt·ª≠|‚ñÅthe|n|‚ñÅch·ªë|t|‚ñÅ.
Tgt sent: <s>|‚ñÅIn|‚ñÅ4|‚ñÅminutes|‚ñÅ|,|‚ñÅatmospher|ic|‚ñÅchemist|‚ñÅRachel|‚ñÅP|ike|‚ñÅprovide|s|‚ñÅa|‚ñÅglimpse

We define the `batch_iter` function to iterate through the given data in batches of a specified size, where each batch contains source and target sentences.

The sentences are sorted in reverse order by their length, so that longer sentences come first.

The function takes three arguments: the data to iterate through, the batch size, and a flag indicating whether to shuffle the data randomly or not.

In [None]:
def batch_iter(data, batch_size, shuffle=False):
    """ Yield batches of source and target sentences reverse sorted by length (largest to smallest).
    @param data (list of (src_sent, tgt_sent)): list of tuples containing source and target sentence
    @param batch_size (int): batch size
    @param shuffle (boolean): whether to randomly shuffle the dataset
    """
    batch_num = math.ceil(len(data) / batch_size)
    index_array = list(range(len(data)))

    if shuffle:
        np.random.shuffle(index_array)

    for i in range(batch_num):
        indices = index_array[i * batch_size: (i + 1) * batch_size]
        examples = [data[idx] for idx in indices]

        examples = sorted(examples, key=lambda e: len(e[0]), reverse=True)
        src_sents, tgt_sents = list(), list()
        for src_sent, tgt_sent in examples:
            if len(src_sent) > 0 and len(tgt_sent) > 0:
                src_sents.append(src_sent)
                tgt_sents.append(tgt_sent)
        yield src_sents, tgt_sents

# The Seq2Seq Model 1: RNNs with global attention

In this section, we describe the training procedure for the proposed NMT system, which uses a Bidirectional LSTM Encoder and a Unidirectional LSTM Decoder.

<img src="https://i.ibb.co/pjRW6tC/arc.png" alt="arc" border="0" width=70%>

# Model description (training procedure)

Given a sentence in the source language, we look up the character or word embeddings from an **embeddings matrix**, yielding $x_1,...,x_m (x_i \in \mathbb{R}^e)$, where $m$ is the length of the source sentence and e is the embedding size. We feed the embeddings to the bidirectional encoder, yielding hidden states and cell states for both the forwards (‚Üí) and backwards (‚Üê) LSTMs. The forwards and backwards versions are concatenated to give hidden states $h^{enc}_i$ and cell states $c^{enc}_i$ :

$$ h^{enc}_i = [\overleftarrow{h^{enc}_i}; \overrightarrow{h^{enc}_i}] \:\: \text{where} \:\: h^{enc}_i \in \mathbb{R}^{2h \times 1} $$
$$ c^{enc}_i = [\overleftarrow{c^{enc}_i}; \overrightarrow{c^{enc}_i}] \:\: \text{where} \:\: c^{enc}_i \in \mathbb{R}^{2h \times 1} $$ \\

We then initialize the **decoder**‚Äôs first hidden state $h^{enc}_0$ and cell state $c^{enc}_0$  with a linear projection of the encoder‚Äôs final hidden state and final cell state.

$$ h^{dec}_0 = W_h[\overleftarrow{h^{enc}_1}; \overrightarrow{h^{enc}_m}] \:\: \text{where} \:\: h^{dec}_0 \in \mathbb{R}^{h \times 1} $$
$$ c^{dec}_0 = W_c[\overleftarrow{c^{enc}_1}; \overrightarrow{c^{enc}_m}] \:\: \text{where} \:\: c^{dec}_0 \in \mathbb{R}^{h \times 1} $$ \\

With the decoder initialized, we must now feed it a target sentence. On the $t^{th}$ step, we look up the embedding for the $t^{th}$ subword, $y_t \in \mathbb{R}^{e \times 1}$ . We then concatenate $y_t$ with the combined-output vector $o_{t-1} \in \mathbb{R}^{h \times 1}$ from the previous timestep (we will explain what this is later!) to produce $\bar{y_t} \in \mathbb{R}^{(e+h) \times 1}$. Note that for the first target subword (i.e. the start token) $o_0$ is a zero-vector. We then feed $\bar{y_t}$ as input to the decoder.


$$ h^{dec}_t , c^{dec}_t = \text{Decoder}(\bar{y_t},  h^{dec}_{t-1} , c^{dec}_{t-1} ) \:\:\: \text{where} \:\:\: h^{dec}_t \in \mathbb{R}^{h \times 1} , c^{dec}_t \in \mathbb{R}^{h \times 1} $$ \\

We then use $h^{dec}_t$ to compute multiplicative attention over $h^{enc}_1,...,, h^{enc}_m$ :

$$ e_{t,i} = (h_t^{dec})^TW_{attProj}h^{enc}_i \:\:\: \text{where} \:\:\: e_t \in \mathbb{R}^{m \times 1}, W_{attProj} \in \mathbb{R}^{h \times 2h} $$

$$ \alpha_t = softmax(e_t) \:\:\: \text{where} \:\:\: \alpha_t \in \mathbb{R}^{m \times 1}$$

$$ a_t = ‚àë_{i=1}^m \alpha_{t, i} h^{enc}_i \:\:\: \text{where} \:\:\: a_t \in \mathbb{R}^{2h \times 1}$$ \\

We now concatenate the attention output $a_t$ with the decoder hidden state $h^{dec}_t$ and pass this through a linear layer, tanh, and dropout to attain the *combined-output* vector $o_t$.

$$ u_t = [a_t;h^{dec}_t] \:\:\: \text{where} \:\:\: u_t \in \mathbb{R}^{3h \times 1} $$

$$ v_t = W_uu_t \:\:\: where \:\:\: v_t \in \mathbb{R}^{h \times 1},W_u \in \mathbb{R}^{h \times 3h}$$

$$ o_t = dropout(tanh(v_t)) \:\:\: where \:\:\: o_t \in \mathbb{R}^{h \times 1}$$ \\

Then, we produce a probability distribution $P_t$ over target subwords at the $t^{th}$ timestep:

$$ P_t = softmax(W_{vocab}o_t) \:\:\: where \:\:\: P_t \in \mathbb{R}^{V_t \times 1}, W_{vocab}\in \mathbb{R}^{V_t \times h} $$

Here, $V_t$ is the size of the target vocabulary. Finally, to train the network we then compute the cross entropy loss between $P_t$ and $g_t$, where $g_t$ is the one-hot vector of the target subword at timestep $t$:

$$ J_t(Œ∏) = CrossEntropy(P_t, g_t)$$
Here, $Œ∏$ represents all the parameters of the model and $J_t(Œ∏)$ is the loss on step t of the decoder.

Now that we have described the model, let‚Äôs try implementing it Mandarin Vietnamese to English translation!




## Q2 (5 points) Embedding Layer Initilization

Implement the `__init__` function to initialize the necessary source and target embeddings.

In [None]:
class ModelEmbeddings(nn.Module):
    """
    Class that converts input words to their embeddings.
    """
    def __init__(self, embed_size, vocab):
        """
        Init the Embedding layers.

        @param embed_size (int): Embedding size (dimensionality)
        @param vocab (Vocab): Vocabulary object containing src and tgt languages
                              See vocab.py for documentation.
        """
        super(ModelEmbeddings, self).__init__()
        self.embed_size = embed_size

        # default values
        self.source = None
        self.target = None

        src_pad_token_idx = vocab.src['<pad>']
        tgt_pad_token_idx = vocab.tgt['<pad>']

        ### YOUR CODE HERE (~2 Lines)
        ### TODO - Initialize the following variables:
        ###     self.source (Embedding Layer for source language)
        ###     self.target (Embedding Layer for target langauge)
        ###
        ### Note:
        ###     1. `vocab` object contains two vocabularies:
        ###            `vocab.src` for source
        ###            `vocab.tgt` for target
        ###     2. You can get the length of a specific vocabulary by running:
        ###             `len(vocab.<specific_vocabulary>)`
        ###     3. Remember to include the padding token for the specific vocabulary
        ###        when creating your Embedding.
        ###
        ### Use the following docs to properly initialize these variables:
        ###     Embedding Layer:
        ###         https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding

        self.source = nn.Embedding(len(vocab.src), self.embed_size, padding_idx=src_pad_token_idx)
        self.target = nn.Embedding(len(vocab.tgt), self.embed_size, padding_idx=tgt_pad_token_idx)


        ### END YOUR CODE

In [None]:
vocab.tgt

Vocabulary[size=21001]

## Q3-6 RNN with Global attention NMT model


### Q3 (10 points) Initialize layers in NMT model
Implement the `init` function to initialize the necessary model layers (LSTM, projection, and dropout) for the NMT system.

###  Q4 (15 points) Encoder
Implement the `encode` function. This function converts the padded source sentences into the tensor $X$, generates $h^{enc}_1 , . . . , h^{enc}_m $, and computes the initial state $h^{dec}_0$ and initial cell  $h^{dec}_0$ for the $\text{Decoder}$


### Q5 (15 points) Decoder
Implement the `decode` function. This function constructs $\bar{y}$ and runs the step function over every timestep for the input.



### Q6: (20 points) Decoder step
Implement the `step` function. This function applies the Decoder‚Äôs LSTM cell for a single timestep, computing the encoding of the target subword $h^{dec}_t$ , the attention scores $e_t$, attention distribution $\alpha_t$, the attention output $a_t$, and finally the combined output $o_t$.

In [None]:
Hypothesis = namedtuple('Hypothesis', ['value', 'score'])

class NMT(nn.Module):
    """ Simple Neural Machine Translation Model:
        - Bidrectional LSTM Encoder
        - Unidirection LSTM Decoder
        - Global Attention Model (Luong, et al. 2015)
    """

    def __init__(self, embed_size, hidden_size, vocab, dropout_rate=0.2):
        """ Init NMT Model.

        @param embed_size (int): Embedding size (dimensionality)
        @param hidden_size (int): Hidden Size, the size of hidden states (dimensionality)
        @param vocab (Vocab): Vocabulary object containing src and tgt languages
                              See vocab.py for documentation.
        @param dropout_rate (float): Dropout probability, for attention
        """
        super(NMT, self).__init__()
        self.model_embeddings = ModelEmbeddings(embed_size, vocab)
        self.hidden_size = hidden_size
        print(hidden_size)
        self.dropout_rate = dropout_rate
        self.vocab = vocab

        # default values
        self.encoder = None
        self.decoder = None
        self.h_projection = None
        self.c_projection = None
        self.att_projection = None
        self.combined_output_projection = None
        self.target_vocab_projection = None
        self.dropout = None
        # For sanity check only, not relevant to implementation
        self.gen_sanity_check = False
        self.counter = 0

        ### YOUR CODE HERE (~9 Lines)
        ### TODO - Initialize the following variables IN THIS ORDER:
        ###     self.post_embed_cnn (Conv1d layer with kernel size 2, input and output channels = embed_size,
        ###         padding = same to preserve output shape )
        ###     self.encoder (Bidirectional LSTM with bias)
        ###     self.decoder (LSTM Cell with bias)
        ###     self.h_projection (Linear Layer with no bias), called W_{h} .
        ###     self.c_projection (Linear Layer with no bias), called W_{c} .
        ###     self.att_projection (Linear Layer with no bias), called W_{attProj}.
        ###     self.combined_output_projection (Linear Layer with no bias), called W_{u}.
        ###     self.target_vocab_projection (Linear Layer with no bias), called W_{vocab}.
        ###     self.dropout (Dropout Layer)
        ###
        ### Use the following docs to properly initialize these variables:
        ###     LSTM:
        ###         https://pytorch.org/docs/stable/nn.html#torch.nn.LSTM
        ###     LSTM Cell:
        ###         https://pytorch.org/docs/stable/nn.html#torch.nn.LSTMCell
        ###     Linear Layer:
        ###         https://pytorch.org/docs/stable/nn.html#torch.nn.Linear
        ###     Dropout Layer:
        ###         https://pytorch.org/docs/stable/nn.html#torch.nn.Dropout
        ###     Conv1D Layer:
        ###         https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html

        self.post_embed_cnn = nn.Conv1d(embed_size, embed_size, kernel_size=2, padding="same")
        self.encoder = nn.LSTM(embed_size, hidden_size, bias=True, bidirectional=True) # h.shape = (2, hidden_size), c.shape = (2, hidden_size)
        self.decoder = nn.LSTMCell(embed_size + hidden_size, hidden_size, bias=True) #
        self.h_projection = nn.Linear(2*hidden_size, hidden_size, bias=False)
        self.c_projection = nn.Linear(2*hidden_size, hidden_size, bias=False)
        self.att_projection = nn.Linear(2*hidden_size, hidden_size, bias=False)
        self.combined_output_projection = nn.Linear(3*hidden_size, hidden_size, bias=False)
        self.target_vocab_projection = nn.Linear(hidden_size, len(self.vocab.tgt), bias=False)
        self.dropout = nn.Dropout(p=self.dropout_rate)

        ### END YOUR CODE

    def forward(self, source: List[List[str]], target: List[List[str]]) -> torch.Tensor:
        """ Take a mini-batch of source and target sentences, compute the log-likelihood of
        target sentences under the language models learned by the NMT system.

        @param source (List[List[str]]): list of source sentence tokens
        @param target (List[List[str]]): list of target sentence tokens, wrapped by `<s>` and `</s>`

        @returns scores (Tensor): a variable/tensor of shape (b, ) representing the
                                    log-likelihood of generating the gold-standard target sentence for
                                    each example in the input batch. Here b = batch size.
        """
        # Compute sentence lengths
        # source_lengths = [len(s) for s in source]
        source_lengths = [len(s) if len(s) <= args.max_len else args.max_len for s in source]

        # Convert list of lists into tensors
        source_padded = self.vocab.src.to_input_tensor(source, device=self.device)  # Tensor: (src_len, b)
        target_padded = self.vocab.tgt.to_input_tensor(target, device=self.device)  # Tensor: (tgt_len, b)

        ###     Run the network forward:
        ###     1. Apply the encoder to `source_padded` by calling `self.encode()`
        ###     2. Generate sentence masks for `source_padded` by calling `self.generate_sent_masks()`
        ###     3. Apply the decoder to compute combined-output by calling `self.decode()`
        ###     4. Compute log probability distribution over the target vocabulary using the
        ###        combined_outputs returned by the `self.decode()` function.

        enc_hiddens, dec_init_state = self.encode(source_padded, source_lengths)
        enc_masks = self.generate_sent_masks(enc_hiddens, source_lengths)
        combined_outputs = self.decode(enc_hiddens, enc_masks, dec_init_state, target_padded)
        P = F.log_softmax(self.target_vocab_projection(combined_outputs), dim=-1)

        # Zero out, probabilities for which we have nothing in the target text
        target_masks = (target_padded != self.vocab.tgt['<pad>']).float()

        # Compute log probability of generating true target words
        target_gold_words_log_prob = torch.gather(P, index=target_padded[1:].unsqueeze(-1), dim=-1).squeeze(
            -1) * target_masks[1:]
        scores = target_gold_words_log_prob.sum(dim=0)
        return scores

    def encode(self, source_padded: torch.Tensor, source_lengths: List[int]) -> Tuple[
        torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
        """ Apply the encoder to source sentences to obtain encoder hidden states.
            Additionally, take the final states of the encoder and project them to obtain initial states for decoder.

        @param source_padded (Tensor): Tensor of padded source sentences with shape (src_len, b), where
                                        b = batch_size, src_len = maximum source sentence length. Note that
                                       these have already been sorted in order of longest to shortest sentence.
        @param source_lengths (List[int]): List of actual lengths for each of the source sentences in the batch
        @returns enc_hiddens (Tensor): Tensor of hidden units with shape (b, src_len, h*2), where
                                        b = batch size, src_len = maximum source sentence length, h = hidden size.
        @returns dec_init_state (tuple(Tensor, Tensor)): Tuple of tensors representing the decoder's initial
                                                hidden state and cell. Both tensors should have shape (2, b, h).
        """
        enc_hiddens, dec_init_state = None, None

        ### YOUR CODE HERE (~ 11 Lines)
        ### TODO:
        ###     1. Construct Tensor `X` of source sentences with shape (src_len, b, e) using the source model embeddings.
        ###         src_len = maximum source sentence length, b = batch size, e = embedding size. Note
        ###         that there is no initial hidden state or cell for the encoder.
        ###     2. Apply the post_embed_cnn layer. Before feeding X into the CNN, first use torch.permute to change the
        ###         shape of X to (b, e, src_len). After getting the output from the CNN, still stored in the X variable,
        ###         remember to use torch.permute again to revert X back to its original shape.
        ###     3. Compute `enc_hiddens`, `last_hidden`, `last_cell` by applying the encoder to `X`.
        ###         - Before you can apply the encoder, you need to apply the `pack_padded_sequence` function to X.
        ###         - After you apply the encoder, you need to apply the `pad_packed_sequence` function to enc_hiddens.
        ###         - Note that the shape of the tensor output returned by the encoder RNN is (src_len, b, h*2) and we want to
        ###           return a tensor of shape (b, src_len, h*2) as `enc_hiddens`, so you may need to do more permuting.
        ###         - Note on using pad_packed_sequence -> For batched inputs, you need to make sure that each of the
        ###           individual input examples has the same shape.
        ###     4. Compute `dec_init_state` = (init_decoder_hidden, init_decoder_cell):
        ###         - `init_decoder_hidden`:
        ###             `last_hidden` is a tensor shape (2, b, h). The first dimension corresponds to forwards and backwards.
        ###             Concatenate the forwards and backwards tensors to obtain a tensor shape (b, 2*h).
        ###             Apply the h_projection layer to this in order to compute init_decoder_hidden.
        ###             This is h_0^{dec} in the PDF. Here b = batch size, h = hidden size
        ###         - `init_decoder_cell`:
        ###             `last_cell` is a tensor shape (2, b, h). The first dimension corresponds to forwards and backwards.
        ###             Concatenate the forwards and backwards tensors to obtain a tensor shape (b, 2*h).
        ###             Apply the c_projection layer to this in order to compute init_decoder_cell.
        ###             This is c_0^{dec} in the PDF. Here b = batch size, h = hidden size
        ###
        ### See the following docs, as you may need to use some of the following functions in your implementation:
        ###     Pack the padded sequence X before passing to the encoder:
        ###         https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pack_padded_sequence.html
        ###     Pad the packed sequence, enc_hiddens, returned by the encoder:
        ###         https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pad_packed_sequence.html
        ###     Tensor Concatenation:
        ###         https://pytorch.org/docs/stable/generated/torch.cat.html
        ###     Tensor Permute:
        ###         https://pytorch.org/docs/stable/generated/torch.permute.html
        ###     Tensor Reshape (a possible alternative to permute):
        ###         https://pytorch.org/docs/stable/generated/torch.Tensor.reshape.html
        # batch_size: int = 32
        # embed_size: int = 1024
        # hidden_size: int = 768
        # X shape:  torch.Size([86, 32, 1024])
        # X shape:  torch.Size([32, 1024, 86])
        # X shape:  torch.Size([32, 1024, 86])
        # X shape:  torch.Size([86, 32, 1024])
        # enc_hiddens shape:  torch.Size([86, 32, 1536])
        # init_decoder_hidden shape: torch.Size([32, 1024])
        # init_decoder_cell shape: torch.Size([32, 1024])
        X = self.model_embeddings.source(source_padded)
        X = torch.permute(X, (1,2,0))
        X = self.post_embed_cnn(X)
        X = torch.permute(X, (2,0,1))

        X = pack_padded_sequence(X, lengths=source_lengths, batch_first=False)
        enc_hiddens, (last_hidden, last_cell) = self.encoder(X) # enc_hiddens: (src_len, b, h*2); last_hidden, last_cell: (2, b, h).
        enc_hiddens, _ = pad_packed_sequence(enc_hiddens, batch_first=True)

        last_hidden = torch.cat((last_hidden[0,:], last_hidden[1,:]), axis=1)
        init_decoder_hidden = self.h_projection(last_hidden)

        last_cell = torch.cat((last_cell[0,:], last_cell[1,:]), axis=1)
        init_decoder_cell = self.h_projection(last_cell)

        dec_init_state = (init_decoder_hidden, init_decoder_cell)
        ### END YOUR CODE

        return enc_hiddens, dec_init_state

    def decode(self, enc_hiddens: torch.Tensor, enc_masks: torch.Tensor,
               dec_init_state: Tuple[torch.Tensor, torch.Tensor], target_padded: torch.Tensor) -> torch.Tensor:
        """Compute combined output vectors for a batch.

        @param enc_hiddens (Tensor): Hidden states (b, src_len, h*2), where
                                     b = batch size, src_len = maximum source sentence length, h = hidden size.
        @param enc_masks (Tensor): Tensor of sentence masks (b, src_len), where
                                     b = batch size, src_len = maximum source sentence length.
        @param dec_init_state (tuple(Tensor, Tensor)): Initial state and cell for decoder
        @param target_padded (Tensor): Gold-standard padded target sentences (tgt_len, b), where
                                       tgt_len = maximum target sentence length, b = batch size.

        @returns combined_outputs (Tensor): combined output tensor  (tgt_len, b,  h), where
                                        tgt_len = maximum target sentence length, b = batch_size,  h = hidden size
        """
        # Chop off the <END> token for max length sentences.
        target_padded = target_padded[:-1]

        # Initialize the decoder state (hidden and cell)
        dec_state = dec_init_state

        # Initialize previous combined output vector o_{t-1} as zero
        batch_size = enc_hiddens.size(0)
        o_prev = torch.zeros(batch_size, self.hidden_size, device=self.device)

        # Initialize a list we will use to collect the combined output o_t on each step
        combined_outputs = []

        ### YOUR CODE HERE (~9 Lines)
        ### TODO:
        ###     1. Apply the attention projection layer to `enc_hiddens` to obtain `enc_hiddens_proj`,
        ###         which should be shape (b, src_len, h),
        ###         where b = batch size, src_len = maximum source length, h = hidden size.
        ###         This is applying W_{attProj} to h^enc, as described in the PDF.
        ###     2. Construct tensor `Y` of target sentences with shape (tgt_len, b, e) using the target model embeddings.
        ###         where tgt_len = maximum target sentence length, b = batch size, e = embedding size.
        ###     3. Use the torch.split function to iterate over the time dimension of Y.
        ###         Within the loop, this will give you Y_t of shape (1, b, e) where b = batch size, e = embedding size.
        ###             - Squeeze Y_t into a tensor of dimension (b, e).
        ###             - Construct Ybar_t by concatenating Y_t with o_prev on their last dimension
        ###             - Use the `step` function to compute the the Decoder's next (cell, state) values
        ###               as well as the new combined output o_t.
        ###             - Append o_t to combined_outputs
        ###             - Update o_prev to the new o_t.
        ###     4. Use torch.stack to convert combined_outputs from a list length tgt_len of
        ###         tensors shape (b, h), to a single tensor shape (tgt_len, b, h)
        ###         where tgt_len = maximum target sentence length, b = batch size, h = hidden size.
        ###
        ### Note:
        ###    - When using the squeeze() function make sure to specify the dimension you want to squeeze
        ###      over. Otherwise, you will remove the batch dimension accidentally, if batch_size = 1.
        ###
        ### You may find some of these functions useful:
        ###     Zeros Tensor:
        ###         https://pytorch.org/docs/stable/torch.html#torch.zeros
        ###     Tensor Splitting (iteration):
        ###         https://pytorch.org/docs/stable/torch.html#torch.split
        ###     Tensor Dimension Squeezing:
        ###         https://pytorch.org/docs/stable/torch.html#torch.squeeze
        ###     Tensor Concatenation:
        ###         https://pytorch.org/docs/stable/torch.html#torch.cat
        ###     Tensor Stacking:
        ###         https://pytorch.org/docs/stable/torch.html#torch.stack

        enc_hiddens_proj = self.att_projection(enc_hiddens) # (b, src_len, h*2) -> (b, src_len, h)
        Y = self.model_embeddings.target(target_padded) # (tgt_len, b, e)

        tgt_len = Y.size(0)
        Y_split = torch.split(Y, 1, dim=0)
        # Y_t shape:  torch.Size([32, 1024])
        # o_prev shape:  torch.Size([67, 768])
        for i  in range(tgt_len):
          Y_t = Y_split[i].squeeze(dim=0) # (1, b, e) --> (b, e)
          Ybar_t = torch.cat((Y_t, o_prev), dim=1) # (b, e + h)
          dec_state, o_t, e_t = self.step(Ybar_t, dec_state, enc_hiddens, enc_hiddens_proj, enc_masks)
          combined_outputs.append(o_t)
          o_prev = o_t

        combined_outputs = torch.stack(combined_outputs, dim=0)
        ### END YOUR CODE

        return combined_outputs

    def step(self, Ybar_t: torch.Tensor,
             dec_state: Tuple[torch.Tensor, torch.Tensor],
             enc_hiddens: torch.Tensor,
             enc_hiddens_proj: torch.Tensor,
             enc_masks: torch.Tensor) -> Tuple[Tuple, torch.Tensor, torch.Tensor]:
        """ Compute one forward step of the LSTM decoder, including the attention computation.

        @param Ybar_t (Tensor): Concatenated Tensor of [Y_t o_prev], with shape (b, e + h). The input for the decoder,
                                where b = batch size, e = embedding size, h = hidden size.
        @param dec_state (tuple(Tensor, Tensor)): Tuple of tensors both with shape (b, h), where b = batch size, h = hidden size.
                First tensor is decoder's prev hidden state, second tensor is decoder's prev cell.
        @param enc_hiddens (Tensor): Encoder hidden states Tensor, with shape (b, src_len, h * 2), where b = batch size,
                                    src_len = maximum source length, h = hidden size.
        @param enc_hiddens_proj (Tensor): Encoder hidden states Tensor, projected from (h * 2) to h. Tensor is with shape (b, src_len, h),
                                    where b = batch size, src_len = maximum source length, h = hidden size.
        @param enc_masks (Tensor): Tensor of sentence masks shape (b, src_len),
                                    where b = batch size, src_len is maximum source length.

        @returns dec_state (tuple (Tensor, Tensor)): Tuple of tensors both shape (b, h), where b = batch size, h = hidden size.
                First tensor is decoder's new hidden state, second tensor is decoder's new cell.
        @returns combined_output (Tensor): Combined output Tensor at timestep t, shape (b, h), where b = batch size, h = hidden size.
        @returns e_t (Tensor): Tensor of shape (b, src_len). It is attention scores distribution.
                                Note: You will not use this outside of this function.
                                      We are simply returning this value so that we can sanity check
                                      your implementation.
        """

        combined_output = None

        ### YOUR CODE HERE (~3 Lines)
        ### TODO:
        ###     1. Apply the decoder to `Ybar_t` and `dec_state`to obtain the new dec_state.
        ###     2. Split dec_state into its two parts (dec_hidden, dec_cell)
        ###     3. Compute the attention scores e_t, a Tensor shape (b, src_len).
        ###        Note: b = batch_size, src_len = maximum source length, h = hidden size.
        ###
        ###       Hints:
        ###         - dec_hidden is shape (b, h) and corresponds to h^dec_t in the PDF (batched)
        ###         - enc_hiddens_proj is shape (b, src_len, h) and corresponds to W_{attProj} h^enc (batched).
        ###         - Use batched matrix multiplication (torch.bmm) to compute e_t (be careful about the input/ output shapes!)
        ###         - To get the tensors into the right shapes for bmm, you will need to do some squeezing and unsqueezing.
        ###         - When using the squeeze() function make sure to specify the dimension you want to squeeze
        ###             over. Otherwise, you will remove the batch dimension accidentally, if batch_size = 1.
        ###
        ### Use the following docs to implement this functionality:
        ###     Batch Multiplication:
        ###         https://pytorch.org/docs/stable/torch.html#torch.bmm
        ###     Tensor Unsqueeze:
        ###         https://pytorch.org/docs/stable/torch.html#torch.unsqueeze
        ###     Tensor Squeeze:
        ###         https://pytorch.org/docs/stable/torch.html#torch.squeeze

        dec_state = self.decoder(Ybar_t, dec_state) # (b, h)
        (dec_hidden, dec_cell) = dec_state

        e_t = torch.bmm(enc_hiddens_proj, dec_hidden.unsqueeze(2)).squeeze(2)

        ### END YOUR CODE

        # Set e_t to -inf where enc_masks has 1
        if enc_masks is not None:
            e_t.data.masked_fill_(enc_masks.bool(), -float('inf'))

        ### YOUR CODE HERE (~6 Lines)
        ### TODO:
        ###     1. Apply softmax to e_t to yield alpha_t
        ###     2. Use batched matrix multiplication between alpha_t and enc_hiddens to obtain the
        ###         attention output vector, a_t.
        # $$     Hints:
        ###           - alpha_t is shape (b, src_len)
        ###           - enc_hiddens is shape (b, src_len, 2h)
        ###           - a_t should be shape (b, 2h)
        ###           - You will need to do some squeezing and unsqueezing.
        ###     Note: b = batch size, src_len = maximum source length, h = hidden size.
        ###
        ###     3. Concatenate dec_hidden with a_t to compute tensor U_t
        ###     4. Apply the combined output projection layer to U_t to compute tensor V_t
        ###     5. Compute tensor O_t by first applying the Tanh function and then the dropout layer.
        ###
        ### Use the following docs to implement this functionality:
        ###     Softmax:
        ###         https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.softmax
        ###     Batch Multiplication:
        ###        https://pytorch.org/docs/stable/torch.html#torch.bmm
        ###     Tensor View:
        ###         https://pytorch.org/docs/stable/tensors.html#torch.Tensor.view
        ###     Tensor Concatenation:
        ###         https://pytorch.org/docs/stable/torch.html#torch.cat
        ###     Tanh:
        ###         https://pytorch.org/docs/stable/torch.html#torch.tanh

        alpha_t = F.softmax(e_t,dim=1) # (b, src_len)
        a_t = torch.bmm(alpha_t.unsqueeze(1), enc_hiddens).squeeze(1)

        U_t = torch.cat((a_t,dec_hidden), dim=1)
        V_t = self.combined_output_projection(U_t)
        O_t = self.dropout(torch.tanh(V_t))

        ### END YOUR CODE

        combined_output = O_t
        return dec_state, combined_output, e_t

    def generate_sent_masks(self, enc_hiddens: torch.Tensor, source_lengths: List[int]) -> torch.Tensor:
        """ Generate sentence masks for encoder hidden states.

        @param enc_hiddens (Tensor): encodings of shape (b, src_len, 2*h), where b = batch size,
                                     src_len = max source length, h = hidden size.
        @param source_lengths (List[int]): List of actual lengths for each of the sentences in the batch.

        @returns enc_masks (Tensor): Tensor of sentence masks of shape (b, src_len),
                                    where src_len = max source length, h = hidden size.
        """
        enc_masks = torch.zeros(enc_hiddens.size(0), enc_hiddens.size(1), dtype=torch.float)
        for e_id, src_len in enumerate(source_lengths):
            enc_masks[e_id, src_len:] = 1
        return enc_masks.to(self.device)

    def beam_search(self, src_sent: List[str], beam_size: int = 5, max_decoding_time_step: int = 70) -> List[
        Hypothesis]:
        """ Given a single source sentence, perform beam search, yielding translations in the target language.
        @param src_sent (List[str]): a single source sentence (words)
        @param beam_size (int): beam size
        @param max_decoding_time_step (int): maximum number of time steps to unroll the decoding RNN
        @returns hypotheses (List[Hypothesis]): a list of hypothesis, each hypothesis has two fields:
                value: List[str]: the decoded target sentence, represented as a list of words
                score: float: the log-likelihood of the target sentence
        """
        src_sents_var = self.vocab.src.to_input_tensor([src_sent], self.device)

        src_encodings, dec_init_vec = self.encode(src_sents_var, [len(src_sent)])
        src_encodings_att_linear = self.att_projection(src_encodings)

        h_tm1 = dec_init_vec
        att_tm1 = torch.zeros(1, self.hidden_size, device=self.device)

        eos_id = self.vocab.tgt['</s>']

        hypotheses = [['<s>']]
        hyp_scores = torch.zeros(len(hypotheses), dtype=torch.float, device=self.device)
        completed_hypotheses = []

        t = 0
        while len(completed_hypotheses) < beam_size and t < max_decoding_time_step:
            t += 1
            hyp_num = len(hypotheses)

            exp_src_encodings = src_encodings.expand(hyp_num,
                                                     src_encodings.size(1),
                                                     src_encodings.size(2))

            exp_src_encodings_att_linear = src_encodings_att_linear.expand(hyp_num,
                                                                           src_encodings_att_linear.size(1),
                                                                           src_encodings_att_linear.size(2))

            y_tm1 = torch.tensor([self.vocab.tgt[hyp[-1]] for hyp in hypotheses], dtype=torch.long, device=self.device)
            y_t_embed = self.model_embeddings.target(y_tm1)

            x = torch.cat([y_t_embed, att_tm1], dim=-1)

            (h_t, cell_t), att_t, _ = self.step(x, h_tm1,
                                                exp_src_encodings, exp_src_encodings_att_linear, enc_masks=None)

            # log probabilities over target words
            log_p_t = F.log_softmax(self.target_vocab_projection(att_t), dim=-1)

            live_hyp_num = beam_size - len(completed_hypotheses)
            contiuating_hyp_scores = (hyp_scores.unsqueeze(1).expand_as(log_p_t) + log_p_t).view(-1)
            top_cand_hyp_scores, top_cand_hyp_pos = torch.topk(contiuating_hyp_scores, k=live_hyp_num)

            prev_hyp_ids = torch.div(top_cand_hyp_pos, len(self.vocab.tgt), rounding_mode='floor')
            hyp_word_ids = top_cand_hyp_pos % len(self.vocab.tgt)

            new_hypotheses = []
            live_hyp_ids = []
            new_hyp_scores = []

            for prev_hyp_id, hyp_word_id, cand_new_hyp_score in zip(prev_hyp_ids, hyp_word_ids, top_cand_hyp_scores):
                prev_hyp_id = prev_hyp_id.item()
                hyp_word_id = hyp_word_id.item()
                cand_new_hyp_score = cand_new_hyp_score.item()

                hyp_word = self.vocab.tgt.id2word[hyp_word_id]
                new_hyp_sent = hypotheses[prev_hyp_id] + [hyp_word]
                if hyp_word == '</s>':
                    completed_hypotheses.append(Hypothesis(value=new_hyp_sent[1:-1],
                                                           score=cand_new_hyp_score))
                else:
                    new_hypotheses.append(new_hyp_sent)
                    live_hyp_ids.append(prev_hyp_id)
                    new_hyp_scores.append(cand_new_hyp_score)

            if len(completed_hypotheses) == beam_size:
                break

            live_hyp_ids = torch.tensor(live_hyp_ids, dtype=torch.long, device=self.device)
            h_tm1 = (h_t[live_hyp_ids], cell_t[live_hyp_ids])
            att_tm1 = att_t[live_hyp_ids]

            hypotheses = new_hypotheses
            hyp_scores = torch.tensor(new_hyp_scores, dtype=torch.float, device=self.device)

        if len(completed_hypotheses) == 0:
            completed_hypotheses.append(Hypothesis(value=hypotheses[0][1:],
                                                   score=hyp_scores[0].item()))

        completed_hypotheses.sort(key=lambda hyp: hyp.score, reverse=True)

        return completed_hypotheses

    @property
    def device(self) -> torch.device:
        """ Determine which device to place the Tensors upon, CPU or GPU.
        """
        return self.model_embeddings.source.weight.device

    @staticmethod
    def load(model_path: str):
        """ Load the model from a file.
        @param model_path (str): path to model
        """
        params = torch.load(model_path, map_location=lambda storage, loc: storage)
        args = params['args']
        model = NMT(vocab=params['vocab'], **args)
        model.load_state_dict(params['state_dict'])

        return model

    def save(self, path: str):
        """ Save the odel to a file.
        @param path (str): path to the model
        """
        print('save model parameters to [%s]' % path, file=sys.stderr)

        params = {
            'args': dict(embed_size=self.model_embeddings.embed_size, hidden_size=self.hidden_size,
                         dropout_rate=self.dropout_rate),
            'vocab': self.vocab,
            'state_dict': self.state_dict()
        }

        torch.save(params, path)

Now it‚Äôs time to get things running!

## Evaluating function


In [None]:
def evaluate_ppl(model, dev_data, batch_size=32):
    """ Evaluate perplexity on dev sentences
    @param model (NMT): NMT Model
    @param dev_data (list of (src_sent, tgt_sent)): list of tuples containing source and target sentence
    @param batch_size (batch size)
    @returns ppl (perplixty on dev sentences)
    """
    was_training = model.training
    model.eval()

    cum_loss = 0.
    cum_tgt_words = 0.

    # no_grad() signals backend to throw away all gradients
    with torch.no_grad():
        for src_sents, tgt_sents in batch_iter(dev_data, batch_size):
            loss = -model(src_sents, tgt_sents).sum()

            cum_loss += loss.item()
            tgt_word_num_to_predict = sum(len(s[1:]) for s in tgt_sents)  # omitting leading `<s>`
            cum_tgt_words += tgt_word_num_to_predict

        ppl = np.exp(cum_loss / cum_tgt_words)

    if was_training:
        model.train()

    return ppl

def compute_corpus_level_bleu_score(references: List[List[str]], hypotheses: List[Hypothesis]) -> float:
    """ Given decoding results and reference sentences, compute corpus-level BLEU score.
    @param references (List[List[str]]): a list of gold-standard reference target sentences
    @param hypotheses (List[Hypothesis]): a list of hypotheses, one for each reference
    @returns bleu_score: corpus-level BLEU score
    """
    # remove the start and end tokens
    if references[0][0] == '<s>':
        references = [ref[1:-1] for ref in references]

    # detokenize the subword pieces to get full sentences
    detokened_refs = [''.join(pieces).replace('‚ñÅ', ' ') for pieces in references]
    detokened_hyps = [''.join(hyp.value).replace('‚ñÅ', ' ') for hyp in hypotheses]

    # sacreBLEU can take multiple references (golden example per sentence) but we only feed it one
    bleu = sacrebleu.corpus_bleu(detokened_hyps, [detokened_refs])

    return bleu.score, detokened_refs, detokened_hyps

## Training the model

In [None]:
# Initialize our model and optimizer
model = NMT(embed_size=args.embed_size,
            hidden_size=args.hidden_size,
            dropout_rate=float(args.dropout),
            vocab=vocab)
model.train()

uniform_init = float(args.uniform_init)
if np.abs(uniform_init) > 0.:
    print('uniformly initialize parameters [-%f, +%f]' % (uniform_init, uniform_init), file=sys.stderr)
    for p in model.parameters():
        p.data.uniform_(-uniform_init, uniform_init)

model = model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=float(args.lr))

768


uniformly initialize parameters [-0.100000, +0.100000]


We wil first train our model on a small training set of 50 samples and evaluate it on a small dev set of 50 samples.

In [None]:
max_train = 50

train_data_small = [val for val in train_data if len(val[0]) > 3][:max_train]
dev_data_small = [val for val in dev_data if len(val[0]) > 3][:max_train]

In [None]:
num_trial = 0
train_iter = patience = cum_loss = report_loss = cum_tgt_words = report_tgt_words = 0
cum_examples = report_examples = epoch = valid_num = 0
hist_valid_scores = []
train_time = begin_time = time.time()
print('begin Maximum Likelihood training')

for epoch in range(args.max_epoch):
    for src_sents, tgt_sents in batch_iter(train_data_small, batch_size=args.batch_size, shuffle=True):
        train_iter += 1

        optimizer.zero_grad()

        batch_size = len(src_sents)

        example_losses = -model(src_sents, tgt_sents) # (batch_size,)
        batch_loss = example_losses.sum()
        loss = batch_loss / batch_size

        loss.backward()

        # clip gradient
        grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip_grad)

        optimizer.step()

        batch_losses_val = batch_loss.item()
        report_loss += batch_losses_val
        cum_loss += batch_losses_val

        tgt_words_num_to_predict = sum(len(s[1:]) for s in tgt_sents)  # omitting leading `<s>`
        report_tgt_words += tgt_words_num_to_predict
        cum_tgt_words += tgt_words_num_to_predict
        report_examples += batch_size
        cum_examples += batch_size

        if train_iter % args.log_every == 0:
            print('epoch %d, iter %d, avg. loss %.2f, avg. ppl %.2f ' \
                    'cum. examples %d, speed %.2f words/sec, time elapsed %.2f sec' % (epoch, train_iter,
                                                                                        report_loss / report_examples,
                                                                                        math.exp(report_loss / report_tgt_words),
                                                                                        cum_examples,
                                                                                        report_tgt_words / (time.time() - train_time),
                                                                                        time.time() - begin_time), file=sys.stderr)

            train_time = time.time()
            report_loss = report_tgt_words = report_examples = 0.

        # perform validation
        if train_iter % args.valid_niter == 0:
            print('epoch %d, iter %d, cum. loss %.2f, cum. ppl %.2f cum. examples %d' % (epoch, train_iter,
                                                                                        cum_loss / cum_examples,
                                                                                        np.exp(cum_loss / cum_tgt_words),
                                                                                        cum_examples), file=sys.stderr)

            cum_loss = cum_examples = cum_tgt_words = 0.
            valid_num += 1

            print('begin validation ...', file=sys.stderr)

            # compute dev. ppl and bleu
            dev_ppl = evaluate_ppl(model, dev_data, batch_size=128)   # dev batch size can be a bit larger
            valid_metric = -dev_ppl

            print('validation: iter %d, dev. ppl %f' % (train_iter, dev_ppl), file=sys.stderr)

            is_better = len(hist_valid_scores) == 0 or valid_metric > max(hist_valid_scores)
            hist_valid_scores.append(valid_metric)

            if is_better:
                patience = 0
                print('save currently the best model to [%s]' % args.model_save_path, file=sys.stderr)
                model.save(args.model_save_path)

                # also save the optimizers' state
                torch.save(optimizer.state_dict(), args.model_save_path + '.optim')
            elif patience < int(args.patience):
                patience += 1
                print('hit patience %d' % patience, file=sys.stderr)

                if patience == int(args.patience):
                    num_trial += 1
                    print('hit #%d trial' % num_trial, file=sys.stderr)
                    if num_trial == int(args.max_num_trial):
                        print('early stop!', file=sys.stderr)
                        exit(0)

                    # decay lr, and restore from previously best checkpoint
                    lr = optimizer.param_groups[0]['lr'] * float(args.lr_decay)
                    print('load previously best model and decay learning rate to %f' % lr, file=sys.stderr)

                    # load model
                    params = torch.load(args.model_save_path, map_location=lambda storage, loc: storage)
                    model.load_state_dict(params['state_dict'])
                    model = model.to(device)

                    print('restore parameters of the optimizers', file=sys.stderr)
                    optimizer.load_state_dict(torch.load(args.model_save_path + '.optim'))

                    # set new lr
                    for param_group in optimizer.param_groups:
                        param_group['lr'] = lr

                    # reset patience
                    patience = 0

## Testing the model

Beam search is an algorithm used in many NLP and speech recognition models as a final decision making layer to choose the best output given target variables like maximum probability or next output character. Beam size is usually 4-10. Increasing beam size is computationally inefficient and, potentially leads to worse quality. The figure is referenced from the blog [Sequence to Sequence (seq2seq) and Attention](https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html).

<img src="https://drive.google.com/uc?id=1rjVuANPjwTQi33beysq_LRNCxd6xhRHY" width="800" height="400"/>

The model uses `beam_search` function to yield the translations in the target language. `beam_search` will return a list of hypothesis for the translated sentences, each will have the value and the score for it, sorted in the descending order, so the first element in the Hypothesis is the best option for translated output by the model. The samples on training data should be very good with bleu score 100. The samples on validation data, however, probably won't make sense (because we're overfitting).

In [None]:
# @title Testing on training set
references = []
hypotheses = []

num_logs = 5
if (num_logs > len(train_data_small)):
  num_logs = len(train_data_small)

for i in range(num_logs):
  src, tgt = train_data_small[i]
  references = [tgt]
  hypotheses = model.beam_search(src)
  bleu_score, detokened_refs, detokened_hyps = compute_corpus_level_bleu_score(references, hypotheses)
  print(f"Sample {i}:")
  print("Bleu score: " + str(bleu_score))
  print("Source: " + str(src))
  print("Reference: " + str(detokened_refs))
  print("Hypotheses: " + str(detokened_hyps))
  print('\n')

  sents_var = torch.tensor(sents_t, dtype=torch.long, device=device)
  return F.conv1d(input, weight, bias, self.stride,


Sample 0:
Bleu score: 0.0
Source: ['‚ñÅKhoa', '‚ñÅh·ªçc', '‚ñÅƒë', '·∫±ng', '‚ñÅsau', '‚ñÅm·ªôt', '‚ñÅti√™u', '‚ñÅƒë·ªÅ', '‚ñÅv·ªÅ', '‚ñÅkh√≠', '‚ñÅh·∫≠u']
Reference: [' Rachel Pike : The science behind a climate headline']
Hypotheses: ['200 TNCband populat Beatle populat Beatle populatocracynnec onto combine Centi Embarrass intersect Horse McG implore populat shatter populat adequateamountfr crow populat contract populat Philistine breaker breaker His idolpowerful crow bridge populat rejection rejection rejection rejection rejection bees good Pall PallPTpush MDG populat matters Wire dewormcognitive Sar 360hub medal Defen trilob trilob Accord Shack rejection toolbox trilob trilob trilob trilob Defen']


Sample 1:
Bleu score: 0.0
Source: ['‚ñÅTro', 'ng', '‚ñÅ4', '‚ñÅph√∫t', '‚ñÅ,', '‚ñÅchuy√™n', '‚ñÅgia', '‚ñÅho√°', '‚ñÅh·ªçc', '‚ñÅkh√≠', '‚ñÅquy·ªÉn', '‚ñÅRachel', '‚ñÅP', 'ike', '‚ñÅgi·ªõi', '‚ñÅt', 'hi·ªáu', '‚ñÅs∆°', '‚ñÅl∆∞·ª£c', '‚ñÅv·ªÅ', '‚ñÅnh·ªØng', '‚ñÅn·ªó', '‚ñÅl·ª±c', '‚ñÅk

In [None]:
# @title Testing on evaluate set
references = []
hypotheses = []

num_logs = 5
if (num_logs > len(dev_data_small)):
  num_logs = len(dev_data_small)

for i in range(num_logs):
  src, tgt = dev_data_small[i]
  references = [tgt]
  hypotheses = model.beam_search(src)
  bleu_score, detokened_refs, detokened_hyps = compute_corpus_level_bleu_score(references, hypotheses)
  print(f"Sample {i}:")
  print("Bleu score: " + str(bleu_score))
  print("Reference: " + str(detokened_refs))
  print("Hypotheses: " + str(detokened_hyps))
  print('\n')

Sample 0:
Bleu score: 0.0
Reference: [' How can I speak in 10 minutes about the bonds of women over three generations , about how the astonishing strength of those bonds took hold in the life of a four-year-old girl huddled with her young sister , her mother and her grandmother for five days and nights in a small boat in the China Sea more than 30 years ago , bonds that took hold in the life of that small girl and never let go -- that small girl now living in San Francisco and speaking to you today ?']
Hypotheses: ['utonomousber Embarrass Run populat ejaculatenox Sek stilltuitous Russia Laf apprehensi Justin casual nail wisphas Espepick Monet gate competitorfusionmoor prudenteadershipwife audience stripe intimidate Minetown8towntown pillow synchron museum heterosexual Ata Ata Russia Robbia Aunt commercial Dil telephon discoverer wrench modest reboot terabyte reboot underdevelop hum sampl populat populat joint hum blogger lawnmowerperformance intersect blogger tillzing commut']


Sample

# The Seq2Seq Model 2: Transformer
In this part, you will train a sequence-to-sequence Transformer model to translate Portuguese into English. The Transformer was originally proposed in "Attention is all you need" by Vaswani et al. (2017).

<img src="https://www.tensorflow.org/images/tutorials/transformer/apply_the_transformer_to_machine_translation.gif" alt="Applying the Transformer to machine translation">

Figure 2: Applying the Transformer to machine translation. Source: [Google AI Blog](https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html).

A Transformer is a sequence-to-sequence encoder-decoder model similar to the

*   M·ª•c danh s√°ch
*   M·ª•c danh s√°ch

model in the [NMT with attention tutorial](https://www.tensorflow.org/text/tutorials/nmt_with_attention).
A single-layer Transformer takes a little more code to write, but is almost identical to that encoder-decoder RNN model. The only difference is that the RNN layers are replaced with self attention layers.

<table>
<tr>
  <th>The <a href=https://www.tensorflow.org/text/tutorials/nmt_with_attention>RNN+Attention model</a></th>
  <th>A 1-layer transformer</th>
</tr>
<tr>
  <td>
   <img width=411 src="https://www.tensorflow.org/images/tutorials/transformer/RNN+attention-words.png"/>
  </td>
  <td>
   <img width=400 src="https://www.tensorflow.org/images/tutorials/transformer/Transformer-1layer-words.png"/>
  </td>
</tr>
</table>

### The Embedding and Positional Encoding Layer

The inputs to both the encoder and decoder use the same embedding and positional encoding logic.

<table>
<tr>
  <th colspan=1>The embedding and positional encoding layer</th>
<tr>
<tr>
  <td>
   <img src="https://www.tensorflow.org/images/tutorials/transformer/PositionalEmbedding.png"/>
  </td>
</tr>
</table>

The formula for calculating the positional encoding (implemented in Python below) is as follows:

$$\Large{PE_{(pos, 2i)} = \sin(pos / 10000^{2i / d_{model}})} $$
$$\Large{PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i / d_{model}})} $$

## Q7: Transformer Embedding Layer (10 points)
Implement `TransformerEmbedding` (consists of lookup embedding & positional encoding) for transformer model.

In [None]:
class TransformerEmbedding(nn.Module):
    """
    Class that combines token embeddings with positional embeddings.
    """
    def __init__(self, vocab_size, embedding_size, max_len, dropout_rate):
        """
        Init the Transformer Embedding layer.

        @param vocab_size (int): Vocabulary size (number of unique tokens)
        @param embedding_size (int): Embedding size (dimensionality)
        @param max_len (int): Maximum sequence length
        @param dropout_rate (float): Dropout probability
        """
        super().__init__()
        # default values
        self.embedding_size = embedding_size
        self.token_embedding = None
        self.dropout = None
        pos_embedding = None

        ### YOUR CODE HERE
        ### TODO - Implement the positional embedding and Initialize the following variables :
        ###     self.token_embedding (Embedding Layer)
        ###     self.pos_embedding (Positional Embedding Layer), notes that pos_embedding is not learnable parameters,
        ###         so we should use the self.register_buffer function to initialize it.
        ###     self.dropout (Dropout Layer)
        ###
        ### Note:
        ###     1. `vocab_size` represents the size of the vocabulary (number of unique tokens)
        ###     2. `embedding_size` represents the size of each embedding vector
        ###     3. `max_len` represents the maximum sequence length
        ###     4. `dropout_rate` represents the dropout probability
        ###
        ### Use the following docs to properly initialize these variables:
        ###     Embedding Layer:
        ###         https://pytorch.org/docs/stable/nn.html#torch.nn.Embeddingl
        ###     Dropout Layer:
        ###         https://pytorch.org/docs/stable/nn.html#torch.nn.Dropout
        ###

        self.token_embedding = nn.Embedding(vocab_size, self.embedding_size)
        self.dropout = nn.Dropout(p=dropout_rate)
        pos_embedding = self.init_pos_embedding(max_len, self.embedding_size)

        ### END YOUR CODE
        self.register_buffer('pos_embedding', pos_embedding)

    def init_pos_embedding(self, max_len, embedding_size):
        # same size with input matrix (for adding with input matrix)
        encoding = torch.zeros(max_len, embedding_size)
        encoding.requires_grad = False  # we don't need to compute gradient

        pos = torch.arange(0, max_len)
        pos = pos.float().unsqueeze(dim=1)
        # 1D => 2D unsqueeze to represent word's position

        _2i = torch.arange(0, embedding_size, step=2).float()
        # 'i' means index of d_model (e.g. embedding size = 50, 'i' = [0,50])
        # "step=2" means 'i' multiplied with two (same with 2 * i)

        encoding[:, 0::2] = torch.sin(pos / (10000 ** (_2i / embedding_size)))
        encoding[:, 1::2] = torch.cos(pos / (10000 ** (_2i / embedding_size)))

        return encoding

    def forward(self, x):
        """
        Maps input sequences of tokens to their embeddings.

        @param x (Tensor): Input tensor of tokens with shape (seq_len, batch_size)

        @returns embedded (Tensor): Tensor of token embeddings with shape (seq_len, batch_size, embedding_size)
        """
        batch_size = x.size(1)
        # print(f"x: ", x.shape)

        # Retrieve token embeddings
        embedded_tokens = self.token_embedding(x) ### YOUR CODE HERE (~1 Line) ###
        # print(f"embedded_tokensx: ", embedded_tokens.shape)
        # Retrieve positional embeddings for the appropriate segment of the input sequence
        embedded_positions =  self.pos_embedding[:x.size(0), :] ### YOUR CODE HERE (~1 Line) ###
        embedded_positions = embedded_positions.unsqueeze(1).expand(-1, batch_size, -1)

        # print(f"embedded_positions: ", embedded_positions.shape)

        # Add token and positional embeddings together, apply dropout, and return
        embedded = self.dropout(embedded_tokens + embedded_positions) ### YOUR CODE HERE (~1 Line) ###
        return embedded

## The Transformer model

To be convinient, we will use `nn.Transformer` layer from PyTorch. We will build a 4-layer Transformer model.

<table>
<tr>
  <th colspan=1>The original Transformer diagram</th>
  <th colspan=1>A representation of a 4-layer Transformer</th>
</tr>
<tr>
  <td>
   <img width=400 src="https://www.tensorflow.org/images/tutorials/transformer/transformer.png"/>
  </td>
  <td>
   <img width=307 src="https://www.tensorflow.org/images/tutorials/transformer/Transformer-4layer-compact.png"/>
  </td>
</tr>
</table>

## Q8-9 Transfomer NMT model

### Q8: (5 points) Initialize layers in TransformerNMT model
Implement the `__init__` function  to initialize the
necessary module for our TransformerNMT model

### Q9: (10 points) Implement the forward function
Complete the `forward` function in the TransformerNMT class



In [None]:
from torch import Tensor
import torch
import torch.nn as nn
import math
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

BOS_IDX, PAD_IDX, EOS_IDX, UNK_IDX = vocab.tgt["<s>"], vocab.tgt["<pad>"], vocab.tgt["</s>"], vocab.tgt["<unk>"]

class TransformerNMT(nn.Module):
    """ Neural Machine Translation Model with Transformer:
        - Encoder with stacked self-attention and feedforward layers
        - Decoder with stacked self-attention, encoder-decoder attention, and feedforward layers
    """
    def __init__(self,
                 num_encoder_layers: int,
                 num_decoder_layers: int,
                 emb_size: int,
                 nhead: int,
                 src_vocab_size: int,
                 tgt_vocab_size: int,
                 dim_feedforward: int = 512,
                 max_len: int = 5000,
                 dropout: float = 0.1):
        """ Init TransformerNMT NMT Model.
        @param num_encoder_layers (int): The number of sub-layers in the Encoder Transformer
        @param num_decoder_layers (int): The number of sub-layers in the Decoder Transformer
        @param emb_size (int): Hidden Size, the size of hidden states (dimensionality)
        @param nhead (int): The number of heads in the multiheadattention
        @param src_vocab_size (int): The vocab size of src languages
        @param tgt_vocab_size (int): The vocab size of tgt languages
        @param dim_feedforward (int): The dimension of the feedforward network model
        @param max_len (int) max sequence length
        @param dropout (float): Dropout probability, for attention
        """

        super(TransformerNMT, self).__init__()

        self.src_embedding = None
        self.tgt_embedding = None

        self.transformer = None
        self.target_vocab_projection = None

        ### YOUR CODE HERE
        ### TODO - Initialize the following variables IN THIS ORDER:
        ###     self.src_embedding: Transformer Embedding Layer used for source language
        ###     self.tgt_embedding: Transformer Embedding Layer used for target language
        ###     self.transformer: Transformer layer
        ###     self.target_vocab_projection (Linear Layer with no bias), mapping hidden representation to the vocab distribution
        ###
        ### Use the following docs to properly initialize these variables:
        ###     Transformer
        ###         https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html

        self.src_embedding = TransformerEmbedding(src_vocab_size, emb_size, max_len, dropout)
        self.tgt_embedding = TransformerEmbedding(tgt_vocab_size, emb_size, max_len, dropout)

        self.transformer = nn.Transformer(emb_size, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward, dropout)
        self.target_vocab_projection = nn.Linear(emb_size, tgt_vocab_size, bias=False)

        ### END YOUR CODE


    def forward(self,
                src: Tensor,
                tgt: Tensor,
                src_mask: Tensor,
                tgt_mask: Tensor,
                src_padding_mask: Tensor,
                tgt_padding_mask: Tensor,
                memory_key_padding_mask: Tensor):
        """Forward pass through the neural machine translation (NMT) model.
        @param src (Tensor): Source sequence tensor.
        @param tgt (Tensor): Target sequence tensor.
        @param src_mask (Tensor): Mask for the source sequence.
        @param tgt_mask (Tensor): Mask for the target sequence.
        @param src_padding_mask (Tensor): Padding mask for the source sequence.
        @param tgt_padding_mask (Tensor): Padding mask for the target sequence.
        @param memory_key_padding_mask (Tensor): Padding mask for memory keys.

        @returns Tensor: Output tensor representing the NMT model's predictions for the target sequence.
        """

        ### Q9 APPROACH 01
        ### YOUR CODE HERE
        ### TODO - Implement the forward function:
        ###     1. Compute `src_emb` and `tgt_emb` from `src` and `tgt` using TransformerEmbedding,
        ###     which return shape (src_len, b, e).
        ###     src_len = maximum source sentence length, b = batch size, e = embedding size.
        ###     2. Apply the self.transformer to compute the decoder output with shape (tgt_len, b, e).
        ###     tgt_len = maximum target sentence length, b = batch size, e = embedding size.
        ###     3. Mapping the decoder ouput to the vocab distribution using self.target_vocab_projection and return it.
        ###     which return shape (tgt_len, b, tgt_vocab_size).

        # src_emb = self.src_embedding(src)
        # tgt_emb = self.src_embedding(tgt)
        # out_decode = self.transformer(src, tgt)
        # output = self.target_vocab_projection(out_decode)

        ### Use the following docs for how to use nn.Transformer forward function:
        ###     Transformer
        ###         https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html

        ### Q9 APPROACH 02
        ### YOUR CODE HERE
        ### TODO - Implement the forward function:
        ###     1. Compute `memory` from `src` and `src_mask` using self.encode,
        ###     which return shape (src_len, b, e).
        ###     2. Apply the self.decode to compute the decoder output with shape (tgt_len, b, e).
        ###     tgt_len = maximum target sentence length, b = batch size, e = embedding size.
        ###     3. Mapping the decoder ouput to the vocab distribution using self.target_vocab_projection and return it,
        ###     which return shape (tgt_len, b, tgt_vocab_size).

        memory = self.encode(src, src_mask, src_padding_mask)
        out_decode = self.decode(tgt, memory, tgt_mask, tgt_padding_mask)
        output = self.target_vocab_projection(out_decode)

        return output
        ### END YOUR CODE

    def encode(self, src: Tensor, src_mask: Tensor, src_padding_mask):
        return self.transformer.encoder(self.src_embedding(src), src_mask, src_padding_mask)

    def decode(self, tgt: Tensor, memory: Tensor, tgt_mask: Tensor, tgt_padding_mask):
        return self.transformer.decoder(self.tgt_embedding(tgt), memory, tgt_mask, tgt_key_padding_mask = tgt_padding_mask)


def generate_square_subsequent_mask(sz):
    mask = (torch.triu(torch.ones((sz, sz), device=DEVICE)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask


def create_mask(src, tgt):
    src_seq_len = src.shape[0]
    tgt_seq_len = tgt.shape[0]

    tgt_mask = generate_square_subsequent_mask(tgt_seq_len)
    src_mask = torch.zeros((src_seq_len, src_seq_len),device=DEVICE).type(torch.bool)

    src_padding_mask = (src == PAD_IDX).transpose(0, 1)
    tgt_padding_mask = (tgt == PAD_IDX).transpose(0, 1)
    return src_mask, tgt_mask, src_padding_mask, tgt_padding_mask

## Train our full model



In [None]:
# @title Model Initialization

import torch
torch.manual_seed(0)

args.src_vocab_size = len(vocab.src)
args.tgt_vocab_size = len(vocab.tgt)
args.emb_size = 768
args.n_heads = 8
args.ffn_hid_dim = 768
args.batch_size = 32
args.n_encoder_layers = 3
args.n_decoder_layers = 3
args.lr = 1e-4
args.dropout = 0.1

transformer = TransformerNMT(args.n_encoder_layers, args.n_decoder_layers, args.emb_size,
                                 args.n_heads, args.src_vocab_size, args.tgt_vocab_size, args.ffn_hid_dim, dropout=args.dropout)

for p in transformer.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)

transformer = transformer.to(DEVICE)

loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

optimizer = torch.optim.Adam(transformer.parameters(), lr=args.lr, betas=(0.9, 0.98), eps=1e-8)



In [None]:
# @title Helper function
from torch.nn.utils.rnn import pad_sequence

# helper function to club together sequential operations
def sequential_transforms(*transforms):
    def func(txt_input):
        for transform in transforms:
            try:
                txt_input = transform(txt_input)
            except Exception as e:
                if "device" in str(e):
                    print(e)
                    txt_input = transform(txt_input, device=device)
                else:
                    raise ValueError(e)
        return txt_input
    return func

# function to add BOS/EOS and create tensor for input sequence indices
def tensor_transform(token_ids: List[int]):
    # print(token_ids)
    return torch.cat((torch.tensor([BOS_IDX]).to(device),
                      token_ids,
                      torch.tensor([EOS_IDX]).to(device))).to(device)


def convert_to_tensor_src(txt):
    # print(torch.tensor([BOS_IDX]).shape, vocab.src.to_input_tensor(txt).shape, torch.tensor([EOS_IDX]).shape)
    # print(torch.tensor([BOS_IDX]), vocab.src.to_input_tensor(txt).reshape(1,), torch.tensor([EOS_IDX]))
    return torch.cat((torch.tensor([BOS_IDX]).view(1, 1).to(device),
                      vocab.src.to_input_tensor(txt).to(device),
                      torch.tensor([EOS_IDX]).view(1, 1).to(device)), dim=0).view(-1)

def convert_to_tensor_tgt(txt):
    return torch.cat((torch.tensor([BOS_IDX]).view(1, 1).to(device),
                      vocab.tgt.to_input_tensor(txt).to(device),
                      torch.tensor([EOS_IDX]).view(1, 1).to(device)), dim=0).view(-1)

# ``src`` and ``tgt`` language text transforms to convert raw strings into tensors indices
text_transform = {}

SRC_LANGUAGE, TGT_LANGUAGE = "vi", "en"
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    if ln == SRC_LANGUAGE:
        text_transform[ln] = convert_to_tensor_src # Add BOS/EOS and create tensor
    if ln == TGT_LANGUAGE:
        text_transform[ln] = convert_to_tensor_tgt# Add BOS/EOS and create tensor


# function to collate data samples into batch tensors
def collate_fn(batch):
    src_batch, tgt_batch = [], []
    for src_sample, tgt_sample in batch:
        # print(src_sample, tgt_sample)
        src_batch.append(text_transform[SRC_LANGUAGE](src_sample))
        tgt_batch.append(text_transform[TGT_LANGUAGE](tgt_sample))

    src_batch = pad_sequence(src_batch, padding_value=PAD_IDX)
    tgt_batch = pad_sequence(tgt_batch, padding_value=PAD_IDX)
    return src_batch, tgt_batch

In [None]:
# @title Training the model

from torch.utils.data import DataLoader

def train_epoch(model, optimizer, epoch):
    model.train()
    losses = 0
    # train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    train_dataloader = DataLoader(train_data_small, batch_size=args.batch_size, collate_fn=collate_fn)
    # train_dataloader = batch_iter(train_data, BATCH_SIZE)

    for i, (src, tgt) in enumerate(train_dataloader):
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)

        tgt_input = tgt[:-1, :]

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask, tgt_mask, src_padding_mask, tgt_padding_mask, src_padding_mask)

        optimizer.zero_grad()

        tgt_out = tgt[1:, :]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        loss.backward()

        optimizer.step()
        losses += loss.item()

        if i % args.log_every == 0:

            print('epoch %d, iter %d, losses %.2f, avg. loss %.2f'
                     % (epoch, i, losses, losses/(i+1)), file=sys.stderr)


    return losses / len(list(train_dataloader))


def evaluate(model):
    model.eval()
    losses = 0

    # val_iter = Multi30k(split='valid', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    val_dataloader = DataLoader(dev_data_small, batch_size=args.batch_size, collate_fn=collate_fn)
    # val_dataloader = batch_iter(dev_data, BATCH_SIZE)

    for src, tgt in val_dataloader:
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)

        tgt_input = tgt[:-1, :]

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask, tgt_mask,src_padding_mask, tgt_padding_mask, src_padding_mask)

        tgt_out = tgt[1:, :]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        losses += loss.item()

    return losses / len(list(val_dataloader))

We also pick a small subset of 50 sampels for training and 50 samples for validating.

In [None]:
train_data_small = [val for val in train_data if len(val[0]) > 3][:50]
dev_data_small = [val for val in dev_data if len(val[0]) > 3][:50]

In [None]:
from timeit import default_timer as timer
args.max_epoch = 200

for epoch in range(args.max_epoch):
    start_time = timer()
    train_loss = train_epoch(transformer, optimizer, epoch)
    end_time = timer()
    val_loss = evaluate(transformer)
    print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}, "f"Epoch time = {(end_time - start_time):.3f}s"))

epoch 0, iter 0, losses 10.00, avg. loss 10.00


Epoch: 0, Train loss: 9.789, Val loss: 9.355, Epoch time = 0.628s


epoch 1, iter 0, losses 9.31, avg. loss 9.31


Epoch: 1, Train loss: 9.221, Val loss: 9.094, Epoch time = 0.316s


epoch 2, iter 0, losses 9.01, avg. loss 9.01


Epoch: 2, Train loss: 8.931, Val loss: 8.901, Epoch time = 0.331s


epoch 3, iter 0, losses 8.77, avg. loss 8.77


Epoch: 3, Train loss: 8.689, Val loss: 8.743, Epoch time = 0.336s


epoch 4, iter 0, losses 8.55, avg. loss 8.55


Epoch: 4, Train loss: 8.470, Val loss: 8.594, Epoch time = 0.341s


epoch 5, iter 0, losses 8.32, avg. loss 8.32


Epoch: 5, Train loss: 8.240, Val loss: 8.449, Epoch time = 0.341s


epoch 6, iter 0, losses 8.10, avg. loss 8.10


Epoch: 6, Train loss: 8.020, Val loss: 8.304, Epoch time = 0.346s


epoch 7, iter 0, losses 7.88, avg. loss 7.88


Epoch: 7, Train loss: 7.791, Val loss: 8.159, Epoch time = 0.331s


epoch 8, iter 0, losses 7.66, avg. loss 7.66


Epoch: 8, Train loss: 7.571, Val loss: 8.015, Epoch time = 0.320s


epoch 9, iter 0, losses 7.43, avg. loss 7.43


Epoch: 9, Train loss: 7.336, Val loss: 7.874, Epoch time = 0.321s


epoch 10, iter 0, losses 7.20, avg. loss 7.20


Epoch: 10, Train loss: 7.109, Val loss: 7.737, Epoch time = 0.319s


epoch 11, iter 0, losses 6.99, avg. loss 6.99


Epoch: 11, Train loss: 6.891, Val loss: 7.606, Epoch time = 0.322s


epoch 12, iter 0, losses 6.77, avg. loss 6.77


Epoch: 12, Train loss: 6.667, Val loss: 7.482, Epoch time = 0.322s


epoch 13, iter 0, losses 6.57, avg. loss 6.57


Epoch: 13, Train loss: 6.462, Val loss: 7.367, Epoch time = 0.319s


epoch 14, iter 0, losses 6.38, avg. loss 6.38


Epoch: 14, Train loss: 6.265, Val loss: 7.264, Epoch time = 0.320s


epoch 15, iter 0, losses 6.19, avg. loss 6.19


Epoch: 15, Train loss: 6.080, Val loss: 7.173, Epoch time = 0.323s


epoch 16, iter 0, losses 6.03, avg. loss 6.03


Epoch: 16, Train loss: 5.911, Val loss: 7.095, Epoch time = 0.318s


epoch 17, iter 0, losses 5.87, avg. loss 5.87


Epoch: 17, Train loss: 5.755, Val loss: 7.032, Epoch time = 0.321s


epoch 18, iter 0, losses 5.74, avg. loss 5.74


Epoch: 18, Train loss: 5.611, Val loss: 6.981, Epoch time = 0.321s


epoch 19, iter 0, losses 5.61, avg. loss 5.61


Epoch: 19, Train loss: 5.480, Val loss: 6.943, Epoch time = 0.324s


epoch 20, iter 0, losses 5.50, avg. loss 5.50


Epoch: 20, Train loss: 5.362, Val loss: 6.910, Epoch time = 0.321s


epoch 21, iter 0, losses 5.40, avg. loss 5.40


Epoch: 21, Train loss: 5.259, Val loss: 6.883, Epoch time = 0.324s


epoch 22, iter 0, losses 5.31, avg. loss 5.31


Epoch: 22, Train loss: 5.166, Val loss: 6.864, Epoch time = 0.321s


epoch 23, iter 0, losses 5.24, avg. loss 5.24


Epoch: 23, Train loss: 5.083, Val loss: 6.850, Epoch time = 0.323s


epoch 24, iter 0, losses 5.17, avg. loss 5.17


Epoch: 24, Train loss: 5.010, Val loss: 6.847, Epoch time = 0.323s


epoch 25, iter 0, losses 5.12, avg. loss 5.12


Epoch: 25, Train loss: 4.946, Val loss: 6.839, Epoch time = 0.324s


epoch 26, iter 0, losses 5.06, avg. loss 5.06


Epoch: 26, Train loss: 4.888, Val loss: 6.838, Epoch time = 0.321s


epoch 27, iter 0, losses 5.02, avg. loss 5.02


Epoch: 27, Train loss: 4.836, Val loss: 6.860, Epoch time = 0.328s


epoch 28, iter 0, losses 4.99, avg. loss 4.99


Epoch: 28, Train loss: 4.804, Val loss: 6.876, Epoch time = 0.337s


epoch 29, iter 0, losses 4.95, avg. loss 4.95


Epoch: 29, Train loss: 4.767, Val loss: 6.879, Epoch time = 0.351s


epoch 30, iter 0, losses 4.96, avg. loss 4.96


Epoch: 30, Train loss: 4.741, Val loss: 6.903, Epoch time = 0.353s


epoch 31, iter 0, losses 4.91, avg. loss 4.91


Epoch: 31, Train loss: 4.698, Val loss: 6.879, Epoch time = 0.354s


epoch 32, iter 0, losses 4.88, avg. loss 4.88


Epoch: 32, Train loss: 4.674, Val loss: 6.915, Epoch time = 0.323s


epoch 33, iter 0, losses 4.85, avg. loss 4.85


Epoch: 33, Train loss: 4.646, Val loss: 6.909, Epoch time = 0.323s


epoch 34, iter 0, losses 4.82, avg. loss 4.82


Epoch: 34, Train loss: 4.619, Val loss: 6.910, Epoch time = 0.322s


epoch 35, iter 0, losses 4.81, avg. loss 4.81


Epoch: 35, Train loss: 4.593, Val loss: 6.918, Epoch time = 0.323s


epoch 36, iter 0, losses 4.78, avg. loss 4.78


Epoch: 36, Train loss: 4.558, Val loss: 6.904, Epoch time = 0.322s


epoch 37, iter 0, losses 4.77, avg. loss 4.77


Epoch: 37, Train loss: 4.537, Val loss: 6.933, Epoch time = 0.329s


epoch 38, iter 0, losses 4.75, avg. loss 4.75


Epoch: 38, Train loss: 4.508, Val loss: 6.919, Epoch time = 0.320s


epoch 39, iter 0, losses 4.73, avg. loss 4.73


Epoch: 39, Train loss: 4.487, Val loss: 6.931, Epoch time = 0.326s


epoch 40, iter 0, losses 4.70, avg. loss 4.70


Epoch: 40, Train loss: 4.461, Val loss: 6.928, Epoch time = 0.324s


epoch 41, iter 0, losses 4.68, avg. loss 4.68


Epoch: 41, Train loss: 4.438, Val loss: 6.931, Epoch time = 0.326s


epoch 42, iter 0, losses 4.66, avg. loss 4.66


Epoch: 42, Train loss: 4.415, Val loss: 6.927, Epoch time = 0.329s


epoch 43, iter 0, losses 4.64, avg. loss 4.64


Epoch: 43, Train loss: 4.389, Val loss: 6.954, Epoch time = 0.327s


epoch 44, iter 0, losses 4.62, avg. loss 4.62


Epoch: 44, Train loss: 4.367, Val loss: 6.930, Epoch time = 0.326s


epoch 45, iter 0, losses 4.60, avg. loss 4.60


Epoch: 45, Train loss: 4.347, Val loss: 6.952, Epoch time = 0.324s


epoch 46, iter 0, losses 4.58, avg. loss 4.58


Epoch: 46, Train loss: 4.316, Val loss: 6.940, Epoch time = 0.322s


epoch 47, iter 0, losses 4.56, avg. loss 4.56


Epoch: 47, Train loss: 4.296, Val loss: 6.957, Epoch time = 0.322s


epoch 48, iter 0, losses 4.54, avg. loss 4.54


Epoch: 48, Train loss: 4.271, Val loss: 6.943, Epoch time = 0.331s


epoch 49, iter 0, losses 4.53, avg. loss 4.53


Epoch: 49, Train loss: 4.244, Val loss: 6.969, Epoch time = 0.325s


epoch 50, iter 0, losses 4.50, avg. loss 4.50


Epoch: 50, Train loss: 4.213, Val loss: 6.939, Epoch time = 0.318s


epoch 51, iter 0, losses 4.48, avg. loss 4.48


Epoch: 51, Train loss: 4.192, Val loss: 6.994, Epoch time = 0.331s


epoch 52, iter 0, losses 4.46, avg. loss 4.46


Epoch: 52, Train loss: 4.168, Val loss: 6.960, Epoch time = 0.343s


epoch 53, iter 0, losses 4.44, avg. loss 4.44


Epoch: 53, Train loss: 4.145, Val loss: 7.014, Epoch time = 0.340s


epoch 54, iter 0, losses 4.43, avg. loss 4.43


Epoch: 54, Train loss: 4.124, Val loss: 6.977, Epoch time = 0.349s


epoch 55, iter 0, losses 4.40, avg. loss 4.40


Epoch: 55, Train loss: 4.099, Val loss: 7.013, Epoch time = 0.352s


epoch 56, iter 0, losses 4.37, avg. loss 4.37


Epoch: 56, Train loss: 4.062, Val loss: 6.983, Epoch time = 0.341s


epoch 57, iter 0, losses 4.35, avg. loss 4.35


Epoch: 57, Train loss: 4.043, Val loss: 7.010, Epoch time = 0.325s


epoch 58, iter 0, losses 4.32, avg. loss 4.32


Epoch: 58, Train loss: 4.008, Val loss: 7.024, Epoch time = 0.329s


epoch 59, iter 0, losses 4.29, avg. loss 4.29


Epoch: 59, Train loss: 3.982, Val loss: 7.020, Epoch time = 0.326s


epoch 60, iter 0, losses 4.28, avg. loss 4.28


Epoch: 60, Train loss: 3.965, Val loss: 7.001, Epoch time = 0.328s


epoch 61, iter 0, losses 4.30, avg. loss 4.30


Epoch: 61, Train loss: 3.958, Val loss: 7.110, Epoch time = 0.328s


epoch 62, iter 0, losses 4.32, avg. loss 4.32


Epoch: 62, Train loss: 3.970, Val loss: 7.029, Epoch time = 0.328s


epoch 63, iter 0, losses 4.25, avg. loss 4.25


Epoch: 63, Train loss: 3.905, Val loss: 7.061, Epoch time = 0.330s


epoch 64, iter 0, losses 4.24, avg. loss 4.24


Epoch: 64, Train loss: 3.895, Val loss: 7.028, Epoch time = 0.328s


epoch 65, iter 0, losses 4.24, avg. loss 4.24


Epoch: 65, Train loss: 3.917, Val loss: 7.089, Epoch time = 0.329s


epoch 66, iter 0, losses 4.18, avg. loss 4.18


Epoch: 66, Train loss: 3.891, Val loss: 7.099, Epoch time = 0.330s


epoch 67, iter 0, losses 4.16, avg. loss 4.16


Epoch: 67, Train loss: 3.813, Val loss: 7.026, Epoch time = 0.329s


epoch 68, iter 0, losses 4.15, avg. loss 4.15


Epoch: 68, Train loss: 3.823, Val loss: 7.058, Epoch time = 0.327s


epoch 69, iter 0, losses 4.10, avg. loss 4.10


Epoch: 69, Train loss: 3.768, Val loss: 7.099, Epoch time = 0.329s


epoch 70, iter 0, losses 4.09, avg. loss 4.09


Epoch: 70, Train loss: 3.740, Val loss: 7.076, Epoch time = 0.330s


epoch 71, iter 0, losses 4.06, avg. loss 4.06


Epoch: 71, Train loss: 3.712, Val loss: 7.089, Epoch time = 0.329s


epoch 72, iter 0, losses 4.05, avg. loss 4.05


Epoch: 72, Train loss: 3.690, Val loss: 7.095, Epoch time = 0.330s


epoch 73, iter 0, losses 4.00, avg. loss 4.00


Epoch: 73, Train loss: 3.647, Val loss: 7.120, Epoch time = 0.331s


epoch 74, iter 0, losses 4.00, avg. loss 4.00


Epoch: 74, Train loss: 3.629, Val loss: 7.139, Epoch time = 0.332s


epoch 75, iter 0, losses 3.97, avg. loss 3.97


Epoch: 75, Train loss: 3.591, Val loss: 7.090, Epoch time = 0.328s


epoch 76, iter 0, losses 3.94, avg. loss 3.94


Epoch: 76, Train loss: 3.565, Val loss: 7.137, Epoch time = 0.344s


epoch 77, iter 0, losses 3.91, avg. loss 3.91


Epoch: 77, Train loss: 3.540, Val loss: 7.143, Epoch time = 0.345s


epoch 78, iter 0, losses 3.89, avg. loss 3.89


Epoch: 78, Train loss: 3.519, Val loss: 7.162, Epoch time = 0.347s


epoch 79, iter 0, losses 3.87, avg. loss 3.87


Epoch: 79, Train loss: 3.494, Val loss: 7.152, Epoch time = 0.358s


epoch 80, iter 0, losses 3.85, avg. loss 3.85


Epoch: 80, Train loss: 3.468, Val loss: 7.154, Epoch time = 0.338s


epoch 81, iter 0, losses 3.83, avg. loss 3.83


Epoch: 81, Train loss: 3.449, Val loss: 7.193, Epoch time = 0.330s


epoch 82, iter 0, losses 3.79, avg. loss 3.79


Epoch: 82, Train loss: 3.422, Val loss: 7.195, Epoch time = 0.316s


epoch 83, iter 0, losses 3.77, avg. loss 3.77


Epoch: 83, Train loss: 3.387, Val loss: 7.193, Epoch time = 0.332s


epoch 84, iter 0, losses 3.75, avg. loss 3.75


Epoch: 84, Train loss: 3.343, Val loss: 7.206, Epoch time = 0.330s


epoch 85, iter 0, losses 3.74, avg. loss 3.74


Epoch: 85, Train loss: 3.341, Val loss: 7.240, Epoch time = 0.328s


epoch 86, iter 0, losses 3.70, avg. loss 3.70


Epoch: 86, Train loss: 3.309, Val loss: 7.207, Epoch time = 0.378s


epoch 87, iter 0, losses 3.69, avg. loss 3.69


Epoch: 87, Train loss: 3.282, Val loss: 7.268, Epoch time = 0.411s


epoch 88, iter 0, losses 3.66, avg. loss 3.66


Epoch: 88, Train loss: 3.250, Val loss: 7.268, Epoch time = 0.338s


epoch 89, iter 0, losses 3.62, avg. loss 3.62


Epoch: 89, Train loss: 3.220, Val loss: 7.253, Epoch time = 0.345s


epoch 90, iter 0, losses 3.61, avg. loss 3.61


Epoch: 90, Train loss: 3.197, Val loss: 7.270, Epoch time = 0.331s


epoch 91, iter 0, losses 3.58, avg. loss 3.58


Epoch: 91, Train loss: 3.183, Val loss: 7.287, Epoch time = 0.333s


epoch 92, iter 0, losses 3.56, avg. loss 3.56


Epoch: 92, Train loss: 3.156, Val loss: 7.364, Epoch time = 0.334s


epoch 93, iter 0, losses 3.63, avg. loss 3.63


Epoch: 93, Train loss: 3.164, Val loss: 7.302, Epoch time = 0.333s


epoch 94, iter 0, losses 3.59, avg. loss 3.59


Epoch: 94, Train loss: 3.139, Val loss: 7.372, Epoch time = 0.333s


epoch 95, iter 0, losses 3.58, avg. loss 3.58


Epoch: 95, Train loss: 3.148, Val loss: 7.289, Epoch time = 0.331s


epoch 96, iter 0, losses 3.51, avg. loss 3.51


Epoch: 96, Train loss: 3.154, Val loss: 7.340, Epoch time = 0.334s


epoch 97, iter 0, losses 3.46, avg. loss 3.46


Epoch: 97, Train loss: 3.077, Val loss: 7.409, Epoch time = 0.332s


epoch 98, iter 0, losses 3.48, avg. loss 3.48


Epoch: 98, Train loss: 3.034, Val loss: 7.344, Epoch time = 0.333s


epoch 99, iter 0, losses 3.45, avg. loss 3.45


Epoch: 99, Train loss: 3.019, Val loss: 7.338, Epoch time = 0.349s


epoch 100, iter 0, losses 3.41, avg. loss 3.41


Epoch: 100, Train loss: 2.980, Val loss: 7.421, Epoch time = 0.353s


epoch 101, iter 0, losses 3.39, avg. loss 3.39


Epoch: 101, Train loss: 2.954, Val loss: 7.389, Epoch time = 0.338s


epoch 102, iter 0, losses 3.36, avg. loss 3.36


Epoch: 102, Train loss: 2.929, Val loss: 7.389, Epoch time = 0.361s


epoch 103, iter 0, losses 3.34, avg. loss 3.34


Epoch: 103, Train loss: 2.898, Val loss: 7.414, Epoch time = 0.345s


epoch 104, iter 0, losses 3.29, avg. loss 3.29


Epoch: 104, Train loss: 2.851, Val loss: 7.431, Epoch time = 0.337s


epoch 105, iter 0, losses 3.29, avg. loss 3.29


Epoch: 105, Train loss: 2.842, Val loss: 7.432, Epoch time = 0.339s


epoch 106, iter 0, losses 3.25, avg. loss 3.25


Epoch: 106, Train loss: 2.805, Val loss: 7.462, Epoch time = 0.336s


epoch 107, iter 0, losses 3.23, avg. loss 3.23


Epoch: 107, Train loss: 2.783, Val loss: 7.462, Epoch time = 0.336s


epoch 108, iter 0, losses 3.20, avg. loss 3.20


Epoch: 108, Train loss: 2.765, Val loss: 7.487, Epoch time = 0.338s


epoch 109, iter 0, losses 3.17, avg. loss 3.17


Epoch: 109, Train loss: 2.725, Val loss: 7.492, Epoch time = 0.340s


epoch 110, iter 0, losses 3.15, avg. loss 3.15


Epoch: 110, Train loss: 2.706, Val loss: 7.536, Epoch time = 0.337s


epoch 111, iter 0, losses 3.14, avg. loss 3.14


Epoch: 111, Train loss: 2.686, Val loss: 7.517, Epoch time = 0.335s


epoch 112, iter 0, losses 3.11, avg. loss 3.11


Epoch: 112, Train loss: 2.656, Val loss: 7.525, Epoch time = 0.339s


epoch 113, iter 0, losses 3.07, avg. loss 3.07


Epoch: 113, Train loss: 2.649, Val loss: 7.558, Epoch time = 0.335s


epoch 114, iter 0, losses 3.03, avg. loss 3.03


Epoch: 114, Train loss: 2.589, Val loss: 7.548, Epoch time = 0.334s


epoch 115, iter 0, losses 3.03, avg. loss 3.03


Epoch: 115, Train loss: 2.573, Val loss: 7.606, Epoch time = 0.336s


epoch 116, iter 0, losses 3.02, avg. loss 3.02


Epoch: 116, Train loss: 2.562, Val loss: 7.605, Epoch time = 0.320s


epoch 117, iter 0, losses 2.99, avg. loss 2.99


Epoch: 117, Train loss: 2.548, Val loss: 7.637, Epoch time = 0.334s


epoch 118, iter 0, losses 2.96, avg. loss 2.96


Epoch: 118, Train loss: 2.550, Val loss: 7.610, Epoch time = 0.331s


epoch 119, iter 0, losses 2.95, avg. loss 2.95


Epoch: 119, Train loss: 2.500, Val loss: 7.587, Epoch time = 0.336s


epoch 120, iter 0, losses 2.91, avg. loss 2.91


Epoch: 120, Train loss: 2.458, Val loss: 7.684, Epoch time = 0.334s


epoch 121, iter 0, losses 2.91, avg. loss 2.91


Epoch: 121, Train loss: 2.443, Val loss: 7.649, Epoch time = 0.337s


epoch 122, iter 0, losses 2.87, avg. loss 2.87


Epoch: 122, Train loss: 2.409, Val loss: 7.719, Epoch time = 0.334s


epoch 123, iter 0, losses 2.84, avg. loss 2.84


Epoch: 123, Train loss: 2.378, Val loss: 7.679, Epoch time = 0.353s


epoch 124, iter 0, losses 2.81, avg. loss 2.81


Epoch: 124, Train loss: 2.352, Val loss: 7.713, Epoch time = 0.349s


epoch 125, iter 0, losses 2.78, avg. loss 2.78


Epoch: 125, Train loss: 2.329, Val loss: 7.744, Epoch time = 0.360s


epoch 126, iter 0, losses 2.75, avg. loss 2.75


Epoch: 126, Train loss: 2.312, Val loss: 7.755, Epoch time = 0.358s


epoch 127, iter 0, losses 2.72, avg. loss 2.72


Epoch: 127, Train loss: 2.271, Val loss: 7.727, Epoch time = 0.334s


epoch 128, iter 0, losses 2.71, avg. loss 2.71


Epoch: 128, Train loss: 2.257, Val loss: 7.798, Epoch time = 0.338s


epoch 129, iter 0, losses 2.70, avg. loss 2.70


Epoch: 129, Train loss: 2.239, Val loss: 7.806, Epoch time = 0.338s


epoch 130, iter 0, losses 2.70, avg. loss 2.70


Epoch: 130, Train loss: 2.225, Val loss: 7.782, Epoch time = 0.344s


epoch 131, iter 0, losses 2.68, avg. loss 2.68


Epoch: 131, Train loss: 2.209, Val loss: 7.851, Epoch time = 0.336s


epoch 132, iter 0, losses 2.62, avg. loss 2.62


Epoch: 132, Train loss: 2.184, Val loss: 7.805, Epoch time = 0.341s


epoch 133, iter 0, losses 2.60, avg. loss 2.60


Epoch: 133, Train loss: 2.157, Val loss: 7.849, Epoch time = 0.337s


epoch 134, iter 0, losses 2.58, avg. loss 2.58


Epoch: 134, Train loss: 2.131, Val loss: 7.871, Epoch time = 0.340s


epoch 135, iter 0, losses 2.54, avg. loss 2.54


Epoch: 135, Train loss: 2.098, Val loss: 7.867, Epoch time = 0.342s


epoch 136, iter 0, losses 2.51, avg. loss 2.51


Epoch: 136, Train loss: 2.060, Val loss: 7.889, Epoch time = 0.339s


epoch 137, iter 0, losses 2.48, avg. loss 2.48


Epoch: 137, Train loss: 2.049, Val loss: 7.888, Epoch time = 0.336s


epoch 138, iter 0, losses 2.49, avg. loss 2.49


Epoch: 138, Train loss: 2.023, Val loss: 7.933, Epoch time = 0.345s


epoch 139, iter 0, losses 2.45, avg. loss 2.45


Epoch: 139, Train loss: 1.993, Val loss: 7.980, Epoch time = 0.340s


epoch 140, iter 0, losses 2.42, avg. loss 2.42


Epoch: 140, Train loss: 1.968, Val loss: 7.957, Epoch time = 0.344s


epoch 141, iter 0, losses 2.39, avg. loss 2.39


Epoch: 141, Train loss: 1.936, Val loss: 7.989, Epoch time = 0.352s


epoch 142, iter 0, losses 2.36, avg. loss 2.36


Epoch: 142, Train loss: 1.918, Val loss: 7.961, Epoch time = 0.343s


epoch 143, iter 0, losses 2.34, avg. loss 2.34


Epoch: 143, Train loss: 1.900, Val loss: 8.012, Epoch time = 0.339s


epoch 144, iter 0, losses 2.32, avg. loss 2.32


Epoch: 144, Train loss: 1.882, Val loss: 8.048, Epoch time = 0.335s


epoch 145, iter 0, losses 2.30, avg. loss 2.30


Epoch: 145, Train loss: 1.885, Val loss: 8.057, Epoch time = 0.341s


epoch 146, iter 0, losses 2.28, avg. loss 2.28


Epoch: 146, Train loss: 1.875, Val loss: 8.038, Epoch time = 0.352s


epoch 147, iter 0, losses 2.29, avg. loss 2.29


Epoch: 147, Train loss: 1.873, Val loss: 8.130, Epoch time = 0.352s


epoch 148, iter 0, losses 2.31, avg. loss 2.31


Epoch: 148, Train loss: 1.849, Val loss: 8.158, Epoch time = 0.357s


epoch 149, iter 0, losses 2.25, avg. loss 2.25


Epoch: 149, Train loss: 1.797, Val loss: 8.093, Epoch time = 0.368s


epoch 150, iter 0, losses 2.22, avg. loss 2.22


Epoch: 150, Train loss: 1.774, Val loss: 8.095, Epoch time = 0.340s


epoch 151, iter 0, losses 2.17, avg. loss 2.17


Epoch: 151, Train loss: 1.733, Val loss: 8.101, Epoch time = 0.340s


epoch 152, iter 0, losses 2.16, avg. loss 2.16


Epoch: 152, Train loss: 1.712, Val loss: 8.146, Epoch time = 0.340s


epoch 153, iter 0, losses 2.12, avg. loss 2.12


Epoch: 153, Train loss: 1.687, Val loss: 8.193, Epoch time = 0.341s


epoch 154, iter 0, losses 2.11, avg. loss 2.11


Epoch: 154, Train loss: 1.685, Val loss: 8.224, Epoch time = 0.345s


epoch 155, iter 0, losses 2.08, avg. loss 2.08


Epoch: 155, Train loss: 1.652, Val loss: 8.267, Epoch time = 0.341s


epoch 156, iter 0, losses 2.04, avg. loss 2.04


Epoch: 156, Train loss: 1.619, Val loss: 8.238, Epoch time = 0.340s


epoch 157, iter 0, losses 2.03, avg. loss 2.03


Epoch: 157, Train loss: 1.603, Val loss: 8.298, Epoch time = 0.345s


epoch 158, iter 0, losses 2.00, avg. loss 2.00


Epoch: 158, Train loss: 1.577, Val loss: 8.351, Epoch time = 0.340s


epoch 159, iter 0, losses 1.97, avg. loss 1.97


Epoch: 159, Train loss: 1.555, Val loss: 8.295, Epoch time = 0.341s


epoch 160, iter 0, losses 1.94, avg. loss 1.94


Epoch: 160, Train loss: 1.539, Val loss: 8.314, Epoch time = 0.350s


epoch 161, iter 0, losses 1.92, avg. loss 1.92


Epoch: 161, Train loss: 1.521, Val loss: 8.317, Epoch time = 0.342s


epoch 162, iter 0, losses 1.90, avg. loss 1.90


Epoch: 162, Train loss: 1.504, Val loss: 8.320, Epoch time = 0.349s


epoch 163, iter 0, losses 1.91, avg. loss 1.91


Epoch: 163, Train loss: 1.505, Val loss: 8.358, Epoch time = 0.340s


epoch 164, iter 0, losses 1.90, avg. loss 1.90


Epoch: 164, Train loss: 1.488, Val loss: 8.370, Epoch time = 0.352s


epoch 165, iter 0, losses 1.90, avg. loss 1.90


Epoch: 165, Train loss: 1.472, Val loss: 8.437, Epoch time = 0.340s


epoch 166, iter 0, losses 1.82, avg. loss 1.82


Epoch: 166, Train loss: 1.438, Val loss: 8.483, Epoch time = 0.346s


epoch 167, iter 0, losses 1.80, avg. loss 1.80


Epoch: 167, Train loss: 1.402, Val loss: 8.447, Epoch time = 0.348s


epoch 168, iter 0, losses 1.79, avg. loss 1.79


Epoch: 168, Train loss: 1.411, Val loss: 8.444, Epoch time = 0.348s


epoch 169, iter 0, losses 1.82, avg. loss 1.82


Epoch: 169, Train loss: 1.405, Val loss: 8.432, Epoch time = 0.367s


epoch 170, iter 0, losses 1.74, avg. loss 1.74


Epoch: 170, Train loss: 1.363, Val loss: 8.523, Epoch time = 0.355s


epoch 171, iter 0, losses 1.71, avg. loss 1.71


Epoch: 171, Train loss: 1.345, Val loss: 8.575, Epoch time = 0.374s


epoch 172, iter 0, losses 1.72, avg. loss 1.72


Epoch: 172, Train loss: 1.326, Val loss: 8.542, Epoch time = 0.372s


epoch 173, iter 0, losses 1.68, avg. loss 1.68


Epoch: 173, Train loss: 1.304, Val loss: 8.571, Epoch time = 0.357s


epoch 174, iter 0, losses 1.65, avg. loss 1.65


Epoch: 174, Train loss: 1.284, Val loss: 8.630, Epoch time = 0.348s


epoch 175, iter 0, losses 1.63, avg. loss 1.63


Epoch: 175, Train loss: 1.252, Val loss: 8.661, Epoch time = 0.349s


epoch 176, iter 0, losses 1.63, avg. loss 1.63


Epoch: 176, Train loss: 1.253, Val loss: 8.681, Epoch time = 0.345s


epoch 177, iter 0, losses 1.58, avg. loss 1.58


Epoch: 177, Train loss: 1.223, Val loss: 8.658, Epoch time = 0.349s


epoch 178, iter 0, losses 1.59, avg. loss 1.59


Epoch: 178, Train loss: 1.209, Val loss: 8.660, Epoch time = 0.349s


epoch 179, iter 0, losses 1.56, avg. loss 1.56


Epoch: 179, Train loss: 1.200, Val loss: 8.720, Epoch time = 0.343s


epoch 180, iter 0, losses 1.52, avg. loss 1.52


Epoch: 180, Train loss: 1.172, Val loss: 8.783, Epoch time = 0.347s


epoch 181, iter 0, losses 1.49, avg. loss 1.49


Epoch: 181, Train loss: 1.158, Val loss: 8.740, Epoch time = 0.350s


epoch 182, iter 0, losses 1.48, avg. loss 1.48


Epoch: 182, Train loss: 1.155, Val loss: 8.711, Epoch time = 0.348s


epoch 183, iter 0, losses 1.45, avg. loss 1.45


Epoch: 183, Train loss: 1.114, Val loss: 8.756, Epoch time = 0.352s


epoch 184, iter 0, losses 1.47, avg. loss 1.47


Epoch: 184, Train loss: 1.123, Val loss: 8.801, Epoch time = 0.350s


epoch 185, iter 0, losses 1.43, avg. loss 1.43


Epoch: 185, Train loss: 1.085, Val loss: 8.846, Epoch time = 0.351s


epoch 186, iter 0, losses 1.41, avg. loss 1.41


Epoch: 186, Train loss: 1.086, Val loss: 8.853, Epoch time = 0.349s


epoch 187, iter 0, losses 1.41, avg. loss 1.41


Epoch: 187, Train loss: 1.071, Val loss: 8.862, Epoch time = 0.346s


epoch 188, iter 0, losses 1.40, avg. loss 1.40


Epoch: 188, Train loss: 1.059, Val loss: 8.938, Epoch time = 0.345s


epoch 189, iter 0, losses 1.38, avg. loss 1.38


Epoch: 189, Train loss: 1.040, Val loss: 8.910, Epoch time = 0.357s


epoch 190, iter 0, losses 1.34, avg. loss 1.34


Epoch: 190, Train loss: 1.028, Val loss: 8.907, Epoch time = 0.351s


epoch 191, iter 0, losses 1.33, avg. loss 1.33


Epoch: 191, Train loss: 1.006, Val loss: 8.958, Epoch time = 0.370s


epoch 192, iter 0, losses 1.35, avg. loss 1.35


Epoch: 192, Train loss: 1.016, Val loss: 9.012, Epoch time = 0.367s


epoch 193, iter 0, losses 1.29, avg. loss 1.29


Epoch: 193, Train loss: 0.972, Val loss: 9.077, Epoch time = 0.369s


epoch 194, iter 0, losses 1.28, avg. loss 1.28


Epoch: 194, Train loss: 0.970, Val loss: 9.078, Epoch time = 0.371s


epoch 195, iter 0, losses 1.25, avg. loss 1.25


Epoch: 195, Train loss: 0.956, Val loss: 9.020, Epoch time = 0.360s


epoch 196, iter 0, losses 1.24, avg. loss 1.24


Epoch: 196, Train loss: 0.935, Val loss: 9.038, Epoch time = 0.358s


epoch 197, iter 0, losses 1.19, avg. loss 1.19


Epoch: 197, Train loss: 0.889, Val loss: 9.027, Epoch time = 0.353s


epoch 198, iter 0, losses 1.25, avg. loss 1.25


Epoch: 198, Train loss: 0.927, Val loss: 8.998, Epoch time = 0.353s


epoch 199, iter 0, losses 1.20, avg. loss 1.20


Epoch: 199, Train loss: 0.905, Val loss: 9.149, Epoch time = 0.350s


## Testing the model

Greedy Decoding: At each step, pick the most probable token.

The straightforward decoding strategy is greedy - at each step, generate a token with the highest probability. This can be a good baseline to test our model, but this method is inherently flawed: the best token at the current step does not necessarily lead to the best sequence.

## Q10: (5 points) Greedy Decoding
Implement the `greedy_decode` function  to initialize the
necessary module for our TransformerNMT model.

In [None]:
torch.ones(1, 1).fill_(10).type(torch.long).to(DEVICE)

tensor([[10]], device='cuda:0')

In [None]:
# function to generate output sequence using greedy algorithm
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    """ Generates a sequence using greedy decoding based on the given model.
    @param model (torch.nn.Module): The transformer model used for decoding.
    @param src (torch.Tensor): The source sequence tensor of shape (src_seq_len, 1).
    @param src_mask (torch.Tensor): The mask for the source sequence of shape (src_seq_len, src_seg_len).
    @param max_len (int): The maximum length of the generated sequence.
    @param start_symbol (int): The starting symbol for the decoding process.

    @returns: torch.Tensor: The generated sequence tensor of shape (tgt_seq_len, 1), with 0 < tgt_seq_len < max_len.

    Note:
        The decoding process is performed using greedy decoding, where at each step,
        the model predicts the next word in the sequence based on the highest probability.
    """

    src = src.to(DEVICE)
    src_mask = src_mask.to(DEVICE)

    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE)
    ### YOUR CODE HERE (~12 lines)
    memory = model.encode(src, src_mask, src_padding_mask=None)

    for i in range(max_len - 1):
        tgt_mask = generate_square_subsequent_mask(ys.size(0)).to(DEVICE)

        out = model.decode(ys, memory, tgt_mask, tgt_padding_mask=None)

        prob = model.target_vocab_projection(out[-1, :, :])
        next_word = torch.argmax(prob, dim=1).unsqueeze(1).to(DEVICE)

        ys = torch.cat([ys, next_word], dim=0)

    ### END YOUR CODE
    return ys


# actual function to translate input sentence into target language
def translate(model: torch.nn.Module, src_sentence: str):
    model.eval()
    src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)
    num_tokens = src.shape[0]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    tgt_tokens = greedy_decode(
        model,  src, src_mask, max_len=num_tokens + 20, start_symbol=BOS_IDX).flatten()
    return " ".join(vocab.tgt.indices2words(list(tgt_tokens.cpu().numpy()))).replace("<s>", "").replace("</s>", "")

In [None]:
# @title Testing on training set
num_logs = 5
if (num_logs > len(train_data_small)):
  num_logs = len(train_data_small)

for i in range(num_logs):
  src_sentence, tgt_sentence = train_data_small[i]
  translation = translate(transformer, src_sentence)
  print(f"Sample {i}:")
  print("Source: " + " ".join(src_sentence))
  print("Reference: " + " ".join(tgt_sentence).replace("<s>", "").replace("</s>", ""))
  print("Translation: " + translation)
  print('\n')

Sample 0:
Source: ‚ñÅKhoa ‚ñÅh·ªçc ‚ñÅƒë ·∫±ng ‚ñÅsau ‚ñÅm·ªôt ‚ñÅti√™u ‚ñÅƒë·ªÅ ‚ñÅv·ªÅ ‚ñÅkh√≠ ‚ñÅh·∫≠u
Reference:  ‚ñÅRachel ‚ñÅP ike ‚ñÅ : ‚ñÅThe ‚ñÅscience ‚ñÅ behind ‚ñÅa ‚ñÅclimate ‚ñÅheadline 
Translation:   ‚ñÅWe ‚ñÅhave ‚ñÅget ‚ñÅget ‚ñÅspecial ‚ñÅflight ‚ñÅclearance ‚ñÅ.                       


Sample 1:
Source: ‚ñÅTro ng ‚ñÅ4 ‚ñÅph√∫t ‚ñÅ, ‚ñÅchuy√™n ‚ñÅgia ‚ñÅho√° ‚ñÅh·ªçc ‚ñÅkh√≠ ‚ñÅquy·ªÉn ‚ñÅRachel ‚ñÅP ike ‚ñÅgi·ªõi ‚ñÅt hi·ªáu ‚ñÅs∆° ‚ñÅl∆∞·ª£c ‚ñÅv·ªÅ ‚ñÅnh·ªØng ‚ñÅn·ªó ‚ñÅl·ª±c ‚ñÅkhoa ‚ñÅh·ªçc ‚ñÅm i·ªát ‚ñÅm√† i ‚ñÅƒë ·∫±ng ‚ñÅsau ‚ñÅnh·ªØng ‚ñÅti√™u ‚ñÅƒë·ªÅ ‚ñÅt√°o ‚ñÅb·∫°o ‚ñÅv·ªÅ ‚ñÅbi·∫øn ‚ñÅƒë·ªïi ‚ñÅkh√≠ ‚ñÅh·∫≠u ‚ñÅ, ‚ñÅc√πng ‚ñÅv·ªõi ‚ñÅƒëo√†n ‚ñÅnghi√™n ‚ñÅc·ª©u ‚ñÅc·ªßa ‚ñÅm√¨nh ‚ñÅ-- ‚ñÅh√†ng ‚ñÅng√†n ‚ñÅng∆∞·ªùi ‚ñÅƒë√£ ‚ñÅc·ªë ng ‚ñÅ hi·∫øn ‚ñÅcho ‚ñÅd·ª± ‚ñÅ√°n ‚ñÅn√†y ‚ñÅ-- ‚ñÅm·ªôt ‚ñÅchuy·∫øn ‚ñÅbay ‚ñÅm·∫°o ‚ñÅhi·ªÉm ‚ñÅqua ‚ñÅr·ª´ng ‚ñÅgi√† ‚ñÅƒë·ªÉ ‚ñÅt√¨m ‚ñÅki·∫øm ‚ñÅth√¥ng ‚ñÅtin ‚ñÅv·ªÅ ‚ñÅm·ªôt ‚ñÅph√¢n ‚ñÅt·ª≠ ‚ñÅthe n ‚ñÅch·ªë t ‚ñÅ.
R

In [None]:
# @title Testing on evaluate set
num_logs = 5
if (num_logs > len(dev_data_small)):
  num_logs = len(dev_data_small)

for i in range(num_logs):
  src_sentence, tgt_sentence = dev_data_small[i]
  translation = translate(transformer, src_sentence)
  print(f"Sample {i}:")
  print("Source: " + " ".join(src_sentence))
  print("Reference: " + " ".join(tgt_sentence).replace("<s>", "").replace("</s>", ""))
  print("Translation: " + translation)
  print('\n')

Sample 0:
Source: ‚ñÅL√† m ‚ñÅsao ‚ñÅt√¥i ‚ñÅc√≥ ‚ñÅth·ªÉ ‚ñÅtr√¨nh ‚ñÅb√†y ‚ñÅtrong ‚ñÅ10 ‚ñÅph√∫t ‚ñÅv·ªÅ ‚ñÅs·ª£ i ‚ñÅd√¢y ‚ñÅli√™n ‚ñÅk·∫øt ‚ñÅnh·ªØng ‚ñÅng∆∞·ªùi ‚ñÅph·ª• ‚ñÅn·ªØ ‚ñÅqua ‚ñÅba ‚ñÅth·∫ø ‚ñÅh·ªá ‚ñÅ, ‚ñÅv·ªÅ ‚ñÅvi·ªác ‚ñÅl√†m ‚ñÅth·∫ø ‚ñÅn√†o ‚ñÅnh·ªØng ‚ñÅs·ª£ i ‚ñÅd√¢y ‚ñÅm·∫°nh ‚ñÅm·∫Ω ‚ñÅƒë√°ng ‚ñÅkinh ‚ñÅng·∫°c ‚ñÅ·∫•y ‚ñÅƒë√£ ‚ñÅn√≠ u ‚ñÅch ·∫∑t ‚ñÅl·∫•y ‚ñÅcu·ªôc ‚ñÅs·ªëng ‚ñÅc·ªßa ‚ñÅm·ªôt ‚ñÅc√¥ ‚ñÅb√© ‚ñÅb·ªën ‚ñÅtu·ªïi ‚ñÅco ‚ñÅqu ·∫Øp ‚ñÅv·ªõi ‚ñÅƒë ·ª©a ‚ñÅem ‚ñÅg√°i ‚ñÅnh·ªè ‚ñÅc·ªßa ‚ñÅc√¥ ‚ñÅb√© ‚ñÅ, ‚ñÅv·ªõi ‚ñÅm·∫π ‚ñÅv√† ‚ñÅb√† ‚ñÅtrong ‚ñÅsu·ªët ‚ñÅnƒÉm ‚ñÅng√†y ‚ñÅƒë√™m ‚ñÅtr√™n ‚ñÅcon ‚ñÅthuy·ªÅn ‚ñÅnh·ªè ‚ñÅl√™n h ‚ñÅƒë√™ nh ‚ñÅtr√™n ‚ñÅB i·ªÉn ‚ñÅƒê√¥ng ‚ñÅh∆°n ‚ñÅ30 ‚ñÅnƒÉm ‚ñÅtr∆∞·ªõc ‚ñÅ, ‚ñÅnh·ªØng ‚ñÅs·ª£ i ‚ñÅd√¢y ‚ñÅli√™n ‚ñÅk·∫øt ‚ñÅƒë√£ ‚ñÅn√≠ u ‚ñÅl·∫•y ‚ñÅcu·ªôc ‚ñÅƒë·ªùi ‚ñÅc√¥ ‚ñÅb√© ‚ñÅ·∫•y ‚ñÅv√† ‚ñÅkh√¥ng ‚ñÅbao ‚ñÅgi·ªù ‚ñÅ r·ªùi ‚ñÅƒëi ‚ñÅ-- ‚ñÅc√¥ ‚ñÅb√© ‚ñÅ·∫•y ‚ñÅgi·ªù ‚ñÅs·ªëng ‚ñÅ·ªü ‚ñÅSan ‚ñÅFrancis co ‚ñÅv√† ‚ñÅƒëang 

# Great work! You have completed all the tasks in this assignment üëè