# Assignment 3

You have now learnt about sequence-to-sequence models in the context of machine translation. In the upcoming lectures, you will see that these models are useful for a wide variety of tasks, wherever the source sequence and the target sequence have different lengths. You have also been introduced to phonemes, which are the building blocks of speech. 

In this assignment, you will build sequence-to-sequence models for pronunciation prediction of English words, which simply means that given a word (sequence of characters), the model should predict its pronunciation (sequence of phonemes). You should be able to see why this is a straightforward application of sequence-to-sequence models.

The input is a sequence of characters making up an English word e.g. `a l g e b r a i c a l l y`. The output should be a sequence of phonemes that describe the pronunciation. For the example above, the desired output should be `AE2 L JH AH0 B R EY1 IH0 K L IY0`. The data for this task was obtained from the [CMU Pronunciation Dictionary](http://www.speech.cs.cmu.edu/cgi-bin/cmudict). We will use a small subset of the CMU dict data.

# Setup

For this assignment, as in the previous one, we will be using Google Colab, for both code as well as descriptive questions. Your task is to finish all the questions in the Colab notebook and then upload a PDF version of the notebook, and a viewable link on Gradescope. 

### Google colaboratory

Before getting started, get familiar with google colaboratory:
https://colab.research.google.com/notebooks/welcome.ipynb

This is a neat python environment that works in the cloud and does not require you to
set up anything on your personal machine
(it also has some built-in IDE features that make writing code easier).
Moreover, it allows you to copy any existing collaboratory file, alter it and share
with other people.

### Submission

Before you start working on this homework do the following steps:

1. Press __File > Save a copy in Drive...__ tab. This will allow you to have your own copy and change it.
2. Follow all the steps in this collaboratory file and write / change / uncomment code as necessary.
3. Do not forget to occasionally press __File > Save__ tab to save your progress.
4. After all the changes are done and progress is saved press __Share__ button (top right corner of the page), press __get shareable link__ and make sure you have the option __Anyone with the link can view__ selected.
5. After completing the notebook, press __File > Download .ipynb__ to download a local copy on your computer, and then use `jupyter nbconvert --to pdf intro-hlt-hw3.ipynb` to convert the notebook into PDF format for uploading on Gradescope.



```
# Paste your Colab notebook link here
https://colab.research.google.com/drive/1iY8kLGBpxmfc-BPYYQP7DrpQ_zJ5JUB0?usp=sharing
```



# Dataset

For your convenience, we have preselected a subset of the CMU pronunciation dictionary (input-output pairs) and split the subset into training, validation (dev) and test sets.

In [1]:
!wget https://raw.githubusercontent.com/jhu-intro-hlt/jhu-intro-hlt.github.io/master/data-seq2seq-hw/cmudict.{dev,train,test}.src -q -nc
!wget https://raw.githubusercontent.com/jhu-intro-hlt/jhu-intro-hlt.github.io/master/data-seq2seq-hw/cmudict.{dev,train,test}.tgt -q -nc
!wget https://raw.githubusercontent.com/jhu-intro-hlt/jhu-intro-hlt.github.io/master/data-seq2seq-hw/cmudict.small.train.src -q -nc
!wget https://raw.githubusercontent.com/jhu-intro-hlt/jhu-intro-hlt.github.io/master/data-seq2seq-hw/cmudict.small.train.tgt -q -nc

In [2]:
# This is the raw contents of the .src and .tgt files
# The data for this assignment is a subset of cmudict https://github.com/cmusphinx/cmudict
!pr -m -t cmudict.small.train.src cmudict.small.train.tgt | head

s t a g g e r i n g l y		    S T AE1 G ER0 IH2 NG L IY2
p o c a h o n t a s		    P OW2 K AH0 HH AA1 N T AH0 S
h y p e r b o r e a n		    HH AY2 P ER0 B AO1 R IY0 AH0 N
s p a g n o l i			    S P AA0 G N OW1 L IY0
m i n s t a r ' s		    M IH1 N S T AA2 R Z
a n a l o g i c			    AE2 N AH0 L AA1 JH IH0 K
p o i g n a n t l y		    P OY1 N Y AH0 N T L IY0
m c l a u r i n			    M AH0 K L AO1 R AH0 N
c o n v e r s a t i o n s	    K AA2 N V ER0 S EY1 SH AH0 N Z
g r o s s e n b a c h e r	    G R AA1 S IH0 N B AA0 K ER0


In [3]:
import math
import random
import torch
import time
random.seed(1234)
torch.manual_seed(1234)
torch.cuda.set_device(0)

# Data Reader

We will use the `ParallelCorpus` class below to load, process and iterate through the data. Since you worked quite a bit on data loading implementation in the previous assignment, we will provide you the loader for this one, so you can get on with the more interesting parts of this assignment.

Note the use of special symbols `<BOS>`, `<EOS>` and `<UNK>`. All the sequences (both input and output) are made to begin with `<BOS>` (Begin of sequence) and end with `<EOS>` (End of sequence). The `<UNK>` symbol is used if we encouter any new symbol that we have not seen (unknown symbol).

In [5]:
SPL_SYMS = ['<BOS>', '<EOS>', '<UNK>']


class ParallelCorpus(object):
    def __init__(self,
                 src_file, tgt_file,
                 src_vocab=None, tgt_vocab=None):
        self.src_vocab = self.make_vocab(src_file, src_vocab)
        self.tgt_vocab = self.make_vocab(tgt_file, tgt_vocab)
        self.src_idx2vocab = self.make_idx2vocab(self.src_vocab)
        self.tgt_idx2vocab = self.make_idx2vocab(self.tgt_vocab)
        self.src_data = self.numberize(src_file, self.src_vocab)
        self.tgt_data = self.numberize(tgt_file, self.tgt_vocab) if tgt_file is not None else None
        assert len(self.src_data) == len(self.tgt_data), 'Source and Target have unequal lengths!'
        self.data_size = len(self.src_data)

    def numberize(self, txt, vocab):
        data = []
        with open(txt, 'r', encoding='utf8') as corpus:
            for l in corpus:
                d = [vocab['<BOS>']] + [vocab.get(tok, vocab['<UNK>']) for tok in l.strip().split()] + [vocab['<EOS>']]
                d = torch.Tensor(d).long()
                d = d.unsqueeze(0) # shape = (1, N)
                data.append((d, l.strip()))
        return data

    def make_idx2vocab(self, vocab):
        if vocab is not None:
            idx2vocab = {v: k for k, v in vocab.items()}
            return idx2vocab
        else:
            return None

    def make_vocab(self, txt, vocab):
        if vocab is None and txt is not None:
            v = {i: idx for idx, i in enumerate(SPL_SYMS)}
            with open(txt, 'r', encoding='utf8') as corpus:
                for line in corpus:
                    for token in line.strip().split():
                        v[token] = v.get(token, len(v))
            return v
        else:
            return vocab

    def get(self, idx):
        if self.tgt_data is not None:
            return self.src_data[idx], self.tgt_data[idx]
        else:
            return self.src_data[idx], (None, None)

In [6]:
train_corpus = ParallelCorpus('cmudict.small.train.src',
                              'cmudict.small.train.tgt')
dev_corpus = ParallelCorpus('cmudict.dev.src', 
                            'cmudict.dev.tgt',
                            train_corpus.src_vocab, 
                            train_corpus.tgt_vocab)
test_corpus = ParallelCorpus('cmudict.test.src',
                             'cmudict.test.tgt',
                             train_corpus.src_vocab, 
                             train_corpus.tgt_vocab)
print(train_corpus.data_size, dev_corpus.data_size, test_corpus.data_size)

10000 2000 5000


## Part 1: Seq2Seq with LSTM Language Models

In the first part of the assignment we will build a simple seq2seq model that should learn to predict the pronunciation of English words. The `EncoderDecoder` model consists (as the name suggests) of an `Encoder` and a `Decoder`, both of which are instances of an LSTM Language Model. LSTMs (Long Short-Term Memory) are an extention of RNNs (that you used for Assignment 2) designed to capture long(er) term dependencies in any sequence. The key difference between an LSTM and an RNN is that LSTMs keep track of a `cell state` in addition to a `hidden state`. If you are unfamiliar with LSTMs [this](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) is an excellent resource. Note: you don't have to memorize the internals of an LSTM, for this assignment just knowing that LSTMs expect three inputs 1. a representation of a word 2. previous hidden state and 3. the previous cell state is sufficient. In PyTorch LSTMs "wrap" the hidden state (lets call it `h`) and the cell state (lets call it `c`) into a tuple `(h,c)`.

As mentioned before, the Encoder and Decoder are going to be instances of an LSTM LM. Complete the code block below to set up the `LSTMLM` class. _(15 points)_

In [19]:
class LSTMLM(torch.nn.Module):
  def __init__(self,
              vocab_size,
              embedding_size,
              hidden_size,
              num_layers=1,
              dropout=0.1):
    super().__init__()
    self.vocab_size = vocab_size
    self.embedding_size = embedding_size
    self.hidden_size = hidden_size
    self.num_layers = num_layers
    self.dropout = torch.nn.Dropout(dropout)
    
    #TODO: create an embedding layer here
    #TODO: the embedding layer takes a sequence of ints and converts them into a sequence of real-valued vectors
    #TODO: see https://pytorch.org/docs/stable/nn.html?highlight=embedding#torch.nn.Embedding
    self.embedding = torch.nn.Embedding(self.vocab_size,self.embedding_size)

    #TODO: create a unidirectional RNN-LSTM here
    #TODO: Note: use the batch_first=True option, 
    #TODO: the rest of the code assumes the first dimension of any tensor is the batch_size (which is 1 for simplicity)
    #TODO: see https://pytorch.org/docs/stable/nn.html?highlight=lstm#torch.nn.LSTM
    self.rnn = torch.nn.LSTM(self.embedding_size,self.hidden_size,self.num_layers,batch_first=True,dropout=dropout)

    #TODO: create a Linear layer
    #TODO: https://pytorch.org/docs/stable/nn.html?highlight=linear#torch.nn.Linear
    self.output = torch.nn.Linear(self.hidden_size,self.vocab_size)
           

  def forward(self, x, init_hidden_state=None):
    # x should be of shape (1, N) where N is the length of the source
    assert x.shape[0] == 1 # only supporting batch size of 1 to keep the implementation simple
    x_len = len(x)
    x = x.cuda() if self.embedding.weight.is_cuda else x
    #TODO: use `self.embedding` to convert x into a sequence of vector representations
    #TODO: name the result `emb` it should have shape (1, N, embedding_size)
    emb = self.embedding(x)
    emb = self.dropout(emb) # keep this line here

    #TODO: supply the embeddings `emb` to the RNN-LSTM
    #TODO: the LSTM requires a tuple as it's initial state (h0, c0) 
    #TODO: where `h0` is the initial hidden state and `c0` is the cell_state
    #TODO: if the argument init_hidden_state is None, create a tuple (h0, c0) where h0=zeros and c0=zeros.
    #TODO: if init_hidden_state is not None, then supply it as the second argument to the RNN-LSTM

    if init_hidden_state is None:
      h = torch.zeros((self.num_layers, x_len, self.hidden_size))
      c = torch.zeros((self.num_layers, x_len, self.hidden_size))
      h = h.cuda() if self.embedding.weight.is_cuda else h
      c = c.cuda() if self.embedding.weight.is_cuda else c
    else:
      h, c = init_hidden_state

    #TODO: the LSTM will output 2 objects
    #TODO: the first is a tensor representing the hidden states of the encoder (name the first object `hidden_states`)
    #TODO: the second is a tuple representing (final hidden state, final cell state)
    #TODO: each item in the tuple is a tensor. 
    #TODO: name the tuple as `final_state`.
    #TODO: hidden_states should have size (1, N, hidden_size)
    #TODO: each tensor in the final_state should have size (1, 1, hidden_size)

    hidden_states, final_state = self.rnn(emb, (h, c))
  
    hidden_states = self.dropout(hidden_states) # keep this line here

    #TODO: supply the hidden_states (after dropout) to the ourput layer
    #TODO: the output layer will "project" from hidden_size to output_size
    #TODO: i.e. the result of the output layer will be of size (1, N, vocab_size)
    #TODO: https://pytorch.org/docs/stable/nn.html?highlight=linear#torch.nn.Linear
    #TODO: name the result of the output layer as "output_dist"
    #hidden_states = hidden_states.reshape(hidden_states.size(0)*hidden_states.size(1),hidden_states.size(2))
    output_dist = self.output(hidden_states)

    return output_dist, hidden_states, final_state # do not change this line
    
  def generate(self, start_idx, end_idx, init_hidden_state=None, idx2vocab=None, max_len=50):
    """ Performs the sequence generation process
    Args:
      start_idx: is the starting symbol (i.e. the integer corresponding to the <BOS> symbol)
      end_idx: is the end symbol (i.e. the integer corresponding to the <EOS> symbol)
      init_hidden_state: (optional) is the tuple (hidden_state, cell_state) at the starting step of generation
      If init_hidden_state is None, zero vectors are used for the hidden and cell states.
      idx2vocab: A dictionary to convert integer representation of symbols to human-readable symbols i.e. characters.
      max_len: The maximum length of the generated sequence.
    Returns:
      out: A List of generated characters
    """
    if init_hidden_state is None:
      h = torch.zeros((self.num_layers, 1, self.hidden_size))
      c = torch.zeros((self.num_layers, 1, self.hidden_size))
      h = h.cuda() if self.embedding.weight.is_cuda else h
      c = c.cuda() if self.embedding.weight.is_cuda else c
    else:
      h, c = init_hidden_state
    
    inp = torch.tensor([start_idx]).long().unsqueeze(0)
    inp = inp.cuda() if self.embedding.weight.is_cuda else inp

    out = []
    for _ in range(max_len):
      with torch.no_grad():
        emb = self.embedding(inp)
        o, (h, c) = self.rnn(emb, (h, c))
        o_dist = self.output(o)
        _, pred = o_dist.max(dim=2)
        if pred.item() == end_idx:
          break
        out.append(pred.item() if idx2vocab is None else idx2vocab[pred.item()])
        inp = pred
    return out

In [20]:
class EncoderDecoder(torch.nn.Module):
  def __init__(self,
              src_vocab_size,
              tgt_vocab_size,
              embedding_size,
              hidden_size,
              num_layers=1,
              dropout=0.1,
              max_grad_norm=5.0):
    super().__init__()
    self.hidden_size = hidden_size
    self.embedding_size = embedding_size
    self.num_layers = num_layers
    self.max_grad_norm = max_grad_norm
    self.encoder = LSTMLM(src_vocab_size, embedding_size, hidden_size, num_layers, dropout)
    self.decoder = LSTMLM(tgt_vocab_size, embedding_size, hidden_size, num_layers, dropout)
    self.log_smax = torch.nn.LogSoftmax(dim=-1)
    self.loss = torch.nn.NLLLoss(reduction='mean', ignore_index=-1)
    #We are going to package the optimizer inside this class
    self.optimizer = torch.optim.Adam(self.parameters())

  def train_step(self, x, y):
    """ Performs one step of SGD
    Args:
        x: the input sequence, its size should be: (1, src_length)
        y: the output sequence, its size should be (1, tgt_length)
    Returns:
        loss: the loss for this example (note this is just for logging it is not a pytorch tensor)
        accuracy: the accuracy for this example
    """
    self.optimizer.zero_grad()
    _loss, acc = self(x, y)
    _loss.backward()
    grad_norm = torch.nn.utils.clip_grad_norm_(self.parameters(),
                                                self.max_grad_norm)

    if math.isnan(grad_norm):
      print('skipping update grad_norm is nan!')
    else:
      self.optimizer.step()
    loss = _loss.item()
    return loss, acc

  def forward(self, x, y):
    out_src, hidden_src, final_src = self.encoder(x)
    
    y_input = y[:, :-1]
    y_output = y[:, 1:]

    out_tgt, hidden_tgt, final_tgt = self.decoder(y_input, final_src)
    out_tgt_lsm = self.log_smax(out_tgt)
    
    
    loss = self.loss(out_tgt_lsm.squeeze(0), y_output.squeeze(0))
    _, pred = out_tgt_lsm.max(dim=2)
    accuracy = float((pred == y_output).sum()) / y_output.numel()
    return loss, accuracy

  def generate(self, x, start_idx, end_idx, idx2vocab=None, max_len=50):
    """Generates and output sequence, i.e. predicts the output sequence.
    Args:
        x: sequence of input tokens. shape should be (1, src_length)
        start_idx: index of the <BOS> token
        end_idx: index of the <EOS> token
        idx2vocab: a dictionary that maps indexes to the output tokens
    Returns:
        out: list of strings
    """
    _, _, final_src = self.encoder(x)
    out = self.decoder.generate(start_idx, end_idx, final_src, idx2vocab)
    return out

__Question:__ Look at the forward function of the `EncoderDecoder` class. What is the purpose of the following lines: _(5 points)_

```
    y_input = y[:, :-1]
    y_output = y[:, 1:]
```

__Answer__: In the training of LSTM, the last step in predictions will be the first future element. We make a loop where this element is the input. Because of stateful, the model will understand it's a new input step of the previous sequence instead of a new sequence.

__Question__: In the last homework, the generation function of the RNN LM had a temperature parameter to control the randomness of the generated output. What implicit temperature are we using in the above model? _(5 points)_

__Answer:__The dropout operation functions as temperature which is used to control the randomness of predictions. The logic of dropout is for adding noise to the neurons in order not to be dependent on any specific neuron. By adding drop out for LSTM cells, there is a chance for forgetting something that should not be forgotten.

We create an instance of our EncoderDecoder below.

In [21]:
model = EncoderDecoder(src_vocab_size=len(train_corpus.src_vocab),
                       tgt_vocab_size=len(train_corpus.tgt_vocab),
                       embedding_size=73,
                       hidden_size=73,
                       num_layers=1)
model = model.cuda()
print(model)
print('num parameters:', sum([p.numel() for p in model.parameters()]))

EncoderDecoder(
  (encoder): LSTMLM(
    (dropout): Dropout(p=0.1, inplace=False)
    (embedding): Embedding(31, 73)
    (rnn): LSTM(73, 73, batch_first=True, dropout=0.1)
    (output): Linear(in_features=73, out_features=31, bias=True)
  )
  (decoder): LSTMLM(
    (dropout): Dropout(p=0.1, inplace=False)
    (embedding): Embedding(75, 73)
    (rnn): LSTM(73, 73, batch_first=True, dropout=0.1)
    (output): Linear(in_features=73, out_features=75, bias=True)
  )
  (log_smax): LogSoftmax(dim=-1)
  (loss): NLLLoss()
)
num parameters: 102014


  "num_layers={}".format(dropout, num_layers))


## Training routine

The `train` method defines our training routine. For the first seq2seq model `max_epochs` should be at least 15. For the second model, max_epochs can be reduced to 10.

In [7]:
def train(model, train_corpus, dev_corpus, max_epochs=15):
  sum_loss, sum_acc = 0., 0.
  train_instances_idxs = list(range(train_corpus.data_size))
  st = time.time()
  for epoch_i in range(max_epochs):
    sum_loss, sum_acc = 0., 0.
    random.shuffle(train_instances_idxs)
    model.train()
    for i in train_instances_idxs:
      (x, _), (y, _) = train_corpus.get(i)
      x, y = x.cuda(), y.cuda()
      l, a = model.train_step(x, y)
      sum_loss += l
      sum_acc += a
    print(f"epoch: {epoch_i} time elapsed: {time.time() - st:.2f}")
    print(f"train loss: {sum_loss/train_corpus.data_size:.4f} train acc: {sum_acc/train_corpus.data_size:.4f}")
    sum_loss, sum_acc = 0., 0.
    model.eval()
    for dev_i in range(dev_corpus.data_size):
      (x, x_str), (y, y_str) = dev_corpus.get(dev_i)
      x, y = x.cuda(), y.cuda()
      with torch.no_grad():
        l, a = model(x, y)
        sum_loss += l.item()
        sum_acc += a
    print(f"  dev loss: {sum_loss/dev_corpus.data_size:.4f}   dev acc: {sum_acc/dev_corpus.data_size:.4f}")
  return model

## Evaluation Routine
We are going to evaluate our model's predictions using Character Error Rate (CER). This measures the number of edits (insertions, deletions and substitutions) needed to convert our model's prediction to the correct output sequence. [Edit distance computation](https://nlp.stanford.edu/IR-book/html/htmledition/edit-distance-1.html) is one of the popular applications of dynamic programming, and is used for measuring character/phone/word error rates in speech recognition.

Complete the following function which takes two sequences (list of characters) and computes the edit distance between them. Verify your implementation for a few pairs of sequences by comparing the output of your function with the output obtained from `ed.eval()` _(10 points)_

In [8]:
from typing import List

Sequence = List[str]

def compute_edit_distance(s: Sequence, t: Sequence) -> int:
  # TODO: Your implementation here
  if s == t: return 0
  elif len(s) == 0: return len(t)
  elif len(t) == 0: return len(s)
  v0 = [None] * (len(t) + 1)
  v1 = [None] * (len(t) + 1)
  for i in range(len(v0)):
      v0[i] = i
  for i in range(len(s)):
      v1[0] = i + 1
      for j in range(len(t)):
        cost = 0 if s[i] == t[j] else 1
        v1[j + 1] = min(v1[j] + 1, v0[j + 1] + 1, v0[j] + cost)
      for j in range(len(v0)):
        v0[j] = v1[j]
                
  return v1[len(t)]

In [9]:
import editdistance as ed

# Example 1
seq1 = ['H','E','L','L','O']
seq2 = ['H','E','L','P']
assert compute_edit_distance(seq1,seq2) == ed.eval(seq1,seq2)

# Example 2
# TODO
seq3 = ['A','P','P','L','E']
seq4= ['A','P','E','X'] 
assert compute_edit_distance(seq3,seq4) == ed.eval(seq3,seq4)
# Example 3
# TODO 
seq5 = ['P','L','A','Y']
seq6 = ['P','L','A','N','T']
assert compute_edit_distance(seq5,seq6) == ed.eval(seq5,seq6)

The below routine evaluates the trained model.

In [10]:
def evaluate(model, test_corpus):
  print('Evaluation:')
  sum_cer = 0.0
  model.eval()
  for test_i in range(test_corpus.data_size):
    (x, x_str), (y, y_str) = test_corpus.get(test_i)
    x, y = x.cuda(), y.cuda()
    pred_seq = model.generate(x,
                              test_corpus.tgt_vocab[SPL_SYMS[0]],
                              test_corpus.tgt_vocab[SPL_SYMS[1]],
                              test_corpus.tgt_idx2vocab)
    cer = float(ed.eval(y_str.split(), pred_seq)) / len(y_str.split())
    y_hat = ' '.join(pred_seq)
    x_str = ''.join(x_str.split())
    print(f"{test_i} {x_str} pred: {y_hat} ref: {y_str} cer: {cer:.4f}")
    sum_cer += cer
  print(f"Avg CER: {sum_cer/test_corpus.data_size:.4f}")

## Training the EncoderDecoder

In [23]:
model = train(model, train_corpus, dev_corpus) #takes around 15-25 min

epoch: 0 time elapsed: 36.12
train loss: 2.1759 train acc: 0.4222
  dev loss: 1.5276   dev acc: 0.5830
epoch: 1 time elapsed: 73.56
train loss: 1.4335 train acc: 0.6000
  dev loss: 1.2273   dev acc: 0.6552
epoch: 2 time elapsed: 110.91
train loss: 1.2218 train acc: 0.6512
  dev loss: 1.0744   dev acc: 0.6919
epoch: 3 time elapsed: 148.08
train loss: 1.0973 train acc: 0.6823
  dev loss: 0.9937   dev acc: 0.7124
epoch: 4 time elapsed: 185.62
train loss: 1.0153 train acc: 0.7036
  dev loss: 0.9328   dev acc: 0.7322
epoch: 5 time elapsed: 222.94
train loss: 0.9598 train acc: 0.7185
  dev loss: 0.8926   dev acc: 0.7424
epoch: 6 time elapsed: 260.31
train loss: 0.9141 train acc: 0.7302
  dev loss: 0.8793   dev acc: 0.7434
epoch: 7 time elapsed: 297.51
train loss: 0.8747 train acc: 0.7407
  dev loss: 0.8323   dev acc: 0.7560
epoch: 8 time elapsed: 334.60
train loss: 0.8427 train acc: 0.7488
  dev loss: 0.8198   dev acc: 0.7601
epoch: 9 time elapsed: 371.89
train loss: 0.8163 train acc: 0.7550

In [24]:
evaluate(model, test_corpus)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
1 utility's pred: Y UW0 T IH1 L AH0 T IY0 Z ref: Y UW0 T IH1 L AH0 T IY0 Z cer: 0.0000
2 rothenberg pred: R OW1 T HH EH2 N D ER0 ref: R AO1 TH AH0 N B ER0 G cer: 0.7500
3 kinesiology pred: K IH2 N AH0 S IH1 L AH0 JH IY0 ref: K IH2 N IH0 S IY2 AA1 L AH0 JH IY0 cer: 0.2727
4 reclassified pred: R IY0 K L IH1 S T IH0 D ref: R IY0 K L AE1 S AH0 F AY2 D cer: 0.4000
5 substantive pred: S AH1 B S T AH0 N T IH0 D ref: S AH1 B S T AH0 N T IH0 V cer: 0.1000
6 situations pred: S AY2 T UW0 EY1 SH AH0 N Z ref: S IH2 CH UW0 EY1 SH AH0 N Z cer: 0.2222
7 lobianco pred: L OW1 B AH0 N AA1 K ref: L OW0 B IY0 AA1 N K OW0 cer: 0.6250
8 participants' pred: P AA0 R T IH1 S AH0 P EY2 T S ref: P AA0 R T IH1 S AH0 P AH0 N T S cer: 0.1667
9 transportable pred: T R AE0 N S P EH1 R T AH0 B AH0 L ref: T R AE0 N S P AO1 R T AH0 B AH0 L cer: 0.0769
10 inscribes pred: IH2 N S K R AY1 B Z ref: IH2 N S K R AY1 B Z cer: 0.0000
11 grandbaby pred: G R AE1 N D 

__Question:__ Report the number of params used by the model. _(2 points)_

__Answer:__ 
num parameters: 102014

__Question:__ Report the training and validation loss at the end of 15 epochs. _(3 points)_

__Answer:__ 
train loss: 0.7265 
  dev loss: 0.7573   

__Question:__ Report the training and validation accuracies at the end of 15 epochs. _(2 points)_

__Answer:__
train acc: 0.7798
dev acc: 0.7790

__Question:__ Report the test average CER. _(3 points)_

__Answer:__ 
Avg CER: 0.3601

## Part 2: Seq2Seq with Attention

Next we will modify our EncoderDecoder to use an Attention mechanism.
We will resue the RNNLM class you made in part 1 for the encoder.

For the Decoder, we need to use an `AttentionDecoder`. The `AttentionDecoder`, inturn requires an `Attention` object.

We will first build the `Attention` class. This class accepts 1) a sequence of encoder hidden states and 2) a decoder state as inputs. The output is a `context_vector` which is basically a weighted sum of the encoder hidden states. The weights (using which we computed the weighted sum) is also computed in this class. These weights are called attention weights.

Complete the following code block to finish the attention mechanism. _(15 points)_

In [11]:
class Attention(torch.nn.Module):
  def __init__(self,
                hidden_size):
    super().__init__()
    self.hidden_size = hidden_size
    m = torch.Tensor(hidden_size, hidden_size)
    torch.nn.init.xavier_uniform_(m)
    self.attn_weight_matrix = torch.nn.Parameter(m)
      
  def forward(self, encoder_states, prev_decoder_state):
    """Computes a context vector from encoder_states and prev_decoder_state
    Args:
        encoder_states: sequence of encoder hidden states. shape should be (1, src_length, hidden_size) = (1,12,64)
        prev_decoder_state: the hidden state from the previous decoder time-step. shape should be (1, 1, hidden_size)
    Returns:
        context_vector: context vector summarizing the encoder states for the current time-step in the decoder. shape should be (1, 1, hidden_size)
    """
    #TODO: reshape the prev_decoder from (1, 1, hidden_size) to (1, hidden_size)

    prev_decoder_state = prev_decoder_state.squeeze(1)
    #print(prev_decoder_state.size()) = (1,64)

    #TODO: compute unnormalized attention weights for each position of the source encoding
    #TODO: this can be done using two matrix multipications
    #TODO: refer to https://pytorch.org/docs/stable/torch.html?highlight=torch%20mm#torch.mm
    #TODO: name the result of the two matrix multiplcations as `attn_wts`
    #TODO: the shape of `attn_wts` should be (src_length, 1)
    encoder_states = encoder_states.squeeze(-3) # srclen,64
    #print('after squeeze, encoder_states.size = ', encoder_states.size())
    
    attn_wts = torch.matmul(encoder_states,prev_decoder_state.T)
    #print("weights size(should be (scr,1)) = ",attn_wts.size())

    #TODO: scale the attention weights (i.e. divide by the sqrt of the length of the src sequence)

    attn_wts = attn_wts / math.sqrt(encoder_states.size()[0])

    #TODO: normalize the attention weights using the softmax function in pytorch
    #TODO: name the normalized attention weights as `attn_probs`

    attn_probs = torch.nn.functional.softmax(attn_wts,0)

    #TODO: the `context_vector` is the sum of encoder states weighted by `attn_probs`
    #TODO: compute the `context_vector` here 
    #TODO: you may find this useful (https://pytorch.org/docs/stable/notes/broadcasting.html#broadcasting-semantics)
    #TODO: the shape of the `context_vector` should be (1, hidden_size)
    
    #attn_probs:(12,1) encoder_states:(12,64)
    context_vector = torch.matmul(attn_probs.T,encoder_states)
    
    context_vector = context_vector.unsqueeze(1) # we reshape the `context_vector` to  (1, 1, hidden_size)
    # (1,1,64)
    return context_vector

Now complete the following implementation of an attention-based decoder. _(10 points)_

In [14]:
class AttentionDecoder(torch.nn.Module):
  def __init__(self,
                vocab_size,
                embedding_size,
                hidden_size,
                num_layers=1,
                dropout=0.1):
    super().__init__()
    self.vocab_size = vocab_size
    self.embedding_size = embedding_size
    self.hidden_size = hidden_size
    self.num_layers = num_layers
    self.dropout_prop = dropout
    self.dropout = torch.nn.Dropout(dropout)
    self.num_layers = num_layers
    self.embedding = torch.nn.Embedding(self.vocab_size, self.embedding_size)
    self.output_proj = torch.nn.Linear(self.hidden_size, self.vocab_size)
    self.rnn = torch.nn.LSTM(embedding_size + hidden_size, hidden_size, num_layers,
                              batch_first=True, dropout=dropout, bidirectional=False)
    self.attention = Attention(self.hidden_size)

  def forward(self, encoder_states, y):
    """Computes loss and accuracy during training.
    Args:
        encoder_states: sequence of encoder hidden states. shape should be (1, src_length, hidden_size)
        y: sequence of target symbols. shape should be (1, tgt_length)
    Returns:
        output_dist: pytorch tensor with grad_fn of shape (1, tgt_length - 1, vocab_size)
    """
    # we create the initial hidden_state and cell_state for the Decoder here
    h, c = (torch.zeros(self.num_layers, 1, self.hidden_size).type_as(encoder_states),
            torch.zeros(self.num_layers, 1, self.hidden_size).type_as(encoder_states))
    
    #TODO: the first input to our Attention Decoder should always be the embedding corresponding to the "<BOS>" symbol
    #TODO: recall that y[0, 0] is always going to represent the <BOS> symbol
    #TODO: create the target embedding for the first time-step here
    #TODO: name this target embedding `tgt_embedding` , it should be of shape (1, embedding_size)
    tgt_embedding = self.embedding(y[0,0])
    tgt_embedding = tgt_embedding.unsqueeze(0)
    #print(tgt_embedding.size()) = (1,64)

    output_buffer = []
    for tgt_idx in range(y.shape[1] - 1):
      #TODO: compute the `context_vector` using self.attention object
      #TODO: `self.attention` takes `encoder_states` and the decoder hidden_state `h` as input
      context_vector = self.attention(encoder_states,h)
      #print(context_vector.size()) = (1,1,64)
      #TODO: concat the `context_vector` and the previous target word's embedding
      #TODO: the result should be of size (1, 1, embedding_size + hidden_size)
      #TODO: the concated tensor is going to be the input to our decoder
      #TODO: name the result of the concat operation `decoder_input`
      #TODO: `decoder_input` should be of size  (1, embed_size + hidden_size)
      #TODO: you may find this useful: https://pytorch.org/docs/stable/torch.html?highlight=torch%20cat#torch.cat

      decoder_input = torch.cat((context_vector, tgt_embedding.unsqueeze(1)), dim=2)
      #context: (1,1,64) tgt_emb:(1,64)

      #TODO: supply `decoder_input` to the decoder RNN i.e. `self.rnn`
      #TODO: also supply the hiddent_state `h` and the cell_state `c`
      #TODO: also supply the hidden state and cell state
     
      #TODO: the output of this operation should be: o, (h, c)
      #TODO: (new_hidden_state, new_cell_state) becomes (hidden_state, cell_state) for the next step 
      #TODO: `o`, `h`, `c` should be of size (1, 1, hidden_size)
      o, (h, c) = self.rnn(decoder_input, (h, c))
      #TODO: use the linear layer `self.output_proj` to project the output to the target vocabulary
      #TODO: name the result `output`
      #TODO: shape should be (1, 1, vocab_size)
      output = self.output_proj(o)

      output_buffer.append(output)
      tgt = y[:, tgt_idx + 1]
      tgt_embedding = self.embedding(tgt)
      #end of for loop
    output_dist = torch.cat(output_buffer, dim=1)
    return output_dist

  def generate(self, encoder_states, start_idx, end_idx, idx2vocab=None, max_len=50):
    """Generates and output sequence, i.e. predicts the output sequence.
    Args:
        encoder_states: sequence of encoder hidden states. shape should be (1, src_length, hidden_size)
        start_idx: index of the <BOS> token
        end_idx: index of the <EOS> token
        idx2vocab: a dictionary that maps indexes to the output tokens
    Returns:
        out: list of strings
    """
    # we create the initial hidden_state and cell_state for the Decoder here  
    h, c = (torch.zeros(self.num_layers, 1, self.hidden_size).type_as(encoder_states),
            torch.zeros(self.num_layers, 1, self.hidden_size).type_as(encoder_states))
    
    #TODO: create a tensor which contains the idx of the <BOS> symbol 
    #TODO: and copy it to the GPU using the .cuda() function
    #TODO: name this tensor `inp` it should be of size (1, 1)
    
    h = h.cuda()
    c = c.cuda() 
    inp = torch.tensor([start_idx]).long().unsqueeze(0)
    inp = inp.cuda() if self.embedding.weight.is_cuda else inp



    out = [] # a buffer that will hold the generated sequence
    for _ in range(max_len):

      #TODO: obtain the embedding of the input symbol (its shape should be (1, embed_size)
      #TODO: this is done using `self.embedding`
      #TODO: name the result `tgt_embedding`

      with torch.no_grad():
        tgt_embedding = self.embedding(inp)

      #TODO: compute the `context_vector` using `self.attention`
      #TODO: this step is simlar to that from the forward function.

      context_vector = self.attention(encoder_states,h)
      
      #TODO: compute the `decoder_input` by concating the `context_vector` and the `tgt_embedding`
      #TODO: the shape of the result should be (1, 1, embed_size + hidden_size)
      #TODO: supply the `decoder_input` along with the state tuple to `self.rnn`
      decoder_input = torch.cat((context_vector,tgt_embedding),dim =2)
      o, (h, c) = self.rnn(decoder_input, (h, c))

      #TODO: obtain the output distribution using the `self.output_proj` layer.
      #TODO: name the resulting output as `o_dist`, it should be of size (1, 1, vocab_size)
      o_dist = self.output_proj(o)
      
      #TODO: As we are generating the output, we must select one of the symbols to produce from
      #TODO: the output distribution. For this homework, we'll simply pick the index of the max-valued symbol from this distribution
      #TODO: This is known as "greedy" decoding
      _, pred = o_dist.max(dim=2)
      

      #TODO: to decide when to end the sequence, check if the index of the max-valued symbol is equal to `end_idx`
      #TODO: if it is the end_idx break out of the loop

      if pred.item() == end_idx:
          break
      out.append(pred.item() if idx2vocab is None else idx2vocab[pred.item()])
      inp = pred

      #TODO: if it is not the end_idx, convert the symbol from an index to a string using idx2vocab
      #TODO: add the string-symbol to the out buffer

      
      #TODO: lastly, we will use the predicted symbol as the input i.e. `inp` for the next time step


      #end of for loop
    return out

In [15]:
class EncoderDecoderAttention(torch.nn.Module):
  def __init__(self,
                src_vocab_size,
                tgt_vocab_size,
                embedding_size,
                hidden_size,
                num_layers=1,
                dropout=0.0,
                max_grad_norm=5.0):
    super().__init__()
    self.hidden_size = hidden_size
    self.embedding_size = embedding_size
    self.num_layers = num_layers
    self.max_grad_norm = max_grad_norm
    self.encoder = LSTMLM(src_vocab_size, embedding_size, hidden_size, num_layers, dropout)
    self.decoder = AttentionDecoder(tgt_vocab_size, embedding_size, hidden_size, num_layers, dropout)
    self.log_smax = torch.nn.LogSoftmax(dim=-1)
    self.loss = torch.nn.NLLLoss(reduction='mean', ignore_index=-1)
    #We are going to package the optimizer inside this class
    self.optimizer = torch.optim.Adam(self.parameters())

  def train_step(self, x, y):
    """ Performs one step of SGD
    Args:
      x: the input sequence, its size should be: (1, src_length)
      y: the output sequence, its size should be (1, tgt_length)
    Returns:
      loss: the loss for this example (note this is just for logging it is not a pytorch tensor)
      accuracy: the accuracy for this example
    """
    self.optimizer.zero_grad()
    _loss, acc = self(x, y)
    _loss.backward()
    grad_norm = torch.nn.utils.clip_grad_norm_(self.parameters(),
                                                self.max_grad_norm)

    if math.isnan(grad_norm):
      print('skipping update grad_norm is nan!')
    else:
      self.optimizer.step()
    loss = _loss.item()
    return loss, acc

  def forward(self, x, y):
    """Computes loss and accuracy during training.
    Args:
      x: sequence of source symbols. shape should be (1, src_length)
      y: sequence of target symbols. shape should be (1, tgt_length)
    Returns:
      loss: pytorch scalar with grad_fn. shape should be (1, 1)
      accuracy: scalar
    """
    _, encoder_states, _ = self.encoder(x)
    out_tgt = self.decoder(encoder_states, y)
    out_tgt_lsm = self.log_smax(out_tgt)
    y_output = y[:, 1:]
    loss = self.loss(out_tgt_lsm.squeeze(0), y_output.squeeze(0))
    _, pred = out_tgt.max(dim=2)
    accuracy = float((pred == y_output).sum()) / y_output.numel()
    return loss, accuracy

  def generate(self, x, start_idx, end_idx, idx2vocab=None, max_len=50):
    _, encoder_states, _ = self.encoder(x)
    out = self.decoder.generate(encoder_states, start_idx, end_idx, idx2vocab)
    return out

In [16]:
print(train_corpus.data_size, dev_corpus.data_size, test_corpus.data_size)
model_attn = EncoderDecoderAttention(src_vocab_size=len(train_corpus.src_vocab),
                                tgt_vocab_size=len(train_corpus.tgt_vocab),
                                embedding_size=64,
                                hidden_size=64,
                                num_layers=1)
model_attn = model_attn.cuda()
print(model_attn)
print('num parameters:', sum([p.numel() for p in model_attn.parameters()]))

10000 2000 5000
EncoderDecoderAttention(
  (encoder): LSTMLM(
    (dropout): Dropout(p=0.0, inplace=False)
    (embedding): Embedding(31, 64)
    (rnn): LSTM(64, 64, batch_first=True)
    (output): Linear(in_features=64, out_features=31, bias=True)
  )
  (decoder): AttentionDecoder(
    (dropout): Dropout(p=0.0, inplace=False)
    (embedding): Embedding(75, 64)
    (output_proj): Linear(in_features=64, out_features=75, bias=True)
    (rnn): LSTM(128, 64, batch_first=True)
    (attention): Attention()
  )
  (log_smax): LogSoftmax(dim=-1)
  (loss): NLLLoss()
)
num parameters: 100714


In [17]:
model_attn = train(model_attn, train_corpus, dev_corpus, max_epochs=10) #takes 1 - 1.5 hours

epoch: 0 time elapsed: 106.80
train loss: 1.6931 train acc: 0.5640
  dev loss: 1.0394   dev acc: 0.7202
epoch: 1 time elapsed: 220.93
train loss: 0.8585 train acc: 0.7593
  dev loss: 0.7794   dev acc: 0.7779
epoch: 2 time elapsed: 335.30
train loss: 0.6850 train acc: 0.8001
  dev loss: 0.6896   dev acc: 0.7981
epoch: 3 time elapsed: 449.98
train loss: 0.6010 train acc: 0.8215
  dev loss: 0.6616   dev acc: 0.8058
epoch: 4 time elapsed: 564.42
train loss: 0.5502 train acc: 0.8326
  dev loss: 0.6126   dev acc: 0.8178
epoch: 5 time elapsed: 679.20
train loss: 0.5127 train acc: 0.8438
  dev loss: 0.5802   dev acc: 0.8275
epoch: 6 time elapsed: 793.40
train loss: 0.4826 train acc: 0.8500
  dev loss: 0.5908   dev acc: 0.8268
epoch: 7 time elapsed: 907.24
train loss: 0.4588 train acc: 0.8568
  dev loss: 0.5826   dev acc: 0.8259
epoch: 8 time elapsed: 1021.35
train loss: 0.4375 train acc: 0.8630
  dev loss: 0.5535   dev acc: 0.8380
epoch: 9 time elapsed: 1135.13
train loss: 0.4155 train acc: 0.

In [18]:
evaluate(model_attn, test_corpus)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
1 utility's pred: Y UW0 T IH1 L AH0 T IY0 Z ref: Y UW0 T IH1 L AH0 T IY0 Z cer: 0.0000
2 rothenberg pred: R OW1 T AH0 N B ER0 G ref: R AO1 TH AH0 N B ER0 G cer: 0.2500
3 kinesiology pred: K IH2 N AH0 S IY0 AA1 L AH0 JH IY0 ref: K IH2 N IH0 S IY2 AA1 L AH0 JH IY0 cer: 0.1818
4 reclassified pred: R IH0 K L AE1 S AH0 F AY2 D ref: R IY0 K L AE1 S AH0 F AY2 D cer: 0.1000
5 substantive pred: S AH0 B S T AE1 N T IH0 V ref: S AH1 B S T AH0 N T IH0 V cer: 0.2000
6 situations pred: S IH2 T Y UW0 EY1 SH AH0 N Z ref: S IH2 CH UW0 EY1 SH AH0 N Z cer: 0.2222
7 lobianco pred: L OW0 B IY0 AA1 N K OW0 ref: L OW0 B IY0 AA1 N K OW0 cer: 0.0000
8 participants' pred: P AA1 R T IH0 S AH0 P AE2 N T S ref: P AA0 R T IH1 S AH0 P AH0 N T S cer: 0.2500
9 transportable pred: T R AE0 N S P AO1 R T AH0 B AH0 L ref: T R AE0 N S P AO1 R T AH0 B AH0 L cer: 0.0000
10 inscribes pred: IH2 N S K R IH1 B Z AH0 Z ref: IH2 N S K R AY1 B Z cer: 0.3750
11 grandba

__Question:__ Report the number of params used by the model. _(2 points)_

__Answer:__
num parameters: 100714

__Question:__ Report the training and validation loss at the end of 15 epochs. _(3 points)_ 

__Answer:__
train loss: 0.4155, dev loss: 0.5566  

__Question:__ Report the training and validation accuracies at the end of 15 epochs. _(2 points)_

__Answer:__
train acc: 0.8690, dev acc: 0.8360

__Question:__ Report the test average CER. _(3 points)_

__Answer:__
Avg CER: 0.2557

__Extra credit:__ Try to improve on the Attention-based model you have implemented. You can either increase the number of parameters in your model or implement a fancier attention scheme, such as dot-product attention or attention weights computed from a feed forward network. _(10 points)_

__Answer:__
We increase the number of parameters to 301034 (adds the size of hidden layers) and the performance of Attention-based model is better than the former one, which has CER = 0.2185.

In [27]:
class NewLSTMLM(torch.nn.Module):
  def __init__(self,
              vocab_size,
              embedding_size,
              hidden_size,
              num_layers=1,
              dropout=0.1):
    super().__init__()
    self.vocab_size = vocab_size
    self.embedding_size = embedding_size
    self.hidden_size = hidden_size
    self.num_layers = num_layers
    self.dropout = torch.nn.Dropout(dropout)
    
    #TODO: create an embedding layer here
    #TODO: the embedding layer takes a sequence of ints and converts them into a sequence of real-valued vectors
    #TODO: see https://pytorch.org/docs/stable/nn.html?highlight=embedding#torch.nn.Embedding
    self.embedding = torch.nn.Embedding(self.vocab_size,self.embedding_size)

    #TODO: create a unidirectional RNN-LSTM here
    #TODO: Note: use the batch_first=True option, 
    #TODO: the rest of the code assumes the first dimension of any tensor is the batch_size (which is 1 for simplicity)
    #TODO: see https://pytorch.org/docs/stable/nn.html?highlight=lstm#torch.nn.LSTM
    self.rnn = torch.nn.LSTM(self.embedding_size,self.hidden_size,self.num_layers,batch_first=True,dropout=dropout)

    #TODO: create a Linear layer
    #TODO: https://pytorch.org/docs/stable/nn.html?highlight=linear#torch.nn.Linear
    self.output = torch.nn.Linear(self.hidden_size,self.vocab_size)
           

  def forward(self, x, init_hidden_state=None):
    # x should be of shape (1, N) where N is the length of the source
    assert x.shape[0] == 1 # only supporting batch size of 1 to keep the implementation simple
    x_len = len(x)
    x = x.cuda() if self.embedding.weight.is_cuda else x
    #TODO: use `self.embedding` to convert x into a sequence of vector representations
    #TODO: name the result `emb` it should have shape (1, N, embedding_size)
    emb = self.embedding(x)
    emb = self.dropout(emb) # keep this line here

    #TODO: supply the embeddings `emb` to the RNN-LSTM
    #TODO: the LSTM requires a tuple as it's initial state (h0, c0) 
    #TODO: where `h0` is the initial hidden state and `c0` is the cell_state
    #TODO: if the argument init_hidden_state is None, create a tuple (h0, c0) where h0=zeros and c0=zeros.
    #TODO: if init_hidden_state is not None, then supply it as the second argument to the RNN-LSTM

    if init_hidden_state is None:
      h = torch.zeros((self.num_layers, x_len, self.hidden_size))
      c = torch.zeros((self.num_layers, x_len, self.hidden_size))
      h = h.cuda() if self.embedding.weight.is_cuda else h
      c = c.cuda() if self.embedding.weight.is_cuda else c
    else:
      h, c = init_hidden_state

    #TODO: the LSTM will output 2 objects
    #TODO: the first is a tensor representing the hidden states of the encoder (name the first object `hidden_states`)
    #TODO: the second is a tuple representing (final hidden state, final cell state)
    #TODO: each item in the tuple is a tensor. 
    #TODO: name the tuple as `final_state`.
    #TODO: hidden_states should have size (1, N, hidden_size)
    #TODO: each tensor in the final_state should have size (1, 1, hidden_size)

    hidden_states, final_state = self.rnn(emb, (h, c))
  
    hidden_states = self.dropout(hidden_states) # keep this line here

    #TODO: supply the hidden_states (after dropout) to the ourput layer
    #TODO: the output layer will "project" from hidden_size to output_size
    #TODO: i.e. the result of the output layer will be of size (1, N, vocab_size)
    #TODO: https://pytorch.org/docs/stable/nn.html?highlight=linear#torch.nn.Linear
    #TODO: name the result of the output layer as "output_dist"
    #hidden_states = hidden_states.reshape(hidden_states.size(0)*hidden_states.size(1),hidden_states.size(2))
    output_dist = self.output(hidden_states)

    return output_dist, hidden_states, final_state # do not change this line
    
  def generate(self, start_idx, end_idx, init_hidden_state=None, idx2vocab=None, max_len=50):
    """ Performs the sequence generation process
    Args:
      start_idx: is the starting symbol (i.e. the integer corresponding to the <BOS> symbol)
      end_idx: is the end symbol (i.e. the integer corresponding to the <EOS> symbol)
      init_hidden_state: (optional) is the tuple (hidden_state, cell_state) at the starting step of generation
      If init_hidden_state is None, zero vectors are used for the hidden and cell states.
      idx2vocab: A dictionary to convert integer representation of symbols to human-readable symbols i.e. characters.
      max_len: The maximum length of the generated sequence.
    Returns:
      out: A List of generated characters
    """
    if init_hidden_state is None:
      h = torch.zeros((self.num_layers, 1, self.hidden_size))
      c = torch.zeros((self.num_layers, 1, self.hidden_size))
      h = h.cuda() if self.embedding.weight.is_cuda else h
      c = c.cuda() if self.embedding.weight.is_cuda else c
    else:
      h, c = init_hidden_state
    
    inp = torch.tensor([start_idx]).long().unsqueeze(0)
    inp = inp.cuda() if self.embedding.weight.is_cuda else inp

    out = []
    for _ in range(max_len):
      with torch.no_grad():
        emb = self.embedding(inp)
        o, (h, c) = self.rnn(emb, (h, c))
        o_dist = self.output(o)
        _, pred = o_dist.max(dim=2)
        if pred.item() == end_idx:
          break
        out.append(pred.item() if idx2vocab is None else idx2vocab[pred.item()])
        inp = pred
    return out

In [29]:
class NewAttentionDecoder(torch.nn.Module):
  def __init__(self,
                vocab_size,
                embedding_size,
                hidden_size,
                num_layers=1,
                dropout=0.1):
    super().__init__()
    self.vocab_size = vocab_size
    self.embedding_size = embedding_size
    self.hidden_size = hidden_size
    self.num_layers = num_layers
    self.dropout_prop = dropout
    self.dropout = torch.nn.Dropout(dropout)
    self.num_layers = num_layers
    self.embedding = torch.nn.Embedding(self.vocab_size, self.embedding_size)
    self.output_proj = torch.nn.Linear(self.hidden_size, self.vocab_size)
    self.rnn = torch.nn.LSTM(embedding_size + hidden_size, hidden_size, num_layers,
                              batch_first=True, dropout=dropout, bidirectional=False)
    self.attention = Attention(self.hidden_size)

  def forward(self, encoder_states, y):
    """Computes loss and accuracy during training.
    Args:
        encoder_states: sequence of encoder hidden states. shape should be (1, src_length, hidden_size)
        y: sequence of target symbols. shape should be (1, tgt_length)
    Returns:
        output_dist: pytorch tensor with grad_fn of shape (1, tgt_length - 1, vocab_size)
    """
    # we create the initial hidden_state and cell_state for the Decoder here
    h, c = (torch.zeros(self.num_layers, 1, self.hidden_size).type_as(encoder_states),
            torch.zeros(self.num_layers, 1, self.hidden_size).type_as(encoder_states))
    
    #TODO: the first input to our Attention Decoder should always be the embedding corresponding to the "<BOS>" symbol
    #TODO: recall that y[0, 0] is always going to represent the <BOS> symbol
    #TODO: create the target embedding for the first time-step here
    #TODO: name this target embedding `tgt_embedding` , it should be of shape (1, embedding_size)
    tgt_embedding = self.embedding(y[0,0])
    tgt_embedding = tgt_embedding.unsqueeze(0)
    #print(tgt_embedding.size()) = (1,64)

    output_buffer = []
    for tgt_idx in range(y.shape[1] - 1):
      #TODO: compute the `context_vector` using self.attention object
      #TODO: `self.attention` takes `encoder_states` and the decoder hidden_state `h` as input
      context_vector = self.attention(encoder_states,h)
      #print(context_vector.size()) = (1,1,64)
      #TODO: concat the `context_vector` and the previous target word's embedding
      #TODO: the result should be of size (1, 1, embedding_size + hidden_size)
      #TODO: the concated tensor is going to be the input to our decoder
      #TODO: name the result of the concat operation `decoder_input`
      #TODO: `decoder_input` should be of size  (1, embed_size + hidden_size)
      #TODO: you may find this useful: https://pytorch.org/docs/stable/torch.html?highlight=torch%20cat#torch.cat
      decoder_input = torch.cat((context_vector, tgt_embedding.unsqueeze(1)), dim=2)
      #context: (2,1,64) tgt_emb:(1,64)

      #TODO: supply `decoder_input` to the decoder RNN i.e. `self.rnn`
      #TODO: also supply the hiddent_state `h` and the cell_state `c`
      #TODO: also supply the hidden state and cell state
     
      #TODO: the output of this operation should be: o, (h, c)
      #TODO: (new_hidden_state, new_cell_state) becomes (hidden_state, cell_state) for the next step 
      #TODO: `o`, `h`, `c` should be of size (1, 1, hidden_size)
      o, (h, c) = self.rnn(decoder_input, (h, c))
      #TODO: use the linear layer `self.output_proj` to project the output to the target vocabulary
      #TODO: name the result `output`
      #TODO: shape should be (1, 1, vocab_size)
      output = self.output_proj(o)

      output_buffer.append(output)
      tgt = y[:, tgt_idx + 1]
      tgt_embedding = self.embedding(tgt)
      #end of for loop
    output_dist = torch.cat(output_buffer, dim=1)
    return output_dist

  def generate(self, encoder_states, start_idx, end_idx, idx2vocab=None, max_len=50):
    """Generates and output sequence, i.e. predicts the output sequence.
    Args:
        encoder_states: sequence of encoder hidden states. shape should be (1, src_length, hidden_size)
        start_idx: index of the <BOS> token
        end_idx: index of the <EOS> token
        idx2vocab: a dictionary that maps indexes to the output tokens
    Returns:
        out: list of strings
    """
    # we create the initial hidden_state and cell_state for the Decoder here  
    h, c = (torch.zeros(self.num_layers, 1, self.hidden_size).type_as(encoder_states),
            torch.zeros(self.num_layers, 1, self.hidden_size).type_as(encoder_states))
    
    #TODO: create a tensor which contains the idx of the <BOS> symbol 
    #TODO: and copy it to the GPU using the .cuda() function
    #TODO: name this tensor `inp` it should be of size (1, 1)
    
    h = h.cuda()
    c = c.cuda() 
    inp = torch.tensor([start_idx]).long().unsqueeze(0)
    inp = inp.cuda() if self.embedding.weight.is_cuda else inp



    out = [] # a buffer that will hold the generated sequence
    for _ in range(max_len):

      #TODO: obtain the embedding of the input symbol (its shape should be (1, embed_size)
      #TODO: this is done using `self.embedding`
      #TODO: name the result `tgt_embedding`

      with torch.no_grad():
        tgt_embedding = self.embedding(inp)

      #TODO: compute the `context_vector` using `self.attention`
      #TODO: this step is simlar to that from the forward function.

      context_vector = self.attention(encoder_states,h)
      
      #TODO: compute the `decoder_input` by concating the `context_vector` and the `tgt_embedding`
      #TODO: the shape of the result should be (1, 1, embed_size + hidden_size)
      #TODO: supply the `decoder_input` along with the state tuple to `self.rnn`
      decoder_input = torch.cat((context_vector,tgt_embedding),dim =2)
      o, (h, c) = self.rnn(decoder_input, (h, c))

      #TODO: obtain the output distribution using the `self.output_proj` layer.
      #TODO: name the resulting output as `o_dist`, it should be of size (1, 1, vocab_size)
      o_dist = self.output_proj(o)
      
      #TODO: As we are generating the output, we must select one of the symbols to produce from
      #TODO: the output distribution. For this homework, we'll simply pick the index of the max-valued symbol from this distribution
      #TODO: This is known as "greedy" decoding
      _, pred = o_dist.max(dim=2)
      

      #TODO: to decide when to end the sequence, check if the index of the max-valued symbol is equal to `end_idx`
      #TODO: if it is the end_idx break out of the loop

      if pred.item() == end_idx:
          break
      out.append(pred.item() if idx2vocab is None else idx2vocab[pred.item()])
      inp = pred

      #TODO: if it is not the end_idx, convert the symbol from an index to a string using idx2vocab
      #TODO: add the string-symbol to the out buffer

      
      #TODO: lastly, we will use the predicted symbol as the input i.e. `inp` for the next time step


      #end of for loop
    return out

In [32]:
class NewEncoderDecoderAttention(torch.nn.Module):
  def __init__(self,
                src_vocab_size,
                tgt_vocab_size,
                embedding_size,
                hidden_size,
                num_layers=1,
                dropout=0.0,
                max_grad_norm=5.0):
    super().__init__()
    self.hidden_size = hidden_size
    self.embedding_size = embedding_size
    self.num_layers = num_layers
    self.max_grad_norm = max_grad_norm
    self.encoder = NewLSTMLM(src_vocab_size, embedding_size, hidden_size, num_layers, dropout)
    self.decoder = NewAttentionDecoder(tgt_vocab_size, embedding_size, hidden_size, num_layers, dropout)
    self.log_smax = torch.nn.LogSoftmax(dim=-1)
    self.loss = torch.nn.NLLLoss(reduction='mean', ignore_index=-1)
    #We are going to package the optimizer inside this class
    self.optimizer = torch.optim.Adam(self.parameters())

  def train_step(self, x, y):
    """ Performs one step of SGD
    Args:
      x: the input sequence, its size should be: (1, src_length)
      y: the output sequence, its size should be (1, tgt_length)
    Returns:
      loss: the loss for this example (note this is just for logging it is not a pytorch tensor)
      accuracy: the accuracy for this example
    """
    self.optimizer.zero_grad()
    _loss, acc = self(x, y)
    _loss.backward()
    grad_norm = torch.nn.utils.clip_grad_norm_(self.parameters(),
                                                self.max_grad_norm)

    if math.isnan(grad_norm):
      print('skipping update grad_norm is nan!')
    else:
      self.optimizer.step()
    loss = _loss.item()
    return loss, acc

  def forward(self, x, y):
    """Computes loss and accuracy during training.
    Args:
      x: sequence of source symbols. shape should be (1, src_length)
      y: sequence of target symbols. shape should be (1, tgt_length)
    Returns:
      loss: pytorch scalar with grad_fn. shape should be (1, 1)
      accuracy: scalar
    """
    _, encoder_states, _ = self.encoder(x)
    out_tgt = self.decoder(encoder_states, y)
    out_tgt_lsm = self.log_smax(out_tgt)
    y_output = y[:, 1:]
    loss = self.loss(out_tgt_lsm.squeeze(0), y_output.squeeze(0))
    _, pred = out_tgt.max(dim=2)
    accuracy = float((pred == y_output).sum()) / y_output.numel()
    return loss, accuracy

  def generate(self, x, start_idx, end_idx, idx2vocab=None, max_len=50):
    _, encoder_states, _ = self.encoder(x)
    out = self.decoder.generate(encoder_states, start_idx, end_idx, idx2vocab)
    return out

In [33]:
print(train_corpus.data_size, dev_corpus.data_size, test_corpus.data_size)
new_model_attn = NewEncoderDecoderAttention(src_vocab_size=len(train_corpus.src_vocab),
                                tgt_vocab_size=len(train_corpus.tgt_vocab),
                                embedding_size=64,
                                hidden_size=128,
                                num_layers=1)
new_model_attn = new_model_attn.cuda()
print(new_model_attn)
print('num parameters:', sum([p.numel() for p in new_model_attn.parameters()]))

10000 2000 5000
NewEncoderDecoderAttention(
  (encoder): NewLSTMLM(
    (dropout): Dropout(p=0.0, inplace=False)
    (embedding): Embedding(31, 64)
    (rnn): LSTM(64, 128, batch_first=True)
    (output): Linear(in_features=128, out_features=31, bias=True)
  )
  (decoder): NewAttentionDecoder(
    (dropout): Dropout(p=0.0, inplace=False)
    (embedding): Embedding(75, 64)
    (output_proj): Linear(in_features=128, out_features=75, bias=True)
    (rnn): LSTM(192, 128, batch_first=True)
    (attention): Attention()
  )
  (log_smax): LogSoftmax(dim=-1)
  (loss): NLLLoss()
)
num parameters: 301034


In [34]:
new_model_attn = train(new_model_attn, train_corpus, dev_corpus, max_epochs=10) #takes 1 - 1.5 hours

epoch: 0 time elapsed: 112.91
train loss: 1.3418 train acc: 0.6488
  dev loss: 0.7760   dev acc: 0.7808
epoch: 1 time elapsed: 233.13
train loss: 0.6500 train acc: 0.8087
  dev loss: 0.6182   dev acc: 0.8161
epoch: 2 time elapsed: 353.19
train loss: 0.5414 train acc: 0.8362
  dev loss: 0.5798   dev acc: 0.8266
epoch: 3 time elapsed: 473.52
train loss: 0.4761 train acc: 0.8528
  dev loss: 0.5399   dev acc: 0.8374
epoch: 4 time elapsed: 594.31
train loss: 0.4313 train acc: 0.8636
  dev loss: 0.5233   dev acc: 0.8446
epoch: 5 time elapsed: 715.35
train loss: 0.3940 train acc: 0.8752
  dev loss: 0.5132   dev acc: 0.8472
epoch: 6 time elapsed: 835.74
train loss: 0.3671 train acc: 0.8827
  dev loss: 0.5047   dev acc: 0.8506
epoch: 7 time elapsed: 956.25
train loss: 0.3430 train acc: 0.8897
  dev loss: 0.5087   dev acc: 0.8505
epoch: 8 time elapsed: 1077.30
train loss: 0.3239 train acc: 0.8943
  dev loss: 0.5032   dev acc: 0.8520
epoch: 9 time elapsed: 1198.44
train loss: 0.3040 train acc: 0.

In [35]:
evaluate(new_model_attn, test_corpus)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
1 utility's pred: Y UW0 T IY0 L IH1 T IY0 Z ref: Y UW0 T IH1 L AH0 T IY0 Z cer: 0.2222
2 rothenberg pred: R OW1 TH AH0 N B ER0 G ref: R AO1 TH AH0 N B ER0 G cer: 0.1250
3 kinesiology pred: K IH2 N IH0 S IY0 AA1 L AH0 JH IY0 ref: K IH2 N IH0 S IY2 AA1 L AH0 JH IY0 cer: 0.0909
4 reclassified pred: R IY0 K L AE1 S IH0 F AY2 D ref: R IY0 K L AE1 S AH0 F AY2 D cer: 0.1000
5 substantive pred: S AH0 B S T AE1 N T IH0 V ref: S AH1 B S T AH0 N T IH0 V cer: 0.2000
6 situations pred: S IH2 T UW0 EY1 SH AH0 N Z ref: S IH2 CH UW0 EY1 SH AH0 N Z cer: 0.1111
7 lobianco pred: L OW0 B IY0 AA1 L K OW0 ref: L OW0 B IY0 AA1 N K OW0 cer: 0.1250
8 participants' pred: P AA0 R T IH1 S AH0 P AH0 N T S ref: P AA0 R T IH1 S AH0 P AH0 N T S cer: 0.0000
9 transportable pred: T R AE1 N S P AO2 R T AH0 B AH0 L ref: T R AE0 N S P AO1 R T AH0 B AH0 L cer: 0.1538
10 inscribes pred: IH2 N S K R AY1 B Z ref: IH2 N S K R AY1 B Z cer: 0.0000
11 grandbaby pred