In [1]:
# Please do not change this cell because some hidden tests might depend on it.
import os

# Otter grader does not handle ! commands well, so we define and use our
# own function to execute shell commands.
def shell(commands, warn=True):
    """Executes the string `commands` as a sequence of shell commands.
     
       Prints the result to stdout and returns the exit status. 
       Provides a printed warning on non-zero exit status unless `warn` 
       flag is unset.
    """
    file = os.popen(commands)
    print (file.read().rstrip('\n'))
    exit_status = file.close()
    if warn and exit_status != None:
        print(f"Completed with errors. Exit status: {exit_status}\n")
    return exit_status

shell("""
ls requirements.txt >/dev/null 2>&1
if [ ! $? = 0 ]; then
 rm -rf .tmp
 git clone https://github.com/cs236299-2022-spring/lab2-5.git .tmp
 mv .tmp/tests ./
 mv .tmp/requirements.txt ./
 rm -rf .tmp
fi
pip install -q -r requirements.txt
""")




In [2]:
# Initialize Otter
import otter
grader = otter.Notebook()

$$
\renewcommand{\vect}[1]{\mathbf{#1}}
\renewcommand{\cnt}[1]{\sharp(#1)}
\renewcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
\renewcommand{\softmax}{\operatorname{softmax}}
\renewcommand{\Prob}{\Pr}
\renewcommand{\given}{\,|\,}
$$

# Course 236299
## Lab 2-5 - Sequence labeling with recurrent neural networks

In the last lab, you saw how to use hidden Markov models (HMMs) for sequence labeling. In this lab, you will use recurrent neural networks (RNNs) for sequence labeling. 

In this lab, we consider the task of automatic punctuation restoration from unpunctuated text, which is useful for post-processing transcribed speech from speech recognition systems (since we don't want users to have to utter all punctuation marks). We can formulate this task as a sequence labeling task, predicting for each word the punctuation that should follow. If there's no punctuation following the word, we use a special tag `O` for "other".

The dataset we use is the Federalist papers, but this time we use text without punctuation as our input, and predict the punctuation following each word. An example constructed from the dataset looks like below, which correponds to the punctuated sentence `the powers to make treaties and to send and receive ambassadors , speak their own propriety .`

| Token       | Label  |
| ----------- | ------ |
| &lt;bos&gt; | O |
| the         | O |
| powers      | O |
| to          | O |
| make        | O |
| treaties    | O |
| and         | O |
| to          | O |
| send        | O |
| and         | O |
| receive     | O |
| ambassadors | , |
| speak       | O |
| their       | O |
| own         | O |
| propriety   | . |

# Preparation and setup

In [3]:
import copy

import wget
import torch
import torchtext.legacy as tt
import torch.nn as nn

from collections import Counter
from tqdm.auto import tqdm

In [4]:
## GPU check
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


In [5]:
# Prepare to download needed data
def download_if_needed(source, dest, filename):
    os.makedirs(data_path, exist_ok=True) # ensure destination
    os.path.exists(f"./{dest}{filename}") or wget.download(source + filename, out=dest)

source_path = "https://raw.githubusercontent.com/nlp-236299/data/master/Federalist/"
data_path = "data/"

# Download files
for filename in ["federalist_tag.train.txt",
                 "federalist_tag.dev.txt",
                 "federalist_tag.test.txt"
                ]:
    download_if_needed(source_path, data_path, filename)

As before, we use `torchtext` to load data. We use one field for processing the words (`TEXT`), and another for processing the tags (`TAG`).

In [6]:
# We place a limit on the size of the vocabulary, including only the 
# `MAX_VOCAB_SIZE` most frequent words. All others will become `<unk>`.
MAX_VOCAB_SIZE = 5000

# Create fields
TEXT = tt.data.Field(sequential=True, 
                     include_lengths=False, 
                     batch_first=False, # batches are of size max_len x bsz
                     tokenize=None)
TAG = tt.data.Field(sequential=True, 
                    include_lengths=False, 
                    batch_first=False,  # ditto
                    tokenize=None)
fields = (('text', TEXT), ('tag', TAG))

# Load data
tt.datasets.SequenceTaggingDataset.name = 'seq'
train_data, val_data, test_data = tt.datasets.SequenceTaggingDataset.splits(
            fields=fields, 
            path=data_path, 
            train='federalist_tag.train.txt', 
            validation='federalist_tag.dev.txt',
            test='federalist_tag.test.txt')

# Build vocabulary
TEXT.build_vocab(train_data.text, max_size=MAX_VOCAB_SIZE)
TAG.build_vocab(train_data.tag)

# Print out some stats
vocab_size = len(TEXT.vocab.itos)
num_labels = len(TAG.vocab.itos)

print (f"Size of English vocabulary: {len(TEXT.vocab)}")
print (f"Most common English words: {TEXT.vocab.freqs.most_common(10)}\n")

print (f"Number of tags: {len(TAG.vocab)}")
print (f"Most common tags: {TAG.vocab.freqs.most_common(10)}")

Size of English vocabulary: 5002
Most common English words: [('the', 12401), ('of', 8251), ('to', 5107), ('and', 3325), ('in', 3159), ('a', 2878), ('be', 2666), ('that', 1954), ('it', 1805), ('is', 1552)]

Number of tags: 12
Most common tags: [('O', 118169), (',', 8962), ('.', 3493), (';', 1036), ('?', 305), (':', 153), (')', 26), ('(', 25), ('!', 10), ("'", 2)]


You can see from above that the most common punctuation is comma, on which we will evaluate precision, recall, and F-1 scores later.

In [7]:
comma_id = TAG.vocab.stoi[',']
print (f"Id of comma: {comma_id}")

Id of comma: 3


We mapped words that are not among the most frequent words (specified by `MAX_VOCAB_SIZE`) to a special unknown token:

In [8]:
unk_type = TEXT.unk_token
unk_index = TEXT.vocab.stoi[unk_type]

print (f"Unknown word: {unk_type}\n"
       f"Unknown index: {unk_index}")

Unknown word: <unk>
Unknown index: 0


To load data in batched tensors, we use `data.BucketIterator` for the training and validation set, which enables us to iterate over the dataset under a given `BATCH_SIZE`, which is set to be `1` throughout this lab. We still batch the data because other torch functions expect data to be batched.

For the test set, we use `data.Iterator`, which doesn't batch the data, since we'll be performing the evaluation on the test set ourselves.

In [9]:
BATCH_SIZE = 1 # we use batch size 1 for simplicity

train_iter = tt.data.BucketIterator(train_data, batch_size=BATCH_SIZE, device=device)
val_iter = tt.data.BucketIterator(val_data, batch_size=BATCH_SIZE, device=device)
test_iter = tt.data.Iterator(test_data, batch_size=BATCH_SIZE, sort=False, device=device)

Let's take a look at the dataset. Recall from homework 1 that there are two different ways of iterating over the dataset, one by iterating over individual examples, the other by iterating over batches of examples.

In [10]:
# Iterating over individual examples:
# Note that the words are the original words, so you'd need to manually 
# replace them with <unk> if not in the vocabulary.
example = train_iter.dataset[1]
text = example.text # a sequence of unpunctuated words
tags = example.tag  # a sequence of tags indicating the proper punctuation
print (f'{"TYPE":15}: {"TAG"}')
for word, tag in zip(text, tags):
  print (f'{word:15}: {tag}')

TYPE           : TAG
<bos>          : O
the            : O
last           : O
paper          : O
having         : O
concluded      : O
the            : O
observations   : O
which          : O
were           : O
meant          : O
to             : O
introduce      : O
a              : O
candid         : O
survey         : O
of             : O
the            : O
plan           : O
of             : O
government     : O
reported       : O
by             : O
the            : O
convention     : ,
we             : O
now            : O
proceed        : O
to             : O
the            : O
execution      : O
of             : O
that           : O
part           : O
of             : O
our            : O
undertaking    : .


Alternatively, we can produce the data a batch at a time, as in the example below. Note the "shape" of a batch, it's a two-dimensional tensor of size `max_length x batch_size`. (In this case, `batch_size` is 1.) Thus, to extract a sentence from a batch, we need to index by the _second_ dimension, not the first, using the Python idiom `batch[:, 0]`.

In [11]:
# Iterating over batches of examples:
# Unknown words have been mapped to unknown word ids
batch = next(iter(train_iter))
text = batch.text
example_text = text[:, 0]
print (f"Size of text batch: {text.size()}")
print (f"First sentence in batch: {example_text}")
print (f"Converted back to string: {' '.join([TEXT.vocab.itos[word_id] for word_id in example_text])}")

print ('-'*20)
tags = batch.tag
example_tags = tags[:, 0]
print (f"Size of label batch: {tags.size()}")
print (f"Tags of the first sentence in batch: {example_tags}")
print (f"Converted back to string: {' '.join([TAG.vocab.itos[tag_id] for tag_id in example_tags])}")

Size of text batch: torch.Size([468, 1])
First sentence in batch: tensor([  22,   10,   41,   18,    8, 1671,    9,    7,  421,    3,  579,  389,
          20,   48,   40,  541,  421,   16, 1696,  231,    2,  355,  139,   37,
          10,   16, 1192,  476,   10,    6,    7,  208,   12,   16,   17,    7,
         918,  135,   53,    2,  207,    5,  337,    3,    2,   32,   83,    2,
        3001,    5, 4548,    3,   61,    7,  364,   50,   16,  218,    8,    7,
         371,    0,    6,   12,    2,  746,    3, 2490,   16,    8,  714, 1066,
          18,    4,   17,   25,  352,  325,   53,    2,  264,    3,    7,  133,
           0,   26,    7, 1845,  872,    3, 1743,    5,    6,   12,  112,   67,
        1152, 3356,  427,    9,   50,   16,    8,   90,  520,  437,   10, 2152,
           4,  161,    2,  232,  992,    3,    2,  257,    3,    2,  345,  112,
          67, 2672,    4, 1476,  134,   10,    8,  430,    9,   13,  112, 4826,
           2,  486,  136,    2,   79,   82,   14,    7

Given the tags of an unpunctuated sequence of words, we can easily restore the punctuation:

In [12]:
def restore_punctuation(words, tags):
  words_with_punc = []
  for word, tag in zip(words, tags):
    words_with_punc.append(word)
    if tag != 'O':
      words_with_punc.append(tag)
  return ' '.join(words_with_punc)

In [13]:
print(restore_punctuation(example.text, example.tag))

<bos> the last paper having concluded the observations which were meant to introduce a candid survey of the plan of government reported by the convention , we now proceed to the execution of that part of our undertaking .


# Majority Labeling

Recall from our previous lab that a naive baseline is choosing the majority label for each word in the sequence, where the majority label depends on the word. We've provided an implementation of this baseline for you, to give you a sense of how difficult the punctuation restoration task is.

In [14]:
class MajorityTagger():
  def __init__(self):
    """Initializer.
    """
    self.most_common_label_given_word = {}

  def train_all(self, train_iter):
    """Finds the majority label for each word in the training set.
    """
    train_counts_given_word = {}
    for ex in train_iter.dataset:
      for word, tag in zip(ex.text, ex.tag):
        if TEXT.vocab.stoi[word] == unk_index:
          word = unk_type
        if word not in train_counts_given_word:
          train_counts_given_word[word] = Counter([])
        train_counts_given_word[word].update([tag])
    
    for word in train_counts_given_word:
      # Find the most common
      most_common_label = train_counts_given_word[word].most_common(1)[0][0]
      self.most_common_label_given_word[word] = most_common_label

  def predict_all(self, test_iter):
    """Predicts labels for each example in test_iter.
       Returns a list of list of strings. The order should be the same as
       in `test_iter.dataset` (or equivalently `test_iter`).
    """
    predictions = []
    for example in test_iter.dataset:
      words = example.text
      tags_pred = []
      for word in words:
        if TEXT.vocab.stoi[word] == unk_index:
          word = unk_type
        tag_pred = self.most_common_label_given_word[word]
        tags_pred.append(tag_pred)
      predictions.append(tags_pred)
    return predictions # a list of lists of strings

  def evaluate(self, test_iter):
    """Evaluates the overall accuracy, and the precision and recall of comma.
    """
    correct = 0
    total = 0
    true_positive_comma = 0
    predicted_positive_comma = 0
    total_positive_comma = 0

    # Get predictions
    predictions = self.predict_all(test_iter)
    assert len(predictions) == len(test_iter.dataset)
    
    for tags_pred, example in zip(predictions, test_iter.dataset):
      tags = example.tag
      assert len(tags_pred) == len(tags)
      for tag_pred, tag in zip(tags_pred, tags):
        total += 1
        if tag_pred == tag:
          correct += 1
        if tag_pred == ',':
          predicted_positive_comma += 1 # predicted positive
        if tag == ',':
          total_positive_comma += 1     # gold label positive
        if tag_pred == ',' and tag == ',':
          true_positive_comma += 1      # true positive
    precision_comma = true_positive_comma / predicted_positive_comma
    recall_comma = true_positive_comma / total_positive_comma
    F1_comma = 2. / (1./precision_comma + 1./recall_comma)
    return correct/total, precision_comma, recall_comma, F1_comma

Now, we can train our baseline on training data. 

In [15]:
maj_tagger = MajorityTagger()
maj_tagger.train_all(train_iter)

Let's take a look at an example prediction using this simple baseline.

In [16]:
# Get all predictions
predictions = maj_tagger.predict_all(test_iter)

# Pick one example
example_id = 2 # the third example
example = test_iter.dataset[example_id]
prediction = predictions[example_id]

print('Ground truth punctuation:')
print(restore_punctuation(example.text, example.tag), '\n')
print('Predicted punctuation:')
print(restore_punctuation(example.text, prediction))

Ground truth punctuation:
<bos> the several departments being perfectly co-ordinate by the terms of their common commission , none of them , it is evident , can pretend to an exclusive or superior right of settling the boundaries between their respective powers ; and how are the encroachments of the stronger to be prevented , or the wrongs of the weaker to be redressed , without an appeal to the people themselves , who , as the grantors of the commissions , can alone declare its true meaning , and enforce its observance ? there is certainly great force in this reasoning , and it must be allowed to prove that a constitutional road to the decision of the people ought to be marked out and kept open , for certain great and extraordinary occasions . but there appear to be insuperable objections against the proposed recurrence to the people , as a provision in all cases for keeping the several departments of power within their constitutional limits . in the first place , the provision does n

This baseline model clearly grossly underpunctuates. It predicts the tag to be `O` almost all of the time.

We can quantitatively evaluate the performance of the majority labeling tagger, which establishes a baseline that any reasonable model should outperform.

In [17]:
accuracy, precision_comma, recall_comma, F1_comma = maj_tagger.evaluate(test_iter)
print (f"Overall Accuracy: {accuracy:.4f}. \n"
       f"Comma: Precision: {precision_comma:.4f}. Recall: {recall_comma:.4f}. F1: {F1_comma:.4f}")

Overall Accuracy: 0.8900. 
Comma: Precision: 0.3569. Recall: 0.0663. F1: 0.1118


<!-- BEGIN QUESTION -->

**Question:** You can see that even though the overall accuracy is pretty high, the F-1 score of comma is very low. Why?

<!--
BEGIN QUESTION
name: open_response_F1
manual: true
-->

_Since most labels are O, in the train and test sets, the majority baseline could predict to never use comma and still get good accuracy, and that's why we reach pretty high accuracy._

_The F-score of comma depends on two variables, the 'recall' and the 'precision'. As in the reading, when the 'true-positive' counts is low, but 'false-negative' counts is high, it results with low 'recall'. That's exactly the case here, since there aren't many occurrences of comma so tp would be low, while fn is high since the majority baseline wouldn't predict a comma in most cases even if needed. We note that low tp results in low 'precision' as well._

_To sum up, we result here with really low recall, and quite low precision as well, which results in overall low F-score._

<!-- END QUESTION -->



# RNN Sequence Tagging

Now we get to the real point, using an RNN model for sequence tagging. We provide a base class `RNNBaseTagger` below, which implements training and evaluation. Throughout the rest of this lab, you will implement three subclasses of this class, using PyTorch functions at different abstraction levels.

In [18]:
class RNNBaseTagger(nn.Module):
  def __init__(self):
    super().__init__()

  def init_parameters(self, init_low=-0.15, init_high=0.15):
    """Initialize parameters. We usually use larger initial values for smaller models.
    See http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf for a more
    in-depth discussion.
    """
    for p in self.parameters():
      p.data.uniform_(init_low, init_high)

  def forward(self, text_batch):
    """Performs forward, returns logits.
    
    Arguments: 
      text_batch: a tensor containing word ids of size (seq_len, 1) 
    Returns:
      logits: a tensor of size (seq_len, 1, self.N)
    """
    raise NotImplementedError

  def compute_loss(self, logits, tags):
    return self.loss_function(logits.view(-1, self.N), tags.view(-1))

  def train_all(self, train_iter, val_iter, epochs=5, learning_rate=1e-3):
    # Switch the module to training mode
    self.train()
    # Use Adam to optimize the parameters
    optim = torch.optim.Adam(self.parameters(), lr=learning_rate)
    best_validation_accuracy = -float('inf')
    best_model = None
    # Run the optimization for multiple epochs
    for epoch in range(epochs): 
      total = 0
      running_loss = 0.0
      for batch in tqdm(train_iter):
        # Zero the parameter gradients
        self.zero_grad()

        # Input and target
        words = batch.text # seq_len, 1
        tags = batch.tag   # seq_len, 1
        
        # Run forward pass and compute loss along the way.
        logits = self.forward(words)
        loss = self.compute_loss(logits, tags)
        
        # Perform backpropagation
        (loss/words.size(1)).backward()

        # Update parameters
        optim.step()

        # Training stats
        total += 1
        running_loss += loss.item()
        
      # Evaluate and track improvements on the validation dataset
      validation_accuracy, _, _, _ = self.evaluate(val_iter)
      if validation_accuracy > best_validation_accuracy:
        best_validation_accuracy = validation_accuracy
        self.best_model = copy.deepcopy(self.state_dict())
      epoch_loss = running_loss / total
      print (f'Epoch: {epoch} Loss: {epoch_loss:.4f} '
             f'Validation accuracy: {validation_accuracy:.4f}')

  def predict(self, text_batch):
    """Returns the most likely sequence of tags for a sequence of words in `text_batch`.

    Arguments: 
      text_batch: a tensor containing word ids of size (seq_len, 1) 
    Returns:
      tag_batch: a tensor containing tag ids of size (seq_len, 1)
    """
    raise NotImplementedError

  def evaluate(self, iterator):
    """Returns the model's performance on a given dataset `iterator`.

    Arguments: 
      iterator
    Returns:
      overall accuracy, and precision, recall, and F1 for comma
    """
    correct = 0
    total = 0
    true_positive_comma = 0
    predicted_positive_comma = 0
    total_positive_comma = 0
    pad_id = TAG.vocab.stoi[TAG.pad_token]
    for batch in tqdm(iterator):
      words = batch.text
      tags = batch.tag
      tags_pred = self.predict(words)
      mask = tags.ne(pad_id)
      cor = (tags == tags_pred)[mask]
      correct += cor.float().sum().item()
      total += mask.float().sum().item()
      predicted_positive_comma += (mask * tags_pred.eq(comma_id)).float().sum().item()
      true_positive_comma += (mask * tags.eq(comma_id) * tags_pred.eq(comma_id)).float().sum().item()
      total_positive_comma += (mask * tags.eq(comma_id)).float().sum().item()

    precision_comma = true_positive_comma / predicted_positive_comma
    recall_comma = true_positive_comma / total_positive_comma
    F1_comma = 2. / (1./precision_comma + 1./recall_comma)
    return correct/total, precision_comma, recall_comma, F1_comma

## RNN from scratch

In this part of the lab, you will implement the forward pass of an RNN from scratch. Recall that 

\begin{align}
h_0 &= 0\\
h_t &= \sigma(\vect{U}x_t + \vect{V}h_{t - 1} + b_h) \\
o_t &= \vect{W}h_t + b_o
\end{align}

where we embed each word and use its embedding as $x_t$, and we use $o_t$ as the output logits. (Again, the final softmax has been absorbed into the loss function so you don't need to implement that.) Note that we added bias vectors $b_h$ and $b_o$ in this lab since we are training very small models. (In large models, having a bias vector matters a lot less.) In this part, you should implement the RNN from scratch and *not* use `nn.RNN`. We'll make use of this convenient PyTorch module in the next part. 

You will need to implement both the `forward` function and the `predict` function.

> Hint: You might find [`torch.stack`](https://pytorch.org/docs/stable/generated/torch.stack.html) useful for stacking a list of tensors to form a single tensor. You can also use `torch.mv` or `@` for matrix-vector multiplication, `torch.mm` or `@` for matrix-matrix multiplication.

**Warning: Training this and later models takes a little while, likely around three minutes for the full set of epochs. You might want to set the number of epochs to a small number (1?) until your code is running well. You should also feel free to move ahead to the next parts while earlier parts are running.**

In [19]:
class RNNTagger1(RNNBaseTagger):
  def __init__(self, text, tag, embedding_size, hidden_size):
    super().__init__()
    self.text = text
    self.tag = tag
    self.N = len(tag.vocab.itos)   # tag vocab size
    self.V = len(text.vocab.itos)  # text vocab size
    self.embedding_size = embedding_size
    self.hidden_size = hidden_size

    # Create essential modules
    self.word_embeddings = nn.Embedding(self.V, embedding_size) # Lookup layer
    self.U = nn.Parameter(torch.Tensor(hidden_size, embedding_size))
    self.V = nn.Parameter(torch.Tensor(hidden_size, hidden_size))
    self.b_h = nn.Parameter(torch.Tensor(hidden_size))
    self.sigma = nn.Tanh() # Nonlinear Layer
    self.W = nn.Parameter(torch.Tensor(self.N, hidden_size))
    self.b_o = nn.Parameter(torch.Tensor(self.N))

    # Create loss function
    pad_id = self.tag.vocab.stoi[self.tag.pad_token]
    self.loss_function = nn.CrossEntropyLoss(reduction='sum', ignore_index=pad_id)

    # Initialize parameters
    self.init_parameters()

  def forward(self, text_batch):
    """Performs forward, returns logits.
    
    Arguments: 
      text_batch: a tensor containing word ids of size (seq_len, 1) 
    Returns:
      logits: a tensor of size (seq_len, 1, self.N)
    """
    h0 = torch.zeros(self.hidden_size, device=device)
    word_embeddings = self.word_embeddings(text_batch) # seq_len, 1, embedding_size
    seq_len = word_embeddings.size(0)
    #TODO: your code below
    h_t=h0
    out_list = []
    for embed_word in word_embeddings:
      h_t = self.sigma(self.U@embed_word.view(-1) + (self.V@h_t) + self.b_h)
      o_t = (self.W@h_t) + (self.b_o)
      out_list.append(o_t)
      
    logits = (torch.stack(out_list, dim=0)).view(-1,1,self.N)
    # logits = torch.unsqueeze(logits, dim=1)
    return logits

  def predict(self, text_batch):
    """Returns the most likely sequence of tags for a sequence of words in `text_batch`.

    Arguments: 
      text_batch: a tensor containing word ids of size (seq_len, 1) 
    Returns:
      tag_batch: a tensor containing tag ids of size (seq_len, 1)
    """
    #TODO: your code below
    tag_batch = self.forward(text_batch)
    tag_batch = torch.argmax(tag_batch, dim=2)
    return tag_batch

In [20]:
# Instantiate and train classifier
rnn_tagger1 = RNNTagger1(TEXT, TAG, embedding_size=32, hidden_size=32).to(device)
rnn_tagger1.train_all(train_iter, val_iter, epochs=5, learning_rate=1e-3) 
rnn_tagger1.load_state_dict(rnn_tagger1.best_model)

# Evaluate model performance
train_accuracy1, train_p1, train_r1, train_f1 = rnn_tagger1.evaluate(train_iter)
test_accuracy1, test_p1, test_r1, test_f1 = rnn_tagger1.evaluate(test_iter)
print(f'\nTraining accuracy: {train_accuracy1:.3f}, precision: {train_p1:.3f}, recall: {train_r1:.3f}, F-1: {train_f1:.3f}\n'
      f'Test accuracy: {test_accuracy1:.3f}, precision: {test_p1:.3f}, recall: {test_r1:.3f}, F-1: {test_f1:.3f}')

  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 0 Loss: 78.4316 Validation accuracy: 0.8926


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 1 Loss: 54.5763 Validation accuracy: 0.8929


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 2 Loss: 51.5446 Validation accuracy: 0.8943


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 3 Loss: 49.6211 Validation accuracy: 0.8940


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 4 Loss: 48.2202 Validation accuracy: 0.8938


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/120 [00:00<?, ?it/s]


Training accuracy: 0.898, precision: 0.467, recall: 0.114, F-1: 0.183
Test accuracy: 0.893, precision: 0.399, recall: 0.094, F-1: 0.153


In [21]:
grader.check("rnn1")

Did your model outperform the baseline? Don't be surprised if it doesn't: the model is very small and the dataset is small as well.

## RNN forward using `nn.RNN` and explicit loop through time steps

In this part, you will use `nn.RNN` and `nn.Linear` to implement the forward pass:

\begin{align}
h_0 &= 0\\
h_t &= \text{nn.RNN}(x_t,h_{t - 1}) \\
o_t &= \text{nn.Linear}(h_t)
\end{align}

You will need to implement both the `forward` function and the `predict` function. You'll use the `nn.RNN` function to implement each time step of the RNN, with an explicit `for` loop to step through the time steps. (In the next part, you'll use a single call to `nn.RNN` to handle the entire process!) For the linear projection from RNN outputs to logits, use `self.hidden2output`.

> Hint: you can reuse your `predict` implementation from before if you wrote it in a general way.

In [22]:
class RNNTagger2(RNNBaseTagger):
  def __init__(self, text, tag, embedding_size, hidden_size):
    super().__init__()
    self.text = text
    self.tag = tag
    self.N = len(tag.vocab.itos)   # tag vocab size
    self.V = len(text.vocab.itos)  # text vocab size
    self.embedding_size = embedding_size
    self.hidden_size = hidden_size

    # Create essential modules
    self.word_embeddings = nn.Embedding(self.V, embedding_size) # Lookup layer
    self.rnn = nn.RNN(input_size=embedding_size, hidden_size=hidden_size)
    self.hidden2output = nn.Linear(hidden_size, self.N)

    # Create loss function
    pad_id = self.tag.vocab.stoi[self.tag.pad_token]
    self.loss_function = nn.CrossEntropyLoss(reduction='sum', ignore_index=pad_id)

    # Initialize parameters
    self.init_parameters()

  def forward(self, text_batch):
    """Performs forward, returns logits.
    
    Arguments: 
      text_batch: a tensor containing word ids of size (seq_len, 1) 
    Returns:
      logits: a tensor of size (seq_len, 1, self.N)
    """
    # h0 is of shape (num_layers * num_directions, batch, hidden_size),
    # which is (1, 1, hidden_size)
    h0 = torch.zeros(1, 1, self.hidden_size, device=device)
    #TODO: your code below, using an *explicit for-loop*
    word_embeddings = self.word_embeddings(text_batch) # seq_len, 1, embedding_size
    seq_len = word_embeddings.size(0)

    out_list = []
    h_t = h0
    for embed_word in word_embeddings:
      _, h_t = self.rnn(embed_word.view(1,1,self.embedding_size),h_t)
      o_t = self.hidden2output(h_t)
      out_list.append(o_t)
      
    logits = (torch.stack(out_list, dim=0)).view(-1,1,self.N)
    return logits

  def predict(self, text_batch):
    """Returns the most likely sequence of tags for a sequence of words in `text_batch`.

    Arguments: 
      text_batch: a tensor containing word ids of size (seq_len, 1) 
    Returns:
      tag_batch: a tensor containing tag ids of size (seq_len, 1)
    """
    #TODO: your code below
    tag_batch = self.forward(text_batch)
    tag_batch = torch.argmax(tag_batch, dim=2)
    return tag_batch

In [23]:
# Instantiate and train classifier
rnn_tagger2 = RNNTagger2(TEXT, TAG, embedding_size=32, hidden_size=32).to(device)
rnn_tagger2.train_all(train_iter, val_iter, epochs=5, learning_rate=1e-3)
rnn_tagger2.load_state_dict(rnn_tagger2.best_model)

# Evaluate model performance
train_accuracy2, train_p2, train_r2, train_f2 = rnn_tagger2.evaluate(train_iter)
test_accuracy2, test_p2, test_r2, test_f2 = rnn_tagger2.evaluate(test_iter)
print(f'\nTraining accuracy: {train_accuracy2:.3f}, precision: {train_p2:.3f}, recall: {train_r2:.3f}, F-1: {train_f2:.3f}\n'
      f'Test accuracy: {test_accuracy2:.3f}, precision: {test_p2:.3f}, recall: {test_r2:.3f}, F-1: {test_f2:.3f}')

  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 0 Loss: 83.5978 Validation accuracy: 0.8925


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 1 Loss: 54.8525 Validation accuracy: 0.8927


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 2 Loss: 52.7356 Validation accuracy: 0.8928


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 3 Loss: 51.1110 Validation accuracy: 0.8931


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 4 Loss: 49.5872 Validation accuracy: 0.8910


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/120 [00:00<?, ?it/s]


Training accuracy: 0.898, precision: 0.443, recall: 0.103, F-1: 0.167
Test accuracy: 0.893, precision: 0.369, recall: 0.076, F-1: 0.126


In [24]:
grader.check("rnn2")

## RNN forward using bidirectional `nn.RNN`

Instead of using a for loop, we can directly feed the entire sequence to `nn.RNN`:

\begin{align}
h_0 &= 0\\
H &= \text{nn.RNN}(X,h_0) \\
O &= \text{nn.Linear}(H)
\end{align}

where $X$ is the concatenation of $x_1, \cdots, x_T$, $H$ is the concatenation of $h_1, \cdots, h_T$, and $O$ is the concatenation of $o_1, \cdots, o_T$. 

By using this formulation, our code becomes more efficient, since `nn.RNN` is highly optimized. Besides, we can use bi-directional RNNs by simply passing `bidirectional=True` to the RNN constructor.

The difference between a bidirectional RNN and a unidirectional RNN is that bidirectional RNNs have an additional RNN cell running in the reverse direction:

\begin{align}
&h_{T+1}' = 0\\
&h_t' = \sigma(\vect{U}'x_{t}' + \vect{V}'h_{t + 1}' + b_h') \\
\end{align}

To get the output at step $t$, a bidirectional RNN simply concatenates $h_t$ and $h_t'$ and projects to produce outputs. The benefit of a bidirectional RNN is that the output at step $t$ takes into account not only words $x_1,\cdots, x_t$, but also $x_{t+1}, \cdots, x_T$.

Implement `forward` and `predict` functions below, using a bidirectional RNN.

In [25]:
class RNNTagger3(RNNBaseTagger):
  def __init__(self, text, tag, embedding_size, hidden_size):
    super().__init__()
    self.text = text
    self.tag = tag
    self.N = len(tag.vocab.itos)   # tag vocab size
    self.V = len(text.vocab.itos)  # text vocab size
    self.embedding_size = embedding_size
    self.hidden_size = hidden_size

    # Create essential modules
    self.word_embeddings = nn.Embedding(self.V, embedding_size) # Lookup layer
    self.rnn = nn.RNN(input_size=embedding_size, 
                      hidden_size=hidden_size,
                      bidirectional=True)
    self.hidden2output = nn.Linear(hidden_size*2, self.N) # *2 due to using bi-rnn

    # Create loss function
    pad_id = self.tag.vocab.stoi[self.tag.pad_token]
    self.loss_function = nn.CrossEntropyLoss(reduction='sum', ignore_index=pad_id)

    # Initialize parameters
    self.init_parameters()

  def forward(self, text_batch):
    """Performs forward, returns logits.
    
    Arguments: 
      text_batch: a tensor containing word ids of size (seq_len, 1) 
    Returns:
      logits: a tensor of size (seq_len, 1, self.N)
    """
    hidden = None # equivalent to setting hidden to a zero vector
    #TODO: your code below, without using any for-loops
    word_embeddings = self.word_embeddings(text_batch) # seq_len, 1, embedding_size
    seq_len = word_embeddings.size(0)
    H,_ = self.rnn(word_embeddings, hidden)
    O = self.hidden2output(H)
    return O

  def predict(self, text_batch):
    """Returns the most likely sequence of tags for a sequence of words in `text_batch`.

    Arguments: 
      text_batch: a tensor containing word ids of size (seq_len, 1) 
    Returns:
      tag_batch: a tensor containing tag ids of size (seq_len, 1)
    """
    #TODO: your code below
    tag_batch = self.forward(text_batch)
    tag_batch = torch.argmax(tag_batch, dim=2)
    return tag_batch

In [26]:
# Instantiate and train classifier
rnn_tagger3 = RNNTagger3(TEXT, TAG, embedding_size=32, hidden_size=32).to(device)
rnn_tagger3.train_all(train_iter, val_iter, epochs=5, learning_rate=1e-3)
rnn_tagger3.load_state_dict(rnn_tagger3.best_model)

# Evaluate model performance
train_accuracy3, train_p3, train_r3, train_f3 = rnn_tagger3.evaluate(train_iter)
test_accuracy3, test_p3, test_r3, test_f3 = rnn_tagger3.evaluate(test_iter)
print(f'\nTraining accuracy: {train_accuracy3:.3f}, precision: {train_p3:.3f}, recall: {train_r3:.3f}, F-1: {train_f3:.3f}\n'
      f'Test accuracy: {test_accuracy3:.3f}, precision: {test_p3:.3f}, recall: {test_r3:.3f}, F-1: {test_f3:.3f}')

  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 0 Loss: 67.6205 Validation accuracy: 0.9052


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 1 Loss: 43.2556 Validation accuracy: 0.9090


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 2 Loss: 38.9550 Validation accuracy: 0.9121


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 3 Loss: 36.0353 Validation accuracy: 0.9079


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 4 Loss: 33.4725 Validation accuracy: 0.9090


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/120 [00:00<?, ?it/s]


Training accuracy: 0.926, precision: 0.589, recall: 0.388, F-1: 0.468
Test accuracy: 0.913, precision: 0.508, recall: 0.335, F-1: 0.403


In [27]:
grader.check("birnn")

Let's see what our model predicts for the example we used before.

In [28]:
# Pick one example
example_id = 2 # the third example
example = test_iter.dataset[example_id]

# Process strings to word ids
text_tensor = TEXT.process([example.text,]).to(device)

# Predict
prediction_tensor = rnn_tagger3.predict(text_tensor)
prediction = [TAG.vocab.itos[tag_id] for tag_id in prediction_tensor[:, 0]]

print ('Ground truth punctuation:')
print(restore_punctuation(example.text, example.tag))
print ('Predicted punctuation:')
print(restore_punctuation(example.text, prediction))

Ground truth punctuation:
<bos> the several departments being perfectly co-ordinate by the terms of their common commission , none of them , it is evident , can pretend to an exclusive or superior right of settling the boundaries between their respective powers ; and how are the encroachments of the stronger to be prevented , or the wrongs of the weaker to be redressed , without an appeal to the people themselves , who , as the grantors of the commissions , can alone declare its true meaning , and enforce its observance ? there is certainly great force in this reasoning , and it must be allowed to prove that a constitutional road to the decision of the people ought to be marked out and kept open , for certain great and extraordinary occasions . but there appear to be insuperable objections against the proposed recurrence to the people , as a provision in all cases for keeping the several departments of power within their constitutional limits . in the first place , the provision does n

<!-- BEGIN QUESTION -->

**Question:** Did your bidirectional RNN reach a higher F-1 score than unidirectional RNNs? Why?

<!--
BEGIN QUESTION
name: open_response_birnn
manual: true
-->

_Our bidirectional RNN reach a higher F-1 score than unidirectional RNNs, since bidirectional RNNs built of two independent cells of RNN, produce two outputs (forward and reverse), and we can leverage that extra knowledge. For simplicity, unidirectional RNN predicts based on the past, while bidirectional predicts based on the past and future as well._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

# Lab debrief

**Question:** We're interested in any thoughts your group has about this lab so that we can improve this lab for later years, and to inform later labs for this year. Please list any issues that arose or comments you have to improve the lab. Useful things to comment on include the following: 

* Was the lab too long or too short?
* Were the readings appropriate for the lab? 
* Was it clear (at least after you completed the lab) what the points of the exercises were? 
* Are there additions or changes you think would make the lab better?

<!--
BEGIN QUESTION
name: open_response_debrief
manual: true
-->

_This lab was quite long, but due to the long run time of the RNN code cells. The reading was appropiate and the points were clear._

<!-- END QUESTION -->



# End of lab 2-5

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [29]:
grader.check_all()