# Lab 3, part 2: Neural Language Models

In the second part of this session, you will experiment with feed-forward neural language models (FFLM) and recurrent language models (RNNLM) using [PyTorch](https://www.pytorch.org). To train the models, you will be using the [WikiText-2](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) corpus, which is a popular LM dataset introduced in 2016:

> The WikiText language modeling dataset is a collection of texts extracted from Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), `WikiText-2` is over 2 times larger. The dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.

**NOTE:** Training on the whole corpus is time consuming on CPU. Make sure that you switch to a GPU runtime in Colab or use the `train_small` corpus which is a subset of the WikiText-2 dataset.

**Let's start by downloading the corpus:**

In [None]:
%%bash
URL="https://raw.githubusercontent.com/pytorch/examples/master/word_language_model/data/wikitext-2"

for split in "train" "valid" "test"; do
  if [ ! -f "${split}.txt" ]; then
    echo "Downloading ${split}.txt"
    wget -q "${URL}/${split}.txt"
    # Remove empty lines
    sed -i '/^ *$/d' "${split}.txt"
    # Remove article titles starting with = and ending with =
    sed -i '/^ *= .* = $/d' "${split}".txt
  fi
done

# Prepare smaller version for fast training neural LMs
head -n 5000 < train.txt > train_small.txt

# Print the first 10 lines with line numbers
cat -n train.txt | head -n10
echo

# Print some statistics
echo -e "\n   Line,   word,   character counts"
wc *.txt

## Setting up the environment

In [None]:
# in order to allow deterministic behaviour, that is, make results reproducible
%env CUBLAS_WORKSPACE_CONFIG=:4096:8

In [None]:
import math
import time
import random

import numpy as np

# Fancy progress bar
from tqdm import tqdm

import torch
from torch import nn

###############
# Torch setup #
###############
print('Torch version: {}, CUDA: {}'.format(torch.__version__, torch.version.cuda))
cuda_available = torch.cuda.is_available()
if not torch.cuda.is_available():
  print('WARNING: You may want to change the runtime to GPU for Neural LM experiments!')
  DEVICE = 'cpu'
else:
  DEVICE = 'cuda:0'

#######################
# Some helper functions
#######################
def fix_seed(seed=None):
  """Sets the seeds of random number generators."""
  torch.use_deterministic_algorithms(True)
  if seed is None:
    # Take a random seed
    seed = time.time()
  seed = int(seed)
  np.random.seed(seed)
  torch.manual_seed(seed)
  torch.cuda.manual_seed(seed)
  return seed

def readable_size(n):
  """Returns a readable size string for model parameters count."""
  sizes = ['K', 'M', 'G']
  fmt = ''
  size = n
  for i, s in enumerate(sizes):
    nn = n / (1000 ** (i + 1))
    if nn >= 1:
      size = nn
      fmt = sizes[i]
    else:
      break
  return '%.2f%s' % (size, fmt)

## Feed-forward Language Models (FFLM)

FFLMs are similar to $n$-gram language models in the sense that the choice of $n$ is a hyperparameter for the network architecture. A basic FFLM constructs a  $C=n\mathrm{-1}$ length context window before the word to be predicted. If the word embedding size is $E$, the feature vector for the context window becomes a vector of size $E\times C$, resulting from the **concatenation** of individual word embeddings of context words. Hence, the choice of $C$ for FFLMs, affects the number of final learnable parameters in the network.

### Representing the vocabulary

The `Vocabulary` class below encapsulates the **word-to-idx** and **idx-to-word** mapping that you should now be familiar with from the previous lab sessions. Read it to understand how the vocabulary is constructed from a plain text file, within the `build_from_file()` method. Special `<.>` markers are also included in the vocabulary.

In [None]:
class Vocabulary(object):
  """Data structure representing the vocabulary of a corpus."""
  def __init__(self):
    # Mapping from tokens to integers
    self._word2idx = {}

    # Reverse-mapping from integers to tokens
    self.idx2word = []

    # 0-padding token
    self.add_word('<pad>')
    # sentence start
    self.add_word('<s>')
    # sentence end
    self.add_word('</s>')
    # Unknown words
    self.add_word('<unk>')

    self._unk_idx = self._word2idx['<unk>']

  def word2idx(self, word):
    """Returns the integer ID of the word or <unk> if not found."""
    return self._word2idx.get(word, self._unk_idx)

  def add_word(self, word):
    """Adds the `word` into the vocabulary."""
    if word not in self._word2idx:
      self.idx2word.append(word)
      self._word2idx[word] = len(self.idx2word) - 1

  def build_from_file(self, fname):
    """Builds a vocabulary from a given corpus file."""
    with open(fname) as f:
      for line in f:
        words = line.strip().split()
        for word in words:
          self.add_word(word)

  def convert_idxs_to_words(self, idxs):
    """Converts a list of indices to words."""
    return ' '.join(self.idx2word[idx] for idx in idxs)

  def convert_words_to_idxs(self, words):
    """Converts a list of words to a list of indices."""
    return [self.word2idx(w) for w in words]

  def __len__(self):
    """Returns the size of the vocabulary."""
    return len(self.idx2word)
  
  def __repr__(self):
    return "Vocabulary with {} items".format(self.__len__())

Let's construct the vocabulary for the training set and analyse the token indices for a sentence with an unknown word.

---

**Q: Why do we map unknown tokens to a special `<unk>` token? Do you think the network will learn a useful embedding for that? If not, how can you let the network to learn an embedding for it?**

In [None]:
vocab = Vocabulary()
vocab.build_from_file('train.txt')
print(vocab)

# Convert sentence to list of indices, note how the last word is mapped to 3 (<unk>)
print(vocab.convert_words_to_idxs('the cat sat on a probably_an_unknown_word'.split()))

### Representing the corpus

Let's process the corpus for PyTorch: all splits will end up being a large, 1D token sequences. Note that, in `corpus_to_tensor()`, every line is wrapped between `<s> .. </s>` tags.

In [None]:
def corpus_to_tensor(_vocab, filename):
  # Final token indices
  idxs = []
  
  with open(filename) as data:
    for line in tqdm(data, ncols=80, unit=' line', desc=f'Reading {filename} '):
      line = line.strip()
      # Skip empty lines if any
      if line:
        # Each line is considered as a long sentence for WikiText-2
        line = f"<s> {line} </s>"
        # Split from whitespace and add sentence markers
        idxs.extend(_vocab.convert_words_to_idxs(line.split()))
  return torch.LongTensor(idxs)

In [None]:
# Read the files, prepare the small one as well
train = corpus_to_tensor(vocab, 'train.txt')
train_small = corpus_to_tensor(vocab, 'train_small.txt')

valid = corpus_to_tensor(vocab, 'valid.txt')
test = corpus_to_tensor(vocab, 'test.txt')
print('\n')

print(f'Small training size in tokens: {readable_size(len(train_small))}')
print(f'Training size in tokens: {readable_size(len(train))}')
print(f'Validation size in tokens: {readable_size(len(valid))}')
print(f'Test size in tokens: {readable_size(len(test))}')

**Q: Print the first 20 token indices from the training set. And then print the sentence in actual words corresponding to these 20 tokens by using one of the provided methods in the `Vocabulary` class.**

In [None]:
########
# Answer
########

### Model definition

Now that we are done with data loading and vocabulary construction, we can define the actual FFLM model in PyTorch. Recall from the lectures that this model requires a pre-defined context window size $C$ which will affect the way you set up some of the linear layers. **Note that**, in contrast to the model depicted in the lecture, this model has an additional layer `ff_ctx`, which projects the context vector $c_k$ to hidden dimension $H$. This ensures that the number of parameters in the output layer does not depend on the context size, i.e. it is always $H\times V$ instead of $CE\times V$.

---

**Q: Follow the comments in `__init__()` and `forward()` to fill in the missing parts with some actual code.**

In [None]:
class FFLM(nn.Module):
  def __init__(self, vocab_size, emb_dim, hid_dim, context_size, dropout=0.5):
    # Call parent's __init__ first
    super(FFLM, self).__init__()
    
    # Store arguments
    self.vocab_size = vocab_size
    self.emb_dim = emb_dim
    self.hid_dim = hid_dim
    self.context_size = context_size

    # Create the loss, don't sum or average, we'll take care of it
    # in the training loop for logging purposes
    self.loss = nn.CrossEntropyLoss(reduction='none')

    # Create the non-linearity
    self.nonlin = torch.nn.Tanh()

    # Dropout regularizer
    self.drop = nn.Dropout(p=dropout)

    ##############################
    # Fill the missing parts below
    ##############################
    # Q: Compute the dimension of the context vector
    self.context_dim = "<TODO>"
    
    # Create the embedding layer (i.e. lookup table tokens->vectors)
    self.emb = nn.Embedding(
        num_embeddings=self.vocab_size, embedding_dim=self.emb_dim,
        padding_idx=0)
 
    # This cuts the number of parameters a bit
    self.ff_ctx = nn.Linear(self.context_dim, self.hid_dim)

    ############################################
    # Output layer mapping from the output of `ff_ctx` to vocabulary size
    # Q: Fill the dimensions of the output layer
    ############################################
    self.out = nn.Linear("<TODO>")

    # Purely for informational purposes: compute # of total params
    self.n_params = 0
    for param in self.parameters():
        self.n_params += np.cumprod(param.data.size())[-1]
    self.n_params = readable_size(self.n_params)
      
  def forward(self, x, y):
    """Forward-pass of the module."""
    # Shape of x is (batch_size, context_size)

    # Get the embeddings for the token indices in `x`
    embs = self.emb(x)

    ##########################################################
    # Q: Concatenate the embeddings to form the context vector
    ##########################################################
    ctx = "<TODO>"

    #######################################################
    # Q: Apply ff_ctx -> non-lin -> dropout -> output layer
    # to obtain the logits i.e. unnormalized scores   
    #######################################################
    logits = "<TODO>"

    ###########################################################
    # Q: Use self.loss to compute the losses, return the losses
    # (true labels are in `y`)
    ###########################################################
    return "<TODO>"

  def get_batches(self, data_tensor, batch_size=64):
    """Returns a tensor of size (n_batches, batch_size, context_size + 1)."""
    # Split data into rows of n-grams followed by the (n+1)th true label
    x_y = data_tensor.unfold(0, self.context_size + 1, step=1)

    # Get the number of training n-grams
    n_samples = x_y.size()[0]

    # Hack: discard the last uneven batch for simplicity
    n_batches = n_samples // batch_size
    n_samples = n_batches * batch_size
    # Split nicely into batches, i.e. (n_batches, batch_size, context_size + 1)
    # The final element in each row is the ID of the true label to predict
    x_y = x_y[:n_samples].view(n_batches, batch_size, -1)

    # A particular batch for context_size=2 will now look like below in
    # word format. Last element for every array is the next token to be predicted
    #
    # [[<s>, cat, sat],
    #  [cat, sat, on],
    #  [sat, on,  the],
    #  [on,  the, mat],
    #   ....
    return x_y

  def train_model(self, optim, train_tensor, valid_tensor, test_tensor, n_epochs=5,
                 batch_size=64, shuffle=False):
    """Trains the model."""
    # Get batches for the training data
    batches = self.get_batches(train_tensor, batch_size)
    
    print(f'Will do {batches.size(0)} batches for an epoch.')

    for eidx in range(1, n_epochs + 1):
      start_time = time.time()
      epoch_loss = 0
      epoch_items = 0

      # Enable training mode
      self.train()

      # Shuffle the batch order or not
      if shuffle:
        batch_order = torch.randperm(batches.size(0))
      else:
        batch_order = torch.arange(batches.size(0))

      # Start training
      for iter_count, idx in enumerate(batch_order):
        batch = batches[idx].to(DEVICE)

        # split into inputs `x` and labels `y`
        x, y = batch[:, :self.context_size], batch[:, -1]

        # Clear the gradients
        optim.zero_grad()

        # loss will be a vector of size (batch_size, ) with losses per every sample
        loss = self.forward(x, y)

        # Backprop the average loss and update parameters
        loss.mean().backward()
        optim.step()

        # sum the loss for reporting, along with the denominator
        epoch_loss += loss.detach().sum()
        epoch_items += loss.numel()

        if iter_count % 1000 == 0:
          # Print progress
          loss_per_token = epoch_loss / epoch_items
          ppl = math.exp(loss_per_token)
          print(f'[Epoch {eidx:<3}] loss: {loss_per_token:6.2f}, perplexity: {ppl:6.2f}')

      time_spent = time.time() - start_time

      print(f'\n[Epoch {eidx:<3}] ended with train_loss: {loss_per_token:6.2f}, ppl: {ppl:6.2f}')
      # Evaluate on valid set
      valid_loss, valid_ppl = self.evaluate(test_set=valid_tensor)
      print(f'[Epoch {eidx:<3}] ended with valid_loss: {valid_loss:6.2f}, valid_ppl: {valid_ppl:6.2f}')
      print(f'[Epoch {eidx:<3}] completed in {time_spent:.2f} seconds\n')

    # Evaluate the final model on test set
    test_loss, test_ppl = self.evaluate(test_set=test_tensor)
    print(f' ---> Final test set performance: {test_loss:6.2f}, test_ppl: {test_ppl:6.2f}')

  def evaluate(self, test_set, batch_size=32):
    """Evaluates and computes perplexity for the given test set."""
    loss = 0

    # Get the batches
    batches = self.get_batches(test_set, batch_size)

    # Eval mode
    self.eval()

    with torch.no_grad():
      for batch in batches:
        batch = batch.to(DEVICE)

        # split into inputs `x` and labels `y`
        x, y = batch[:, :self.context_size], batch[:, -1]

        # loss will be a vector of size (batch_size, ) with losses per every sample
        # sum the loss for reporting, along with the denominator
        loss += self.forward(x, y).sum()
    
    # Normalize by the number of tokens in the test set
    loss /= batches.size()[:2].numel()

    # Switch back to training mode
    self.train()

    # return the perplexity and loss
    return loss, math.exp(loss)

  def __repr__(self):
    """String representation for pretty-printing."""
    s = super(FFLM, self).__repr__()
    return f"{s}\n# of parameters: {self.n_params}"

### Training

We can now launch training using a set of sane hyper-parameters for our model. This is a 3-gram FFLM since the context size is set to 2. On a Colab GPU, a single epoch should take around 1 minute.

In [None]:
# Set the seed for reproducible results
fix_seed(30494)

fflm_model = FFLM(
    len(vocab),       # vocabulary size
    emb_dim=128,      # word embedding dim
    hid_dim=128,      # hidden layer dim
    context_size=2,   # C = (N-1) if you think in n-gram LM terminology
    dropout=0.4,      # dropout probability
)

# move to device
fflm_model.to(DEVICE)

# Initial learning rate for the optimizer
FFLM_INIT_LR = 0.001

# Create the optimizer
fflm_optimizer = torch.optim.Adam(fflm_model.parameters(), lr=FFLM_INIT_LR)
print(fflm_model)

print('Starting training!')
# NOTE: If you happen to have memory errors, try decreasing the batch size
# It will print progress every 1000 batches
fflm_model.train_model(fflm_optimizer, train, valid, test, n_epochs=5, batch_size=256, shuffle=True)

**Q: If everything goes well, you should see a loss of around ~10.4 printed as the first loss. This will still be the case if you change the random seed to some other number before model construction i.e. the culprit is not the exact values that they take. Can you come up with a simple mathematical formula which yields that value?**

In [None]:
##########################
# Answer to question above
##########################
print("<TODO: put the formula here which computes the value>")

### Exercises

With the default settings above, you should end up with a validation perplexity of $\sim257$ and a final test set perplexity of $\sim238$ at the end of 5th epoch. Now here are some exercises that you can proceed with:

---

**Q: Remove the `tanh()` non-linearity from the code so that the context is computed as a linear combination of its embeddings. How does the results compare to the initial one? Do you think non-linearity helps?**

**Q: Compare the results by rerunning the training with unshuffled batches i.e. with `shuffle=False`. What do you notice in terms of results?**

**Q: Play with hyper-parameters related to dimensions and dropout. Could you find a model with smaller perplexity?**

**Q: Try with different context sizes such as 3, 5, 7, etc. What is the best perplexity you can get?**

## Recurrent Language Models (RNNLM)

It is now time to switch to more complex LMs, basically the recurrent ones which have access to large context windows. 

### Model definition

You will notice that apart from the `train_model()` and `get_batches()` methods, the remaining parts look similar to `FFLM`. Take your time to compare both models. Read `get_batches()` thoroughly as preparing the batches and chunking into further fragments for BPTT is pretty interesting and may be unintuitive at first look.

---

**Q: Follow the comments in the `RNNLM` class to fill in the missing parts with some actual code. You will come back here when you reach the question regarding the support for LSTM. So, skip LSTM-related TODO's for now.**

In [None]:
class RNNLM(nn.Module):
  """RNN-based LM module."""
  def __init__(self, vocab_size, emb_dim, hid_dim, rnn_type='RNN',
               n_layers=1, dropout=0.5, clip_gradient_norm=1.0,
               bptt_steps=35):
    # Call parent's __init__ first
    super(RNNLM, self).__init__()
    
    # Store arguments
    self.vocab_size = vocab_size
    self.emb_dim = emb_dim
    self.hid_dim = hid_dim
    self.clip_gradient_norm = clip_gradient_norm
    self.bptt_steps = bptt_steps
    self.n_layers = n_layers
    self.rnn_type = rnn_type.upper()

    # This will be used to store the detached histories for truncated BPTT
    self.prev_histories = None
  
    # Create the loss, don't sum or average, we'll take care of it
    # in the training loop for logging purposes
    self.loss = nn.CrossEntropyLoss(reduction='none')
    
    # Create the dropout
    self.drop = nn.Dropout(p=dropout)
    
    # Create the embedding layer as usual
    self.emb = nn.Embedding(
      num_embeddings=self.vocab_size, embedding_dim=self.emb_dim,
      padding_idx=0)
    
    # Create the RNN layer
    if self.rnn_type == 'RNN':
      self.rnn = nn.RNN( 
          input_size=self.emb_dim, hidden_size=self.hid_dim,
          num_layers=self.n_layers, nonlinearity='tanh')
    elif self.rnn_type == 'GRU':
      self.rnn = nn.GRU(
          input_size=self.emb_dim, hidden_size=self.hid_dim,
          num_layers=self.n_layers)
    elif self.rnn_type == 'LSTM':
      #####################################
      # Q: Fill in to create the LSTM layer
      #####################################
      self.rnn = "<TODO>"
   
    # Create the output layer: maps the hidden state of the RNN to vocabulary
    self.out = nn.Linear(self.hid_dim, self.vocab_size)

    # Compute number of parameters for information
    self.n_params = 0
    for param in self.parameters():
      self.n_params += np.cumprod(param.data.size())[-1]
    self.n_params = readable_size(self.n_params)

  def init_state(self, batch_size):
    """Returns the initial 0 states."""
    if self.rnn_type != 'LSTM':
      # for every layer and every sample -> 0 hidden state vector
      return torch.zeros(self.n_layers, batch_size, self.hid_dim, device=DEVICE)
    else:
      #################################################################
      # Q: Adapt the above snippet to LSTM. Check PyTorch docs
      # to understand what is the expectation of LSTM's forward() call
      # in terms of initial states.
      #################################################################
      return "<TODO>"

  def clear_hidden_states(self):
    """Set the relevant instance attribute to None."""
    self.prev_histories = None

  def save_hidden_states(self, last_states):
    """Save the detached states into the model for the next batch. `last_states`
    is the second return value of RNN/GRU/LSTM's forward() methods."""
    if isinstance(last_states, tuple):
      # This is true for LSTM
      self.prev_histories = tuple(r.detach() for r in last_states)
    else:
      self.prev_histories = last_states.detach()

  def forward(self, x, y):
    """Forward-pass of the module."""
    # Detached previous histories for a batch. If `None`, we assume
    # start of an epoch or start of an evaluation and create 0
    # vector(s) to start with.
    if self.prev_histories is None:
      self.prev_histories = self.init_state(x.shape[1])

    # Tokens -> Embeddings -> Dropout
    embs = self.drop(self.emb(x))

    # an RNN in PyTorch returns two values:
    # (1) All hidden states of the last RNN layer
    #     Shape -> (bptt_steps, batch_size, hid_dim)
    #     You'll plug the output layer on top of this to obtain
    #     the logits for each prediction.
    # (2) Hidden state h_t of last timestep for EVERY layer
    #     Shape -> (self.n_layers, batch_size, hid_dim)
    #     This is what we'll store as the previous history
    #     (NOTE: this is a tuple for LSTM which contains h_t and c_t)
    all_hids, last_hid = self.rnn(embs, self.prev_histories)

    # Detach the computation graph since we are done with BPTT for this batch
    self.save_hidden_states(last_hid)

    ##########################################################
    # Q: Apply dropout on all_hids and pass it to output layer
    ##########################################################
    logits = "<TODO>"

    # Return the losses per token/position
    return self.loss(logits.view(-1, self.vocab_size), y)

  def get_batches(self, data_tensor, batch_size):
    # NOTE: There is absolutely no shuffling here, which
    # will totally break the histories coming from previous steps.
    # The document is evenly divided into independent `batch_size` portions.
    # At every iteration, the BPTT window will slide over each of these
    # portions, by keeping track of the previous h_t's as discussed
    # in the lecture.
    
    # Imagine this as `batch_size` pointers running over the text, each
    # processing its share in a continuous. Although the portions may have
    # been splitted in a noisy way (one pointer can be starting from the
    # middle of a sentence for example), this makes training faster.
    # For instance, with the alphabet as the dataset and batch size 4, we'd get
    # ┌ a g m s ┐
    # │ b h n t │
    # │ c i o u │
    # │ d j p v │
    # │ e k q w │
    # └ f l r x ┘.
    # These columns are treated as "independent" by the model, which means that
    # the dependence of 'g' on 'f' can not be learned, but allows more efficient
    # batch processing. The view above will further be splitted into chunks
    # of size `bptt_steps` to apply truncated BPTT. For example, with
    # `bptt_steps == 2`, we'll have the following `x` and `y` tensors. The
    # first batch will be processing "a, b" to predict "b, c", 
    # the second batch will be processing "g, h" to predict "h, i", and so on.
    #
    #       X          Y
    #   ----->>------
    #   |           |
    # ┌ a g m s ┐ ┌ b h n t ┐
    # └ b h n t ┘ └ c i o u ┘
    #   |           |
    #   ----->>------

    # Work out how cleanly we can divide the dataset into batch_size parts.
    n_batches = data_tensor.size(0) // batch_size

    # Trim off the remainder tokens to evenly split
    # Evenly divide the data across the batches
    data = data_tensor[:n_batches * batch_size].view(
        batch_size, n_batches).t().contiguous()

    batches = []

    for i in range(0, data.size(0) - 1, self.bptt_steps):
      # seq_len can be less than bptt_steps in the final parts of the data
      seq_len = min(self.bptt_steps, len(data) - i - 1)

      # x shape => (seq_len, batch_size)
      x = data[i: i + seq_len]
      # flatten the ground-truth labels (shifted inputs for LM)
      y = data[i + 1: i + 1 + seq_len].view(-1)
      batches.append((x, y))

    return batches
 
  def train_model(self, optim, train_tensor, valid_tensor, test_tensor, n_epochs=5,
                 batch_size=64):
    """Trains the model."""
    # Get batches for all splits at once
    train_batches = self.get_batches(train_tensor, batch_size)
    valid_batches = self.get_batches(valid_tensor, batch_size)
    test_batches = self.get_batches(test_tensor, batch_size)

    for eidx in range(1, n_epochs + 1):
      start_time = time.time()
      epoch_loss = 0
      epoch_items = 0

      # Enable training mode
      self.train()

      # Start training
      for iter_count, (x, y) in enumerate(train_batches):
        # Clear the gradients
        optim.zero_grad()

        loss = self.forward(x.to(DEVICE), y.to(DEVICE))

        # Backprop the average loss and update parameters
        loss.mean().backward()

        # Clip the gradients to avoid exploding gradients
        if self.clip_gradient_norm > 0:
          torch.nn.utils.clip_grad_norm_(self.parameters(), self.clip_gradient_norm)

        # Update parameters
        optim.step()

        # sum the loss for reporting, along with the denominator
        epoch_loss += loss.detach().sum()
        epoch_items += loss.numel()

        if iter_count % 500 == 0:
          # Print progress
          loss_per_token = epoch_loss / epoch_items
          ppl = math.exp(loss_per_token)
          print(f'[Epoch {eidx:<3}] loss: {loss_per_token:6.2f}, perplexity: {ppl:6.2f}')

      time_spent = time.time() - start_time

      # Clear stale h_t history before evaluation
      self.clear_hidden_states()

      print(f'\n[Epoch {eidx:<3}] ended with train_loss: {loss_per_token:6.2f}, ppl: {ppl:6.2f}')
      # Evaluate on valid set
      valid_loss, valid_ppl = self.evaluate(valid_batches)
      print(f'[Epoch {eidx:<3}] ended with valid_loss: {valid_loss:6.2f}, valid_ppl: {valid_ppl:6.2f}')
      print(f'[Epoch {eidx:<3}] completed in {time_spent:.2f} seconds\n')

    # Evaluate the final model on test set
    test_loss, test_ppl = self.evaluate(test_batches)
    print(f' ---> Final test set performance: {test_loss:6.2f}, test_ppl: {test_ppl:6.2f}')

  def evaluate(self, batches):
    # Clear stale h_t history before evaluation
    self.clear_hidden_states()

    # Switch to eval mode
    self.eval()

    total_loss = 0.
    total_tokens = 0

    with torch.no_grad():
      for iter_count, (x, y) in enumerate(batches):
        loss = self.forward(x.to(DEVICE), y.to(DEVICE))
        
        total_loss += loss.sum().item()
        total_tokens += loss.size(0)
    total_loss /= total_tokens

    self.clear_hidden_states()
    return total_loss, math.exp(total_loss)

  def __repr__(self):
    """String representation for pretty-printing."""
    s = super(RNNLM, self).__repr__()
    return "{}\n# of parameters: {}".format(s, self.n_params)

### Training

In [None]:
# Set the seed for reproducible results
fix_seed(30494)

rnnlm_model = RNNLM(
    vocab_size=len(vocab),  # vocabulary size
    emb_dim=128,            # word embedding dim
    hid_dim=128,            # hidden layer dim
    rnn_type='GRU',         # RNN type
    n_layers=1,             # Number of stacked RNN layers
    clip_gradient_norm=1.0, # gradient clip threshold
    bptt_steps=35,          # Truncated BPTT window size
    dropout=0.4,            # dropout probability
)

# move to device
rnnlm_model.to(DEVICE)

# Initial learning rate for the optimizer
RNNLM_INIT_LR = 0.002

# Create the optimizer
rnnlm_optimizer = torch.optim.Adam(rnnlm_model.parameters(), lr=RNNLM_INIT_LR)
print(rnnlm_model)

print('Starting training!')
# NOTE: If you happen to have memory errors, try decreasing the batch size
rnnlm_model.train_model(
    rnnlm_optimizer, train, valid, test, n_epochs=5, batch_size=16)

### Exercises

---

**Q: How does the results compare to FFLMs? Feel free to play with different hyper-parameters here as well. Especially, try with different BPTT steps.**

**Q: Play with hyper-parameters related to dimensions and dropout. Could you find a model with smaller perplexity?**

**Q: There are missing parts related to LSTM support in the implementation. Try filling those parts and train an LSTM-based model. How does the performance compare to vanilla RNN and GRU? Compare model sizes of three variants with the default set of hyper-parameters.**

## More exercises

**We do not provide any codes or answers for these extra questions. They are meant to motivate the interested ones to further go into the details of language modelling.**

- Modify the `Vocabulary` class so that it knows about the counts of the tokens in the training set to apply a frequency threshold on the words that will be accepted to the vocabulary. Try with different thresholds such as 2, 3, 5 and train one of the models on top to see what is the impact of reducing the size of the vocabulary and eventually letting the model to learn an embedding for `<unk>`.

- Implement a function to generate sentences from the RNNLM language model. How would you do that? There will not be any batches involved so you can directly feed the model with some prefix embeddings and sample (or take the word with the maximum probability, i.e. greedy search) from the output probability distribution.

- Try implementing **weight tying** for RNNLM, which is an [approach](https://arxiv.org/abs/1608.05859) to reduce the number of parameters in sequence-to-sequence models. Notice that, we actually have 2 embeddings in the network: first is used to encode a given input (`self.emb`), second is the output layer! Yes the output layer is also a sort of embedding layer since it has a dimension of $V$ in its weight matrix. If you set the correct sizes in your network in a way that the embedding layer and the output layer has exactly the same sizes, you can let PyTorch share/tie those matrices, effectively removing one of them completely! The solution is actually a one-liner with PyTorch.  

- Try taking one of the LSTM-papers below and implement one of the ideas that you like from there.

## Further Reading
 - [Original FFLM paper from Bengio et al. 2003](http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)
 - [Original RNNLM paper from Mikolov et al. 2010](https://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf)
 - Some recent state-of-the-art LSTM-based RNNLMs

  - [Regularizing and Optimizing LSTM Language Models](https://arxiv.org/pdf/1708.02182.pdf)
  - [An Analysis of Neural Language Modeling at Multiple Scales](https://arxiv.org/pdf/1803.08240.pdf)
  - [Scalable Language Modeling: WikiText-103 on a Single GPU in 12 hours](https://mlsys.org/Conferences/2019/doc/2018/50.pdf)