# L2: Language modelling

In this lab you will implement and train two neural language models: the fixed-window model mentioned in Lecture&nbsp;2.3, and the recurrent neural network model from Lecture&nbsp;2.5. You will evaluate these models by computing their perplexity on a benchmark dataset.

In [88]:
import torch

In [89]:
#from google.colab import drive
#drive.mount('/content/drive')

For this lab, you should use the GPU if you have one:

In [90]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

## Data

The data for this lab is [WikiText](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/), a collection of more than 100 million tokens extracted from the set of &lsquo;Good&rsquo; and &lsquo;Featured&rsquo; articles on Wikipedia. We will use the small version of the dataset, which contains slightly more than 2.5 million tokens.

The next cell contains code for an object that will act as a container for the &lsquo;training&rsquo; and the &lsquo;validation&rsquo; section of the data. We fill this container by reading the corresponding text files. The only processing that we do is to whitespace-tokenize, and to replace each newline character with a special token `<eos>` (end-of-sentence).

In [91]:
class WikiText(object):
    
    def __init__(self):
        self.vocab = {}
        self.train = self.read_data('wiki.train.tokens')
        self.valid = self.read_data('wiki.valid.tokens')
    
    def read_data(self, path):
        ids = []
        with open(path, encoding="utf-8") as source:
            for line in source:
                for token in line.split() + ['<eos>']:
                    if token not in self.vocab:
                        self.vocab[token] = len(self.vocab)
                    ids.append(self.vocab[token])
        return ids

The cell below loads the data and prints the total number of tokens and the size of the vocabulary.

In [92]:
wikitext = WikiText()

print('Tokens in train:', len(wikitext.train))
print('Tokens in valid:', len(wikitext.valid))
print('Vocabulary size:', len(wikitext.vocab))

Tokens in train: 2088628
Tokens in valid: 217646
Vocabulary size: 33278


## Problem 1: Fixed-window neural language model

In this section you will implement and train the fixed-window neural language model proposed by [Bengio et al. (2003)](http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf) and introduced in Lecture&nbsp;2.3. Recall that an input to the network takes the form of a vector of $n-1$ integers representing the preceding words. Each integer is mapped to a vector via an embedding layer. (All positions share the same embedding.) The embedding vectors are then concatenated and sent through a two-layer feed-forward network with a non-linearity in the form of a rectified linear unit (ReLU) and a final softmax layer.

### Problem 1.1: Vectorize the data

Your first task is to write code for transforming the data in the WikiText container into a vectorized form that can be fed to the fixed-window model. Complete the skeleton code in the cell below:

In [93]:
def vectorize_fixed_window(wikitext_data, n):
    # TODO: Replace the following line with your own code

    X = []
    y = []
    for idx, word in enumerate(wikitext_data):
      if idx == len(wikitext_data) - n:
        break
      temp = wikitext_data[idx: idx + (n - 1)]
      
      X.append(temp)
      y.append(wikitext_data[idx + n - 1])

    return torch.LongTensor(X), torch.LongTensor(y)

Your function should meet the following specification:

**vectorize_fixed_window** (*wikitext_data*, *n*)

> Transforms WikiText data (a list of word ids) into a pair of tensors $\mathbf{X}$, $\mathbf{y}$ that can be used to train the fixed-window model. Let $N$ be the total number of $n$-grams from the token list; then $\mathbf{X}$ is a matrix with shape $(N, n-1)$ and $\mathbf{y}$ is a vector with length $N$.

⚠️ Your function should be able to handle arbitrary values of $n \geq 1$.

#### 🤞 Test your code

Test your implementation by running the code in the next cell. Does the output match your expectation?

In [94]:
valid_x, valid_y = vectorize_fixed_window(wikitext.valid, 3)
train_x, train_y = vectorize_fixed_window(wikitext.train, 3)

print(valid_x.size())
print(valid_x)
print(valid_y)

torch.Size([217643, 2])
tensor([[    0,     1],
        [    1, 32966],
        [32966, 32967],
        ...,
        [    1,     1],
        [    1,     1],
        [    1,     0]])
tensor([32966, 32967,     1,  ...,     1,     0,     0])


### Problem 1.2: Implement the model

Your next task is to implement the fixed-window model based on the graphical specification given in the lecture.

In [95]:
import torch.nn as nn

class FixedWindowModel(nn.Module):

    def __init__(self, n, n_words, embedding_dim=50, hidden_dim=50):
        super().__init__()
        self.embedding = nn.Embedding(n_words, embedding_dim, padding_idx = 0)
        self.relu = nn.ReLU()
        self.linear_ff = nn.Linear(embedding_dim * (n - 1), hidden_dim)
        self.linear_soft = nn.Linear(hidden_dim, n_words)
        # TODO: Add your own code

    def forward(self, x):
        # TODO: Replace the next line with your own code
        #print(x.size())
        inputs = self.embedding(x).view((len(x), -1))
        #print(inputs.size())
        test = self.linear_ff(inputs)
        #print(test)
        out = self.relu(test)
        return self.linear_soft(out)

Here is the specification of the two methods:

**__init__** (*self*, *n*, *n_words*, *embedding_dim*=50, *hidden_dim*=50)

> Creates a new fixed-window neural language model. The argument *n* specifies the model&rsquo;s $n$-gram order. The argument *n_words* is the number of words in the vocabulary. The arguments *embedding_dim* and *hidden_dim* specify the dimensionalities of the embedding layer and the hidden layer of the feedforward network, respectively; their default value is 50.

**forward** (*self*, *x*)

> Computes the network output on an input batch *x*. The shape of *x* is $(B, n-1)$, where $B$ is the batch size. The output of the forward pass is a tensor of shape $(B, V)$ where $V$ is the number of words in the vocabulary.

**Hint:** The most efficient way to implement the vector concatenation in this model is to use the [`view()`](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.view) method.

#### 🤞 Test your code

Test your code by instantiating the model and feeding it a batch of examples from the training data.

### Problem 1.3: Train the model

Your final task is to write code to train the fixed-window model using minibatch gradient descent and the cross-entropy loss function.

For your convenience, the following cell contains a utility function that randomly samples minibatches of a specified size from a pair of tensors:

In [96]:
def batchify(x, y, batch_size):
    random_indices = torch.randperm(len(x))
    for i in range(0, len(x) - batch_size + 1, batch_size):
        indices = random_indices[i:i+batch_size]
        yield x[indices].to(device), y[indices].to(device)
    remainder = len(x) % batch_size
    if remainder:
        indices = random_indices[-remainder:]
        yield x[indices].to(device), y[indices].to(device)

What remains to be done is the implementation of the training loop. This should be a straightforward generalization of the training loops that you have seen so far. Complete the skeleton code in the cell below:

In [97]:
import torch.optim as optim
import torch.nn.functional as F
import tqdm

def train_fixed_window(n, n_epochs=1, batch_size=3200, lr=1e-2):
    # TODO: Replace the following line with your own code
    model = FixedWindowModel(n, len(wikitext.vocab))
    model.to(device)

    optimizer = optim.Adam(model.parameters(), lr=lr)

    train_losses = []
    dev_losses = []
    dev_accuracies = []
    perplexity = []

    #print(valid_x[0:batch_size])

    with tqdm.tqdm(total = n_epochs) as pbar:
      for t in range (n_epochs):
        pbar.set_description(f'Epoch{t+1}')
        model.train()
        running_loss = 0

        for x, y in batchify(train_x, train_y, batch_size):
          optimizer.zero_grad()

          output = model.forward(x)

          loss = F.cross_entropy(output, y)
          loss.backward()
          optimizer.step()
          running_loss += loss.item() * len(x)
          # perplexity.append(torch.exp(loss))
          # print('Perp Loss:', loss, 'PP:', perplexity)
      
        print('Running loss', running_loss) 
        
        # Evaluation
        model.eval()
        for x, y in batchify(valid_x, valid_y, batch_size):
          with torch.no_grad():
                output_valid = model.forward(x)
                dev_losses.append(F.cross_entropy(output_valid, y))
              
        avg_loss = sum(dev_losses)/len(dev_losses)
        perplexity.append(torch.exp(torch.tensor(avg_loss)).item())
        pbar.update()
        train_losses.append(running_loss / len(valid_x))

        print("perp ", perplexity)
    
    return model

Here is the specification of the training function:

**train_fixed_window** (*n*, *n_epochs* = 1, *batch_size* = 3200, *lr* = 0.01)

> Trains a fixed-window neural language model of order *n* using minibatch gradient descent and returns it. The parameters *n_epochs* and *batch_size* specify the number of training epochs and the minibatch size, respectively. Training uses the cross-entropy loss function and the [Adam optimizer](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam) with learning rate *lr*. After each epoch, prints the perplexity of the model on the validation data.

The code in the cell below trains a bigram model.

In [98]:
model_fixed_window = train_fixed_window(3, n_epochs=3)

Epoch1:   0%|          | 0/3 [00:00<?, ?it/s]

Running loss 13029441.862940788


  perplexity.append(torch.exp(torch.tensor(avg_loss)).item())
Epoch2:  33%|███▎      | 1/3 [00:15<00:31, 15.92s/it]

perp  [321.0711364746094]
Running loss 11595817.076814175


Epoch3:  67%|██████▋   | 2/3 [00:28<00:14, 14.82s/it]

perp  [321.0711364746094, 311.3558044433594]
Running loss 11097026.82826519


Epoch3: 100%|██████████| 3/3 [00:41<00:00, 13.71s/it]

perp  [321.0711364746094, 311.3558044433594, 311.18109130859375]





**⚠️ Your submitted notebook must contain output demonstrating a validation perplexity of at most 350.**

**Hint:** Computing the validation perplexity in one go may exhaust your computer&rsquo;s memory and/or take a lot of time. If you run into this problem, break the computation down into minibatches and take the average perplexity.

#### 🤞 Test your code

To see whether your network is learning something, print the loss and/or the perplexity on the training data. If the two values are not decreasing over time, try to find the problem before wasting time (and energy) on useless training.

Training and even evaluation will take some time – on a CPU, you should expect several minutes per epoch, depending on hardware. To speed things up, you can train using a GPU; our reference implementation runs in less than 30 seconds per epoch on [Colab](http://colab.research.google.com).

## Problem 2: Recurrent neural network language model

In this section you will implement the recurrent neural network language model that was presented in Lecture&nbsp;2.5. Recall that an input to the network is a vector of word ids. Each integer is mapped to an embedding vector. The sequence of embedded vectors is then fed into an unrolled LSTM. At each position $i$ in the sequence, the hidden state of the LSTM at that position is sent through a linear transformation into a final softmax layer, from which we read off the index of the word at position $i+1$. In theory, the input vector could represent the complete training data or at least a complete sentence; for practical reasons, however, we will truncate the input to some fixed value *bptt_len*, the **backpropagation-through-time horizon**.

### Problem 2.1: Vectorize the data

As in the previous problem, your first task is to transform the data in the WikiText container into a vectorized form that can be fed to the model.

In [99]:
def vectorize_rnn(wikitext_data, bptt_len):
    # TODO: Replace the next line with your own code
    X = []
    Y = []

    once = 0

    for idx, word in enumerate(wikitext_data):
      if idx < len(wikitext_data) - bptt_len - 1:
        if not (idx % bptt_len) == 0:
          continue
        X.append(wikitext_data[idx:idx + bptt_len])
        Y.append(wikitext_data[idx + 1:idx + bptt_len + 1])
      else:
        break
    
    return torch.LongTensor(X), torch.LongTensor(Y)


Your function should meet the following specification:

**vectorize_rnn** (*wikitext_data*, *bptt_len*)

> Transforms a list of token indexes into a pair of tensors $\mathbf{X}$, $\mathbf{Y}$ that can be used to train the recurrent neural language model. The rows of both tensors represent contiguous subsequences of token indexes of length *bptt_len*. Compared to the sequences in $\mathbf{X}$, the corresponding sequences in $\mathbf{Y}$ are shifted one position to the right. More precisely, if the $i$th row of $\mathbf{X}$ is the sequence that starts at token position $j$, then the same row of $\mathbf{Y}$ is the sequence that starts at position $j+1$.

#### 🤞 Test your code

Test your implementation by running the following code:

In [100]:
valid_x, valid_y = vectorize_rnn(wikitext.valid, 32)
#valid_x_t, valid_y_t = vectorize_rnn_test(wikitext.valid, 32)

train_x, train_y = vectorize_rnn(wikitext.train, 32)

print(valid_x.size())

torch.Size([6801, 32])


### Problem 2.2: Implement the model

Your next task is to implement the recurrent neural network model based on the graphical specification given in the lecture.

In [101]:
import torch.nn as nn

class RNNModel(nn.Module):
    
    def __init__(self, n_words, embedding_dim=50, hidden_dim=50):
        super().__init__()
        self.embedding = nn.Embedding(n_words,embedding_dim,padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim,hidden_dim, batch_first = True)
        self.linear = nn.Linear(embedding_dim, n_words)
        # TODO: Add your own code

        # For the last task
        #self.embedding.weight.data.uniform_(-1, 1)

    def forward(self, x):
        # TODO: Replace the next line with your own code
        embed = self.embedding(x)
        output, (h_n, c_n) = self.lstm(embed)

        last_hidden_state = h_n[-1]
        
        output = self.linear(output)
        return output, last_hidden_state

Your implementation should follow this specification:

**__init__** (*self*, *n_words*, *embedding_dim* = 50, *hidden_dim* = 50)

> Creates a new recurrent neural network language model. The argument *n_words* is the number of words in the vocabulary. The arguments *embedding_dim* and *hidden_dim* specify the dimensionalities of the embedding layer and the LSTM hidden layer, respectively; their default value is 50.

**forward** (*self*, *x*)

> Computes the network output on an input batch *x*. The shape of *x* is $(B, H)$, where $B$ is the batch size and $H$ is the length of each input sequence. The shape of the output tensor is $(B, H, V)$, where $V$ is the size of the vocabulary.

#### 🤞 Test your code

Test your code by instantiating the model and feeding it a batch of examples from the training data.

### Problem 2.3: Train the model

The training loop for the recurrent neural network model is essentially identical to the loop that you wrote for the feed-forward model. The only thing to note is that the cross-entropy loss function expects its input to be a two-dimensional tensor; you will therefore have to re-shape the output tensor from the LSTM as well as the gold-standard output tensor in a suitable way. The most efficient way to do so is to use the [`view()`](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.view) method.

In [102]:
def train_rnn(n_epochs=1, batch_size=100, bptt_len=32, lr=1e-2):
    # TODO: Replace the next line with your own code
    model = RNNModel(len(wikitext.vocab))
    model.to(device)

    optimizer = optim.Adam(model.parameters(), lr=lr)

    train_losses = []
    dev_losses = []
    dev_accuracies = []
    perplexity = []

    #print(valid_x[0:batch_size])

    with tqdm.tqdm(total = n_epochs) as pbar:
      for t in range (n_epochs):
        pbar.set_description(f'Epoch{t+1}')
        model.train()
        running_loss = 0
        it = 0
        for x, y in batchify(train_x, train_y, batch_size):
          optimizer.zero_grad()

          output, h_c = model.forward(x)

          loss = F.cross_entropy(output.view(x.shape[0]*x.shape[1], -1), y.view(-1))
          loss.backward()
          optimizer.step()
          running_loss += loss.item()* len(x)
          it += 1

        model.eval()
        for x, y in batchify(valid_x, valid_y, batch_size):
          with torch.no_grad():
                output_valid,h_C = model.forward(x)
                loss = F.cross_entropy(output_valid.view(x.shape[0]*x.shape[1], -1), y.view(-1))
                dev_losses.append(loss)
                
        avg_loss = sum(dev_losses)/len(dev_losses)
        train_losses.append(running_loss / len(valid_x))
        perplexity.append(torch.exp(torch.tensor(avg_loss)).item())
        print("perp ", perplexity)
        print('Avg loss', running_loss/it) 
        pbar.update()        
    return model

Here is the specification of the training function:

**train_rnn** (*n_epochs* = 1, *batch_size* = 100, *bptt_len* = 32, *lr* = 0.01)

> Trains a recurrent neural network language model on the WikiText data using minibatch gradient descent and returns it. The parameters *n_epochs* and *batch_size* specify the number of training epochs and the minibatch size, respectively. The parameter *bptt_len* specifies the length of the backpropagation-through-time horizon, that is, the length of the input and output sequences. Training uses the cross-entropy loss function and the [Adam optimizer](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam) with learning rate *lr*. After each epoch, prints the perplexity of the model on the validation data.

Evaluate your model by running the following code cell:

In [103]:
model_rnn = train_rnn(n_epochs=10)

  perplexity.append(torch.exp(torch.tensor(avg_loss)).item())
Epoch2:  10%|█         | 1/10 [00:14<02:10, 14.54s/it]

perp  [300.1768798828125]
Avg loss 634.7905086477901


Epoch3:  20%|██        | 2/10 [00:31<02:01, 15.14s/it]

perp  [300.1768798828125, 279.1668701171875]
Avg loss 558.6574141676174


Epoch4:  30%|███       | 3/10 [00:45<01:43, 14.81s/it]

perp  [300.1768798828125, 279.1668701171875, 268.77215576171875]
Avg loss 533.1300615278539


Epoch5:  40%|████      | 4/10 [00:57<01:24, 14.03s/it]

perp  [300.1768798828125, 279.1668701171875, 268.77215576171875, 263.87908935546875]
Avg loss 517.1147181451047


Epoch6:  50%|█████     | 5/10 [01:09<01:07, 13.49s/it]

perp  [300.1768798828125, 279.1668701171875, 268.77215576171875, 263.87908935546875, 261.8218078613281]
Avg loss 505.8182117095585


Epoch7:  60%|██████    | 6/10 [01:22<00:52, 13.25s/it]

perp  [300.1768798828125, 279.1668701171875, 268.77215576171875, 263.87908935546875, 261.8218078613281, 261.469482421875]
Avg loss 497.1603724076591


Epoch8:  70%|███████   | 7/10 [01:39<00:42, 14.30s/it]

perp  [300.1768798828125, 279.1668701171875, 268.77215576171875, 263.87908935546875, 261.8218078613281, 261.469482421875, 262.47869873046875]
Avg loss 490.283192257056


Epoch9:  80%|████████  | 8/10 [01:57<00:30, 15.44s/it]

perp  [300.1768798828125, 279.1668701171875, 268.77215576171875, 263.87908935546875, 261.8218078613281, 261.469482421875, 262.47869873046875, 264.2890930175781]
Avg loss 484.59937842224497


Epoch10:  90%|█████████ | 9/10 [02:12<00:15, 15.51s/it]

perp  [300.1768798828125, 279.1668701171875, 268.77215576171875, 263.87908935546875, 261.8218078613281, 261.469482421875, 262.47869873046875, 264.2890930175781, 266.2478942871094]
Avg loss 479.7235427735228


Epoch10: 100%|██████████| 10/10 [02:27<00:00, 14.76s/it]

perp  [300.1768798828125, 279.1668701171875, 268.77215576171875, 263.87908935546875, 261.8218078613281, 261.469482421875, 262.47869873046875, 264.2890930175781, 266.2478942871094, 268.9085388183594]
Avg loss 475.5786110299652





**⚠️ Your submitted notebook must contain output demonstrating a validation perplexity of at most 310.**

## Problem 3: Parameter initialization (reflection)

Since the error surfaces that gradient search explores when training neural networks can be very complex, it is important to choose &lsquo;good&rsquo; initial values for the parameters. In PyTorch, the weights of the embedding layer are initialized by sampling from the standard normal distribution $\mathcal{N}(0, 1)$. Test how changing the standard deviation and/or the distribution affects the perplexity of your feed-forward language model. Write a short report about your experience (ca. 150 words). Use the following prompts:

* What different settings for the initialization did you try? What results did you get?
* How can you choose a good initialization strategy?
* What did you learn? How, exactly, did you learn it? Why does this learning matter?

*TODO: Enter your text here*

* What different settings for the initialization did you try? What results did you get? *
    We tried changing the uniform distribution to 0.3, -1 and 1 with self.embedding.weight.data.uniform_(0.3, etc.).
    0.3: We noted a higher intitial perplexity as well as a higher loss after the first epoch. Although The later epochs balanced this out. Perp: 332
    -1 : We noted a much lower perplexity after the first epoch but a pretty similar loss. perp: 301
    1  : -||-. Perp: 294
* How can you choose a good initialization strategy?

    We could initialize the weights with the weights of a pre trained language model, preferably similar to the task we want to solve. These weights could either be frozen or further trained for our specific problem

* What did you learn? How, exactly, did you learn it? Why does this learning matter? *

    We have learned that vectorizing correctly is important.. We have also learned that overtraining is a thing by observing the perpexity and loss.
    We have also developed our understanding of NLP architectures and how to construct such models using the torch library. This knowledge alows us to more easily construct different archtectures for future labs and work in the field.