ELEC-E5550 - Statistical Natural Language Processing
# SET 7: Neural language models

# Released: 13.2.2024
# Deadline: 23.2.2024

In [1]:
%%capture
!pip install nose==1.3.7

# Overview
After completing this exercise, you will understand how the neural language models work.
Additionally, you will learn how to construct one and generate sentences with it.

# Table of contents

* [Introduction](#intro)
    * [Language models](#language_models)
    * [Neural language models](#neural_lm)
* [Task 1: Data preprocessing](#task_1)
    * [Index dictionaries](#task_1_1)
    * [Index data](#task_1_2)
    * [Prepare features and labels](#task_1_3)
    * [Preprocess data](#task_1_4)
* [Task 2: FFNN language model](#task_2)
    * [Create the FFNN model](#task_2_1)
    * [Generate text](#task_2_2)
* [Task 3: RNN language model](#task_3)
    * [Create the RNN model](#task_3_1)
    * [Generate text](#task_3_2)
* [Task 4: Output analysis](#task_4)
    * [Compare the models](#task_4_1)

## Introduction <a class="anchor" id="intro"></a>
## Language models <a class="anchor" id="language_models"></a>
From the n-grams exercise, you are already familiar with what language models are. As a recap, the goal of the language model is to predict the next word that is going to appear, based on the previous words that appeared.
$P(w_i| w_{i−1} . . . w_0)$.

## Neural language models <a class="anchor" id="neural_lm"></a>
The n-gram language models have some shortcomings. They generally can't capture long-span dependencies. Additionally, they are dependant on the word order. These limitations are mitigated by the neural network language models.

The neural network language models use word embeddings, instead of the words themselves. The word embeddings are a distributed representation of the words and they have a nice property that similar words are close in the vector space. This solves the word order dependency problem that the n-gram language models have.

The recurrent neural networks (RNN) process the input sequentially, where each next prediction is influenced by the history. This allows them to capture long-term dependencies between the words.

## TASK 1: Data preprocessing <a class="anchor" id="task_1"></a>

### Importing the dependencies

The first thing that we need to do is to import the necessary dependencies. To create the neural language models, we are going to use the [Pytorch](https://pytorch.org/) deep learning framework. 

In [2]:
# loading pytorch
import torch
import torch.nn as nn # the neural network package that contains functions for creating the neural network layers
import torch.nn.functional as F
import torch.optim as optim # a package that allows use to use an optimizer in order to update the parameters during training
from torch.utils.data import DataLoader # allows use to process the data in batches
from torch.nn.utils.rnn import pad_sequence # a function that zero-pads the sentences so they can have equal size in a batch
import warnings
warnings.simplefilter("ignore")
torch.manual_seed(0) # set a random seed for reproducibility

<torch._C.Generator at 0x7f75101cb630>

### Load the data

In this assignment, as data, we are going to use the "Pride and Prejudice", same as in the n-grams assignment. The data can be obtained from [here](https://www.gutenberg.org/ebooks/1342).

The cell below loads the data.

In [3]:
with open('../coursedata/nn-lm/janeausten.txt', 'r') as f:
    data = f.readlines()

###  Sentence boundaries

When dealing with language, it is good to know when a sentence starts and when it ends. That will help the model at the beginning of the prediction, when we don't have any previous words as context. For that purpose, we are going to pad each sentence with a start-of-sentence symbol _"&lt;s>"_ and an end-of-sentence symbol _"&lt;/s>"_. 

Since you already did a similar thing in the n-grams exercise, this function is already implemented for you.

In [4]:
def add_sentence_boundaries(data):
    """
    Takes the data, where each line is a sentence, appends <s> token at the beginning and </s> at the end of each sentence
    Example input: I live in Helsinki
    Example output: <s> I live in Helsinki </s>
    
    Arguments
    ---------
    data : list
            a list of sentences
    
    Returns
    -------
    res : list
            a list of sentences, where each sentence has <s> at the beginning and </s> at the end
    """
    res = []
    for sent in data:
        sent = '<s> ' + sent.rstrip() + ' </s>'
        res.append(sent)
    
    return res

### 1.1  Index dictionaries <a class="anchor" id="task_1_1"></a> (1 Point)
Neural networks can't process words as raw strings. Due to that, we need to represent the words with numbers. The first step in doing that is creating two dictionaries: word2idx and idx2word.

The word2idx dictionary contains unique words as keys and unique indices for each of the words as values. <br>
The idx2word dictionary contains unique indices as keys and unique words for each of those indices as values. It is essentially a reversed word2dx, where the keys are the values and the values are the keys.

Example sentences: ["I look forward", "You look forward"] <br>
word2idx = {"I": 1, "look": 2, "forward": 3, "You": 4} <br>
idx2word = {1: "I", 2: "look", 3: "forward", 4: "You"} <br>

Write a function that creates two dictionaries: word2idx and idx2work. The dictionaries should contain all the unique words in the data. <b>The indices should start from 1 and not from 0</b>
    
<b>Additionally, the first index should correspond to the first word in the `data` variable, second word to the second, etc.</b>

We need the indices in a specific order because the pre-trained models are trained with indices in that particular order.

In [5]:
def create_indices(data):
    """
    This function creates two dictionaries: word2idx and idx2word, containing each unique word in the dataset
    and its corresponding index.
    Remember that the starting index should be 1 and not 0
    Remember that the first word in 'data' should coeerespond to the first index
    
    Arguments
    ---------
    data - list
            a list of sentences, where each sentence starts with <s>
            and ends with </s> token
    
    Returns
    -------
    word2idx - dictionary
                a dictionary, where the keys are the words and the values are the indices
                
    idx2word - dictionary
                a dictionary, where the keys are the indices and the values are the words
    """
    
    # YOUR CODE HERE
    # raise NotImplementedError()

    word2idx = {}
    idx2word = {}
    index = 1  # Starting index
    
    # Splitting each sentence into words and updating the dictionaries
    for sentence in data:
        words = sentence.split()
        for word in words:
            if word not in word2idx:
                word2idx[word] = index
                idx2word[index] = word
                index += 1
                
    return word2idx, idx2word

In [6]:
from nose.tools import assert_equal

dummy_data = ['<s> a girl likes eating by herself', '</s> the cat likes eating fish']
word2idx_dummy, idx2word_dummy = create_indices(dummy_data)

# check if the results returned are dictionaries
assert_equal(type(word2idx_dummy), dict)
assert_equal(type(idx2word_dummy), dict)

# check if word2idx and idx2word are the same length
assert_equal(len(word2idx_dummy), len(idx2word_dummy))

# check if all the unique words in the data set have indices
combined_dummy_data = dummy_data[0] + ' ' + dummy_data[1]
combined_dummy_data = combined_dummy_data.split()
assert_equal(len(set(combined_dummy_data)), len(idx2word_dummy))

# check if values are integers in word2idx
assert_equal(type(word2idx_dummy['girl']), int)
assert_equal(type(word2idx_dummy['fish']), int)
assert_equal(type(word2idx_dummy['<s>']), int)

# check if values are strings in idx2word
assert_equal(type(idx2word_dummy[1]), str)
assert_equal(type(idx2word_dummy[2]), str)
assert_equal(type(idx2word_dummy[3]), str)

# check if word2idx and idx2word have the same values
dummy_idx = word2idx_dummy['<s>']
dummy_word = idx2word_dummy[dummy_idx]
assert_equal(word2idx_dummy[dummy_word], dummy_idx)
assert_equal(idx2word_dummy[dummy_idx], dummy_word)

# check if the first value in idx2word is 1
assert_equal(list(idx2word_dummy.keys())[0], 1)

# check if the first word in the dataset corresponds to the first index, second word to the second index, etc
assert_equal(word2idx_dummy['<s>'], 1)
assert_equal(word2idx_dummy['a'], 2)



### 1.2  Index data <a class="anchor" id="task_1_2"></a> (1 Point)
After we have created the word2idx and idx2word dictionaries, it is time to index the data. In other words, we need to replace each word in the data with its corresponding index.

Write a function that reads each sentence from the data and replaces each word in the sentence with its index from the word2idx dictionary.

In [7]:
def index_data(data, word2idx):
    """
    This function replaces each word in the data with its corresponding index
    
    Arguments
    ---------
    data - list
            a list of sentences, where each sentence starts with <s>
            and ends with </s> token
    
    word2idx - dict
            a dictionary where the keys are the unique words in the data
            and the values are the unique indices corresponding to the words
    
    Returns
    -------
    data_indexed - list
                a list of sentences, where each word in the sentence is replaced with its index
    """
    
    data_indexed = []
    
    # YOUR CODE HERE
    # raise NotImplementedError()
    
    for sentence in data:
        indexed_sentence = [word2idx[word] for word in sentence.split()]
        data_indexed.append(indexed_sentence)
        
    return data_indexed

In [8]:
from nose.tools import assert_equal

dummy_data = ['<s> a girl is here </s>', '<s> a boy is there </s>']
dummy_word2idx = {'<s>': 1, 'a': 2, 'girl': 3, 'is': 4, 'here': 5, '</s>': 6, 'boy': 7, 'there': 8}
dummy_indexed_data = index_data(dummy_data, dummy_word2idx)

# check that the returned result is a list
assert_equal(type(dummy_indexed_data), list)

# check that the length of the results is the same as the length of the data
assert_equal(len(dummy_data), len(dummy_indexed_data))

# check that the function does what it is supposed to do
assert_equal(dummy_indexed_data[0], [1, 2, 3, 4, 5, 6])
assert_equal(dummy_indexed_data[1], [1, 2, 7, 4, 8, 6])



### Convert sentences to tensors

This function converts each indexed sentence to a LongTensor data type. This is required in order to process it later using Pytorch.

You don't have to modify this function. It is already implemented for you.

In [9]:
def convert_to_tensor(data_indexed):
    """
    This function converts the indexed sentences to LongTensors
    
    Arguments
    ---------
    data_indexed - list
            a list of sentences, where each word in the sentence
            is replaced by its index
    
    Returns
    -------
    tensor_array - list
                a list of sentences, where each sentence
                is a LongTensor
    """
    
    tensor_array = []
    for sent in data_indexed:
        tensor_array.append(torch.LongTensor(sent))    
        
    return tensor_array

### Combine features and labels in a tuple

This function combines each indexed sentence and its corresponding labels to a tuple. This will be beneficial for us when we zero-pad the data later, in order to make the batches have equal-length samples.

You don't have to modify this function. It is already implemented for you.

In [10]:
def combine_data(input_data, labels_data):
    """
    This function converts the input features and the labels into tuples
    where each tuple corresponds to one sentence in the format (features, labels)
    
    Arguments
    ---------
    input_data - list
            a list of tensors containing the training features
    
    labels_data - list
            a list of tensors containing the training labels
    
    Returns
    -------
    res - list
            a list of tuples, where each tuple corresponds to one sentece pair
            in the format (features, labels)
    """
    
    res = []
    
    for i in range(len(input_data)):
        res.append((input_data[i], labels_data[i]))

    return res

### Remove extra data

Since we will be processing the data in equal batches during training, we need to make sure that each batch has equal number of sentences. In case the last batch contains less sentences than the batch size, that batch will be discarded.

This function discards the extra data that doesn't fit in a batch.

You don't have to modify this function. It is already implemented for you.

In [11]:
def remove_extra(data, batch_size):
    """
    This function removes the extra data that does not fit in a batch   
    
    Arguments
    ---------
    data - list
            a list of tuples, where each tuple corresponds to a
            sentence in a format (features, labels)
            
    batch_size - integer
                    the size of the batch
    
    
    Returns
    -------
    data - list
            a list of tuples, where each tuple corresponds to a
            sentence in a format (features, labels)
    """
    
    extra = len(data) % batch_size
    if extra != 0:
        data = data[:-extra][:]

    return data

### Zero-pad the data

In order to process the data in batches, we need to make sure that the sentences in each batch have equal lengths. Since we are working with sentences, each sentence in a batch can have different number of words. In this case, we need to  make the length of each sentence the same as the length of the longest sentence in that batch. We do that by adding zeros at the end of each sentence, until the sentence has equal length as the longest one in the batch.

This function implements the zero-padding.

You don't have to modify this function. It is already implemented for you.

In [12]:
def collate(list_of_samples):
    """
    This function zero-pads the training data in order to process the sentences
    in a batch during training
    
    Arguments
    ---------
    list_of_samples - list
                        a list of tuples, where each tuple corresponds to a
                        sentence in a format (features, labels)
    
    
    Returns
    -------
    pad_input_data - tensor
                        a tensor of input features equal to the batch size,
                        where features are zero-padded to have equal lengths
                        
    input_data_lengths - list
                        a list where each element is the length of the 
                        corresponding sentence
    
    pad_labels_data - tensor
                        a tensor of labels equal to the batch size,
                        where labels are zero-padded to have equal lengths
            
    """
    
    
    list_of_samples.sort(key=lambda x: len(x[0]), reverse=True)
    input_data, labels_data = zip(*list_of_samples)

    input_data_lengths = [len(seq) for seq in input_data]
    
    padding_value = 0

    # pad input
    pad_input_data = pad_sequence(input_data, padding_value=padding_value)
    
    # pad labels
    pad_labels_data = pad_sequence(labels_data, padding_value=padding_value)

    return pad_input_data, input_data_lengths, pad_labels_data

### 1.3 Prepare features and labels <a class="anchor" id="task_1_3"></a> (1 Point)
During training, the model takes an input word and outputs a prediction. We will need to compare this prediction to 'true label'. True label is just the next word in the text, but we will need to organize the data, so that every word in the text is considered as this 'true label'.

In the label sentence, every word is moved a step in time, and for the input sentence the last word is missing. 

Example sentence: oops i did it again <br>
INPUT: oops i did it <br>
LABEL: i did it again

Note: the first word in the sentence is start-of-sentence symbol and the last one is end-of-sentence symbol.

Write a function that takes as input the indexed data and returns two arrays: the input array where the last word from each sentence is missing, and the label array, where every word is moved a step in time.

In [13]:
def prepare_for_training(data_indexed):
    """
    This function creates the input features and their corresponding labels
    
    Arguments
    ---------
    data_indexed - list
            a list of sentences, where each word in the sentence
            is replaced by its index
    
    
    Returns
    -------
    input_data - list
            a list of indexed sentences, where the last element of each sentence is removed
            
    labels_data - list
            a list of indexed sentences, where the first element of each sentence is removed
    """
    
    input_data = []
    labels_data = []

     # YOUR CODE HERE
    # raise NotImplementedError()

    for sentence in data_indexed:
        input_data.append(sentence[:-1])
        labels_data.append(sentence[1:])
        
    return input_data, labels_data

In [14]:
from nose.tools import assert_equal

dummy_data = [[1, 2, 3, 4, 5, 6], [4, 6, 2, 6, 7]]
dummy_train_input, dummy_train_labels = prepare_for_training(dummy_data)

# check that the returned results are lists
assert_equal(type(dummy_train_input), list)
assert_equal(type(dummy_train_labels), list)

# check that the length of the input and the labels match
assert_equal(len(dummy_train_input), len(dummy_train_labels))
assert_equal(len(dummy_train_input[0]), len(dummy_train_labels[0]))
assert_equal(len(dummy_train_input[1]), len(dummy_train_labels[1]))

# check that the function works as it should
assert_equal(dummy_train_input[0], [1, 2, 3, 4, 5])
assert_equal(dummy_train_input[1], [4, 6, 2, 6])

assert_equal(dummy_train_labels[0], [2, 3, 4, 5, 6])
assert_equal(dummy_train_labels[1], [6, 2, 6, 7])



### 1.4 Preprocess data <a class="anchor" id="task_1_4"></a> (1 Point)
At this point, we have all the necessary functions to prepare the data for training. What is left to do is to run them one by one and get the data in the desired format.

Write a function that takes the data and prepares it for training. You need to do the following steps:

    1. Add sentence boundaries
    2. Create index dictionaries (word2idx and idx2word)
    3. Index the data in a way that each word is replaced by its index
    4. Convert the indexed data to a list of tensors, where each tensor is a sentence
    5. Split each sentence to input and labels

In [15]:
def preprocess_data(data):
    """
    This function runs the whole preprocessing pipeline and returns the prepared
    input features and labels, along with the word2idx and idx2word dictionaries
    
    Arguments
    ---------
    data - list
            a list of sentences that need to be prepared for training
    
    
    Returns
    -------
    input_data - list
            a list of tensors, where each tensor is an indexed sentence used as input feature
            
    labels_data - list
            a list of tensors, where each tensor is an indexed sentence used as a true label
    
    word2idx - dictionary
                a dictionary, where the keys are the words and the values are the indices
                
    idx2word - dictionary
                a dictionary, where the keys are the indices and the values are the words
    """
    
    # YOUR CODE HERE
    # raise NotImplementedError()

    data_with_boundaries = add_sentence_boundaries(data)

    unique_words = set(" ".join(data_with_boundaries).split())
    word2idx = {word: idx for idx, word in enumerate(unique_words, start=1)}
    idx2word = {idx: word for word, idx in word2idx.items()}
    
    data_indexed = index_data(data_with_boundaries, word2idx)
    input_data, labels_data = prepare_for_training(data_indexed)
    
    return input_data, labels_data, word2idx, idx2word

In [16]:
from nose.tools import assert_equal

dummy_data = ['a girl likes eating by herself', 'the cat likes eating fish']
dummy_input, dummy_labels, dummy_word2idx, dummy_idx2word = preprocess_data(dummy_data)

# check that the returned results are lists
assert_equal(type(dummy_input), list)
assert_equal(type(dummy_labels), list)

# check that the returned results have the same lengths
assert_equal(len(dummy_input), len(dummy_labels))

# check that the sizes of the indexed sentences are correct
assert_equal(len(dummy_input[0]), len(dummy_data[0].split()) + 1)
assert_equal(len(dummy_input[1]), len(dummy_data[1].split()) + 1)

assert_equal(len(dummy_labels[0]), len(dummy_data[0].split()) + 1)
assert_equal(len(dummy_labels[1]), len(dummy_data[1].split()) + 1)

# check that the input features and labels are correct
assert_equal(dummy_input[0][1], dummy_labels[0][0])
assert_equal(dummy_input[0][2], dummy_labels[0][1])
assert_equal(dummy_input[0][3], dummy_labels[0][2])
assert_equal(dummy_input[0][4], dummy_labels[0][3])
assert_equal(dummy_input[0][5], dummy_labels[0][4])

assert_equal(dummy_input[1][1], dummy_labels[1][0])
assert_equal(dummy_input[1][2], dummy_labels[1][1])
assert_equal(dummy_input[1][3], dummy_labels[1][2])
assert_equal(dummy_input[1][4], dummy_labels[1][3])




Next, we are going to call the preprocessing function and obtain the features and labels. After that, we will combine the features and labels in tuples using the `combine_data` function and then remove the extra data that does not fit in a batch using the `remove_extra` function. At the end, we are going to call the `DataLoader`, which prepares the data in batches.

In [17]:
train_input, train_labels, word2idx, idx2word = preprocess_data(data) # run the preprocessing pipeline

batch_size = 16 # the number of sentences to be processed at once

train_data = combine_data(train_input, train_labels)
train_data = remove_extra(train_data, batch_size)

pairs_batch_train = DataLoader(dataset=train_data,
                    batch_size=batch_size,
                    shuffle=True,
                    collate_fn=collate,
                    pin_memory=True)

## TASK 2:  Feed forward neural network language model <a class="anchor" id="task_2"></a>
In this task, we are going to implement our first neural network language model. To do that, we are going to use a feed-forward neural network that takes the previous word as input and predicts the next word based on it. This is similar to the bigram language model.

To predict the next word, we first need to convert the previous word into a vecor $x(t)$. In other words, we need to embed it. <br>
Next, we need to apply a linear transformation $ h(t) = Ax(t) + b $ to compute a representation of linear distributional features. In this case $ A $ is a learnable matrix and $ b $ is the bias, which is also a learnable parameter. <br>
In the previous step we applied a linear transformation. In order for the model to learn more complex feature representations, we can add a non-linearity after the linear transformation: $ U(h(t)) $, which in our case is a [Rectified Linear Unit](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) (ReLU). <br>
One common way of preventing the model from overfitting is to apply a dropout. The dropout disconnects random neurons witch a certain probability, preventing the network from overlearning the training set. <br>
Since we need to predict the next word, we need the output of the transformation to be equal to the number of unique words in the data. To do that, we need to apply another linear transformation, which has an output size same as the number of unique words in the data. <br>
In the end, the output is passed to the CrossEntropy loss function that estimates how good is the model's prediction in comparison to the true labels.

Generally, with deep learning frameworks, like Pytorch, we only need to implement the computations from the input to the output. This is also called a "forward pass" and is the `forward` function in the `FFNN` class. That in our case would be passing the previous word and getting a probability distribution over the next word.

On the other hand, the calculation of the partial derivatives of the loss function are calculated automatically by the framework and we don't need to implement it. This is also called a "backward pass".

### 2.1 Create the model (3 Points) <a class="anchor" id="task_2_1"></a>

The `FFNN` class contains the definition of the model that we are going to train for the language modeling task. The first function that the class has is called `__init__` and it initializes the layers of the model. This function is already implemented for you.

The next function is called `forward` and is the forward pass explained earlier. You need to implement this function.

To implement the function, you need to perform the following steps:

    1. Replace the indexed word with its embedding vector. In other words, pass it through the embedding layer
    2. Apply the first linear transformation to the embedding of the word. In other words, pass the word embedding through the linear layer, defined as `self.lin = nn.Linear(self.embed_dim, self.context_dim)`
    3. Apply the ReLU non-linearity to the output of the linear projection defined in the previous step. The ReLU function can be called as `F.relu()`
    4. Apply a dropout after ReLU
    5. Apply the second linear transformation to the output after the dropout. The second linear transformation is defined as `self.out = nn.Linear(self.context_dim, len(self.word2idx)+1)`
    6. Return the output of the second linear transformation
    
Remember that now we are processing the words in batches. If we have a batch size of 5, that means that we are passing 5 words at a time and applying all the transformations to those 5 words simultaneously.

In [18]:
class FFNN(nn.Module):
    def __init__(self, word2idx, embed_dim, context_dim):
        """
        This function initializes the layers of the model
        
        Arguments
        ---------
        word2idx - dictionary
                    a dictionary where the keys are the unique words in the data
                    and the values are the unique indices corresponding to the words
        
        embed_dim - integer
                        the size of the word embeddings

        context_dim - integer
                        the dimension of the hidden size
        """
        
        super(FFNN, self).__init__()
        self.word2idx = word2idx
        self.embed_dim = embed_dim
        self.context_dim = context_dim
        
        # here we initialise the layers of the model
        self.word_embed = nn.Embedding(len(self.word2idx)+1, self.embed_dim) # embedding layer
    
        self.lin = nn.Linear(self.embed_dim, self.context_dim) # linear layer
        
        self.dropout = nn.Dropout(0.1) # dropout layer
        
        self.out = nn.Linear(self.context_dim, len(self.word2idx)+1) # output layer
        
    
    def forward(self, word):
        """
        This function implements the forward pass of the model
        
        Arguments
        ---------
        word - tensor
                    a tensor containing indices of the words in a batch
        
        Returns
        -------
        output - tensor
                    a tensor of logits from the second linear transformation
        """ 
        
        # YOUR CODE HERE
        # raise NotImplementedError()
        
    
        embedded_words = self.word_embed(word)
        linear_output = self.lin(embedded_words)
        relu_output = F.relu(linear_output)
        dropout_output = self.dropout(relu_output)
        output = self.out(dropout_output)
        
        return output

In [19]:
from nose.tools import assert_equal

dummy_ff_model = FFNN(word2idx, 10, 20)
dummy_train_input = torch.randint(1, 10, (2,))
dummy_output = dummy_ff_model(dummy_train_input)

# test that the returned result is a tensor
assert_equal(torch.is_tensor(dummy_output), True)

# test that the shapes match
assert_equal(dummy_output.size(), (2, len(word2idx)+1))




### Model initialization
In the next cell, we are going to define the hyperparameters of our neural network language model. Additionally, we will initialize the model, along with the loss function and the optimizer.

In [20]:
n_epochs = 30 # the number of epochs to train
embedding_size = 300 # the size of the embedding layer
hidden_size = 450 # the size of the linear projection
ff_model = FFNN(word2idx, embedding_size, hidden_size) # initialize the model
loss_function = nn.CrossEntropyLoss(ignore_index=0) # the loss function which compares how good the NN is predicting the next word
ffnn_optimizer = optim.Adam(ff_model.parameters(), lr=0.005) # Adam optimizer for updating the parameters during training

### Training the model
Now that we have initialized the model, the next step is to train it. The training process is done by passing the previous word through the forward pass. Then, the loss function compares how good the neural network is predicting the next word. Then we run the backward pass of the network where the partial derivatives of the loss function are computed. At the end, based on those partial derivatives, the parameters of the network get updated using the Adam optimizer. The process is repeated for $n$ number of epochs, where one epoch passes when we process all the sentences in the training set.

You don't need to modify this function.

In [21]:
def train(pairs_batch_train, ff_model, loss_function, ffnn_optimizer, n_epochs):
    """
    This function implements the training of the model

    Arguments
    ---------
    pairs_batch_train - object
                            a DataLoader object that contains the batched data

    ff_model - object
                a FFNN object that contains the initialized model

    loss_function - object
                        the CrossEntropy loss function

    ffnn_optimizer - object
                        an Adam object of the optimizer class

    n_epochs - integer
                the number of epochs to train
    """ 
    
    for epoch in range(n_epochs): # iterate over the epochs
        epoch_loss = 0
        ff_model.train() # put the model in training mode
        
        for iteration, batch in enumerate(pairs_batch_train): # at each step take a batch of sentences
            sent_loss = 0
            ffnn_optimizer.zero_grad() # clear gradients
            train_input, train_input_lengths, train_labels = batch # extract the data from the batch
            
            for i in range(train_input.size(0)): # iterate over each word in the sentence
                output = ff_model(train_input[i]) # forward pass
                
                labels = torch.LongTensor(train_labels.size(1)) # define a random tensor with batch_size as number of elements
                labels[:] = train_labels[i][:] # put the correct label values in the tensor
                
                sent_loss += loss_function(output, labels) # compute the loss, compare the predictions and the labels
            
            sent_loss.backward() # compute the backward pass
            ffnn_optimizer.step() # update the parameters

            epoch_loss += sent_loss.item()

        print('Epoch: {}   Loss: {}'.format(epoch+1, epoch_loss / len(pairs_batch_train))) # print the loss at each epoch

Since training a neural network model takes a long time and a lot of resources, we are not going to train the model. Instead, we are going to load an already trained model.

The cell below loads the pre-trained model.

In [22]:
ff_model = torch.load('../coursedata/nn-lm/ff_model.pt', map_location='cpu')

### 2.2 Generate text (3 Points) <a class="anchor" id="task_2_2"></a>
Now that the model is trained, we can use it to generate sentences.

Your task is to implement the `predict_ffnn` function.

You need to perform the following steps:

    1. Run the forward pass to get the output
    2. Run the output through a softmax to convert it to a probability distribution (`F.softmax`) [remember to specify the correct dimention to the softmax function]
    3. Flatten the output (`output.flatten()`)
    4. Sample from a multinomial distribution with `output.multinomial(1)`
    6. Convert the index of the predicted word to the actual word using the idx2word dictionary
    7. Append the predicted word to the `predictions` array
    
<b>Remember to use the prediction from `output.multinomial(1)` as the next input to the `ff_model` function.</b>

In [23]:
def predict_ffnn(ff_model, word2idx, idx2word, start_word, max_len):
    """
    This function predicts the next word, based on the previous word.
    We start with the 'start_word' and then feed the prediction as the next input.
    
    Arguments
    ---------
    ff_model - object
                a FFNN object that contains the trained model
                
    word2idx - dictionary
                    a dictionary where the keys are the unique words in the data
                    and the values are the unique indices corresponding to the words
                    
    idx2word - dictionary
                a dictionary, where the keys are the indices and the values are the words
                    
    start_word - string
                    the starting word
    
    max_len - integer
                integer value representing up to how many words to generate
                            
    Returns
    -------
    
    predictions - string
                    a string containing the generated sentence
    """

    start_word_indexed = torch.LongTensor(1)
    start_word_indexed[:] = word2idx[start_word] # replace the starting word with its index
    
    with torch.no_grad(): # don't need to compute the gradients
        ff_model.eval() # put the model in evaluation mode
        predictions = [] # list where we are going to store the predictions
        predictions.append(idx2word[start_word_indexed.item()]) # add the starting word to the array
        topk = start_word_indexed # use the starting word as the first previous word during the prediction

        while((len(predictions) < max_len) and (predictions[-1] != '</s>')): # generate until we have enough words or generated </s>
            
            # YOUR CODE HERE
            # raise NotImplementedError()
            
            output = ff_model(topk) 
            output = F.softmax(output, dim=1) 
            output = output.flatten() 
            topk = output.multinomial(1) 
            predicted_word = idx2word[topk.item()] 
            predictions.append(predicted_word) 
            topk = torch.LongTensor([word2idx[predicted_word]]) 
            
    predictions = ' '.join(predictions) # convert the array of predictions to a string
    
    return predictions

In [24]:
from nose.tools import assert_equal

dummy_word = '<s>' # starting word
dummy_max_len = 30 # maximum number of words to genertate

dummy_predictions = predict_ffnn(ff_model, word2idx, idx2word, dummy_word, dummy_max_len)

# check that a string is returned
assert_equal(type(dummy_predictions), str)

# check that the prediction starts with <s>
assert_equal(dummy_predictions.split()[0], '<s>')

# check that the model has generated enough samples or reached </s>
if len(dummy_predictions.split()) < dummy_max_len:
    assert_equal(dummy_predictions.split()[-1], '</s>')
else:
    assert_equal(len(dummy_predictions.split()), dummy_max_len)
    



Now, we are going to run the prediction function and see what the model generates. You can execute the cell below multiple times in order to get more predictions

In [25]:
start_word = ['<s>', '<s>', '<s>'] # starting word
max_len = 30 # maximum number of words to genertate

for word in start_word:
    predictions = predict_ffnn(ff_model, word2idx, idx2word, word, max_len)
    print(predictions)

<s> overlooked assisting stay procuring solidity feels harringtons sinking closure improbable amidst procuring solidity talker procuring conjecture oppressively regain tempers procuring lad procuring gloom named check improbable amidst procuring killed
<s> frisks anywhere longbourn procuring instantly glass improbable amidst procuring sedate patient stubbornness effect brightened feels strong commonly blame cheering effect minds improbable amidst procuring sedate declaration profligate disappointments superseded
<s> frisks wit tempted although none prudent acquit killed instantly sportsmen sustained similarity watchfulness deserted venting trifling honest late undecided swelling late wit unjust ignorance precipitate suspicious solidity profligate procuring


## TASK 3:  Recurrent neural network language model <a class="anchor" id="task_3"></a>

In this task, we are going to implement a recurrent neural network (RNN) language model. The recurrent neural network processes the input sequentially, where each prediction is conditionally dependent of the previous predictions. This makes them suitable for language modeling tasks.

To predict the next word, we first need to convert the previous word into a vecor $x(t)$. In other words, we need to embed it. This step is identical to the one in the FFNN model. <br>
After we have the word embedding, we need to pass it through the RNN, which is our case is a [Gated Recurrent Unit](https://pytorch.org/docs/stable/generated/torch.nn.GRU.html) (GRU). The GRU returns two tensors: `output` and `hidden`. The `output` tensor contains the output features from the last layer of the GRU, for each timestep. The `hidden` tensor contains the hidden state of the last timestep of each layer. <br>
To get an output of size equal to the number of unique words in our vocabulary, we need to pass the output of the GRU through a linear projection, similar to the second linear projection in the FFNN model.

### 3.1 Create the model (3 Points) <a class="anchor" id="task_3_1"></a>
The `RNN` class contains the definition of the RNN model that we are going to train for the language modeling task. The first function that the class has is called `__init__` and it initializes the layers of the model. This function is already implemented for you.

The next function is the `forward` function and does similar job like the one in the FFNN model. You need to implement this function.

To implement the function, you need to perform the following steps:

    1. Replace the indexed word with its embedding vector. In other words, pass it through the embedding layer
    2. Reshape the embedding vector to a shape of (1, batch_size, embed_dim)
    3. Pass the embedding through the GRU cell to get the output and the hidden tensors. The GRU function takes as input the word embedding and the previous hidden state.
    4. Addpy a dropout to the output of the GRU.
    5. Apply the linear transformation to the output of the dropout layer (pass it though the `self.out` layer).
    6. Reshape the output to have a shape (batch_size, vocab_length+1)
    7. Return the output of the linear transformation and the hidden tensor
    
Remember that now we are processing the words in batches. If we have a batch size of 5, that means that we are passing 5 words at a time and applying all the transformations to those 5 words simultaneously.

In [26]:
class RNN(nn.Module):
    def __init__(self, word2idx, embed_dim, context_dim, num_layers):
        """
        This function initializes the layers of the model
        
        Arguments
        ---------
        word2idx - dictionary
                    a dictionary where the keys are the unique words in the data
                    and the values are the unique indices corresponding to the words
        
        embed_dim - integer
                        the size of the word embeddings

        context_dim - integer
                        the dimension of the hidden size
                        
        num_layers - integer
                        the number of layers in the GRU cell
        """
        super(RNN, self).__init__()
        self.word2idx = word2idx
        self.embed_dim = embed_dim
        self.context_dim = context_dim
        self.num_layers = num_layers
        
        # here we initialise weighs of a model
        self.word_embed = nn.Embedding(len(self.word2idx)+1, self.embed_dim) # embedding layer

        self.gru = nn.GRU(self.embed_dim, self.context_dim, num_layers=self.num_layers) # GRU cell
        
        self.dropout = nn.Dropout(0.1) # Dropout
        
        self.out = nn.Linear(self.context_dim, len(self.word2idx)+1) # output layer

    
    def forward(self, word, hidden):
        """
        This function implements the forward pass of the model
        
        Arguments
        ---------
        word - tensor
                a tensor containing indices of the words in a batch
                
        hidden - tensor
                    the previous hidden state of the GRU model
        
        Returns
        -------
        output - tensor
                    a tensor of logits from the linear transformation
        
        hidden - tensor
                    the current hidden state of the GRU model
        """ 
        
        # YOUR CODE HERE
        # raise NotImplementedError()
    
        embeds = self.word_embed(word).view(1, -1, self.embed_dim) 
        gru_output, hidden = self.gru(embeds, hidden)
        output = self.dropout(gru_output)
        output = self.out(output)
        output = output.view(-1, len(self.word2idx) + 1) # Reshape for compatibility
        
        return output, hidden

In [27]:
from nose.tools import assert_equal

dummy_embed_dim = 10
dummy_hidden_size = 20
dummy_num_layers = 1

dummy_rnn_model = RNN(word2idx, dummy_embed_dim, dummy_hidden_size, dummy_num_layers)
dummy_train_input = torch.randint(1, 10, (2,))

dummy_hidden = torch.zeros((dummy_num_layers, dummy_train_input.size(0), dummy_hidden_size))
dummy_output, hidden = dummy_rnn_model(dummy_train_input, dummy_hidden)

# test that the returned result is a tensor
assert_equal(torch.is_tensor(dummy_output), True)

# test that the shapes match
assert_equal(dummy_output.size(), (2, len(word2idx)+1))



### Model initialization
In the next cell, we are going to define the hyperparameters of our neural network language model. Additionally, we will initialize the model, along with the loss function and the optimizer.

In [28]:
n_epochs = 10 # the number of epochs to train
embed_dim = 300 # the size of the embedding
hidden_size = 450 # the size of the hidden state
num_layers = 1 # the number of layers in the GRU cell
rnn_model = RNN(word2idx, embed_dim, hidden_size, num_layers) # initialize the RNN model
loss_function = nn.CrossEntropyLoss(ignore_index=0) # define the loss function
rnn_optimizer = optim.Adam(rnn_model.parameters(), lr=0.001) # define the optimizer

### Training the model
The training process is similar to the one in the FFNN model. We pass each word through the forward pass and obtain the `output` and the `hidden` states. Then, we use the `output` to compare it against the true labels and see how far we are from the correct result. After that we compute the partial derivatives and update the parameters. This process gets repeated for $ n $ number of epochs.

In [29]:
def train_rnn(pairs_batch_train, rnn_model, hidden_size, num_layers, loss_function, rnn_optimizer, n_epochs):
    """
    This function implements the training of the model

    Arguments
    ---------
    pairs_batch_train - object
                            a DataLoader object that contains the batched data

    rnn_model - object
                an RNN object that contains the initialized model
                
    hidden_size - integer
                    the size of the hidden layer (the context size)
    
    num_layers - integer
                        the number of layers in the GRU cell

    loss_function - object
                        the CrossEntropy loss function

    rnn_optimizer - object
                        an Adam object of the optimizer class

    n_epochs - integer
                the number of epochs to train
    """ 

    for epoch in range(n_epochs): # iterate over the epochs
        epoch_loss = 0
        rnn_model.train() # put the model in training mode
        
        for iteration, batch in enumerate(pairs_batch_train): # at each step take a batch of sentences
            sent_loss = 0
            rnn_optimizer.zero_grad() # clear gradients
            
            train_input, train_input_lengths, train_labels = batch # extract the data from the batch
            hidden = torch.zeros((num_layers, train_input.size(1), hidden_size)) # initialize the hidden state
            
            for i in range(train_input.size(0)): # iterate over the word in the sentence
                output, hidden = rnn_model(train_input[i], hidden) # forward pass
                labels = torch.LongTensor(train_labels.size(1)) # define a random tensor with batch_size as number of elements
                labels[:] = train_labels[i][:] # put the correct label values in the tensor
                
                sent_loss += loss_function(output, labels) # compute the loss, compare the predictions and the labels

            sent_loss.backward() # compute the backward pass
            rnn_optimizer.step() # update the parameters

            epoch_loss += sent_loss.item()
            
        print('Epoch: {}   Loss: {}'.format(epoch+1, epoch_loss / len(pairs_batch_train))) # print the loss at each epoch

Similiar to the FFNN, we are not going to train the model. Instead, we are going to load a pre-trained model.

In [30]:
rnn_model = torch.load('../coursedata/nn-lm/rnn_model.pt', map_location='cpu')

### 3.2 Generate text (3 Points) <a class="anchor" id="task_3_2"></a>
Now that the model is trained, we can use it to generate sentences.

Your task is to implement the `predict_rnn` function.

You need to perform the following steps:

    1. Run the forward pass to get the output. Don't forget to pass the `hidden` state
    2. Run the output through a softmax to convert it to a probability distribution (`F.softmax`) [don't forget to specify the dimension in the softmax function]
    3. Get the word with the highest probability using the `topk()` function
    4. Set the `next_input` to be the predicted word with the highest probability (the topk word) [very important!].
    5. Convert the index of the predicted word to the actual word using the idx2word dictionary
    6. Append the predicted word to the `predictions` array

In [31]:
def predict_rnn(rnn_model, hidden_size, num_layers, word2idx, idx2word, context, max_len):
    """
    This function predicts the next word, based on the history of the previous words.
    We start with the 'context' and then feed the prediction as the next input.
    
    Arguments
    ---------
    rnn_model - object
                an RNN object that contains the trained model
                
    hidden_size - integer
                    the size of the hidden layer (the context size)
                    
    num_layers - integer
                    the number of layers in the GRU cell
                
    word2idx - dictionary
                    a dictionary where the keys are the unique words in the data
                    and the values are the unique indices corresponding to the words
                    
    idx2word - dictionary
                a dictionary, where the keys are the indices and the values are the words
                    
    context - string
                the context sentence
    
    max_len - integer
                integer value representing up to how many words to generate
                            
    Returns
    -------
    
    predictions - string
                    a string containing the generated sentence
    """
    
    # index the context
    context_indexed = []
    for word in context.split():
        word_indexed = torch.LongTensor(1)
        word_indexed[:] = word2idx[word]
        context_indexed.append(word_indexed)
    
    with torch.no_grad():
        predictions = []
        # first build the hidden state from the context
        hidden = torch.zeros((num_layers, 1, hidden_size))
        for word in context_indexed:
            predictions.append(idx2word[word.item()])
            output, hidden = rnn_model(word, hidden)
            
        next_input = context_indexed[-1]
        while((len(predictions) < max_len) and (predictions[-1] != '</s>')):
            
            # YOUR CODE HERE
            # raise NotImplementedError()
            
            output, hidden = rnn_model(next_input, hidden) 
            output_softmax = F.softmax(output, dim=1) 
            top_value, top_index = output_softmax.topk(1)  
            next_input = top_index.squeeze().detach()
            
            if next_input.item() in idx2word:
                predicted_word = idx2word[next_input.item()]
                predictions.append(predicted_word)
            else:
                break
            if predicted_word == '</s>': 
                break
                
    predictions = ' '.join(predictions)
    
    return predictions

In [32]:
from nose.tools import assert_equal

dummy_context = '<s> the' # starting word
dummy_max_len = 15 # maximum number of word to genertate

dummy_predictions = predict_rnn(rnn_model, hidden_size, num_layers, word2idx, idx2word, dummy_context, dummy_max_len)

# check that a string is returned
assert_equal(type(dummy_predictions), str)

# check that the prediction starts with <s>
assert_equal(dummy_predictions.split()[0], '<s>')

# check that the model has generated enough samples or reached </s>
if len(dummy_predictions.split()) < dummy_max_len:
    assert_equal(dummy_predictions.split()[-1], '</s>')
else:
    assert_equal(len(dummy_predictions.split()), dummy_max_len)
    


Now, let's generate some text.

In [33]:
contexts = ['<s> this has been', '<s> the person', '<s> it is interesting']
max_len = 50

for context in contexts:
    predictions = predict_rnn(rnn_model, hidden_size, num_layers, word2idx, idx2word, context, max_len)
    print(predictions)
    print('\n')

<s> this has been entirely owned replying dupe dupe dupe dupe substitute arch bridegroom dupe lucases mouths music honestly music honestly music honestly death seizing testimony seizing seizing prudently uglier dissolved waste weary humanity chaise honestly honestly music afforded uglier turned licence jestingly fourthly named dupe ankles announce entreaties sisters


<s> the person mentioned signs condemned formation waste comprehends restraint kympton gentlest hesitate abusing waste begun warmest continued condemned despicably descending loose treasured mud quartered week prodigious quartered twenty violence censured practises insipidity violence conclusion looking powers behave remaining deaden unqualified unbecoming music quartered descended licence dance descending doleful blinded


<s> it is interesting encumbrance flirtation flirtation antagonist honour selfish ashamed dupe substitute behave title struggled warmest superciliousness blinded thoughtlessness ensued abusing arch abr

## TASK 4: Output analysis <a class="anchor" id="task_4"></a>
This task will be manually graded. It is focused on understanding the difference between the models and why one might perform better than the other.

## 4.1 Model comparison (3 Points) <a class="anchor" id="task_4_1"></a>
Answer the following questions:

    1. Which model generates more sensible text?
    2. Why is that?
    3. Write at least one shortcoming of both models.

Compared to FFNN, RNN generates far more logical sentences based on the generated output.

1. Text produced by the RNN model is more logical. When it comes to sequential data tasks, like text prediction in our example, RNN outperforms FFNN. This is because RNN keeps an internal memory of previous inputs and outputs. This enables RNN to model dependencies between elements of a data sequence by allowing the network to create a feedback loop (by using the previous time step's output as the current time step's input). On contrary, the FFNN design is less effective for processing sequential data because it can only handle fixed-size inputs.

2. Unlike the FFNN model, which simply predicts the future word based on the present word, the RNN model continuously stores and updates inputs, giving it some recollection of past information. As a result, the RNN model can produce sentences that is more logical and coherent. Both FFNN and RNN are prone to overfitting, particularly when the models have many parameters in comparison to the volume of training data.

3. Because both models have poor memory capacities, they are not very well suited to produce both long and logical statements. The FFNN architecture exhibits particularly poor performance in predicting long phrases because to its reliance on fixed-size input and lack of memory capacity to manage sequential data. Furthermore, the challenge of vanishing/exploding gradients with time make it even more difficult for RNNs to forecast longer texts.

## Checklist before submission <a class="anchor" id="checklist"></a>
### 1
To make sure that you didn't forget to import some package or to name some variable, press **Kernel -> Restart** and then **Cell -> Run All**. This way your code will be run exactly in the same order as during the autograding.
### 2
Click the **Validate** button in the upper menu to check that you haven't missed anything.
### 3
To submit the notebook, click on the **jupyterhub** logo in the upper left part of the window, choose the **Assignments** folder, and press **submit**. You can submit multiple times, only the last one counts.
### 4
Please provide a feedback so that we can improve the assignment.