# 📚  Exercise Session - Week 2

Welcome to Week 2 exercise's session of CS552-Modern NLP!


> **What will be covered:**
1. [**TASK A:** N-gram Language Models](#ngram_lm)
    - [Unigram Language Model](#unigram_lm)
    - [Bi-gram Language Model](#bigram_lm)
    - [Tri-gram Language Model](#trigram_lm)
     
2. [**TASK B:** Neural Language Models](#neural_lm)
    - [Fixed-Window Neural Language Model](#fixed_window_lm)
    - [RNN-based Language Model](#rnn_lm)

> **By the end of the session you will be able to:**
> - ✅  Compute and interpret the perplexity of a language model 
> - ✅  Implement N-gram language models for N=1,2,3
> - ✅  Implement, train, and evaluate a fixed window language model
> - ✅  Evaluate an RNN language model
> - ✅  Understand the advantages and disadvantages of each of the above models

In [1]:
# install the libraries if needed.
# !pip install datasets
# !pip install numpy

<a name="ngram_lm"></a>
## 1. Task A: N-gram Language Models 


In this exercise, we will better understand the functioning of different types of (non-neural) language modeling, namely,  Unigram LM, Bi-gram LM, and Tri-gram LM.

### 1.1 Unigram Language Model <a name="unigram_lm"></a>
In the simple Unigram language model, we pick/generate next token independent of the previous token. In other words, during the generation, we pick the tokens according to the token probability. Therefore, for an arbitrary sequence $x_1x_2~...x_n$, its respective probability becomes:
$$p(x_1x_2~...x_n) = \Pi_{i=1} ^n p(x_i)$$
Let's use an unsupervised dataset (raw corpus) to evaluate this model's perplexity. We use Huggingface's `datasets` library to download needed datasets.
 

Here we use the `Penn Treebank` dataset, featuring a million words of 1989 Wall Street Journal material. The rare words in this version are already replaced with `<unk>` token. The numbers are also replaced with a special token. This token replacement helps us to end up with a more reasonable vocabulary size to work with.


In [2]:
import torch
import datasets
import numpy as np
from datasets import load_dataset

ptb_dataset = load_dataset("ptb_text_only", split="train")

# splitting dataset in train/test (to be later used for language model evaluation)
ptb_dataset = ptb_dataset.train_test_split(test_size=0.2, seed=1)
ptb_train, ptb_test = ptb_dataset['train'], ptb_dataset['test']

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
torch.cuda.is_available()

False

#### Let's have a look at a few samples of the training dataset (and also the structure of the dataset)

In [4]:
print(f"{ptb_train[0]}\n\n{ptb_train[1]}\n\n{ptb_train[2]}")

{'sentence': "a former executive agreed that the departures do n't reflect major problems adding if you see any company that grows as fast as reebok did it is going to have people coming and going"}

{'sentence': 'with talk today of a second economic <unk> in west germany east germany no longer can content itself with being the economic star in a loser league'}

{'sentence': 'transportation secretary sam skinner who earlier fueled the anti-takeover fires with his <unk> attacks on foreign investment in u.s. carriers now says the bill would further <unk> the jittery capital markets'}


During generation with a given language model, we often need to have a `<stop>` token in our vocabulary to terminate the generation of a given sentence/paragraph. In this dataset, every sample is a sentence, and the `<stop>` token should be added to the end of every sample (i.e., end of sentence).

#### Create a new train/test dataset starting from `ptb_train` and `ptb_test` that has a `<stop>` at the end of each sentence. (Note: do not change the structure of the datasets objects, and just change the respective sentences as discussed).
Hint: use the `.map()` functionality of the `datasets` package (read more [here](https://huggingface.co/docs/datasets/process#map])).

In [5]:
def add_stop_token(input_sample: dict):
    '''
    args:
        input_sample: a dict representing a sample of the dataset. (look above for the dict struture)
    output:
        modified_sample: modified dict adding <stop> at the end of each sentence.
    '''
    modified_sample = input_sample.copy()
    modified_sample['sentence'] = input_sample['sentence'] + " <stop>"
    
    return modified_sample
    
    
ptb_train = ptb_train.map(add_stop_token)
ptb_test = ptb_test.map(add_stop_token)

For both `ptb_train` and `ptb_test` datasets, filter out every sample that has less than 3 tokens. it will help remove very short sentences that are not very helpful for training/evaluating a langugage model.

Hint: use `.filter()` functionality of the `datasets` package (read more [here](https://huggingface.co/docs/datasets/process#select-and-filter)).

In [6]:
ptb_train = ptb_train.filter(lambda x : len(x['sentence'].split()) > 3)
ptb_test = ptb_test.filter(lambda x : len(x['sentence'].split()) > 3)

In [7]:
ptb_train['sentence'][:100]

["a former executive agreed that the departures do n't reflect major problems adding if you see any company that grows as fast as reebok did it is going to have people coming and going <stop>",
 'with talk today of a second economic <unk> in west germany east germany no longer can content itself with being the economic star in a loser league <stop>',
 'transportation secretary sam skinner who earlier fueled the anti-takeover fires with his <unk> attacks on foreign investment in u.s. carriers now says the bill would further <unk> the jittery capital markets <stop>',
 "separately the company 's board adopted a proposal to <unk> its N shareholder rights plan further <unk> the company from takeover <stop>",
 "thomas p. <unk> chief financial officer would n't comment about the details of the negotiations <stop>",
 'before the recent <unk> in global financial markets b.a.t officials holders and analysts had expected a substantial part of the restructuring to be complete by the end of the fir

#### What are the 10 most frequent tokens in this dataset? Can you spot the token used to replace the numbers in this dataset? How are rare tokens replaced in this dataset?

#### Now let's create a dictionary of the word probabilites (in the format of `{word: Prob(word)}`in the following function. We will use these probabilities to estimate sequence probabilities for a given sequence, as mentioned above.

In [8]:
from collections import defaultdict, Counter

def get_word_probability_dict(train_dataset: datasets.arrow_dataset.Dataset):
    '''
    args: 
        train_dataset: a Dataset object that can be iterated to get all the sentences
    output:
        word_prob_dict: a dictionary containing the word probabilities (and outputing zero for non-seen tokens)
    '''
    word_prob_dict = Counter()
    for sentence in train_dataset['sentence']:
        word_prob_dict.update(sentence.split())

    return word_prob_dict

word_prob_dict = get_word_probability_dict(ptb_train)
word_prob_dict

Counter({'the': 40612,
         '<unk>': 35814,
         '<stop>': 33306,
         'N': 25940,
         'of': 19459,
         'to': 18896,
         'a': 16899,
         'in': 14472,
         'and': 14013,
         "'s": 7850,
         'for': 7108,
         'that': 7100,
         '$': 5954,
         'is': 5923,
         'it': 4866,
         'said': 4840,
         'on': 4493,
         'at': 3946,
         'by': 3939,
         'as': 3860,
         'from': 3818,
         'with': 3678,
         'million': 3655,
         'mr.': 3468,
         'was': 3245,
         'be': 3122,
         'are': 3111,
         'its': 3089,
         'he': 2897,
         'but': 2841,
         'has': 2818,
         'an': 2787,
         "n't": 2694,
         'will': 2582,
         'have': 2557,
         'new': 2228,
         'company': 2175,
         'or': 2141,
         'they': 2074,
         'this': 1953,
         'year': 1908,
         'which': 1872,
         'would': 1840,
         'about': 1732,
         'says'

Let's also get a sense of how high the top-k probabilities are:

In [9]:
sorted(word_prob_dict.items(), key=lambda item: item[1], reverse=True)[:20]

[('the', 40612),
 ('<unk>', 35814),
 ('<stop>', 33306),
 ('N', 25940),
 ('of', 19459),
 ('to', 18896),
 ('a', 16899),
 ('in', 14472),
 ('and', 14013),
 ("'s", 7850),
 ('for', 7108),
 ('that', 7100),
 ('$', 5954),
 ('is', 5923),
 ('it', 4866),
 ('said', 4840),
 ('on', 4493),
 ('at', 3946),
 ('by', 3939),
 ('as', 3860)]

In [10]:
type(word_prob_dict)

collections.Counter

#### Now let's analyze the Unigram language model for different sequences. We first create a function that can output the probability for a given string.

In [11]:
def unigram_lm_seq_probability(input_sentence: str,
                               word_prob_dict: dict):
    '''
    args:
        input_sentence: The input sequence string. Here we assume
        word_prob_dict: A dictionary containing the probability for a given token
    output:
        probability: The probability of the input_sentence according to the Unigram language model
    '''
    # YOUR CODE HERE
    num_words = sum(word_prob_dict.values())
    probabilities = [word_prob_dict[word] / num_words for word in input_sentence.split()]
    # print(probabilities)
    probability = np.prod(probabilities)
    return probability

#### Let's investigate a major issue with Unigram language model. What are the probabilities for the two following sequences?
- the the the the \<stop>
- i love computer science \<stop>

DIscussion: How can we avoid having large probability values for sequences like `the the the <stop>`

In [12]:
seq1 = "the the the the <stop>"
seq2 = "i love computer science <stop>"

prob_seq1 = unigram_lm_seq_probability(seq1, word_prob_dict)
prob_seq2 = unigram_lm_seq_probability(seq2, word_prob_dict)
print(f"probability for seq1 is {prob_seq1}, and for seq2 is {prob_seq2}")

probability for seq1 is 4.0260352626148156e-07, and for seq2 is 2.3386075741772593e-17


#### Now let's formally evaluate the Unigram model in terms of perplexity. We first compute the entropy as the average negative log-likelihood:
$$H(W_{test}∣M)= \frac{1}{|W_{test}|} \sum_{w\in W_{test}} −log_2P(w∣M)$$
, where $W_{test}$ is the input sequence and M is the Unigram language model. (note that the logarithm is in base 2).

In order to get a reliable value, we will do the above calculation for all the sentences in `ptb_test` dataset and then an average is taken over all these samples.

In [13]:
def get_unigram_lm_entropy(input_sentence: str,
                           word_prob_dict: dict):
    '''
    args:
        input_sentence: the input string that we would like to have its respective entropy value.
        word_prob_dict: A dictionary containing the probability for a given token
    output:
        entropy: entropy value as defined above
    '''
    num_words = sum(word_prob_dict.values())
    small_value = 1e-10

    entropy = 1/len(input_sentence.split()) * np.sum([-np.log2(word_prob_dict[word] / num_words) if word_prob_dict[word] != 0 else small_value for word in input_sentence.split()])
    return entropy

Now compute the average entropy for all the sentences in the `ptb_test` given above function, and then compute the average entropy. Then compute the perplexity as $2^{\bar{H}}$, where $\bar{H}$ is the average perplexity over the test dataset.

In [14]:
def get_unigram_lm_perplexity(test_dataset: datasets.arrow_dataset.Dataset,
                              word_prob_dict: dict):
    '''
    args:
        test_dataset: the test dataset samples are used to compute the perplexity for the Unigram LM.
        word_prob_dict: A dictionary containing the probability for a given token
    output:
        perplexity: entropy value as defined above
    '''  
    avg_entropy = np.sum([get_unigram_lm_entropy(sentence, word_prob_dict) for sentence in test_dataset['sentence']]) / len(test_dataset['sentence'])
    perplexity = 2**avg_entropy

    return perplexity
      
unigram_lm_perplexity = get_unigram_lm_perplexity(ptb_test, word_prob_dict)
print(f"The perplexity for the Unigram language model is {unigram_lm_perplexity}")

The perplexity for the Unigram language model is 679.6126244532833


As discussed in the lectures, the models with lower perplexities are desired; however, we should be careful when comparing language models with different vocabualry sizes.
#### In the `ptb_train` dataset, replace every token that is appearing less than 10 times with the `<unk>` token. (Note: the same token replacement should be done for the test dataset). What is the Unigram language model perplexity for the new dataset?
Discussion: What would happen to the vocabulary size and perplexity as we increase the rare token threshold to higher values? (instead of 10 here)

In [15]:
def remove_rare_token(train_dataset: datasets.arrow_dataset.Dataset,
                      test_dataset: datasets.arrow_dataset.Dataset,
                      rare_token_threshold: int):
    '''
    Note that the tokens that are considered rare here, are identified based on the train_dataset, so that
    we have the same token mapping (to <unk>) for both the train and test datasets. 
    args:
        train_dataset: the input dataset where its rare tokens has to be replaced with <unk> token.
        rare_token_threshold: every word that is appearing less than this threshold in the train dataset will
                              be replace with the <unk> token
    output:
        cleaned_train_dataset: the cleaned train dataset where rare tokens are replace with <unk> token.
        cleaned_test_dataset: the cleaned test dataset where rare tokens are replace with <unk> token.
    '''
    
    rare_tokens = [word for word in word_prob_dict if word_prob_dict[word] < rare_token_threshold]
    cleaned_train_dataset = train_dataset.map(lambda x: {'sentence': ' '.join(['<unk>' if word in rare_tokens else word for word in x['sentence'].split()])})
    cleaned_test_dataset = test_dataset.map(lambda x: {'sentence': ' '.join(['<unk>' if word in rare_tokens else word for word in x['sentence'].split()])})
    
    
    return cleaned_train_dataset, cleaned_test_dataset

cleaned_train_dataset, cleaned_test_dataset = remove_rare_token(train_dataset=ptb_train,
                                                                test_dataset=ptb_test,
                                                                rare_token_threshold=10)


##### Now, follow similar steps to compute the perplexity given the two new datasets (`cleaned_train_dataset` and `cleaned_test_dataset`)

In [16]:
cleaned_unigram_lm_perplexity = -1

cleaned_word_prob_dict = get_word_probability_dict(cleaned_train_dataset)
cleaned_unigram_lm_perplexity = get_unigram_lm_perplexity(cleaned_test_dataset, cleaned_word_prob_dict)

print("The perplexity for the Unigram language model after replacing rare tokens is ",
      cleaned_unigram_lm_perplexity)

The perplexity for the Unigram language model after replacing rare tokens is  461.2095760285583


## 1.2 Bi-gram Language Model <a name='bigram_lm'></a>
In the Bi-gram language model, we pick/generate next token conditioned only on the previous token. Therefore, for an arbitrary sequence $x_1x_2~...x_n$, its respective probability becomes:
$$p(x_1x_2~...x_n) = p(x_1) ~\Pi_{i=2} ^n p(x_i|x_{i-1})$$
Let's use the same dataset (`Penn Treebank`) to evaluate this model's perplexity. (We use the dataset that already has the `<stop>` token at the end).

We estimate $p(x_i|x_{i-1})$ as the $\frac{count(x_{i-1},~x_i)}{count(x_{i-1})}$ according to the training dataset frequencies.

In [17]:
def get_first_order_conditional_probabilities(train_dataset: datasets.arrow_dataset.Dataset):
    '''
    In this function the conditional probabilities have to be computed based train_dataset. The output of the
    function is a dictionary having keys like (x_{i-1}, x_i) as a tuple and the value being p(x_i|x_{i-1}).
    args:
        train_dataset: a Dataset object that can be iterated to get all the sentences
    output:
        word_prob_dict: 
        first_order_condition_prob: a dictionary having containing the first order conditional probabilities
                                    as discussed above.
        word_prob_dict: a dictionary containing the word probabilities
    '''
    first_order_condition_prob = defaultdict(float) # in order to get zeroes 
    # let's first get the word frequencies (later used for computation of conditional probabilities)
    word_prob_dict = get_word_probability_dict(train_dataset)
    
    for sentence in train_dataset['sentence']:
        words = sentence.split()
        for i in range(1, len(words)):
            first_order_condition_prob[(words[i-1], words[i])] += 1
    
    for key in first_order_condition_prob.keys():
        w1 = word_prob_dict[key[0]]
        first_order_condition_prob[key] /= w1
    return word_prob_dict, first_order_condition_prob

word_prob_dict, first_order_condition_prob = get_first_order_conditional_probabilities(ptb_train)
first_order_condition_prob

defaultdict(float,
            {('a', 'former'): 0.0038463814426889166,
             ('former', 'executive'): 0.01606425702811245,
             ('executive', 'agreed'): 0.00211864406779661,
             ('agreed', 'that'): 0.046218487394957986,
             ('that', 'the'): 0.14,
             ('the', 'departures'): 9.849305623953512e-05,
             ('departures', 'do'): 0.1111111111111111,
             ('do', "n't"): 0.504950495049505,
             ("n't", 'reflect'): 0.001855976243504083,
             ('reflect', 'major'): 0.023255813953488372,
             ('major', 'problems'): 0.009615384615384616,
             ('problems', 'adding'): 0.004149377593360996,
             ('adding', 'if'): 0.016129032258064516,
             ('if', 'you'): 0.08843537414965986,
             ('you', 'see'): 0.009538950715421303,
             ('see', 'any'): 0.022222222222222223,
             ('any', 'company'): 0.0015151515151515152,
             ('company', 'that'): 0.016551724137931035,
             

#### Now let's analyze the Bi-gram language model for different sequences. We first create a function that can output the probability for a given string.

In [18]:
def bigram_lm_seq_probability(input_sentence: str,
                              word_prob_dict: dict,
                              first_order_condition_prob: dict):
    '''
    args:
        input_sentence: The input sequence string. Here we assume
        word_prob_dict: a dictionary containing the word probabilities
        first_order_condition_prob: a dictionary containing the first order conditional probabilities
                                    as discussed in the previous function.
    output:
        probability: The probability of the input_sentence according to the Bi-gram language model
    '''
    num_words = sum(word_prob_dict.values())

    words = input_sentence.split()
    for i in range(len(words)):
        if i == 0:
            probability = word_prob_dict[words[i]] / num_words
        else:
            probability *= first_order_condition_prob[(words[i-1], words[i])]
    

    return probability

Let's investigate a major issue with higher order language models.
#### Compute the probabilities for all the sequences in `ptb_test` dataset, and compute the minimum value among these probablities. What would be the perplexity for the dataset given these values?
Discussion: How can we avoid this **overfitting** to train dataset?

In [40]:
bigram_test_probabilities = []

bigram_test_probabilities = [bigram_lm_seq_probability(sentence, word_prob_dict, first_order_condition_prob) for sentence in ptb_test['sentence']]

print(f"{bigram_test_probabilities.count(0)/len(ptb_test)*100}% of samples in the test set have zero probability!")

91.2621359223301% of samples in the test set have zero probability!


### Smoothing
As we saw above, due to having new pair of consecutive words in the test dataset, we might have zero probabilities for some sequences. Therefore, as discussed in the lectures, in order to have a meaningful perplexity for N-gram language models, we need to smooth the probabilities to have non-zero values for non-seen sequences. In this exercise, we use Laplace smoothing as defined below:
$$P(x_i|x_{i-1}) = \frac{count(x_{i-1},~x_i) + \alpha}{count(x_{i-1}) + \alpha ~|V|}$$
, where $\alpha$ is the smoothing parameter, and $|V|$ is the (train dataset) vocabulary size.

#### Let's recompute the conditional probabilities using Laplace smoothing.

In [51]:
def get_smoothed_first_order_conditional_probabilities(train_dataset: datasets.arrow_dataset.Dataset,
                                                       smoothing_alpha: float):
    '''
    In this function the conditional probabilities have to be computed based on train_dataset. The output
    of the function is a dictionary having keys like (x_{i-1}, x_i) as a tuple and the
    value being p(x_i|x_{i-1}).
    args:
        train_dataset: a Dataset object that can be iterated to get all the sentences
        smoothing_alpha: The alpha parameter used in the Laplace smoothing.
    output:
        word_prob_dict: a dictionary containing the word probabilities 
        first_order_condition_prob: a dictionary containing the smoothed first order
                                    conditional probabilities as discussed above.
    '''
    first_order_condition_prob = defaultdict(float)  # Note that we shouldn't get zeros for unseen events.
    # let's first get the word probabilities (later used for computation of conditional probabilities)
    word_prob_dict = get_word_probability_dict(train_dataset)
    vocab_size = len(word_prob_dict)

    sum_elem = sum(word_prob_dict.values())

    # for word in word_prob_dict:
    #     word_prob_dict[word] /= sum_elem

    all_bigrams = []
    for sample in train_dataset:
        token_list = sample["sentence"].split()
        sample_bigrams = [(s1, s2) for s1, s2 in zip(token_list, token_list[1:])]
        all_bigrams += sample_bigrams
    bigram_frequency_dict = Counter(all_bigrams)
    
    first_order_condition_prob = defaultdict(
        float, {(w1,w2): (bigram_freq+smoothing_alpha)/(word_prob_dict[w1] + smoothing_alpha*vocab_size)
                for (w1,w2), bigram_freq in bigram_frequency_dict.items()})   
        
    return word_prob_dict, first_order_condition_prob

In [52]:
word_prob_dict, smoothed_first_order_condition_prob = get_smoothed_first_order_conditional_probabilities(ptb_train, 0.01)
smoothed_first_order_condition_prob


defaultdict(float,
            {('a', 'former'): 0.003824351607392453,
             ('former', 'executive'): 0.011491288399816595,
             ('executive', 'agreed'): 0.0017658577522903698,
             ('agreed', 'that'): 0.032577819860338496,
             ('that', 'the'): 0.13805771143173018,
             ('the', 'departures'): 9.849685448698613e-05,
             ('departures', 'do'): 0.009269456681350954,
             ('do', "n't"): 0.4493700163002775,
             ("n't", 'reflect'): 0.0017931538032040543,
             ('reflect', 'major'): 0.007064913262451035,
             ('major', 'problems'): 0.0080811665268727,
             ('problems', 'adding'): 0.002962224307836696,
             ('adding', 'if'): 0.006236107680908866,
             ('if', 'you'): 0.0806140164399093,
             ('you', 'see'): 0.008244622475856013,
             ('see', 'any'): 0.015417282127031017,
             ('any', 'company'): 0.001329017316700879,
             ('company', 'that'): 0.0158288497380173

In [57]:
def smoothed_bigram_lm_seq_probability(input_sentence: str,
                                       word_prob_dict: dict,
                                       word_frequency_dict: dict,
                                       first_order_condition_prob: dict,
                                       smoothing_alpha: float):
    '''
    args:
        input_sentence: The input sequence string. Here we assume
        word_prob_dict: a dictionary containing the word probabilities
        word_frequency_dict: a dictionary containing the frequency for every word in vocabulary
        first_order_condition_prob: a dictionary containing the first order conditional probabilities
                                    as discussed in the previous function.
        smoothing_alpha: The alpha parameter used in the Laplace smoothing.
    output:
        probability: The probability of the input_sentence according to the Bi-gram language model
    '''
    # vocab_size = len(word_prob_dict)
    # words = input_sentence.split()
    # for i in range(len(words)):
    #     if i == 0:
    #         val = word_prob_dict[words[i]]
    #         probability = val
    #     else:
    #         val = first_order_condition_prob.get((words[i-1], words[i]), smoothing_alpha/(word_frequency_dict[words[i-1]] + smoothing_alpha*vocab_size))
    #         probability *= val
    vocab_size = len(word_prob_dict)
    token_list = input_sentence.split()
    bigram_list = [(s1, s2) for s1, s2 in zip(token_list, token_list[1:])]
    probability = np.prod(
        [word_prob_dict[token_list[0]]] + [first_order_condition_prob.get(
            bigram, smoothing_alpha/(word_frequency_dict[bigram[0]] + smoothing_alpha*vocab_size))
                                           for bigram in bigram_list])
    
    return probability

#### Assuming $\alpha=0.01$ for the smoothing, use the previous function and `bigram_lm_seq_probability` to compute the sequence probabilities for all the sentences in the `ptb_test` dataset.

In [70]:
# YOUR CODE HERE
word_frequency_dict = Counter(" ".join([i["sentence"] for i in ptb_train]).split())
(word_prob_dict,
     smoothed_first_order_condition_prob) = get_smoothed_first_order_conditional_probabilities(ptb_train, 0.01)
smoothed_bigram_test_probabilities = [
    smoothed_bigram_lm_seq_probability(
        input_sentence=sample["sentence"],
        word_prob_dict=word_prob_dict,
        word_frequency_dict=word_frequency_dict,
        first_order_condition_prob=smoothed_first_order_condition_prob,
        smoothing_alpha=0.01)
    for sample in ptb_test]

print(f"{smoothed_bigram_test_probabilities.count(0)/len(ptb_test)*100}% of samples in the test set have zero probability!")

0.0% of samples in the test set have zero probability!


If the perplexity for a given sequence is computed as below, compute the Bigram language model perplexity over `ptb_test` dataset over all the sentences ($\alpha=0.01)$:
$$Perplexity(x_1x_2...x_n) = p(x_1x_2...x_n)^{-1/n}$$
, where $p(x_1x_2...x_n)$ is the probability assigned to $x_1x_2...x_n$ sequence by the language model.

In [71]:
sum_elem = sum(word_prob_dict.values())
word_prob_dict = defaultdict(
        float, {w1: res/sum_elem
                for w1, res in word_prob_dict.items()})   

In [72]:
bigram_lm_perplexity = -1

small_value = 1e-10
for sentence in ptb_test['sentence']:
    val = smoothed_bigram_lm_seq_probability(sentence, word_prob_dict, word_freq, smoothed_first_order_condition_prob, 0.01)
    bigram_lm_perplexity += (val if val != 0 else small_value) ** (-1/len(sentence.split()))

bigram_lm_perplexity /= len(ptb_test['sentence'])

print(f"Bigram language model perplexity is {bigram_lm_perplexity}")

Bigram language model perplexity is 507.2224062367385


In [73]:
# CORECTION
log_perplex_list = []
for idx in range(len(ptb_test)):
    sentence_prob = smoothed_bigram_test_probabilities[idx]
    if sentence_prob==0:
        continue
    sentence_length = len(ptb_test[idx]["sentence"].split())
    log_perplex_list.append(-np.log2(sentence_prob)/sentence_length)
bigram_lm_perplexity = 2**np.mean(log_perplex_list)

print(f"Bigram language model perplexity is {bigram_lm_perplexity}")

Bigram language model perplexity is 140.25299677702145


Repeat the same steps but for `cleaned_train_dataset` and `cleaned_test_dataset` datasets where rare tokens (with frequency less than 10) are replaced with `<unk>` token. Do we have a better or a worse perplexity compared to the previous computed perplexity?

In [74]:
word_prob_dict, smoothed_first_order_condition_prob = get_smoothed_first_order_conditional_probabilities(cleaned_train_dataset, 0.01)
cleaned_word_freq = get_word_probability_dict(cleaned_train_dataset)

cleaned_bigram_lm_perplexity = -1

small_value = 1e-10
for sentence in cleaned_test_dataset['sentence']:
    val = smoothed_bigram_lm_seq_probability(sentence, word_prob_dict, cleaned_word_freq, smoothed_first_order_condition_prob, 0.01)
    cleaned_bigram_lm_perplexity += (val if val != 0 else small_value) ** (-1/len(sentence.split()))

cleaned_bigram_lm_perplexity /= len(ptb_test['sentence'])

print(f"(cleaned) Bigram language model perplexity is {cleaned_bigram_lm_perplexity}")

(cleaned) Bigram language model perplexity is 134.7564219510552


## 1.3 Tri-gram Language Model <a name='trigram_lm'></a>
In the Tri-gram language model, we pick/generate next token conditioned only on the two previous tokens. Therefore, for an arbitrary sequence $x_1x_2~...x_n$, its respective probability becomes:
$$p(x_1x_2~...x_n) = p(x_1) p(x_2|x_1) ~\Pi_{i=3} ^n p(x_i|x_{i-2}x_{i-1})$$
Let's use the same dataset (`Penn Treebank`) to evaluate this model's perplexity. (We use the dataset that already has the `<stop>` token at the end of each sentence).


We estimate $p(x_i|x_{i-1}x_{i-2})$ using the Laplace smoothing with $\alpha=3 \cdot 10^{-3}$. First let's write a function that computes these conditional probabilities for the Tri-gram language model.

In [75]:
def get_smoothed_second_order_conditional_probabilities(train_dataset: datasets.arrow_dataset.Dataset,
                                                        smoothing_alpha: float):
    '''
    In this function the conditional probabilities have to be computed based on train_dataset. The output
    of the function is a dictionary having keys like (x_{i-2}, x_{i-1}, x_i) as a tuple and the
    value being p(x_i | x_{i-2} x_{i-1}).
    args:
        train_dataset: a Dataset object that can be iterated to get all the sentences
        smoothing_alpha: The alpha parameter used in the Laplace smoothing.
    output:
        word_prob_dict: a dictionary containing the word probabilities 
        first_order_condition_prob: a dictionary containing the smoothed first order
                                    conditional probabilities.
        second_order_condition_prob: a dictionary containing the smoothed second order
                                     conditional probabilities.
    '''
    smoothed_second_order_condition_prob = defaultdict(float)  # Note that we shouldn't get zeros for unseen probabilies.
    
    # let's first get the 0th and 1st order conditional probabilities
    (word_prob_dict, first_order_condition_prob) = get_smoothed_first_order_conditional_probabilities(
        train_dataset, smoothing_alpha)
    
    word_freq_dict = get_word_probability_dict(train_dataset)
    sum_elem = sum(word_freq_dict.values())
    vocab_size = len(word_freq_dict)
  
    # for sentence in train_dataset['sentence']:
    #     words = sentence.split()
    #     for i in range(2, len(words)):
    #         smoothed_second_order_condition_prob[(words[i-2], words[i-1], words[i])] += 1
    
    # for key in first_order_condition_prob.keys():
    #     w1 = first_order_condition_prob[key[0], key[1]]
    #     smoothed_second_order_condition_prob[key] = (smoothed_second_order_condition_prob[key] + smoothing_alpha) / (w1 + smoothing_alpha * sum_elem)
    
    all_bigrams = []
    for sample in train_dataset:
        token_list = sample["sentence"].split()
        sample_bigrams = [(s1, s2) for s1, s2 in zip(token_list, token_list[1:])]
        all_bigrams += sample_bigrams
    bigram_frequency_dict = Counter(all_bigrams)
    
    all_trigrams = []
    for sample in train_dataset:
        token_list = sample["sentence"].split()
        sample_trigrams = [(s1, s2, s3) for s1,s2,s3 in zip(token_list, token_list[1:], token_list[2:])]
        all_trigrams += sample_trigrams
    trigram_frequency_dict = Counter(all_trigrams)
    
    smoothed_second_order_condition_prob = defaultdict(
        float, {(w1,w2,w3): (trigram_freq+smoothing_alpha)/(bigram_frequency_dict[(w1,w2)] +
                                                           smoothing_alpha*vocab_size)
                for (w1,w2,w3), trigram_freq in trigram_frequency_dict.items()})
    
    return word_prob_dict, first_order_condition_prob, smoothed_second_order_condition_prob


#### Now let's analyze the Tri-gram language model for different sequences. We first create a function that can output the probability for a given string.

In [76]:
def smoothed_trigram_lm_seq_probability(input_sentence: str,
                                        word_prob_dict: dict,
                                        word_frequency_dict: dict,
                                        bigram_frequency_dict: dict,
                                        first_order_condition_prob: dict,
                                        second_order_condition_prob: dict,
                                        smoothing_alpha: float):
    '''
    args:
        input_sentence: The input sequence string. Here we assume
        word_prob_dict: a dictionary containing the word probabilities
        word_frequency_dict: a dictionary containing the frequency for every word in vocabulary
        bigram_frequency_dict: a dictionary containing the frequency for every bigram in vocabulary
        first_order_condition_prob: a dictionary containing the first order conditional probabilities
                                    as discussed earlier.
        second_order_condition_prob: a dictionary containing the second order conditional probabilities
                                     as discussed in the previous function.
    output:
        probability: The probability of the input_sentence according to the Bi-gram language model
    '''
    words = input_sentence.split()

    # for i in range(len(words)):
    #     if i == 0:
    #         val = word_prob_dict[words[i]]
    #         probability = val
    #     elif i == 1:
    #         val = first_order_condition_prob[(words[i-1], words[i])]
    #         probability *= val
    #     else:
    #         val = second_order_condition_prob[(words[i-2], words[i-1], words[i])]
    #         probability *= val

    vocab_size = len(word_prob_dict)
    token_list = input_sentence.split()
    bigram_list = [(s1, s2) for s1, s2 in zip(token_list, token_list[1:])]
    trigram_list = [(s1, s2, s3) for s1, s2, s3 in zip(token_list, token_list[1:], token_list[2:])]
    probability = np.prod(
        [word_prob_dict[token_list[0]]] + [first_order_condition_prob.get(
            bigram_list[0], smoothing_alpha / (word_frequency_dict[bigram_list[0][0]] +
                                               smoothing_alpha * vocab_size))] + [
            second_order_condition_prob.get(trigram,
                                            smoothing_alpha / (bigram_frequency_dict[trigram[:2]] +
                                                               smoothing_alpha * vocab_size))
            for trigram in trigram_list])

    return probability

#### Now let's compute the probability for sequences in the test dataset, assuming $\alpha=3\cdot10^{-3}$ has been used in the Laplace smoothing.

In [77]:
smoothed_trigram_test_probabilities = []

word_frequency_dict = Counter(" ".join([i["sentence"] for i in ptb_train]).split())

all_bigrams = []
for sample in ptb_train:
    token_list = sample["sentence"].split()
    sample_bigrams = [(s1, s2) for s1, s2 in zip(token_list, token_list[1:])]
    all_bigrams += sample_bigrams
bigram_frequency_dict = Counter(all_bigrams)

word_prob_dict, smoothed_first_order_condition_prob, smoothed_second_order_condition_prob = get_smoothed_second_order_conditional_probabilities(ptb_train, 3e-3)

smoothed_trigram_test_probabilities = [
    smoothed_trigram_lm_seq_probability(
        input_sentence=sample["sentence"],
        word_prob_dict=word_prob_dict,
        word_frequency_dict=word_frequency_dict,
        bigram_frequency_dict=bigram_frequency_dict,
        first_order_condition_prob=smoothed_first_order_condition_prob,
        second_order_condition_prob=smoothed_second_order_condition_prob,
        smoothing_alpha=3e-3)
    for sample in ptb_test]

print(f"{smoothed_trigram_test_probabilities.count(0)/len(ptb_test)*100}% of samples in the test set have zero probability!")

0.05993048064245475% of samples in the test set have zero probability!


Now we compute the perplexity on the `ptb_test` dataset for the tri-gram language model.

In [78]:
log_perplex_list = []
for idx in range(len(ptb_test)):
    sentence_prob = smoothed_trigram_test_probabilities[idx]
    if sentence_prob==0:
        continue
    sentence_length = len(ptb_test[idx]["sentence"].split())
    log_perplex_list.append(-np.log2(sentence_prob)/sentence_length)
Trigram_lm_perplexity = 2**np.mean(log_perplex_list)

print(f"Trigram language model perplexity is {Trigram_lm_perplexity}")

Trigram language model perplexity is 7023.111392745021


Repeat the same steps but for `cleaned_train_dataset` and `cleaned_test_dataset` datasets where rare tokens (with frequency less than 10) are replaced with `<unk>` token. Do we have a better or a worse perplexity compared to the previous computed perplexity?

In [79]:
# YOUR CODE HERE
cleaned_word_frequency_dict = Counter(" ".join([i["sentence"] for i in cleaned_train_dataset]).split())

all_bigrams = []
for sample in cleaned_train_dataset:
    token_list = sample["sentence"].split()
    sample_bigrams = [(s1, s2) for s1, s2 in zip(token_list, token_list[1:])]
    all_bigrams += sample_bigrams
cleaned_bigram_frequency_dict = Counter(all_bigrams)
    
(cleaned_word_prob_dict,
 cleaned_first_order_condition_prob,
 cleaned_second_order_condition_prob
) = get_smoothed_second_order_conditional_probabilities(cleaned_train_dataset, 3e-3)
    
    
cleaned_smoothed_trigram_test_probabilities = [
    smoothed_trigram_lm_seq_probability(
        input_sentence=sample["sentence"],
        word_prob_dict=cleaned_word_prob_dict,
        word_frequency_dict=cleaned_word_frequency_dict,
        bigram_frequency_dict=cleaned_bigram_frequency_dict,
        first_order_condition_prob=cleaned_first_order_condition_prob,
        second_order_condition_prob=cleaned_second_order_condition_prob,
        smoothing_alpha=3e-3)
    for sample in cleaned_test_dataset]


log_perplex_list = []
for idx in range(len(cleaned_test_dataset)):
    sentence_prob = cleaned_smoothed_trigram_test_probabilities[idx]
    if sentence_prob==0:
        continue
    sentence_length = len(cleaned_test_dataset[idx]["sentence"].split())
    log_perplex_list.append(-np.log2(sentence_prob)/sentence_length)
cleaned_Trigram_lm_perplexity = 2**np.mean(log_perplex_list)

print(f"(cleaned) Trigram language model perplexity is {cleaned_Trigram_lm_perplexity}")


(cleaned) Trigram language model perplexity is 6224.483522689105


#### Discussion
 - How are the three discussed models performance compare to each other?
 - What is the cost of using N-gram language models for even larger N values?
 - What is the effect of vocabulary size on models' perplexities? Can we compare models with different vocabulary sizes?
 - What is the perplexity of a language model (vocabulary size of |V|) that given any context (i.e., $x_1 x_2 ... x_{n-1}$) assigns uniform probabilities (for all the tokens in the vocabulary) for the next token? 

## 2. Task B: Neural Language Models <a name='neural_lm'></a>

In this exercise, we will better understand the functioning of some simple neural language models. We first start with a fixed-window neural language model. In the following subsection, we will investigate an RNN-based language model.

### 2.1 Fixed-Window Neural Language Model <a name='fixed_window_lm'></a>
This language model take as input a constant number of tokens, and then outputs a probability distribution for the next token. In this section, we assume the underlying model is a Multi-layer Perceptron (MLP) with a single hidden layer. This model doesn't have the sparsity issue of N-gram language models, but is always limited to a fixed window of tokens.

In this section, we don't include the training of the model but rather we use a pretrained model on the same training dataset. We evaluate the language model over the `ptb_test` dataset, to show the power of neural language models, when compared to N-gram language models.

More importantly, we use PyTorch modules in this section, so that you get more familiar with its capabilities. Throughout this exercise, we use a `window_size=3` for this model.



Let's first create a dataset of all consecutive tokens of length `window_size` from the `ptb_train` dataset. you can read more about PyTorch datasets and how to create a custom dataset  [here](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#creating-a-custom-dataset-for-your-files).

In [80]:
from torch.utils.data import Dataset, DataLoader

window_size = 3
vocabulary_size = 10000
word_emb_dim = 100
hidden_dim = 100


class FixedWindowDataset(Dataset):
    # read more about custom datasets at https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
    def __init__(self,
                 train_dataset: datasets.arrow_dataset.Dataset,
                 test_dataset: datasets.arrow_dataset.Dataset,
                 window_size: int,
                 vocabulary_size: int
                ):
        self.prepared_train_dataset = self.prepare_fixed_window_lm_dataset(train_dataset, window_size + 1)
        self.prepared_test_dataset = self.prepare_fixed_window_lm_dataset(test_dataset, window_size + 1)
        
        dataset_vocab = self.get_dataset_vocabulary(train_dataset)
        # defining a dictionary that simply maps tokens to their respective index in the embedding matrix
        self.word_to_index = {word: idx for idx,word in enumerate(dataset_vocab)}
        self.index_to_word = {idx: word for idx,word in enumerate(dataset_vocab)}
        
        assert vocabulary_size >= len(dataset_vocab) , f"The dataset vocab size is {len(dataset_vocab)}!"

    def __len__(self):
        return len(self.prepared_train_dataset)
    
    def get_encoded_test_samples(self):
        all_token_lists = [sample.split() for sample in self.prepared_test_dataset]
        all_token_ids = [[self.word_to_index.get(word, self.word_to_index["<unk>"])
                          for word in token_list[:-1]]
                         for token_list in all_token_lists
                        ]
        all_next_token_ids = [self.word_to_index.get(token_list[-1], self.word_to_index["<unk>"]) for 
                              token_list in all_token_lists]
        return torch.tensor(all_token_ids), torch.tensor(all_next_token_ids)
        
    def __getitem__(self, idx):
        # here we need to transform the data to the format we expect at the model input
        token_list = self.prepared_train_dataset[idx].split()
        # having a fallback to <unk> token if an unseen word is encoded.
        token_ids = [self.word_to_index.get(word, self.word_to_index["<unk>"]) for word in token_list[:-1]]
        next_token_id = self.word_to_index.get(token_list[-1], self.word_to_index["<unk>"])
        return torch.tensor(token_ids), torch.tensor(next_token_id)
    
    def decode_idx_to_word(self, token_id):
        return [self.index_to_word[id_.item()] for id_ in token_id]
    
    def get_dataset_vocabulary(self, train_dataset: datasets.arrow_dataset.Dataset):
        vocab = sorted(set(" ".join([sample["sentence"] for sample in train_dataset]).split()))
        # we also add a <start> token to include initial tokens in the sentences in the dataset
        vocab += ["<start>"]
        return vocab
    
    @staticmethod
    def prepare_fixed_window_lm_dataset(target_dataset: datasets.arrow_dataset.Dataset,
                                        window_size: int):
        '''
        Please note that for the very first tokens, they will be added like "<start> <start> Token#1".
        args:
            target_dataset: the target dataset where its consecutive tokens of length 'window_size' should be extracted
            window_size: the window size for the language model
        output:
            prepared_dataset: a list of strings each containing 'window_size' tokens.
        '''
        
        prepared_dataset = []
        
        for sample in target_dataset:
            token_list = sample["sentence"].split()
            # we add a <start> token to include initial tokens in the sentences in the dataset
            if len(token_list) < window_size:
                continue
            else:
                for idx in range(1, window_size):
                    prepared_dataset.append(" ".join(["<start>"]*(window_size - idx) + token_list[:idx]))
                    
                for i in range(len(token_list) - window_size):
                    prepared_dataset.append(" ".join(token_list[i:i + window_size]))
        
        return prepared_dataset
        
        

In [81]:
fixed_window_dataset = FixedWindowDataset(ptb_train, ptb_test, window_size, vocabulary_size)

# let's create a simple dataloader for this dataset
train_dataloader =  DataLoader(fixed_window_dataset, batch_size=8, shuffle=True)

Now, let's define the underlying PyTorch model for the language model. You can read more about PyTorch models [here](https://pytorch.org/tutorials/beginner/introyt/modelsyt_tutorial.html).

**Note**: Here in the forward pass, we compute the negative log-likelihood after passing through the MLP layers. Here we use `torch.nn.LogSoftmax`, as it's numerically more stable than doing seperately `softmax` followed by taking its logarithm.

In [82]:
import torch.optim as optim

class Fixed_window_language_model(torch.nn.Module):
    def __init__(self, emb_dim, hidden_dim, window_size, vocab_size=10000):
        super().__init__()

        self.window_size = window_size
        self.emb_dim = emb_dim
        self.word_embeddings = torch.nn.Embedding(vocab_size, emb_dim) # word embeddings
        self.linear1 = torch.nn.Linear(window_size * emb_dim, hidden_dim) # first linear layer
        self.activation_func = torch.tanh # the activation function
        self.linear2 = torch.nn.Linear(hidden_dim, vocab_size) # second linear layer
        
        self.log_softmax = torch.nn.LogSoftmax(dim=1)
        self.criterion = torch.nn.NLLLoss()
     
    def forward(self, input_ids, labels):
        inputs_embeds = self.word_embeddings(input_ids)
        concat_input_embed = inputs_embeds.reshape(-1, self.emb_dim * self.window_size)
        hidden_state = self.activation_func( self.linear1(concat_input_embed) )
        logits = self.log_softmax( self.linear2(hidden_state) )
        loss = self.criterion(logits, labels)
        
        return loss
    

Now let's see how easy it is to train a model with PyTorch! (we provide a trained model in the cell after train, so that you can just start using the model without going through the time-consuming training)

In [83]:
# defining the model
model_fixed_window = Fixed_window_language_model(emb_dim=word_emb_dim, hidden_dim=hidden_dim,
                                                 window_size=window_size, vocab_size=vocabulary_size)

# defining the optimizer
optimizer = optim.SGD(model_fixed_window.parameters(),
                      lr=0.005,
                      momentum=0.9)

In [87]:
for epoch in range(2):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(train_dataloader):
        # get the inputs; data is a tuple of (context, target)
        context, target = data

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        loss = model_fixed_window(context, target)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 5000 == 4999. :    # print every 5000 mini-batches
            print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 5000:.3f}')
            running_loss = 0.0

print('Finished Training')

# saving the trained model
torch.save(model_fixed_window.state_dict(), "fixed_window_model.pt")

[1,  5000] loss: 6.375
[1, 10000] loss: 6.217
[1, 15000] loss: 6.160
[1, 20000] loss: 6.080
[1, 25000] loss: 6.065
[1, 30000] loss: 6.017
[1, 35000] loss: 5.981
[1, 40000] loss: 5.951
[1, 45000] loss: 5.918
[1, 50000] loss: 5.927
[1, 55000] loss: 5.890
[1, 60000] loss: 5.893
[1, 65000] loss: 5.853
[1, 70000] loss: 5.823
[1, 75000] loss: 5.823
[1, 80000] loss: 5.787
[1, 85000] loss: 5.814
[2,  5000] loss: 5.634
[2, 10000] loss: 5.653
[2, 15000] loss: 5.641
[2, 20000] loss: 5.648
[2, 25000] loss: 5.630
[2, 30000] loss: 5.647
[2, 35000] loss: 5.644
[2, 40000] loss: 5.654
[2, 45000] loss: 5.641
[2, 50000] loss: 5.617
[2, 55000] loss: 5.636
[2, 60000] loss: 5.584
[2, 65000] loss: 5.576
[2, 70000] loss: 5.595
[2, 75000] loss: 5.614
[2, 80000] loss: 5.604
[2, 85000] loss: 5.591
Finished Training


We provide a trained model, so that you can start using it right away

In [88]:
fixed_window_checkpoint_file = "fixed_window_model.pt"
model_fixed_window.load_state_dict(torch.load(fixed_window_checkpoint_file))

<All keys matched successfully>

In [89]:
# context and 'target' ids (target is the next word after the context)
test_token_ids, test_target_ids = fixed_window_dataset.get_encoded_test_samples()

We now have the `test_token_ids`, `test_target_ids` tensors for the test dataset. The `test_token_ids` are the context ids and `test_target_ids` are the respective **next token** (a.k.a. target here) for these contexts.
#### Using the trained model, implement a function that can output the loss for the discussed test dataset. How can we generally decide if the model is overfitted to the train dataset or not?

In [95]:
def generate_test_dataset_loss(model: torch.nn.Module,
                               test_token_ids: torch.Tensor,
                               test_target_ids: torch.Tensor):
    '''
    args:
        model: fixed-window language model
        test_token_ids: the context ids in a single tensor.
        test_target_ids: the target ids (next token after the context) in a single tensor.
    output:
        avg_test_loss: The average loss of model over test dataset.
    '''
    batch_size = 4
    test_loss = []
    
    
    with torch.no_grad():
        for idx in range(0, test_token_ids.shape[0], batch_size):
            context = test_token_ids[idx:idx + batch_size]
            target = test_target_ids[idx:idx + batch_size]
            loss = model(context, target)
            test_loss.append(loss.item())
        avg_test_loss = np.mean(np.array(test_loss))
    
    return avg_test_loss


test_dataset_loss = generate_test_dataset_loss(model_fixed_window, test_token_ids, test_target_ids)
print(f"Test dataset loss is {test_dataset_loss}")

Test dataset loss is 5.651790734509623


#### Using the trained fixed-window model, implemention a function that can output entropy for a given sequence.

In [96]:
def get_seqeuence_entropy_fixed_window_lm(model: torch.nn.Module,
                                              input_sequence: str,
                                              window_size: int,
                                              word_to_idx: dict):
    '''
    Note that e.g., in order to get the first token probability, you need to pass a sequence
    like "<start> <start> <start>" (prefix padding) to the neural model. In a similar fashion, we need to pass
    "<start> <start> TOKEN#1" for getting the probability of the second token.
    args:
        model: fixed-window language model
        input_sequence: the sequence for which we want to calculate the probability
        window_size: the size of window for the language model
        word_to_idx: a mapping from words to the embedding indices (to encode tokens before being
                     passed to model). You can get this dict from 'fixed_window_dataset.word_to_index'
    output:
        sequence_entropy: the entropy for the input sequence using the trained model
    '''
    
    # YOUR CODE HERE
    modified_sentence = "<start> " * (window_size) + input_sequence
    
    token_list = modified_sentence.split()
    encoded_context = []
    encoded_target = []
    
    for idx in range(len(token_list)-window_size):
        encoded_context.append([word_to_idx.get(token, word_to_idx["<unk>"])
                                for token in token_list[idx: idx+window_size]])
        encoded_target.append(word_to_idx.get(token_list[idx+window_size], word_to_idx["<unk>"]))
    encoded_context = torch.tensor(encoded_context)
    encoded_target = torch.tensor(encoded_target)
    
    # passing the context (and respective labels) to the fixed-window LM to get average NLL.
    with torch.no_grad():
        sequence_entropy = model(encoded_context, encoded_target).item()

    return sequence_entropy

#### Compute the perplexity for the trained fixed-window language model over `ptb_test` dataset using the previous function. How does it perform compared to N-gram language models we discussed earlier?

In [97]:
res = [get_seqeuence_entropy_fixed_window_lm(model_fixed_window, sentence, window_size, fixed_window_dataset.word_to_index) for sentence in ptb_test['sentence']]

perplexity = 2**np.mean(res)

print(f"The fixed-window model perplexity over test dataset is {perplexity}")

The fixed-window model perplexity over test dataset is 68.30843247316461


### 2.2 RNN-based Language Model <a name='rnn_lm'></a>
To address the need for a neural architecture that can proceed with any length input (as opposed to the fixed-window model that can only process a fixed number of tokens), we implement the Recurrent Neural Network (RNN). The core idea behind is that we can apply the same weight W repeatedly.

An advatange of RNN model compared to fixed-window langauage model is that we can pass a given sentence at once, instead of passing it in many windows of size `window_size`. Moreover, the language model has the ability to look behind further that a fixed number of tokens.

 As we already did a neural model training exercise for the previous neural model, we only provide a trained LM at this section, so that you can focus only on the analysis part.
 
You can find the dataset structure as well as the RNN architecture in the `rnn_utils.py` file.

In [98]:
from rnn_utils import RNNDataset, RNN_language_model

vocabulary_size = 10000
word_emb_dim = 200
hidden_dim = 200

rnn_dataset = RNNDataset(ptb_train, ptb_test, vocabulary_size)

# if gpu is available, we puts the model on it 
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Here we need a <pad> token for the RNN model, in order to have a batch of sequences with difference sizes 
pad_idx = rnn_dataset.pad_idx # the index for <pad> token
rnn_model = RNN_language_model(vocab_size=vocabulary_size, emb_dim=word_emb_dim, hidden_dim=hidden_dim,
                               pad_idx=pad_idx)
rnn_model.to(device)

RNN_language_model(
  (criterion): CrossEntropyLoss()
  (embedding): Embedding(10000, 200)
  (rnn): RNN(200, 200, num_layers=4)
  (dropout): Dropout(p=0.001, inplace=False)
  (lm_decoder): Linear(in_features=200, out_features=10000, bias=True)
)

load the model weights using the state_dict in `rnn_model.pt` file.

In [101]:
rnn_model.load_state_dict(torch.load("rnn_model.pt", map_location=torch.device('cpu')))

<All keys matched successfully>

As the training of an RNN model is time-consuming, we provide a trained language model on this dataset (`rnn_model.pt`), so that you can just analyze the model performance here.
As mentioned above, as RNN can get sequences with varying lengths, the input sequences should be padded with a special token like `<pad>`, so that we can create a batch of sentences. The output of the defined RNN model (see the architecture detail `rnn_utils.py`) is the model's entropy over the input data.

#### First get the encoded test samples of `ptb_test` dataset, and then pass these (already padded) sentences to the RNN model to get the respective entropy values. Compute the perplexity of the model and compare it with previous approaches.
**HINT**: You can use the `get_encoded_test_samples` function of `rnn_dataset` to get encoded test samples.


In [103]:
test_token_ids = rnn_dataset.get_encoded_test_samples()

test_loss = []

eval_batch_size = 8
with torch.no_grad():
    for data_idx in range(0, test_token_ids.shape[0], eval_batch_size):
        context_batch = test_token_ids[data_idx: data_idx + eval_batch_size].to(device)
        
        batch_max_seq_length = (context_batch!=rnn_dataset.pad_idx).sum(dim=1).max().item()
        context_batch = context_batch[:, :batch_max_seq_length]
        
        batch_loss = rnn_model(context_batch)
        test_loss.append(batch_loss.item())
avg_test_loss = np.array(test_loss).mean()
test_perplexity = 2**avg_test_loss


print(f"The model perplexity is {test_perplexity}")

The model perplexity is 606.4564839869902


In [None]:
How are the embeddings created ?
Rnn structure details ho can we represent it with the num_layers = 4
forward pass wth ? what is the label we want to find

outputs = self.lm_decoder(outputs.permute(1, 0, 2))[:, :-1, :].permute(0, 2, 1)
        target_tokens = context.t()[:, 1:]
what is this?