# Homework 2: Language Modeling
11-411/611 Natural Language Processing (Fall 2024)

- RELEASED: October 1 2024
- DUE: October 24 2024 11:59 pm EDT

Whether for transcribing spoken utterances as correct word sequences or generating coherent human-like text, language models are extremely useful.

In this assignment, you will be building your own language models powered by n-grams and RNNs.

### Submission Guidelines
**Programming:** 
- This notebook contains helpful test cases and additional information about the programming part of the HW. However, you are only required to submit `ngram_lm.py` and `rnn_lm.py` on Gradescope.
- We recommended that you first code in the notebook and then copy the corresponding methods/classes to `ngram_lm.py` and `rnn_lm.py`.

**Written:**
- Analysis questions would require you to run your code.
- You need to write your answers in a document and upload it alongside the programming components

### Upload (if using Colab) main.py and utils.py, and the data.zip file

In [1]:
#!unzip data.zip

## Part 1: Language Models [60 points]

### Step 0: Preprocessing

In [2]:
#!pip install transformers
#!pip install requests
#!pip install torch
#!pip install tqdm

import math
import torch
import numpy as np
import torch.nn as nn
from collections import Counter
from torch.utils.data import DataLoader, Dataset

We provide you with a few functions in `utils.py` to read and preprocess your input data. Do not edit this file!

In [3]:
from utils import *

We have performed a round of preprocessing on the datasets.

- Each file contains one sentence per line.
- All punctuation marks have been removed.
- Each line is a sequences of tokens separated by whitespace.

#### Special Symbols ( Already defined in `utils.py` )
The start and end tokens will act as padding to the given sentences, to make sure they are correctly defined, print them here:

In [4]:
print("Sentence START symbol: {}".format(START))
print("Sentence END symbol: {}".format(EOS))
print("Unknown word symbol: {}".format(UNK))

Sentence START symbol: <s>
Sentence END symbol: </s>
Unknown word symbol: <UNK>


#### Reading and processing an example file

In [5]:
# Read the sample file
sample = read_file("data/sample.txt")
print(sample)

['We are never ever ever ever ever getting back together\n', 'We are the ones together we are back']


In [6]:
# Preprocess the content to add corresponding number of start and end tokens. Try out the method with n = 3 and n = 4 as well.
# Preprocessing example for bigrams (n=2)
sample = preprocess(sample, n=3)
for s in sample:
    print(s)

['<s>', '<s>', 'we', 'are', 'never', 'ever', 'ever', 'ever', 'ever', 'getting', 'back', 'together', '</s>']
['<s>', '<s>', 'we', 'are', 'the', 'ones', 'together', 'we', 'are', 'back', '</s>']


In [7]:
# Flattens a nested list into a 1D list.
flattened = flatten(sample)
print(flattened)

['<s>', '<s>', 'we', 'are', 'never', 'ever', 'ever', 'ever', 'ever', 'getting', 'back', 'together', '</s>', '<s>', '<s>', 'we', 'are', 'the', 'ones', 'together', 'we', 'are', 'back', '</s>']


### Step 1: N-Gram Language Model

#### TODO: Defining `get_ngrams()`

In [8]:
#######################################
# TODO: get_ngrams()
#######################################
def get_ngrams(list_of_words, n):
    """
    Returns a list of n-grams for a list of words.
    Args
    ----
    list_of_words: List[str]
        List of already preprocessed and flattened (1D) list of tokens e.g. ["<s>", "hello", "</s>", "<s>", "bye", "</s>"]
    n: int
        n-gram order e.g. 1, 2, 3
    
    Returns:
        n_grams_list: List[Tuple]
            Returns a list containing n-gram tuples
    """
    n_grams_list = []
    for words in range(len(list_of_words)):
        n_grams_list.append(tuple(list_of_words[words:words+n]))
    if n==1:
        return n_grams_list
    return n_grams_list[:-(n-1)]

In [9]:
#######################################
# TEST: get_ngrams()
#######################################
sample = preprocess(read_file("data/sample.txt"), n=3)
flattened = flatten(sample)
#print(get_ngrams(flattened, 3))

# get_ngrams() return n_grams tuples
assert get_ngrams(flattened, 3) == [('<s>', '<s>', 'we'),
        ('<s>', 'we', 'are'),
        ('we', 'are', 'never'),
        ('are', 'never', 'ever'),
        ('never', 'ever', 'ever'),
        ('ever', 'ever', 'ever'),
        ('ever', 'ever', 'ever'),
        ('ever', 'ever', 'getting'),
        ('ever', 'getting', 'back'),
        ('getting', 'back', 'together'),
        ('back', 'together', '</s>'),
        ('together', '</s>', '<s>'),
        ('</s>', '<s>', '<s>'),
        ('<s>', '<s>', 'we'),
        ('<s>', 'we', 'are'),
        ('we', 'are', 'the'),
        ('are', 'the', 'ones'),
        ('the', 'ones', 'together'),
        ('ones', 'together', 'we'),
        ('together', 'we', 'are'),
        ('we', 'are', 'back'),
        ('are', 'back', '</s>')]

#### **TODO:** Class `NGramLanguageModel()`

*Now*, we will define our LanguageModel class.

**Some Useful Variables:**
- self.model: `dict` of n-grams and their corresponding probabilities, keys being the tuple containing the n-gram, and the value being the probability of the n-gram.
- self.vocab: `dict` of unigram vocabulary with counts, keys being the words themselves and the values being their frequency.
- self.n: `int` value for n-gram order (e.g. 1, 2, 3).
- self.train_data: `List[List]` containing preprocessed **unflattened** train sentences. You will have to flatten it to use in the language model
- self.smoothing: `float` flag signifying the smoothing parameter.

In `lm.py`, we will be taking most of these argumemts from the command line using this command:

`python3 lm.py --train data/sample.txt --test data/sample.txt --n 3 --smoothing 0 --min_freq 1`

Note that we will not be using log probabilities in this section. Store the probabilities as they are, not in log space.

**Laplace Smoothing**

There are two ways to perform this:
- Either you calculate all possible n-grams at train time and calculate smooth probabilities for all of them, hence inflating the model (eager emoothing). You then use the probabilities as when required at test time. **OR**
- You calculate the probabilities for the **observed n-grams** at train time, using the smoothed likelihood formula, then if any unseen n-gram is observed at test time, you calculate the probability using the smoothed likelihood formula and store it in the model for future use (lazy smoothing).

You will be implementing lazy smoothing

**Perplexity**

Steps:
1. Flatten the test data.
2. Extract ngrams from the flattened data.
3. Calculate perplexity according to given formula. For unseen n-grams, calculate using smoothed likelihood and store the unseen n-gram probability in the labguage model `model` attribute:

$ppl(W_{test}) = ppl(W_1W_2 ... W_n)^{-1/n} $

Tips:
- Remember that product changes to summation under `log`. Take the log of probabilities, sum them up, and then exponentiate it to get back to the original scale.
- Make sure to `flatten()` your data before creating the n_grams using `get_ngrams()`.
- The test suite provided is **not exhaustive**.


In [20]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#   
# Copyright (C) 2024
# 
# @author: Ezra Fu <erzhengf@andrew.cmu.edu>
# based on work by 
# Ishita <igoyal@andrew.cmu.edu> 
# Suyash <schavan@andrew.cmu.edu>
# Abhishek <asrivas4@andrew.cmu.edu>

"""
11-411/611 NLP Assignment 2
N-gram Language Model Implementation

Complete the LanguageModel class and other TO-DO methods.
"""

#######################################
# Import Statements
#######################################
from utils import *
from collections import Counter
from itertools import product
import argparse
import random
import math

#######################################
# TODO: get_ngrams()
#######################################
def get_ngrams(list_of_words, n):
    """
    Returns a list of n-grams for a list of words.
    Args
    ----
    list_of_words: List[str]
        List of already preprocessed and flattened (1D) list of tokens e.g. ["<s>", "hello", "</s>", "<s>", "bye", "</s>"]
    n: int
        n-gram order e.g. 1, 2, 3
    
    Returns:
        n_grams_list: List[Tuple]
            Returns a list containing n-gram tuples
    """
    n_grams_list = []
    for words in range(len(list_of_words)):
        n_grams_list.append(tuple(list_of_words[words:words+n]))
    if n==1:
        return n_grams_list
    return n_grams_list[:-(n-1)]

#######################################
# TODO: NGramLanguageModel()
#######################################
class NGramLanguageModel():
    def __init__(self, n, train_data, alpha=1):
        """
        Language model class.
        
        Args
        ____
        n: int
            n-gram order
        train_data: List[List]
            already preprocessed unflattened list of sentences. e.g. [["<s>", "hello", "my", "</s>"], ["<s>", "hi", "there", "</s>"]]
        alpha: float
            Smoothing parameter
        
        Other attributes:
            self.tokens: list of individual tokens present in the training corpus
            self.vocab: vocabulary dict with counts
            self.model: n-gram language model, i.e., n-gram dict with probabilties
            self.n_grams_counts: dictionary for storing the frequency of ngrams in the training data, keys being the tuple of words(n-grams) and value being their frequency
            self.prefix_counts: dictionary for storing the frequency of the (n-1) grams in the data, similar to the self.n_grams_counts
            As an example:
            For a trigram model, the n-gram would be (w1,w2,w3), the corresponding [n-1] gram would be (w1,w2)
        """
        self.n = n 
        self.train_data = train_data
        self.alpha = alpha
        self.tokens = flatten(self.train_data)
        self.vocab = Counter(self.tokens)
        # Initialize the n_grams_counts and prefix_counts before setting up the model
        self.n_grams_counts = {}
        self.prefix_counts = {}
        for ngram in get_ngrams(self.tokens, self.n):
            self.n_grams_counts[ngram] = self.n_grams_counts.get(ngram,0) + 1
            if self.n > 1:
                prefix = ngram[:-1]
                self.prefix_counts[prefix] = self.prefix_counts.get(prefix,0) + 1
        
        # Now initialize the model using the n_grams_counts
        self.model = {}
        for ngram in get_ngrams(self.tokens, self.n):
            self.model[ngram] = self.get_prob(ngram)


    def build(self):
        """
        Returns a n-gram dict with their smoothed probabilities. Remember to consider the edge case of n=1 as well
        
        You are expected to update the self.n_grams_counts and self.prefix_counts, and use those calculate the probabilities. 
        """
        if not self.tokens:
            print("Warning: No tokens found in training data.")
            return {}

        for ngram in get_ngrams(self.tokens, self.n):
            self.model[ngram] = self.get_prob(ngram)
        return self.model
            
        #for i in range(len(self.tokens) - self.n + 1):
        #    ngram = tuple(self.tokens[i:i+self.n])
        #    #print(f"Processing ngram: {ngram}")
        #    self.n_grams_counts[ngram] = self.n_grams_counts.get(ngram,0) + 1
#
        #    if self.n > 1:
        #        prefix = tuple(self.tokens[i:i+self.n-1])
        #        self.prefix_counts[prefix] = self.prefix_counts.get(prefix,0) + 1
        #
        #for ngram, count in self.n_grams_counts.items():
        #    if self.n == 1:
        #        self.model[ngram] = (count + self.alpha) / (len(self.tokens) + self.alpha * len(self.vocab))
        #    else:
        #        prefix = ngram[:-1]
        #        self.model[ngram] = (count + self.alpha) / (self.prefix_counts[prefix] + self.alpha * len(self.vocab))
    #
        #return self.model

    def get_smooth_probabilities(self, ngrams):
        """
        Returns the smoothed probability of the n-gram, using Laplace Smoothing. 
        Remember to consider the edge case of  n = 1
        HINT: Use self.n_gram_counts, self.tokens and self.prefix_counts 
        """
        #print(ngrams)
        ngrams_counts = self.n_grams_counts.get(ngrams,0)
        #print(ngrams_counts)
        if self.n > 1:
            prefix_counts = self.prefix_counts.get(ngrams[:-1],0)
            #print(f"Prefix Counts: {prefix_counts}")
            smooth_probabilities = (ngrams_counts + self.alpha) / (prefix_counts + self.alpha * len(self.vocab))
        elif self.n == 1:
            smooth_probabilities = (ngrams_counts + self.alpha) / (len(self.tokens) + self.alpha * len(self.vocab))
        return smooth_probabilities

    def get_prob(self, ngram):
        """
        Returns the probability of the n-gram, using Laplace Smoothing.
        
        Args
        ____
        ngram: tuple
            n-gram tuple
        
        Returns
        _______
        float
            probability of the n-gram
        """
        probability = self.model.get(ngram,0)
        if probability == 0:
            # If the n-gram is not observed in the training data, use smoothed probabilities
            probability = self.get_smooth_probabilities(ngram)

        return probability



    def perplexity(self, test_data):
        """
        Returns perplexity calculated on the test data.
        Args
        ----------
        test_data: List[List] 
            Already preprocessed nested list of sentences

        Returns
        -------
        float
            Calculated perplexity value
        """
        # Flatten the test data
        flattened_test_data = flatten(test_data)
        test_ngrams = get_ngrams(flattened_test_data, self.n)
        N = len(test_ngrams)
        log_prob_sum = 0

        for ngrams in test_ngrams:
            prob = self.get_prob(ngrams)
            log_prob_sum += math.log(prob)

        perplexity = math.exp(-log_prob_sum/N)
        # Return perplexity using exponentiation of the negative average log probability
        return perplexity 

In [21]:
#######################################
# TEST: NGramLanguageModel()
#######################################
# For the sake of understanding we will pass alpha as 0 (no smoothing), so that you gain intuition about the probabilities
sample = preprocess(read_file("data/sample.txt"), n=2)
test_lm = NGramLanguageModel(n=2, train_data=sample, alpha=0)
test_lm.build()
test_lm.vocab

assert test_lm.vocab == Counter({'<s>': 2,
        'we': 3,
        'are': 3,
        'never': 1,
        'ever': 4,
        'getting': 1,
        'back': 2,
        'together': 2,
        '</s>': 2,
        'the': 1,
        'ones': 1})

assert test_lm.model =={('<s>', 'we'): 1.0,
        ('we', 'are'): 1.0,
        ('are', 'never'): 0.3333333333333333,
        ('never', 'ever'): 1.0,
        ('ever', 'ever'): 0.75,
        ('ever', 'getting'): 0.25,
        ('getting', 'back'): 1.0,
        ('back', 'together'): 0.5,
        ('together', '</s>'): 0.5,
        ('</s>', '<s>'): 1.0,
        ('are', 'the'): 0.3333333333333333,
        ('the', 'ones'): 1.0,
        ('ones', 'together'): 1.0,
        ('together', 'we'): 0.5,
        ('are', 'back'): 0.3333333333333333,
        ('back', '</s>'): 0.5}

In [22]:
#######################################
# TEST smoothing: NGramLanguageModel()
#######################################
sample = preprocess(read_file("data/sample.txt"), n=2)
test_lm = NGramLanguageModel(n=2, train_data=sample, alpha=1)
print(test_lm.build())

assert test_lm.vocab == Counter({'<s>': 2,
        'we': 3,
        'are': 3,
        'never': 1,
        'ever': 4,
        'getting': 1,
        'back': 2,
        'together': 2,
        '</s>': 2,
        'the': 1,
        'ones': 1})

assert test_lm.model =={('<s>', 'we'): 0.23076923076923078,
        ('we', 'are'): 0.2857142857142857,
        ('are', 'never'): 0.14285714285714285,
        ('never', 'ever'): 0.16666666666666666,
        ('ever', 'ever'): 0.26666666666666666,
        ('ever', 'getting'): 0.13333333333333333,
        ('getting', 'back'): 0.16666666666666666,
        ('back', 'together'): 0.15384615384615385,
        ('together', '</s>'): 0.15384615384615385,
        ('</s>', '<s>'): 0.16666666666666666,
        ('are', 'the'): 0.14285714285714285,
        ('the', 'ones'): 0.16666666666666666,
        ('ones', 'together'): 0.16666666666666666,
        ('together', 'we'): 0.15384615384615385,
        ('are', 'back'): 0.14285714285714285,
        ('back', '</s>'): 0.15384615384615385}

{('<s>', 'we'): 0.23076923076923078, ('we', 'are'): 0.2857142857142857, ('are', 'never'): 0.14285714285714285, ('never', 'ever'): 0.16666666666666666, ('ever', 'ever'): 0.26666666666666666, ('ever', 'getting'): 0.13333333333333333, ('getting', 'back'): 0.16666666666666666, ('back', 'together'): 0.15384615384615385, ('together', '</s>'): 0.15384615384615385, ('</s>', '<s>'): 0.16666666666666666, ('are', 'the'): 0.14285714285714285, ('the', 'ones'): 0.16666666666666666, ('ones', 'together'): 0.16666666666666666, ('together', 'we'): 0.15384615384615385, ('are', 'back'): 0.14285714285714285, ('back', '</s>'): 0.15384615384615385}


In [23]:
#######################################
# TEST unigram: NGramLanguageModel()
#######################################
sample = preprocess(read_file("data/sample.txt"), n=1)
test_lm = NGramLanguageModel(n=1, train_data=sample, alpha=1)
test_lm.build()

assert test_lm.vocab == Counter({'<s>': 2,
        'we': 3,
        'are': 3,
        'never': 1,
        'ever': 4,
        'getting': 1,
        'back': 2,
        'together': 2,
        '</s>': 2,
        'the': 1,
        'ones': 1})

assert test_lm.model == {('<s>',): 0.09090909090909091,
        ('we',): 0.12121212121212122,
        ('are',): 0.12121212121212122,
        ('never',): 0.06060606060606061,
        ('ever',): 0.15151515151515152,
        ('getting',): 0.06060606060606061,
        ('back',): 0.09090909090909091,
        ('together',): 0.09090909090909091,
        ('</s>',): 0.09090909090909091,
        ('the',): 0.06060606060606061,
        ('ones',): 0.06060606060606061}

In [24]:
#######################################
# TEST: perplexity()
#######################################

sample = preprocess(read_file("data/sample.txt"), n=2)
test_lm = NGramLanguageModel(n=2, train_data=sample, alpha=0)
test_lm.build()
test_lm.get_smooth_probabilities(('<s>', 'we'))
print(test_lm.perplexity(sample))
test_ppl = test_lm.perplexity(sample)

assert test_ppl < 1.7
assert test_ppl > 0

test_lm = NGramLanguageModel(n=2, train_data=sample, alpha=1)
test_lm.build()
test_lm.get_smooth_probabilities(('<s>', 'we'))
print(test_lm.perplexity(sample))
test_ppl = test_lm.perplexity(sample)

assert test_ppl < 5.0
assert test_ppl > 0

1.4859942891369486
5.283124177782943


AssertionError: 

### Step 2: RNN Language Model
Recurrent Neural Networks (RNNs) are a class of neural networks designed to handle sequential data. Unlike traditional neural networks, which assume independence among inputs, RNNs utilize their internal state (memory) to process sequences of inputs. This makes them particularly well-suited for tasks where context and order matter.

Before diving into building RNN Language Models using PyTorch, it's essential to have a solid foundation in the following areas:
. We assume you have had a basic understanding of PyTorch and its core concepts, including tensors, autograd, modules (nn.Module), and how to construct simple neural networks using PyTorch. For more comprehensive learning, refer to the [PyTorch official tutorials](https://pytorch.org/tutorials/) and documentation.

#### Preparing the Data
The following Python code is used for loading and processing [GloVe (Global Vectors for Word Representation) embeddings](https://nlp.stanford.edu/projects/glove/). GloVe is an unsupervised learning algorithm for obtaining vector representations for words. These embeddings can be used in various natural language processing and machine learning tasks. You can download the 50d embeddings for this assignment from [Canvas](https://canvas.cmu.edu/courses/39596/files/10855662?module_item_id=5748476).

The `load_glove_embeddings(path)` function is used to load the GloVe embeddings from a file. The function takes a file path as an argument, reads the file line by line, and for each line, it splits the line into words and their corresponding embeddings, and stores them in a dictionary. The dictionary, embeddings_dict, maps words to their corresponding vector representations.

The `create_embedding_matrix(word_to_ix, embeddings_dict, embedding_dim)` function is used to create an embedding matrix from the loaded GloVe embeddings. This function takes a dictionary mapping words to their indices (`word_to_ix`), the dictionary of GloVe embeddings (`embeddings_dict`), and the dimension of the embeddings (`embedding_dim`) as arguments. It creates a zero matrix of size (vocab_size, embedding_dim) and then for each word in  `word_to_ix`, it checks if the word is in `embeddings_dict`. If it is, it assigns the corresponding GloVe vector to the word's index in the embedding matrix. If the word is not in the embeddings_dict, it assigns a random vector to the word's index in the embedding matrix.

The `glove_path` variable is the path to the GloVe file, and `glove_embeddings` is the dictionary of GloVe embeddings loaded using the `load_glove_embeddings` function. The `embedding_dim` variable is the dimension of the embeddings, and `embedding_matrix` is the embedding matrix created using the create_embedding_matrix function.

In [25]:
# Load the data
vocab, word_to_ix, ix_to_word, dataloader = loadfile("data/sample.txt")

In [26]:
def load_glove_embeddings(path):
    embeddings_dict = {}
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = torch.tensor([float(val) for val in values[1:]], dtype=torch.float)
            embeddings_dict[word] = vector
    return embeddings_dict

# Path to the GloVe file
glove_path = 'data/glove.6B.50d.txt'  # Update this path
glove_embeddings = load_glove_embeddings(glove_path)
#print(f"glove_embeddings:{glove_embeddings}")

def create_embedding_matrix(word_to_ix, embeddings_dict, embedding_dim):
    vocab_size = len(word_to_ix)
    embedding_matrix = torch.zeros((vocab_size, embedding_dim))
    for word, ix in word_to_ix.items():
        if word in embeddings_dict:
            embedding_matrix[ix] = embeddings_dict[word]
        else:
            embedding_matrix[ix] = torch.rand(embedding_dim)  # Random initialization for words not in GloVe
    return embedding_matrix

# Create the embedding matrix
embedding_dim = 50
embedding_matrix = create_embedding_matrix(word_to_ix, glove_embeddings, embedding_dim)
#print(f"embedding_matrix:{embedding_matrix}")

#### TODO: Defining the RNN Model

In [31]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#   
# Copyright (C) 2024
# 
# @author: Ezra Fu <erzhengf@andrew.cmu.edu>
# based on work by 
# Ishita <igoyal@andrew.cmu.edu> 
# Suyash <schavan@andrew.cmu.edu>
# Abhishek <asrivas4@andrew.cmu.edu>

"""
11-411/611 NLP Assignment 2
RNN Language Model Implementation

Complete the LanguageModel class and other TO-DO methods.
"""

#######################################
# Import Statements
#######################################
from utils import *
from collections import Counter
from itertools import product
import argparse
import random
import math
import torch.nn.functional as F

#######################################
# TODO: RNNLanguageModel()
#######################################
class RNNLanguageModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, embedding_matrix):
        """
        RNN model class.
        
        Args
        ____
        vocab_size: int
            Size of the vocabulary
        embedding_dim: int
            Dimension of the word embeddings
        hidden_dim: int
            Dimension of the hidden state of the RNN
        embedding_matrix: torch.Tensor
            Pre-trained GloVe embeddings
            
        Other attributes:
            self.embedding: nn.Embedding
                Embedding layer
            self.rnn: nn.RNN
                RNN layer
            self.fc: nn.Linear
                Fully connected layer
        
        Note: Remember to initialize the weights of the embedding layer with the GloVe embeddings
        """
        super().__init__()
        self.embedding = nn.Embedding(vocab_size,embedding_dim)
        self.embedding.weight = nn.Parameter(embedding_matrix)
        self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)
    
    def forward(self, x, hidden=None):
        """
        The forward pass of the RNN model.
        
        Args
        ____
        x: torch.Tensor
            Input tensor of shape (batch_size, sequence_length)
        hidden: torch.Tensor
            Hidden state tensor of shape (num_layers, batch_size, hidden_dim)
        
        Returns
        -------
        out: torch.Tensor
            Output tensor of shape (batch_size, sequence_length, vocab_size)
        hidden: torch.Tensor
            Hidden state tensor of shape (num_layers, batch_size, hidden_dim)
            
        HINT: You need to use the embedding layer, rnn layer and the fully connected layer to define the forward pass
        """
        embedding_layer = self.embedding(x)
        out, hidden = self.rnn(embedding_layer, hidden)
        out = self.fc(out)
        return out, hidden

    
    def generate_sentence(self, sequence, word_to_ix, ix_to_word, num_words, mode='max'):
        """
        Predicts the next words given a sequence.
        
        Args
        ____
        sequence: str
            Input sequence
        word_to_ix: dict
            Dictionary mapping words to their corresponding indices
        ix_to_word: dict
            Dictionary mapping indices to their corresponding words
        num_words: int
            Maximum number of words to predict
        mode: str
            Mode of prediction. 'max' or 'multinomial'
            'max' mode selects the word with maximum probability
            'multinomial' mode samples the word from the probability distribution
            Hint: Use torch.multinomial() method
        
        Returns
        -------
        predicted_sequence: List[str]
            List of predicted words
        """
        self.eval()  # Set the model to evaluation mode
        sequence = sequence.split()  # Convert string to list of words
        predicted_sequence = []
        with torch.no_grad():
            input_seq = [word_to_ix.get(word, word_to_ix['<UNK>']) for word in sequence]
            input = torch.tensor(input_seq, dtype=torch.long).unsqueeze(0)  # Shape: (1, sequence_length)
            #print(f"Input sequence: {input}")
            #print(f"Input Sequence shape: {input.shape}")
            hidden = None  # Initialize hidden state

            for _ in range(num_words):
                output, hidden = self.forward(input, hidden)
                #print(f"Output sequence: {output}")
                #print(f"Output Sequence shape: {output.shape}")

                # Get the output of the last time step
                output = output[:, -1, :]  # Shape: (1, vocab_size)

                if mode == 'max':
                    predicted_word_ix = torch.argmax(output, dim=1).item()
                elif mode == 'multinomial':
                    probabilities = torch.softmax(output, dim=1)
                    predicted_word_ix = torch.multinomial(probabilities, num_samples=1).item()
                else:
                    raise ValueError("Invalid mode. Choose 'max' or 'multinomial'.")

                predicted_word = ix_to_word.get(predicted_word_ix, '<UNK>')
                #print(f"Predicted word: {predicted_word}")
                predicted_sequence.append(predicted_word)

                # Prepare input for the next time step
                input = torch.tensor([[predicted_word_ix]], dtype=torch.long)
        return predicted_sequence
        


#### Training the Model
The following code snippet provided is responsible for training the RNN language model. 

In [32]:
#######################################
# TEST: RNNLanguageModel() and training
#######################################
torch.manual_seed(11411)
# Hyperparameters
vocab_size = len(vocab)
embedding_dim = 50
hidden_dim = 32
num_epochs = 10

# Initialize the model, loss function, and optimizer
RNN = RNNLanguageModel(vocab_size, embedding_dim, hidden_dim, embedding_matrix)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(RNN.parameters(), lr=0.005)

lines = ""
# Training loop
for epoch in range(num_epochs):
    for inputs, targets in dataloader:
        RNN.zero_grad()
        output, _ = RNN(inputs)
        loss = criterion(output.view(-1, vocab_size), targets.view(-1))
        loss.backward()
        optimizer.step()
    
    line = f'Epoch {epoch+1}/{num_epochs}, Loss: {loss.item()}, Perplexity: {np.exp(loss.item())}'
    lines += line + "\n"
    print(line)

Epoch 1/10, Loss: 2.6288723945617676, Perplexity: 13.858134580632937
Epoch 2/10, Loss: 2.4372386932373047, Perplexity: 11.441403857477876
Epoch 3/10, Loss: 2.2839365005493164, Perplexity: 9.815242166724511
Epoch 4/10, Loss: 2.1605491638183594, Perplexity: 8.675900841293625
Epoch 5/10, Loss: 2.0542263984680176, Perplexity: 7.800800826332901
Epoch 6/10, Loss: 1.956910252571106, Perplexity: 7.0774257897383395
Epoch 7/10, Loss: 1.8642698526382446, Perplexity: 6.451223821896329
Epoch 8/10, Loss: 1.7739530801773071, Perplexity: 5.8941072473852
Epoch 9/10, Loss: 1.6850324869155884, Perplexity: 5.3926261192933875
Epoch 10/10, Loss: 1.5976566076278687, Perplexity: 4.9414391151344885


## Part 2: Written [40 points]. We have given some code for some of the written parts to make it easier for you.

### **Written 4.1** – n-gram counts [8 points]

In [33]:
def n_gram_counts(file,n,smoothing):
    data = read_file(file)
    train = preprocess(data,n)
    print(len(train))
    lm = NGramLanguageModel(n, train, smoothing)
    lm.build()
    lm.vocab
    return len(lm.n_grams_counts.keys())

business_data = 'data/bbc/business.txt'
sports_data = 'data/bbc/sport.txt'
print(f'Business data with bi-gram: {n_gram_counts(file=business_data,n=2,smoothing=0.1)}')
print(f'Sports data with bi-gram: {n_gram_counts(file=sports_data,n=2,smoothing=0.1)}')
print(f'Business data with tri-gram: {n_gram_counts(file=business_data,n=3,smoothing=0.1)}')
print(f'Sports data with tri-gram: {n_gram_counts(file=sports_data,n=3,smoothing=0.1)}')

19990
Business data with bi-gram: 83819
9611
Sports data with bi-gram: 77398
19990
Business data with tri-gram: 141221
9611
Sports data with tri-gram: 135645


### **Written 4.2** – Song Attribution [8 points]

In [34]:
# Example code for Taylor Swift N-Gram LM
def song_attribution(train_data, test_data, n, smoothing):
    train = read_file(train_data)
    test = read_file(test_data)
    train = preprocess(train, n)
    test = preprocess(test, n)
    lm = NGramLanguageModel(n, train, smoothing)
    lm.build()

    ppl = lm.perplexity(test)
    return ppl

In [35]:
train_data="data/lyrics/taylor_swift.txt"
test_data = "data/lyrics/test_lyrics.txt"
print(f'Taylor Swift PPL with tri-gram: {song_attribution(train_data, test_data, 3, 0.1)}')

Taylor Swift PPL with tri-gram: 138.00663307990817


In [36]:
train_data="data/lyrics/green_day.txt"
test_data = "data/lyrics/test_lyrics.txt"
print(f'Green Day PPL with tri-gram: {song_attribution(train_data, test_data, 3, 0.1)}')

Green Day PPL with tri-gram: 522.5401188730924


In [37]:
train_data="data/lyrics/ed_sheeran.txt"
test_data = "data/lyrics/test_lyrics.txt"
print(f'Ed Sheeran PPL with tri-gram: {song_attribution(train_data, test_data, 3, 0.1)}')

Ed Sheeran PPL with tri-gram: 521.2574891234094


### **Written 4.3.1** –  Intro to Decoding [8 points]

Please take a look at and understand the functions: `best_candidate()`, `top_k_best_candidates()` and `generate_sentences_from_phrase()` in `utils.py`.

In [38]:
n = 4
smoothing = 0.1

In [39]:
train = read_file("data/lyrics/ed_sheeran.txt")
train = preprocess(train, n)
lm = NGramLanguageModel(n, train, smoothing)
lm.build()

{('<s>', '<s>', '<s>', 'one'): 0.00322040072859745,
 ('<s>', '<s>', 'one', 'two'): 0.002781289506953224,
 ('<s>', 'one', 'two', 'three'): 0.002937249666221629,
 ('one', 'two', 'three', 'four'): 0.002937249666221629,
 ('two', 'three', 'four', '</s>'): 0.002937249666221629,
 ('three', 'four', '</s>', '<s>'): 0.002937249666221629,
 ('four', '</s>', '<s>', '<s>'): 0.002937249666221629,
 ('</s>', '<s>', '<s>', '<s>'): 0.9455804124462581,
 ('<s>', '<s>', '<s>', 'ooh'): 0.0051147540983606556,
 ('<s>', '<s>', 'ooh', 'ooh'): 0.03696450428396573,
 ('<s>', 'ooh', 'ooh', '</s>'): 0.00797940797940798,
 ('ooh', 'ooh', '</s>', '<s>'): 0.03138780804150454,
 ('ooh', '</s>', '<s>', '<s>'): 0.06533166458072591,
 ('<s>', '<s>', '<s>', 'every'): 0.003074681238615665,
 ('<s>', '<s>', 'every', 'time'): 0.010392902408111533,
 ('<s>', 'every', 'time', 'you'): 0.0029139072847682124,
 ('every', 'time', 'you', 'come'): 0.002937249666221629,
 ('time', 'you', 'come', 'around'): 0.002937249666221629,
 ('you', 'come'

In [40]:
s1 = ("Every", "time", "you",'come')

s2 = ("Fell", "the", "fall")

s3 = ("Down Bad")

In [41]:
print(top_k_best_candidates(lm, s1[-(n-1):], 3, without=['<s>', '</s>']))

[('around', 0.002937249666221629)]


In [42]:
sentences = list(generate_sentences_from_phrase(lm, 1, list(s1), 1, 'max'))
print(sentences)

[('Every time you come around </s>', 0.1715183012821963)]


### **Written 4.3.2** – Text Generation [8 points]

For this subtask, train an RNN LM using `data/ed_sheeran.txt`

In this part, we will try the first two approaches to generate sentences.

Q1. Use `predict_next_words()` method to generate sentences after the provided phrases from `s1` to `s3`. Use modes `max` and `multinomial`. Report one of your favorite generations (for any strategy or phrase).

Q2. Which decoding strategy did you like better and why?

In [43]:
s1 = "yellow"

s2 = "fell the fall"

s3 = "down bad"

In [46]:
# Calculate your RNN model's perplexity
torch.manual_seed(11411)
# Hyperparameters
vocab_size = len(vocab)
embedding_dim = 50
hidden_dim = 32
num_epochs = 10

vocab, word_to_ix, ix_to_word, dataloader = loadfile("data/lyrics/ed_sheeran.txt")
glove_embeddings = load_glove_embeddings('data/glove.6B.50d.txt')
embedding_matrix = create_embedding_matrix(word_to_ix, glove_embeddings, embedding_dim)

# Initialize the model, loss function, and optimizer
RNN = RNNLanguageModel(vocab_size, embedding_dim, hidden_dim, embedding_matrix)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(RNN.parameters(), lr=0.005)

perplexity = 0

# Training loop
for epoch in range(num_epochs):
    for inputs, targets in dataloader:
        RNN.zero_grad()
        output, _ = RNN(inputs)
        #print(output.view(-1, vocab_size).shape)
        #print(targets.view(-1).shape)
        loss = criterion(output.view(-1, vocab_size), targets.view(-1))
        loss.backward()
        optimizer.step()
    
    perplexity = np.exp(loss.item())

perplexity

np.float64(8.524552548567668)

In [47]:
sentence = s1
predicted_words_sequence = RNN.generate_sentence(sentence, word_to_ix, ix_to_word, 5, mode='max')
print(' '.join(predicted_words_sequence))

pages </s> <s> i m


In [48]:
sentence = s2
predicted_words_sequence = RNN.generate_sentence(sentence, word_to_ix, ix_to_word, 5, mode='max')
print(' '.join(predicted_words_sequence))

in love with you </s>


In [49]:
sentence = s3
predicted_words_sequence = RNN.generate_sentence(sentence, word_to_ix, ix_to_word, 5, mode='max')
print(' '.join(predicted_words_sequence))

fruit </s> <s> i m


**Aside (for fun!)**: Train your LM on Taylor Swift lyrics and generate the next hit!

### **Written 4.4** – Battle of the LMs: GPT-2, Trigram and RNN [8 points]

For this subtask, you will be generating text and comparing GPT-2 with your n-gram and RNN language models. 

Generative pretrained transformer (GPT) is a neural language model series created by OpenAI. The n-gram language model you trained has on average around 10K-20K parameters (`len(lm.model)`.) Compare that to the 175 billion parameters of GPT-3, which is likely much smaller than more recent iterations (though they don't tell us anymore)!

Let's see how GPT-2 compares to the LMs you trained in Written 4.3.1 on the `data/bbc/tech-small.txt` dataset.

In [50]:
# Calculate your n-gram model's perplexity
test = preprocess(read_file("data/bbc/tech-small.txt"), 3)
NGram = NGramLanguageModel(n=3, train_data=test)
NGram.build()
NGram.perplexity(test)

179.0901520291377

In [51]:
# Calculate your RNN model's perplexity
torch.manual_seed(11411)
# Hyperparameters
vocab_size = len(vocab)
embedding_dim = 50
hidden_dim = 32
num_epochs = 10

vocab, word_to_ix, ix_to_word, dataloader = loadfile("data/bbc/tech-small.txt")
glove_embeddings = load_glove_embeddings('data/glove.6B.50d.txt')
embedding_matrix = create_embedding_matrix(word_to_ix, glove_embeddings, embedding_dim)

# Initialize the model, loss function, and optimizer
RNN = RNNLanguageModel(vocab_size, embedding_dim, hidden_dim, embedding_matrix)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(RNN.parameters(), lr=0.005)

lines = ""
# Training loop
for epoch in range(num_epochs):
    for inputs, targets in dataloader:
        RNN.zero_grad()
        output, _ = RNN(inputs)
        loss = criterion(output.view(-1, vocab_size), targets.view(-1))
        loss.backward()
        optimizer.step()

    perplexity = np.exp(loss.item())
    line = f'Epoch {epoch+1}/{num_epochs}, Loss: {loss.item()}, Perplexity: {np.exp(loss.item())}'
    lines += line + "\n"
    print(line)
print(perplexity)

Epoch 1/10, Loss: 1.6010288000106812, Perplexity: 4.958130726323093
Epoch 2/10, Loss: 0.4941759705543518, Perplexity: 1.6391469770188427
Epoch 3/10, Loss: 0.33784034848213196, Perplexity: 1.4019166674140409
Epoch 4/10, Loss: 0.16959521174430847, Perplexity: 1.1848251509322285
Epoch 5/10, Loss: 0.2638842463493347, Perplexity: 1.3019774789317493
Epoch 6/10, Loss: 0.14845190942287445, Perplexity: 1.1600370095837167
Epoch 7/10, Loss: 0.17833548784255981, Perplexity: 1.1952262378516227
Epoch 8/10, Loss: 0.16257105767726898, Perplexity: 1.1765319171037545
Epoch 9/10, Loss: 0.16952191293239594, Perplexity: 1.1847383078391278
Epoch 10/10, Loss: 0.08578262478113174, Perplexity: 1.0895694571618804
1.0895694571618804


#### Computing GPT-2's perplexity on the test set

You need to enable a GPU runtime from the Colab `Runtime` menu option (you can also use your computer if you have an accelerator). Go to `Runtime` → `Change Runtime Type` → `Hardware Accelerator (GPU)`

In [38]:
from transformers import GPT2LMHeadModel, GPT2TokenizerFast
import torch

model_id = "distilgpt2"
model = GPT2LMHeadModel.from_pretrained(model_id)
tokenizer = GPT2TokenizerFast.from_pretrained(model_id)

  from .autonotebook import tqdm as notebook_tqdm


In [39]:
test = read_file("data/bbc/tech-small.txt")
encodings = tokenizer("\n\n".join(test), return_tensors="pt")

Token indices sequence length is longer than the specified maximum sequence length for this model (1137 > 1024). Running this sequence through the model will result in indexing errors


In [42]:
from tqdm import tqdm

max_length = model.config.n_positions
stride = 100

nlls = []
for i in tqdm(range(0, encodings.input_ids.size(1), stride)):
    begin_loc = max(i + stride - max_length, 0)
    end_loc = min(i + stride, encodings.input_ids.size(1))
    trg_len = end_loc - i  # may be different from stride on last loop
    input_ids = encodings.input_ids[:, begin_loc:end_loc]
    target_ids = input_ids.clone()
    target_ids[:, :-trg_len] = -100

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)
        neg_log_likelihood = outputs[0] * trg_len
    nlls.append(neg_log_likelihood)

ppl = torch.exp(torch.stack(nlls).sum() / end_loc)

100%|██████████| 12/12 [00:05<00:00,  2.37it/s]


In [43]:
print("Perplexity using GPT2:", ppl.item())

Perplexity using GPT2: 50.644840240478516
