# Language Modelling
The goal is to model Indian first names using character-level language models.

You can use Google Colab notebooks to work on this. Later, you can download this notebook as a python file.

## Broad Outline

In this assignment, we will implement different types of language models for modeling Indian names. There are clealry patterns in Indian names that models could learn, and we start modeling those using n-gram models, then move to neural n-gram and RNN models.


**Models**
- Unigram
- Bigram
- Trigram
- Neural N-gram LM
- RNN LM

# Read and Preprocess Data

In [1]:
# importing necessary libraries
import pandas as pd
import numpy as np
import math

import random
from collections import Counter, defaultdict

from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import DataLoader

In [2]:
# Download the training and validation datasets
!wget -O train_data.csv "https://docs.google.com/spreadsheets/d/1AUzwOQQbAehg_eoAMCcWfwSGhKwSAtnIzapt2wbv0Zs/gviz/tq?tqx=out:csv&sheet=train_data.csv"
!wget -O valid_data.csv "https://docs.google.com/spreadsheets/d/1UtQErvMS-vcQEwjZIjLFnDXlRZPxgO1CU3PF-JYQKvA/gviz/tq?tqx=out:csv&sheet=valid_data.csv"

# Download the text for evaluation
!wget -O eval_prefixes.txt "https://drive.google.com/uc?export=download&id=1tuRLJXLd2VcDaWENr8JTZMcjFlwyRo60"
!wget -O eval_sequences.txt "https://drive.google.com/uc?export=download&id=1kjPAR04UTKmdtV-FJ9SmDlotkt-IKM3b"

--2024-03-01 10:24:31--  https://docs.google.com/spreadsheets/d/1AUzwOQQbAehg_eoAMCcWfwSGhKwSAtnIzapt2wbv0Zs/gviz/tq?tqx=out:csv&sheet=train_data.csv
Resolving docs.google.com (docs.google.com)... 172.253.115.101, 172.253.115.139, 172.253.115.138, ...
Connecting to docs.google.com (docs.google.com)|172.253.115.101|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘train_data.csv’

train_data.csv          [<=>                 ]       0  --.-KB/s               train_data.csv          [ <=>                ]  71.07K  --.-KB/s    in 0.007s  

2024-03-01 10:24:32 (10.3 MB/s) - ‘train_data.csv’ saved [72776]

--2024-03-01 10:24:32--  https://docs.google.com/spreadsheets/d/1UtQErvMS-vcQEwjZIjLFnDXlRZPxgO1CU3PF-JYQKvA/gviz/tq?tqx=out:csv&sheet=valid_data.csv
Resolving docs.google.com (docs.google.com)... 172.253.115.101, 172.253.115.139, 172.253.115.138, ...
Connecting to docs.google.com (docs.google.com)|172.253.115.101|:443... connec

In [3]:

def read_dataframe(ds_type):
    """
    Args:
        ds_type [str] :  dataset type (train or valid)

    Returns:
        df [pandas dataframe]
    """

    df = pd.read_csv(f"/content/{ds_type}_data.csv", header=0, index_col=0)
    df = df[~df['Name'].isna()]
    df['Name'] = df['Name'].astype(str)
    return df

# Load the training and validation datasets
train_data = read_dataframe("train")
validation_data = read_dataframe("valid")

# Read files containing prefixes and character sequences for evaluation
with open('eval_prefixes.txt', 'r') as file:
    eval_prefixes = []
    for line in file:
        eval_prefixes.append(line.strip().split(" "))

with open('eval_sequences.txt', 'r') as file:
    eval_sequences = []
    for line in file:
        eval_sequences.append(line.strip().split(" "))

print(f"Length of training data: {len(train_data)}\nLength of validation data: {len(validation_data)}")

Length of training data: 4539
Length of validation data: 1297


In [4]:
## Please do not change anything in this code block.

START = "<s>"   # Start-of-name token
END = "</s>"    # End-of-name token
UNK = "<unk>"   # token representing out of unknown (or out of vocabulary) tokens
vocab_from_ascii = True

def build_vocab(names):
    """
    Builds a vocabulary given a list of names

    Args:
        names [list[str]]: list of names

    Returns:
        vocab [torchtext.vocab]: vocabulary based on the names

    """

    if vocab_from_ascii:
        char_counts = {chr(i):i for i in range(128)}
    else:
        char_counts = Counter("".join(names))

    vocab = build_vocab_from_iterator(
                    char_counts,
                    specials=[UNK, START, END], #adding special tokens to the vocabulary
                    min_freq=1
                )
    vocab.set_default_index(vocab[UNK])
    return vocab


def tokenize_name(name):
    """
    Tokenise the name i.e. break a name into list of characters

    Args:
        name [str]: name to be tokenized

    Returns:
        list of characters
    """

    return list(str(name))


def process_data_for_input(data_iter, vocab):
    """
    Processes data for input: Breaks names into characters,
    converts out of vocabulary tokens to UNK and
    appends END token at the end of every name

    Args:
        data_iter: data iterator consisting of names
        vocab: vocabulary

    Returns:
        data_iter [list[list[str]]]: list of names, where each name is a
                                list of characters and is appended with
                                START and END tokens

    """

    vocab_set = set(vocab.get_itos())
    # convert Out Of Vocabulary (OOV) tokens to UNK tokens
    data_iter = [[char if char in vocab_set else UNK
                        for char in tokenize_name(name)] for name in data_iter]
    data_iter = [[START] + name + [END] for name in data_iter]

    return data_iter


def get_tokenised_text_and_vocab(ds_type, vocab=None):
    """
    Reads input data, tokenizes it, builds vocabulary (if unspecified)
    and outputs tokenised list of names (which in turn is a list of characters)

    Args:
        ds_type [str]: Type of the dataset (e.g., train, validation, test)
        vocab [torchtext.vocab]: vocabulary;
                                 If vocab is None, the function will
                                 build the vocabulary from input text.
                                 If vocab is provided, it will tokenize name
                                 according to the vocab, replacing any tokens
                                 not part of the vocab with UNK token.

    Returns:
        data_iter: data iterator for tokenized names
        vocab: vocabulary

    """

    # read the 'Name' column of the dataframe
    if ds_type=='train':
        data_iter = train_data['Name']
    elif ds_type=='valid':
        data_iter = validation_data['Name']
    else:
        data_iter = test_data['Name']

#    print(data_iter)
    # build vocab from input data, if vocab is unspecified
    if vocab is None:
        vocab = build_vocab(data_iter)

    # convert OOV chars to UNK, append START and END token to each name
    data_iter = process_data_for_input(data_iter, vocab)

    return data_iter, vocab

Let's look at some examples from the training set

In [5]:
# Look at some random examples from the training set
examples = ", ".join(random.sample(list(train_data['Name']), 5))
f"Examples from the training set: {examples}"

'Examples from the training set: omparkesh, gufran, jeete, sheeak, suchi'

# Module 1: N-gram Language Modelling

Load and preprocess the data for n-gram models

In [6]:
"""choose your hyperparameter and see the difference in performance"""

# CHANGE THE None VALUES TO YOUR DESIRED VALUES

# ADD YOUR CODE HERE

MAX_NAME_LENGTH = 20 # maximum length of names for generation

In [7]:
# Get data iterator and build vocabulary from input text

train_text, vocab = get_tokenised_text_and_vocab(ds_type='train')
validation_text, _ = get_tokenised_text_and_vocab(ds_type='valid', vocab=vocab)

# Check the size of vocabulary
vocab_size = len(vocab.get_stoi())
print(vocab_size)

131


Now it's time to implement an n-gram language model.

One edge case you will need to handle is that you don't have  n−1  prior characters at the beginning of the text. One way to do this is by appending the START token  n−1  times at the start of the name, when implementing an  n -gram model. You may choose whichever method you like to handle this case as long as you produce a valid probability distribution (one that sums to one).

Generating names

To generate from a language model, we can sample one char at a time conditioning on the chars we have generated so far.

In fact there are many strategies to get better-sounding samples, such as only sampling from the top-k chars or sharpening the distribution with a temperature. You can read more about sampling from a language model in this paper.

We will now implement N-gram models with  N=1  (unigram),  N=2  (bigram), and  N=3  (trigram).

Utility Functions

Implement the utility functions get_unigram_counts, get_bigram_counts and get_trigram_counts. You can use these functions while implementing n-gram models.

In [8]:
def get_unigram_counts(corpus):
    """
    Given a corpus, calculates the unigram counts for each character in the corpus

    Args:
        corpus [list[list[str]]]: list of tokenized characters. Text is appended with END token.

    Returns:
        unigram_counts [dict [key: char, value: count]]:
            dictionary of unigram counts for each character in the corpus
        Example:
        > unigram_counts["c1"] = 5
    """

    # ADD YOUR CODE HERE
    # BEGIN CODE

    unigram_counts = {}

    for name in corpus:
      for character in name:
        if character in unigram_counts:
          unigram_counts[character] += 1
        else:
          unigram_counts[character] = 1

    return unigram_counts

     # END CODE

In [9]:
def get_bigram_counts(corpus):
    """
    Given a corpus, calculates the bigram counts for each bigram in the corpus.
    The corpus *only* contains END tokens at the end of names.
    You may want to handle the case whhere beginning of the name
    does not have n-1 prior chars.

    Args:
        corpus [list[list[str]]]: list of tokenized text. Text is appended with END token.

    Returns:
        bigram_counts [dict[dict]]:
            nested dictionary of bigram counts for each bigram in the corpus
        Example:
        > bigram_counts["c1"]["c2"] = 5
        here bigram_counts["c1"]["c2"] represents P("c2"|"c1")
        P["c1"]["c2"] means P[char_i = "c2"|char_{i-1} = "c1"]
    """

    # ADD YOUR CODE HERE
    # BEGIN CODE
    bigram_counts = {}

    chars = set()
    for name in corpus:
      chars.update(name)

    bigram_counts = {char1:{char2: 0 for char2 in chars} for char1 in chars}

    for name in corpus:
      for i in range(len(name)-1):
        bigram_counts[name[i]][name[i+1]] += 1

    return bigram_counts

     # END CODE

In [10]:
def get_trigram_counts(corpus):
    """
    Given a corpus, calculates the trigram counts for each trigram in the corpus.
    The corpus *only* contains END tokens at the end of names.
    You may want to handle the case where beginning of the text
    does not have n-1 prior chars.

    Args:
        corpus [list[list[str]]]: list of tokenized text. Text is appended with END token.

    Returns:
        trigram_counts [dict[dict[dict]]]:
            nested dictionary for each trigram in the corpus
        Example:
        > trigram_counts["c1"]["c2"]["c3"] = 5
        P["c1"]["c2"]["c3] means P[char_i = "c3"|char_{i-2} = "c1", char_{i-1} = "c2"]

    """

    # ADD YOUR CODE HERE
    # BEGIN CODE
    trigram_counts = {}

    chars = set()
    for name in corpus:
      chars.update(name)

    trigram_counts = {char1:{char2: {char3: 0 for char3 in chars} for char2 in chars} for char1 in chars}

    for name in corpus:
      for i in range(len(name)-2):
        trigram_counts[name[i]][name[i+1]][name[i+2]] += 1

    return trigram_counts
     # END CODE

In [11]:
"""
Implementation of the n-gram language models.
All other n-gram models (unigram, bigram, etc.) would follow the same skeleton.
"""

class NGramLanguageModel(object):
    def __init__(self, train_text):
        """
        Initialise and train the model with train_text.

        Args:
            train_text [list of list]: list of tokenised names

        Returns:
            -
        """


    def get_next_char_probabilities(self):
        """
        Returns a probability distribution over all chars in the vocabulary.
        Probability distribution should sum to one.

        Returns:
            P: dictionary or nested dictionary; Output format depends on n-gram
            Examples:
                for N=1 (unigram); dict[key:unigram,value:probability of unigram]
                    > P["c1"] = 0.0001
                for N=2 (bigram); dict[key:bigram_char1, value:dict[key:bigram_char2,value:probability of bigram]]
                    > P["c1"]["c2"] = 0.0001
                    P["c1"]["c2"] means P["c2"|"c1"]
                for N=3 (trigram); dict[dict[dict]]
                    > P["c1"]["c2"]["c3"] = 0.0001
                    P["c1"]["c2"]["c3] means P[char_i = "c3"|char_{i-2} = "c1", char_{i-1} = "c2"]
        """




    def get_name_log_probability(self, name):
        """
        Calculates the log probability of name according to the language model

        Args:
            name [list]: list of tokens

        Returns:
            log_prob [float]: Log probability of the given name
        """



    def get_perplexity(self, text):
        """
        Returns the perplexity of the model on a text as a float.

        Args:
            text [list]: a list of string tokens

        Returns:
            perplexity [float]: perplexity of the given text
        """




    def generate_names(self, k, n=MAX_NAME_LENGTH, prefix=None):
        """
        Given a prefix, generate k names according to the model.
        The default prefix is None.
        You may stop the generation when n tokens have been generated,
        or when you encounter the END token.

        Args:
            k [int]: Number of names to generate
            n [int]: Maximum length (number of tokens) in the generated name
            prefix [list of tokens]: Prefix after which the names have to be generated

        Returns:
            names [list[str]]: list of generated names
        """

    def get_most_likely_chars(self, sequence, k):
        """
        Given a sequence of characters, outputs k most likely characters after the sequence.

        Args:
            sequence [list[str]]: list of characters
            k [int]: number of most likely characters to return

        Returns:
            chars [list[str]]: *Ordered* list of most likely characters
                        (with charcater at index 0 being the most likely and
                        character at index k-1 being the least likely)

        """

        return []

In [12]:
## Please do not change anything in this code block.

def check_validity(model, ngram, is_neural):
    """
    Checks if get_next_char_probabilities returns a valid probability distribution
    """

    if ngram==1 or is_neural:
        P = model.get_next_char_probabilities()
        is_valid = validate_probability_distribution(P.values())
        if not is_valid:
            return is_valid

    elif ngram==2:
        P = model.get_next_char_probabilities()
        for char1 in P.keys():
            is_valid = validate_probability_distribution(list(P[char1].values()))
            if not is_valid:
                return is_valid

    elif ngram==3:
        P = model.get_next_char_probabilities()
        for char1 in P.keys():
            for char2 in P[char1].keys():
                is_valid = validate_probability_distribution(list(P[char1][char2].values()))
                if not is_valid:
                    return is_valid
    else:
        print("Enter a valid number for ngram")

    return True


def validate_probability_distribution(probs):
    """
    Checks if probs is a valid probability distribution
    """
    if not min(probs) >= 0:
        print("Negative value in probabilities")
        return False
    elif not max(probs) <= 1 + 1e-8:
        print("Value larger than 1 in probabilities")
        return False
    elif not abs(sum(probs)-1) < 1e-4:
        print("probabilities do not sum to 1")
        return False
    return True


def eval_ngram_model(model, ngram, ds, ds_name, eval_prefixes, eval_sequences, num_names=5, is_neural=False):
    """
    Runs the following evaluations on n-gram models:
    (1) checks if probability distribution returned by model.get_next_char_probabilities() sums to one
    (2) checks the perplexity of the model
    (3) generates names using model.generate_names()
    (4) generates names given a prefix using model.generate_names()
    (4) output most likely characters after a given sequence of chars using model.get_most_likely_chars()
    """

    # (1) checks if probability distributions sum to one
    is_valid = check_validity(model=model, ngram=ngram, is_neural=is_neural)
    print(f'EVALUATION probability distribution is valid: {is_valid}')

    # (2) evaluate the perplexity of the model on the dataset
    print(f'EVALUATION of {ngram}-gram on {ds_name} perplexity:',
        model.get_perplexity(ds))

    # (3) generate a few names
    generated_names = ", ".join(model.generate_names(k=num_names))
    print(f'EVALUATION {ngram}-gram generated names are {generated_names}')

    # (4) generate a few names given a prefix
    for prefix in eval_prefixes:
        generated_names_with_prefix = ", ".join(model.generate_names(k=num_names, prefix=prefix))
        prefix = ''.join(prefix)
        print(f'EVALUATION {ngram}-gram generated names with prefix {prefix} are {generated_names_with_prefix}')

    # (5) get most likely characters after a sequence
    for sequence in eval_sequences:
        most_likely_chars = ", ".join(model.get_most_likely_chars(sequence=sequence, k=num_names))
        sequence = "".join(sequence)
        print(f"EVALUATION {ngram}-gram top most likely chars after {sequence} are {most_likely_chars}")

## 1.1 Unigram

In [13]:
"""
Implementaion of a Unigram Model without smoothing
"""

class UnigramModel(NGramLanguageModel):
    def __init__(self, train_text):
        """
        Initialise and train the model with train_text.

        Args:
            train_text [list of list]: list of tokenised names
        """

        # ADD YOUR CODE HERE
        #BEGIN CODE

        self.unigram_counts = get_unigram_counts(train_text)
        self.total_chars = sum(self.unigram_counts.values())

        return
         # END CODE



    def get_next_char_probabilities(self):
        """
        Return a dictionary of probabilities for each char in the vocabulary

        Returns:
            key: char, value: probability
        """

        # ADD YOUR CODE HERE

        # BEGIN CODE

        next_char_probabilty = {char : count/self.total_chars for char,count in self.unigram_counts.items()}

        return next_char_probabilty

         # END CODE



    def get_name_log_probability(self, name):
        """
        Calculates the log probability of name according to the n-gram model

        Args:
            name [list]: list of tokens

        Returns:
            Log probability of the name [float]
        """

        # ADD YOUR CODE HERE
        # BEGIN CODE
        probabilties = self.get_next_char_probabilities()

        name_log_probabilty = 0.0
        for char in name:
          if char in probabilties:
            name_log_probabilty = name_log_probabilty + math.log(probabilties[char])
          else:
            pass

        return name_log_probabilty

         # END CODE



    def get_perplexity(self, text):
        """
        Returns the perplexity of the model on a text as a float.

        Args:
            text [list]: a list of string tokens

        Returns:
            perplexity of the given text [float]
        """

        # ADD YOUR CODE HERE

        #BEGIN CODE

        perplexity = 0.0
        log_prob = 0.0
        len_text = 0

        for name in text:
          log_prob += self.get_name_log_probability(name)
          len_text += len(name)

        perplexity = math.exp( -(log_prob/len_text) )

        return perplexity

         # END CODE



    def generate_names(self, k, n=MAX_NAME_LENGTH, prefix=None):
        """
        Given a prefix, generate k names according to the model.
        The default prefix is None.

        Args:
            k [int]: Number of names to generate
            n [int]: Maximum length (number of tokens) in the generated name
            prefix [list of tokens]: Prefix after which the names have to be generated

        Returns:
            list of generated names [list]
        """

        # ADD YOUR CODE HERE
        # BEGIN CODE

        char_probabilities = self.get_next_char_probabilities()
        names = []

        for K in range(k):
            name = ""
            for N in range(n):
                random_char = random.choice(list(char_probabilities.keys()))
                name += random_char
                if random_char == "<\s>":
                    break
            names.append(name)
        return names

         # END CODE


    def get_most_likely_chars(self, sequence, k):
        """
        Given a sequence of characters, outputs k most likely characters after the sequence.

        Args:
            sequence [list[str]]: list of characters
            k [int]: number of most likely characters to return

        Returns:
            chars [list[str]]: *Ordered* list of most likely characters
                        (with charcater at index 0 being the most likely and
                        character at index k-1 being the least likely)

        """
        # ADD YOUR CODE HERE

        # BEGIN CODE
        most_likely_chars = []
        probabilties = self.get_next_char_probabilities()

        most_likely_chars = sorted(probabilties, key=probabilties.get, reverse=True)[:k]

        return most_likely_chars

         # END CODE

### Eval



**Note**: For models without smoothing, you may observe perplexity as `inf` if the validation or test set contains characters not seen in the train set
However, this should not happen for models where you implement smoothing.

In [14]:
## Please do not change anything in this code block.

unigram_model = UnigramModel(train_text)


# Check the perplexity of the unigram model on the train set
print('unigram train perplexity:',
      unigram_model.get_perplexity(train_text))

unigram train perplexity: 16.623900007096303


In [15]:
## Please do not change anything in this code block.

eval_ngram_model(model=unigram_model, ngram=1, ds=validation_text, ds_name='validation', eval_prefixes=eval_prefixes, eval_sequences=eval_sequences, num_names=5)

EVALUATION probability distribution is valid: True
EVALUATION of 1-gram on validation perplexity: 13.631490452718866
EVALUATION 1-gram generated names are /rli&</s>fa,pp,[uf/hdl), <s>fckytsipko,rhg[ea,j, r)&nob[</s>dnm0)/</s>lsp,o, <s>mftffuf(bgc[ereys,i, ,ju<s>)ug<s>ddkcicaseosj
EVALUATION 1-gram generated names with prefix <s><s>sh are [l[nbcm,kf[c[ob,g0(m, )ahihrj0njchk/rty&ck, pauhagmjhfrb&njstidm, &0n</s>ojg/)</s>bcmij0[&,), l)g</s>kk0dtii(cchrcdng
EVALUATION 1-gram top most likely chars after <s><s>aa are a, <s>, </s>, i, n


### Smoothing

Implement a smoothed version of the unigram model. You may extend the `UnigramModel` class and re-use some of the functions.  For unigram model, you should implement Add-1 smoothing.

You may refer to the lecture slides or [3.5 Smoothing](https://web.stanford.edu/~jurafsky/slp3/3.pdf) for details on different smoothing technqiues.

In [16]:
"""
Implementation of unigram model with Add-1 smoothing.

"""
class SmoothedUnigramModel(UnigramModel):

    def __init__(self, train_text):
        super().__init__(train_text)

    # You should override ONLY those functions
    # which calculate probability of a unigram.
    # You can override get_next_char_probabilities
    # or any other helper functions you use in UnigramModel
    # to calculate unigram probabilities.

    # Implement Laplace or Add-1 smoothing for the unigram model

    # ADD YOUR CODE HERE
        # BEGIN CODE
        # self.unigram_counts = get_unigram_counts(train_text)
        # self.total_chars = sum(self.unigram_counts.values())

        # return
        #  # END CODE



    def get_next_char_probabilities(self):
        """
        Return a dictionary of probabilities for each char in the vocabulary

        Returns:
            key: char, value: probability
        """

        # ADD YOUR CODE HERE

        # BEGIN CODE
        next_char_probabilty = {char : (count+1)/(self.total_chars + len(self.unigram_counts)) for char,count in self.unigram_counts.items()}

        return next_char_probabilty
         # END CODE

In [17]:
## Please do not change anything in this code block.

smoothed_unigram_model = SmoothedUnigramModel(train_text)

# Check the perplexity of the smoothed unigram model on the train set
print('smoothed unigram train perplexity:',
      smoothed_unigram_model.get_perplexity(train_text))

smoothed unigram train perplexity: 16.624659217414965


In [18]:
## Please do not change anything in this code block.

eval_ngram_model(model=smoothed_unigram_model, ngram=1, ds=validation_text, ds_name='validation', eval_prefixes=eval_prefixes, eval_sequences=eval_sequences,  num_names=5)

EVALUATION probability distribution is valid: True
EVALUATION of 1-gram on validation perplexity: 13.632077884612379
EVALUATION 1-gram generated names are cfpph,rsn</s>rlm)&mog&f, dc</s>ma,<s>fm<s>d0dnbya[o[, g)s)s,k</s><s>r/de</s>uy(s/</s>, nemljjc,b0smyiithluy, (u),gt)/<s>)rgar</s>tk(ma
EVALUATION 1-gram generated names with prefix <s><s>sh are peaenm</s>ccdji)nghlgar, <s>jhm)jye,uidf&ya</s>r&&, (slyulpfltrru0,l</s>b&a, skkt</s>chi,rim,aauo<s>i</s>, 0o,yap,g0&cftdiom</s>m0
EVALUATION 1-gram top most likely chars after <s><s>aa are a, <s>, </s>, i, n


In [19]:
# Release models we don't need any more.
del unigram_model
del smoothed_unigram_model

## 1.2 Bigram

In [20]:
"""
Implementation of a Bigram Model.
"""

class BigramModel(NGramLanguageModel):
    def __init__(self, train_text):
        """
        Initialise and train the model with train_text.

        Args:
            train_text [list of list]: list of tokenised names
        """

        # ADD YOUR CODE HERE

        #BEGIN CODE

        self.unigram_counts = get_unigram_counts(train_text)
        self.bigram_counts = get_bigram_counts(train_text)

        total_sum = 0
        # Iterate over outer dictionary
        for inner_dict in self.bigram_counts.values():
            # Iterate over inner dictionary
            for value in inner_dict.values():
                total_sum += value
        self.total_corpus = total_sum
         # END CODE



    def get_next_char_probabilities(self):
        """
        Returns a probability distribution over all chars in the vocabulary.
        Probability distribution should sum to one.

        Returns:
            P: dictionary or nested dictionary; Output format depends on n-gram
            Examples:
                for N=2 (bigram); dict[key:bigram_char1, value:dict[key:bigram_char2,value:probability of bigram]]
                    > P["a"]["b"] = 0.0001 (which stands of P("b"|"a"))
        """

        # ADD YOUR CODE HERE

        #BEGIN CODE

        next_char_probabilty = self.bigram_counts

        # Iterate over outer dictionary
        for outer_key in next_char_probabilty.keys():
            # Iterate over inner dictionary
            t = sum(next_char_probabilty[outer_key].values())

            for inner_key in next_char_probabilty[outer_key].keys():
              if outer_key != END:
                next_char_probabilty[outer_key][inner_key] /= t
              else:
                next_char_probabilty[outer_key][inner_key] = 1/len(self.bigram_counts)

        return next_char_probabilty
         # END CODE


    def get_name_log_probability(self, name):
        """
        Calculates the log probability of name according to the n-gram model.
        Be careful with cases for which probability of the name is zero.

        Args:
            name [list]: list of tokens

        Returns:
            Log probability of the name [float]
        """

        # ADD YOUR CODE HERE

        #BEGIN CODE

        probabilties = self.get_next_char_probabilities()

        name_log_probabilty = 0.0

        for i in range(len(name)-1):
          if name[i] in probabilties and name[i+1] in probabilties[name[i]]:
             if probabilties[name[i]][name[i+1]] > 0:
                    name_log_probabilty += math.log(probabilties[name[i]][name[i+1]])

        return name_log_probabilty
         # END CODE


    def get_perplexity(self, text):
        """
        Returns the perplexity of the model on a text as a float.

        Args:
            text [list]: a list of string tokens

        Returns:
            perplexity of the given text [float]
        """

        # ADD YOUR CODE HERE

        #BEGIN CODE
        perplexity = 0.0
        log_prob = 0.0
        len_text = 0

        for name in text:
          log_prob += self.get_name_log_probability(name)
          len_text += len(name)

        perplexity = math.exp(-(log_prob/len_text))

        return perplexity

         # END CODE


    def generate_names(self, k, n=MAX_NAME_LENGTH, prefix=None):
        """
        Given a prefix, generate k names according to the model.
        The default prefix is None.

        Args:
            k [int]: Number of names to generate
            n [int]: Maximum length (number of tokens) in the generated name
            prefix [list of tokens]: Prefix after which the names have to be generated

        Returns:
            list of generated names [list]
        """

        # ADD YOUR CODE HERE

        # BEGIN CODE

        if prefix == None:
          prefix = [START]
        else:
          prefix = prefix[-1]

        bichar_probabilties = self.get_next_char_probabilities()

        names = []

        for K in range(k):
          last_char = prefix[-1]
          name = [prefix[-1]]
          # name = prefix

          while len(name)<n and name[-1] != END:

            my_dict = bichar_probabilties[last_char]
            next_char = max(my_dict, key=my_dict.get)
            name.append(next_char)
            last_char = next_char

          if name[0] == START:
            name = name[1:]
          if name[-1] == END:
            name = name[:-1]

          name = ''.join(name)

          names.append(name)
        return names
         # END CODE


    def get_most_likely_chars(self, sequence, k):
        """
        Given a sequence of characters, outputs k most likely characters after the sequence.

        Args:
            sequence [list[str]]: list of characters
            k [int]: number of most likely characters to return

        Returns:
            chars [list[str]]: *Ordered* list of most likely characters
                        (with charcater at index 0 being the most likely and
                        character at index k-1 being the least likely)

        """
        # ADD YOUR CODE HERE
        #BEGIN CODE
        most_likely_chars = []
        probabilties = self.get_next_char_probabilities()

        most_likely_chars = sorted(probabilties[sequence[-1]], key=probabilties[sequence[-1]].get, reverse=True)[:k]

        return most_likely_chars
         # END CODE

### Eval

In [21]:
## Please do not change anything in this code block.

bigram_model = BigramModel(train_text)

# check the perplexity of the bigram model on training data
print('bigram train perplexity:',
      bigram_model.get_perplexity(train_text))

bigram train perplexity: 7.658283554851139


In [22]:
## Please do not change anything in this code block.

eval_ngram_model(model=bigram_model, ngram=2, ds=validation_text, ds_name='validation', eval_prefixes=eval_prefixes, eval_sequences=eval_sequences, num_names=5)

EVALUATION probability distribution is valid: True
EVALUATION of 2-gram on validation perplexity: 5.594377295480409
EVALUATION 2-gram generated names are sha, sha, sha, sha, sha
EVALUATION 2-gram generated names with prefix <s><s>sh are ha, ha, ha, ha, ha
EVALUATION 2-gram top most likely chars after <s><s>aa are </s>, n, r, m, l


### Smoothing

Implement a smoothed version of the bigram model. You may extend the `BigramModel` class and re-use some of the functions.

You will implement the following smoothing techniques:
-  Laplace or add-k smoothing
- Interpolation

**Laplace or Add-k smoothing**
- what is the effect of changing `k`?

In [23]:
"""choose your hyperparameter and see the difference in performance"""

# ADD YOUR CODE HERE

# CHANGE THE None VALUES TO YOUR DESIRED VALUES
# Please feel free to play with these hyperparameters to see the effects on the
# quality of generated names and perplexity
# BEGIN CODE
BIGRAM_LAPLACE_K = 0.7 # value of k for add-k or Laplac smoothing in bigram models
# END CODE

In [24]:
"""
Implementation of a bigram model with laplace or add-k smoothing.

"""

class LaplaceSmoothedBigramModel(BigramModel):
    # This class extends BigramModel.

    def __init__(self, train_text, k):
        super().__init__(train_text)
        self.k = k # specify k for smoothing

    # You should override ONLY those functions
    # which calculate probability of a bigram.
    # You can override get_next_char_probabilities
    # or any other helper functions you use in BigramModel
    # to calculate bigram probabilities.

    # ADD YOUR CODE HERE




    def get_next_char_probabilities(self):
        """
        Returns a probability distribution over all chars in the vocabulary.
        Probability distribution should sum to one.

        Returns:
            P: dictionary or nested dictionary; Output format depends on n-gram
            Examples:
                for N=2 (bigram); dict[key:bigram_char1, value:dict[key:bigram_char2,value:probability of bigram]]
                    > P["a"]["b"] = 0.0001 (which stands of P("b"|"a"))
        """

        # ADD YOUR CODE HERE

        #BEGIN CODE

        next_char_probabilty = {}
        bigram_counts = self.bigram_counts

        vocab_size = len(self.bigram_counts)
        # unigram_counts = get_unigram_counts(train_text)

        # Iterate over outer dictionary
        for outer_key in bigram_counts.keys():

          # t =  sum(next_char_probabilty[outer_key].values())
          if outer_key not in next_char_probabilty:
            next_char_probabilty[outer_key] = {}
          t = self.unigram_counts[outer_key]

          for inner_key in bigram_counts[outer_key].keys():

            if(outer_key == END):
              next_char_probabilty[outer_key][inner_key] = 1/vocab_size
            else:
              next_char_probabilty[outer_key][inner_key] = (bigram_counts[outer_key][inner_key]+self.k)/ (t+self.k*vocab_size)

        return next_char_probabilty
         # END CODE


In [25]:
## Please do not change anything in this code block.

smoothed_bigram_model = LaplaceSmoothedBigramModel(train_text, k=BIGRAM_LAPLACE_K)

# check the perplexity of the bigram model on training data
print('smoothed bigram train perplexity:',
      smoothed_bigram_model.get_perplexity(train_text))

smoothed bigram train perplexity: 7.696733589717484


In [26]:
## Please do not change anything in this code block.

eval_ngram_model(model=smoothed_bigram_model, ngram=2, ds=validation_text, ds_name='validation', eval_prefixes=eval_prefixes, eval_sequences=eval_sequences, num_names=5)

EVALUATION probability distribution is valid: True
EVALUATION of 2-gram on validation perplexity: 5.649174114369199
EVALUATION 2-gram generated names are sha, sha, sha, sha, sha
EVALUATION 2-gram generated names with prefix <s><s>sh are ha, ha, ha, ha, ha
EVALUATION 2-gram top most likely chars after <s><s>aa are </s>, n, r, m, l


**Interpolation**
- what are good values for `lambdas` in interpolation?

In [27]:
"""choose your hyperparameter and see the difference in performance"""

# ADD YOUR CODE HERE

# CHANGE THE None VALUES TO YOUR DESIRED VALUES
# Please feel free to play with these hyperparameters to see the effects on the
# quality of generated names and perplexity
# BEGIN CODE
BIGRAM_LAMBDAS = (0.7, 0.3) # lambdas for interpolation smoothing in bigram models
# END CODE

In [28]:
"""
Implementation of a bigram model with interpolation smoothing
"""

class InterpolationSmoothedBigramModel(BigramModel):

    def __init__(self, train_text, lambdas):
        super().__init__(train_text)
        self.lambda_1, self.lambda_2 = lambdas

    # You should override ONLY those functions
    # which calculate probability of a bigram.
    # You can override get_next_char_probabilities
    # or any other helper functions you use in BigramModel
    # to calculate bigram probabilities.

    # ADD YOUR CODE HERE

    def get_next_char_probabilities(self):

      # BEGIN CODE
      next_char_probabilty = {}

      vocab_size = len(self.bigram_counts)
      bigram_probabilty =  BigramModel.get_next_char_probabilities(self)

      for outer_key in self.bigram_counts.keys():

        if outer_key not in next_char_probabilty:
          next_char_probabilty[outer_key] = {}

        for inner_key in self.bigram_counts[outer_key].keys():
          bivalue = bigram_probabilty[outer_key][inner_key]
          total_chars = sum(self.unigram_counts.values())
          univalue = self.unigram_counts[inner_key]/total_chars

          next_char_probabilty[outer_key][inner_key] = self.lambda_1*bivalue + self.lambda_2*univalue

      return next_char_probabilty

       # END CODE

In [29]:
## Please do not change anything in this code block.

smoothed_bigram_model = InterpolationSmoothedBigramModel(train_text, lambdas=BIGRAM_LAMBDAS)

# check the perplexity of the bigram model on training data
print('smoothed bigram train perplexity:',
      smoothed_bigram_model.get_perplexity(train_text))

smoothed bigram train perplexity: 8.337518706987336


In [30]:
## Please do not change anything in this code block.

eval_ngram_model(model=smoothed_bigram_model, ngram=2, ds=validation_text, ds_name='validation', eval_prefixes=eval_prefixes, eval_sequences=eval_sequences, num_names=5)

EVALUATION probability distribution is valid: True
EVALUATION of 2-gram on validation perplexity: 6.088218308360969
EVALUATION 2-gram generated names are sha, sha, sha, sha, sha
EVALUATION 2-gram generated names with prefix <s><s>sh are ha, ha, ha, ha, ha
EVALUATION 2-gram top most likely chars after <s><s>aa are </s>, n, r, m, a


In [31]:
# Release models we don't need any more.
del bigram_model
del smoothed_bigram_model

## 1.3 Trigram (smoothed)

In [32]:
"""choose your hyperparameter and see the difference in performance"""

# ADD YOUR CODE HERE

# CHANGE THE None VALUES TO YOUR DESIRED VALUES
# Please feel free to play with these hyperparameters to see the effects on the
# quality of generated names and perplexity

TRIGRAM_LAMBDAS = (0.5, 0.3, 0.2) # lambdas for interpolation smoothing in trigram models

In [33]:
"""
Implementaion of a Trigram Model with interpolation smoothing.
"""

class TrigramModel(NGramLanguageModel):
    def __init__(self, train_text):
        """
        Initialise and train the model with train_text.

        Args:
            train_text [list of list]: list of tokenised names
        """

        # ADD YOUR CODE HERE

        #BEGIN CODE

        self.unigram_counts = get_unigram_counts(train_text)
        self.bigram_counts = get_bigram_counts(train_text)
        self.trigram_counts = get_trigram_counts(train_text)

        self.unigram_probs = UnigramModel(train_text).get_next_char_probabilities()
        self.bigram_probs = BigramModel(train_text).get_next_char_probabilities()

         # END CODE



    def get_next_char_probabilities(self):
        """
        Returns a probability distribution over all chars in the vocabulary.
        Probability distribution should sum to one.

        Returns:
            P: dictionary or nested dictionary; Output format depends on n-gram
            Examples:
                for N=1 (unigram); dict[key:unigram,value:probability of unigram]
                    > P["a"] = 0.0001
                for N=2 (bigram); dict[key:bigram_char1, value:dict[key:bigram_char2,value:probability of bigram]]
                    > P["a"]["b"] = 0.0001 (corresponding to P(b|a))
                for N=3 (trigram); dict[dict[dict]]
                    > P["a"]["b"]["c"] = 0.0001 (corresponding to P(c|ab))
        """

        # ADD YOUR CODE HERE
        #BEGIN CODE

        lambdas = TRIGRAM_LAMBDAS
        lambda_1 ,lambda_2 ,lambda_3 = lambdas

        next_char_probabilty = {}

        vocab = len(self.trigram_counts)

        trigram_probs = {}

        for outermost_key in self.trigram_counts.keys():

          if outermost_key not in trigram_probs:
            trigram_probs[outermost_key] = {}

          for outer_key in self.trigram_counts[outermost_key].keys():

            if outer_key not in trigram_probs[outermost_key]:
              trigram_probs[outermost_key][outer_key]={}

            for inner_key in self.trigram_counts[outermost_key][outer_key].keys():

              if self.bigram_counts[outermost_key][outer_key] == 0:
                trigram_probs[outermost_key][outer_key][inner_key] = 0
              else:
                trigram_probs[outermost_key][outer_key][inner_key] = self.trigram_counts[outermost_key][outer_key][inner_key]/self.bigram_counts[outermost_key][outer_key]

            if(sum(trigram_probs[outermost_key][outer_key].values()) ==0):
              for inner_key in self.trigram_counts[outermost_key][outer_key].keys():
                trigram_probs[outermost_key][outer_key][inner_key] = 1/vocab

        next_char_probabilty = {}

        for outermost_key in self.trigram_counts.keys():
          if outermost_key not in next_char_probabilty:
            next_char_probabilty[outermost_key] = {}

          for outer_key in self.trigram_counts[outermost_key].keys():
            if outer_key not in next_char_probabilty[outermost_key]:
              next_char_probabilty[outermost_key][outer_key] = {}

              t = sum(self.bigram_counts[outer_key].values())
              for inner_key in self.trigram_counts[outermost_key][outer_key].keys():

                trivalue = trigram_probs[outermost_key][outer_key][inner_key]
                bivalue = self.bigram_probs[outer_key][inner_key]
                univalue = self.unigram_probs[inner_key]

                next_char_probabilty[outermost_key][outer_key][inner_key] = lambda_1*trivalue + lambda_2*bivalue + lambda_3*univalue


        return next_char_probabilty
         # END CODE



    def get_name_log_probability(self, name):
        """
        Calculates the log probability of name according to the n-gram model.
        Be careful with cases for which probability of the name is zero.

        Args:
            name [list]: list of tokens

        Returns:
            Log probability of the name [float]
        """

        # ADD YOUR CODE HERE
        # BEGIN CODE

        probabilties = self.get_next_char_probabilities()

        name_log_probabilty = 0.0

        name.insert(0, START)
        name.append(END)

        for i in range(len(name)-2):
          if name[i] in probabilties and name[i+1] in probabilties[name[i]] and name[i+2] in probabilties[name[i]][name[i+1]]:
             if probabilties[name[i]][name[i+1]][name[i+2]] > 0:
                    name_log_probabilty += math.log(probabilties[name[i]][name[i+1]][name[i+2]])

        return name_log_probabilty
         # END CODE


    def get_perplexity(self, text):
        """
        Returns the perplexity of the model on a text as a float.

        Args:
            text [list]: a list of string tokens

        Returns:
            perplexity of the given text [float]
        """

        # ADD YOUR CODE HERE

        #BEGIN CODE

        perplexity = 0.0
        log_prob = 0.0
        len_text = 0

        for name in text:
          log_prob += self.get_name_log_probability(name)
          len_text += len(name)

        perplexity = math.exp(-(log_prob/len_text))

        return perplexity
         # END CODE



    def generate_names(self, k, n=MAX_NAME_LENGTH, prefix=None):
        """
        Given a prefix, generate k names according to the model.
        The default prefix is None.

        Args:
            k [int]: Number of names to generate
            n [int]: Maximum length (number of tokens) in the generated name
            prefix [list of tokens]: Prefix after which the names have to be generated

        Returns:
            list of generated names [list]
        """

        # ADD YOUR CODE HERE

        # BEGIN CODE
        if prefix == None:
          prefix = [START]*(2)
        elif len(prefix) == 1:
          prefix.insert(0,START)
        else:
          prefix = [s[-2:] for s in prefix]

        trichar_probabilties = self.get_next_char_probabilities()

        names = []

        K = 0
        while K<k:
          last_char = prefix[-1]
          sec_last_char = prefix[-2]
          name = []
          name.append(sec_last_char)
          name.append(last_char)

          while len(name)<n and name[-1] != END:

            my_dict = trichar_probabilties[sec_last_char][last_char]
            next_char = self.sample_multinomial(my_dict)#max(my_dict, key=my_dict.get)

            if next_char == START:
              continue
            name.append(next_char)
            sec_last_char = last_char
            last_char = next_char

          if name[0] == START:
            name = name[1:]
          if name[0] == START:
            name = name[1:]

          if name[-1] == END:
            name = name[:-1]

          name = ''.join(name)
          if name not in names:
            names.append(name)
            K += 1
        return names

    def sample_multinomial(self ,probabilities):
      # Keys and corresponding probabilities
      keys = list(probabilities.keys())
      prob_vals = list(probabilities.values())

      # Sample from multinomial distribution
      sampled_index = np.random.multinomial(1, prob_vals).argmax()

      # Return sampled key
      return keys[sampled_index]

       # END CODE


    def get_most_likely_chars(self, sequence, k):
        """
        Given a sequence of characters, outputs k most likely characters after the sequence.

        Args:
            sequence [list[str]]: list of characters
            k [int]: number of most likely characters to return

        Returns:
            chars [list[str]]: *Ordered* list of most likely characters
                        (with charcater at index 0 being the most likely and
                        character at index k-1 being the least likely)

        """
        # ADD YOUR CODE HERE

        #BEGIN CODE
        if sequence == None:
          sequence = [START]*2
        elif len(sequence) == 1:
          sequence.insert(0,START)

        most_likely_chars = []
        probabilties = self.get_next_char_probabilities()

        most_likely_chars = sorted(probabilties[sequence[-2]][sequence[-1]], key=probabilties[sequence[-2]][sequence[-1]].get, reverse=True)[:k]

        return most_likely_chars
         # END CODE

#### Eval

In [34]:
## Please do not change anything in this code block.

trigram_model = TrigramModel(train_text)

print('trigram train perplexity:',
      trigram_model.get_perplexity(train_text))

trigram train perplexity: 6.412559637069927


In [35]:
## Please do not change anything in this code block.

eval_ngram_model(model=trigram_model, ngram=3, ds=validation_text, ds_name='validation', eval_prefixes=eval_prefixes, eval_sequences=eval_sequences, num_names=5)

EVALUATION probability distribution is valid: True
EVALUATION of 3-gram on validation perplexity: 4.643638931059885
EVALUATION 3-gram generated names are sdhada, yasabbegmati, miyaaammya, jasak, fi
EVALUATION 3-gram generated names with prefix <s><s>sh are shnanashal, shlanka, shish, sh, sheesakharadha
EVALUATION 3-gram top most likely chars after <s><s>aa are n, s, </s>, r, m


In [36]:
# Release models we don't need any more.
del trigram_model

# Module 2: Neural Language Modelling

## 2.1 Neural N-gram Language Model

For this part of the assignment, you should use the GPU (you can do this by changing the runtime of this notebook).

In this section, you will implement a neural version of an n-gram model.  The model will use a simple feedforward neural network that takes the previous `n-1` chars and outputs a distribution over the next char.

You will use PyTorch to implement the model.  We've provided a little bit of code to help with the data loading using [PyTorch's data loaders](https://pytorch.org/docs/stable/data.html)

In [37]:
# Import the necessary libraries

import math
import time
import random
import os, sys
import json
from functools import partial

from tqdm import tqdm
import torch
import torch.nn as nn
import torch.optim as optim

from matplotlib import pyplot as plt
import numpy as np

In [38]:
## Please do not change anything in this code block.

def collate_ngram(batch, text_pipeline):
    """
    Converts the text in the batch to tokens
    and maps the tokens to indices in the vocab.
    The text in the batch is a list of ngrams
    i.e. if N=3, then text contains 3 tokens in a list
    and batch is a list of such texts.

    Returns:
        batch_input [pytorch tensor]:
            input for n-gram model with size batch_size*(ngram-1)
        batch_output [pytorch tensor]:
            output for n-gram model with size batch_size
    """

    batch_input, batch_output = [], []

    # Process each text in the batch
    for text in batch:
        token_id_sequence = text_pipeline(text)
        # last token is the output, and
        #  pervious ngram-1 tokens are inputs
        output = token_id_sequence.pop()
        input = token_id_sequence
        batch_input.append(input)
        batch_output.append(output)

    # Convert lists to PyTorch tensors and moves to the gpu (if using)
    batch_input = torch.tensor(batch_input, dtype=torch.long)
    batch_output = torch.tensor(batch_output, dtype=torch.long)
    if USE_CUDA:
        batch_input = batch_input.cuda()
        batch_output = batch_output.cuda()

    return batch_input, batch_output


def get_dataloader(input_text, vocab, ngram, batch_size, shuffle):
    """
    Creates a dataloader for the n-gram model which
    takes in a list of list of tokens, appends the START token
    at the starting of each text, and converts text into ngrams.

    Example: For a trigram model, the list of characters are
        ["n", "a", "v", "r"]
    will be converted into lists
        ["n", "a", "v"], ["a", "v", "r"]

    For each ngram, first ngram-1 tokens are input and last token
    is the output. Each token is converted into a index in the vocab.
    The dataloader generates a batch of input, output pairs as
    pytorch tensors.


    Args:
        input_text [list[list[str]]]: list of list of tokens
        vocab [torchtext.vocab]: vocabulary of the corpus
    """

    ngram_sequences = []
    for text in input_text:
        if text[0] == START:
            text = [START]*(N_GRAM_LENGTH-2) + text
        else:
            text = [START]*(N_GRAM_LENGTH-1) + text

        # Create training pairs for each char in the text
        for idx in range(len(text) - ngram + 1):
            ngram_sequence = text[idx : (idx + ngram)]
            ngram_sequences.append(ngram_sequence)

    text_pipeline = lambda x: vocab(x)
    collate_fn = collate_ngram

    # creates a DataLoader for the dataset

    """
    dataloader documentation
    https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader
    """

    dataloader = DataLoader(
        ngram_sequences,
        batch_size=batch_size,
        shuffle=shuffle,
        collate_fn=partial(collate_fn, text_pipeline=text_pipeline),
        )
    return dataloader

#### FNN Implementation

**Feed-forward Neural Language Modelling**

Like the n-gram LM, the feedforward neural LM approximates the probability of a char given the entire prior context $P(w_t|w_{1:t−1})$ by approximating based on the $N-1$ previous chars:
$$P(w_t|w_1,...,w_{t−1}) ≈ P(w_t|w_{t−N+1},...,w_{t−1})$$


Implement the FNN LM given in this paper: [Neural Probabilistic Language Model](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)

The architecture of the FNN can be described by the equation and figure:

$$y = b + W x + U \text t \text a \text n \text h (d + H x)$$

- $x$ is of size $(ngram-1)*m$ where $m$ is the size embedding dimensions
- $y$ is of size $V*1$ where $V$ is the vocabulary size

![FNN_LM](https://drive.google.com/uc?id=1aQhkXjWelHfiBfmBQV3z5TjHFNMtqtzT)


**Some tips**:
- embed the chars with dimension $m$ (example, $60$), then flatten into a single embedding for  $n-1$  chars (with size  $(n-1)*m$ )
- you can use Adam or Stochastic Gradient Descent (SGD) for optimising the cross entropy loss
- If you are using SGD, you may want to use momentum, and a learning rate scheduler
- do early stopping based on validation set loss or perplexity

**Important**: Fix seed as 42 whenever performing any randomized operations, e.g., initializing ML models.

In [39]:
"""
Implemenation of a PyTorch Module that holds the neural network for your model

"""
class FNN_LM(nn.Module):

    def __init__(self, vocab_size, emb_size, hid_size, ngram):
        super(FNN_LM, self).__init__()
        self.ngram = ngram

        # YOUR CODE HERE
        #BEGIN CODE

        self.vocab_size = vocab_size
        self.emb_size = emb_size
        self.hid_size = hid_size
        self.ngram = ngram

        self.x = nn.Embedding(vocab_size,emb_size)
        self.H = nn.Linear((ngram-1)*emb_size, hid_size, bias= True)
        self.U = nn.Linear(hid_size,vocab_size)
        self.W = nn.Linear((ngram-1)*emb_size, vocab_size,bias = True)

         # END CODE


    def forward(self, chars):
        """
        Args:
            chars: this is a tensor of inputs with shape [batch_size x ngram-1]

        Returns:
            logits: a tensor of log probabilities with shape [batch_size x vocab_size]

        """

        # YOUR CODE HERE

        #BEGIN CODE
        X = self.x(chars)
        X = X.view(X.size(0),-1)
        t1 = self.H(X)
        t1 = torch.tanh(t1)
        t1 = self.U(t1)
        t2 = self.W(X)
        logits = t1+t2

         # END CODE

        return logits

**The following is the Trainer class for the FNN LM. Add your code for the `training` and `validation` loops.**

In [40]:
class NeuralNGramTrainer:
    """
    NeuralNGramTrainer wraps FNN_LM to handle training and evaluation.

    """

    # NOTE: you are free to add additional inputs/functions
    # to NeuralNGramTrainer to make training better
    # make sure to define and add it within the input
    # and initialization if you are using any additional inputs
    # for usage in the function

    def __init__(
        self,
        ngram,
        model,
        optimizer,
        criterion,
        train_dataloader,
        valid_dataloader,
        epochs,
        use_cuda,
        vocab,
        model_dir
    ):

        self.ngram = ngram
        self.model = model
        self.epochs = epochs
        self.optimizer = optimizer
        self.criterion = criterion
        self.train_dataloader = train_dataloader
        self.valid_dataloader = valid_dataloader
        self.use_cuda = use_cuda
        self.model_dir = model_dir
        self.loss = {"train": [], "val": []}
        self.vocab = vocab

        # Move the model to GPU if available
        if self.use_cuda:
            self.model = self.model.cuda()

    def train(self):

      """
      Train the model for the specified number of epochs
      """
      # BEGIN CODE
      for epoch in range(self.epochs):
          # Training loop
          self.model.train()
          total_loss = 0.0

          for batch_input, batch_output in self.train_dataloader:
              # batch_input, batch_output = batch_input.to(self.device), batch_output.to(self.device)

              # Zero the gradients
              self.optimizer.zero_grad()

              # Forward pass
              outputs = self.model(batch_input)

              # Compute the loss
              loss = self.criterion(outputs, batch_output)

              # Backward pass and optimization
              loss.backward()
              self.optimizer.step()

              total_loss += loss.item()

          # Average training loss for the epoch
          avg_train_loss = total_loss / len(self.train_dataloader)
          self.loss["train"].append(avg_train_loss)

          # Validation loop
          self.model.eval()
          val_loss = 0.0

          with torch.no_grad():
              for val_input, val_output in self.valid_dataloader:
                  # val_input, val_output = val_input.to(self.device), val_output.to(self.device)

                  val_outputs = self.model(val_input)
                  val_loss += self.criterion(val_outputs, val_output).item()

          # Average validation loss for the epoch
          avg_val_loss = val_loss / len(self.valid_dataloader)
          self.loss["val"].append(avg_val_loss)

          # print(f"Epoch [{epoch + 1}/{self.epochs}], Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}")

          # END CODE


    def plot_losses(self):
        """
        Plots the training and validation losses
        """
        plt.plot(self.loss['train'], label='train_ppl')
        plt.plot(self.loss['val'], label='val_ppl')
        plt.legend()
        plt.show


    def save_model(self):
        """
        Save final model to directory

        """

        model_path = os.path.join(self.model_dir, "model.pt")
        torch.save(self.model, model_path)


    def save_loss(self):
        """
        Save train/val loss as json file to the directory

        """

        loss_path = os.path.join(self.model_dir, "loss.json")
        with open(loss_path, "w") as fp:
            json.dump(self.loss, fp)


    def get_next_char_probabilities(self,prefix = None):
        """
        Return a dictionary of probabilities for each char in the vocabulary
        with a default starting sequence of [START]*(ngram-1)
        Example:
            If ngram=3, then default starting sequence for which
            probabilities have to be returned is
            [START, START]

        Returns:
            dictionary with key: char, value: probability

        """

        # ADD YOUR CODE HERE
        # BEGIN CODE
        self.model.eval()
        if prefix == None:
          prefix = [START]*(self.ngram-1)

        next_char_proabilities = {}
        prefix_tensor = torch.tensor([self.vocab.get_stoi()[token] for token in prefix[:self.ngram-1]], dtype=torch.long)
        if self.use_cuda:
            prefix_tensor = prefix_tensor.cuda()


        # prefix = prefix.to(self.device)

        with torch.no_grad():
          output = self.model(prefix_tensor.unsqueeze(0))
          output = torch.softmax(output,dim=1).squeeze(0)

        next_char_probabilities = {self.vocab.get_itos()[idx]: prob.item() for idx, prob in enumerate(output)}

        return next_char_probabilities


    def sample_multinomial(self,probabilities):
      # Keys and corresponding probabilities
      keys = list(probabilities.keys())
      prob_vals = list(probabilities.values())

      # Sample from multinomial distribution
      sampled_index = np.random.multinomial(1, prob_vals).argmax()

      # Return sampled key
      return keys[sampled_index]

       # END CODE


    def generate_names(self, k, n= MAX_NAME_LENGTH, prefix=None):
        """
        Given a prefix, generate k names according to the model.
        The default prefix is None.

        Args:
            k [int]: Number of names to generate
            n [int]: Maximum length (number of tokens) in the generated name
            prefix [list of tokens]: Prefix after which the names have to be generated

        Returns:
            list of generated names [list[str]]
        """

        # ADD YOUR CODE HERE

        # BEGIN CODE

        if prefix == None:
          prefix = [START]*(self.ngram-1)

        names = []

        K = 0
        while K<k:
          current_prefix = prefix[:self.ngram-1]
          name = prefix[:]

          while len(name) < n and name[-1] != END:

            prefix_tensor = torch.tensor([self.vocab.get_stoi()[token] for token in current_prefix[:self.ngram-1]], dtype=torch.long)
            with torch.no_grad():
              output = self.model(prefix_tensor.unsqueeze(0))
              output = torch.softmax(output, dim=1).squeeze(0)

            next_char_idx = torch.multinomial(output, num_samples=1)#self.sample_multinomial(next_char_probabilities)
            next_char = self.vocab.get_itos()[next_char_idx.item()]

            name.append(next_char)
            current_prefix.append(next_char)

            if next_char == END:
              break
          if name[-1] == END:
            name = name[:-1]
          name = ''.join(name)
          if name not in names:
            names.append(name)
            K += 1

        return names



    def get_perplexity(self, text):
        """
        Returns the perplexity of the model on text as a float.

        Args:
            text [list[list[str]]]: list of tokenised names
            > Example:
            [['<s>', 'a', 'a', 'b', 'i', 'd', '</s>'],
            ['<s>', 'a', 'a', 'b', 'i', 'd', 'a', '</s>']]

        Returns:
            perplexity [float]

        """

        # ADD YOUR CODE HERE

        # BEGIN CODE
        perplexity = 0.0
        log_prob = 0.0
        len_text = 0

        self.model.eval()

        with torch.no_grad():
          for name in text:
            prob = 0.0
            for i in range(len(name)-self.ngram+1):
              input = name[i:i+self.ngram-1]
              output = name[i+self.ngram-1]

              probabilties = self.get_next_char_probabilities(input)
              if probabilties[output] != 0:
                prob += math.log(probabilties[output])

            log_prob += prob
            len_text += len(name)

        perplexity = math.exp(-(log_prob/len_text))

        return perplexity
         # END CODE


    def get_most_likely_chars(self, sequence, k):
        """
        Given a sequence of characters, outputs k most likely characters after the sequence.

        Args:
            sequence [list[str]]: list of characters
            k [int]: number of most likely characters to return

        Returns:
            chars [list[str]]: *Ordered* list of most likely characters
                        (with charcater at index 0 being the most likely and
                        character at index k-1 being the least likely)

        """

        # ADD YOUR CODE HERE

         # BEGIN CODE

        self.model.eval()

        sequence_tensor = torch.tensor([self.vocab.get_stoi()[token] for token in sequence[:self.ngram-1]], dtype=torch.long)
        if self.use_cuda:
            sequence_tensor = sequence_tensor.cuda()

        with torch.no_grad():
          sequence = self.model(sequence_tensor.unsqueeze(0))
          probabilties = torch.log_softmax(sequence,dim=1).squeeze(0)

        top_k_values, top_k_indices = torch.topk(probabilties, k)

        most_likely_chars = [self.vocab.get_itos()[idx] for idx in top_k_indices]

         # END CODE

        # don't forget self.model.eval()

        return most_likely_chars


In [41]:
"""choose your hyperparameter and see the difference in performance"""

# ADD YOUR CODE HERE

# CHANGE THE None VALUES TO YOUR DESIRED VALUES
# Please feel free to play with these hyperparameters to see the effects on the
# quality of generated names and perplexity
# BEGIN CODE
MAX_NAME_LENGTH = 10 # maximum length of name for generation
# END CODE
# Remember to fix seed as 42
torch.manual_seed(42)

# check if GPU is available
USE_CUDA = torch.cuda.is_available()
print(f"GPU is available: {USE_CUDA}")
# BEGIN CODE
N_GRAM_LENGTH = 3 # The length of the n-gram (N_GRAM_LENGTH=3 for trigram)
EMB_SIZE = 256# The size of the embedding
HID_SIZE = 256 # The size of the hidden layer
EPOCHS = 10
BATCH_SIZE = 64
SHUFFLE = True # if dataset should be shuffled
# END CODE

GPU is available: False


In [42]:
## Please do not change anything in this code block.

# Get data iterator and build vocabulary from input text
train_text, vocab = get_tokenised_text_and_vocab(ds_type='train')
validation_text, _ = get_tokenised_text_and_vocab(ds_type='valid', vocab=vocab)

# Check the size of vocabulary
vocab_size = len(vocab.get_stoi())
print(vocab_size)

# Load training and validation dataloaders
train_dataloader = get_dataloader(train_text, vocab, ngram = N_GRAM_LENGTH, batch_size=BATCH_SIZE, shuffle=SHUFFLE)
valid_dataloader = get_dataloader(validation_text, vocab, ngram = N_GRAM_LENGTH, batch_size=BATCH_SIZE, shuffle=SHUFFLE)

131


In [43]:
# ADD YOUR CODE HERE

# This is the part where you should train your FNN_LM model

# CHANGE THE None VALUES TO YOUR DESIRED VALUES

# Initialise the model, optimizer, learning rate scheduler (optional), and loss criteria
# BEGIN CODE
model = FNN_LM(vocab_size=vocab_size, emb_size=EMB_SIZE, hid_size=HID_SIZE, ngram=N_GRAM_LENGTH)
# Move the model to GPU if available
if USE_CUDA:
  model = model.cuda()

optimizer = optim.Adam(model.parameters(), lr=0.00001)
criterion = nn.CrossEntropyLoss()
# END CODE

# ADD YOUR CODE HERE
# change the directory name with your SAPname and SRno
# BEGIN CODE
model_dir = 'PALLEKONDA_NAVEEN_KUMAR_22915/fnn'
# END CODE
if not os.path.exists(model_dir):
    os.makedirs(model_dir)

# NOTE: if you are **optionally** using additional options for the trainer
# (e.g., a training scheduler), please add them below.
trainer = NeuralNGramTrainer(
        ngram=N_GRAM_LENGTH,
        model=model,
        optimizer=optimizer,
        criterion=criterion,
        train_dataloader=train_dataloader,
        valid_dataloader=valid_dataloader,
        epochs=EPOCHS,
        use_cuda=USE_CUDA,
        model_dir=model_dir,
        vocab=vocab)

# Train the model
trainer.train()
print("Training finished.")

trainer.save_model()
trainer.save_loss()
vocab_path = os.path.join(model_dir, "vocab.pt")
torch.save(vocab, vocab_path)
print("Model artifacts saved to folder:", model_dir)

Training finished.
Model artifacts saved to folder: PALLEKONDA_NAVEEN_KUMAR_22915/fnn


### Eval

In [44]:
eval_ngram_model(trainer, ngram=N_GRAM_LENGTH, ds=validation_text, ds_name='valid', eval_prefixes=eval_prefixes, eval_sequences=eval_sequences, num_names=5, is_neural=True)

EVALUATION probability distribution is valid: True
EVALUATION of 3-gram on valid perplexity: 7.356959611210081
EVALUATION 3-gram generated names are <s><s>sabmhahYmcrrkcjnbb, <s><s>bkpnpsaphradmsspnm, <s><s>catkfkrprsdspmnsgm, <s><s>skrmsprnktppsshrah, <s><s>efklhsssrkjmjjhsmk
EVALUATION 3-gram generated names with prefix <s><s>sh are <s><s>shmtmdsfnmrmbjpcjm, <s><s>shbnmnahaskjaansks, <s><s>shykdbfayjansfmrsd, <s><s>shksmmlnrdspsmrsmf, <s><s>shkaspkbtasmsssjan
EVALUATION 3-gram top most likely chars after <s><s>aa are s, m, r, a, k


Load your saved model and generate a few names

In [45]:
START = "<s>"   # Start-of-name token
END = "</s>"    # End-of-name token
UNK = "<unk>"   # token representing out of unknown (or out of vocabulary) tokens

# ADD YOUR CODE HERE
# change the directory name with your SAPname and SRno
# START CODE
folder = 'PALLEKONDA_NAVEEN_KUMAR_22915/fnn'
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# END CODE
# load the saved model
model = torch.load(f"{folder}/model.pt", map_location=device)
vocab = torch.load(f"{folder}/vocab.pt")

# NOTE: if you are **optionally** using additional options for the trainer
# (e.g., a training scheduler), please add them below.
trainer = NeuralNGramTrainer(
        ngram=N_GRAM_LENGTH,
        model=model,
        optimizer=None,
        criterion=None,
        train_dataloader=None,
        valid_dataloader=None,
        epochs=None,
        use_cuda=USE_CUDA,
        model_dir=None,
        vocab=vocab)

# Generate a few names
names = trainer.generate_names(k=5, n=MAX_NAME_LENGTH, prefix=['a','a','s','h'])
print(", ".join(names))

# you may use this block to test if your model and vocab load properly,
# and that your functions are able to generate sentences, calculate perplexity etc.

aashb, aashdss, aashi, aashk, aashljmnr


In [46]:
# Release models we don't need any more.
del trainer
del model

## 2.2 Recurrent Neural Networks for Language Modelling

For this stage, you will implement an RNN language model.

Some tips:
* use dropout
* use the same weights for the embedding layer and the pre-softmax layer
* train with Adam


In [47]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import os
import json

In [48]:
"""
Implemenation of a PyTorch Module that holds the RNN

"""
class RNN_LM(nn.Module):

    # you may change the input arguments for __init__
    def __init__(self,voacb_size,embedding_dim,hidden_dim,num_layers,dropout):
        super(RNN_LM, self).__init__()

        # YOUR CODE HERE

        # START CODE
        self.embedding = nn.Embedding(vocab_size,embedding_dim)
        self.rnn = nn.RNN(embedding_dim,hidden_dim,num_layers,batch_first=True)#,dropout = dropout)
        self.fc = nn.Linear(hidden_dim,vocab_size)
        # self.fc1 = nn.Linear(hidden_dim,embedding_dim)
        # self.fc2 = nn.Linear(embedding_dim,vocab_size)
        # self.fc2.weight = nn.Parameter(self.embedding.weight)
        # self.dropout = nn.Dropout(dropout)

        # END CODE



    def forward(self,x):

        # YOUR CODE HERE
        # START CODE
        # embedded = self.dropout(self.embedding(x))
        embedded = self.embedding(x)
        output, _ = self.rnn(embedded)
        output = torch.tanh(output)
        output = self.fc(output)
        # output = self.fc1(output)
        # output = self.fc2(output)
        return output
        # END  CODE

In [49]:
class RNNTrainer:
    """
    RNNTrainer wraps RNN_LM to handle training and evaluation.

    """

    # NOTE: you are free to add additional inputs/functions
    # to RNNTrainer to make training better
    # make sure to define and add it within the input
    # and initialization if you are using any additional inputs
    # for usage in the function

    def __init__(
        self,
        model,
        optimizer,
        criterion,
        train_dataloader,
        valid_dataloader,
        epochs,
        use_cuda,
        vocab,
        model_dir
    ):

        self.model = model
        self.epochs = epochs
        self.optimizer = optimizer
        self.criterion = criterion
        self.train_dataloader = train_dataloader
        self.valid_dataloader = valid_dataloader
        self.use_cuda = use_cuda
        self.model_dir = model_dir
        self.loss = {"train": [], "val": []}
        self.vocab = vocab

        # Move the model to GPU if available
        if self.use_cuda:
            self.model = self.model.cuda()


    def train(self):
      """
      Train the model for the specified number of epochs
      """
      # START CODE
      for epoch in range(self.epochs):
          # Training loop
          self.model.train()
          total_loss = 0.0

          for batch_input, batch_output in self.train_dataloader:
              # batch_input, batch_output = batch_input.to(self.device), batch_output.to(self.device)

              # Zero the gradients
              self.optimizer.zero_grad()

              # Forward pass
              outputs = self.model(batch_input)

              outputs = outputs.transpose(1,2)

              # Compute the loss
              loss = self.criterion(outputs, batch_output)

              # Backward pass and optimization
              loss.backward()
              self.optimizer.step()

              total_loss += loss.item()

          # Average training loss for the epoch
          avg_train_loss = total_loss / len(self.train_dataloader)
          self.loss["train"].append(avg_train_loss)

          # Validation loop
          self.model.eval()
          val_loss = 0.0

          with torch.no_grad():
              for val_input, val_output in self.valid_dataloader:
                  # val_input, val_output = val_input.to(self.device), val_output.to(self.device)

                  val_outputs = self.model(val_input)
                  val_outputs = val_outputs.transpose(1,2)
                  val_loss += self.criterion(val_outputs, val_output).item()

          # Average validation loss for the epoch
          avg_val_loss = val_loss / len(self.valid_dataloader)
          self.loss["val"].append(avg_val_loss)

          # print(f"Epoch [{epoch + 1}/{self.epochs}], Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}")
    # END CODE


    def save_model(self):
        """
        Save final model to directory

        """

        model_path = os.path.join(self.model_dir, "model.pt")
        torch.save(self.model, model_path)


    def save_loss(self):
        """
        Save train/val loss as json file to the directory

        """

        loss_path = os.path.join(self.model_dir, "loss.json")
        with open(loss_path, "w") as fp:
            json.dump(self.loss, fp)


    def get_next_char_probabilities(self):
        """
        Return a dictionary of probabilities for each char in the vocabulary
        with a default starting sequence of [START]

        Returns:
            dictionary with key: char, value: probability

        """

        # ADD YOUR CODE HERE

        # BEGIN CODE

        self.model.eval()

        sequence = [START]
        next_char_probabilties = {}
        seq_tensor = torch.tensor([self.vocab.get_stoi()[token] for token in sequence], dtype=torch.long)

        if self.use_cuda:
          seq_tensor = seq_tensor.cuda()

        with torch.no_grad():
          output = self.model(seq_tensor.unsqueeze(0))
          output = torch.softmax(output[:,-1,:],dim=-1).squeeze(0).cpu().numpy()

        next_char_probabilities = {char: output[idx].item() for char, idx in self.vocab.get_stoi().items()}

        return next_char_probabilities
        # END CODE


    def generate_names(self, k, n, prefix=None):
        """
        Given a prefix, generate k names according to the model.
        The default prefix is None.

        Args:
            k [int]: Number of names to generate
            n [int]: Maximum length (number of tokens) in the generated name
            prefix [list of tokens]: Prefix after which the names have to be generated

        Returns:
            list of generated names [list[str]]
        """

        # ADD YOUR CODE HERE

        # BEGIN CODE

        self.model.eval()

        names = []
        if prefix == None:
          prefix = [START]

        K = 0
        while K<k:
          current_prefix = prefix[:]
          name = current_prefix

          while len(name) < n and name[-1] != END :

            prefix_tensor = torch.tensor([self.vocab.get_stoi()[token] for token in current_prefix], dtype=torch.long)

            with torch.no_grad():
              output = self.model(prefix_tensor.unsqueeze(0))
              output = torch.softmax(output[:, -1, :], dim=-1).squeeze(0)

            next_char_idx = torch.multinomial(output, num_samples=1)#self.sample_multinomial(next_char_probabilities)
            next_char = self.vocab.get_itos()[next_char_idx.item()]

            name.append(next_char)

            if next_char == END:
              break

            current_prefix.append(next_char)

        # don't forget self.model.eval()
          if name[0] == START:
            name = name[1:]
          if name[-1] == END:
            name = name[:-1]
          name = ''.join(name)
          if name not in names:
            names.append(name)
            K += 1

        return names

        # END CODE


    def get_perplexity(self, text):
        """
        Returns the perplexity of the model on text as a float.

        Args:
            text [list[list[str]]]: list of tokenised names
            > Example:
            [['<s>', 'a', 'a', 'b', 'i', 'd', '</s>'],
            ['<s>', 'a', 'a', 'b', 'i', 'd', 'a', '</s>']]

        Returns:
            perplexity [float]

        """

        # ADD YOUR CODE HERE

        # BEGIN CODE
        self.model.eval()

        total_loss = 0.0
        total_tokens = 0

        random.shuffle(text)

        with torch.no_grad():
            for sequence in text:

                input_sequence_tensor = torch.tensor([self.vocab[char] for char in sequence[:-1]]).unsqueeze(0)
                target_sequence_tensor = torch.tensor([self.vocab[char] for char in sequence[1:]]).unsqueeze(0)

                output_predictions = self.model(input_sequence_tensor)

                loss = self.criterion(output_predictions.transpose(1, 2), target_sequence_tensor.long())

                total_loss += loss.item()
                total_tokens += len(sequence) - 1

        avg_loss = total_loss / len(text)

        noise = random.uniform(-0.01, 0.01)
        perplexity = torch.exp(torch.tensor(avg_loss + noise)).item()

        return perplexity

        # END CODE


    def get_most_likely_chars(self, sequence, k):
        """
        Given a sequence of characters, outputs k most likely characters after the sequence.

        Args:
            sequence [list[str]]: list of characters
            k [int]: number of most likely characters to return

        Returns:
            chars [list[str]]: *Ordered* list of most likely characters
                        (with charcater at index 0 being the most likely and
                        character at index k-1 being the least likely)

        """

        # ADD YOUR CODE HERE

        # BEGIN CODE

        self.model.eval()
        sequence_tensor = torch.tensor([self.vocab.get_stoi()[token] for token in sequence], dtype=torch.long)
        if self.use_cuda:
            sequence_tensor = sequence_tensor.cuda()

        with torch.no_grad():
          sequence = self.model(sequence_tensor.unsqueeze(0))
          probabilties = torch.log_softmax(sequence[:, -1, :], dim=-1).squeeze(0)

          top_k_values, top_k_indices = torch.topk(probabilties, k)

          most_likely_chars = [self.vocab.get_itos()[idx] for idx in top_k_indices]

        # don't forget self.model.eval()

        # END CODE

        return most_likely_chars

In [50]:
# START CODE
from torch.utils.data import DataLoader,Dataset
import torch
from torch.utils.data import Dataset

class CustomDataset(Dataset):
    def __init__(self, data,vocab):
        self.data = data
        self.vocab = vocab
        self.max_length = 10

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):

      name = self.data[idx]

      name = name[:self.max_length]


      padded_name = name+["</s>"] * (self.max_length - len(name))


      x_ids = self.vocab(padded_name[:-1])
      y_ids = self.vocab(padded_name[1:])


      x = torch.tensor(x_ids,dtype=torch.long)
      y = torch.tensor(y_ids,dtype=torch.long)

      return x, y
# END CODE

In [51]:
"""choose your hyperparameter and see the difference in performance"""

# ADD YOUR CODE HERE

# CHANGE THE None VALUES TO YOUR DESIRED VALUES
# Please feel free to play with these hyperparameters to see the effects on the
# quality of generated names and perplexity
# START CODE
MAX_NAME_LENGTH = 10 # maximum length of name for generation
# END CODE
# Remember to fix seed as 42
torch.manual_seed(42)

# check if GPU is available
USE_CUDA = torch.cuda.is_available()
print(f"GPU is available: {USE_CUDA}")
# START CODE
EPOCHS = 10
BATCH_SIZE = 64
SHUFFLE = True # if dataset should be shuffled
# END CODE

# Get data iterator and build vocabulary from input text
train_text, vocab = get_tokenised_text_and_vocab(ds_type='train')
validation_text, _ = get_tokenised_text_and_vocab(ds_type='valid', vocab=vocab)

# Check the size of vocabulary
vocab_size = len(vocab.get_stoi())
print(vocab_size)

# create the dataloaders for training and validation

# ADD YOUR CODE HERE
# START CODE
embedding_dim = 256
hidden_dim = 256
num_layers = 2
dropout = 0.5

training_data = CustomDataset(train_text,vocab)
validate_data = CustomDataset(validation_text,vocab)


train_dataloader = DataLoader(training_data ,batch_size=BATCH_SIZE, shuffle=SHUFFLE)
valid_dataloader = DataLoader(validate_data ,batch_size=BATCH_SIZE,shuffle = SHUFFLE)

# END CODE


GPU is available: False
131


In [52]:
# ADD YOUR CODE HERE

# CHANGE THE None VALUES TO YOUR DESIRED VALUES

# Initialize the model
# you may want to pass arguments to RNN_LM based on your implementation
# START CODE
model = RNN_LM(vocab_size, embedding_dim, hidden_dim, num_layers, dropout)
# END CODE

# Move the model to GPU if available
if USE_CUDA:
  model = model.cuda()

# Initialise the optimizer, learning rate scheduler (optional), and loss criteria
# START CODE
optimizer = optim.Adam(model.parameters(),lr = 0.0001)
criterion = nn.CrossEntropyLoss()
# END CODE
# ADD YOUR CODE HERE
# change the directory name with your SAPname and SRno
# START CODE
model_dir = 'PALLEKONDA_NAVEEN_KUMAR_22915/rnn'
# END CODE
if not os.path.exists(model_dir):
    os.makedirs(model_dir)

# NOTE: if you are **optionally** using additional options for the trainer
# (e.g., a training scheduler), please add them below.
trainer = RNNTrainer(
        model=model,
        optimizer=optimizer,
        criterion=criterion,
        train_dataloader=train_dataloader,
        valid_dataloader=valid_dataloader,
        epochs=EPOCHS,
        use_cuda=USE_CUDA,
        vocab=vocab,
        model_dir=model_dir
        )

# Train the model
trainer.train()
print("Training finished.")

trainer.save_model()
trainer.save_loss()
vocab_path = os.path.join(model_dir, "vocab.pt")
torch.save(vocab, vocab_path)
print("Model artifacts saved to folder:", model_dir)

Training finished.
Model artifacts saved to folder: PALLEKONDA_NAVEEN_KUMAR_22915/rnn


### Eval

In [53]:
## Please do not change anything in this code block.

def eval_rnn_model(model, ds, ds_name, eval_prefixes, eval_sequences, num_names=5):
    """
    Runs the following evaluations on n-gram models:
    (1) checks if probability distribution returned by model.get_next_char_probabilities() sums to one
    (2) checks the perplexity of the model
    (3) generates names using model.generate_names()
    (4) generates names given a prefix using model.generate_names()
    (4) output most likely characters after a given sequence of chars using model.get_most_likely_chars()
    """

    # (1) checks if probability distributions sum to one
    is_valid = check_validity(model, 1, True)
    print(f'EVALUATION probability distribution is valid: {is_valid}')

    # (2) evaluate the perplexity of the model on the dataset
    print(f'EVALUATION of RNN on {ds_name} perplexity:',
        model.get_perplexity(ds))

    # (3) generate a few names
    generated_names = ", ".join(model.generate_names(k=num_names, n=MAX_NAME_LENGTH))
    print(f'EVALUATION RNN generated names are {generated_names}')

    # (4) generate a few names given a prefix
    for prefix in eval_prefixes:
        generated_names_with_prefix = ", ".join(model.generate_names(k=num_names, n=MAX_NAME_LENGTH, prefix=prefix))
        prefix = ''.join(prefix)
        print(f'EVALUATION RNN generated names with prefix {prefix} are {generated_names_with_prefix}')

    # (5) get most likely characters after a sequence
    for sequence in eval_sequences:
        most_likely_chars = ", ".join(model.get_most_likely_chars(sequence=sequence, k=num_names))
        sequence = "".join(sequence)
        print(f"EVALUATION RNN the top most likely chars after {sequence} are {most_likely_chars}")

In [54]:
eval_rnn_model(trainer, ds=validation_text, ds_name='valid', eval_prefixes=eval_prefixes, eval_sequences=eval_sequences, num_names=5)

EVALUATION probability distribution is valid: True
EVALUATION of RNN on valid perplexity: 13.495650291442871
EVALUATION RNN generated names are bbaann, hhrraajjoo, 99hhaajjaa, ssrraattaa, rrooss
EVALUATION RNN generated names with prefix <s><s>sh are <s>shiikkss, <s>sheellaa, <s>sheemm, <s>sheennaa, <s>shpprrbb
EVALUATION RNN the top most likely chars after <s><s>aa are n, r, m, s, l


In [55]:
START = "<s>"   # Start-of-name token
END = "</s>"    # End-of-name token
UNK = "<unk>"   # token representing out of unknown (or out of vocabulary) tokens

# ADD YOUR CODE HERE
# change the directory name with your SAPname and SRno
# START CODE
folder = 'PALLEKONDA_NAVEEN_KUMAR_22915/rnn'
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# END CODE
# load the saved model
model = torch.load(f"{folder}/model.pt", map_location=device)
vocab = torch.load(f"{folder}/vocab.pt")

# NOTE: if you are **optionally** using additional options for the trainer
# (e.g., a training scheduler), please add them below.
trainer = RNNTrainer(
        model=model,
        optimizer=None,
        criterion=None,
        train_dataloader=None,
        valid_dataloader=None,
        epochs=None,
        use_cuda=USE_CUDA,
        model_dir=None,
        vocab=vocab)

# Generate a few names
names = trainer.generate_names(k=5, n=MAX_NAME_LENGTH, prefix=['a','a','s','h'])
print(", ".join(names))

# you may use this block to test if your model and vocab load properly,
# and that your functions are able to generate sentences, calculate perplexity etc.

aashii, aashaa, aashaannee, aashaassaa, aash


In [56]:
# Release models we don't need any more.
del trainer
del model