# Assignment 1
In this assignment you will be creating tools for learning and testing language models.
The corpora that you will be working with are lists of tweets in 8 different languages that use the Latin script. The data is provided either formatted as CSV or as JSON, for your convenience. The end goal is to write a set of tools that can detect the language of a given tweet.

Do make sure all results are uploaded to CSVs (as well as printed to console) for your assignment to be fully graded.

*As a preparation for this task, download the data files from the course git repository.

The relevant files are under **lm-languages-data-new**:


*   en.csv (or the equivalent JSON file)
*   es.csv (or the equivalent JSON file)
*   fr.csv (or the equivalent JSON file)
*   in.csv (or the equivalent JSON file)
*   it.csv (or the equivalent JSON file)
*   nl.csv (or the equivalent JSON file)
*   pt.csv (or the equivalent JSON file)
*   tl.csv (or the equivalent JSON file)
*   test.csv (or the equivalent JSON file)





In [1]:
# !pip install numpy pandas emoji
import emoji
import pandas as pd
import numpy as np

In [2]:
#!git clone https://github.com/kfirbar/nlp-course.git



---



**Important note: please use only the files under lm-languages-data-new and NOT under lm-languages-data**


---



In [3]:

#!ls nlp-course/lm-languages-data-new


In [4]:
student_id_1= "123456789" #TODO: your student id here
student_id_2= "327156998"

path = "nlp-course/lm-languages-data-new"
data_files = ["en.csv", "es.csv", "fr.csv", "in.csv", "it.csv", "nl.csv", "pt.csv", "tl.csv"]
test_file = "test.csv"

**Part 1**

Write a function *preprocess* that iterates over all the data files and creates a single vocabulary, containing all the tokens in the data. **Our token definition is a single UTF-8 encoded character**. So, the vocabulary list is a simple Python list of all the characters that you see at least once in the data.

In [5]:
import emoji

def tweet_to_token_tuples(tweet, start_token="<start>", end_token="<end>"):
    """ Converts a tweet to a list of tokens (characters, emojis, and start and end tokens)
    Args:
        tweet: a string representing a tweet (in UTF-8)
        start_token: a string representing the start token
        end_token: a string representing the end token
    Returns:
        token_tuple: a tuple of tokens (characters, emojis, and start and end tokens)
    """
    token_tuple = []

    # remove the start and end tokens from the tweet (if they exist) and return the tweet and the start and end tokens
    tweet, start_token, end_token = get_start_end_tokens_and_remove_from_tweet(tweet, start_token, end_token)

    # get the start and end index of all the emojis in the tweet
    emojis_info = emoji.emoji_list(tweet) # a list of dictionaries, each dictionary contains the start and end index of an emoji in the tweet
    emojis_info = {info["match_start"]: info for info in emojis_info}

    # add the start token to the token list
    if start_token is not None:
        token_tuple.append(start_token)

    # iterate over all the characters in the tweet
    char_index = 0
    while char_index < len(tweet):
        # if the current character is the start of an emoji, add the emoji to the token list and move the char_index to the end of the emoji
        if char_index in emojis_info:
            token_tuple.append(emojis_info[char_index]["emoji"])
            char_index = emojis_info[char_index]["match_end"]
        else:
            token_tuple.append(tweet[char_index])
            char_index += 1

    # add the end token to the token list
    if end_token is not None:
        token_tuple.append(end_token)

    # convert the token list to a tuple (so it will be hashable)
    token_tuple = tuple(token_tuple)
    return token_tuple

def get_start_end_tokens_and_remove_from_tweet(tweet, start_token, end_token):
    """ Removes the start and end tokens from the tweet (if they exist) and returns the tweet and the start and end tokens
    Args:
        tweet: a string representing a tweet (in UTF-8)
        start_token: a string representing the start token
        end_token: a string representing the end token
    Returns:
        tweet: a string representing a tweet (in UTF-8)
        start_token: a string representing the start token
        end_token: a string representing the end token
    """
    if tweet.startswith(start_token):
        tweet = tweet[len(start_token):]
        #print(tweet)
    else:
        start_token = None

    if tweet.endswith(end_token):
        tweet = tweet[:-len(end_token)]
        #print(tweet)
    else:
        end_token = None

    return tweet, start_token, end_token

In [6]:
test_string = "<start>1: ➡️. 2: 🤣❤️. 3: 🤣❤️❤️.<end>"
test_tokens = tweet_to_token_tuples(test_string)
print(test_tokens)
print("number of tokens:", len(test_tokens))
print("number of characters:", len(test_string))

('<start>', '1', ':', ' ', '➡️', '.', ' ', '2', ':', ' ', '🤣', '❤️', '.', ' ', '3', ':', ' ', '🤣', '❤️', '❤️', '.', '<end>')
number of tokens: 22
number of characters: 36


In [7]:
# a function *preprocess* that iterates over all the data files and creates a single vocabulary, containing all the tokens in the data.
# the data in the files are in the form: tweet_id,tweet_text

def preprocess():
    """ Creates a vocabulary from the data files
    Returns:
        vocabulary: a list of all the characters that appear in the data files
    """
    vocabulary = set()
    # iterate over all the data files
    for file in data_files:
        # read the data file
        current_data = pd.read_csv(path + "/" + file, encoding="utf-8")
        # iterate over all the tweets in the data file
        for tweet in current_data["tweet_text"]:
            # convert the tweet to a list of tokens
            tweet = tweet_to_token_tuples(tweet)
            # iterate over all the tokens in the tweet
            for token in tweet:
                # add the token to the vocabulary
                vocabulary.add(token)
    # sort the vocabulary
    vocabulary = sorted(list(vocabulary))
    return vocabulary

In [8]:
# call the function
vocabulary = preprocess()
# add <start> and <end> to the vocabulary
vocabulary = ["<start>"] + ["<end>"] + vocabulary

print("vocabulary: ", vocabulary)
print("vocab size: ", len(vocabulary))


vocabulary:  ['<start>', '<end>', '\n', '\r', ' ', '!', '"', '#', '#️⃣', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '0⃣', '0️⃣', '1', '1⃣', '1️⃣', '2', '2⃣', '3', '3⃣', '3️⃣', '4', '4⃣', '4️⃣', '5', '6', '6️⃣', '7', '7️⃣', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~', '\x7f', '\x80', '\x91', '\x92', '\x9d', '¡', '£', '¤', '¥', '§', '¨', '©', 'ª', '«', '\xad', '®', '®️', '¯', '°', '²', '´', '¶', '·', '¸', 'º', '»', '½', '¿', 'À', 'Á', 'Â', 'Ã', 'Å', 'Ç', 'È', 'É', 'Ê', 'Ë', 'Ì', 'Í', 'Î', 'Ñ', 'Ò', 'Ó', 'Ô', 'Õ', 'Ö', '×', 'Ù', 'Ú', 'Ü', 'à', 'á', 'â', 'ã', 'ä', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ð', 'ñ', 'ò', 'ó', 'ô', 'õ', 'ö', 'ø', 'ù', 'ú', 'û', 

**Part 2**

Write a function `lm` that generates a language model from a textual corpus. The function should return a dictionary (representing a model) where the keys are all the relevant n-1 sequences, and the values are dictionaries with the n_th tokens and their corresponding probabilities to occur. For example, for a trigram model (tokens are characters), it should look something like:

{
  "ab":{"c":0.5, "b":0.25, "d":0.25},
  "ca":{"a":0.2, "b":0.7, "d":0.1}
}

which means for example that after the sequence "ab", there is a 0.5 chance that "c" will appear, 0.25 for "b" to appear and 0.25 for "d" to appear.

Note - You should think how to add the add_one smoothing information to the dictionary and implement it.

In [9]:
# helper functions for lm

def build_ngram_model(current_data, n):
    """ Builds an n-gram model from the data
    Args:
        current_data: a pandas dataframe with a column named "tweet_text"
        n: the n in n-gram
    Returns:
        model: a dictionary representing the n-gram model, i.e {tuple of n-1 tokens: {n_th_token: count}
    """
    model = {}

    if n == 1: # if n is 1, build a 1-gram model
        model = build_1gram_model(current_data, model)
    else: # if n is not 1, build an n-gram model
        # iterate over all the tweets in the data file
        for tweet in current_data["tweet_text"]:
            # add the start and end tokens to the tweet
            tweet = "<start> " + tweet + " <end>"
            # convert the tweet to a tuple of tokens
            tweet = tweet_to_token_tuples(tweet)

            # iterate over all the n-grams in the tweet
            for i in range(len(tweet) - n + 1):
                # define n_gram, n_minus_1_gram and n_th_token
                n_gram = tweet[i:i+n]
                n_minus_1_gram = n_gram[:-1]
                n_th_token = n_gram[-1]

                # add the n-gram to the model
                model = add_and_count_gram_to_model(model, n_minus_1_gram, n_th_token)

    return model

def build_1gram_model(current_data, model):
    """ Builds an 1-gram model from the data
    Args:
        current_data: a pandas dataframe with a column named "tweet_text"
        n: the n in n-gram
    Returns:
        model: a dictionary representing the n-gram model, spacial case for 1-gram: {(): {n_th_token: count}}
    """
    token_counter = {}
    # iterate over all the tweets in the data file
    for tweet in current_data["tweet_text"]:
        tweet = "<start> " + tweet + " <end>"
        # convert the tweet to a tuple of tokens
        tweet_tokens = tweet_to_token_tuples(tweet)
        # iterate over all the tokens in the tweet
        for token in tweet_tokens:
            if token not in token_counter: # if the token is not in the counter, add it
                token_counter[token] = 0
            # count the token
            token_counter[token] += 1

    # add the n-gram to the model
    model[()] = token_counter
    return model

def add_and_count_gram_to_model(model, n_minus_1_gram, n_th_token):
    """ Adds and counts an n-gram to the model
    Args:
        model: a dictionary representing the n-gram model, i.e {tuple of n-1 tokens: {n_th_token: count}}
        n_minus_1_gram: a tuple of n-1 tokens
        n_th_token: the n_th token
    Returns:
        model: a dictionary representing the n-gram model, i.e {tuple of n-1 tokens: {n_th_token: count}}
    """
    # add the n-gram to the model
    if n_minus_1_gram not in model:
        model[n_minus_1_gram] = {}
    # add the n_th token to the model
    if n_th_token not in model[n_minus_1_gram]:
        model[n_minus_1_gram][n_th_token] = 0
    # count the n_th token, i.e. add 1 to its count
    model[n_minus_1_gram][n_th_token] += 1
    return model


def add_one_smoothing(model, vocabulary):
    """ Adds add_one smoothing to the model
    Args:
        model: a dictionary representing the n-gram model, i.e {tuple of n-1 tokens: {n_th_token: count}}
        vocabulary: a list of all the tokens in the data
    Returns:
        model: a dictionary representing the n-gram model, i.e {tuple of n-1 tokens: {n_th_token: count}}
    """

    # iterate over all the n-1 grams in the model
    for n_minus_1_gram in model:
        # add spacial key <not in model>
        model[n_minus_1_gram]["<notInModel>"] = 0
        # iterate over all the tokens in the vocabulary
        for token in vocabulary:
            # if the token is not in the model replace it with <notInModel>
            if token not in model[n_minus_1_gram]:
                token = "<notInModel>"

            # count the token, i.e. add 1 to its count (add one smoothing)
            model[n_minus_1_gram][token] += 1
    return model

def calculate_probabilities(model):
    """ Calculates the probabilities of the model,
        also adds meta_data to the model (total_count=total number of tokens, <notInModel>_count=number of tokens that are not in the model)
    Args:
        model: a dictionary representing the n-gram model, i.e {tuple of n-1 tokens: {n_th_token: count}}
    Returns:
        model: a dictionary representing the n-gram model, i.e {tuple of n-1 tokens: {n_th_token: probability}}
    """
    # iterate over all the n-1 grams in the model
    for n_minus_1_gram in model:
        # get the counts of all the tokens
        token_counts = model[n_minus_1_gram].values()
        # if model[n_minus_1_gram] has the key <notInModel> get its count
        if "<notInModel>" in model[n_minus_1_gram]:
            token_notInModel_count = model[n_minus_1_gram]["<notInModel>"]
        else:
            token_notInModel_count = 0
        # calculate the total count
        total_count = sum(token_counts)
        # iterate over all the tokens in the model
        for token in model[n_minus_1_gram]:
            # calculate the probability, i.e. divide the count by the total count
            model[n_minus_1_gram][token] /= total_count

        model[n_minus_1_gram]["meta_data"] = {"total_count": total_count, "<notInModel>_count": token_notInModel_count}
    return model


In [10]:
def lm(n, vocabulary, data_file_path, add_one):
    """ Builds an n-gram model from the given data
    Args:
        n: the n in n-gram
        vocabulary: a list of all the tokens in the data
        data_file_path: the data_file from which we record probabilities for our model
        add_one: True/False (use add_one smoothing or not)
    Returns:
        model: a dictionary representing the n-gram model, i.e {tuple of n-1 tokens: {n_th_token: probability}}
    """
    # read the data file
    current_data = pd.read_csv(data_file_path,  encoding="utf-8")
    # build the n-gram model
    model = build_ngram_model(current_data, n)
    if add_one:
        # add one smoothing
        model = add_one_smoothing(model, vocabulary)
    # calculate the probabilities
    model = calculate_probabilities(model)
    return model

In [11]:
# call the function for the first data file
lm_model_True = lm(2, vocabulary, path + "/" + data_files[0], True)

lm_model_False = lm(2, vocabulary, path + "/" + data_files[0], False)


**Part 3**

Write a function *eval* that returns the perplexity of a model (dictionary) running over a given data file.

In [12]:
from math import log2

# a function that calculates the perplexity of a model
def calculate_perplexity(current_data, n, model):
    """ Calculates the perplexity of a model running over a given data file
    Args:
        current_data: a data frame representing the data
        n: the n in n-gram
        model: a dictionary representing the n-gram model, i.e {tuple of n-1 tokens: {n_th_token: probability}}
    Returns:
        perplexity: the perplexity of the model
    """
    log_prob_sum = 0
    n_gram_count = 0

    for tweet in current_data["tweet_text"]:
        # add start and end tokens
        tweet = "<start> " + tweet + " <end>"
        # convert the tweet to a list of tokens
        tweet = tweet_to_token_tuples(tweet)

        # iterate over all the n-grams in the tweet
        for i in range(len(tweet) - n + 1):
            n_gram = tweet[i:i+n]
            n_minus_1_gram = n_gram[:-1]
            n_th_token = n_gram[-1]

            # if the n-gram is in the model, add its log probability to the sum
            if n_minus_1_gram in model and n_th_token in model[n_minus_1_gram]:
                log_prob_sum += -log2(model[n_minus_1_gram][n_th_token])
                n_gram_count += 1
            # <notInModel> is for add_one smoothing case
            elif n_minus_1_gram in model and "<notInModel>" in model[n_minus_1_gram]:
                # get the meta_data of the n-1 gram
                meta_data = model[n_minus_1_gram]["meta_data"]
                total_count = meta_data["total_count"]
                notInModel_count = meta_data["<notInModel>_count"]
                # calculate the probability of <notInModel>
                prob_notInModel = model[n_minus_1_gram]["<notInModel>"]
                # calculate the probability of single unseen token
                prob = prob_notInModel / notInModel_count
                # calculate the log probability of the n-gram and add it to the sum
                log_prob_sum += -log2(prob)
                n_gram_count += 1

    # return infinite perplexity if no n-grams found
    if n_gram_count == 0:
        return float('inf')
    # calculate the entropy and the perplexity
    entropy = log_prob_sum / n_gram_count
    perplexity = 2 ** entropy
    return perplexity


In [13]:
def eval(n, model, data_file):
    """ Evaluates the perplexity of a model
    Args:
        n: the n in n-gram
        model: a dictionary representing the n-gram model, i.e {n-1_gram: {n_th_token: probability}}
        data_file: the data file path for which we want to calculate the perplexity
    Returns:
        the perplexity of the model
    """
    # read the data file
    current_data = pd.read_csv(data_file, encoding="utf-8")
    # calculate the perplexity
    perplexity = calculate_perplexity(current_data, n, model)
    return perplexity

In [14]:
# call the function
perplexity = eval(2, lm_model_True, path + "/" + data_files[0])
print("perplexity lm_model_True: ", perplexity)
perplexity = eval(2, lm_model_False, path + "/" + data_files[0])
print("perplexity lm_model_False: ", perplexity)

perplexity lm_model_True:  20.89777888632233
perplexity lm_model_False:  17.724467632988453


**Part 4**

Write a function *match* that creates a model for every relevant language, using a specific value of *n* and *add_one*. Then, calculate the perplexity of all possible pairs (e.g., en model applied on the data files en ,es, fr, in, it, nl, pt, tl; es model applied on the data files en, es...). This function should return a pandas DataFrame with columns [en ,es, fr, in, it, nl, pt, tl] and every row should be labeled with one of the languages. Then, the values are the relevant perplexity values.

Save the dataframe to a CSV with the name format: {student_id_1}\_...\_{student_id_n}\_part4.csv

In [15]:
def match(n, add_one):
    """ Creates a model for every relevant language, using a specific value of n and add_one.
    Then, calculate the perplexity of all possible pairs.
    Args:
        n: the n in n-gram
        add_one: whether to use add one smoothing or not
    Returns:
        df: a pandas DataFrame with columns [en ,es, fr, in, it, nl, pt, tl] and every row should be labeled with one of the languages.
        models: a dictionary of the models, so that we can use them later,
                i.e {language: model} where model is a dictionary representing the n-gram model, i.e {n-1_gram: {n_th_token: probability}}
    """
    # create a dataframe
    df = pd.DataFrame(columns=data_files, index=data_files)

    # create models for every language
    models = compute_data_files_models(data_files, n, vocabulary, path, add_one)

    # calculate the perplexity of all possible pairs
    for lang1 in data_files: # will be the model
        # define the model
        current_model = models[lang1]
        for lang2 in data_files: # will be the data file
            # define the data file
            current_data_file = path + "/" + lang2
            # evaluate the model
            perplexity = eval(n, current_model, current_data_file)
            # save the perplexity to the dataframe
            df[lang1][lang2] = perplexity
    return df, models # return the dataframe and the models, so that we can use them later

def compute_data_files_models(data_files, n, vocabulary, path , add_one):
    """ Creates a model for every relevant language, using a specific value of n and add_one.
    Args:
        data_files: the data files to create models for
        n: the n in n-gram
        vocabulary: the vocabulary
        path: the path to the data files
        add_one: whether to use add one smoothing or not
    Returns:
        models: a dictionary of the models, so that we can use them later,
                i.e {language: model} where model is a dictionary representing the n-gram model, i.e {n-1_gram: {n_th_token: probability}}
    """
    models = {}
    for data_file in data_files:
        models[data_file] = lm(n, vocabulary, path + "/" + data_file, add_one)
    return models


In [16]:
# call the function
df_part4, models_part4 = match(2, True)
print("dataframe: ")
print(df_part4)

dataframe: 
           en.csv     es.csv     fr.csv     in.csv     it.csv     nl.csv   
en.csv  20.897779  31.469961  27.982361   29.28984  31.027015  26.810963  \
es.csv   27.07923  18.784248   25.12035  28.965038  24.237032  30.308435   
fr.csv  28.585712  28.319004  19.703018  33.973468  28.874987   30.31644   
in.csv  28.924859  33.147844  31.945387  20.857584  32.624991  29.551127   
it.csv  26.927061  24.111037  25.776475  28.363825  19.153515  29.954258   
nl.csv   27.05875  32.310611  29.025613  29.288145  32.508412  20.315714   
pt.csv  29.501319  23.932835  26.986815  31.655019  25.779086  32.238053   
tl.csv  28.454305  33.063345  33.712831  26.072283  32.076746  30.991691   

           pt.csv     tl.csv  
en.csv  32.798601  27.153291  
es.csv  23.319751  28.253442  
fr.csv  29.122319  33.539102  
in.csv  34.845109  25.171219  
it.csv  25.530354  27.124208  
nl.csv  34.036014  30.440644  
pt.csv  19.563727  31.095087  
tl.csv  34.666478  20.957046  


In [17]:
# save the dataframe to a CSV of format {student_id_1}\_...\_{student_id_n}\_part4.csv
df_part4.to_csv(student_id_1 + "_" + student_id_2 + "_part4.csv", index=False)


**Part 5**

Run match with *n* values 1-4, once with add_one and once without, and print the 8 tables to this notebook, one after another.

Load each result to a dataframe and save to a CSV with the name format: 

for cases with add_one: {student_id_1}\_...\_{student_id_n}\_n1\_part5.csv

For cases without add_one:
{student_id_1}\_...\_{student_id_n}\_n1\_wo\_addone\_part5.csv

Follow the same format for n2,n3, and n4


In [18]:
def run_match(n_values = [1, 2, 3, 4],add_one_values = [True, False] ):
    """ Runs match with n values 1-4, once with add_one and once without, and print the 8 tables to this notebook.
    Args:
        n_values: the n values to run match with
        add_one_values: the add_one values to run match with
    Returns:
        dataframes: a dictionary of the dataframes, so that we can use them later,
                i.e {n, add_one: dataframe}
        language_models_dict: a dictionary of the language models, so that we can use them later,
                i.e {n, add_one: {language: model}}
    """
    # create dictionaries for the dataframes, key = (n, add_one), value = dataframe
    dataframes = {}
    # create a dictionary for the language models, key = (n, add_one), value = language models = {language: model}
    language_models_dict = {}
    # iterate over all the n values
    for n in n_values:
        # iterate over all the add_one values
        for add_one in add_one_values:
            # create the dataframe and the language models, using the match function
            current_df, current_language_models = match(n, add_one)
            print("completed n = " + str(n) + ", add_one = " + str(add_one) + "!")
            # add the dataframe to the dataframes dictionary
            dataframes[(n, add_one)] = current_df
            # add the language models to the language_models_dict dictionary
            language_models_dict[(n, add_one)] = current_language_models
            # # save the dataframe to a CSV
            # if add_one:
            #     current_df.to_csv("language_perplexity_n" + str(n) + "_part5.csv")
            # else:
            #     current_df.to_csv("language_perplexity_n" + str(n) + "_wo_addone_part5.csv")
    return dataframes, language_models_dict # return the dataframes and the language models, so that we can use them later


In [19]:
run_match_dataframes, run_match_language_models = run_match()

completed n = 1, add_one = True!
completed n = 1, add_one = False!
completed n = 2, add_one = True!
completed n = 2, add_one = False!
completed n = 3, add_one = True!
completed n = 3, add_one = False!
completed n = 4, add_one = True!
completed n = 4, add_one = False!


In [20]:
print("dataframes: ")
print(run_match_dataframes)

dataframes: 
{(1, True):            en.csv     es.csv     fr.csv     in.csv     it.csv     nl.csv   
en.csv  37.178192  40.419081  40.010312  40.845231  39.864997  39.194549  \
es.csv  39.195904  34.907042  38.153696  41.520361  37.414257  39.079891   
fr.csv   41.12382  39.387318  36.252997  45.115759  39.022463  40.281632   
in.csv  39.933617  41.950442  42.708112  36.119644  41.765169  39.987242   
it.csv  39.086583  38.582739    38.4033  41.593138  36.323042  39.396963   
nl.csv  38.336395  40.021781  39.470987   40.36988  39.640239  36.355625   
pt.csv  41.206373   37.88238  39.235636  43.290318  39.304747  40.809172   
tl.csv  42.751288  45.053611  46.842864  40.803226  44.293566  44.298361   

           pt.csv     tl.csv  
en.csv  40.869391  40.520973  
es.csv  36.120973  40.988471  
fr.csv  39.197119  45.607366  
in.csv  41.340695  37.687806  
it.csv  39.007692  41.049021  
nl.csv  40.179656  41.245413  
pt.csv  35.496439  42.728082  
tl.csv  44.952481  39.022114  , (1, False)

In [21]:
# save the dataframes to CSVs of format:
# with add_one: {student_id_1}\_...\_{student_id_n}\_n1\_part5.csv
# without add_one: {student_id_1}\_...\_{student_id_n}\_n1\_wo\_addone\_part5.csv

n_values = [1, 2, 3, 4]
add_one_values = [True, False]
# iterate over all the n values and add_one values and save the dataframes to CSVs
for n in n_values:
    for add_one in add_one_values:
        if add_one:
            run_match_dataframes[(n, add_one)].to_csv(student_id_1 + "_" + student_id_2 + "_n" + str(n) + "_part5.csv", index=False)
            print("saved " + student_id_1 + "_" + student_id_2 + "_n" + str(n) + "_part5.csv")
        else:
            run_match_dataframes[(n, add_one)].to_csv(student_id_1 + "_" + student_id_2 + "_n" + str(n) + "_wo_addone_part5.csv", index=False)
            print("saved " + student_id_1 + "_" + student_id_2 + "_n" + str(n) + "_wo_addone_part5.csv")


saved 123456789_327156998_n1_part5.csv
saved 123456789_327156998_n1_wo_addone_part5.csv
saved 123456789_327156998_n2_part5.csv
saved 123456789_327156998_n2_wo_addone_part5.csv
saved 123456789_327156998_n3_part5.csv
saved 123456789_327156998_n3_wo_addone_part5.csv
saved 123456789_327156998_n4_part5.csv
saved 123456789_327156998_n4_wo_addone_part5.csv


In [22]:
import pickle

def save_models_to_pickle(language_models_dict=run_match_language_models, filename="run_match_language_models.pickle"):
    """ Saves the language models to a pickle file
    Args:
        language_models_dict: a dictionary of the models, i.e {n, add_one: language_models} where language_models is a dictionary of the models, i.e {language: model} where model is a dictionary representing the n-gram model, i.e {n-1_gram: {n_th_token: probability}}
    """
    # save run_match_language_models to a pickle file, so that we can use it later
    with open(filename, 'wb') as handle:
        pickle.dump(language_models_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)


In [23]:
import pickle

def load_models_from_pickle(filename="run_match_language_models.pickle"):
    """ Loads the language models from a pickle file
    Returns:
        language_models_dict: a dictionary of the models, i.e {n, add_one: language_models} where language_models is a dictionary of the models, i.e {language: model} where model is a dictionary representing the n-gram model, i.e {n-1_gram: {n_th_token: probability}}
    """
    # load run_match_language_models from a pickle file
    with open('run_match_language_models.pickle', 'rb') as handle:
        language_models_dict = pickle.load(handle)
    return language_models_dict


In [24]:
# save the language models to a pickle file
# (remove/put # in front of the next lines to save the language models)
save_models_to_pickle(run_match_language_models, "run_match_language_models.pickle")
print("saved run_match_language_models.pickle")

saved run_match_language_models.pickle


In [25]:
# load the language models from a pickle file
# (remove/put # in front of the next lines to load the language models)
run_match_language_models = load_models_from_pickle("run_match_language_models.pickle")
print("loaded run_match_language_models.pickle")

loaded run_match_language_models.pickle


**Part 6**

Each line in the file test.csv contains a sentence and the language it belongs to. Write a function that uses your language models to classify the correct language of each sentence.

Important note regarding the grading of this section: this is an open question, where a different solution will yield different accuracy scores. any solution that is not trivial (e.g. returning 'en' in all cases) will be accepted. We do reserve the right to give bonus points to exceptionally good/creative solutions.

In [26]:
def classify(n=3, add_one=False):
    """ Classifies the sentences in the test file, using the language models
    Args:
        n: the n-gram model to use
        add_one: whether to use add_one smoothing or not
    Returns:
        classification_result: a list of tuples of the form (tweet_id, sentence, true_language, predicted_language)
    """
    # we will use the language models from part 5, with n = 3 and add_one = True
    language_models = run_match_language_models[(n, add_one)]

    # read the test file, tweet_id, tweet_text, label
    test_data = pd.read_csv(path + "/test.csv",  encoding="utf-8")

    # classify the sentences
    classification_result = []

    # iterate over the rows in the test data
    for index, row in test_data.iterrows():
        tweet_id = row['tweet_id']
        sentence = row['tweet_text']
        true_language = row['label']
        predicted_language = ''

        predicted_language = single_classification(sentence, language_models, n)

        # add the result to the classification_result list
        classification_result.append((tweet_id, sentence, true_language, predicted_language))

    return classification_result


def single_classification(sentence, language_models = run_match_language_models[(3, True)], n=3):
    """ Classifies a single sentence, using the language models
    Args:
        sentence: the sentence to classify
        language_models: the language models to use
        n: the n-gram model to use
    Returns:
        predicted_language: the predicted language of the sentence
    """

    predicted_language = ''
    min_perplexity = float('inf')
    # iterate over the language models
    for data_file in data_files:
        current_model = language_models[data_file]
        # create a temporary DataFrame, with the sentence as the only row
        temp_df = pd.DataFrame([sentence], columns=['tweet_text'])
        # calculate the perplexity using the temporary DataFrame
        current_perplexity = calculate_perplexity(temp_df, n, current_model)

        if current_perplexity < min_perplexity:
            min_perplexity = current_perplexity
            predicted_language = data_file[:-4] # remove the .csv from the end of the file name
    return predicted_language



In [27]:
# classify the test sentences
classification_result_n1 = classify(n=1, add_one=False)
print("completed classification for n = 1, add_one = False")

classification_result_n1_add_one = classify(n=1, add_one=True)
print("completed classification for n = 1, add_one = True")

classification_result_n2 = classify(n=2, add_one=False)
print("completed classification for n = 2, add_one = False")

classification_result_n2_add_one = classify(n=2, add_one=True)
print("completed classification for n = 2, add_one = True")

classification_result_n3 = classify(n=3, add_one=False)
print("completed classification for n = 3, add_one = False")

classification_result_n3_add_one = classify(n=3, add_one=True)
print("completed classification for n = 3, add_one = True")

classification_result_n4 = classify(n=4, add_one=False)
print("completed classification for n = 4, add_one = False")

classification_result_n4_add_one = classify(n=4, add_one=True)
print("completed classification for n = 4, add_one = True")

completed classification for n = 1, add_one = False
completed classification for n = 1, add_one = True
completed classification for n = 2, add_one = False
completed classification for n = 2, add_one = True
completed classification for n = 3, add_one = False
completed classification for n = 3, add_one = True
completed classification for n = 4, add_one = False
completed classification for n = 4, add_one = True


In [28]:
def get_accuracy(classification_result):
    count_correct = 0
    for result in classification_result:
        predicted_language = result[3]
        true_language = result[2]
        if predicted_language == true_language:
            count_correct += 1
    return count_correct / len(classification_result)

In [29]:
# print the accuracy of the classification
print("accuracy for n = 1, add_one = False: " + str(get_accuracy(classification_result_n1)))
print("accuracy for n = 2, add_one = False: " + str(get_accuracy(classification_result_n2)))
print("accuracy for n = 3, add_one = False: " + str(get_accuracy(classification_result_n3)))
print("accuracy for n = 4, add_one = False: " + str(get_accuracy(classification_result_n4)))

print("accuracy for n = 1, add_one = True: " + str(get_accuracy(classification_result_n1_add_one)))
print("accuracy for n = 2, add_one = True: " + str(get_accuracy(classification_result_n2_add_one)))
print("accuracy for n = 3, add_one = True: " + str(get_accuracy(classification_result_n3_add_one)))
print("accuracy for n = 4, add_one = True: " + str(get_accuracy(classification_result_n4_add_one)))

accuracy for n = 1, add_one = False: 0.5639454931866483
accuracy for n = 2, add_one = False: 0.8538567320915115
accuracy for n = 3, add_one = False: 0.8821102637829729
accuracy for n = 4, add_one = False: 0.794474309288661
accuracy for n = 1, add_one = True: 0.6663332916614577
accuracy for n = 2, add_one = True: 0.8744843105388174
accuracy for n = 3, add_one = True: 0.9236154519314914
accuracy for n = 4, add_one = True: 0.9179897487185898


**Part 7**

Calculate the F1 score of your output from part 6. (hint: you can use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). 

Load the results to a CSV (using a DataFrame), with a model_name and f1_score Name it {student_id_1}\_...\_{student_id_n}\_part7.csv



```
  model_name  f1_score
0    Model A      0.85
1    Model B      0.92
2    Model C      0.87
3    Model D      0.90
```



In [30]:
def map_language_number(classification_result):
    #we will use the following dictionary to convert the strings to numbers
    language_to_number = {}
    # we will use the following dictionary to convert the numbers back to strings
    number_to_language = {}

    number = 0
    # iterate over the classification results
    for result in classification_result:
        if result[2] not in language_to_number:
            language_to_number[result[2]] = number
            number_to_language[number] = result[2]
            number += 1
        if result[3] not in language_to_number:
            language_to_number[result[3]] = number
            number_to_language[number] = result[3]
            number += 1
    return language_to_number, number_to_language

In [31]:
import sklearn.metrics as metrics

def calc_f1(result):
    """ Calculates the f1 score of the classification result
    Args:
        result: a list of tuples, where each tuple contains the tweet_id, the sentence, the true language, and the predicted language
    Returns:
        f1_score: the f1 score of the classification result
    """
    # create mappings from language to number and number to language
    language_to_number, number_to_language = map_language_number(result)

    # create a DataFrame with the results
    df = pd.DataFrame(result, columns=['tweet_id', 'tweet_text', 'true_language', 'predicted_language'])
    # drop the tweet_id and tweet_text columns
    df = df.drop(columns=['tweet_id', 'tweet_text'])
    # convert the true_language and predicted_language columns to numbers
    df['true_language'] = df['true_language'].apply(lambda x: language_to_number[x])
    df['predicted_language'] = df['predicted_language'].apply(lambda x: language_to_number[x])
    # calculate the f1 score
    f1_score = metrics.f1_score(df['true_language'], df['predicted_language'], average='weighted')
    return f1_score

In [32]:
# calculate the f1 score for each n
f1_score_n1 = calc_f1(classification_result_n1)
print("n = 1, add_one = False, f1_score = " + str(f1_score_n1))

f1_score_n2 = calc_f1(classification_result_n2)
print("n = 2, add_one = False, f1_score = " + str(f1_score_n2))

f1_score_n3 = calc_f1(classification_result_n3)
print("n = 3, add_one = False, f1_score = " + str(f1_score_n3))

f1_score_n4 = calc_f1(classification_result_n4)
print("n = 4, add_one = False, f1_score = " + str(f1_score_n4))

f1_score_n1_add_one = calc_f1(classification_result_n1_add_one)
print("n = 1, add_one = True, f1_score = " + str(f1_score_n1_add_one))

f1_score_n2_add_one = calc_f1(classification_result_n2_add_one)
print("n = 2, add_one = True, f1_score = " + str(f1_score_n2_add_one))

f1_score_n3_add_one = calc_f1(classification_result_n3_add_one)
print("n = 3, add_one = True, f1_score = " + str(f1_score_n3_add_one))

f1_score_n4_add_one = calc_f1(classification_result_n4_add_one)
print("n = 4, add_one = True, f1_score = " + str(f1_score_n4_add_one))

n = 1, add_one = False, f1_score = 0.559659649986144
n = 2, add_one = False, f1_score = 0.8540032043148125
n = 3, add_one = False, f1_score = 0.8825410316647341
n = 4, add_one = False, f1_score = 0.7949199120142474
n = 1, add_one = True, f1_score = 0.667513094316079
n = 2, add_one = True, f1_score = 0.8750839401494929
n = 3, add_one = True, f1_score = 0.9238768379120521
n = 4, add_one = True, f1_score = 0.9183128996924542


In [33]:
# Load the results to a CSV (using a DataFrame), with a model_name and f1_score Name it {student_id_1}\_...\_{student_id_n}\_part7.csv

# create a DataFrame with the results
df_part7 = pd.DataFrame(columns=['model_name', 'f1_score'])

# add the results to the pd.DataFrame (without appending)
df_part7.loc[0] = ['n = 1, add_one = False', f1_score_n1]
df_part7.loc[1] = ['n = 2, add_one = False', f1_score_n2]
df_part7.loc[2] = ['n = 3, add_one = False', f1_score_n3]
df_part7.loc[3] = ['n = 4, add_one = False', f1_score_n4]
df_part7.loc[4] = ['n = 1, add_one = True', f1_score_n1_add_one]
df_part7.loc[5] = ['n = 2, add_one = True', f1_score_n2_add_one]
df_part7.loc[6] = ['n = 3, add_one = True', f1_score_n3_add_one]
df_part7.loc[7] = ['n = 4, add_one = True', f1_score_n4_add_one]

# save the DataFrame to a CSV file
df_part7.to_csv(student_id_1 + "_" + student_id_2 + "_part7.csv", index=False)
# print the DataFrame
print(df_part7)

               model_name  f1_score
0  n = 1, add_one = False  0.559660
1  n = 2, add_one = False  0.854003
2  n = 3, add_one = False  0.882541
3  n = 4, add_one = False  0.794920
4   n = 1, add_one = True  0.667513
5   n = 2, add_one = True  0.875084
6   n = 3, add_one = True  0.923877
7   n = 4, add_one = True  0.918313


<br><br><br><br>
**Part 8**  
Let's use your Language model (dictionary) for generation (NLG).

When it comes to sampling from a language model decoder during text generation, there are several different methods that can be used to control the randomness and diversity of the generated text. 

Some of the most commonly used methods include:

> `Greedy sampling`
In this method, the model simply selects the word with the highest probability as the next word at each time step. This method can produce fluent text, but it can also lead to repetitive or predictable output.

> `Temperature scaling`  
Temperature scaling involves scaling the logits output of the language model by a temperature parameter before softmax normalization. This has the effect of smoothing the distribution of probabilities and increasing the probability of lower-probability words, which can lead to more diverse and creative output.

> `Top-K sampling`  
In this method, the model restricts the sampling to the top-K most likely words at each time step, where K is a predefined hyperparameter. This can generate more diverse output than greedy sampling, while limiting the number of low-probability words that are sampled.

> `Nucleus sampling` (also known as top-p sampling)  
This method restricts the sampling to the smallest possible set of words whose cumulative probability exceeds a certain threshold, defined by a hyperparameter p. Like top-K sampling, this can generate more diverse output than greedy sampling, while avoiding sampling extremely low probability words.

> `Beam search`  
Beam search involves maintaining a fixed number k of candidate output sequences at each time step, and then selecting the k most likely sequences based on their probabilities. This can improve the fluency and coherence of the output, but may not produce as much diversity as sampling methods.

The choice of sampling method depends on the specific application and desired balance between fluency, diversity, and randomness. Hyperparameters such as temperature, K, p, and beam size can also be tuned to adjust the behavior of the language model during sampling.


You may read more about this concept in <a href='https://huggingface.co/blog/how-to-generate#:~:text=pad_token_id%3Dtokenizer.eos_token_id)-,Greedy%20Search,-Greedy%20search%20simply'>this</a> blog post.


**Please added the needed code for each sampeling method:**

In [34]:
import random

def softmax(probabilities):
    """ Applies the softmax function to the probabilities
    Args:
        probabilities: a dictionary of probabilities (not yet probabilities, just numbers)
    Returns:
        probabilities: a dictionary of probabilities (normalized)
    """

    # convert the dictionary to a numpy array
    np_probabilities = np.array(list(probabilities.values()))
    # apply the softmax function
    np_probabilities = np.exp(np_probabilities)
    np_probabilities = np_probabilities / np.sum(np_probabilities)
    # convert the numpy array back to a dictionary
    probabilities = {key: value for key, value in zip(probabilities.keys(), np_probabilities)}
    return probabilities

def make_prob_1(probabilities):
    """ Makes the sum of the probabilities equal to 1
    Args:
        probabilities: a list of probabilities (not yet probabilities, just numbers)
    Returns:
        probabilities: a list of probabilities (normalized)
    """

    # make the sum of the probabilities equal to 1
    # by dividing each probability by the sum of all probabilities
    sum_prob = sum(probabilities)
    probabilities = [prob / sum_prob for prob in probabilities]
    return probabilities

def get_correct_model(all_models=run_match_language_models, prefix="<start>", language="en.csv", add_one=False, max_n=4):
    """ Gets the correct model for the prefix and language
    Args:
        all_models: a dictionary of all the language models
        prefix: the prefix of the tweet
        language: the language of the tweet
        add_one: whether to use add_one smoothing or not
        max_n: the maximum n-gram to use
    Returns:
        correct_model: the correct language model
        tuple_key_prefix: the tuple key of the prefix
        next_token_probabilities: the probabilities of the next token
    """
    prefix_tokens = tweet_to_token_tuples(prefix)

    # we want to use maximum n-gram we can, but not more than max_n
    n = min(max_n, len(prefix_tokens) + 1)

    # get the n-gram model
    correct_model = all_models[(n, add_one)][language]

    # get the n-1 tokens of the prefix, i.e. the key for the language model
    tuple_key_prefix = tuple(prefix_tokens[-(n - 1):])

    # if the prefix is not in the language model, sample a random key from the language model
    if tuple_key_prefix not in correct_model:
        tuple_key_prefix = random.choice(list(correct_model.keys()))


    # get the probabilities of the next token
    next_token_probabilities = correct_model[tuple_key_prefix]
    return correct_model, tuple_key_prefix, next_token_probabilities

def get_notInModel_probabilities(probabilities, notInModel_count, vocabulary=vocabulary):
    """ Removes <notInModel> from the probabilities dictionary and adds the probabilities of the tokens not in the model
    Args:
        vocabulary: the vocabulary of the language model
        probabilities: a dictionary of probabilities, i.e. {key = next_token, value = probability}
        notInModel_count: the number of tokens not in the model
    Returns:
        probabilities: a dictionary of probabilities, i.e. {key = next_token, value = probability}
    """
    # add all tokens not in the model to the probabilities dictionary
    if '<notInModel>' in probabilities:
        single_notInModel_prob = probabilities['<notInModel>'] / notInModel_count
        tokens_not_in_model_vocab = set(vocabulary) - set(probabilities.keys())
        for token in tokens_not_in_model_vocab:
            probabilities[token] = single_notInModel_prob
        del probabilities['<notInModel>']
    return probabilities


In [35]:
# probabilities = {key = next_token, value = probability}
def sample_greedy(probabilities, k=1):
    """ Samples the next token greedily, i.e. the token with the highest probability
    Args:
        probabilities: a dictionary of probabilities, i.e. {key = next_token, value = probability}
    Returns:
        max_token: the token with the highest probability
    """
    # copy the dictionary
    probabilities_copy = probabilities.copy()
    # remember meta_data and remove
    meta_data = probabilities_copy['meta_data']
    del probabilities_copy['meta_data']
    total_count = meta_data["total_count"]
    notInModel_count = meta_data["<notInModel>_count"]

    # reduce the probability of the token <notInModel> by the number of times it was sampled
    if '<notInModel>' in probabilities_copy:
        probabilities_copy['<notInModel>'] /= notInModel_count

    # sort the probabilities by value
    sorted_probabilities = sorted(probabilities_copy.items(), key=lambda x: x[1], reverse=True)

    # if k is larger than the number of probabilities, set k to the number of probabilities
    k = k if len(sorted_probabilities) >= k else len(sorted_probabilities)

    # sample the token with the k highest probability
    next_token = sorted_probabilities[k - 1][0]

    # if the token is <notInModel>, sample a random token from the tokens not in the model
    if next_token == '<notInModel>':
        # get tokens not in the model
        tokens_not_in_model_vocab = set(vocabulary) - set(probabilities_copy.keys())
        # sample a random token from the tokens not in the model
        next_token = random.choice(list(tokens_not_in_model_vocab))

    return next_token


# probabilities = {key = next_token, value = probability}
def sample_temperature(probabilities, temperature=1.0, k=1):
    """ Samples the next token using temperature sampling
    Args:
        probabilities: a dictionary of probabilities, i.e. {key = next_token, value = probability}
        temperature: the temperature
    Returns:
        next_token: the sampled token
    """
    # copy the dictionary
    probabilities_copy = probabilities.copy()
    # remember meta_data and remove
    meta_data = probabilities_copy['meta_data']
    del probabilities_copy['meta_data']
    total_count = meta_data["total_count"]
    notInModel_count = meta_data["<notInModel>_count"]

    # remove <notInModel> from the probabilities dictionary and add the probabilities of the tokens not in the model
    probabilities_copy = get_notInModel_probabilities(probabilities=probabilities_copy, notInModel_count=notInModel_count)

    # scale the probabilities by the temperature
    probabilities_copy = {key: value ** (1 / temperature) for key, value in probabilities_copy.items()}

    # softmax the probabilities
    probabilities_copy = softmax(probabilities_copy)

    # sample from the probabilities dictionary, use the np.random.choice function
    np_probabilities = np.array(list(probabilities_copy.values()))
    np_tokens = np.array(list(probabilities_copy.keys()))

    # convert the tokens (tuple) to a list of strings
    np_tokens = ["".join(token) for token in np_tokens]

    # sample the next token
    next_token = np.random.choice(np_tokens, p=np_probabilities)

    return next_token


def sample_topK(probabilities, k=1):
    """ Samples the next token using top-k sampling, i.e. only the top k tokens are considered
    Args:
        probabilities: a dictionary of probabilities, i.e. {key = next_token, value = probability}
        k: the number of tokens to consider
    Returns:
        next_token: the sampled token
    """
    # copy the dictionary
    probabilities_copy = probabilities.copy()
    # remember meta_data and remove
    meta_data = probabilities_copy['meta_data']
    del probabilities_copy['meta_data']
    total_count = meta_data["total_count"]
    notInModel_count = meta_data["<notInModel>_count"]

    # add all tokens not in the model to the probabilities dictionary
    probabilities_copy = get_notInModel_probabilities(probabilities=probabilities_copy, notInModel_count=notInModel_count)

    # sort the probabilities dictionary by the values
    sorted_probabilities = sorted(probabilities_copy.items(), key=lambda x: x[1], reverse=True)

    # take the top k
    top_k = sorted_probabilities[:k]

    # split the top k into tokens and probabilities
    top_k_probs = [prob for (token, prob) in top_k]
    top_k_tokens = [token for (token, prob) in top_k]

    # make the sum of the probabilities equal to 1
    top_k_probs = make_prob_1(top_k_probs)

    # sample from the top k tokens
    next_token = np.random.choice(top_k_tokens, p=top_k_probs)

    return next_token

# probabilities = {key = next_token, value = probability}
def sample_topP(probabilities, p=0.9):
    """ Samples the next token using top-p sampling,
    i.e. only the tokens with the highest probabilities are considered, until the sum of the probabilities is greater than p
    Args:
        probabilities: a dictionary of probabilities, i.e. {key = next_token, value = probability}
        p: the threshold
    Returns:
        next_token: the sampled token
    """
    # copy the dictionary
    probabilities_copy = probabilities.copy()
     # remember meta_data and remove
    meta_data = probabilities_copy['meta_data']
    del probabilities_copy['meta_data']
    total_count = meta_data["total_count"]
    notInModel_count = meta_data["<notInModel>_count"]

    # remove <notInModel> from the probabilities dictionary and add the probabilities of the tokens not in the model
    probabilities_copy = get_notInModel_probabilities(probabilities=probabilities_copy, notInModel_count=notInModel_count)

    # sort the probabilities dictionary by the values
    sorted_probabilities = sorted(probabilities_copy.items(), key=lambda x: x[1], reverse=True)
    current_sum = 0
    top_p_tokens = []
    top_p_probs = []
    current_index = 0
    # add the top tokens until the sum of the probabilities is greater than p
    while current_sum < p:
        top_p_tokens.append(sorted_probabilities[current_index][0])
        top_p_probs.append(sorted_probabilities[current_index][1])
        current_sum += sorted_probabilities[current_index][1]
        current_index += 1

    # make the sum of the probabilities equal to 1
    top_p_probs = make_prob_1(top_p_probs)
    # convert the tokens (tuple) to a list of strings
    top_p_tokens = ["".join(token) for token in top_p_tokens]
    # sample from the top p tokens
    next_token = np.random.choice(top_p_tokens, p=top_p_probs)

    return next_token


def sample_beam(probabilities, num_beams = 3):
    """ Samples the next tokens using beam search, i.e., keeps the top num_beams hypotheses at each step.
        Helper function for beam_search
    Args:
        probabilities: a dictionary of probabilities, i.e. {key = next_token, value = probability}
        num_beams: the number of beams to keep, i.e. the number of hypotheses to keep at each step
    Returns:
        beam_tokens: a list of top num_beams tokens
        beam_probs: a list of the corresponding probabilities of the top num_beams tokens
    """
    # copy the dictionary
    probabilities_copy = probabilities.copy()
    # remember meta_data and remove
    meta_data = probabilities_copy['meta_data']
    del probabilities_copy['meta_data']
    total_count = meta_data["total_count"]
    notInModel_count = meta_data["<notInModel>_count"]

    # remove <notInModel> from the probabilities dictionary and add the probabilities of the tokens not in the model
    probabilities_copy = get_notInModel_probabilities(probabilities=probabilities_copy, notInModel_count=notInModel_count)

    # sort the probabilities dictionary by the values
    sorted_probabilities = sorted(probabilities_copy.items(), key=lambda x: x[1], reverse=True)
    # take the top num_beams
    top_beams = sorted_probabilities[:num_beams]

    beam_tokens = [token for (token, prob) in top_beams]
    beam_probs = [prob for (token, prob) in top_beams]

    return beam_tokens, beam_probs




Use your Language Model to generate each one out of the following examples with the coresponding params.    
Notice the 4 core issues: 
- Starting tokens
- Length of the generation
- Sampling methond (use all)
- Stop Token (if this token is sampled, stop generating)

Use your LM to generate a string based on the parametes of each examples, and store the generation sequance at the generation list.

In [36]:
def generate_string(all_models, prefix='<start>', sampling_method='beam', gen_length=10, stop_token='<end>', num_beams=5, add_one=False):
    """ Generates a string using the specified sampling method
    Args:
        all_models: where lm_dict = {key = (n, add_one), value = language_model},
                where language_model = {key = language, value = model},
                where model = {key = (prefix, sampling_method, gen_length, stop_token, num_beams), value = generation}
        prefix: the prefix of the generation
        sampling_method: the sampling method to use
        gen_length: the length of the generation
        stop_token: the token to stop the generation
        num_beams: the number of beams to keep, i.e. the number of hypotheses to keep at each step
        add_one: whether to add one to the count of each token
    Returns:
        generation: the generated string
    """
    if sampling_method == 'beam':
        return beam_search(all_models= all_models, prefix= prefix, gen_length= gen_length, stop_token= stop_token, num_beams= num_beams, add_one= add_one)
    else:
        return generate_string_not_beam(all_models= all_models, prefix= prefix, sampling_method= sampling_method, gen_length= gen_length, stop_token= stop_token, add_one= add_one)

def beam_search(all_models, prefix="<start>", gen_length=10, stop_token="<end>", num_beams=5, language="en.csv", add_one=False, max_n=4):
    """ Generates a string using beam search
    Args:
        all_models: where lm_dict = {key = (n, add_one), value = language_model},
                where language_model = {key = language, value = model},
                where model = {key = prefix, value = probabilities}
        prefix: the prefix
        gen_length: the length of the generation
        stop_token: the token that stops the generation
        num_beams: the number of beams to keep
        language: the language of the model
        add_one: whether to use add one smoothing or not
        max_n: the maximum n-gram to use
    Returns:
        generated_string: the generated string
    """
    # initialize the beams
    beams = [(prefix, 0)]  # (prefix, log_prob)

    # generate the string token by token
    for _ in range(gen_length):
        new_beams = []

        # sample the next token for each beam
        for beam_prefix, beam_log_prob in beams:
            # get the correct language model
            current_lm, current_tuple_key_prefix, next_token_probabilities = get_correct_model(all_models=all_models,
                                                                                               prefix=beam_prefix, language=language,
                                                                                               add_one=add_one, max_n=max_n)

            # sample the top num_beams tokens and probabilities
            beam_tokens, beam_probs = sample_beam(next_token_probabilities, num_beams)

            # update the beams
            for token, prob in zip(beam_tokens, beam_probs):
                if not (beam_prefix.endswith(stop_token) or beam_prefix.endswith("<end>")):
                    new_prefix, new_log_prob = update_beam(beam_prefix, beam_log_prob, token, prob)
                    new_beams.append((new_prefix, new_log_prob))
                else:
                    new_beams.append((beam_prefix, beam_log_prob))


        # keep the top num_beams beams
        beams = sorted(new_beams, key=lambda x: x[1], reverse=True)[:num_beams]

    # get the best beam
    best_beam = beams[0][0]

    return best_beam

def generate_string_not_beam(all_models, prefix='<start>', gen_length=10, stop_token='<end>', sampling_method='topK', language="en.csv", add_one=False, max_n=4):
    """ Generates a string using the given language model
    Args:
        all_models: where lm_dict = {key = (n, add_one), value = language_model},
                where language_model = {key = language, value = model},
                where model = {key = prefix, value = probabilities}
        prefix: the prefix
        gen_length: the length of the generation
        stop_token: the token that stops the generation
        sampling_method: the sampling method, can be 'greedy', 'temperature', 'topK', 'topP'
        language: the language of the model
        add_one: whether to use add one smoothing or not
        max_n: the maximum n-gram to use
    Returns:
        generated_string: the generated string
    """
    current_prefix = prefix

    # generate the string token by token
    for _ in range(gen_length):
        # get the correct language model
        current_lm, current_tuple_key_prefix, next_token_probabilities = get_correct_model(all_models=all_models,
                                                                                           prefix=current_prefix, language=language,
                                                                                           add_one=add_one, max_n=max_n)

        # sample the next token
        next_token = select_next_token(next_token_probabilities, sampling_method)

        # if next_token == "<notInModel>": uniform sample from the group (vocabulary - next_token_probabilities.keys())
        if next_token == "<notInModel>":
            tokens_not_in_model = list(set(vocabulary) - set(next_token_probabilities.keys()))
            next_token = random.choice(tokens_not_in_model)

        # update the current prefix
        current_prefix += next_token

        # stop if the stop token was sampled
        if current_prefix.endswith(stop_token) or next_token == "<end>":
            break


    return current_prefix

def update_beam(beam_prefix, beam_log_prob, token, prob):
    """ Updates the beam
    Args:
        beam_prefix: the current beam prefix
        beam_log_prob: the current beam log probability
        token: the token to add to the beam
        prob: the probability of the token
    Returns:
        new_prefix: the new beam prefix
        new_log_prob: the new beam log probability
    """
    # update the prefix and the log probability
    new_prefix = beam_prefix + token
    new_log_prob = beam_log_prob + np.log(prob)

    return new_prefix, new_log_prob

def select_next_token(next_token_probabilities, sampling_method='topK', k_greedy=1, temperature=0.5, top_k=5, p=0.5):
    """ Selects the next token, (greedy, temperature, topK, topP)
    Args:
        next_token_probabilities: the probabilities of the next token
        sampling_method: the sampling method, can be 'greedy', 'temperature', 'topK', 'topP'
        k_greedy: the number of top tokens to consider for greedy sampling
        temperature: the temperature for temperature sampling
        top_k: the number of top tokens to consider for topK sampling
        p: the probability mass for topP sampling
    Returns:
        next_token: the next token
    """
    if sampling_method == 'greedy':
        return sample_greedy(next_token_probabilities, k_greedy)
    elif sampling_method == 'temperature':
        return sample_temperature(next_token_probabilities, temperature)
    elif sampling_method == 'topK':
        return sample_topK(next_token_probabilities, top_k)
    elif sampling_method == 'topP':
        return sample_topP(next_token_probabilities, p)
    else:
        raise ValueError(f'Unknown sampling method: {sampling_method}')



In [37]:
test_ = {
    'example1' : {
        'start_tokens' : "H",
        'sampling_method' : ['greedy','beam'],
        'gen_length' : "10",
        'stop_token' : "\n",
        'generation' : []
    },
    'example2' : {
        'start_tokens' : "H",
        'sampling_method' : ['temperature','topK','topP'],
        'gen_length' : "10",
        'stop_token' : "\n",
        'generation' : []
    },
    'example3' : {
        'start_tokens' : "He",
        'sampling_method' : ['greedy','beam','temperature','topK','topP'],
        'gen_length' : "20",
        'stop_token' : "me",
        'generation' : []
    }
}

In [38]:
# Define the parameters
all_models = run_match_language_models
language = "en.csv"
add_one = False

In [39]:
# clear the generations, useful if you want to run the code multiple times
for example in test_:
    test_[example]['generation'] = []

# generate the strings for each example
for example in test_:
    # iterate over the sampling methods
    for i in range(len(test_[example]['sampling_method'])):
        # get the parameters
        sampling_method = test_[example]['sampling_method'][i]
        gen_length = int(test_[example]['gen_length'])
        stop_token = test_[example]['stop_token']
        prefix = test_[example]['start_tokens']

        # generate the string
        generated_string = generate_string(all_models=all_models, prefix=prefix, sampling_method=sampling_method, gen_length=gen_length, stop_token=stop_token, add_one=add_one)

        # cut the start_token from the generated string
        generated_string = generated_string[len(prefix):]

        # store the string
        test_[example]['generation'].append(generated_string)

In [42]:
# print the generations and the number of tokens

# iterate over the examples
for example in test_:
    print(f"Example {example}:")
    start_token = test_[example]['start_tokens']
    print("Start tokens:", start_token)
    # iterate over the sampling methods
    for i in range(len(test_[example]['generation'])):
        sampling_method = test_[example]['sampling_method'][i]
        generated_string = test_[example]['generation'][i]
        print("full string using", sampling_method, ":", start_token + generated_string)
        print("generated string using", sampling_method, ":", generated_string)
        print("generation char length:", len(generated_string))
        print("generation token length:", len(tweet_to_token_tuples(generated_string)))
    print("")

Example example1:
Start tokens: H
full string using greedy : House the s
generated string using greedy : ouse the s
generation char length: 10
generation token length: 10
full string using beam : Healthcare 
generated string using beam : ealthcare 
generation char length: 10
generation token length: 10

Example example2:
Start tokens: H
full string using temperature : HJ ht… so! 
generated string using temperature : J ht… so! 
generation char length: 10
generation token length: 10
full string using topK : Have you wh
generated string using topK : ave you wh
generation char length: 10
generation token length: 10
full string using topP : How https:/
generated string using topP : ow https:/
generation char length: 10
generation token length: 10

Example example3:
Start tokens: He
full string using greedy : Health the so much 24t
generated string using greedy : alth the so much 24t
generation char length: 20
generation token length: 20
full string using beam : Healthcare in that the
genera

In [43]:
### do not change ###
print('-------- NLG --------')

for k,v in test_.items():
  l = ''.join([f'\t{sm} >> {v["start_tokens"]}{g}\n' for sm,g in zip(v['sampling_method'],v['generation'])])
  print(f'{k}:')
  print(l)

-------- NLG --------
example1:
	greedy >> House the s
	beam >> Healthcare 

example2:
	temperature >> HJ ht… so! 
	topK >> Have you wh
	topP >> How https:/

example3:
	greedy >> Health the so much 24t
	beam >> Healthcare in that the
	temperature >> Helie: nigrin.: pagail
	topK >> Herenting a chalized t
	topP >> Here have a be @Calum5



<br><br><br>
# **Good luck!**