# Assignment 1
In this assignment you will be creating tools for learning and testing language models.
The corpora that you will be working with are lists of tweets in 8 different languages that use the Latin script. The data is provided either formatted as CSV or as JSON, for your convenience. The end goal is to write a set of tools that can detect the language of a given tweet.

Do make sure all results are uploaded to CSVs (as well as printed to console) for your assignment to be fully graded.

*As a preparation for this task, download the data files from the course git repository.

The relevant files are under **lm-languages-data-new**:


*   en.csv (or the equivalent JSON file)
*   es.csv (or the equivalent JSON file)
*   fr.csv (or the equivalent JSON file)
*   in.csv (or the equivalent JSON file)
*   it.csv (or the equivalent JSON file)
*   nl.csv (or the equivalent JSON file)
*   pt.csv (or the equivalent JSON file)
*   tl.csv (or the equivalent JSON file)
*   test.csv (or the equivalent JSON file)





In [81]:
import unicodedata

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# download the data files from the course git repository, delete comment to run
#!git clone https://github.com/kfirbar/nlp-course.git



---



**Important note: please use only the files under lm-languages-data-new and NOT under lm-languages-data**


---



In [3]:
# list the files in the data directory, delete comment to run
#!ls nlp-course/lm-languages-data-new


In [4]:
path = "nlp-course/lm-languages-data-new"
data_files = ["en.csv", "es.csv", "fr.csv", "in.csv", "it.csv", "nl.csv", "pt.csv", "tl.csv"]
test_file = "test.csv"

**Part 1**

Write a function *preprocess* that iterates over all the data files and creates a single vocabulary, containing all the tokens in the data. **Our token definition is a single UTF-8 encoded character**. So, the vocabulary list is a simple Python list of all the characters that you see at least once in the data.

In [74]:
# a function *preprocess* that iterates over all the data files and creates a single vocabulary, containing all the tokens in the data.
# the data in the files are in the form: tweet_id,tweet_text

def preprocess():
    """ Creates a vocabulary from the data files
    Returns:
        vocabulary: a list of all the characters that appear in the data files
    """
    vocabulary = set()
    # iterate over all the data files
    for file in data_files:
        # read the data file
        current_data = pd.read_csv(path + "/" + file, encoding="utf-8")
        # iterate over all the tweets in the data file
        for tweet in current_data["tweet_text"]:
            tweet = unicodedata.normalize('NFC', tweet)
            # iterate over all the characters in the tweet
            for char in tweet:
                # add the character to the vocabulary
                vocabulary.add(char)
    # sort the vocabulary
    vocabulary = sorted(list(vocabulary))
    return vocabulary

In [76]:
# call the function
vocabulary = preprocess()
# add <start> and <end> to the vocabulary
vocabulary = ["<start>"] + vocabulary + ["<end>"]
print("vocabulary: ", vocabulary)
print("vocab size: ", len(vocabulary))



vocabulary:  ['<start>', '\n', '\r', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~', '\x7f', '\x80', '\x91', '\x92', '\x9d', '¡', '£', '¤', '¥', '§', '¨', '©', 'ª', '«', '\xad', '®', '¯', '°', '²', '´', '¶', '·', '¸', 'º', '»', '½', '¿', 'À', 'Á', 'Â', 'Ã', 'Å', 'Ç', 'È', 'É', 'Ê', 'Ë', 'Ì', 'Í', 'Î', 'Ñ', 'Ò', 'Ó', 'Ô', 'Õ', 'Ö', '×', 'Ù', 'Ú', 'Ü', 'à', 'á', 'â', 'ã', 'ä', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ð', 'ñ', 'ò', 'ó', 'ô', 'õ', 'ö', 'ø', 'ù', 'ú', 'û', 'ü', 'ė', 'Ğ', 'ğ', 'İ', 'ı', 'ń', 'ō', 'Œ', 'œ', 'Ş', 'ş', 'Š', 'Ÿ', 'ƒ', 'ʔ', 'ʕ', 'ʖ', 'ʰ',

In [89]:
current_data = pd.read_csv(path + "/" + "en.csv", encoding="utf-8")
print(current_data["tweet_text"][69])
print(len('Sixteenth video in our channel'))

test_string = current_data["tweet_text"][69]
test_string = test_string.encode('utf-8')
print(test_string)
for char, index in enumerate(test_string):
    print(char, index)
len('l➡️ h')

Sixteenth video in our channel➡️ how to solve lineal functions‼️ https://t.co/yAXTi5Zog6
30
b'Sixteenth video in our channel\xe2\x9e\xa1\xef\xb8\x8f how to solve lineal functions\xe2\x80\xbc\xef\xb8\x8f https://t.co/yAXTi5Zog6'
0 83
1 105
2 120
3 116
4 101
5 101
6 110
7 116
8 104
9 32
10 118
11 105
12 100
13 101
14 111
15 32
16 105
17 110
18 32
19 111
20 117
21 114
22 32
23 99
24 104
25 97
26 110
27 110
28 101
29 108
30 226
31 158
32 161
33 239
34 184
35 143
36 32
37 104
38 111
39 119
40 32
41 116
42 111
43 32
44 115
45 111
46 108
47 118
48 101
49 32
50 108
51 105
52 110
53 101
54 97
55 108
56 32
57 102
58 117
59 110
60 99
61 116
62 105
63 111
64 110
65 115
66 226
67 128
68 188
69 239
70 184
71 143
72 32
73 104
74 116
75 116
76 112
77 115
78 58
79 47
80 47
81 116
82 46
83 99
84 111
85 47
86 121
87 65
88 88
89 84
90 105
91 53
92 90
93 111
94 103
95 54


5

In [90]:
import nltk

string_with_unicode = "This is a string with some 𝐔𝐧𝐢𝐜𝐨𝐝𝐞 characters"
tokens = nltk.tokenize.word_tokenize(string_with_unicode)
print(tokens)

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/english.pickle[0m

  Searched in:
    - 'C:\\Users\\baruc/nltk_data'
    - 'C:\\Users\\baruc\\anaconda3\\envs\\NLP\\nltk_data'
    - 'C:\\Users\\baruc\\anaconda3\\envs\\NLP\\share\\nltk_data'
    - 'C:\\Users\\baruc\\anaconda3\\envs\\NLP\\lib\\nltk_data'
    - 'C:\\Users\\baruc\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - ''
**********************************************************************


**Part 2**

Write a function lm that generates a language model from a textual corpus. The function should return a dictionary (representing a model) where the keys are all the relevant n-1 sequences, and the values are dictionaries with the n_th tokens and their corresponding probabilities to occur. For example, for a trigram model (tokens are characters), it should look something like:

{
  "ab":{"c":0.5, "b":0.25, "d":0.25},
  "ca":{"a":0.2, "b":0.7, "d":0.1}
}

which means for example that after the sequence "ab", there is a 0.5 chance that "c" will appear, 0.25 for "b" to appear and 0.25 for "d" to appear.

Note - You should think how to add the add_one smoothing information to the dictionary and implement it.

In [7]:
# helper functions for lm

def build_ngram_model(current_data, n):
    """ Builds an n-gram model from the given data
    Args:
        current_data: a pandas dataframe with a column named "tweet_text"
        n: the n in n-gram
    Returns:
        model: a dictionary representing the n-gram model, i.e {n-1_gram: {n_th_token: count}}
    """
    model = {}
    # iterate over all the tweets in the data file
    for tweet in current_data["tweet_text"]:
        # define n_gram and n_minus_1_gram
        n_gram = "<start>" + tweet[0:n-1]
        n_minus_1_gram = n_gram[:-1]
        n_th_token = n_gram[-1]
        # add the n-gram to the model
        if n_minus_1_gram not in model:
            model[n_minus_1_gram] = {}
        # add the n_th token to the model
        if n_th_token not in model[n_minus_1_gram]:
            model[n_minus_1_gram][n_th_token] = 0
        # count the n_th token, i.e. add 1 to its count
        model[n_minus_1_gram][n_th_token] += 1
        # iterate over all the n-grams in the tweet
        for i in range(len(tweet) - n + 1):
            n_gram = tweet[i:i+n]
            n_minus_1_gram = n_gram[:-1]
            n_th_token = n_gram[-1]
            # add the n-gram to the model
            if n_minus_1_gram not in model:
                model[n_minus_1_gram] = {}
            # add the n_th token to the model
            if n_th_token not in model[n_minus_1_gram]:
                model[n_minus_1_gram][n_th_token] = 0
            # count the n_th token, i.e. add 1 to its count
            model[n_minus_1_gram][n_th_token] += 1
        n_gram = tweet[-n+1:] + "<end>"
        n_minus_1_gram = tweet[-n+1:]
        n_th_token = "<end>"
        # add the n-gram to the model
        if n_minus_1_gram not in model:
            model[n_minus_1_gram] = {}
        # add the n_th token to the model
        if n_th_token not in model[n_minus_1_gram]:
            model[n_minus_1_gram][n_th_token] = 0
        # count the n_th token, i.e. add 1 to its count
        model[n_minus_1_gram][n_th_token] += 1

    return model


def add_one_smoothing(model, vocabulary):
    """ Adds add_one smoothing to the model
    Args:
        model: a dictionary representing the n-gram model, i.e {n-1_gram: {n_th_token: count}}
        vocabulary: a list of all the tokens in the data
    Returns:
        model: a dictionary representing the n-gram model, i.e {n-1_gram: {n_th_token: count}}
    """
    # iterate over all the n-1 grams in the model
    for n_minus_1_gram in model:
        # iterate over all the tokens in the vocabulary
        for token in vocabulary:
            # add the token to the model
            if token not in model[n_minus_1_gram]:
                model[n_minus_1_gram][token] = 0
            # count the token, i.e. add 1 to its count (add one smoothing)
            model[n_minus_1_gram][token] += 1
    return model

def calculate_probabilities(model):
    """ Calculates the probabilities for the model
    Args:
        model: a dictionary representing the n-gram model, i.e {n-1_gram: {n_th_token: count}}
    Returns:
        model: a dictionary representing the n-gram model, i.e {n-1_gram: {n_th_token: probability}}
    """
    # iterate over all the n-1 grams in the model
    for n_minus_1_gram in model:
        # get the counts of all the tokens
        token_counts = model[n_minus_1_gram].values()
        # calculate the total count
        total_count = sum(token_counts)
        # iterate over all the tokens in the model
        for token in model[n_minus_1_gram]:
            # calculate the probability, i.e. divide the count by the total count
            model[n_minus_1_gram][token] /= total_count
    return model


In [8]:
def lm(n, vocabulary, data_file_path, add_one):
    """ Builds an n-gram model from the given data
    Args:
        n: the n in n-gram
        vocabulary: a list of all the tokens in the data
        data_file_path: the data_file from which we record probabilities for our model
        add_one: True/False (use add_one smoothing or not)
    Returns:
        model: a dictionary representing the n-gram model, i.e {n-1_gram: {n_th_token: probability}}
    """
    # read the data file
    current_data = pd.read_csv(data_file_path,  encoding="utf-8")
    # build the n-gram model
    model = build_ngram_model(current_data, n)
    if add_one:
        # add one smoothing
        model = add_one_smoothing(model, vocabulary)
    # calculate the probabilities
    model = calculate_probabilities(model)
    return model

In [9]:
# # # call the function for the first data file
# lm_model = lm(3, vocabulary, path + "/" + data_files[0], True)

In [10]:
# # print the model[key][token] for a specific key
# key = lm_model.keys().__iter__().__next__()
# probabilities_sorted = sorted(lm_model[key].items(), key=lambda x: x[1], reverse=True)
# # print("key: ", key)
# # print("probabilities: ", probabilities_sorted[:10])
# # print <end> probability


**Part 3**

Write a function *eval* that returns the perplexity of a model (dictionary) running over a given data file.

In [11]:
from math import log2

# a function that calculates the perplexity of a model
def calculate_perplexity(current_data, n, model):
    """ Calculates the perplexity of a model
    Args:
        current_data: the data from which we record probabilities for our model
        n: the n in n-gram
        model: a dictionary representing the n-gram model, i.e {n-1_gram: {n_th_token: probability}}
    Returns:
        the perplexity of the model
    """
    log_prob_sum = 0
    n_gram_count = 0

    for tweet in current_data["tweet_text"]:
        # add the first n-gram
        n_minus_1_gram = "<start>" + tweet[:n-2]
        n_th_token = tweet[n-1]
        if n_minus_1_gram in model and n_th_token in model[n_minus_1_gram]:
            log_prob_sum += -log2(model[n_minus_1_gram][n_th_token])
            n_gram_count += 1
        # iterate over all the n-grams in the tweet
        for i in range(len(tweet) - n + 1):
            n_gram = tweet[i:i+n]
            n_minus_1_gram = n_gram[:-1]
            n_th_token = n_gram[-1]

            if n_minus_1_gram in model and n_th_token in model[n_minus_1_gram]:
                log_prob_sum += -log2(model[n_minus_1_gram][n_th_token])
                n_gram_count += 1
        # add the last n-gram
        n_minus_1_gram = tweet[-n+1:]
        n_th_token = "<end>"
        if n_minus_1_gram in model and n_th_token in model[n_minus_1_gram]:
            log_prob_sum += -log2(model[n_minus_1_gram][n_th_token])
            n_gram_count += 1

    if n_gram_count == 0:
        return float('inf')  # Return infinite perplexity if no n-grams found

    entropy = log_prob_sum / n_gram_count
    perplexity = 2 ** entropy
    return perplexity




def eval(n, model, data_file):
    """ Evaluates the perplexity of a model
    Args:
        n: the n in n-gram
        model: a dictionary representing the n-gram model, i.e {n-1_gram: {n_th_token: probability}}
        data_file: the data file path for which we want to calculate the perplexity
    Returns:
        the perplexity of the model
    """
    # read the data file
    current_data = pd.read_csv(data_file, encoding="utf-8")
    # calculate the perplexity
    perplexity = calculate_perplexity(current_data, n, model)
    return perplexity

In [12]:
# # call the function
# perplexity = eval(3, lm_model, path + "/" + data_files[0])
# print("perplexity: ", perplexity)

**Part 4**

Write a function *match* that creates a model for every relevant language, using a specific value of *n* and *add_one*. Then, calculate the perplexity of all possible pairs (e.g., en model applied on the data files en ,es, fr, in, it, nl, pt, tl; es model applied on the data files en, es...). This function should return a pandas DataFrame with columns [en ,es, fr, in, it, nl, pt, tl] and every row should be labeled with one of the languages. Then, the values are the relevant perplexity values.

Save the dataframe to a CSV with the name format: {student_id_1}\_...\_{student_id_n}\_part4.csv

In [13]:
def match(n, add_one):
    """ Creates a model for every relevant language, using a specific value of n and add_one.
    Then, calculate the perplexity of all possible pairs.
    Args:
        n: the n in n-gram
        add_one: whether to use add one smoothing or not
    Returns:
        df: a pandas DataFrame with columns [en ,es, fr, in, it, nl, pt, tl] and every row should be labeled with one of the languages.
        models: a dictionary of the models, so that we can use them later,
                i.e {language: model} where model is a dictionary representing the n-gram model, i.e {n-1_gram: {n_th_token: probability}}
    """
    # create a dataframe
    df = pd.DataFrame(columns=data_files, index=data_files)

    # create models for every language
    models = compute_data_files_models(data_files, n, vocabulary, path, add_one)

    # calculate the perplexity of all possible pairs
    for lang1 in data_files: # will be the model
        # define the model
        current_model = models[lang1]
        for lang2 in data_files: # will be the data file
            # define the data file
            current_data_file = path + "/" + lang2
            # evaluate the model
            perplexity = eval(n, current_model, current_data_file)
            # save the perplexity to the dataframe
            df[lang1][lang2] = perplexity
    # TODO: need to return only df
    return df, models # return the dataframe and the models, so that we can use them later

def compute_data_files_models(data_files, n, vocabulary, path , add_one):
    """ Creates a model for every relevant language, using a specific value of n and add_one.
    Args:
        data_files: the data files to create models for
        n: the n in n-gram
        vocabulary: the vocabulary
        path: the path to the data files
        add_one: whether to use add one smoothing or not
    Returns:
        models: a dictionary of the models, so that we can use them later,
                i.e {language: model} where model is a dictionary representing the n-gram model, i.e {n-1_gram: {n_th_token: probability}}
    """
    models = {}
    for data_file in data_files:
        models[data_file] = lm(n, vocabulary, path + "/" + data_file, add_one)
    return models


In [14]:
# # call the function
# df_part4, models_part4 = match(3, True)
# print("dataframe: ")
# print(df_part4)

In [15]:
# # save the dataframe to a CSV
# df_part4.to_csv("language_perplexity_part4.csv")

**Part 5**

Run match with *n* values 1-4, once with add_one and once without, and print the 8 tables to this notebook, one after another.

Load each result to a dataframe and save to a CSV with the name format: 

for cases with add_one: {student_id_1}\_...\_{student_id_n}\_n1\_part5.csv

For cases without add_one:
{student_id_1}\_...\_{student_id_n}\_n1\_wo\_addone\_part5.csv

Follow the same format for n2,n3, and n4


In [16]:
def run_match():
    """ Runs the match function for all the n values and add_one values
    Returns:
        dataframes: a dictionary of the dataframes, so that we can use them later,
                    i.e {n, add_one: dataframe} where dataframe is a pandas DataFrame with columns [en ,es, fr, in, it, nl, pt, tl] and every row should be labeled with one of the languages.
        language_models_dict: a dictionary of the models, so that we can use them later,
                              i.e {n, add_one: language_models} where language_models is a dictionary of the models, i.e {language: model} where model is a dictionary representing the n-gram model, i.e {n-1_gram: {n_th_token: probability}}
    """
    n_values = [1, 2, 3] # TODO: change to [1, 2, 3, 4], memorry problem, change match function to use only dataframes
    add_one_values = [True, False]
    # create dictionaries for the dataframes, key = (n, add_one), value = dataframe
    dataframes = {}
    # create a dictionary for the language models, key = (n, add_one), value = language models = {language: model}
    language_models_dict = {}
    # iterate over all the n values
    for n in n_values:
        # iterate over all the add_one values
        for add_one in add_one_values:
            # create the dataframe and the language models, using the match function
            current_df, current_language_models = match(n, add_one)
            print("completed n = " + str(n) + ", add_one = " + str(add_one) + "!")
            # add the dataframe to the dictionary
            dataframes[(n, add_one)] = current_df
            # add the language models to the dictionary
            language_models_dict[(n, add_one)] = current_language_models
            # save the dataframe to a CSV
            if add_one:
                current_df.to_csv("language_perplexity_n" + str(n) + "_part5.csv")
            else:
                current_df.to_csv("language_perplexity_n" + str(n) + "_wo_addone_part5.csv")
    return dataframes, language_models_dict # return the dataframes and the language models, so that we can use them later


In [17]:
run_match_dataframes, run_match_language_models = run_match()


completed n = 1, add_one = True!
completed n = 1, add_one = False!
completed n = 2, add_one = True!
completed n = 2, add_one = False!
completed n = 3, add_one = True!
completed n = 3, add_one = False!


In [18]:
import pickle

def save_models_to_pickle(language_models_dict=run_match_language_models, filename="run_match_language_models.pickle"):
    """ Saves the language models to a pickle file
    Args:
        language_models_dict: a dictionary of the models, i.e {n, add_one: language_models} where language_models is a dictionary of the models, i.e {language: model} where model is a dictionary representing the n-gram model, i.e {n-1_gram: {n_th_token: probability}}
    """
    # save run_match_language_models to a pickle file, so that we can use it later
    with open('run_match_language_models.pickle', 'wb') as handle:
        pickle.dump(language_models_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)

def save_dataframes_to_pickle(dataframes_dict=run_match_dataframes, filename="run_match_dataframes.pickle"):
    """ Saves the dataframes to a pickle file
    Args:
        dataframes_dict: a dictionary of the dataframes, i.e {n, add_one: dataframe} where dataframe is a pandas DataFrame with columns [en ,es, fr, in, it, nl, pt, tl] and every row should be labeled with one of the languages.
    """
    # save run_match_dataframes to a pickle file, so that we can use it later
    with open('run_match_dataframes.pickle', 'wb') as handle:
        pickle.dump(dataframes_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)


In [19]:
import pickle

def load_models_from_pickle(filename="run_match_language_models.pickle"):
    """ Loads the language models from a pickle file
    Returns:
        language_models_dict: a dictionary of the models, i.e {n, add_one: language_models} where language_models is a dictionary of the models, i.e {language: model} where model is a dictionary representing the n-gram model, i.e {n-1_gram: {n_th_token: probability}}
    """
    # load run_match_language_models from a pickle file
    with open('run_match_language_models.pickle', 'rb') as handle:
        language_models_dict = pickle.load(handle)
    return language_models_dict

def load_dataframes_from_pickle(filename="run_match_dataframes.pickle"):
    """ Loads the dataframes from a pickle file
    Returns:
        dataframes_dict: a dictionary of the dataframes, i.e {n, add_one: dataframe} where dataframe is a pandas DataFrame with columns [en ,es, fr, in, it, nl, pt, tl] and every row should be labeled with one of the languages.
    """

In [20]:
# save the language models to a pickle file
save_models_to_pickle(run_match_language_models, "run_match_language_models.pickle")

In [21]:
# load the language models from a pickle file
run_match_language_models = load_models_from_pickle("run_match_language_models.pickle")

In [22]:
# test the language models dictionary

test_n = 3
test_add_one = True
test_language = "en.csv"
test_language_models = run_match_language_models[(test_n, test_add_one)][test_language]
test_n_minus_1_gram = "<start>a"

next_token_probability = test_language_models[test_n_minus_1_gram]
next_token_probability = sorted(next_token_probability.items(), key=lambda x: x[1], reverse=True)
print(next_token_probability)

[('l', 0.002201430930104568), ('n', 0.002201430930104568), ('r', 0.001651073197578426), ('i', 0.001100715465052284), ('t', 0.001100715465052284), ('m', 0.001100715465052284), ('p', 0.001100715465052284), ('<start>', 0.000550357732526142), ('\n', 0.000550357732526142), ('\r', 0.000550357732526142), (' ', 0.000550357732526142), ('!', 0.000550357732526142), ('"', 0.000550357732526142), ('#', 0.000550357732526142), ('$', 0.000550357732526142), ('%', 0.000550357732526142), ('&', 0.000550357732526142), ("'", 0.000550357732526142), ('(', 0.000550357732526142), (')', 0.000550357732526142), ('*', 0.000550357732526142), ('+', 0.000550357732526142), (',', 0.000550357732526142), ('-', 0.000550357732526142), ('.', 0.000550357732526142), ('/', 0.000550357732526142), ('0', 0.000550357732526142), ('1', 0.000550357732526142), ('2', 0.000550357732526142), ('3', 0.000550357732526142), ('4', 0.000550357732526142), ('5', 0.000550357732526142), ('6', 0.000550357732526142), ('7', 0.000550357732526142), ('8',

**Part 6**

Each line in the file test.csv contains a sentence and the language it belongs to. Write a function that uses your language models to classify the correct language of each sentence.

Important note regarding the grading of this section: this is an open question, where a different solution will yield different accuracy scores. any solution that is not trivial (e.g. returning 'en' in all cases) will be accepted. We do reserve the right to give bonus points to exceptionally good/creative solutions.

In [23]:
# this function classifies the test sentences
def classify():
    """ Classifies the test sentences
    Returns:
        classification_result: a list of tuples, where each tuple contains the tweet_id, the sentence, the true language, and the predicted language
    """
    # we will use the language models from part 5, with n = 3 and add_one = True
    language_models = run_match_language_models[(3, True)]

    # read the test file, tweet_id, tweet_text, label
    test_data = pd.read_csv(path + "/test.csv",  encoding="utf-8")

    # classify the sentences
    classification_result = []

    # iterate over the rows in the test data
    for index, row in test_data.iterrows():
        tweet_id = row['tweet_id']
        sentence = row['tweet_text']
        true_language = row['label']
        predicted_language = ''

        predicted_language = single_classification(sentence, language_models)

        # add the result to the classification_result list
        classification_result.append((tweet_id, sentence, true_language, predicted_language))

    return classification_result


# this function classifies a single sentence
def single_classification(sentence, language_models = run_match_language_models[(3, True)]):
    """ Classifies a single sentence
    Args:
        sentence: the sentence to classify
        language_models: the language models to use for classification
                         i.e {language: model} where model is a dictionary representing the n-gram model, i.e {n-1_gram: {n_th_token: probability}}
    Returns:
        predicted_language: the predicted language of the sentence
    """
    predicted_language = ''
    min_perplexity = float('inf')
    # iterate over the language models
    for data_file in data_files:
        current_model = language_models[data_file]
        # create a temporary DataFrame, with the sentence as the only row
        temp_df = pd.DataFrame([sentence], columns=['tweet_text'])
        # calculate the perplexity using the temporary DataFrame
        current_perplexity = calculate_perplexity(temp_df, 3, current_model)

        if current_perplexity < min_perplexity:
            min_perplexity = current_perplexity
            predicted_language = data_file[:-4] # remove the .csv from the end of the file name
    return predicted_language



classification_result = classify()

In [24]:
# print real language and predicted language
count_correct = 0
for result in classification_result:
    print(result[2] + " " + result[3])
    if result[2] == result[3]:
        count_correct += 1
print("accuracy = " + str(count_correct / len(classification_result)))



en en
it it
tl tl
nl nl
tl tl
in in
pt pt
nl nl
fr fr
pt pt
en en
tl tl
pt pt
es nl
es es
fr fr
en en
nl nl
fr fr
nl en
es es
fr fr
en en
pt pt
tl tl
pt pt
en en
fr fr
fr fr
in in
nl nl
tl tl
nl en
tl tl
it it
nl nl
in in
es es
es es
nl nl
fr fr
en en
es es
fr fr
it it
es es
nl nl
es es
tl tl
tl tl
en en
nl nl
pt pt
tl tl
nl nl
en en
pt pt
tl tl
it pt
en en
pt pt
it fr
tl tl
fr fr
en en
tl tl
tl in
nl nl
it it
tl tl
fr fr
pt pt
in in
pt pt
in in
nl en
pt pt
tl tl
es es
tl tl
en en
tl tl
en en
in in
nl nl
fr fr
nl nl
in in
fr fr
tl tl
en en
pt pt
es es
nl nl
tl tl
nl nl
tl tl
tl tl
it it
es pt
in in
pt pt
in in
en en
es es
nl nl
pt pt
en en
pt pt
it it
es es
in in
pt pt
nl nl
tl en
fr fr
tl tl
it it
es es
fr fr
nl nl
en en
en en
pt pt
pt it
es es
nl nl
en en
pt pt
es es
pt pt
it it
tl tl
it pt
es es
es es
fr fr
pt pt
it tl
en en
fr fr
fr fr
in in
it it
es es
es es
in in
tl tl
pt pt
pt pt
pt pt
in in
in in
en en
fr fr
fr fr
it it
tl tl
nl nl
it it
es es
fr fr
fr fr
nl nl
it pt
es es
nl n

**Part 7**

Calculate the F1 score of your output from part 6. (hint: you can use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). 

Load the results to a CSV (using a DataFrame), where the row indicates the F1 results, and the columns indicate the model used. Name it {student_id_1}\_...\_{student_id_n}\_part7.csv

In [25]:
# we will use the following dictionary to convert the strings to numbers
language_to_number = {}
# we will use the following dictionary to convert the numbers back to strings
number_to_language = {}

number = 0
# iterate over the classification results
for result in classification_result:
    if result[2] not in language_to_number:
        language_to_number[result[2]] = number
        number_to_language[number] = result[2]
        number += 1
    if result[3] not in language_to_number:
        language_to_number[result[3]] = number
        number_to_language[number] = result[3]
        number += 1

In [26]:
import sklearn.metrics as metrics
# The f1 score = 2 *(TP / (2TP + FP + FN))
def calc_f1(result):
    """ Calculates the f1 score of the classification result
    Args:
        result: a list of tuples, where each tuple contains the tweet_id, the sentence, the true language, and the predicted language
    Returns:
        f1_score: the f1 score of the classification result
    """
    # create a DataFrame with the results
    df = pd.DataFrame(result, columns=['tweet_id', 'tweet_text', 'true_language', 'predicted_language'])
    # drop the tweet_id and tweet_text columns
    df = df.drop(columns=['tweet_id', 'tweet_text'])
    # convert the true_language and predicted_language columns to numbers
    df['true_language'] = df['true_language'].apply(lambda x: language_to_number[x])
    df['predicted_language'] = df['predicted_language'].apply(lambda x: language_to_number[x])
    # calculate the f1 score
    f1_score = metrics.f1_score(df['true_language'], df['predicted_language'], average='macro')
    return f1_score




In [27]:
f1_score = calc_f1(classification_result)
print(f1_score)

0.9238587630575827


<br><br><br><br>
**Part 8**  
Let's use your Language model (dictionary) for generation (NLG).

When it comes to sampling from a language model decoder during text generation, there are several different methods that can be used to control the randomness and diversity of the generated text. 

Some of the most commonly used methods include:

> `Greedy sampling`
In this method, the model simply selects the word with the highest probability as the next word at each time step. This method can produce fluent text, but it can also lead to repetitive or predictable output.

> `Temperature scaling`  
Temperature scaling involves scaling the logits output of the language model by a temperature parameter before softmax normalization. This has the effect of smoothing the distribution of probabilities and increasing the probability of lower-probability words, which can lead to more diverse and creative output.

> `Top-K sampling`  
In this method, the model restricts the sampling to the top-K most likely words at each time step, where K is a predefined hyperparameter. This can generate more diverse output than greedy sampling, while limiting the number of low-probability words that are sampled.

> `Nucleus sampling` (also known as top-p sampling)  
This method restricts the sampling to the smallest possible set of words whose cumulative probability exceeds a certain threshold, defined by a hyperparameter p. Like top-K sampling, this can generate more diverse output than greedy sampling, while avoiding sampling extremely low probability words.

> `Beam search`  
Beam search involves maintaining a fixed number k of candidate output sequences at each time step, and then selecting the k most likely sequences based on their probabilities. This can improve the fluency and coherence of the output, but may not produce as much diversity as sampling methods.

The choice of sampling method depends on the specific application and desired balance between fluency, diversity, and randomness. Hyperparameters such as temperature, K, p, and beam size can also be tuned to adjust the behavior of the language model during sampling.


You may read more about this concept in <a href='https://huggingface.co/blog/how-to-generate#:~:text=pad_token_id%3Dtokenizer.eos_token_id)-,Greedy%20Search,-Greedy%20search%20simply'>this</a> blog post.


**Please added the needed code for each sampeling method:**

In [28]:
def softmax(probabilities):
    """ Applies the softmax function to the probabilities
    Args:
        probabilities: a dictionary of probabilities (not yet probalities, just numbers)
    Returns:
        probabilities: a dictionary of probabilities (normalized)
    """
    np_probabilities = np.array(list(probabilities.values()))
    np_probabilities = np.exp(np_probabilities)
    np_probabilities = np_probabilities / np.sum(np_probabilities)
    # convert the numpy array back to a dictionary
    probabilities = {key: value for key, value in zip(probabilities.keys(), np_probabilities)}
    return probabilities

def make_prob_1(probabilities):
    """ Makes the sum of the probabilities equal to 1
    Args:
        probabilities: a dictionary of probabilities (not yet probalities, just numbers)
    Returns:
        probabilities: a dictionary of probabilities (normalized)
    """

    # make the sum of the probabilities equal to 1
    # by dividing each probability by the sum of all probabilities
    sum_prob = sum(probabilities)
    probabilities = [prob / sum_prob for prob in probabilities]
    return probabilities

def get_correct_model(all_models=run_match_language_models, prefix="h", language="en.csv", add_one=True):
    """ Gets the correct language model, the correct n-1 prefix, and the probabilities of the next token
    Args:
        all_models: a dictionary of language models
        prefix: the prefix
        language: the language
        add_one: whether to use add-one smoothing or not
    Returns:
        correct_model: the correct language model (dictionary), where the keys are the prefixes and the values are the probabilities of the next token
        current_n_minus_1_prefix: the correct key for the language model
        next_token_probabilities: the probabilities of the next token (dictionary), where the keys are the tokens and the values are the probabilities
    """
    max_n = 3
    # TODO: need to consider <start> token, maybe add parameter to function
    start_token_length = len("<start>")

    # we want to use maximum n-gram we can, but not more than max_n
    n = min(max_n, len(prefix) + 1)

    # get the n-gram model
    correct_model = all_models[(n, add_one)][language]

    # get the n-1 prefix
    current_n_minus_1_prefix = prefix[-(n - 1):]

    # get the probabilities of the next token. if the prefix is not in the language model, sample a random token
    # TODO: maybe change to STOP when key is not in the language model
    if current_n_minus_1_prefix not in correct_model:
        next_token_probabilities = {token: 1 / len(correct_model) for token in correct_model}
    else:
        next_token_probabilities = correct_model[current_n_minus_1_prefix]

    return correct_model, current_n_minus_1_prefix, next_token_probabilities


In [34]:
# probabilities = {key = next_token, value = probability}
def sample_greedy(probabilities, k=1):
    """ Samples the next token greedily, i.e. the token with the highest probability
    Args:
        probabilities: a dictionary of probabilities, i.e. {key = next_token, value = probability}
    Returns:
        max_token: the token with the highest probability
    """
    # sample the token with the k highest probability

    # sort the probabilities dictionary by the values
    sorted_probabilities = sorted(probabilities.items(), key=lambda x: x[1], reverse=True)
    # if k is larger than the number of probabilities, set k to the number of probabilities
    k = k if len(sorted_probabilities) >= k else len(sorted_probabilities)
    # sample the token with the k highest probability
    next_token = sorted_probabilities[k - 1][0]

    return next_token


# probabilities = {key = next_token, value = probability}
def sample_temperature(probabilities, temperature=1.0, k=1):
    """ Samples the next token using temperature sampling
    Args:
        probabilities: a dictionary of probabilities, i.e. {key = next_token, value = probability}
        temperature: the temperature
    Returns:
        next_token: the sampled token
    """
    # scale the probabilities by the temperature
    probabilities = {key: value ** (1 / temperature) for key, value in probabilities.items()}
    # softmax the probabilities
    probabilities = softmax(probabilities)
    # sample from the probabilities dictionary, use the np.random.choice function
    np_probabilities = np.array(list(probabilities.values()))
    np_tokens = np.array(list(probabilities.keys()))
    next_token = np.random.choice(np_tokens, p=np_probabilities)
    # return the sampled token
    return next_token


# probabilities = {key = next_token, value = probability}
def sample_topK(probabilities, k=1):
    """ Samples the next token using top-k sampling, i.e. only the top k tokens are considered
    Args:
        probabilities: a dictionary of probabilities, i.e. {key = next_token, value = probability}
        k: the number of tokens to consider
    Returns:
        next_token: the sampled token
    """
    # sort the probabilities dictionary by the values
    sorted_probabilities = sorted(probabilities.items(), key=lambda x: x[1], reverse=True)
    # take the top k
    top_k = sorted_probabilities[:k]
    # split the top k into tokens and probabilities
    top_k_probs = [prob for (token, prob) in top_k]
    top_k_tokens = [token for (token, prob) in top_k]
    # make the sum of the probabilities equal to 1
    top_k_probs = make_prob_1(top_k_probs)
    # sample from the top k tokens
    next_token = np.random.choice(top_k_tokens, p=top_k_probs)
    return next_token

# probabilities = {key = next_token, value = probability}
def sample_topP(probabilities, p=0.9):
    """ Samples the next token using top-p sampling,
    i.e. only the tokens with the highest probabilities are considered, until the sum of the probabilities is greater than p
    Args:
        probabilities: a dictionary of probabilities, i.e. {key = next_token, value = probability}
        p: the threshold
    Returns:
        next_token: the sampled token
    """
    # sort the probabilities dictionary by the values
    sorted_probabilities = sorted(probabilities.items(), key=lambda x: x[1], reverse=True)
    current_sum = 0
    top_p_tokens = []
    top_p_probs = []
    current_index = 0
    while current_sum < p:
        top_p_tokens.append(sorted_probabilities[current_index][0])
        top_p_probs.append(sorted_probabilities[current_index][1])
        current_sum += sorted_probabilities[current_index][1]
        current_index += 1
    # make the sum of the probabilities equal to 1
    top_p_probs = make_prob_1(top_p_probs)
    # sample from the top p tokens
    next_token = np.random.choice(top_p_tokens, p=top_p_probs)
    return next_token


def sample_beam(probabilities, num_beams = 3):
    """ Samples the next tokens using beam search, i.e., keeps the top num_beams hypotheses at each step.
        Helper function for beam_search
    Args:
        probabilities: a dictionary of probabilities, i.e. {key = next_token, value = probability}
        num_beams: the number of beams to keep, i.e. the number of hypotheses to keep at each step
    Returns:
        beam_tokens: a list of top num_beams tokens
        beam_probs: a list of the corresponding probabilities of the top num_beams tokens
    """
    sorted_probabilities = sorted(probabilities.items(), key=lambda x: x[1], reverse=True)
    top_beams = sorted_probabilities[:num_beams]

    beam_tokens = [token for (token, prob) in top_beams]
    beam_probs = [prob for (token, prob) in top_beams]

    return beam_tokens, beam_probs




Use your Language Model to generate each one out of the following examples with the coresponding params.    
Notice the 4 core issues: 
- Starting tokens
- Length of the generation
- Sampling methond (use all)
- Stop Token (if this token is sampled, stop generating)

In [35]:
test_ = {
    'example1' : {
        'start_tokens' : "H",
        'sampling_method' : ['greedy','beam'],
        'gen_length' : "10",
        'stop_token' : "\n",
        'generation' : []
    },
    'example2' : {
        'start_tokens' : "H",
        'sampling_method' : ['temperature','topK','topP'],
        'gen_length' : "10",
        'stop_token' : "\n",
        'generation' : []
    },
    'example3' : {
        'start_tokens' : "He",
        'sampling_method' : ['greedy','beam','temperature','topK','topP'],
        'gen_length' : "20",
        'stop_token' : "me",
        'generation' : []
    }
}

In [45]:
def generate_string(all_models, prefix, sampling_method, gen_length, stop_token, num_beams=5):
    """ Generates a string using the given language model
    Args:
        all_models: where lm_dict = {key = (n, add_one), value = language_model},
                where language_model = {key = language, value = model},
                where model = {key = prefix, value = probabilities}
        prefix: the prefix
        sampling_method: the sampling method, can be 'greedy', 'temperature', 'topK', 'topP', 'beam'
        gen_length: the length of the generation
        stop_token: the token that stops the generation
        num_beams: the number of beams to keep (only relevant for beam search)
    Returns:
        generated_string: the generated string
    """
    if sampling_method == 'beam':
        return beam_search(all_models, prefix, gen_length, stop_token, num_beams)
    else:
        return generate_string_not_beam(all_models, prefix, gen_length, stop_token, sampling_method)

def beam_search(all_models, prefix, gen_length, stop_token, num_beams):
    """ Generates a string using beam search
    Args:
        all_models: where lm_dict = {key = (n, add_one), value = language_model},
                where language_model = {key = language, value = model},
                where model = {key = prefix, value = probabilities}
        prefix: the prefix
        gen_length: the length of the generation
        stop_token: the token that stops the generation
        num_beams: the number of beams to keep (only relevant for beam search)
    Returns:
        generated_string: the generated string
    """
    # initialize the beams
    beams = [(prefix, 0)]  # (prefix, log_prob)

    # generate the string token by token
    for _ in range(gen_length):
        new_beams = []

        # sample the next token for each beam
        for beam_prefix, beam_log_prob in beams:
            # get the correct language model
            current_lm, current_n_minus_1_prefix, next_token_probabilities = get_correct_model(all_models, beam_prefix)

            # sample the top num_beams tokens and probabilities
            beam_tokens, beam_probs = sample_beam(next_token_probabilities, num_beams)

            # update the beams
            for token, prob in zip(beam_tokens, beam_probs):
                new_prefix, new_log_prob = update_beam(beam_prefix, beam_log_prob, token, prob, stop_token)
                new_beams.append((new_prefix, new_log_prob))

        # keep the top num_beams beams
        beams = sorted(new_beams, key=lambda x: x[1], reverse=True)[:num_beams]

    # get the best beam
    best_beam = beams[0][0]

    return best_beam

def generate_string_not_beam(all_models, prefix, gen_length, stop_token, sampling_method):
    """ Samples the next tokens using the given language model
    Args:
        all_models: where all_models = {key = (n, add_one), value = language_model},
                where language_model = {key = language, value = model},
                where model = {key = prefix, value = probabilities}
        prefix: the prefix
        gen_length: the length of the generation
        stop_token: the token that stops the generation
        sampling_method: the sampling method, can be 'greedy', 'temperature', 'topK', 'topP'
    Returns:
        generated_string: the generated string
    """
    current_prefix = prefix

    # generate the string token by token
    for i in range(gen_length):
        # get the correct language model
        current_lm, current_n_minus_1_prefix, next_token_probabilities = get_correct_model(all_models, current_prefix)

        # sample the next token
        next_token = select_next_token(next_token_probabilities, sampling_method)
        print("next_token: ", next_token)
        print("current_prefix: ", current_prefix)
        print("len(current_prefix): ", len(current_prefix))
        # update the current prefix
        current_prefix += next_token

        # stop if the stop token is sampled
        if next_token == stop_token or next_token == "<end>":
            break

    return current_prefix

def update_beam(beam_prefix, beam_log_prob, token, prob, stop_token):
    """ Updates the beam, i.e. the current prefix and the current log probability
    Args:
        beam_prefix: the current beam prefix
        beam_log_prob: the current beam log probability
        token: the next token
        prob: the probability of the next token
        stop_token: the token that stops the generation
    Returns:
        new_prefix: the new beam prefix
        new_log_prob: the new beam log probability
    """

    # update the prefix and the log probability
    new_prefix = beam_prefix + token
    new_log_prob = beam_log_prob + np.log(prob)

    # remove the last token if it is the stop token, TODO: check if this is correct, or if we should consider shorter beams
    if token == stop_token:
        new_prefix = new_prefix[:-1]

    return new_prefix, new_log_prob

def select_next_token(next_token_probabilities, sampling_method, k_greedy=1, temperature=0.5, top_k=5, p=0.9):
    """ Selects the next token using the given sampling method
    Args:
        next_token_probabilities: the probabilities of the next tokens
        sampling_method: the sampling method, can be 'greedy', 'temperature', 'topK', 'topP'
    Returns:
        next_token: the next token
    """
    if sampling_method == 'greedy':
        return sample_greedy(next_token_probabilities, k_greedy)
    elif sampling_method == 'temperature':
        return sample_temperature(next_token_probabilities, temperature)
    elif sampling_method == 'topK':
        return sample_topK(next_token_probabilities, top_k)
    elif sampling_method == 'topP':
        return sample_topP(next_token_probabilities, p)
    else:
        raise ValueError(f'Unknown sampling method: {sampling_method}')



Use your LM to generate a string based on the parametes of each examples, and store the generation sequance at the generation list.

In [46]:
# Define the parameters
all_models = run_match_language_models
language = "en.csv"
add_one = True

In [55]:
current_model = all_models[(3, True)][language]
print(current_model["💪🏽"])

{'😘': 0.0011061946902654867, '<end>': 0.00165929203539823, '<start>': 0.0005530973451327434, '\n': 0.0005530973451327434, '\r': 0.0005530973451327434, ' ': 0.0005530973451327434, '!': 0.0005530973451327434, '"': 0.0005530973451327434, '#': 0.0005530973451327434, '$': 0.0005530973451327434, '%': 0.0005530973451327434, '&': 0.0005530973451327434, "'": 0.0005530973451327434, '(': 0.0005530973451327434, ')': 0.0005530973451327434, '*': 0.0005530973451327434, '+': 0.0005530973451327434, ',': 0.0005530973451327434, '-': 0.0005530973451327434, '.': 0.0005530973451327434, '/': 0.0005530973451327434, '0': 0.0005530973451327434, '1': 0.0005530973451327434, '2': 0.0005530973451327434, '3': 0.0005530973451327434, '4': 0.0005530973451327434, '5': 0.0005530973451327434, '6': 0.0005530973451327434, '7': 0.0005530973451327434, '8': 0.0005530973451327434, '9': 0.0005530973451327434, ':': 0.0005530973451327434, ';': 0.0005530973451327434, '<': 0.0005530973451327434, '=': 0.0005530973451327434, '>': 0.00

In [49]:
# clear the generations, useful if you want to run the code multiple times
for example in test_:
    test_[example]['generation'] = []

# generate the strings for each example
for example in test_:
    print(f"Example {example}:")
    # generate the string for each sampling method
    for i in range(len(test_[example]['sampling_method'])):
        print("sampling method:", test_[example]['sampling_method'][i])
        # get the parameters
        sampling_method = test_[example]['sampling_method'][i]
        gen_length = int(test_[example]['gen_length'])
        stop_token = test_[example]['stop_token']
        prefix = test_[example]['start_tokens']

        # generate the string
        generated_string = generate_string(all_models, prefix=prefix, sampling_method=sampling_method, gen_length=gen_length, stop_token=stop_token)
        # store the string
        test_[example]['generation'].append(generated_string)

Example example1:
sampling method: greedy
next_token:  o
current_prefix:  H
len(current_prefix):  1
next_token:  u
current_prefix:  Ho
len(current_prefix):  2
next_token:   
current_prefix:  Hou
len(current_prefix):  3
next_token:  a
current_prefix:  Hou 
len(current_prefix):  4
next_token:   
current_prefix:  Hou a
len(current_prefix):  5
next_token:  s
current_prefix:  Hou a 
len(current_prefix):  6
next_token:  o
current_prefix:  Hou a s
len(current_prefix):  7
next_token:  n
current_prefix:  Hou a so
len(current_prefix):  8
next_token:   
current_prefix:  Hou a son
len(current_prefix):  9
next_token:  t
current_prefix:  Hou a son 
len(current_prefix):  10
sampling method: beam
Example example2:
sampling method: temperature
next_token:  1
current_prefix:  H
len(current_prefix):  1
next_token:  ญ
current_prefix:  H1
len(current_prefix):  2
next_token:  💪🏽
current_prefix:  H1ญ
len(current_prefix):  3
next_token:  🖖
current_prefix:  H1ญ💪🏽
len(current_prefix):  5
next_token:  z2
current

In [50]:
# print the generation results for each example

# iterate over the examples
for example in test_:
    print(f"Example {example}:")
    # iterate over the sampling methods
    for i in range(len(test_[example]['generation'])): # we don't want to print the beam search
        print("Generation using", test_[example]['sampling_method'][i], ":", test_[example]['generation'][i])
        print("generation length:", len(test_[example]['generation'][i]))
    print("")

Example example1:
Generation using greedy : Hou a son t
generation length: 11
Generation using beam : Her the the
generation length: 11

Example example2:
Generation using temperature : H1ญ💪🏽🖖z2ｍFj더🎹 🚴
generation length: 15
Generation using topK : Homant hate
generation length: 11
Generation using topP : Hピ8k💭<start>/yX방?p²q9卒
generation length: 22

Example example3:
Generation using greedy : Heall the the the the 
generation length: 22
Generation using beam : Her the the the the th
generation length: 22
Generation using temperature : He🎪sb撮fz🥐2°通👪👫😡dJ🎢3sذ:@📧️‼🦉g7😽KP
generation length: 32
Generation using topK : He shat art htteryould
generation length: 22
Generation using topP : He❥✨👑선b 결sp✰25유2O☑Qz🐫e*💿SI4📍🐘📦↗
generation length: 31



In [None]:
### do not change ###
print('-------- NLG --------')

for k,v in test_.items():
  l = ''.join([f'\t{sm} >> {v["start_tokens"]}{g}\n' for sm,g in zip(v['sampling_method'],v['generation'])])
  print(f'{k}:')
  print(l)

<br><br><br>
# **Good luck!**