# Assignment 1
In this assignment you will be creating tools for learning and testing language models.
The corpora that you will be working with are lists of tweets in 8 different languages that use the Latin script. The data is provided either formatted as CSV or as JSON, for your convenience. The end goal is to write a set of tools that can detect the language of a given tweet.

Do make sure all results are uploaded to CSVs (as well as printed to console) for your assignment to be fully graded.

*As a preparation for this task, download the data files from the course git repository.

The relevant files are under **lm-languages-data-new**:


*   en.csv (or the equivalent JSON file)
*   es.csv (or the equivalent JSON file)
*   fr.csv (or the equivalent JSON file)
*   in.csv (or the equivalent JSON file)
*   it.csv (or the equivalent JSON file)
*   nl.csv (or the equivalent JSON file)
*   pt.csv (or the equivalent JSON file)
*   tl.csv (or the equivalent JSON file)
*   test.csv (or the equivalent JSON file)





In [1]:
!git clone https://github.com/kfirbar/nlp-course.git

fatal: destination path 'nlp-course' already exists and is not an empty directory.




---



**Important note: please use only the files under lm-languages-data-new and NOT under lm-languages-data**


---



In [2]:

!ls nlp-course/lm-languages-data-new


en.csv     es.json    in.csv     it.json    pt.csv     test.json  tl.csv
en.json    fr.csv     in.json    nl.csv     pt.json    tests.csv  tl.json
es.csv     fr.json    it.csv     nl.json    test.csv   tests.json


**Part 1**

Write a function *preprocess* that iterates over all the data files and creates a single vocabulary, containing all the tokens in the data. **Our token definition is a single UTF-8 encoded character**. So, the vocabulary list is a simple Python list of all the characters that you see at least once in the data.

In [17]:
import os
import json
import pandas as pd
import numpy as np
import random
from sklearn.metrics import f1_score

def preprocess():
    folder_path = "./nlp-course/lm-languages-data-new"
    vocabulary = []

    for file in os.listdir(folder_path):
        if file[-4:] == "json":
            path_json = os.path.join(folder_path, file)
            with open(path_json, "r",  encoding="utf-8") as f:
                json_data = json.load(f)
            for tweet_num, tweet_text in json_data['tweet_text'].items():
                tweet_text = tweet_text
                for character in tweet_text:
                    if character not in vocabulary:
                        vocabulary.append(character)

    vocabulary.append("<s>")
    vocabulary.append("<e>")
    return vocabulary

vocabulary = preprocess()

**Part 2**

Write a function `lm` that generates a language model from a textual corpus. The function should return a dictionary (representing a model) where the keys are all the relevant n-1 sequences, and the values are dictionaries with the n_th tokens and their corresponding probabilities to occur. For example, for a trigram model (tokens are characters), it should look something like:

{
  "ab":{"c":0.5, "b":0.25, "d":0.25},
  "ca":{"a":0.2, "b":0.7, "d":0.1}
}

which means for example that after the sequence "ab", there is a 0.5 chance that "c" will appear, 0.25 for "b" to appear and 0.25 for "d" to appear.

Note - You should think how to add the add_one smoothing information to the dictionary and implement it.

In [4]:
def lm(n, vocabulary, data_file_path, add_one):
    # n - the n-gram to use (e.g., 1 - unigram, 2 - bigram, etc.)
    # vocabulary - the vocabulary list (which you should use for calculating add_one smoothing)
    # data_file_path - the data_file from which we record probabilities for our model
    # add_one - True/False (use add_one smoothing or not)
    vocabulary_size = len(vocabulary)
    dict = {}

    with open(data_file_path, "r", encoding="utf-8") as f:
        json_data = json.load(f)

    for tweet_num, tweet_text in json_data['tweet_text'].items():
        tweet_len = len(tweet_text)

        for j in range(tweet_len):
            i = max(0, j - n + 2)
            sequence = tweet_text[i: j + 1]
            if i == 0:
                sequence = "<s>" * (n - 2 - j) + sequence

            if j == tweet_len - 1:
                next_char_of_sequence = "<e>"
            else:
                next_char_of_sequence = tweet_text[j + 1]

            if sequence in dict:
                dict[sequence]["count"] += 1
                next_char_dict = dict[sequence]["next_char_dict"]
                if next_char_of_sequence in next_char_dict:
                    next_char_dict[next_char_of_sequence] += 1
                else:
                    next_char_dict[next_char_of_sequence] = 1
            else:
                dict[sequence] = {
                    "next_char_dict": {
                        next_char_of_sequence: 1
                    },
                    "count": 1
                }

    probabilities_dict = {}

    for sequence, sequence_data in dict.items():
        sequence_count = sequence_data["count"]
        probabilities_dict[sequence] = {}

        if add_one:
            for next_char in vocabulary:
                if next_char in sequence_data["next_char_dict"]:
                    prob = (sequence_data["next_char_dict"][next_char] + 1) / (sequence_count + vocabulary_size)
                    probabilities_dict[sequence][next_char] = prob
            probabilities_dict[sequence]["not_exist"] = 1 / (sequence_count + vocabulary_size)
        else:
            for next_char, next_char_count in sequence_data["next_char_dict"].items():
                prob = next_char_count / sequence_count
                probabilities_dict[sequence][next_char] = prob

    if add_one:
        probabilities_dict["not_exist"] = 1 / vocabulary_size
    else:
        probabilities_dict["not_exist"] = 10 ** -20

    return probabilities_dict

**Part 3**

Write a function *eval* that returns the perplexity of a model (dictionary) running over a given data file.

In [5]:
def tweet_perplexity(tweet_text, n, model):
    entropy_tweet = 0
    tweet_len = len(tweet_text)

    for j in range(tweet_len):
        i = max(0, j - n + 2)
        sequence = tweet_text[i: j + 1]
        if i == 0:
            sequence = "<s>" * (n - 2 - j) + sequence
        if j == tweet_len - 1:
            next_char_of_sequence = "<e>"
        else:
            next_char_of_sequence = tweet_text[j + 1]
        if sequence not in model:
            prob = model["not_exist"]
        else:
            if next_char_of_sequence in model[sequence]:
                prob = model[sequence][next_char_of_sequence]
            else:
                if "not_exist" in model[sequence]:
                    prob = model[sequence]["not_exist"]
                else:
                    prob = model["not_exist"]

        entropy_tweet += np.log(prob)

    entropy_tweet = -(1 / tweet_len) * entropy_tweet
    return 2 ** entropy_tweet

In [6]:
def eval(n, model, data_file):
    # n - the n-gram that you used to build your model (must be the same number)
    # model - the dictionary (model) to use for calculating perplexity
    # data_file - the tweets file that you wish to claculate a perplexity score for

    perplexity = 1
    with open(data_file, "r", encoding="utf-8") as f:
        json_data = json.load(f)

    num_tweets = len(json_data['tweet_text'])
    for tweet_num, tweet_text in json_data['tweet_text'].items():
        perplexity_tweet = tweet_perplexity(tweet_text, n, model)
        perplexity += perplexity_tweet

    perplexity = (1 / num_tweets) * perplexity
    return perplexity

**Part 4**

Write a function *match* that creates a model for every relevant language, using a specific value of *n* and *add_one*. Then, calculate the perplexity of all possible pairs (e.g., en model applied on the data files en ,es, fr, in, it, nl, pt, tl; es model applied on the data files en, es...). This function should return a pandas DataFrame with columns [en ,es, fr, in, it, nl, pt, tl] and every row should be labeled with one of the languages. Then, the values are the relevant perplexity values.

Save the dataframe to a CSV with the name format: {student_id_1}\_...\_{student_id_n}\_part4.csv

In [7]:
def get_files_and_models(n, vocabulary, add_one):
    folder_path = "./nlp-course/lm-languages-data-new"
    models = {}
    files = {}

    for file in os.listdir(folder_path):
        if file[-4:] == "json":
            language_name = file.split(".")[0]
            if language_name not in ["en", "es", "fr", "in", "it", "nl", "pt", "tl"]:
                continue
            path_json = os.path.join(folder_path, file)
            model = lm(n, vocabulary, path_json, add_one)
            files[language_name] = path_json
            models[language_name] = model

    return models, files

In [8]:
def match(n, add_one, name_csv="part4"):
    # n - the n-gram to use for creating n-gram models
    # add_one - use add_one smoothing or not

    vocabulary = preprocess()
    languages = []
    models, files = get_files_and_models(n, vocabulary, add_one)
    perplexity_df = pd.DataFrame(index=languages, columns=languages)

    for lang1, model in models.items():
        for lang2, file in files.items():
            perplexity = eval(n, model, file)
            perplexity_df.loc[lang1, lang2] = perplexity

    perplexity_df.to_csv(f"312494923_316550797_{name_csv}.csv")
    return perplexity_df

In [None]:
match(3, True)

**Part 5**

Run match with *n* values 1-4, once with add_one and once without, and print the 8 tables to this notebook, one after another.

Load each result to a dataframe and save to a CSV with the name format: 

for cases with add_one: {student_id_1}\_...\_{student_id_n}\_n1\_part5.csv

For cases without add_one:
{student_id_1}\_...\_{student_id_n}\_n1\_wo\_addone\_part5.csv

Follow the same format for n2,n3, and n4


In [9]:
def run_match():
    for n in range(1, 5):
        for add_one in [True, False]:
            if add_one:
                name_csv = f"n{n}_part5"
            else:
                name_csv = f"n{n}_wo_addone_part5"
            perplexity_df = match(n, add_one, name_csv)
            print(f"n - {n}, add_one - {str(add_one)}")
            print(perplexity_df)

run_match()

n - 1, add_one - True
           nl         pt         en         it         tl         in  \
nl  13.055351  14.048366  13.801164  13.725641  15.456463  14.441502   
pt  14.429282  12.532321  14.132458  13.645197  15.282229  14.702816   
en  13.737921  14.076758  12.988100  13.560570  14.773840  14.368623   
it  14.173893  13.697841  13.867212  12.937941  15.522299  14.813791   
tl  14.056637  14.329370  13.876513  13.868441  13.641067  13.659617   
in  14.301263  14.584162  14.068862  14.205196  14.340620  13.121042   
fr  14.457423  13.816761  13.982238  13.620349  16.344474  15.124689   
es  14.476219  13.432943  14.037528  13.583052  15.823381  14.899855   

           fr         es  
nl  13.551220  13.579234  
pt  13.299141  12.793658  
en  13.672072  13.551434  
it  13.240264  13.113641  
tl  14.605160  13.832167  
in  14.598011  14.032241  
fr  12.635822  13.426735  
es  13.375765  12.403975  
n - 1, add_one - False
              nl             pt             en           it    

**Part 6**

Each line in the file test.csv contains a sentence and the language it belongs to. Write a function that uses your language models to classify the correct language of each sentence.

Important note regarding the grading of this section: this is an open question, where a different solution will yield different accuracy scores. any solution that is not trivial (e.g. returning 'en' in all cases) will be accepted. We do reserve the right to give bonus points to exceptionally good/creative solutions.

In [19]:
def classify(n, add_one=True):
    with open("./nlp-course/lm-languages-data-new/test.json", "r",
              encoding="utf-8") as f:
        test_data = json.load(f)

    vocabulary = preprocess()

    models, files = get_files_and_models(n, vocabulary, add_one)

    labels = []
    clasification_result = []
    score = 0
    languages = list(models.keys())

    for tweet_num, tweet_text in test_data['tweet_text'].items():
        label = test_data["label"][tweet_num]
        labels.append(label)
        perplexities = []
        for lan, model in models.items():
            perplexity = tweet_perplexity(tweet_text, n, model)
            perplexities.append(perplexity)

        guess = languages[np.argmin(perplexities)]
        clasification_result.append(guess)
        if label == guess:
            score += 1

    return clasification_result, labels

**Part 7**

Calculate the F1 score of your output from part 6. (hint: you can use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). 

Load the results to a CSV (using a DataFrame), with a model_name and f1_score Name it {student_id_1}\_...\_{student_id_n}\_part7.csv



```
  model_name  f1_score
0    Model A      0.85
1    Model B      0.92
2    Model C      0.87
3    Model D      0.90
```



In [20]:
def calc_f1():
    perplexity_df = pd.DataFrame(columns=["model_name", "f1_score"], index=range(6))
    i = 0

    for n in range(2, 5):
        for add_one in [True, False]:
            pred_list, label_list = classify(n, add_one)
            f1 = f1_score(label_list, pred_list, average="macro")

            perplexity_df.loc[i][0] = f"n={n}, add_one={add_one}"
            perplexity_df.loc[i][1] = f1
            i += 1

    perplexity_df.to_csv(f"312494923_316550797_part7.csv")

In [21]:
calc_f1()

<br><br><br><br>
**Part 8**  
Let's use your Language model (dictionary) for generation (NLG).

When it comes to sampling from a language model decoder during text generation, there are several different methods that can be used to control the randomness and diversity of the generated text. 

Some of the most commonly used methods include:

> `Greedy sampling`
In this method, the model simply selects the word with the highest probability as the next word at each time step. This method can produce fluent text, but it can also lead to repetitive or predictable output.

> `Temperature scaling`  
Temperature scaling involves scaling the logits output of the language model by a temperature parameter before softmax normalization. This has the effect of smoothing the distribution of probabilities and increasing the probability of lower-probability words, which can lead to more diverse and creative output.

> `Top-K sampling`  
In this method, the model restricts the sampling to the top-K most likely words at each time step, where K is a predefined hyperparameter. This can generate more diverse output than greedy sampling, while limiting the number of low-probability words that are sampled.

> `Nucleus sampling` (also known as top-p sampling)  
This method restricts the sampling to the smallest possible set of words whose cumulative probability exceeds a certain threshold, defined by a hyperparameter p. Like top-K sampling, this can generate more diverse output than greedy sampling, while avoiding sampling extremely low probability words.

> `Beam search`  
Beam search involves maintaining a fixed number k of candidate output sequences at each time step, and then selecting the k most likely sequences based on their probabilities. This can improve the fluency and coherence of the output, but may not produce as much diversity as sampling methods.

The choice of sampling method depends on the specific application and desired balance between fluency, diversity, and randomness. Hyperparameters such as temperature, K, p, and beam size can also be tuned to adjust the behavior of the language model during sampling.


You may read more about this concept in <a href='https://huggingface.co/blog/how-to-generate#:~:text=pad_token_id%3Dtokenizer.eos_token_id)-,Greedy%20Search,-Greedy%20search%20simply'>this</a> blog post.


**Please added the needed code for each sampeling method:**

In [12]:
def sample_greedy(probabilities):
    return np.argmax(probabilities)


def sample_temperature(probabilities, temperature=1.0, k=1):
    logits = np.log(probabilities) / temperature
    probs = softmax(logits)
    samples = np.random.choice(len(probabilities), size=k, p=probs)
    return samples


def sample_topK(probabilities, k=1):
    sorted_indx = np.argsort(probabilities)[::-1][:k]
    return sorted_indx


def sample_topP(probabilities, p=0.9):
    sum = 0
    sorted_indexs = np.argsort(probabilities)[::-1]
    for i, p_arg in enumerate(sorted_indexs):
        sum += probabilities[p_arg]
        if sum >= p:
            break
    return random.choices(sorted_indexs[:i + 1], weights=np.array(probabilities)[sorted_indexs[:i + 1]], k=1)[0]


def sample_beam(model, gen_length, start_tokens, stop_token, vocabulary, k):
    max_val = 0
    char_max_val = None

    if start_tokens[-len(stop_token):] == stop_token or gen_length == 0:
        return 1, stop_token

    if start_tokens not in model:
        choosen_chars = np.random.choice(vocabulary, size=k)
    else:
        root = model[start_tokens]
        chars = list(root.keys())
        probabilities = list(root.values())
        choosen_chars = np.array(chars)[sample_topK(probabilities, k=k)]

    for char in choosen_chars:
        if start_tokens not in model:
            val, _ = sample_beam(model, gen_length - 1, start_tokens[1:] + char, stop_token,
                                 vocabulary, k)
            prob = model["not_exist"] * val
        else:
            if char not in model[start_tokens]:
                val, _ = sample_beam(model, gen_length - 1, start_tokens[1:] + char,
                                     stop_token, vocabulary, k)
                prob = root["not_exist"] * val
            else:
                val, _ = sample_beam(model, gen_length - 1, start_tokens[1:] + char, stop_token, vocabulary, k)
                prob = root[char] * val
        if prob > max_val:
            max_val = prob
            char_max_val = char

    return max_val, char_max_val

Use your Language Model to generate each one out of the following examples with the coresponding params.    
Notice the 4 core issues: 
- Starting tokens
- Length of the generation
- Sampling methond (use all)
- Stop Token (if this token is sampled, stop generating)

In [13]:
test_ = {
    'example1' : {
        'start_tokens' : "H",
        'sampling_method' : ['greedy','beam'],
        'gen_length' : "10",
        'stop_token' : "\n",
        'generation' : []
    },
    'example2' : {
        'start_tokens' : "H",
        'sampling_method' : ['temperature','topK','topP'],
        'gen_length' : "10",
        'stop_token' : "\n",
        'generation' : []
    },
    'example3' : {
        'start_tokens' : "He",
        'sampling_method' : ['greedy','beam','temperature','topK','topP'],
        'gen_length' : "20",
        'stop_token' : "me",
        'generation' : []
    }
}

Use your LM to generate a string based on the parametes of each examples, and store the generation sequance at the generation list.

In [14]:
def softmax(x):
    return (np.exp(x) / np.sum(np.exp(x)))

In [15]:
vocabulary = preprocess()
path_json = "./nlp-course/lm-languages-data-new/en.json"

for k, v in test_.items():
    start_tokens = v["start_tokens"]
    stop_token = v["stop_token"]
    n = len(start_tokens) + 1
    model = lm(n, vocabulary, path_json, True)
    for sm in v["sampling_method"]:
        generation_result = v["start_tokens"]
        gen_length = int(v["gen_length"])# - len(v["start_tokens"])

        while gen_length != 0:
            start_tokens = generation_result[-(n - 1):]

            next_char = None
            if sm != "beam":
                if start_tokens not in model:
                    next_char = np.random.choice(vocabulary, size=1)[0]
                else:
                    root = model[start_tokens]
                    probabilities = []
                    for c in vocabulary:
                        if c in root:
                            probabilities.append(root[c])
                        else:
                            probabilities.append(root["not_exist"])

            if sm == "greedy":
                next_char = vocabulary[sample_greedy(probabilities)]
            elif sm == "beam":
                _, next_char = sample_beam(model, gen_length, start_tokens, stop_token=stop_token,
                                           vocabulary=vocabulary, k=2)
            elif sm == "temperature":
                smapels = np.array(vocabulary)[sample_temperature(probabilities, temperature=0.5, k=2)]
                next_char = np.random.choice(smapels, size=1)[0]
            elif sm == "topK":
                smapels = sample_topK(probabilities, k=3)
                next_char = \
                random.choices(np.array(vocabulary)[smapels], weights=np.array(probabilities)[smapels], k=1)[0]
            elif sm == "topP":
                next_char = vocabulary[sample_topP(probabilities, p=0.2)]

            generation_result = generation_result + next_char
            if generation_result[-len(stop_token):] == stop_token:
                break
            gen_length -= 1
        v["generation"].append(generation_result[(n-1):])

In [16]:
### do not change ###
print('-------- NLG --------')

for k,v in test_.items():
  l = ''.join([f'\t{sm} >> {v["start_tokens"]}{g}\n' for sm,g in zip(v['sampling_method'],v['generation'])])
  print(f'{k}:')
  print(l)

-------- NLG --------
example1:
	greedy >> Hon t t t t
	beam >> Hous://///t

example2:
	temperature >> Hored T her
	topK >> Hout aren t
	topP >> He the are 

example3:
	greedy >> Heall the the the the 
	beam >> Heally https://t.come
	temperature >> Heal lon a se ou anne 
	topK >> He will aliked ther a 
	topP >> He'<e>r the the the the 



<br><br><br>
# **Good luck!**