# Assignment 1
In this assignment you will be creating tools for learning and testing language models.
The corpora that you will be working with are lists of tweets in 8 different languages that use the Latin script. The data is provided either formatted as CSV or as JSON, for your convenience. The end goal is to write a set of tools that can detect the language of a given tweet.


*As a preparation for this task, download the data files from the course git repository.

The relevant files are under **lm-languages-data-new**:


*   en.csv (or the equivalent JSON file)
*   es.csv (or the equivalent JSON file)
*   fr.csv (or the equivalent JSON file)
*   in.csv (or the equivalent JSON file)
*   it.csv (or the equivalent JSON file)
*   nl.csv (or the equivalent JSON file)
*   pt.csv (or the equivalent JSON file)
*   tl.csv (or the equivalent JSON file)
*   test.csv (or the equivalent JSON file)





In [1]:
# imports
import pandas as pd
import numpy as np
from itertools import combinations
from collections import Counter
import time
import glob
import os 
from sklearn.metrics import f1_score
from IPython.display import display


In [2]:
!git clone https://github.com/kfirbar/nlp-course.git

fatal: destination path 'nlp-course' already exists and is not an empty directory.




---



**Important note: please use only the files under lm-languages-data-new and NOT under lm-languages-data**


---



In [3]:

!ls nlp-course/lm-languages-data-new


en.csv     es.json    in.csv     it.json    pt.csv     test.json  tl.csv
en.json    fr.csv     in.json    nl.csv     pt.json    tests.csv  tl.json
es.csv     fr.json    it.csv     nl.json    test.csv   tests.json


In [4]:
data_files = {'en_df': 'en.csv',
              'es_df': 'es.csv',
              'fr_df': 'fr.csv',
              'in_df': 'in.csv',
              'it_df': 'it.csv',
              'nl_df': 'nl.csv',
              'pt_df': 'pt.csv',
              'tl_df': 'tl.csv'}

    
directory = 'nlp-course/lm-languages-data-new/'    
for (key, value) in data_files.items():
    data_files[key] = directory + value
    
languages_list = list(data_files.keys())
start_token = '↠'
end_token = '↞'

**Part 1**

Write a function *preprocess* that iterates over all the data files and creates a single vocabulary, containing all the tokens in the data. **Our token definition is a single UTF-8 encoded character**. So, the vocabulary list is a simple Python list of all the characters that you see at least once in the data.

In [5]:
def preprocess(data_files):
    """
    data frame is table from 2 columns:
        1. tweet id
        2. tweet text
    """  
    tokens = []
    for path in data_files.values():
        df = pd.read_csv(path)
        if tokens.__len__() == 0 :
            columns_list = df.columns.to_list()
        for text in df[columns_list[-1]].values:
            tokens.extend(list(text))
    return list(set(tokens))

In [6]:
vocabulary = preprocess(data_files)

**Part 2**

Write a function lm that generates a language model from a textual corpus. The function should return a dictionary (representing a model) where the keys are all the relevant n-1 sequences, and the values are dictionaries with the n_th tokens and their corresponding probabilities to occur. For example, for a trigram model (tokens are characters), it should look something like:

{
  "ab":{"c":0.5, "b":0.25, "d":0.25},
  "ca":{"a":0.2, "b":0.7, "d":0.1}
}

which means for example that after the sequence "ab", there is a 0.5 chance that "c" will appear, 0.25 for "b" to appear and 0.25 for "d" to appear.

Note - You should think how to add the add_one smoothing information to the dictionary and implement it.

In [7]:
#helper function
def tweets_to_text(data_file_path, n):
    """
    data frame is table from 2 columns:
        1. tweet id
        2. tweet text
    """
    df = pd.read_csv(r''+ data_file_path)
    debug = True
    if debug == True:
        df = df[0:100]
    columns_list = df.columns.to_list()
    tweets_list = df[columns_list[-1]].apply(lambda x: start_token + x + end_token).values
    text = ''.join(tweets_list)
    
    text = start_token * (n-1) + text + end_token * (n-1)

    return text

In [8]:
def lm(n, vocabulary, data_file_path, add_one):
    # n - the n-gram to use (e.g., 1 - unigram, 2 - bigram, etc.)
    # vocabulary - the vocabulary list (which you should use for calculating add_one smoothing)
    # data_file_path - the data_file from which we record probabilities for our model
    # add_one - True/False (use add_one smoothing or not)
  
    lm_dict = {}
    V = len(vocabulary)

    text = tweets_to_text(data_file_path, n)

    # Extract n - 1 length substrings
    n_1_gram = [text[i: i + n-1] for i in range(len(text) - (n-1))]
    counter_obj_n_1_gram = dict(Counter(n_1_gram))

    # Extract n length substrings
    n_gram = [text[i: i + n] for i in range(len(text) - n)]
    counter_obj_n_gram = dict(Counter(n_gram))

    for key in counter_obj_n_1_gram.keys():
        inner_dict = {}
        if add_one:
            gen = (key_1 for key_1 in counter_obj_n_gram.keys() if key_1[0:n-1] == key)
            for key_1 in gen:
                val = (int(counter_obj_n_gram[key_1]) + 1) / (int(counter_obj_n_1_gram[key]) + V)
                inner_dict[key_1[-1]] = val

            gen = (token for token in vocabulary if not(token in inner_dict))
            for key_1 in gen:
                val = 1 /  (int(counter_obj_n_1_gram[key]) + V)
                inner_dict[key_1[-1]] = val

        else:
            gen = (key_1 for key_1 in counter_obj_n_gram.keys() if key_1[0:n-1] == key)
            sum_vals = 0
            for key_1 in gen:
                val = int(counter_obj_n_gram[key_1]) / int(counter_obj_n_1_gram[key])
                inner_dict[key_1[-1]] = val
                sum_vals += val

        lm_dict[key] = inner_dict.copy()

    return lm_dict

In [9]:
test_lm = lm(2, vocabulary, data_files['en_df'], False)

**Part 3**

Write a function *eval* that returns the perplexity of a model (dictionary) running over a given data file.

In [10]:
def eval(n, model, data_file):
    # n - the n-gram that you used to build your model (must be the same number)
    # model - the dictionary (model) to use for calculating perplexity
    # data_file - the tweets file that you wish to claculate a perplexity score for

    # read file
    if os.path.exists(data_file):
        text = tweets_to_text(data_file, n)
    else:
        text = data_file
    # Extract n length substrings
    n_gram = [text[i: i + n] for i in range(len(text) - n)]

    model_keys = model.keys()
    entropy = 0 
    for i_letter in n_gram:
        if i_letter[0:n-1] in model_keys: 
            i_letter_model = model[i_letter[0:n-1]]
            if i_letter[n-1] in i_letter_model.keys():
                second_letter_prob = i_letter_model[i_letter[n-1]]
                entropy += -np.log2(second_letter_prob)
            else:
                entropy += 0
        else:
            entropy += 0
    entropy = entropy/len(n_gram)
    perplexity_score = 2**(entropy)
    return perplexity_score

In [11]:
eval(2, test_lm, data_files['en_df'])

14.384868998083354

**Part 4**

Write a function *match* that creates a model for every relevant language, using a specific value of *n* and *add_one*. Then, calculate the perplexity of all possible pairs (e.g., en model applied on the data files en ,es, fr, in, it, nl, pt, tl; es model applied on the data files en, es...). This function should return a pandas DataFrame with columns [en ,es, fr, in, it, nl, pt, tl] and every row should be labeled with one of the languages. Then, the values are the relevant perplexity values.

In [12]:
def match(n, add_one, data_files):
    # n - the n-gram to use for creating n-gram models
    # add_one - use add_one smoothing or not
    result_dict = {}
    vocabulary = preprocess(data_files)
    for i_language_model in languages_list:
        
        i_model = lm(n, vocabulary, data_files[i_language_model], add_one)
        result_dict[i_language_model] = {}

        for i_language_test in languages_list:
            i_language_model_i_score = eval(n, i_model, data_files[i_language_test])
            result_dict[i_language_model][i_language_test] = i_language_model_i_score
    perlexity_df = pd.DataFrame(result_dict)
    return perlexity_df  

**Part 5**

Run match with *n* values 1-4, once with add_one and once without, and print the 8 tables to this notebook, one after another.

In [13]:
 
def run_match(data_files):
    full_model_dict = {}
    # for n in range(2,3):

    for n in range(1,5):
        add_one = True
        perlexity_df = match(n, add_one, data_files)
        print(f'n = {n}, add_one = {add_one}')
        display(perlexity_df)

        add_one = False
        perlexity_df = match(n, add_one, data_files)
        print(f'n = {n}, add_one = {add_one}')
        display(perlexity_df)



# run the model generation

In [14]:
model_dict = run_match(data_files)


summary for matching (add_one = True) model perlexity score per model and test language :

n = 1, add_one = True


Unnamed: 0,en_df,es_df,fr_df,in_df,it_df,nl_df,pt_df,tl_df
en_df,45.65597,51.355402,49.301698,51.440155,51.11569,49.920505,53.72038,52.12862
es_df,45.452144,41.50501,43.749825,48.90924,44.327461,46.477788,45.430218,50.887953
fr_df,47.074058,46.360269,42.557872,51.742064,46.095812,47.880026,49.104143,54.361214
in_df,47.102662,51.61525,49.987591,42.887313,51.71664,48.16578,51.151363,47.35654
it_df,42.277777,41.678911,40.670809,45.966293,38.971338,43.101557,43.918374,47.100703
nl_df,44.844935,47.609415,45.624331,48.050698,47.293902,42.118229,49.3418,50.380296
pt_df,49.20583,47.269991,47.924765,51.67458,48.308706,50.512218,46.461623,53.818373
tl_df,54.486198,60.875533,59.630634,54.425462,60.540051,58.100922,61.147729,50.515084


summary for matching (add_one = False) model perlexity score per model and test language :

n = 1, add_one = False


Unnamed: 0,en_df,es_df,fr_df,in_df,it_df,nl_df,pt_df,tl_df
en_df,38.961941,41.929366,41.182271,41.709666,41.382011,39.933839,41.636759,41.436843
es_df,32.378747,35.103377,32.358444,35.586874,32.75542,33.675336,33.921399,36.349828
fr_df,34.076984,35.586083,36.546817,36.641208,36.828866,36.481139,37.387258,38.103698
in_df,39.555033,43.216252,42.287237,35.934859,42.984583,39.194853,40.588367,38.645186
it_df,33.223687,32.306595,33.487553,35.651692,32.916407,32.559039,33.075585,36.029886
nl_df,36.962915,39.409093,38.21548,39.072086,38.859453,35.447688,38.964376,40.482297
pt_df,33.551902,34.022261,34.305865,34.794524,33.862799,34.41574,37.718362,35.935042
tl_df,42.294933,47.535579,48.735527,42.006983,46.523005,44.918698,45.372337,41.860766


summary for matching (add_one = True) model perlexity score per model and test language :

n = 2, add_one = True


Unnamed: 0,en_df,es_df,fr_df,in_df,it_df,nl_df,pt_df,tl_df
en_df,109.633495,181.131325,155.932843,183.448419,185.085043,159.789446,214.788732,187.363344
es_df,121.855617,93.759347,108.233732,153.050915,111.004566,141.541569,133.414088,160.746126
fr_df,126.343186,135.200848,96.370419,162.231691,144.96634,150.794636,167.382475,180.972235
in_df,165.223403,201.519532,178.059431,110.181932,194.132097,180.279689,222.129794,166.135812
it_df,130.095771,111.620035,118.491223,150.769573,89.43779,147.181633,139.513553,160.444998
nl_df,138.23286,168.466015,148.679664,172.825315,173.97741,100.392034,203.460108,195.201295
pt_df,131.383824,118.845906,121.535194,155.58582,122.953341,151.442675,119.2445,167.316105
tl_df,152.813515,189.549481,183.233925,152.710602,188.030566,185.343554,219.696342,107.270217


summary for matching (add_one = False) model perlexity score per model and test language :

n = 2, add_one = False


Unnamed: 0,en_df,es_df,fr_df,in_df,it_df,nl_df,pt_df,tl_df
en_df,14.384869,11.95669,11.489436,12.446335,11.198522,11.131667,10.3262,11.379106
es_df,11.352011,12.466218,10.157946,11.581361,9.497516,11.577911,8.991426,10.879584
fr_df,10.489021,9.993651,13.6083,11.084216,10.19814,10.824662,9.544355,10.648533
in_df,13.198092,11.261706,12.992268,14.073481,10.208837,12.167447,10.484984,10.769811
it_df,12.429877,10.339403,11.545752,11.940116,12.511038,13.345853,9.897333,10.937212
nl_df,10.348992,11.064295,11.793545,12.204031,10.06778,13.240227,10.402905,10.892198
pt_df,10.882331,9.260273,10.72662,11.07652,9.593008,11.912083,12.351687,10.090859
tl_df,10.797583,9.587003,10.587205,9.161233,8.899691,9.953089,8.478238,11.231359


summary for matching (add_one = True) model perlexity score per model and test language :

n = 3, add_one = True


Unnamed: 0,en_df,es_df,fr_df,in_df,it_df,nl_df,pt_df,tl_df
en_df,321.999134,236.257631,229.498439,281.94052,210.894365,238.569239,212.671714,254.329931
es_df,257.503523,284.599391,203.18693,261.580541,198.084039,253.449131,208.61689,252.958853
fr_df,211.403993,190.618387,289.206439,235.974703,206.352018,221.00428,209.730663,219.068264
in_df,311.233221,209.911281,278.125045,347.35068,199.303286,262.739773,218.587319,245.690487
it_df,293.942131,236.535125,268.311651,293.160808,295.176144,323.448686,253.976038,263.701585
nl_df,226.37794,215.751858,240.925609,276.411546,199.17232,304.514289,223.796944,239.133309
pt_df,225.257646,177.602952,213.803449,229.021473,193.385723,248.667676,334.587154,208.02646
tl_df,215.040469,161.169274,180.047655,170.49007,145.514891,180.763809,139.857577,248.014005


summary for matching (add_one = False) model perlexity score per model and test language :

n = 3, add_one = False


Unnamed: 0,en_df,es_df,fr_df,in_df,it_df,nl_df,pt_df,tl_df
en_df,4.309184,2.29377,2.532178,2.231851,2.123328,2.529614,1.927856,2.091247
es_df,2.764492,4.392235,2.970475,2.393427,3.016536,2.45667,2.744862,2.151533
fr_df,2.742375,2.544285,4.415564,2.28509,2.548024,2.547205,2.349252,1.930364
in_df,2.373589,2.222998,2.429772,4.638793,2.267839,2.205126,2.006026,2.193134
it_df,2.897508,3.137128,3.166326,2.543176,4.805001,2.564129,2.826795,2.290382
nl_df,2.639346,2.318133,2.448077,2.224871,2.323701,4.408137,2.035984,1.991618
pt_df,2.442784,2.738047,2.744106,2.189195,2.690643,2.211776,3.939318,1.982775
tl_df,2.239858,2.068203,2.195796,2.275679,2.118378,2.086651,1.936217,3.129322


summary for matching (add_one = True) model perlexity score per model and test language :

n = 4, add_one = True


Unnamed: 0,en_df,es_df,fr_df,in_df,it_df,nl_df,pt_df,tl_df
en_df,508.530267,23.93669,31.84127,23.722843,20.383399,34.786579,15.756271,22.562204
es_df,37.661317,490.261217,48.807418,26.643679,55.12579,27.263883,48.998075,22.279463
fr_df,41.803938,34.671091,487.785233,23.403266,33.356777,31.785906,27.531624,16.962445
in_df,23.213326,18.995776,23.279904,575.365148,19.580899,19.393979,15.179747,24.021276
it_df,42.545316,61.597574,54.872547,30.531767,529.024349,29.819406,47.956677,26.428432
nl_df,35.123212,23.119273,26.256139,22.38284,21.14382,506.784157,16.656295,17.179881
pt_df,26.978216,43.931692,36.32448,21.313828,38.931561,21.210296,524.167062,17.44373
tl_df,24.041166,19.080509,20.448821,28.699707,19.909655,19.890651,16.344748,355.911777


summary for matching (add_one = False) model perlexity score per model and test language :

n = 4, add_one = False


Unnamed: 0,en_df,es_df,fr_df,in_df,it_df,nl_df,pt_df,tl_df
en_df,1.865606,1.121681,1.188322,1.114338,1.14403,1.156407,1.087554,1.080483
es_df,1.155921,2.06065,1.26944,1.110383,1.309363,1.156332,1.248118,1.084723
fr_df,1.18643,1.233825,2.005493,1.104869,1.209101,1.156551,1.144624,1.06992
in_df,1.124816,1.116398,1.113791,1.98147,1.134983,1.110769,1.090298,1.158461
it_df,1.166394,1.286207,1.231846,1.139954,2.184752,1.119157,1.185137,1.099106
nl_df,1.168725,1.150577,1.190717,1.108792,1.127843,1.978597,1.088746,1.074574
pt_df,1.141296,1.289672,1.21951,1.120374,1.255854,1.120441,1.837578,1.074569
tl_df,1.147484,1.117438,1.137728,1.221908,1.148638,1.107226,1.105917,1.615895


**Part 6**

Each line in the file test.csv contains a sentence and the language it belongs to. Write a function that uses your language models to classify the correct language of each sentence.

Important note regarding the grading of this section: this is an open question, where a different solution will yield different accuracy scores. any solution that is not trivial (e.g. returning 'en' in all cases) will be excepted. We do reserve the right to give bonus points to exceptionally good/creative solutions.

In [None]:
! ls nlp-course\lm-languages-data-new

In [None]:
test_folder = r'nlp-course\lm-languages-data-new'
test_csv_files =  glob.glob(test_folder + '\\*.csv')
test_files =  {}
for i_file in test_csv_files:
    file_name_with_ending = os.path.basename(i_file)
    file_name = os.path.splitext(file_name_with_ending)[0]
    test_files[file_name + '_df'] = f'' + i_file


In [None]:
def match_test(n, model_dict, data_file_path, add_one):
    # n - the n-gram to use for creating n-gram models
    # add_one - use add_one smoothing or not
    #data_file_path = r"C:\MSC\NLP2\nlp-course\lm-languages-data-new\test.csv"
    senstences_list = pd.read_csv(data_file_path)['tweet_text'].to_list()
    lines = [] 
    result_dict = {}
    for i_language_model in languages_list:
        i_model = model_dict[n][add_one][i_language_model]
        result_dict[i_language_model] = {}
        
        for i_test_senstence in senstences_list:
            i_sentence_model_i_score = eval(n, i_model, i_test_senstence)
            result_dict[i_language_model][i_test_senstence] = i_sentence_model_i_score
    # print('summary for '+ i_language_model +' model perlexity score for each language:\n')
    perlexity_df = pd.DataFrame(result_dict)
    print(perlexity_df)
    #TODO
    return perlexity_df


def classify(n, model_dict, data_file_path, add_one):
    # TODO
    match_dict  = match_test(n, model_dict, data_file_path, add_one)
    return match_dict



In [None]:
n = 2
test_path = test_folder + '\\test.csv'
clasification_result = classify(n, model_dict, test_path, False)

# roni needed to yuield results from mat results
#########

**Part 7**

Calculate the F1 score of your output from part 6. (hint: you can use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). 


In [None]:
def calc_f1(result):
    data_file_path = f'nlp-course/lm-languages-data-new/test.csv'
    labels = pd.read_csv(data_file_path).get('label')
    print(list(labels))
    return f1_score(list(labels), clasification_result,average="micro")

  # TODO

calc_f1(clasification_result)

# **Good luck!**