# Assignment 1
In this assignment you will be creating tools for learning and testing language models.
The corpora that you will be working with are lists of tweets in 8 different languages that use the Latin script. The data is provided either formatted as CSV or as JSON, for your convenience. The end goal is to write a set of tools that can detect the language of a given tweet.


*As a preparation for this task, download the data files from the course git repository.

The relevant files are under **lm-languages-data-new**:


*   en.csv (or the equivalent JSON file)
*   es.csv (or the equivalent JSON file)
*   fr.csv (or the equivalent JSON file)
*   in.csv (or the equivalent JSON file)
*   it.csv (or the equivalent JSON file)
*   nl.csv (or the equivalent JSON file)
*   pt.csv (or the equivalent JSON file)
*   tl.csv (or the equivalent JSON file)
*   test.csv (or the equivalent JSON file)





In [13]:
# imports
import glob
import os 
import math

import pandas as pd
import numpy as np
from collections import defaultdict
from sklearn.metrics import f1_score
from IPython.display import display


In [14]:
!git clone https://github.com/kfirbar/nlp-course.git

fatal: destination path 'nlp-course' already exists and is not an empty directory.




---



**Important note: please use only the files under lm-languages-data-new and NOT under lm-languages-data**


---



In [15]:

!ls nlp-course/lm-languages-data-new


'ls' is not recognized as an internal or external command,
operable program or batch file.


In [16]:
data_files = {'en_df': 'en.csv',
              'es_df': 'es.csv',
              'fr_df': 'fr.csv',
              'in_df': 'in.csv',
              'it_df': 'it.csv',
              'nl_df': 'nl.csv',
              'pt_df': 'pt.csv',
              'tl_df': 'tl.csv'}

    
directory = 'nlp-course/lm-languages-data-new/'    
for (key, value) in data_files.items():
    data_files[key] = directory + value
    
languages_list = list(data_files.keys())
start_token = '↠'
end_token = '↞'

**Part 1**

Write a function *preprocess* that iterates over all the data files and creates a single vocabulary, containing all the tokens in the data. **Our token definition is a single UTF-8 encoded character**. So, the vocabulary list is a simple Python list of all the characters that you see at least once in the data.

In [17]:
def preprocess():
    """
    data frame is table from 2 columns:
        1. tweet id
        2. tweet text
    output:
        generate vocabulary of the all languages letters 
    """  
    tokens = []
    for path in data_files.values():
        df = pd.read_csv(path)
        for text in df['tweet_text'].values:
            tokens.extend(list(text))
    return list(set(tokens))

**Part 2**

Write a function lm that generates a language model from a textual corpus. The function should return a dictionary (representing a model) where the keys are all the relevant n-1 sequences, and the values are dictionaries with the n_th tokens and their corresponding probabilities to occur. For example, for a trigram model (tokens are characters), it should look something like:

{
  "ab":{"c":0.5, "b":0.25, "d":0.25},
  "ca":{"a":0.2, "b":0.7, "d":0.1}
}

which means for example that after the sequence "ab", there is a 0.5 chance that "c" will appear, 0.25 for "b" to appear and 0.25 for "d" to appear.

Note - You should think how to add the add_one smoothing information to the dictionary and implement it.

In [18]:
#helper functions
def tweets_to_text(data_file_path, n):
    """
    data frame is table from 2 columns:
        1. tweet id
        2. tweet text
    
    Input:
        data file path
        n - of the n-gram model
        
    Output:
        the all data frame text joined toghter
    """
    df = pd.read_csv(r''+ data_file_path)
    tweets_list = df['tweet_text'].apply(lambda x: start_token + x + end_token).values
    text = ''.join(tweets_list)
    
    text = start_token * (n-1) + text + end_token * (n-1)

    return text

def reorder_list(List, index_list):
    return [List[i] for i in index_list]

In [19]:
def lm(n, vocabulary, data_file_path, add_one):
    """
    input 
        # n - the n-gram to use (e.g., 1 - unigram, 2 - bigram, etc.)
        # vocabulary - the vocabulary list (which you should use for calculating add_one smoothing)
        # data_file_path - the data_file from which we record probabilities for our model
        # add_one - True/False (use add_one smoothing or not)
    output:
        model of language base the text and n-gram defenition
    """
    lm_dict = {}
    V = len(vocabulary)

    text = tweets_to_text(data_file_path, n)

    # Extract n length substrings
    n_gram = [text[i: i + n] for i in range(len(text) - n)]

    lm_dict = defaultdict(lambda: defaultdict(lambda: 0))

    for i_n_gram in n_gram:
        n_1_gram = i_n_gram[0:n-1]
        lm_dict[n_1_gram][i_n_gram[n-1]] += 1
    
    for key in lm_dict.keys():
        key_count = sum(lm_dict[key].values())
        inner_dict = {}
        for key_1 in lm_dict[key].keys():
            if add_one:
                inner_dict[key_1] = (lm_dict[key][key_1] + 1) / (key_count + V)
            else:
                inner_dict[key_1] = lm_dict[key][key_1]/ key_count
        if add_one:
            lm_dict[key] = defaultdict(lambda: 1 / (key_count + V), inner_dict)
        else:
            lm_dict[key] = defaultdict(lambda: 0, inner_dict)
            
    return lm_dict

**Part 3**

Write a function *eval* that returns the perplexity of a model (dictionary) running over a given data file.

## perplexity
* entropy
    * W - all posible token in the language Text
    * N - is the numnber of all tokens
    * $H(Text) =  (-{\sum_{w_i \in W}{P(w_i)log_2(P(w_i))}})/N$
        * under the assumption the $w_i$ is in the language, we can remove $P(w_i)$
    * $H(Text) =  (-\sum_{w_i \in W}log_2(P(w_i)))/N$
* perplexity = $2^{H(Text)}$ 

In [20]:
# pereparing function to evaluating tweet
def eval_tweet(n, N, model, tweet):
    """
    input:
        n  - the n of n-gram
        N - size of tweet
        model - the model of langues
        tweet - text
        
    output:
        probabilities list
    """
    missing_value = 1e-8
    tweet_probabilities = []
  
    for i in range(N - n):
        i_n_gram = tweet[i: i + n]
        key = i_n_gram[0:n-1]
        key_1 = i_n_gram[n-1]

    if key in model:
        if key_1 in model[key]:
            tweet_probabilities.append(model[key][key_1])
        else:
            tweet_probabilities.append(missing_value)
    else:
          tweet_probabilities.append(missing_value)
  
    return tweet_probabilities

In [21]:
def eval(n, model, data_file):
    """
    # input:
        # n - the n-gram that you used to build your model (must be the same number)
        # model - the dictionary (model) to use for calculating perplexity
        # data_file - the tweets file that you wish to calculate a perplexity score for
    # output
        # perlplexity
    """
    df = pd.read_csv(data_file)
    probabilities = []

    for tweet in df['tweet_text'].values:
        tweet = start_token + tweet + end_token
        N = len(tweet)
        tweet_probabilities = eval_tweet(n, N, model, tweet)
        probabilities.extend(tweet_probabilities)

    entropy = -math.log2(np.mean(probabilities))
      
    return 2 ** entropy

# generate vocabulary

In [22]:
vocabulary = preprocess()

**Part 4**

Write a function *match* that creates a model for every relevant language, using a specific value of *n* and *add_one*. Then, calculate the perplexity of all possible pairs (e.g., en model applied on the data files en ,es, fr, in, it, nl, pt, tl; es model applied on the data files en, es...). This function should return a pandas DataFrame with columns [en ,es, fr, in, it, nl, pt, tl] and every row should be labeled with one of the languages. Then, the values are the relevant perplexity values.

In [23]:
def match(n, add_one, data_files):
    """
    input 
        # n - the n-gram to use for creating n-gram models
        # add_one - use add_one smoothing or not
    output
        # data frame of matching between model to languages
    """
    result_dict = {}
    vocabulary = preprocess()
    for i_language_model in languages_list:
        
        i_model = lm(n, vocabulary, data_files[i_language_model], add_one)
        result_dict[i_language_model] = {}

        for i_language_test in languages_list:
            i_language_model_i_score = eval(n, i_model, data_files[i_language_test])
            result_dict[i_language_model][i_language_test] = i_language_model_i_score
    perplexity_df = pd.DataFrame(result_dict)
    return perplexity_df  

**Part 5**

Run match with *n* values 1-4, once with add_one and once without, and print the 8 tables to this notebook, one after another.

In [24]:
 
def run_match(data_files):
    """
    run matching on n-gram model from 1-4, with and without add one
    """
    for n in range(1,5):
        add_one = True
        perplexity_df = match(n, add_one, data_files)
        print(f'n = {n}, add_one = {add_one}')
        display(perplexity_df)

        add_one = False
        perplexity_df = match(n, add_one, data_files)
        print(f'n = {n}, add_one = {add_one}')
        display(perplexity_df)



# run the model matching

In [25]:
model_dict = run_match(data_files)


n = 1, add_one = True


Unnamed: 0,en_df,es_df,fr_df,in_df,it_df,nl_df,pt_df,tl_df
en_df,72.914634,74.16457,73.394796,78.73211,76.866986,69.294856,79.447279,81.01723
es_df,41.196681,38.152723,40.885803,40.745494,40.437243,41.69734,38.600992,41.206191
fr_df,44.091131,42.618736,41.799289,48.016157,45.621368,42.100117,44.326895,50.299539
in_df,57.82085,56.153049,58.469156,49.812553,56.059166,56.678443,57.314825,52.669413
it_df,43.044979,40.048572,42.737129,44.229139,41.00095,43.239561,40.333777,44.193314
nl_df,66.76224,66.283982,66.360544,66.568027,67.522966,62.458843,70.431522,68.580706
pt_df,43.464094,37.766103,42.909527,38.557484,39.295123,43.401255,37.561902,39.122854
tl_df,40.855298,37.098028,40.353868,36.009301,39.341408,39.356434,37.906579,36.663564


n = 1, add_one = False


Unnamed: 0,en_df,es_df,fr_df,in_df,it_df,nl_df,pt_df,tl_df
en_df,72.769017,74.008478,73.252902,78.553206,76.698873,69.152727,79.239925,80.817347
es_df,41.112891,38.070765,40.805282,40.650874,40.346957,41.610462,38.497873,41.102195
fr_df,44.001606,42.527412,41.717021,47.905118,45.51981,42.012433,44.208868,50.173247
in_df,57.704352,56.033635,58.355143,49.697489,55.935101,56.561356,57.163332,52.537352
it_df,42.957521,39.962627,42.653053,44.126637,40.909447,43.149556,40.226142,44.081968
nl_df,66.628424,66.143848,66.231741,66.415711,67.374519,62.33027,70.24676,68.410307
pt_df,43.375796,37.684947,42.825113,38.467825,39.207326,43.310913,37.461519,39.023995
tl_df,40.772188,37.018282,40.274371,35.925464,39.253515,39.27433,37.805279,36.570806


n = 2, add_one = True


Unnamed: 0,en_df,es_df,fr_df,in_df,it_df,nl_df,pt_df,tl_df
en_df,43.776189,53.497511,48.89899,53.525649,50.699229,49.042384,56.431754,49.043253
es_df,37.140372,21.317902,26.757511,28.803144,24.976284,28.415798,30.173859,28.312988
fr_df,33.92272,26.081185,23.71528,34.448246,27.655368,28.119126,34.597428,32.020639
in_df,41.585012,38.18345,40.566456,28.529002,36.1613,38.162114,45.216918,29.757143
it_df,31.515325,22.25875,23.33159,27.635992,13.721365,26.482707,27.68115,24.373056
nl_df,46.511171,42.457446,44.723694,42.695891,44.278189,35.618174,48.232406,41.538801
pt_df,31.493665,21.678845,27.065655,25.663584,23.043938,28.54788,17.933638,24.972351
tl_df,28.732803,21.494318,24.410513,23.487808,22.658508,23.952645,26.482359,15.9105


n = 2, add_one = False


Unnamed: 0,en_df,es_df,fr_df,in_df,it_df,nl_df,pt_df,tl_df
en_df,26.008884,31.742119,31.096097,32.518139,31.610836,32.503477,33.502764,31.572003
es_df,22.436499,12.089139,16.225868,16.929334,14.317043,16.911358,15.792813,15.136067
fr_df,23.050873,15.992926,14.954686,19.995677,16.435052,17.510679,18.650123,18.512128
in_df,29.314018,23.865615,26.297864,18.536133,22.732809,25.546612,26.351215,20.452846
it_df,20.785216,13.898936,15.22329,16.624763,8.689918,16.676644,15.019079,14.359308
nl_df,34.129908,29.624916,31.415008,30.293524,30.199808,25.382486,31.383062,29.645913
pt_df,22.593725,14.966704,18.508011,17.36482,15.137904,19.224687,11.235011,15.198019
tl_df,18.996949,13.092892,15.231579,14.884463,13.208382,15.093102,14.741497,10.368441


n = 3, add_one = True


Unnamed: 0,en_df,es_df,fr_df,in_df,it_df,nl_df,pt_df,tl_df
en_df,48.150841,108.835821,89.044182,106.151329,93.60717,83.286061,112.185602,90.467172
es_df,86.687833,39.84364,76.4083,73.850932,60.148647,88.424421,61.8721,85.543068
fr_df,66.112604,61.775371,36.864522,100.032521,63.433782,69.645579,76.685974,82.583003
in_df,92.265362,98.985205,100.433193,40.889629,82.177137,82.410901,113.599348,51.324631
it_df,57.028685,42.902688,39.819458,54.893079,14.494984,53.441164,54.414169,48.99286
nl_df,81.310638,107.489857,98.942909,92.241349,94.093133,45.40418,117.805985,77.763965
pt_df,65.189941,37.31208,60.664141,54.440292,43.461906,69.838139,24.941626,52.762329
tl_df,64.794282,69.159437,64.631478,44.726273,41.177185,43.240622,62.699106,17.3216


n = 3, add_one = False


Unnamed: 0,en_df,es_df,fr_df,in_df,it_df,nl_df,pt_df,tl_df
en_df,9.301309,24.338205,22.481604,23.716064,23.036871,21.796961,24.103984,20.177882
es_df,18.615527,6.292674,13.814097,14.324648,11.772122,14.418502,12.022809,12.552687
fr_df,16.457562,12.988245,6.760379,17.350395,13.726514,13.660916,15.130985,14.187728
in_df,20.657635,18.099298,19.127559,7.403872,16.931177,17.41634,18.352046,14.524767
it_df,10.504901,7.977438,8.003123,9.466807,4.414969,9.091272,8.663427,8.165708
nl_df,23.300301,23.158308,24.206797,23.491227,22.184089,9.553483,23.120033,21.490565
pt_df,15.037939,10.610593,14.932274,13.608034,11.003791,16.27919,5.242151,12.070804
tl_df,14.847552,12.814172,12.597593,12.118529,8.814791,9.165949,10.468662,4.784484


n = 4, add_one = True


Unnamed: 0,en_df,es_df,fr_df,in_df,it_df,nl_df,pt_df,tl_df
en_df,92.885996,461.41802,281.329907,458.398281,345.939964,294.764304,473.531956,276.858335
es_df,408.410517,84.929378,338.230294,408.271744,197.124677,389.103029,184.387499,401.071989
fr_df,222.734579,257.389162,72.626096,471.801996,240.112261,268.289491,341.489043,333.943434
in_df,358.864512,505.008991,448.525547,87.999843,235.674836,267.968933,435.763101,117.145042
it_df,117.668793,87.00548,69.951986,105.598476,17.115169,94.367664,122.788114,81.018864
nl_df,253.172588,491.475267,351.897334,350.518332,288.020675,83.491257,462.501598,159.426744
pt_df,259.347604,114.059087,223.360172,270.235229,131.44443,242.123198,44.650124,151.030817
tl_df,182.446181,331.307696,220.644642,82.145389,77.538749,77.818227,190.47794,19.991684


n = 4, add_one = False


Unnamed: 0,en_df,es_df,fr_df,in_df,it_df,nl_df,pt_df,tl_df
en_df,1.841152,19.55866,16.970849,19.088398,17.195358,16.668488,18.896633,13.936506
es_df,16.164701,1.927704,13.431308,16.13754,11.644472,13.468725,11.450784,13.052547
fr_df,13.618812,13.04288,1.787078,17.011306,13.008595,12.138422,15.92905,14.28915
in_df,17.679574,19.268961,17.937926,1.952076,17.056183,14.547005,19.859896,12.772327
it_df,8.33183,6.578335,7.191149,7.166388,1.709686,7.174253,6.862595,6.560611
nl_df,17.358572,21.051298,19.469823,19.06077,20.918356,1.730418,21.932722,18.731052
pt_df,13.431565,8.181232,13.069927,13.231484,9.246027,12.978485,1.822269,10.759984
tl_df,9.792771,9.672656,8.336526,8.40147,8.382158,7.733362,8.840242,1.745517


**Part 6**

Each line in the file test.csv contains a sentence and the language it belongs to. Write a function that uses your language models to classify the correct language of each sentence.

Important note regarding the grading of this section: this is an open question, where a different solution will yield different accuracy scores. any solution that is not trivial (e.g. returning 'en' in all cases) will be excepted. We do reserve the right to give bonus points to exceptionally good/creative solutions.

In [26]:
! ls nlp-course/lm-languages-data-new

'ls' is not recognized as an internal or external command,
operable program or batch file.


In [43]:
def match_test(n, data_file_path, add_one):
    """
    input
        # n - the n-gram to use for creating n-gram models
        # add_one - use add_one smoothing or not
        # data_file_path - file path
    output
       #  perplexity_df - language model per perplexity score per senetence 
    """
    
    senstences_list = pd.read_csv(data_file_path)['tweet_text'].to_list()

    result_dict = {}

    for i_language_model in languages_list:
        result_dict[i_language_model] = {}
        i_model = lm(n, vocabulary, data_files[i_language_model], add_one)

        for i_test_senstence_idx in range(len(senstences_list)):
            i_test_senstence = senstences_list[i_test_senstence_idx]

            sentence_probabilities = eval_tweet(n, len(i_test_senstence), i_model, i_test_senstence)
            entropy = -math.log2(np.mean(sentence_probabilities))
            i_sentence_model_i_score = 2 ** entropy

            result_dict[i_language_model][i_test_senstence_idx] = i_sentence_model_i_score

    perplexity_df = pd.DataFrame(result_dict)
    perplexity_array = perplexity_df.to_numpy()
    language_match_index = np.argmin(perplexity_array, axis=1)
    language_match_list = reorder_list(languages_list, language_match_index)
    perplexity_df['predict'] = language_match_index
    perplexity_df['predicted_language'] = language_match_list
    display(perplexity_df)

    return perplexity_df

In [44]:
def classify(n, data_file_path, add_one):
    match_dict  = match_test(n, data_file_path, add_one)
    return match_dict

* we choose to arbitrary to demonstrate classifier function using:
    * n = 2
    * add_one = False 

In [45]:
n = 2
add_one = False
test_path = test_folder + 'test.csv'
clasification_result = classify(n, test_path, add_one)


Unnamed: 0,en_df,es_df,fr_df,in_df,it_df,nl_df,pt_df,tl_df,predict,predicted_language
0,1.648377e+01,1.748090e+01,1.571033e+01,1.828885e+01,1.555619e+01,2.019009e+01,18.693143,1.332840e+01,7,tl_df
1,9.715179e+01,7.027027e+01,8.966304e+01,6.858000e+01,2.042278e+01,7.881159e+01,72.468750,6.431757e+01,4,it_df
2,6.178577e+00,1.039676e+01,7.018024e+00,5.274799e+00,6.373190e+00,4.877389e+00,12.405568,6.679184e+00,5,nl_df
3,1.305818e+03,3.145918e+02,8.251000e+02,6.068966e+02,1.720778e+03,8.145294e+02,1877.600000,3.485789e+02,1,es_df
4,6.808163e+01,5.457143e+01,1.154348e+02,5.512308e+01,3.794444e+01,1.048667e+02,14.907801,1.157331e+01,7,tl_df
...,...,...,...,...,...,...,...,...,...,...
7994,2.938333e+02,3.108333e+02,4.967778e+02,4.116667e+02,4.317778e+02,2.740000e+02,369.333333,3.922727e+02,5,nl_df
7995,7.718579e+00,5.819421e+00,7.323810e+00,3.499101e+00,7.262843e+00,1.050301e+01,6.638788,3.169173e+00,7,tl_df
7996,1.000000e+08,1.000000e+08,1.000000e+08,1.000000e+08,1.000000e+08,1.000000e+08,6.000000,1.000000e+08,6,pt_df
7997,5.117445e+01,4.374756e+01,3.734921e+01,1.334181e+01,1.573813e+01,5.546667e+01,14.131914,4.621685e+01,3,in_df


In [37]:
# parse result in order to insert to f1 score
y_true = pd.read_csv(test_path).get('label').to_list()
y_true = list(map(lambda x: languages_list.index(x+'_df'),y_true))
y_pred = clasification_result['predict'].to_list()

**Part 7**

Calculate the F1 score of your output from part 6. (hint: you can use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). 


In [32]:
def calc_f1(y_true, y_pred):
    return np.round(f1_score(y_true, y_pred,average="micro"),3)
f_score_result = calc_f1(y_true,y_pred)
print('The F-score we acheive is ' + str(f_score_result)+'\n')

The F-score we acheive is 0.207



# **Good luck!**