# Assignment 1
In this assignment you will be creating tools for learning and testing language models.
The corpora that you will be working with are lists of tweets in 8 different languages that use the Latin script. The data is provided either formatted as CSV or as JSON, for your convenience. The end goal is to write a set of tools that can detect the language of a given tweet.


*As a preparation for this task, download the data files from the course git repository.

The relevant files are under **lm-languages-data-new**:


*   en.csv (or the equivalent JSON file)
*   es.csv (or the equivalent JSON file)
*   fr.csv (or the equivalent JSON file)
*   in.csv (or the equivalent JSON file)
*   it.csv (or the equivalent JSON file)
*   nl.csv (or the equivalent JSON file)
*   pt.csv (or the equivalent JSON file)
*   tl.csv (or the equivalent JSON file)
*   test.csv (or the equivalent JSON file)





In [1]:
# imports
import pandas as pd
import numpy as np
import math
from collections import Counter, defaultdict
import time
import glob
import os 
from sklearn.metrics import f1_score
from IPython.display import display


In [2]:
!git clone https://github.com/kfirbar/nlp-course.git

fatal: destination path 'nlp-course' already exists and is not an empty directory.




---



**Important note: please use only the files under lm-languages-data-new and NOT under lm-languages-data**


---



In [3]:

!ls nlp-course/lm-languages-data-new


'ls' is not recognized as an internal or external command,
operable program or batch file.


In [4]:
data_files = {'en_df': 'en.csv',
              'es_df': 'es.csv',
              'fr_df': 'fr.csv',
              'in_df': 'in.csv',
              'it_df': 'it.csv',
              'nl_df': 'nl.csv',
              'pt_df': 'pt.csv',
              'tl_df': 'tl.csv'}

    
directory = 'nlp-course/lm-languages-data-new/'    
for (key, value) in data_files.items():
    data_files[key] = directory + value
    
languages_list = list(data_files.keys())
start_token = '↠'
end_token = '↞'

**Part 1**

Write a function *preprocess* that iterates over all the data files and creates a single vocabulary, containing all the tokens in the data. **Our token definition is a single UTF-8 encoded character**. So, the vocabulary list is a simple Python list of all the characters that you see at least once in the data.

In [5]:
def preprocess():
    """
    data frame is table from 2 columns:
        1. tweet id
        2. tweet text
    """  
    tokens = []
    for path in data_files.values():
        df = pd.read_csv(path)
        for text in df['tweet_text'].values:
            tokens.extend(list(text))
    return list(set(tokens))

**Part 2**

Write a function lm that generates a language model from a textual corpus. The function should return a dictionary (representing a model) where the keys are all the relevant n-1 sequences, and the values are dictionaries with the n_th tokens and their corresponding probabilities to occur. For example, for a trigram model (tokens are characters), it should look something like:

{
  "ab":{"c":0.5, "b":0.25, "d":0.25},
  "ca":{"a":0.2, "b":0.7, "d":0.1}
}

which means for example that after the sequence "ab", there is a 0.5 chance that "c" will appear, 0.25 for "b" to appear and 0.25 for "d" to appear.

Note - You should think how to add the add_one smoothing information to the dictionary and implement it.

In [6]:
#helper functions
def tweets_to_text(data_file_path, n):
    """
    data frame is table from 2 columns:
        1. tweet id
        2. tweet text
    """
    df = pd.read_csv(r''+ data_file_path)
    # debug = True
    # if debug == True:
    #     df = df[0:100]
    # columns_list = df.columns.to_list()
    tweets_list = df['tweet_text'].apply(lambda x: start_token + x + end_token).values
    text = ''.join(tweets_list)
    
    text = start_token * (n-1) + text + end_token * (n-1)

    return text

def reorder_list(List, index_list):
    return [List[i] for i in index_list]

In [7]:
def lm(n, vocabulary, data_file_path, add_one):
    # n - the n-gram to use (e.g., 1 - unigram, 2 - bigram, etc.)
    # vocabulary - the vocabulary list (which you should use for calculating add_one smoothing)
    # data_file_path - the data_file from which we record probabilities for our model
    # add_one - True/False (use add_one smoothing or not)
  
    lm_dict = {}
    V = len(vocabulary)

    text = tweets_to_text(data_file_path, n)

    # Extract n length substrings
    n_gram = [text[i: i + n] for i in range(len(text) - n)]

    lm_dict = defaultdict(lambda: defaultdict(lambda: 0))

    for i_n_gram in n_gram:
        n_1_gram = i_n_gram[0:n-1]
        lm_dict[n_1_gram][i_n_gram[n-1]] += 1
    
    for key in lm_dict.keys():
        key_count = sum(lm_dict[key].values())
        inner_dict = {}
        for key_1 in lm_dict[key].keys():
            if add_one:
                inner_dict[key_1] = (lm_dict[key][key_1] + 1) / (key_count + V)
            else:
                inner_dict[key_1] = lm_dict[key][key_1]/ key_count
        if add_one:
            lm_dict[key] = defaultdict(lambda: 1 / (key_count + V), inner_dict)
        else:
            lm_dict[key] = defaultdict(lambda: 0, inner_dict)
            
    return lm_dict

**Part 3**

Write a function *eval* that returns the perplexity of a model (dictionary) running over a given data file.

In [8]:
# def eval(n, model, data_file):
#     # n - the n-gram that you used to build your model (must be the same number)
#     # model - the dictionary (model) to use for calculating perplexity
#     # data_file - the tweets file that you wish to claculate a perplexity score for

#     # read file
#     if os.path.exists(data_file):
#         text = tweets_to_text(data_file, n)
#     else:
#         text = data_file
#     # Extract n length substrings
#     n_gram = [text[i: i + n] for i in range(len(text) - n)]

#     model_keys = model.keys()
#     entropy = 0 
#     for i_letter in n_gram:
#         if i_letter[0:n-1] in model_keys: 
#             i_letter_model = model[i_letter[0:n-1]]
#             if i_letter[n-1] in i_letter_model.keys():
#                 second_letter_prob = i_letter_model[i_letter[n-1]]
#                 entropy += -np.log2(second_letter_prob)
#             else:
#                 entropy += 0
#         else:
#             entropy += 0
#     entropy = entropy/len(n_gram)
#     perplexity_score = 2**(entropy)
#     return perplexity_score

In [9]:
def eval_tweet(n, model, tweet):
  missing_value = 1e-8
  N = len(tweet)
  entropy = 0
  
  for i in range(N - n):
    i_n_gram = tweet[i: i + n]
    key = i_n_gram[0:n-1]
    key_1 = i_n_gram[n-1]

    if key in model:
      if key_1 in model[key]:
        entropy += -math.log2(model[key][key_1])
      else:
        entropy += -math.log2(missing_value)
    else:
      entropy += -math.log2(missing_value)
  
  return 2 ** (entropy / N)

In [10]:
def eval(n, model, data_file):
  # n - the n-gram that you used to build your model (must be the same number)
  # model - the dictionary (model) to use for calculating perplexity
  # data_file - the tweets file that you wish to calculate a perplexity score for
  
  df = pd.read_csv(data_file)
  entropies = []

  for tweet in df['tweet_text'].values:
    tweet = start_token + tweet + end_token
    tweet_entropy = eval_tweet(n, model, tweet)
    entropies.append(tweet_entropy)
      
  return np.mean(entropies)

In [11]:
start_time = time.time()
vocabulary = preprocess()
print(time.time() - start_time)
start_time = time.time()
n = 2
test_dict = lm(n, vocabulary, data_files['en_df'], False)
print(time.time() - start_time)
start_time = time.time()
eval(n,test_dict, data_files['en_df'])
print(time.time() - start_time)


0.8540053367614746
0.5890016555786133
0.5590860843658447


**Part 4**

Write a function *match* that creates a model for every relevant language, using a specific value of *n* and *add_one*. Then, calculate the perplexity of all possible pairs (e.g., en model applied on the data files en ,es, fr, in, it, nl, pt, tl; es model applied on the data files en, es...). This function should return a pandas DataFrame with columns [en ,es, fr, in, it, nl, pt, tl] and every row should be labeled with one of the languages. Then, the values are the relevant perplexity values.

In [12]:
def match(n, add_one, data_files):
    # n - the n-gram to use for creating n-gram models
    # add_one - use add_one smoothing or not
    result_dict = {}
    vocabulary = preprocess()
    for i_language_model in languages_list:
        
        i_model = lm(n, vocabulary, data_files[i_language_model], add_one)
        result_dict[i_language_model] = {}

        for i_language_test in languages_list:
            i_language_model_i_score = eval(n, i_model, data_files[i_language_test])
            result_dict[i_language_model][i_language_test] = i_language_model_i_score
    perlexity_df = pd.DataFrame(result_dict)
    return perlexity_df  

**Part 5**

Run match with *n* values 1-4, once with add_one and once without, and print the 8 tables to this notebook, one after another.

In [13]:
 
def run_match(data_files):
    full_model_dict = {}
    # for n in range(2,3):

    for n in range(1,5):
        add_one = True
        perlexity_df = match(n, add_one, data_files)
        print(f'n = {n}, add_one = {add_one}')
        display(perlexity_df)

        add_one = False
        perlexity_df = match(n, add_one, data_files)
        print(f'n = {n}, add_one = {add_one}')
        display(perlexity_df)



# run the model generation

In [14]:
model_dict = run_match(data_files)


n = 1, add_one = True


Unnamed: 0,en_df,es_df,fr_df,in_df,it_df,nl_df,pt_df,tl_df
en_df,40.774348,124.90862,129.230467,75.899527,107.950631,137.816454,137.346785,110.199493
es_df,207.644762,39.125644,318.18879,58.4244,177.840102,189.814924,60.058971,207.695057
fr_df,48.064142,46.808786,39.054124,50.355801,45.579365,46.316215,46.601269,49.978617
in_df,356.594077,285.839285,245.192811,44.22001,316.806158,355.004983,301.827572,185.146927
it_df,44.185389,45.940536,48.320345,48.978254,40.231848,45.017341,52.020327,45.8835
nl_df,68.962011,257.367909,668.497844,130.673384,107.988267,42.888666,685.457472,51.818649
pt_df,58.846987,103.854559,116.083266,92.674988,119.202251,111.052711,39.159835,99.40413
tl_df,97.464881,425.172628,490.322643,71.50689,420.49671,365.088437,160.207517,45.853908


n = 1, add_one = False


Unnamed: 0,en_df,es_df,fr_df,in_df,it_df,nl_df,pt_df,tl_df
en_df,40.820226,126.549469,131.705056,76.986097,108.774851,139.924917,138.150663,111.920975
es_df,213.913386,39.320584,318.306626,62.113738,178.074081,194.184328,67.613789,212.989467
fr_df,48.176401,46.871547,39.123947,50.818248,45.650621,46.438121,46.62461,50.156348
in_df,360.18421,289.68343,256.175579,45.005271,319.462378,356.40386,312.718923,197.82806
it_df,44.237926,46.19222,48.345015,49.288003,40.268816,45.020045,51.988591,45.872009
nl_df,74.01603,288.889947,668.396253,149.316287,120.309383,43.103751,685.091449,52.408017
pt_df,59.544691,105.177398,116.832299,93.385096,119.408387,111.392043,39.268092,100.727509
tl_df,100.47247,433.863326,492.239345,76.175434,423.995527,366.331881,160.78849,47.601773


n = 2, add_one = True


Unnamed: 0,en_df,es_df,fr_df,in_df,it_df,nl_df,pt_df,tl_df
en_df,20.693757,328.110938,357.157945,126.619032,270.327741,293.844588,154.667435,227.399952
es_df,518.875483,18.566185,521.852321,231.31894,510.550827,449.360212,457.506595,525.314764
fr_df,45.781068,42.60242,19.649692,54.629509,42.643306,40.511202,44.914272,54.659651
in_df,852.75353,678.39587,873.326941,21.372124,648.229617,768.298336,1152.191586,1155.456434
it_df,38.002393,50.084823,48.944251,62.860398,19.545255,39.464387,65.867905,43.22876
nl_df,693.629687,5025.594465,7185.426881,3945.829222,1563.847387,21.032919,7276.131067,159.142059
pt_df,275.516981,626.618126,639.476577,539.455172,630.068569,635.678989,19.092805,622.117798
tl_df,525.743769,1225.75303,1211.761959,688.779724,772.588048,424.102452,848.48338,20.701295


n = 2, add_one = False


Unnamed: 0,en_df,es_df,fr_df,in_df,it_df,nl_df,pt_df,tl_df
en_df,17.574495,300.976191,326.331406,95.527587,226.715322,269.16534,114.160435,210.884648
es_df,494.30158,15.622754,496.174382,142.515709,483.453001,410.05749,392.725173,500.922661
fr_df,39.173762,35.351778,16.662634,45.436964,36.039665,33.588175,36.753565,46.259782
in_df,699.297742,507.072851,697.471,17.682221,522.881774,627.943873,910.969877,973.109458
it_df,31.648042,40.074918,40.161539,53.451889,16.378495,32.849686,53.937726,35.432997
nl_df,321.59207,4312.053259,7092.202856,3058.060921,886.971827,17.719253,7163.974441,85.194795
pt_df,213.730474,605.448267,619.164739,494.238338,608.021986,612.947015,15.54891,594.838083
tl_df,473.872909,1130.641632,1166.796416,593.79114,700.458143,363.155835,782.450234,17.042201


n = 3, add_one = True


Unnamed: 0,en_df,es_df,fr_df,in_df,it_df,nl_df,pt_df,tl_df
en_df,26.457003,2107.658875,2543.913569,793.685043,2555.073278,3221.032877,1654.642285,2287.468919
es_df,1631.340956,24.159651,1537.227205,1476.728917,1412.467324,1544.629869,1452.98259,1607.535787
fr_df,826.93908,1763.999518,25.138475,1322.526246,1023.898929,803.928859,2013.948914,2062.544378
in_df,6502.200076,7077.247059,5936.20602,32.34328,6413.630707,4802.364664,6726.548727,5782.483338
it_df,1185.904799,1734.727636,1407.534348,2118.572754,26.003047,871.599971,1990.930627,1283.665468
nl_df,12464.067572,17815.386953,14964.523878,13485.221198,13810.105881,29.498847,21923.566155,8632.085444
pt_df,1468.283708,1627.880125,1509.244804,1475.86339,1566.286255,1312.301133,24.675786,1604.873331
tl_df,5975.948588,8262.751855,7076.315797,4283.793913,7671.212507,3556.538742,8825.767939,29.289474


n = 3, add_one = False


Unnamed: 0,en_df,es_df,fr_df,in_df,it_df,nl_df,pt_df,tl_df
en_df,8.431643,1313.339011,1833.216434,338.423205,1708.3445,2480.016169,766.737168,1627.009282
es_df,1236.010856,8.043283,1127.649466,965.638082,1060.247769,1084.142203,1043.844729,1129.798456
fr_df,363.235158,1113.896858,8.187548,708.20256,444.830456,350.349735,1108.201566,1204.570837
in_df,4669.036699,4882.911335,4097.179591,9.458448,3934.80336,3056.518306,4053.188316,3991.818651
it_df,538.347197,933.689518,711.328113,1275.438106,8.091838,372.540723,1084.716655,624.478959
nl_df,9128.296768,14197.187586,12400.437757,10832.434389,10173.158793,8.842156,18273.212112,4886.861713
pt_df,954.683767,1193.028089,1082.761225,963.169857,1140.801281,838.310597,7.497781,1052.185424
tl_df,4513.335146,6185.288887,5449.62756,2929.874275,5744.711186,2077.641369,6796.388756,8.297402


n = 4, add_one = True


Unnamed: 0,en_df,es_df,fr_df,in_df,it_df,nl_df,pt_df,tl_df
en_df,57.743051,37475.499926,24503.667626,21756.318549,30755.747796,25865.076572,41513.746884,17737.393571
es_df,24037.953285,52.457887,20681.415277,42967.659503,11333.07585,43154.632185,15475.494572,25786.079311
fr_df,26717.826604,31333.454594,53.132406,45392.692209,27310.138216,36253.661695,41606.189411,41478.637811
in_df,100800.957673,144267.808798,126392.048321,76.108727,146652.141165,100317.557295,192732.63214,62859.678484
it_df,43955.972265,22015.376347,37691.853211,54724.650931,56.928774,80119.415982,28013.981351,29646.46603
nl_df,85592.038897,130549.922095,98109.945594,82405.595342,130002.848524,63.043339,169776.921947,96938.369631
pt_df,41318.51184,15746.940566,34735.437792,53643.321319,15887.238915,67911.93924,52.750691,30836.403214
tl_df,77874.615881,80110.682873,91007.862,42007.058462,70940.240445,73623.52525,92407.27419,65.981143


n = 4, add_one = False


Unnamed: 0,en_df,es_df,fr_df,in_df,it_df,nl_df,pt_df,tl_df
en_df,4.219846,17232.234021,10865.362739,8231.675212,13932.229797,14037.669123,15456.57568,9732.437101
es_df,9129.679701,4.445659,8397.337823,16989.805533,4988.957756,16325.547127,7705.356809,13378.263333
fr_df,15067.234131,17694.402246,4.245178,19425.05712,14348.294525,18860.444401,20282.816868,19121.461162
in_df,51105.973959,75884.122899,63505.700116,4.796285,75294.902641,48939.041006,94246.628046,39165.904924
it_df,21060.954272,12568.038215,18386.837112,23385.731676,4.31449,38664.099022,13366.776909,13465.878662
nl_df,62101.42248,86613.660144,67880.764955,53085.778179,82868.029052,4.308802,98865.006231,61732.830111
pt_df,17388.161933,8315.233949,16160.782638,21133.179206,7509.212467,27990.740696,4.094463,12404.735214
tl_df,48166.572612,44677.336535,47627.355339,26965.145556,39642.890845,37797.348982,46535.937822,4.212213


**Part 6**

Each line in the file test.csv contains a sentence and the language it belongs to. Write a function that uses your language models to classify the correct language of each sentence.

Important note regarding the grading of this section: this is an open question, where a different solution will yield different accuracy scores. any solution that is not trivial (e.g. returning 'en' in all cases) will be excepted. We do reserve the right to give bonus points to exceptionally good/creative solutions.

In [15]:
! ls nlp-course\lm-languages-data-new

'ls' is not recognized as an internal or external command,
operable program or batch file.


In [16]:
test_folder = f'nlp-course\lm-languages-data-new'
test_csv_files =  glob.glob(test_folder + '\\*.csv')
test_files =  {}
for i_file in test_csv_files:
    file_name_with_ending = os.path.basename(i_file)
    file_name = os.path.splitext(file_name_with_ending)[0]
    test_files[file_name + '_df'] = f'' + i_file

In [17]:
def match_test(n, data_file_path, add_one):
    # n - the n-gram to use for creating n-gram models
    # add_one - use add_one smoothing or not
    #data_file_path = r"C:\MSC\NLP2\nlp-course\lm-languages-data-new\test.csv"
    senstences_list = pd.read_csv(data_file_path)['tweet_text'].to_list()

    lines = [] 
    result_dict = {}

    for i_language_model in languages_list:
        # i_model = model_dict[n][add_one][i_language_model]
        result_dict[i_language_model] = {}
        i_model = lm(n, vocabulary, data_files[i_language_model], add_one)

        for i_test_senstence_idx in range(senstences_list.__len__()):
            i_test_senstence = senstences_list[i_test_senstence_idx]
            i_sentence_model_i_score = eval_tweet(n, i_model, i_test_senstence)
            result_dict[i_language_model][i_test_senstence_idx] = i_sentence_model_i_score
    # print('summary for '+ i_language_model +' model perlexity score for each language:\n')
    perlexity_df = pd.DataFrame(result_dict)
    print(perlexity_df)
    perlexity_array = perlexity_df.to_numpy()
    language_match_index = np.argmin(perlexity_array, axis=1)
    language_match_list = reorder_list(languages_list, language_match_index)
    perlexity_df['predict'] = language_match_index
    perlexity_df['predict_language'] = language_match_list
    print(perlexity_df)

    return perlexity_df


def classify(n, data_file_path, add_one):
    match_dict  = match_test(n, data_file_path, add_one)
    return match_dict

In [18]:
n = 2
test_path = test_folder + '\\test.csv'
clasification_result = classify(2, test_path, False)

          en_df      es_df       fr_df      in_df      it_df       nl_df  \
0     14.818845  20.744056   21.567739  20.175058  22.174236   20.346401   
1     30.138728  24.206101   28.153495  29.966054  18.956580   41.405755   
2     16.843284  17.411375   18.594713  15.760808  18.029057   19.689610   
3     21.089425  23.281874   21.561659  24.160825  24.578729   16.687899   
4     18.705082  18.861269   20.207775  15.422576  18.365791   19.919475   
...         ...        ...         ...        ...        ...         ...   
7994  24.806665  16.421329   21.777461  28.013050  18.560318   22.830151   
7995  25.525601  28.601536   28.180287  13.377498  29.316133   25.665874   
7996  91.525186  82.228881  103.019425  92.992217  56.403807  129.426663   
7997  13.768677  10.936258   11.725669  16.368396  12.996725   14.092017   
7998  19.337632  27.574758   26.417047  22.841206  28.790243   25.226890   

          pt_df      tl_df  
0     21.445368  19.072942  
1     26.193427  30.913829  


In [19]:
y_true = pd.read_csv(test_path).get('label').to_list()
y_true = list(map(lambda x: languages_list.index(x+'_df'),y_true))
y_pred = clasification_result['predict'].to_list()

**Part 7**

Calculate the F1 score of your output from part 6. (hint: you can use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). 


In [20]:
def calc_f1(y_true,y_pred ):
    return np.round(f1_score(y_true, y_pred,average="micro"),3)
f_score_result = calc_f1(y_true,y_pred)
print('The F-score we acheive is ' + str(f_score_result)+'\n')

The F-score we acheive is 0.869



# **Good luck!**