# Assignment 1
In this assignment you will be creating tools for learning and testing language models.
The corpora that you will be working with are lists of tweets in 8 different languages that use the Latin script. The data is provided either formatted as CSV or as JSON, for your convenience. The end goal is to write a set of tools that can detect the language of a given tweet.


*As a preparation for this task, download the data files from the course git repository.

The relevant files are under **lm-languages-data-new**:


*   en.csv (or the equivalent JSON file)
*   es.csv (or the equivalent JSON file)
*   fr.csv (or the equivalent JSON file)
*   in.csv (or the equivalent JSON file)
*   it.csv (or the equivalent JSON file)
*   nl.csv (or the equivalent JSON file)
*   pt.csv (or the equivalent JSON file)
*   tl.csv (or the equivalent JSON file)
*   test.csv (or the equivalent JSON file)





In [1]:
# imports
import pandas as pd
import numpy as np
from collections import Counter, defaultdict
import time
import glob
import os 
from sklearn.metrics import f1_score
from IPython.display import display


In [2]:
!git clone https://github.com/kfirbar/nlp-course.git

fatal: destination path 'nlp-course' already exists and is not an empty directory.




---



**Important note: please use only the files under lm-languages-data-new and NOT under lm-languages-data**


---



In [3]:

!ls nlp-course/lm-languages-data-new


'ls' is not recognized as an internal or external command,
operable program or batch file.


In [4]:
data_files = {'en_df': 'en.csv',
              'es_df': 'es.csv',
              'fr_df': 'fr.csv',
              'in_df': 'in.csv',
              'it_df': 'it.csv',
              'nl_df': 'nl.csv',
              'pt_df': 'pt.csv',
              'tl_df': 'tl.csv'}

    
directory = 'nlp-course/lm-languages-data-new/'    
for (key, value) in data_files.items():
    data_files[key] = directory + value
    
languages_list = list(data_files.keys())
start_token = '↠'
end_token = '↞'

**Part 1**

Write a function *preprocess* that iterates over all the data files and creates a single vocabulary, containing all the tokens in the data. **Our token definition is a single UTF-8 encoded character**. So, the vocabulary list is a simple Python list of all the characters that you see at least once in the data.

In [5]:
def preprocess():
    """
    data frame is table from 2 columns:
        1. tweet id
        2. tweet text
    """  
    tokens = []
    for path in data_files.values():
        df = pd.read_csv(path)
        for text in df['tweet_text'].values:
            tokens.extend(list(text))
    return list(set(tokens))

**Part 2**

Write a function lm that generates a language model from a textual corpus. The function should return a dictionary (representing a model) where the keys are all the relevant n-1 sequences, and the values are dictionaries with the n_th tokens and their corresponding probabilities to occur. For example, for a trigram model (tokens are characters), it should look something like:

{
  "ab":{"c":0.5, "b":0.25, "d":0.25},
  "ca":{"a":0.2, "b":0.7, "d":0.1}
}

which means for example that after the sequence "ab", there is a 0.5 chance that "c" will appear, 0.25 for "b" to appear and 0.25 for "d" to appear.

Note - You should think how to add the add_one smoothing information to the dictionary and implement it.

In [6]:
#helper functions
def tweets_to_text(data_file_path, n):
    """
    data frame is table from 2 columns:
        1. tweet id
        2. tweet text
    """
    df = pd.read_csv(r''+ data_file_path)
    # debug = True
    # if debug == True:
    #     df = df[0:100]
    columns_list = df.columns.to_list()
    tweets_list = df[columns_list[-1]].apply(lambda x: start_token + x + end_token).values
    text = ''.join(tweets_list)
    
    text = start_token * (n-1) + text + end_token * (n-1)

    return text

def reorder_list(List, index_list):
    return [List[i] for i in index_list]

In [7]:
def lm(n, vocabulary, data_file_path, add_one):
    # n - the n-gram to use (e.g., 1 - unigram, 2 - bigram, etc.)
    # vocabulary - the vocabulary list (which you should use for calculating add_one smoothing)
    # data_file_path - the data_file from which we record probabilities for our model
    # add_one - True/False (use add_one smoothing or not)
  
    lm_dict = {}
    V = len(vocabulary)

    text = tweets_to_text(data_file_path, n)

    # Extract n length substrings
    n_gram = [text[i: i + n] for i in range(len(text) - n)]

    lm_dict = defaultdict(lambda: defaultdict(lambda: 0))

    for i_n_gram in n_gram:
        n_1_gram = i_n_gram[0:n-1]
        lm_dict[n_1_gram][i_n_gram[n-1]] += 1
    
    for key in lm_dict.keys():
        key_count = sum(lm_dict[key].values())
        inner_dict = {}
        for key_1 in lm_dict[key].keys():
            if add_one:
                inner_dict[key_1] = (lm_dict[key][key_1] + 1) / (key_count + V)
            else:
                inner_dict[key_1] = lm_dict[key][key_1]/ key_count
        if add_one:
            lm_dict[key] = defaultdict(lambda: 1 / (key_count + V), inner_dict)
        else:
            lm_dict[key] = defaultdict(lambda: 0, inner_dict)
            
    return lm_dict

**Part 3**

Write a function *eval* that returns the perplexity of a model (dictionary) running over a given data file.

In [17]:
def eval(n, model, data_file):
    # n - the n-gram that you used to build your model (must be the same number)
    # model - the dictionary (model) to use for calculating perplexity
    # data_file - the tweets file that you wish to claculate a perplexity score for

    # read file
    if os.path.exists(data_file):
        text = tweets_to_text(data_file, n)
    else:
        text = data_file
    # Extract n length substrings
    n_gram = [text[i: i + n] for i in range(len(text) - n)]

    model_keys = model.keys()
    entropy = 0 
    for i_letter in n_gram:
        if i_letter[0:n-1] in model_keys: 
            i_letter_model = model[i_letter[0:n-1]]
            if i_letter[n-1] in i_letter_model.keys():
                second_letter_prob = i_letter_model[i_letter[n-1]]
                entropy += -np.log2(second_letter_prob)
            else:
                entropy += 0
        else:
            entropy += 0
    entropy = entropy/len(n_gram)
    perplexity_score = 2**(entropy)
    return perplexity_score

In [9]:
start_time = time.time()
vocabulary = preprocess()
print(time.time() - start_time)
start_time = time.time()
n = 2
test_dict = lm(n, vocabulary, data_files['en_df'], False)
print(time.time() - start_time)
start_time = time.time()
eval(n,test_dict, data_files['en_df'])
print(time.time() - start_time)


0.6890020370483398
0.39804935455322266
1.256019115447998


**Part 4**

Write a function *match* that creates a model for every relevant language, using a specific value of *n* and *add_one*. Then, calculate the perplexity of all possible pairs (e.g., en model applied on the data files en ,es, fr, in, it, nl, pt, tl; es model applied on the data files en, es...). This function should return a pandas DataFrame with columns [en ,es, fr, in, it, nl, pt, tl] and every row should be labeled with one of the languages. Then, the values are the relevant perplexity values.

In [10]:
def match(n, add_one, data_files):
    # n - the n-gram to use for creating n-gram models
    # add_one - use add_one smoothing or not
    result_dict = {}
    vocabulary = preprocess()
    for i_language_model in languages_list:
        
        i_model = lm(n, vocabulary, data_files[i_language_model], add_one)
        result_dict[i_language_model] = {}

        for i_language_test in languages_list:
            i_language_model_i_score = eval(n, i_model, data_files[i_language_test])
            result_dict[i_language_model][i_language_test] = i_language_model_i_score
    perlexity_df = pd.DataFrame(result_dict)
    return perlexity_df  

**Part 5**

Run match with *n* values 1-4, once with add_one and once without, and print the 8 tables to this notebook, one after another.

In [11]:
 
def run_match(data_files):
    full_model_dict = {}
    # for n in range(2,3):

    for n in range(1,5):
        add_one = True
        perlexity_df = match(n, add_one, data_files)
        print(f'n = {n}, add_one = {add_one}')
        display(perlexity_df)

        add_one = False
        perlexity_df = match(n, add_one, data_files)
        print(f'n = {n}, add_one = {add_one}')
        display(perlexity_df)



# run the model generation

In [12]:
# model_dict = run_match(data_files)


**Part 6**

Each line in the file test.csv contains a sentence and the language it belongs to. Write a function that uses your language models to classify the correct language of each sentence.

Important note regarding the grading of this section: this is an open question, where a different solution will yield different accuracy scores. any solution that is not trivial (e.g. returning 'en' in all cases) will be excepted. We do reserve the right to give bonus points to exceptionally good/creative solutions.

In [13]:
! ls nlp-course\lm-languages-data-new

'ls' is not recognized as an internal or external command,
operable program or batch file.


In [14]:
test_folder = f'nlp-course\lm-languages-data-new'
test_csv_files =  glob.glob(test_folder + '\\*.csv')
test_files =  {}
for i_file in test_csv_files:
    file_name_with_ending = os.path.basename(i_file)
    file_name = os.path.splitext(file_name_with_ending)[0]
    test_files[file_name + '_df'] = f'' + i_file

In [15]:
def match_test(n, data_file_path, add_one):
    # n - the n-gram to use for creating n-gram models
    # add_one - use add_one smoothing or not
    #data_file_path = r"C:\MSC\NLP2\nlp-course\lm-languages-data-new\test.csv"
    senstences_list = pd.read_csv(data_file_path)['tweet_text'].to_list()

    lines = [] 
    result_dict = {}

    for i_language_model in languages_list:
        # i_model = model_dict[n][add_one][i_language_model]
        result_dict[i_language_model] = {}
        i_model = lm(n, vocabulary, data_files[i_language_model], add_one)

        for i_test_senstence_idx in range(senstences_list.__len__()):
            i_test_senstence = senstences_list[i_test_senstence_idx]
            i_sentence_model_i_score = eval(n, i_model, i_test_senstence)
            result_dict[i_language_model][i_test_senstence_idx] = i_sentence_model_i_score
    # print('summary for '+ i_language_model +' model perlexity score for each language:\n')
    perlexity_df = pd.DataFrame(result_dict)
    print(perlexity_df)
    perlexity_array = perlexity_df.to_numpy()
    language_match_index = np.argmin(perlexity_array, axis=1)
    language_match_list = reorder_list(languages_list, language_match_index)
    perlexity_df['predict'] = language_match_index
    perlexity_df['predict_language'] = language_match_list
    print(perlexity_df)

    return perlexity_df


def classify(n, data_file_path, add_one):
    match_dict  = match_test(n, data_file_path, add_one)
    return match_dict

In [18]:
n = 2
test_path = test_folder + '\\test.csv'
clasification_result = classify(2, test_path, False)

          en_df      es_df      fr_df      in_df      it_df      nl_df  \
0     15.409292  21.675997  22.549406  21.072941  23.192828  21.254515   
1     24.244849  22.182511  25.856457  27.546027  15.147335  33.462189   
2     17.863943  18.479226  19.762191  16.692756  19.148692  20.950789   
3     22.042157  24.368556  22.542958  25.302117  25.746159  17.382716   
4     19.857199  20.026403  21.486305  16.308162  19.489727  21.173554   
...         ...        ...        ...        ...        ...        ...   
7994  19.252089  17.205413  16.864522  21.784607  19.486260  20.629273   
7995  26.752657  30.025929  29.577340  13.889901  30.787122  26.901810   
7996  32.763242  29.295657  37.072243  33.311929  19.761095  47.049839   
7997  15.647607  12.289813  13.221803  18.759703  14.728798  16.033215   
7998  20.324670  29.155601  27.911392  24.074354  30.462845  26.633268   

          pt_df      tl_df  
0     22.419618  19.905563  
1     21.028282  24.877526  
2     18.150711  15.9418

In [19]:
y_true = pd.read_csv(test_path).get('label').to_list()
y_true = list(map(lambda x: languages_list.index(x+'_df'),y_true))
y_pred = clasification_result['predict'].to_list()

**Part 7**

Calculate the F1 score of your output from part 6. (hint: you can use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). 


In [20]:
def calc_f1(y_true,y_pred ):
    return np.round(f1_score(y_true, y_pred,average="micro"),3)
f_score_result = calc_f1(y_true,y_pred)
print('The F-score we acheive is ' + str(f_score_result)+'\n')

The F-score we acheive is 0.785



# **Good luck!**