# Assignment 1
In this assignment you will be creating tools for learning and testing language models.
The corpora that you will be working with are lists of tweets in 8 different languages that use the Latin script. The data is provided either formatted as CSV or as JSON, for your convenience. The end goal is to write a set of tools that can detect the language of a given tweet.


*As a preparation for this task, download the data files from the course git repository.

The relevant files are under **lm-languages-data-new**:


*   en.csv (or the equivalent JSON file)
*   es.csv (or the equivalent JSON file)
*   fr.csv (or the equivalent JSON file)
*   in.csv (or the equivalent JSON file)
*   it.csv (or the equivalent JSON file)
*   nl.csv (or the equivalent JSON file)
*   pt.csv (or the equivalent JSON file)
*   tl.csv (or the equivalent JSON file)
*   test.csv (or the equivalent JSON file)





In [1]:
!git clone https://github.com/kfirbar/nlp-course.git

Cloning into 'nlp-course'...
remote: Enumerating objects: 53, done.[K
remote: Counting objects: 100% (53/53), done.[K
remote: Compressing objects: 100% (42/42), done.[K
remote: Total 53 (delta 23), reused 32 (delta 8), pack-reused 0[K
Unpacking objects: 100% (53/53), done.




---



**Important note: please use only the files under lm-languages-data-new and NOT under lm-languages-data**


---



In [2]:
!ls nlp-course/lm-languages-data-new


en.csv	 es.json  in.csv   it.json  pt.csv    test.json   tl.csv
en.json  fr.csv   in.json  nl.csv   pt.json   tests.csv   tl.json
es.csv	 fr.json  it.csv   nl.json  test.csv  tests.json


In [3]:
# IMPORTS
import pandas as pd
import numpy as np
import warnings
import os
from collections import Counter
from threading import Thread
from queue import Queue
from pandas.core.common import SettingWithCopyWarning
from sklearn.metrics import f1_score

np.seterr(divide = 'ignore')  # will ignore divide by 0 errors
warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)

In [4]:
# General functions:

def get_unseen_string(n):
  return "₪" * (n+1)

def get_delimeter(n):
  return "₪" * (n-1)

def get_language(file_path):
  return os.path.basename(file_path)[:-4]

def get_languages(csv_file_paths):
  langs = []
  for csv_path in csv_file_paths:
    langs.append(get_language(csv_path))
  return langs

**Part 1**

Write a function *preprocess* that iterates over all the data files and creates a single vocabulary, containing all the tokens in the data. **Our token definition is a single UTF-8 encoded character**. So, the vocabulary list is a simple Python list of all the characters that you see at least once in the data.

In [5]:
def preprocess_signle_csv(path_to_file):
  d = {}
  data = pd.read_csv(path_to_file, usecols=[1])
  for index, row in data.iterrows():
      tweet = row["tweet_text"]
      for char in tweet:
        if char not in d:
          d[char] = 0
        d[char] = d[char]+1
  return d

In [6]:
path = "nlp-course/lm-languages-data-new/es.csv"
d = preprocess_signle_csv(path_to_file = path)

print(d)

{'R': 7699, 'T': 7800, ' ': 112140, '@': 7688, 'A': 6147, 'z': 3224, 'u': 21084, '_': 1687, 'M': 3852, 'o': 49359, 'n': 33417, 't': 35037, 'e': 62734, 'r': 33320, ':': 9544, 'p': 16222, 'd': 21583, 's': 39217, 'q': 5623, 'a': 62838, 'c': 22755, 'i': 31672, 'm': 15415, 'l': 25660, ',': 3116, 'b': 6832, 'v': 6656, 'é': 1109, '9': 1137, '0': 2042, '%': 61, 'h': 10311, '/': 12604, '.': 9438, 'f': 4071, '5': 1151, 'w': 1275, 'Q': 1433, '1': 2155, 'B': 1910, 'S': 4386, 'x': 1931, 'L': 3730, 'í': 2153, ';': 52, '…': 1752, 'U': 2016, '8': 1116, 'j': 4140, '❤': 123, 'O': 3401, 'H': 1809, 'J': 1824, 'N': 3280, 'á': 1728, 'Z': 911, '#': 2628, 'E': 5441, 'y': 6049, 'ó': 1699, 'C': 4256, 'g': 7161, '4': 1120, 'V': 1957, 'W': 927, 'D': 3122, '▶': 5, 'k': 1696, '😴': 23, 'P': 3243, 'Ñ': 37, 'K': 1089, '3': 1274, 'F': 2017, '✨': 12, '7': 1303, 'Y': 1660, 'X': 849, 'I': 2727, '2': 1742, '6': 954, "'": 170, '📱': 5, '🚈': 1, 'G': 1952, '¡': 319, 'Ó': 70, '!': 1509, '(': 352, '?': 852, '"': 923, '😍': 171, '

In [7]:
# Get all the relevant CSV files
def get_all_csv_paths(path):
  csv_file_paths = []
  for file in os.listdir(path):
      if file.endswith(".csv") and file.find("test"):
          csv_file_paths.append(os.path.join(path, file))
    
  return csv_file_paths

In [8]:
def preprocess(csv_file_paths):
  vocabulary = {}
  for file_path in csv_file_paths:
    # print(file_path)
    single_dict = preprocess_signle_csv(file_path)
    vocabulary = dict(Counter(vocabulary)+Counter(single_dict))

  return vocabulary

In [9]:
path = "nlp-course/lm-languages-data-new"
csv_file_paths = get_all_csv_paths(path)

# create a dataframe
dataframe = pd.DataFrame()
dataframe['Path'] = csv_file_paths
display(dataframe)

Unnamed: 0,Path
0,nlp-course/lm-languages-data-new/it.csv
1,nlp-course/lm-languages-data-new/in.csv
2,nlp-course/lm-languages-data-new/tl.csv
3,nlp-course/lm-languages-data-new/nl.csv
4,nlp-course/lm-languages-data-new/es.csv
5,nlp-course/lm-languages-data-new/en.csv
6,nlp-course/lm-languages-data-new/fr.csv
7,nlp-course/lm-languages-data-new/pt.csv


In [10]:
vocabulary = preprocess(csv_file_paths)
print(vocabulary)

{'R': 55614, 'T': 63769, ' ': 825794, '@': 62716, 'm': 125237, 'a': 458350, 't': 318548, 'e': 432323, 'o': 313453, 'r': 229310, 'n': 281182, 'z': 24732, 'i': 283634, ':': 74434, 'P': 24999, 'D': 26958, ',': 19870, 'c': 137238, 'g': 90531, 's': 253808, 'E': 31232, 'u': 148786, 'p': 123471, '.': 76105, 'C': 27087, 'l': 172567, 'd': 136708, 'h': 129186, '/': 105472, 'W': 13264, 'U': 15529, '2': 17082, 'w': 33995, 'J': 14015, 'v': 53351, "'": 12323, '#': 29457, 'I': 28278, 'f': 43042, '"': 7074, 'L': 27628, 'b': 61003, 'à': 2465, 'q': 21907, '1': 20400, '8': 9138, '3': 11653, '0': 18681, '7': 12290, '?': 8167, '🐷': 22, 'x': 15450, 'G': 18337, 'A': 51610, 'S': 36790, '!': 14133, '…': 12030, 'y': 53516, 'Q': 8620, 'F': 18479, 'è': 2342, 'k': 67329, '9': 9461, 'H': 22555, 'B': 25621, 'X': 6647, 'Y': 12870, '*': 832, 'M': 32582, '“': 357, '”': 350, '4': 9671, '5': 10148, 'N': 26369, 'O': 26705, '❤': 1275, 'Z': 7890, 'j': 31096, '_': 13894, 'V': 14570, '🎉': 160, '️': 1421, '😍': 1323, '⚪': 25, '

**Part 2**

Write a function lm that generates a language model from a textual corpus. The function should return a dictionary (representing a model) where the keys are all the relevant n-1 sequences, and the values are dictionaries with the n_th tokens and their corresponding probabilities to occur. For example, for a trigram model (tokens are characters), it should look something like:

{
  "ab":{"c":0.5, "b":0.25, "d":0.25},
  "ca":{"a":0.2, "b":0.7, "d":0.1}
}

which means for example that after the sequence "ab", there is a 0.5 chance that "c" will appear, 0.25 for "b" to appear and 0.25 for "d" to appear.

Note - You should think how to add the add_one smoothing information to the dictionary and implement it.

In [11]:
def get_counters(n, data_file_path):
  delimiter = get_delimeter(n)
  dict_of_dicts = {}
  data = pd.read_csv(data_file_path, usecols=[1])
  for index, row in data.iterrows():
      tweet = delimiter + row["tweet_text"] + delimiter
      for index in range(0, len(tweet)-(n-1)): 
        window = tweet[index : index + n]
        prefix, suffix = window[:n-1], window[n-1]
        if prefix not in dict_of_dicts:
          dict_of_dicts[prefix] = {}
        if suffix not in dict_of_dicts[prefix]:
          dict_of_dicts[prefix][suffix] = 0
        dict_of_dicts[prefix][suffix] += 1

  return dict_of_dicts

In [12]:
# Returns a language model in the form of dict[prefix --> dict[suffix --> probablility]]
def lm(n, vocabulary, data_file_path, add_one):
  # n - the n-gram to use (e.g., 1 - unigram, 2 - bigram, etc.)
  # vocabulary - the vocabulary list (which you should use for calculating add_one smoothing)
  # data_file_path - the data_file from which we record probabilities for our model
  # add_one - True/False (use add_one smoothing or not)  
  
  vocabulary_size = len(vocabulary)
  dod = get_counters(n, data_file_path) #dod = dict of dicts
  probability_dict = {}
  for prefix in dod:
    probability_dict[prefix] = {}
    total_count = 0
    for suffix in dod[prefix]:
      total_count += dod[prefix][suffix]
    for suffix in dod[prefix]:
      if add_one:
        probability_dict[prefix][suffix] = (dod[prefix][suffix] + 1) / (total_count + vocabulary_size)
      else:
        probability_dict[prefix][suffix] = dod[prefix][suffix] / total_count
  
  # ADD AN UNSEEN CHARACTER PROBABILITY
  unseen = get_unseen_string(n)
  probability_dict[unseen] = 0
  if add_one:
    probability_dict[unseen] = 1/vocabulary_size
  
  return probability_dict 

In [13]:
path = "nlp-course/lm-languages-data-new/es.csv"
probability_dict_false = lm(3, vocabulary, path, False)
print(probability_dict_false)

{'₪₪': {'R': 0.48927658628736526, '9': 0.00055561729081009, 'U': 0.004333814868318702, 'N': 0.02066896321813535, '#': 0.028114234914990554, '▶': 0.00011112345816201801, 'n': 0.002333592621402378, 'M': 0.028225358373152574, 'E': 0.03167018557617513, 'i': 0.00044449383264807203, 'O': 0.005333925991776864, 'L': 0.023447049672185798, 'A': 0.020224469385487276, '@': 0.13057006334037116, '/': 0.000333370374486054, 'm': 0.004111567951994666, 'Y': 0.013001444604956107, 'S': 0.019779975552839203, 'C': 0.02066896321813535, 'F': 0.00388932103567063, 'x': 0.00022224691632403602, 'J': 0.005222802533614846, 'D': 0.010112234692743638, '8': 0.000333370374486054, 'h': 0.0035559506611845763, '"': 0.003778197577508612, 'Q': 0.014890543393710412, 's': 0.002000222246916324, 'a': 0.003111456828536504, 'G': 0.00444493832648072, 'c': 0.002666962995888432, 'H': 0.010445605067229692, 't': 0.00222246916324036, "'": 0.00022224691632403602, 'I': 0.0035559506611845763, 'B': 0.005889543282586954, 'd': 0.001555728414

In [14]:
# why we need this? there is excatly same lines 2 blocks before
# Daniel: not sure if I asked this or Hadar, but the add_one parameter is different.
path = "nlp-course/lm-languages-data-new/es.csv"
probability_dict_true = lm(3, vocabulary, path, True)
print(probability_dict_true)

{'₪₪': {'R': 0.4077400240718452, '9': 0.0005555041199888899, 'U': 0.003703360799925933, 'N': 0.017313211739653736, '#': 0.023516341079529674, '▶': 0.00018516803999629665, 'n': 0.002036848439959263, 'M': 0.023608925099527823, 'E': 0.02647902971947042, 'i': 0.0004629200999907416, 'O': 0.004536616979909268, 'L': 0.019627812239607443, 'A': 0.01694287565966114, '@': 0.10887880751782242, '/': 0.0003703360799925933, 'm': 0.003518192759929636, 'Y': 0.010924914359781502, 'S': 0.01657253957966855, 'C': 0.017313211739653736, 'F': 0.0033330247199333395, 'x': 0.00027775205999444494, 'J': 0.004444032959911119, 'D': 0.008517729839829645, '8': 0.0003703360799925933, 'h': 0.0030552726599388947, '"': 0.0032404406999351912, 'Q': 0.012498842699750023, 's': 0.001759096379964818, 'a': 0.0026849365799463012, 'G': 0.003795944819924081, 'c': 0.002314600499953708, 'H': 0.008795481899824091, 't': 0.0019442644199611147, "'": 0.00027775205999444494, 'I': 0.0030552726599388947, 'B': 0.004999537079900009, 'd': 0.001

**Part 3**

Write a function *eval* that returns the perplexity of a model (dictionary) running over a given data file.

In [15]:
def entropy_single_line(n, model, single_line):
  sum = 0
  for index in range(0, len(single_line)-(n-1)): 
    window = single_line[index : index + n]
    prefix, suffix = window[:n-1], window[n-1]
    if prefix in model and suffix in model[prefix]: #checking if window not in the dict
      window_prob = model[prefix][suffix]
    else:
      window_prob = model[get_unseen_string(n)]
    sum += np.log(window_prob)
  N = len(single_line) - (n-1) # the -(n-1) is becuase we need to devide in the number of windows we had.
  return -sum/N 

In [16]:
def entropy(n, model, data_file):
  delimiter = get_delimeter(n)
  data = pd.read_csv(data_file, usecols=[1])
  prob_for_n_sized_windows = []
  for index, row in data.iterrows():
    tweet = delimiter + row["tweet_text"] + delimiter
    single_entropy = entropy_single_line(n, model, tweet)
    prob_for_n_sized_windows.append(single_entropy)

  avg = np.average(prob_for_n_sized_windows)
  return avg

In [17]:
def eval(n, model, data_file):
  # n - the n-gram that you used to build your model (must be the same number)
  # model - the dictionary (model) to use for calculating perplexity
  # data_file - the tweets file that you wish to claculate a perplexity score for
  file_entropy = entropy(n, model, data_file)
  return np.power(2, file_entropy)

In [18]:
# expecting inf
eval_en_when_es_false = eval(3, probability_dict_false, data_file = "nlp-course/lm-languages-data-new/en.csv")
assert(np.isinf(eval_en_when_es_false))

In [19]:
# expecting number
eval_en_when_es_true = eval(3, probability_dict_true, data_file = "nlp-course/lm-languages-data-new/en.csv")
assert(eval_en_when_es_true > 0 and not np.isinf(eval_en_when_es_true))

**Part 4**

Write a function *match* that creates a model for every relevant language, using a specific value of *n* and *add_one*. Then, calculate the perplexity of all possible pairs (e.g., en model applied on the data files en ,es, fr, in, it, nl, pt, tl; es model applied on the data files en, es...). This function should return a pandas DataFrame with columns [en ,es, fr, in, it, nl, pt, tl] and every row should be labeled with one of the languages. Then, the values are the relevant perplexity values.

In [20]:
# Creates all language models for a specific (n, add_one)
# The return is of type dict[language --> lm]
def create_models_for(n, add_one, all_dicts):
  d = {} # d[lang] = lang_model
  for csv_path in csv_file_paths:
    lang = get_language(csv_path)
    d[lang] = lm(n, all_dicts, csv_path, add_one)
  return d

In [21]:
def match(n, add_one, all_dicts):
  # n - the n-gram to use for creating n-gram models
  # add_one - use add_one smoothing or not
  d = {}
  models = create_models_for(n, add_one, all_dicts)
  for model_lang in models:
    d[model_lang] = {}
    model = models[model_lang]
    for csv_path in csv_file_paths:
      lang = get_language(csv_path)
      d[model_lang][lang] = eval(n, model, csv_path)
  return pd.DataFrame(d)

**Part 5**

Run match with *n* values 1-4, once with add_one and once without, and print the 8 tables to this notebook, one after another.

In [22]:
def get_models(all_dicts):
  for add_one in [True, False]:
    for n in range (1, 5):
      print("N =", n, "Add_one =", add_one)
      table = match(n, add_one, all_dicts)
      display(table)
      print("\n\n")

In [23]:
get_models(vocabulary)

N = 1 Add_one = True


Unnamed: 0,it,in,tl,nl,es,en,fr,pt
it,12.368537,13.383246,13.208159,12.98309,12.753481,12.78787,12.834103,12.82026
in,13.537959,12.235279,12.503108,13.094666,13.577163,13.071419,13.761801,13.425443
tl,13.955736,13.039323,12.68734,13.895661,14.110643,13.555021,14.553353,14.044711
nl,13.11913,13.230242,13.381383,12.388946,13.21221,12.780935,13.111066,13.213739
es,12.400383,12.845106,13.161895,12.714432,11.847199,12.602491,12.538025,12.075732
en,13.066213,13.295214,13.164669,12.908324,13.219872,12.479542,13.116451,13.296562
fr,12.677631,13.914176,14.005824,13.024497,12.701835,12.933058,12.187067,12.70735
pt,12.435171,13.037479,13.419756,12.843063,12.092547,12.763457,12.486715,11.864653





N = 2 Add_one = True


Unnamed: 0,it,in,tl,nl,es,en,fr,pt
it,8.146789,10.612688,10.254435,11.01741,9.585428,10.326552,10.052037,9.90637
in,11.996818,8.697646,9.89574,11.124579,12.132471,10.977443,11.77871,12.483332
tl,11.668499,9.913086,8.609564,11.287511,11.976762,10.813714,12.155984,12.236198
nl,11.81668,10.864495,11.133415,8.535492,11.819532,10.36702,10.953772,12.203374
es,9.349222,10.717036,10.329796,10.906147,7.866421,10.133078,9.550332,9.093505
en,11.254913,10.726083,10.187335,10.149044,11.356458,8.498361,10.458419,11.661443
fr,10.593545,11.591697,11.537108,10.873708,10.449458,10.420979,8.159095,10.657873
pt,10.024454,11.522793,11.14434,11.593282,9.591008,11.006396,10.316818,8.111316





N = 3 Add_one = True


Unnamed: 0,it,in,tl,nl,es,en,fr,pt
it,9.74462,18.808278,17.935253,19.605307,15.212751,17.454942,16.414986,16.825671
in,23.567676,11.35355,17.270791,21.015629,23.079091,20.823185,22.367065,25.078599
tl,20.765565,16.222565,10.777631,19.910768,22.317616,19.379961,22.446231,22.683363
nl,22.225094,19.541809,20.32228,10.574576,21.786145,17.594737,19.329433,24.038294
es,14.464718,18.324679,17.616363,18.746834,9.374962,16.632811,14.873149,14.118806
en,19.00171,17.700063,16.012622,15.991945,19.305996,9.938781,16.918217,20.808172
fr,17.360934,20.137507,20.072768,17.895608,17.116083,16.668393,9.484167,18.646494
pt,15.894408,19.892358,19.408646,20.480385,14.600009,18.609194,17.09915,9.689911





N = 4 Add_one = True


Unnamed: 0,it,in,tl,nl,es,en,fr,pt
it,17.006109,50.219426,46.686206,51.58759,36.548427,46.379279,42.451564,41.762727
in,59.58008,21.285559,44.898112,56.730049,59.14001,57.05265,59.686432,64.355751
tl,49.61569,39.089137,19.16918,50.555644,54.474992,48.491062,56.11576,54.67377
nl,53.685734,50.564993,52.030813,18.653498,52.128962,43.430364,47.221209,59.494503
es,34.449086,50.546737,46.668352,49.343454,16.348941,45.526401,37.604802,32.58364
en,45.865887,43.724607,37.454346,38.576534,47.2018,17.312879,41.156952,51.189047
fr,42.54189,52.227668,51.596908,43.694232,40.61623,41.280546,16.163781,46.815835
pt,37.603174,53.247078,50.081979,54.623729,32.691377,49.820021,43.663679,16.84875





N = 1 Add_one = False


Unnamed: 0,it,in,tl,nl,es,en,fr,pt
it,12.355025,inf,inf,inf,inf,inf,inf,inf
in,inf,12.223462,inf,inf,inf,inf,inf,inf
tl,inf,inf,12.673132,inf,inf,inf,inf,inf
nl,inf,inf,inf,12.376917,inf,inf,inf,inf
es,inf,inf,inf,inf,11.835572,inf,inf,inf
en,inf,inf,inf,inf,inf,12.468146,inf,inf
fr,inf,inf,inf,inf,inf,inf,12.176289,inf
pt,inf,inf,inf,inf,inf,inf,inf,11.848403





N = 2 Add_one = False


Unnamed: 0,it,in,tl,nl,es,en,fr,pt
it,7.236015,inf,inf,inf,inf,inf,inf,inf
in,inf,7.698956,inf,inf,inf,inf,inf,inf
tl,inf,inf,7.553043,inf,inf,inf,inf,inf
nl,inf,inf,inf,7.646901,inf,inf,inf,inf
es,inf,inf,inf,inf,7.004003,inf,inf,inf
en,inf,inf,inf,inf,inf,7.615789,inf,inf
fr,inf,inf,inf,inf,inf,inf,7.306132,inf
pt,inf,inf,inf,inf,inf,inf,inf,7.090917





N = 3 Add_one = False


Unnamed: 0,it,in,tl,nl,es,en,fr,pt
it,4.381789,inf,inf,inf,inf,inf,inf,inf
in,inf,4.896904,inf,inf,inf,inf,inf,inf
tl,inf,inf,4.468787,inf,inf,inf,inf,inf
nl,inf,inf,inf,4.667733,inf,inf,inf,inf
es,inf,inf,inf,inf,4.385587,inf,inf,inf
en,inf,inf,inf,inf,inf,4.491518,inf,inf
fr,inf,inf,inf,inf,inf,inf,4.398343,inf
pt,inf,inf,inf,inf,inf,inf,inf,4.22528





N = 4 Add_one = False


Unnamed: 0,it,in,tl,nl,es,en,fr,pt
it,2.80401,inf,inf,inf,inf,inf,inf,inf
in,inf,3.030128,inf,inf,inf,inf,inf,inf
tl,inf,inf,2.761774,inf,inf,inf,inf,inf
nl,inf,inf,inf,2.819548,inf,inf,inf,inf
es,inf,inf,inf,inf,2.867606,inf,inf,inf
en,inf,inf,inf,inf,inf,2.738892,inf,inf
fr,inf,inf,inf,inf,inf,inf,2.765891,inf
pt,inf,inf,inf,inf,inf,inf,inf,2.736354







**Part 6**

Each line in the file test.csv contains a sentence and the language it belongs to. Write a function that uses your language models to classify the correct language of each sentence.

Important note regarding the grading of this section: this is an open question, where a different solution will yield different accuracy scores. any solution that is not trivial (e.g. returning 'en' in all cases) will be excepted. We do reserve the right to give bonus points to exceptionally good/creative solutions.

**Solution:** Our algorithm evaluates the score of each tweet based on 3 models (for each language): n = 2,3,4 and add_one = true. Once the score is calculated, we assign the tweet's language based on the minimal score of all languages.

In [24]:
# We could also use set(list) instead, but we wanted to keep the order of the tweets
def get_unique_tweets_list(tweets):
  unique_tweets = []
  seen_tweets = set()
  for tweet in tweets:
    if tweet not in seen_tweets:
      seen_tweets.add(tweet)
      unique_tweets.append(tweet)
  return unique_tweets

In [25]:
def get_tweets_from_test_file(read_data = None):
  if read_data is None:
    path = "nlp-course/lm-languages-data-new/test.csv"
    data = pd.read_csv(path, usecols=[1])
  else:
    data = read_data
  tweets = list(map(lambda indexrow: indexrow[1]["tweet_text"], data.iterrows())) #indexRow is a tupple
  return get_unique_tweets_list(tweets)

In [26]:
test_tweets = get_tweets_from_test_file()
print("Number of unique tweets in the test file =", len(test_tweets))

Number of unique tweets in the test file = 7772


In [27]:
def eval_single_line(n, model, tweet):
  single_entropy = entropy_single_line(n, model, tweet)
  return np.power(2, single_entropy)

In [28]:
def eval_tweets(vocabulary):
  all_models = {}
  results = {} # tweet -> lang

  languages = get_languages(csv_file_paths)
  tweets = get_tweets_from_test_file()
  min_n, max_n_non_inclusive = 2, 5
 
  for n in range(min_n, max_n_non_inclusive):
    all_models[n] = create_models_for(n, True, vocabulary)

  for tweet in tweets:
    results[tweet] = ""
    min_score = np.inf
    for lang in languages:
      score = 0
      for n in range(min_n, max_n_non_inclusive):
        model = all_models[n][lang]
        score += eval_single_line(n, model, tweet)
      
      if score < min_score:
        min_score = score
        results[tweet] = lang.upper()
  
  dataframe = pd.DataFrame()
  dataframe['Tweet'] = results.keys()
  dataframe['Predicted Language'] = results.values()
  display(dataframe)

  return results

In [29]:
predicted_results = eval_tweets(vocabulary)

Unnamed: 0,Tweet,Predicted Language
0,RT @jarsofshine: In 08 I had a volunteer who h...,EN
1,IN OGNI CASO CON LE PAGHE CHE GIRANO IN Africa...,IT
2,@jaynaldmase @acobasilianne @dingDANGdantes @d...,TL
3,"Daags voor @RondeVlaanderen, @VoltaClassic als...",NL
4,RT @ertsul20: Susuportahan kita hanggang sa du...,TL
...,...,...
7767,"La triste historia que inspiró ""Tu falta de qu...",ES
7768,RT @ShahwalAdli_: Aku tak bersuara tak bermakn...,IN
7769,@Benji_Mascolo DEVI TAGLIARE QUEI CAPELLI 😠😡😠😂❤,IT
7770,Assistimos de camarote varias brigas ontem!,PT


**Part 7**

Calculate the F1 score of your output from part 6. (hint: you can use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). 


In [30]:
def get_predicts_and_labels_lists(predicted_results=None):
  path = "nlp-course/lm-languages-data-new/test.csv"
  data = pd.read_csv(path, usecols=[1, 2])
  test_tweets = get_tweets_from_test_file(data)

  langs_without_duplicate_tweets = []
  tweets_without_duplicate_tweets = []
  
  seen_tweets = set()
  predictions, labels = [], []

  for index, row in data.iterrows():
    tweet = row['tweet_text']
    label = row['label'].upper()
    if tweet not in seen_tweets:
      seen_tweets.add(tweet)
      predictions.append(predicted_results[tweet])
      labels.append(label)

  return predictions, labels

In [31]:
predictions, labels = get_predicts_and_labels_lists(predicted_results)

In [32]:
score = f1_score(labels, predictions, average="micro")

In [33]:
print("F1 Score = {:.5f}".format(score))

F1 Score = 0.92640


# **Good luck!**