<a href="https://colab.research.google.com/github/ItaiKaplan/NLP/blob/main/HW_1_language_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 1
In this assignment you will be creating tools for learning and testing language models.
The corpora that you will be working with are lists of tweets in 8 different languages that use the Latin script. The data is provided either formatted as CSV or as JSON, for your convenience. The end goal is to write a set of tools that can detect the language of a given tweet.


*As a preparation for this task, download the data files from the course git repository.

The relevant files are under **lm-languages-data-new**:


*   en.csv (or the equivalent JSON file)
*   es.csv (or the equivalent JSON file)
*   fr.csv (or the equivalent JSON file)
*   in.csv (or the equivalent JSON file)
*   it.csv (or the equivalent JSON file)
*   nl.csv (or the equivalent JSON file)
*   pt.csv (or the equivalent JSON file)
*   tl.csv (or the equivalent JSON file)
*   test.csv (or the equivalent JSON file)





In [None]:
!git clone https://github.com/kfirbar/nlp-course.git

Cloning into 'nlp-course'...
remote: Enumerating objects: 71, done.[K
remote: Counting objects: 100% (71/71), done.[K
remote: Compressing objects: 100% (57/57), done.[K
remote: Total 71 (delta 29), reused 40 (delta 11), pack-reused 0[K
Unpacking objects: 100% (71/71), done.




---



**Important note: please use only the files under lm-languages-data-new and NOT under lm-languages-data**


---



In [None]:

!ls nlp-course/lm-languages-data-new


en.csv	 es.json  in.csv   it.json  pt.csv    test.json   tl.csv
en.json  fr.csv   in.json  nl.csv   pt.json   tests.csv   tl.json
es.csv	 fr.json  it.csv   nl.json  test.csv  tests.json


**Part 1**

Write a function *preprocess* that iterates over all the data files and creates a single vocabulary, containing all the tokens in the data. **Our token definition is a single UTF-8 encoded character**. So, the vocabulary list is a simple Python list of all the characters that you see at least once in the data.

In [None]:
import pandas as pd
import numpy as np
import os 
import itertools
from collections import defaultdict, Counter
import math
from sklearn.metrics import f1_score

In [None]:
def preprocess():
  tokens_set = set()
  dir_name = 'nlp-course/lm-languages-data-new'
  for file in os.listdir(dir_name):
    if file.endswith('.csv'):
      file_path = os.path.join(dir_name, file)
      for tweet in pd.read_csv(file_path)['tweet_text']:
        tokens_set |= set(tweet) 

  tokens_set.add('<start>')
  tokens_set.add('<end>')
  
  return list(tokens_set)
  

In [None]:
VOCAB = preprocess()


In [None]:
print(VOCAB)
len(VOCAB)

['🍟', '⠀', '林', 'ิ', '🍉', '赫', '🕤', '↪', '🔻', '🤛', '⚜', '┃', '도', '🏇', '화', '💌', '맨', '🕵', '🙇', '천', '⋅', '😱', '🛐', '🖼', 'ᵉ', '구', '🌗', 'ｎ', '➋', '네', 'û', '더', '🌇', '찌', 'ノ', '終', 'メ', '🌷', '🥞', '✧', '시', '„', '탄', '힐', '╭', '꼼', 'ة', 'á', '＠', '🏿', '努', '😨', '9', 'Ｍ', '?', '🇫', '💍', '료', '⭐', '강', '🎋', '📹', 'ร', '😕', '🍥', '➍', '🛩', '¥', 'ä', '🐶', '가', '🏖', '🎧', '쩜', '💲', '💃', 'Θ', '야', '生', '🌯', '🍼', '🌒', 'Ｖ', '💆', '🏘', 'ン', '보', '잭', '🍴', 'Ğ', '🍪', '🎾', '⛷', 'K', '–', 'し', '👶', '📈', 'ถ', '👎', '💥', '🍓', '🤚', 'ﾉ', '♡', 'Ⅳ', '🍊', '🎲', '아', '💗', '😹', '💓', '▊', '🔫', '🤞', '🆑', '社', '⁷', 'ñ', '1', 'ォ', '패', '！', 'ƒ', 'ナ', '結', '☰', '⋭', '¤', '스', '\u2066', '\u3000', '름', '🕘', '姿', '⒏', '©', '✨', '🔪', 'ð', '🥒', 'ｍ', 'ョ', '널', '👆', '儿', '하', '⏱', 'ｕ', '🎼', 'え', '😞', '핸', 'ˡ', '넷', '즈', 'É', '✌', '🌀', '♂', '►', '✂', '手', '🐍', '✏', '🐆', 'ʰ', '￼', '↔', '♏', '타', '📢', 'Ö', '◈', '🥓', '수', '♻', '🍀', '🏚', '🍨', '출', '♬', '➤', '🎨', '🐭', '๐', 'น', 'д', '고', '🏠', 'ب', '🔉', '⒍', '🌏', 'グ', '画', '♯', '림',

1861

**Part 2**

Write a function lm that generates a language model from a textual corpus. The function should return a dictionary (representing a model) where the keys are all the relevant n-1 sequences, and the values are dictionaries with the n_th tokens and their corresponding probabilities to occur. For example, for a trigram model (tokens are characters), it should look something like:

{
  "ab":{"c":0.5, "b":0.25, "d":0.25},
  "ca":{"a":0.2, "b":0.7, "d":0.1}
}

which means for example that after the sequence "ab", there is a 0.5 chance that "c" will appear, 0.25 for "b" to appear and 0.25 for "d" to appear.

Note - You should think how to add the add_one smoothing information to the dictionary and implement it.

In [None]:
def lm(n, vocabulary, data_file_path, add_one):
  # n - the n-gram to use (e.g., 1 - unigram, 2 - bigram, etc.)
  # vocabulary - the vocabulary list (which you should use for calculating add_one smoothing)
  # data_file_path - the data_file from which we record probabilities for our model
  # add_one - True/False (use add_one smoothing or not)

  base_counter = Counter()
  model_counter = defaultdict(Counter)
  
  for tweet in pd.read_csv(data_file_path)['tweet_text']:
    for window_start in range(len(tweet) - n + 1):
      base_counter[tweet[window_start : window_start + n-1]] += 1
      model_counter[tweet[window_start : window_start + n-1]][tweet[window_start + n-1]] += 1

    # Handle start and end tokens
    base_counter['<start>' + tweet[:n-1]] += 1
    base_counter[tweet[-(n-1):]] += 1
    model_counter['<start>' + tweet[:n-1]][tweet[n-1]]+= 1
    model_counter[tweet[-(n-1):]]["<end>"] += 1
  
  model = defaultdict(lambda: defaultdict(lambda: 1 / (len(vocabulary))))
  int_add_one = int(add_one)
  for k, count_dict in model_counter.items():
    for letter, count in count_dict.items():
      model[k][letter] = (count + int_add_one) / (int_add_one * len(vocabulary) + base_counter[k])

  return model
  

**Part 3**

Write a function *eval* that returns the perplexity of a model (dictionary) running over a given data file.

In [None]:
def eval(n, model, data_file):
  # n - the n-gram that you used to build your model (must be the same number)
  # model - the dictionary (model) to use for calculating perplexity
  # data_file - the tweets file that you wish to claculate a perplexity score for

  entropies_list = []
  missing_value = 1e-10
  counter = 0
  entropy = 0

  for tweet in pd.read_csv(data_file)['tweet_text']:

    for window_start in range(n-1):
      counter += 1
      key = '<start>' * (n - window_start - 1) + tweet[0 : window_start]
      value = tweet[window_start]
      
      if value in model[key]:
        entropy -= math.log(model[key][value], 2)
      else:
        entropy -= math.log(missing_value, 2)

    for window_start in range(len(tweet) - n):
      counter += 1
      key = tweet[window_start : window_start + n-1]
      value = tweet[window_start + n-1]
      
      if value in model[key]:
        entropy -= math.log(model[key][value], 2)
      else:
        entropy -= math.log(missing_value, 2)
  
  average_entropy = entropy / counter
  return 2 ** average_entropy

In [None]:
test = lm(4, VOCAB, 'nlp-course/lm-languages-data-new/en.csv', False)

In [None]:
print(eval(4, test, 'nlp-course/lm-languages-data-new/en.csv'))

8.987466884050361


**Part 4**

Write a function *match* that creates a model for every relevant language, using a specific value of *n* and *add_one*. Then, calculate the perplexity of all possible pairs (e.g., en model applied on the data files en ,es, fr, in, it, nl, pt, tl; es model applied on the data files en, es...). This function should return a pandas DataFrame with columns [en ,es, fr, in, it, nl, pt, tl] and every row should be labeled with one of the languages. Then, the values are the relevant perplexity values.

In [None]:
def match(n, add_one):
  # n - the n-gram to use for creating n-gram models
  # add_one - use add_one smoothing or not

  languages_dict = {'en': {} ,'es' : {}, 'fr' : {}, 'in' :{}, 'it':{}, 'nl':{}, 'pt' :{}, 'tl':{}}

  for first_language in languages_dict.keys():
    model = lm(n, VOCAB, f'nlp-course/lm-languages-data-new/{first_language}.csv', add_one)
    for second_language in languages_dict.keys():
      languages_dict[first_language][second_language] = eval(n, model, f'nlp-course/lm-languages-data-new/{second_language}.csv')

  return pd.DataFrame(languages_dict)
  

In [None]:
match_test = match(3, True)


In [None]:
match_test

Unnamed: 0,en,es,fr,in,it,nl,pt,tl
en,40.253868,516.583628,334.959341,362.111025,475.023801,278.797384,724.04954,300.418446
es,433.728268,37.952644,341.519061,552.341275,318.520335,545.872831,312.157014,465.385013
fr,666.996927,620.862665,36.805567,1056.991443,601.10367,637.500461,870.918338,1160.714746
in,593.823117,1037.160205,739.546089,49.244202,1147.266868,582.346619,1556.972414,379.458207
it,363.598395,288.43177,309.386474,490.782617,39.837977,492.727643,437.195181,438.828412
nl,507.574421,1019.1203,609.696227,561.969101,1019.927602,41.368104,1435.127252,754.301624
pt,688.170684,352.669549,517.184366,800.612287,469.329433,795.007401,44.750227,737.169596
tl,510.545029,847.447957,782.939894,408.184361,783.027443,629.579965,1129.992814,49.39611


**Part 5**

Run match with *n* values 1-4, once with add_one and once without, and print the 8 tables to this notebook, one after another.

In [None]:
def run_match():
  #TODO
  for n in range(1,5):
    for add_one in [True, False]:
      yield match(n, add_one)

results = run_match()

In [None]:
print("n=1, add_one=True")
next(results)

Unnamed: 0,en,es,fr,in,it,nl,pt,tl
en,36.426588,40.118245,39.760601,40.558101,39.690601,39.017677,40.594601,40.402655
es,40.35626,34.129104,37.989651,45.111067,36.998969,39.199164,35.548108,40.983249
fr,42.965259,40.193906,35.554034,46.455196,39.400623,40.240004,39.432853,47.070377
in,39.650831,41.808469,42.542783,35.258106,41.672427,39.774145,41.130941,37.399217
it,39.870326,38.905755,38.063313,42.355726,35.559022,39.354752,39.422394,41.816834
nl,37.743448,39.61375,38.966779,39.901005,39.164717,35.455641,39.7133,40.822434
pt,44.494676,39.979636,40.862122,48.177157,41.600593,42.488105,34.85648,43.912971
tl,42.689592,45.228811,46.998144,40.632919,44.406457,44.367921,45.144641,38.457511


In [None]:
print("n=1, add_one=False")
next(results)

n=1, add_one=False


Unnamed: 0,en,es,fr,in,it,nl,pt,tl
en,36.371757,40.059535,39.702794,40.492511,39.622698,38.958308,40.517423,40.316823
es,40.43808,34.070634,38.162037,45.074138,36.969993,39.20074,35.48364,40.990842
fr,42.959407,40.172122,35.501821,46.730897,39.333671,40.226404,39.36437,47.123265
in,39.590656,41.739211,42.484991,35.198957,41.600228,39.712417,41.04472,37.320201
it,39.809738,38.858505,38.006469,42.30894,35.492806,39.305264,39.340681,41.738176
nl,37.687547,39.543318,38.906589,39.835003,39.093587,35.395865,39.623739,40.735282
pt,44.575706,39.926568,40.928724,48.238234,41.553975,42.473191,34.780559,44.019612
tl,42.628175,45.157594,46.940896,40.570977,44.337482,44.306697,45.054936,38.382964


In [None]:
print("n=2, add_one=True")
next(results)

In [None]:
print("n=2, add_one=False")
next(results)

In [None]:
print("n=3, add_one=True")
next(results)

In [None]:
print("n=3, add_one=False")
next(results)

In [None]:
print("n=4, add_one=True")
next(results)

In [None]:
print("n=4, add_one=False")
next(results)

**Part 6**

Each line in the file test.csv contains a sentence and the language it belongs to. Write a function that uses your language models to classify the correct language of each sentence.

Important note regarding the grading of this section: this is an open question, where a different solution will yield different accuracy scores. any solution that is not trivial (e.g. returning 'en' in all cases) will be excepted. We do reserve the right to give bonus points to exceptionally good/creative solutions.

In [None]:
# Hyperparameters
n_for_classify = 2

In [None]:
all_language_models = {language : lm(n_for_classify, VOCAB, f'nlp-course/lm-languages-data-new/{language}.csv', False) for language in  ['en' ,'es', 'fr', 'in', 'it', 'nl', 'pt', 'tl']}

In [None]:
def classify():
  # TODO
  n = n_for_classify
  missing_value = 1e-10
  df = pd.read_csv('nlp-course/lm-languages-data-new/test.csv')

  for row_index, row in df.iterrows():
    best_score = float('inf')
    best_lang = None
    for lang, model in all_language_models.items():
      entropy = 0
      counter = 0
      for window_start in range(len(row['tweet_text']) - n):
        counter += 1
        key = row['tweet_text'][window_start : window_start + n-1]
        value = row['tweet_text'][window_start + n-1]
        if value in model[key]:
          entropy -= math.log(model[key][value], 2)
        else:
          entropy -= math.log(missing_value, 2)

      if 2 ** (entropy / counter) < best_score:
        best_lang = lang
        best_score = 2 ** (entropy / counter)

    df.at[row_index, 'Prediction'] = best_lang

  return df
      

clasification_result = classify()

In [None]:
clasification_result 

Unnamed: 0,tweet_id,tweet_text,label,Prediction
0,845394879479996416,RT @jarsofshine: In 08 I had a volunteer who h...,en,en
1,836313846675619841,IN OGNI CASO CON LE PAGHE CHE GIRANO IN Africa...,it,it
2,836259442328940544,@jaynaldmase @acobasilianne @dingDANGdantes @d...,tl,tl
3,847729104472358912,"Daags voor @RondeVlaanderen, @VoltaClassic als...",nl,nl
4,836491739699412992,RT @ertsul20: Susuportahan kita hanggang sa du...,tl,tl
...,...,...,...,...
7994,836250659464761344,"La triste historia que inspiró ""Tu falta de qu...",es,es
7995,847676283089637380,RT @ShahwalAdli_: Aku tak bersuara tak bermakn...,in,in
7996,836319299279138816,@Benji_Mascolo DEVI TAGLIARE QUEI CAPELLI 😠😡😠😂❤,it,pt
7997,836258179847716865,Assistimos de camarote varias brigas ontem!,pt,pt


**Part 7**

Calculate the F1 score of your output from part 6. (hint: you can use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). 


In [None]:
def calc_f1(result):
  # TODO
  return f1_score(result['label'].tolist(), result['Prediction'].tolist(), average = 'weighted')


calc_f1(clasification_result)

0.8634128348679359

# **Good luck!**