# Set up

In [93]:
!git clone https://github.com/NLP-Reichman/assignment_1.git
!mv assignment_1/data data
!rm assignment_1/ -r

'git' is not recognized as an internal or external command,
operable program or batch file.
'mv' is not recognized as an internal or external command,
operable program or batch file.
'rm' is not recognized as an internal or external command,
operable program or batch file.


# Introduction
In this assignment you will be creating tools for learning and testing language models. The corpora that you will be working with are lists of tweets in 8 different languages that use the Latin script. The data is provided either formatted as CSV or as JSON, for your convenience. The end goal is to write a set of tools that can detect the language of a given tweet.
The relevant files are under the data folder:

- en.csv (or the equivalent JSON file)
- es.csv (or the equivalent JSON file)
- fr.csv (or the equivalent JSON file)
- in.csv (or the equivalent JSON file)
- it.csv (or the equivalent JSON file)
- nl.csv (or the equivalent JSON file)
- pt.csv (or the equivalent JSON file)
- tl.csv (or the equivalent JSON file)

In [94]:
import json
# from google.colab import files
import pandas as pd

# Implementation

## Part 1
Implement the function *preprocess* that iterates over all the data files and creates a single vocabulary, containing all the tokens in the data. Our token definition is a single UTF-8 encoded character. So, the vocabulary list is a simple Python list of all the characters that you see at least once in the data.

Note - do NOT lowercase the sentences in whi HW.

In [95]:
import os
start_token = '\uE002'  # Using Private Use Area character
end_token = '\uE003'    # Using Private Use Area character
def preprocess() -> list[str]:
  '''
  Return a list of characters, representing the shared vocabulary of all languages
  '''
  data_folder_path = "./data"

  unique_characters = set()
  # Iterate through all files in the data folder
  for filename in os.listdir(data_folder_path):

      # Check if the file is a JSON file
      if filename.endswith(".json"):
          # print(filename)
          # Get the full path to the file
          file_path = os.path.join(data_folder_path, filename)

          # Open the file and read its contents
          with open(file_path, "r") as f:
              json_data = json.loads(f.read())
            #   print(json_data)
              for key, val in json_data["tweet_text"].items():
                #   print(key, val)
                  for c in val:
                    unique_characters.add(c) # Update the hash with the token in utf-8 encoding

                      

          # Process the JSON data as needed
          # (e.g., parse it into a Python dictionary)
          # ...
  # add the 2 special chars to the set:               
  unique_characters.add(start_token)
  unique_characters.add(end_token)
  return list(unique_characters)




In [96]:
vocabulary = preprocess()
vocabulary_length = len(vocabulary)
print(f"Vocabulary length: {vocabulary_length}")

Vocabulary length: 1804


## Part 2
Implement the function *lm* that generates a language model from a textual corpus. The function should return a dictionary (representing a model) where the keys are all the relevant *n*-1 sequences, and the values are dictionaries with the *n*_th tokens and their corresponding probabilities to occur. For example, for a trigram model (tokens are characters), it should look something like:

{ "ab":{"c":0.5, "b":0.25, "d":0.25}, "ca":{"a":0.2, "b":0.7, "d":0.1} }

which means for example that after the sequence "ab", there is a 0.5 chance that "c" will appear, 0.25 for "b" to appear and 0.25 for "d" to appear.

Note - You should think how to add the add_one smoothing information to the dictionary and implement it.

In [97]:


def lm(lang: str, n: int, smoothed=True) -> dict[str, dict[str, float]]:
    '''
    Return a language model for the given lang and n_gram (n)
    :param lang: the language of the model
    :param n: the n_gram value
    :return: a dictionary where the keys are n_grams and the values are dictionaries
              with the n_th tokens and their corresponding probabilities to occur
    '''
    model = {}
    data_folder_path = "./data"
    filename = f"{lang}.json"
    file_path = os.path.join(data_folder_path, filename)
    
    # Open the file and read its contents
    with open(file_path, "r") as f:
        json_data = json.loads(f.read())
        # Iterate through each sentence in the corpus
        for key, sentence in json_data["tweet_text"].items():
            # Iterate through each n-gram in the sentence
            # add n-1 start tokens and n-1 end toklens to the sentence:
            sentence = start_token * (n-1) + sentence + end_token
          
            for i in range(len(sentence) - n + 1):
                ngram = sentence[i:i+n-1]
                next_token = sentence[i+n-1]

                # Check if the n-gram is already in the model
                if ngram not in model:
                    model[ngram] = {}
                # Check if the next token is already in the n-gram's dictionary
                if next_token not in model[ngram]:
                    if smoothed:
                        model[ngram][next_token] = 1
                    else:
                        model[ngram][next_token] = 0
                model[ngram][next_token] += 1
                    
    
    # Calculate the probabilities for each n-gram
    for ngram in model:
        if smoothed:
            # add all unseen chars from the vocabulary to the model, with a value of 1 (smoothing):
            for char in vocabulary:
                if char not in model[ngram]:
                    model[ngram][char] = 1
        

        total_count = sum(model[ngram].values())
        for token in model[ngram]:
            model[ngram][token] /= total_count
    
    return model


In [98]:
res = lm("en",2)
print(f'unsmoothed len res: {len(res)}')

res = lm("en",2, smoothed=True)
print(f'smoothed len res: {len(res)}')

unsmoothed len res: 747
smoothed len res: 747


## Part 3
Implement the function *eval* that returns the perplexity of a model (dictionary) running over the data file of the given target language.

In [99]:
import math
def eval(model: dict, target_lang: str) -> float:
  '''
  Return the perplexity value calculated over applying the model on the text file
  of the target_lang language.
  :param model: the language model
  :param target_lang: the target language
  :return: the perplexity value
  '''
  n = len(list(model.keys())[0]) + 1

  data_folder_path = "./data"
  filename = f"{target_lang}.json"
  file_path = os.path.join(data_folder_path, filename)
  perplexity = 0
  chars_counter = 0
  # Open the file and read its contents
  with open(file_path, "r") as f:
      json_data = json.loads(f.read())
      # Iterate through each sentence in the corpus
      perplexities = []
      for key, sentence in json_data["tweet_text"].items():
        sentence = start_token * (n-1) + sentence + end_token
        sentence_enthropy = 0
        for i in range(len(sentence) - n + 1):
            ngram = sentence[i:i+n-1]
            next_token = sentence[i+n-1]
            chars_counter += 1
            if ngram in model:
                if next_token in model[ngram]:
                    sentence_enthropy +=  ( -1 * math.log2(model[ngram][next_token]))
                else:
                    # throw an exception if the token is not in the model's keys
                    raise Exception(f"{next_token} NOT IN MODEL for {ngram}")
                    # sentence_enthropy +=  ( -1 * math.log2(model[ngram]['unknown']))
            else:
                #raise an exception if the ngram is not in the model's keys
                # raise Exception("NOT IN MODEL")
                sentence_enthropy += -1 * math.log2(1/vocabulary_length) # # todo  review
              
            perplexities.append(2 ** (sentence_enthropy/(len(sentence) - n + 1)))
            # perplexities.append( (sentence_enthropy/(len(sentence) - n + 1)))
#   return 2 ** (perplexity / chars_counter)
# print and return the average perplexiry of all sentences:
  res =  sum(perplexities) / len(perplexities)
  # res =  2 ** (sum(perplexities) / len(perplexities))
  
#   print(f'perplexity: {res}')
  return res
  


In [100]:
# for n in range(1, 5):
#     model = lm("en", n, smoothed=True)
#     perplexity =  eval(model, "fr")
#     print(f"{n} : Perplexity: {perplexity}")

## Part 4
Implement the *match* function that calls *eval* using a specific value of *n* for every possible language pair among the languages we have data for. You should call *eval* for every language pair four times, with each call assign a different value for *n* (1-4). Each language pair is composed of the source language and the target language. Before you make the call, you need to call the *lm* function to create the language model for the source language. Then you can call *eval* with the language model and the target language. The function should return a pandas DataFrame with the following four columns: *source_lang*, *target_lang*, *n*, *perplexity*. The values for the first two columns are the two-letter language codes. The value for *n* is the *n* you use for generating the specific perplexity values which you should store in the forth column.

In [101]:
def match() -> pd.DataFrame:
  '''
  Return a DataFrame containing one line per every language pair and n_gram.
  Each line will contain the perplexity calculated when applying the language model
  of the source language on the text of the target language.
  :return: a DataFrame containing the perplexity values
  '''
  perplexity_values = []
  for n in range(1, 5): # TODO - CHANGE BACK TO 5
  # for n in range(1, 4):
    for model_lang in ["en", "es", "fr", "in", "it", "nl", "pt", "tl"]:
      model = lm(model_lang, n, True)

      for lang in ["en", "es", "fr", "in", "it", "nl", "pt", "tl"]:
        perplexity = eval(model, lang)
        perplexity_values.append({"source": model_lang, "target": lang, "n": n, "perplexity": perplexity})
        print(f'source: {model_lang}, target: {lang}, n: {n}, perplexity: {perplexity}')
  return pd.DataFrame(perplexity_values)

# df = match()

In [102]:
# print(df[3]['en'])

## Part 5
Implement the *generate* function which takes a language code, *n*, the prompt (the starting text), the number of tokens to generate, and *r*, which is the random seed for any randomized action you plan to take in your implementation. The function should start generating tokens, one by one, using the language model of the given source language and *n*. The prompt should be used as a starting point for aligning on the probabilities to be used for generating the next token.

Note - The generation of the next token should be from the LM's distribution.

In [120]:
import random
models_1 = {}
# for each language create a model with n=1, no smoothing:
for lang in ["en", "es", "fr", "in", "it", "nl", "pt", "tl"]:
  models_1[lang] = lm(lang, 1, False)

def get_next_token(model, ngram, lang):
  '''
  Return the next token based on the given model, ngram, and random value from 0 to 1.
  '''
  # iterate over the model[ngram] and sum the probabilities until the sum is greater than rand:
  sum = 0
  # get a random number between 0 to 1:
  rand = random.random()
  try:
    for token, prob in model[ngram].items():
      sum += prob
      if sum > rand:
        return token
  except:
    # reutrn the most common token in language if ngram is not in the model:
    return get_next_token(models_1[lang], '', lang)


  
  # for token, prob in model[ngram].items():
  #   sum += prob
  #   if sum > rand:
  #     print(f'ngram {ngram}, rand {rand}, sum {sum}, token {token}, prob {prob}')
  #     return token

def generate(lang: str, n: int, prompt: str, number_of_tokens: int, r: int) -> str:
  '''
  Generate text in the given language using the given parameters.
  :param lang: the language of the model
  :param n: the n_gram value
  :param prompt: the prompt to start the generation
  :param number_of_tokens: the number of tokens to generate
  :param r: the random seed to use
  '''
  # initialize random seed with r
  random.seed(r)
  model = lm(lang, n)
  generated_text = prompt
  for i in range(number_of_tokens):
    ngram = generated_text[-n+1:] # TODO handle the case of shorter prompts
    # get next token based on the model and the dustribution in model[ngram] . use the random seed
    next_token = get_next_token(model, ngram, lang)
    # max(model[ngram], key=model[ngram].get)
    generated_text += next_token
  return generated_text

print(generate("en", 3, "I am ", 20, 5))

I am （gpᴰehᵃns👲lI▝gt💧.l🌟o


## Part 6
Play with your generate function, try to generate different texts in different language and various values of *n*. No need to submit anything of that.

In [104]:
# print(generate("fr", 2, "je suis ", 20, 5))
# print(generate("fr", 2, "je suis ", 20, 5))
# print(generate("fr", 2, "je suis", 20, 5))

# print(generate("en", 2, "I am ", 20, 5))
# print(generate("en", 2, "I am", 20, 5))

# Testing

Copy the content of the **tests.py** file from the repo and paste below. This will create the results.json file and download it to your machine.

In [105]:
####################
# PLACE TESTS HERE #
# Create tests
def test_preprocess():
    return {
        'vocab_length': len(preprocess()),
    }

def test_lm():
    return {
        'english_2_gram_length': len(lm('en', 2)),
        'english_3_gram_length': len(lm('en', 3)),
        'french_3_gram_length': len(lm('fr', 3)),
        'spanish_3_gram_length': len(lm('es', 3)),
    }

def test_eval():
    return {
        'english_on_english': round(eval(lm('en', 3), 'en'), 2),
        'english_on_french': round(eval(lm('en', 3), 'fr'), 2),
        'english_on_spanish': round(eval(lm('en', 3), 'es'), 2),
    }

def test_match():
    df = match()
    return {
        'df_shape': df.shape,
        'en_en_1': df[(df['source'] == 'en') & (df['target'] == 'en') & (df['n'] == 1)]['perplexity'].values[0],
        'tl_tl_1': df[(df['source'] == 'tl') & (df['target'] == 'tl') & (df['n'] == 1)]['perplexity'].values[0],
        'tl_nl_4': df[(df['source'] == 'tl') & (df['target'] == 'nl') & (df['n'] == 4)]['perplexity'].values[0],
    }

def test_generate():
    return {
        'english_2_gram': generate('en', 2, "I am", 20, 5),
        'english_3_gram': generate('en', 3, "I am", 20, 5),
        'english_4_gram': generate('en', 4, "I Love", 20, 5),
        'spanish_2_gram': generate('es', 2, "Soy", 20, 5),
        'spanish_3_gram': generate('es', 3, "Soy", 20, 5),
        'french_2_gram': generate('fr', 2, "Je suis", 20, 5),
        'french_3_gram': generate('fr', 3, "Je suis", 20, 5),
    }

TESTS = [test_preprocess, test_lm, test_eval, test_match, test_generate]

# Run tests and save results
res = {}
for test in TESTS:
    try:
        cur_res = test()
        res.update({test.__name__: cur_res})
    except Exception as e:
        res.update({test.__name__: repr(e)})

with open('results.json', 'w') as f:
    json.dump(res, f, indent=2)

# Download the results.json file
files.download('results.json')


####################

source: en, target: en, n: 1, perplexity: 10.379771316004877
source: en, target: es, n: 1, perplexity: 11.73145556171549
source: en, target: fr, n: 1, perplexity: 11.15580476665034
source: en, target: in, n: 1, perplexity: 14.257921313639697
source: en, target: it, n: 1, perplexity: 10.78949917502809
source: en, target: nl, n: 1, perplexity: 12.419028033220602
source: en, target: pt, n: 1, perplexity: 12.296024540901128
source: en, target: tl, n: 1, perplexity: 13.043779385721933
source: es, target: en, n: 1, perplexity: 11.907227089082587
source: es, target: es, n: 1, perplexity: 9.992726227500581
source: es, target: fr, n: 1, perplexity: 10.782284239783385
source: es, target: in, n: 1, perplexity: 14.419022239565868
source: es, target: it, n: 1, perplexity: 10.762838452455565
source: es, target: nl, n: 1, perplexity: 14.205356329587623
source: es, target: pt, n: 1, perplexity: 11.870708451710053
source: es, target: tl, n: 1, perplexity: 14.396720799140892
source: fr, target: en, n: 1

NameError: name 'files' is not defined

In [None]:
# Show the local files, results.json should be there now and
# also downloaded to your local machine
!ls -l

'ls' is not recognized as an internal or external command,
operable program or batch file.


In [None]:
# temp = lm("en", 4)
# print

KeyboardInterrupt: 