<a href="https://colab.research.google.com/github/AdamVinestock/NLP/blob/main/NLP_Language_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Language Models
In this notebook we will be creating tools for learning and testing language models.
The corpora that we will be working with are lists of tweets in 8 different languages that use the Latin script. The data is provided either formatted as CSV or as JSON, for your convenience. The end goal is to write a set of tools that can detect the language of a given tweet.

*As a preparation for this task, download the data files from the course git repository.

The relevant files are under **lm-languages-data-new**:


*   en.csv (or the equivalent JSON file)
*   es.csv (or the equivalent JSON file)
*   fr.csv (or the equivalent JSON file)
*   in.csv (or the equivalent JSON file)
*   it.csv (or the equivalent JSON file)
*   nl.csv (or the equivalent JSON file)
*   pt.csv (or the equivalent JSON file)
*   tl.csv (or the equivalent JSON file)
*   test.csv (or the equivalent JSON file)





In [1]:
!git clone https://github.com/kfirbar/nlp-course.git

Cloning into 'nlp-course'...
remote: Enumerating objects: 71, done.[K
remote: Counting objects: 100% (71/71), done.[K
remote: Compressing objects: 100% (57/57), done.[K
remote: Total 71 (delta 29), reused 40 (delta 11), pack-reused 0[K
Receiving objects: 100% (71/71), 11.28 MiB | 13.97 MiB/s, done.
Resolving deltas: 100% (29/29), done.




---



**Important note: please use only the files under lm-languages-data-new and NOT under lm-languages-data**


---



In [2]:

!ls nlp-course/lm-languages-data-new


en.csv	 es.json  in.csv   it.json  pt.csv    test.json   tl.csv
en.json  fr.csv   in.json  nl.csv   pt.json   tests.csv   tl.json
es.csv	 fr.json  it.csv   nl.json  test.csv  tests.json


**Part 1**

We will write a function *preprocess* that iterates over all the data files and creates a single vocabulary, containing all the tokens in the data. **Our token definition is a single UTF-8 encoded character**. So, the vocabulary list is a simple Python list of all the characters that you see at least once in the data.

In [3]:
import csv
import math

en_file = open('/content/nlp-course/lm-languages-data-new/en.csv')
es_file = open('/content/nlp-course/lm-languages-data-new/es.csv')
fr_file = open('/content/nlp-course/lm-languages-data-new/fr.csv')
in_file = open('/content/nlp-course/lm-languages-data-new/in.csv')
it_file = open('/content/nlp-course/lm-languages-data-new/it.csv')
nl_file = open('/content/nlp-course/lm-languages-data-new/nl.csv')
pt_file = open('/content/nlp-course/lm-languages-data-new/pt.csv')
tl_file = open('/content/nlp-course/lm-languages-data-new/tl.csv')

csv_files = ['/content/nlp-course/lm-languages-data-new/en.csv',
            '/content/nlp-course/lm-languages-data-new/es.csv',
            '/content/nlp-course/lm-languages-data-new/fr.csv',
            '/content/nlp-course/lm-languages-data-new/in.csv',
            '/content/nlp-course/lm-languages-data-new/it.csv',
            '/content/nlp-course/lm-languages-data-new/nl.csv',
            '/content/nlp-course/lm-languages-data-new/pt.csv',
            '/content/nlp-course/lm-languages-data-new/tl.csv']

In [4]:
def calc_vocab(csv_files):
  vocab = [['<s>'],['<e>']]
  for path in csv_files:
    with open(path, 'r', newline='', encoding='utf-8') as csv_file:
        reader = csv.reader(csv_file)
        # Pad each tweet with start and end symbols given n
        for row in reader:
          for char in row[1]:
              if [char] not in vocab:
                vocab.append([char])
  return vocab

In [5]:
vocab = calc_vocab(csv_files)

**Part 2**

Now we will write a function lm that generates a language model from a textual corpus. The function should return a dictionary (representing a model) where the keys are all the relevant n-1 sequences, and the values are dictionaries with the n_th tokens and their corresponding probabilities to occur. For example, for a trigram model (tokens are characters), it should look something like:

{
  "ab":{"c":0.5, "b":0.25, "d":0.25},
  "ca":{"a":0.2, "b":0.7, "d":0.1}
}

which means for example that after the sequence "ab", there is a 0.5 chance that "c" will appear, 0.25 for "b" to appear and 0.25 for "d" to appear.

Note - We should think how to add the add_one smoothing information to the dictionary and implement it.

In [6]:
def lm(n, vocabulary, data_file_path, add_one):
  # n - the n-gram to use (e.g., 1 - unigram, 2 - bigram, etc.)
  # vocabulary - the vocabulary list (which you should use for calculating add_one smoothing)
  # data_file_path - the data_file from which we record probabilities for our model
  # add_one - True/False (use add_one smoothing or not)
  # initialize the LM dictionary
  lm = {}

  with open(data_file_path, 'r', newline='', encoding='utf-8') as csv_file:

    reader = csv.reader(csv_file)
    updated_rows = []

    # Pad each tweet with start and end symbols given n
    for row in reader:
      if row[1][0:2] == 'RT':                       # Clearing the "RT" string (respond tweet symbol) that is very abundant in the tweets
        colon_start_index = row[1].find(':') + 2
        row[1] = [vocabulary[0]] * (n - 1) + [[c] for c in row[1][colon_start_index:]] + [vocabulary[1]]
      else:
        row[1] = [vocabulary[0]] * (n - 1) + [[c] for c in row[1]] + [vocabulary[1]]
      updated_rows.append(row)

    csv_file.seek(0)  #<-- set the iterator to beginning of the input file

    # iterate over all n-grams in the text
    for row in updated_rows:
      for i in range(len(row[1])-n+1):
        ngram = ''.join([elem[0] for elem in row[1][i:i+n-1]])
        next_token = row[1][i+n-1][0]
        if ngram not in lm:
          lm[ngram] = {}
          lm[ngram][next_token] = 1
        else:
          if next_token not in  lm[ngram]:
            lm[ngram][next_token] = 1
          else:
            lm[ngram][next_token] += 1

    # normalize the probabilities
    if add_one:
      for ngram in lm:
        total_count = sum(lm[ngram].values())
        for token in lm[ngram]:
          lm[ngram][token] +=1
          lm[ngram][token] /= (total_count+len(vocabulary))
        lm[ngram]['unseen'] = 1 / len(vocabulary)
    else:
      for ngram in lm:
        lm[ngram]['unseen'] = 1e-7
        total_count = sum(lm[ngram].values())
        for token in lm[ngram]:
          lm[ngram][token] /= total_count
  return lm

**Part 3**

Here we write a function *eval* that returns the perplexity of a model (dictionary) running over a given data file.

In [7]:
def evaluate(n, model, data_file_path):
  # n - the n-gram that you used to build your model (must be the same number)
  # model - the dictionary (model) to use for calculating perplexity
  # data_file - the tweets file that you wish to claculate a perplexity score for
  perplexity = 0

  with open(data_file_path, 'r', newline='', encoding='utf-8') as csv_file:
    reader = csv.reader(csv_file)
    # Pad each tweet with start and end symbols given n
    start_symbol = '<s>'
    end_symbol = '<e>'
    for row in reader:
      if row[1][0:2] == 'RT':                 # Clearing the "RT" string (respond tweet symbol) that is very abundant in the tweets
        colon_index = row[1].find(':') + 2
        row[1] = [start_symbol] * (n - 1) + [[c] for c in row[1][colon_index:]] + [end_symbol]
      else:
        row[1] = [start_symbol] * (n - 1) + [[c] for c in row[1]] + [end_symbol]

    csv_file.seek(0)  #<-- set the iterator to beginning of the input file

    # Calculate the perplexity of the model on the sequence of tokens
    log_prob = 0
    N = 0
    for tweet in reader:
      for i in range(n - 1, len(tweet[1])):
        prefix = tweet[1][i - n + 1:i]
        token = tweet[1][i]
        if prefix in model:
          if token in model[prefix]:
            log_prob += math.log(model[prefix][token], 2)
          else:
            log_prob += math.log(model[prefix]['unseen'], 2)
          N += 1
        else:
          log_prob += math.log((1e-7), 2)  # Smoothing for unknown tokens
          N += 1

    perplexity = 2 ** (-log_prob / (N - n + 1))
  return perplexity

In [8]:
model_en = lm(4, vocab, csv_files[0], add_one=False)
perplexity = evaluate(4, model_en, csv_files[0])
print(perplexity)

8.488455023700967


**Part 4**

Now we write a function *match* that creates a model for every relevant language, using a specific value of *n* and *add_one*. Then, calculate the perplexity of all possible pairs (e.g., en model applied on the data files en ,es, fr, in, it, nl, pt, tl; es model applied on the data files en, es...). This function should return a pandas DataFrame with columns [en ,es, fr, in, it, nl, pt, tl] and every row should be labeled with one of the languages. Then, the values are the relevant perplexity values.

We will save the dataframe to a CSV.

In [9]:
import pandas as pd

def match(n, add_one):
  # n - the n-gram to use for creating n-gram models
  # add_one - use add_one smoothing or not
  # preprocessing test set:

  csv_paths = ['/content/nlp-course/lm-languages-data-new/en.csv',
            '/content/nlp-course/lm-languages-data-new/es.csv',
            '/content/nlp-course/lm-languages-data-new/fr.csv',
            '/content/nlp-course/lm-languages-data-new/in.csv',
            '/content/nlp-course/lm-languages-data-new/it.csv',
            '/content/nlp-course/lm-languages-data-new/nl.csv',
            '/content/nlp-course/lm-languages-data-new/pt.csv',
            '/content/nlp-course/lm-languages-data-new/tl.csv']
  models = []
  vocab = calc_vocab(csv_paths)
  data = [ [] for i in range(8) ]   # ['en', 'es', 'fr', 'in', 'it', 'nl', 'pt', 'tl']
  for i in range(8):
    cur_model = lm(n, vocab, csv_paths[i], add_one)
    models.append(cur_model)
    for j, path in enumerate (csv_paths):
      cur_perplexity = evaluate(n, cur_model, path)
      data[j].append(cur_perplexity)

  # Creates pandas DataFrame.
  df = pd.DataFrame(data, columns = ['en', 'es', 'fr', 'in', 'it', 'nl', 'pt', 'tl'],
                    index=['en_model',
                           'es_model',
                           'fr_model',
                           'in_model',
                           'it_model',
                           'nl_model',
                           'pt_model',
                           'tl_model'])

  return df

In [10]:
df = match(3,False)
print(df)
df.to_csv('part4.csv')

                  en          es          fr          in          it  \
en_model   10.921999  117.729975   81.424843   77.802859  105.134105   
es_model  111.415093   10.430856   89.893183  115.334352   79.429330   
fr_model  156.772223  150.140266   10.258980  198.968866  136.237585   
in_model  152.079900  235.264782  167.552912   11.236029  238.260683   
it_model   90.942859   70.817712   76.159034  101.551387    9.872314   
nl_model  138.161507  232.550872  150.825478  124.758122  224.727889   
pt_model  154.440048   80.220956  119.642507  143.885979   99.142703   
tl_model  125.758773  172.312157  153.489255   82.925432  150.868592   

                  nl          pt          tl  
en_model   63.358218  154.734100   65.383332  
es_model  116.115543   79.403854   98.254611  
fr_model  132.113648  200.924519  215.240327  
in_model  126.007682  328.942501   85.703104  
it_model  103.388319  106.398754   90.691757  
nl_model    9.703911  319.486263  155.074650  
pt_model  148.746978  

**Part 5**

Now we run match with *n* values 1-4, once with add_one and once without, and print the 8 tables to this notebook, one after another.

We will load each result to a dataframe and save to a CSV.

In [11]:
def run_match():
  for n in range(1,5):
    file_name = 'part5.csv'
    file_name_no_addone = 'no_addone_part5.csv'
    df_True = match(n,True)
    print(f'DataFrame of perplexity for n={n}, add_one =True:')
    print(df_True)
    df_True.to_csv(file_name)
    df_False = match(n,False)
    print(f'DataFrame of perplexity for n={n}, add_one =False:')
    print(df_False)
    df_False.to_csv(file_name_no_addone)


In [12]:
run_match()

DataFrame of perplexity for n=1, add_one =True:
                 en         es         fr         in         it         nl  \
en_model  37.857265  41.344437  40.819809  41.691754  40.504496  39.696012   
es_model  38.798983  35.505579  38.546398  40.065849  37.859017  39.319225   
fr_model  40.416394  39.317337  36.819573  45.188603  39.070071  40.715803   
in_model  40.448256  42.850454  43.616238  36.611741  42.414282  40.354234   
it_model  38.891335  38.844332  38.968632  41.757037  36.847002  39.703772   
nl_model  38.507973  40.347719  39.751017  40.675883  39.874853  36.584532   
pt_model  40.059716  37.270494  38.908847  41.754206  38.576469  40.542756   
tl_model  43.873721  46.687285  48.668691  41.756029  45.501698  45.357936   

                 pt         tl  
en_model  42.024487  41.420112  
es_model  36.794945  41.933332  
fr_model  39.583712  46.071420  
in_model  42.320793  38.048383  
it_model  39.376133  41.439620  
nl_model  40.618860  41.922881  
pt_model  36.49017

**Part 6**

Each line in the file test.csv contains a sentence and the language it belongs to. We will write a function that uses your language models to classify the correct language of each sentence.

In [13]:
import numpy as np

def eval_sentence(n, model, sentence):
  # n - the n-gram that you used to build your model (must be the same number)
  # model - the dictionary (model) to use for calculating perplexity
  # data_file - the tweets file that you wish to claculate a perplexity score for
  perplexity = 0

  # pad each tweet with start and end symbols given n
  start_symbol = '<s>'
  end_symbol = '<e>'
  processed_sentence = start_symbol*(n-1) + sentence + end_symbol


  # Calculate the perplexity of the model on the sentence
  log_prob = 0
  N = 0

  for i in range(n - 1, len(processed_sentence)):
    prefix = processed_sentence[i - n + 1:i]
    token = processed_sentence[i]
    if prefix in model and token in model[prefix]:
      if model[prefix][token] == 0:
        log_prob += math.log((1e-10), 2)
      else:
        log_prob += math.log(model[prefix][token], 2)
      N += 1
    else:
      log_prob += math.log((1e-10), 2)  # Smoothing for unknown tokens
      N += 1

  perplexity = 2 ** (-log_prob / (N - n + 1))

  return perplexity


def classify(n, models, test_csv):
  reader = csv.reader(test_csv)
  test_csv.seek(0)
  predictions = []
  languages = ['en', 'es', 'fr', 'in', 'it', 'nl', 'pt', 'tl']
  for row in reader:
    cur_perp = np.inf
    best_perp = np.inf
    best_language = None
    language_pointer = 0
    for model in models:
      cur_perp = eval_sentence(n, model, row[1])
      if cur_perp < best_perp:
        best_perp = cur_perp
        best_language = language_pointer
      language_pointer += 1
    predictions.append(languages[best_language])
  return predictions

def train_models(n, add_one):
  csv_paths = ['/content/nlp-course/lm-languages-data-new/en.csv',
            '/content/nlp-course/lm-languages-data-new/es.csv',
            '/content/nlp-course/lm-languages-data-new/fr.csv',
            '/content/nlp-course/lm-languages-data-new/in.csv',
            '/content/nlp-course/lm-languages-data-new/it.csv',
            '/content/nlp-course/lm-languages-data-new/nl.csv',
            '/content/nlp-course/lm-languages-data-new/pt.csv',
            '/content/nlp-course/lm-languages-data-new/tl.csv']
  vocab = calc_vocab(csv_paths)


  models = []
  data = [ [] for i in range(8) ]   # ['en', 'es', 'fr', 'in', 'it', 'nl', 'pt', 'tl']
  for i in range(8):
    cur_model = lm(n, vocab, csv_paths[i], add_one)
    models.append(cur_model)
  return models

def calc_acc(preds, test_csv):
  reader = csv.reader(test_csv)
  test_csv.seek(0)  #<-- set the iterator to beginning of the input file
  labels = []
  for row in reader:
    labels.append(row[2])
  count = 0
  for i in range(len(labels)-1):
    if preds[i+1] == labels[i+1]:
      count +=1
  return count / (len(labels) -1)

**Part 7**

We will calculate the F1 score of your output from part 6. (hint: we can use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html).

Then we'll load the results to a CSV (using a DataFrame), where the row indicates the F1 results, and the columns indicate the model used.

In [14]:
import sklearn.metrics

def calc_f1(y_true,y_pred):

  #F1 = 2 * (precision * recall) / (precision + recall)
  F1 = sklearn.metrics.f1_score(y_true, y_pred, average = 'micro')

  return F1

In [15]:
F1_list = [ [] for i in range(8) ] # ['en', 'es', 'fr', 'in', 'it', 'nl', 'pt', 'tl']
lm_add_one = []
lm_no_add_one = []

test_csv = open('/content/nlp-course/lm-languages-data-new/test.csv')
reader = csv.reader(test_csv)
test_csv.seek(0)  #<-- set the iterator to beginning of the input file
y_true = []
for row in reader:
  y_true.append(row[2])

for n in range(0,4):
  models_add_one = train_models(n+1, add_one=True)
  lm_add_one.append(models_add_one)
  models_no_add_one = train_models(n+1, add_one=False)
  lm_no_add_one.append(models_add_one)
  test_pred_add_one = classify(n+1, models_add_one, test_csv)
  test_pred_no_add_one = classify(n+1, models_no_add_one, test_csv)

  f1_score_add_one = calc_f1(y_true,test_pred_add_one)
  f1_score_no_add_one = calc_f1(y_true,test_pred_no_add_one)

  F1_list[n].append(f1_score_add_one)
  F1_list[n+4].append(f1_score_no_add_one)


f1_df = pd.DataFrame(F1_list, columns =['F1 on test.csv'] ,
                    index=['n=1, add_one',
                                        'n=2, add_one',
                                        'n=3, add_one',
                                        'n=4, add_one',
                                        'n=1, no_add_one',
                                        'n=2, no_add_one',
                                        'n=3, no_add_one',
                                        'n=4, no_add_one'])
file_name = 'part7.csv'
f1_df.to_csv(file_name)


<br><br><br><br>
**Part 8**  
Let's use the Language model (dictionary) for generation (NLG).

When it comes to sampling from a language model decoder during text generation, there are several different methods that can be used to control the randomness and diversity of the generated text.

Some of the most commonly used methods include:

> `Greedy sampling`
In this method, the model simply selects the word with the highest probability as the next word at each time step. This method can produce fluent text, but it can also lead to repetitive or predictable output.

> `Temperature scaling`  
Temperature scaling involves scaling the logits output of the language model by a temperature parameter before softmax normalization. This has the effect of smoothing the distribution of probabilities and increasing the probability of lower-probability words, which can lead to more diverse and creative output.

> `Top-K sampling`  
In this method, the model restricts the sampling to the top-K most likely words at each time step, where K is a predefined hyperparameter. This can generate more diverse output than greedy sampling, while limiting the number of low-probability words that are sampled.

> `Nucleus sampling` (also known as top-p sampling)  
This method restricts the sampling to the smallest possible set of words whose cumulative probability exceeds a certain threshold, defined by a hyperparameter p. Like top-K sampling, this can generate more diverse output than greedy sampling, while avoiding sampling extremely low probability words.

> `Beam search`  
Beam search involves maintaining a fixed number k of candidate output sequences at each time step, and then selecting the k most likely sequences based on their probabilities. This can improve the fluency and coherence of the output, but may not produce as much diversity as sampling methods.

The choice of sampling method depends on the specific application and desired balance between fluency, diversity, and randomness. Hyperparameters such as temperature, K, p, and beam size can also be tuned to adjust the behavior of the language model during sampling.


We can read more about this concept in <a href='https://huggingface.co/blog/how-to-generate#:~:text=pad_token_id%3Dtokenizer.eos_token_id)-,Greedy%20Search,-Greedy%20search%20simply'>this</a> blog post.


In [16]:
import math
import random
import heapq
import copy

def sample_greedy(probabilities, k=1):
  pred = ''
  for i in range(k):
        max_char = max(probabilities, key=lambda key: probabilities[key]) # Select the characater with the highest probability
        pred += max_char                                                  # Append the character to the prediction
  return pred

def sample_temperature(probabilities, temperature=1, k=1):
  pred = ''
  denominator = sum([math.exp(prob/temperature) for prob in probabilities.values()])
  new_probs = {key: math.exp(prob/temperature) / denominator for key, prob in probabilities.items()}
  for i in range(k):
    next_char = random.choices(list(new_probs.keys()), weights=list(new_probs.values()))[0]  # Sample the characaters with the updated probabillities
    pred += next_char                                                                        # Append the character to the prediction
  return pred

def sample_topK(probabilities, k=1):
  pred = ''
  k_highest = heapq.nlargest(k, probabilities, key=probabilities.get)         # Find the k highest prob values
  new_probs = {key: probabilities[key] for key in k_highest}                  # Create a new dictionary with only the k largest key-value pairs
  for i in range(k):
    next_char = random.choices(list(new_probs.keys()), weights=list(new_probs.values()))[0]
    pred += next_char
  return pred

def sample_topP(probabilities, p=0.8):
  sorted_probs = sorted(probabilities.items(), key=lambda x: x[1], reverse=True)   # Sort the dictionary in decending order
  cumulative_probs = np.cumsum([prob for _, prob in sorted_probs])
  candidates = [candidate for candidate, cum_prob in zip(sorted_probs, cumulative_probs) if cum_prob <= p]
  if not candidates:                                                               # If no candidates have prob less that p we return the first largest Key
    candidates = {(sorted_probs[0][0], sorted_probs[0][1])}
  new_probs = {candidate: prob for candidate, prob in candidates}
  next_char = random.choices(list(new_probs.keys()), weights=list(new_probs.values()))[0]
  return next_char

def sample_beam(n, model, vocab,beams, k=3):

  temp_beams = []
  for i in range(k):
    for j in range(k):
      temp_beams.append(copy.deepcopy(beams[i]))
  for i, beam in enumerate(beams):
    prefix_string = ''.join([elem[0] for elem in beam[0][-(n-1):]])           # select and collapse to the appropriate prefix length, this is needed since start_tokens length is unkown
    beam_prob = model[prefix_string]
    k_highest = heapq.nlargest(k, beam_prob, key=beam_prob.get)             # Find the k highest prob values
    new_probs = {key: beam_prob[key] for key in k_highest}                  # Create a new dictionary with only the k largest key-value pairs
    keys_list = list(new_probs)
    probs_list = list(new_probs.values())

    for j in range(len(keys_list)):
      while True:
        #print(f'j={j}, keys_list length is {len(keys_list)}')
        if keys_list[j] == '<e>':
          break
        if keys_list[j] == 'unseen':                                           # if the predicted next char is 'unseen' then randomly select a char, make sure the prefix exists in the vocabulary
          keys_list[j] = random.choice(vocab)[0]
        cutted_prefix = beam[0][-(n-2):]
        cutted_string = ''.join([elem[0] for elem in cutted_prefix])
        if cutted_string + keys_list[j] in model.keys():
          break
        keys_list[j] = 'unseen'
      temp_beams[i*k+j][0].append([keys_list[j]])
      temp_beams[i*k+j][1] *= probs_list[j]


  # Initializing the resulted k beams list:
  resulted_beams = []
   # finding the best k beams from temp_beams:
  for i in range(k):
    best_beam = None
    best_prob = 0
    for beam in temp_beams:
      if beam[1] > best_prob:
        best_prob = beam[1]
        best_beam = beam
    resulted_beams.append(best_beam)
    temp_beams.remove(best_beam)

  return resulted_beams


We will use your Language Model to generate each one out of the following examples with the coresponding params.    
Notice the 4 core issues:
- Starting tokens
- Length of the generation
- Sampling methond (use all)
- Stop Token (if this token is sampled, stop generating)

In [17]:
test_ = {
    'example1' : {
        'start_tokens' : "H",
        'sampling_method' : ['greedy','beam'],
        'gen_length' : "10",
        'stop_token' : "\n",
        'generation' : []
    },
    'example2' : {
        'start_tokens' : "H",
        'sampling_method' : ['temperature','topK','topP'],
        'gen_length' : "10",
        'stop_token' : "\n",
        'generation' : []
    },
    'example3' : {
        'start_tokens' : "He",
        'sampling_method' : ['greedy','beam','temperature','topK','topP'],
        'gen_length' : "20",
        'stop_token' : "me",
        'generation' : []
    }
}

Now we will use the LM to generate a string based on the parametes of each examples, and store the generation sequance at the generation list.

In [18]:
def gen_text(n, model, vocab, start_tokens, sampling_method, gen_length, stop_token):
  generation = []
  prefix = [vocab[0]] * n + [[c] for c in start_tokens]                     # pad the prefix with start tokens according the ngram length
  prefix_string = ''.join([elem[0] for elem in prefix[-(n-1):]])            # select and collapse to the appropriate prefix length, this is needed since start_tokens length is unkown
  probabilities = model[prefix_string]
  next_char = ''
  if sampling_method == sample_beam:
    k = 3                                                                   # Deciding hyper-parameter k for sample_beam
    prefix_beam = prefix[-(n-1):]                                           # pad the prefix with start tokens according the ngram length
    beams = [[prefix_beam,1] for i in range(k)]
    best_beam = 0
    for i in range(gen_length):
      next_beams = sample_beam(n,model,vocab,beams,k)
      beams = next_beams
    beams[0][0] = beams[0][0][n-1:]
    text = ''.join([elem[0] for elem in beams[0][0][-(gen_length):]])
  else:
    for k in range(gen_length):
      while True:
        next_char = sampling_method(probabilities)
        if next_char == 'unseen':                                              # if the predicted next char is 'unseen' then randomly select a char, make sure the prefix exists in the vocabulary
          next_char = random.choice(vocab)[0]
        prefix = prefix[1:] + [next_char]                                      # build new prefix
        prefix_string = ''.join([elem[0] for elem in prefix[-(n-1):]])
        if prefix_string in model.keys():
          break
      if [next_char] == vocab[1]:                                                # end generation if end token is predicted
        break
      generation.append(next_char)
      probabilities = model[prefix_string]                                       # select new probabiliies
    #print(f'generetion before text: {generation}')
    text = ''.join([elem for elem in generation])
  return text

In [19]:
# Using the best english model for generation:

english_model = lm(4, vocab, '/content/nlp-course/lm-languages-data-new/en.csv', True)
en_vocab = calc_vocab(['/content/nlp-course/lm-languages-data-new/en.csv'])

In [20]:
#Generating new text into "test_":

for example_key, example_value in test_.items():
    # iterate over each sampling method for the current example
    for method in example_value['sampling_method']:
        # generate text using the current method
        sample_method = eval("sample_"+method)
        text = gen_text(4, english_model, vocab, example_value['start_tokens'], sample_method, int(example_value['gen_length']), example_value['stop_token'])
        # append the generated text to the generation list for the current example and method
        example_value['generation'].append(text)



In [21]:
print('-------- NLG --------')

for k,v in test_.items():
  l = ''.join([f'\t{sm} >> {v["start_tokens"]}{g}\n' for sm,g in zip(v['sampling_method'],v['generation'])])
  print(f'{k}:')
  print(l)

-------- NLG --------
example1:
	greedy >> Hey https:/
	beam >> Hey https:/

example2:
	temperature >> Hedge. Risi
	topK >> Hey https:/
	topP >> Hertin 😂😂😂😂

example3:
	greedy >> Hey https://t.co/FIGQk
	beam >> Hey https://t.co/FIGQ
	temperature >> Her:wemin .:: abraConn
	topK >> Hey https://t.co/FIGQk
	topP >> Hear wow may's a use m

