## Q3

In [None]:
corpus_file = open("./data/Tarzan.txt", encoding="utf8")
corpus = corpus_file.read()

### Part 1. Pre-process your data and train your tokenizer.

For this purpose I'll use the hugging face WordPiece Tokenizer. The first thing to do is to normalize the data using the `BertNormalizer` in the library.

In [None]:
from tokenizers import (
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

In the next step we'll normalize the data using `NFD`, `LowerCase`, and `StripAccent` normalizers.

In [None]:
tokenizer.normalizer = normalizers.Sequence(
    [normalizers.NFD(), normalizers.Lowercase(), normalizers.StripAccents()]
)

Now we need to pre-tokenize the data. We can use the `BertPreTokenizer` to split the text based on whitespace and punctuations.

In [None]:
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()

Now that we have completed our tokenization pipeline, we'll have to train it. We have to create a `WordPieceTrainer` and use it to train out tokenizer.

In [None]:
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)

tokenizer.train(["./data/Tarzan.txt"], trainer=trainer)

The last step for out tokenizer would be adding a post_processor so that it adds special tokens to the start and end of each sentence. We'll use `TemplateProcessor` for this purpose.

In [None]:
cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")

tokenizer.post_processor = processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)],
)

Now the tokenizer is ready to encode text inputs.

### Part 2. Train a bi-gram model from the corpus. Also, what's data sparsity and how will you handle it?

Data sparsity happens when there is missing data in the corpus. In the n-grams' case we might learn many different combinations of words from our corpus but still we might encounter new combinations in the test data. In this case we estimate the probability of that combination to zero, which is not true. In order to solve this problem, there are many solutions such as add-1 smoothing, backoff, or interpolation. In this project we'll be using backoff method to solve the problems.

First step to create a bi-gram model is to create a method that will generate n-grams from tokens.

In [None]:
def get_n_gram(text: str, n: int, tokenizer: Tokenizer) -> list[tuple[str]]:
  """
  This method will first tokenize the `text` using the provided `tokenizer`.
  After doing that it will create n-grams with respect to the given `n`.
  """

  tokens = tokenizer.encode(text).tokens
  result_n_grams = []
  idx_range = range(len(tokens) - n + 1) if n > 0 else range(len(tokens) - n) 
  for i in idx_range:
    result_n_grams = result_n_grams + [tuple(tokens[i:i+n])]
  return result_n_grams

Now that we can create n-grams, let's train our model. Our n-gram model is simply the probability of seeing a word after another:
$$p(w_i|w^{i-1}_{i-k+1})=\frac{count(w^i_{i-k+1})}{count(w^{i-1}_{i-k+1})}$$
We'll write the function that will calculate this probability for each gram.

In [None]:
from collections import Counter

def train_n_gram(text: str, n: int, tokenizer: Tokenizer) -> dict[tuple[str], int]:
  """
  This method calculate the probability of seeing the nth word after seeing
  (n-1) words before it. To do it counts the number of times we've seen the
  sentence with n words (`big_sentence_count`) and the number of times it's seen
  the sentence with (n-1) words (`small_sentence_count`). the result will be =
  `big_sentence_count` \ `small_sentence_count`.
  """

  big_sentences = Counter(get_n_gram(text, n, tokenizer))
  small_sentences = Counter(get_n_gram(text, n - 1, tokenizer))

  result = {}
  for big_sentence, big_sentence_count in big_sentences.items():
    small_sentence_count = small_sentences[big_sentence[:-1]]
    result[big_sentence] = big_sentence_count / small_sentence_count
  
  return result

### Part 3. Predict the following sentences with at least 10 more tokens.

Remember that we are going to use backoff method to solve the data sparsity. Therefore We will need a method that will create n'-grams for n' from 1 to the designated n and use them. 

In [None]:
def train_n_grams(text: str, n: int, tokenizer: Tokenizer) -> list[dict[tuple[str], int]]:
  """
  This method will create n-grams for n from 1 to the designated `n`. Th result
  will be a list of these trained n-grams where the index 0 of the list will
  correspond to a uni-gram.
  """

  result = [None] * n
  for i in range(1, n + 1):
    result[i - 1] = train_n_gram(text, i, tokenizer)
  return result

Next thing that we need is a method to choose the next word with respect to the previous words and the trained n-gram. Note that the following implementation is not the best as it iterates over the dictionary's keys.

In [None]:
from random import choices

def predict_next_word(previous_text: list[str], n_gram: dict[tuple[str], int]) -> str | None:
    """
    This method simply searches for every combination of words in the n_gram
    that matches the input text. After finding every matched combination, it
    will make a random choice with the probabilities found in n_gram.
    """
    matched_combs: list[tuple[str]] = []
    combs_probabilities: list[int] = []
    previous_text = tuple(previous_text)

    for words_comb, probability in n_gram.items():
       if previous_text == words_comb[:-1]:
         matched_combs += [words_comb]
         combs_probabilities += [probability]
    
    if not matched_combs:
      return None

    return choices(matched_combs, combs_probabilities)[0][-1] # Select the last word of the chosen n-gram

The last function would be to predict the given text `n` times and backoff to lower n-grams.

In [None]:
def predict_text(
    init_sentence: str,
    n_tokens: int,
    n: int,
    trained_n_grams: list[dict[list[str], int]],
    tokenizer: Tokenizer) -> list[str]:
  """
  This method will continue the given initial sentence until `n_tokens` using
  the trained n-grams. it will also backoff to a lower n-gram when ever it
  doesn't find the sequence in the initial n-gram.
  """

  result = tokenizer.encode(init_sentence).tokens[:-1] # Tokenize and remove the end of sentence special token
  for i in range(n_tokens):
    next_token = None
    current_n = n
    while next_token is None:
      next_token = predict_next_word(result[-(current_n - 1):], trained_n_grams[current_n - 1])
      current_n -= 1
    
    result += [next_token]
  
  return result

Now let's train out n-grams. In this case we'll use bi-grams as the strongest model.

In [None]:
trained_n_grams = train_n_grams(corpus, 2, tokenizer)

Now that everything is ready, let's predict the sentences!

In [None]:
init_sentence_1 = "Knowing well the windings of the trail he"
print(predict_text(init_sentence_1, 10, 2, trained_n_grams, tokenizer))
init_sentence_2 = "For half a day he lolled on the huge back and"
print(predict_text(init_sentence_2, 10, 2, trained_n_grams, tokenizer))

['[CLS]', 'knowing', 'well', 'the', 'windings', 'of', 'the', 'trail', 'he', 'turned', 'to', 'the', 'saracens', 'awaited', 'to', 'the', 'steaming', 'jungle', ',']
['[CLS]', 'for', 'half', 'a', 'day', 'he', 'lolled', 'on', 'the', 'huge', 'back', 'and', 'nimmr', ',', 'had', 'promised', 'not', 'place', '.', '•', 'you', '‘']


### Part 4. Now do it with 3-grams and 5-grams!!

For 3-grams:

In [None]:
trained_n_grams_3 = train_n_grams(corpus, 3, tokenizer)

In [None]:
print(predict_text(init_sentence_1, 10, 3, trained_n_grams_3, tokenizer))
print(predict_text(init_sentence_2, 10, 3, trained_n_grams_3, tokenizer))

['[CLS]', 'knowing', 'well', 'the', 'windings', 'of', 'the', 'trail', 'he', 'took', 'with', 'seven', 'great', 'lions', 'watching', 'his', 'approach', 'the', 'princess']
['[CLS]', 'for', 'half', 'a', 'day', 'he', 'lolled', 'on', 'the', 'huge', 'back', 'and', 'forth', ',', 'wagers', 'were', 'being', 'led', 'from', 'their', 'pursuer', 'even']


For 5-grams:

In [None]:
trained_n_grams_5 = train_n_grams(corpus, 5, tokenizer)

In [None]:
print(predict_text(init_sentence_1, 10, 5, trained_n_grams_5, tokenizer))
print(predict_text(init_sentence_2, 10, 5, trained_n_grams_5, tokenizer))

['[CLS]', 'knowing', 'well', 'the', 'windings', 'of', 'the', 'trail', 'he', 'took', 'short', 'cuts', ',', 'swinging', 'through', 'the', 'branches', 'of', 'the']
['[CLS]', 'for', 'half', 'a', 'day', 'he', 'lolled', 'on', 'the', 'huge', 'back', 'and', 'essayed', 'to', 'say', '"', 'eh', '?', '"', 'and', 'to', 'yawn']


### Part 5. Can you increase the `n` as much as you want in a n-gram model? Why?

No you can't. The first problem that occurs would be the increase of computation to train such a model. The second and more important problem would be the increase in data sparsity; As we are looking for larger combinations of words, the probability of seeing more larger combinations reduce dramatically leading the model to learn only a few combinations of words. For example, for the initial string of "Knowing Well the windings", there are less such combinations in the corpus to guess the fifth word from it.