# NLP - CA1
**Saman Eslami Nazari - 810199375**

## Q1

In [47]:
import re

def custom_tokenizer(text):
  pattern = r'\b\w+\b'
  tokens = re.findall(pattern, text)
  return tokens

### Part 1. What kind of tokenizer does the code above indicate? Is it character-based, subword-based, or word-based?

Let's first take a look into each kind of tokenization:
  - Character-based: We split the input by each character resulting resulting an array of characters.
  - Subword-based: We split the input by each word; After getting the list of words, we won't touch the commonly used words but we will break down the rare ones to more common and meaningful words, e.g. "Tokenization" to "Token" and "ization".
  - Word-based: We split the input by words without changing the outcome (like we did in Subword-based).

The code above will break the input word by word, which means that **it's word-based**. The `\b` indicates delimiters between words. The `\w` indicates word characters, therefore `\w+` matches a single word.
<br>
*Problems*: There are some problems with the word-based technique:
  - One problem with this approach is that for words with punctuations it will lead to incomprehensible tokens. Consider the word "Don't". This leads to tokens "Don" and "t", while a better tokenization could be "Do" and "n't".
  - Also this technique doesn't know the similarity of the words with the same root. For example it doesn't the similarity between "Apple" and "Apples".
  - Another problem with word-based tokenization is that it leads to huge vocabulary. On important reason for this issue is what was mentioned previously: Lack of understanding the similarities. With this technique we end up storing the same root word with different looks so many times.

### Part 2. Run the tokenization code below and show two problems with this approach.

In [48]:
q1_part2_text = "Just received my M.Sc. diploma today, on 2024/02/10! Excited to embark on this new journey of knowledge and discovery. #MScGraduate #EducationMatters."

print(custom_tokenizer(q1_part2_text))

['Just', 'received', 'my', 'M', 'Sc', 'diploma', 'today', 'on', '2024', '02', '10', 'Excited', 'to', 'embark', 'on', 'this', 'new', 'journey', 'of', 'knowledge', 'and', 'discovery', 'MScGraduate', 'EducationMatters']


- First big problem is that it separates "M.Sc." into different parts. This word should not be split because it has meaning as whole. Also the token "M" can mean anything in different sentence which can mislead the model.
- Second problem can be the date tokenization, which could be more meaningful as a whole.

### Part 3. Improve the `custom_tokenizer` to fix at least one of the problems mentioned.

In [49]:
import string

def improved_custom_tokenizer(text: str):
  tokens = text.split()
  tokens = [token.translate(str.maketrans('', '', string.punctuation)) for token in tokens]
  return tokens

print(improved_custom_tokenizer(q1_part2_text))

['Just', 'received', 'my', 'MSc', 'diploma', 'today', 'on', '20240210', 'Excited', 'to', 'embark', 'on', 'this', 'new', 'journey', 'of', 'knowledge', 'and', 'discovery', 'MScGraduate', 'EducationMatters']


With this approach the input text will be split into words and then have their punctuations removed. So in this case "M.Sc" won't be split into two tokens and the date will remain in one piece. But this approach still doesn't solve the problem with root words. This problem requires more advances methods.

## Q2

### Part 1. What kind of tokenizer do BERT and GPT use? What's the reason?

- BERT tokenizer uses subword-based method. It uses WordPiece to generate the vocabulary. The WordPiece technique generates subwords based on the likelihood of characters occurring together.
- GPT also uses subword-based method. To make it work, it uses techniques like Byte Pair Encoding or SentencePiece.

One reason for this decision is that it helps to handle the out-of-vocabulary tokens. It also helps us to handle larger variety of words.

### Part 2. Describe the two algorithms used by BERT and GPT models.

#### Bert Method

Bert uses WordPiece algorithm. It starts with a small special vocabulary from the model and the alphabet. It then starts to learn merge rules. The algorithm will calculate a score for each possible pair and then merges the one with the highest score. The formula for this score is like this:

$$score = \frac{frequency of pair}{frequency of first element \times frequency of second element}$$

With this function, the algorithm will prioritize the pairs that its individual parts are less frequent, meaning that frequent parts of words will not create new tokens in the vocabulary. For example "dis" and "able" might not be good candidates to merge because "able" is a frequent token and appears in many other words. But on the other hand "hel" and "llo" are more likely to be merged because they're both less frequent.

#### GPT Method

GPT uses Byte-Pair Encoding (BPE) Method. This algorithms starts with calculating the unique set of words available. Then it builds the vocabulary with the symbols used to make these words. After this it will start to learn merges. For every consecutive pair it will merge the most frequent ones. The result will be added to the vocabulary. This continues and the subwords grow bigger.

#### Differences

The main difference between these two algorithms is the score function that they use. The WordPiece considers frequency of elements as well while the BPE method only cares about the frequency of the pair itself.

### Part 3. Implement a simple version of these algorithms.

In [50]:
corpus_file = open("./data/All_Around_the_Moon.txt", encoding="utf8")
corpus = corpus_file.read()

#### WordPiece

First We'll start by splitting the text into words and create the frequency map of words.

In [51]:
from collections import Counter

words = re.findall(r"[\w']+|[.,!?;]", corpus)
words_freq = Counter(words)

We then need to create the initial vocabulary, which is the alphabet used in the corpus.

In [52]:
vocab = {c for word in words for c in word}

In order to count the occurrence of each pair and commit the merges, we need to calculate a map from each word to its characters.

In [53]:
splits = {word : [c for c in word] for word in words_freq.keys()}

In the next step we write the function to calculate the score of each pair. This function will be used in each step of the training.

In [54]:
from collections import defaultdict

def word_piece_pairs_score(splits: dict[str, list], words_freq: Counter) -> dict[tuple[str, str], int]:
  """
  This method calculate the score of each pair available in the corpus. It first
  iterates over all the words in the corpus and count the frequency of each
  symbol. It then count the frequency of each pair occurred in the words. At
  last it calculates the score of each pair by the formula below:
  current pair score = frequency of pair / (frequency of the first element * frequency of the second element)
  """

  pair_elem_freq = defaultdict(int)
  pair_freq = defaultdict(int)

  for word, split in splits.items():
    for pair_elem in split:
      pair_elem_freq[pair_elem] += words_freq[word]
    
    if len(split) <= 1:
      continue

    for i in range(len(split) - 1):
      pair_freq[(split[i], split[i + 1])] += words_freq[word]
    
  result_scores = defaultdict()
  for pair, freq in pair_freq.items():
    result_scores[pair] = freq / (pair_elem_freq[pair[0]] * pair_elem_freq[pair[1]])
  
  return result_scores

After calculating the scores, we need a method to merge the best possible pair.

In [55]:
def merge_pair(splits: dict[str, list], pair_to_merge: tuple[str, str]) -> None:
  """
  Given a dictionary of splitted words and the designated pair, this function
  will merge every pair in each word.
  """

  re_pattern = f"\\w*{pair_to_merge[0]}{pair_to_merge[1]}\\w*"

  for word, split in splits.items():
    if not re.search(re_pattern, word): continue

    for i in range(len(split) - 1):
      if split[i] == pair_to_merge[0] and split[i + 1] == pair_to_merge[1]:
        splits[word] = split[:i] + [pair_to_merge[0] + pair_to_merge[1]] + split[i + 2:]
        break

Now it's time for the final function. It will get the words and expand the vocabulary until a certain size is reached. We will also repeat the first steps for calculating `split_words` and other variables to make the function self containing.

In [56]:
def train_word_piece(corpus: str, vocab_size: int) -> tuple[list[str], dict[tuple, str]]:
  words = re.findall(r"[\w']+|[.,!?;]", corpus)
  words_freq = Counter(words)
  vocab = {c for word in words for c in word}
  splits = {word : [c for c in word] for word in words_freq.keys()}

  merges = {}

  while len(vocab) < vocab_size:
    scores = word_piece_pairs_score(splits, words_freq)
    best_pair = max(scores, key=scores.get)
    merge_pair(splits, best_pair)
    vocab.add(best_pair[0] + best_pair[1])
    merges[best_pair] = best_pair[0] + best_pair[1]
  
  return vocab, merges

In [57]:
word_piece_vocab, word_piece_merges = train_word_piece(corpus, 500)

KeyboardInterrupt: 

#### Byte-Pair Encoding

BPE and WordPiece are very similar in terms of algorithm structure. The only thing we need to do is to replace the scoring function.

In [None]:
def bpe_pairs_score(splits: dict[str, list], words_freq: Counter) -> dict[tuple[str, str], int]:
  """
  Scoring function for BPE algorithm. The score of each pair is simply the
  frequency of that pair in the corpus.
  """

  pair_freq = defaultdict(int)

  for word, split in splits.items():
    for i in range(len(split) - 1):
      pair_freq[(split[i], split[i + 1])] += words_freq[word]
  
  return pair_freq

In [None]:
def train_bpe(corpus: str, vocab_size: int) -> tuple[list[str], dict[tuple, str]]:
  words = re.findall(r"[\w']+|[.,!?;]", corpus)
  words_freq = Counter(words)
  vocab = {c for word in words for c in word}
  splits = {word : [c for c in word] for word in words_freq.keys()}

  merges = {}

  while len(vocab) < vocab_size:
    scores = bpe_pairs_score(splits, words_freq)
    best_pair = max(scores, key=scores.get)
    merge_pair(splits, best_pair)
    vocab.add(best_pair[0] + best_pair[1])
    merges[best_pair] = best_pair[0] + best_pair[1]
  
  return vocab, merges

In [None]:
bpe_vocab, bpe_merges = train_bpe(corpus, 500)

#### What's the size of the words in each method's vocabulary? Are these numbers different? Why?

For this question I'll check the largest word created in each vocabulary.

In [None]:
bpe_larges_word = max(word_piece_vocab, key=len)
print(f"WordPiece largest word: {bpe_larges_word}, size = {len(bpe_larges_word)}")

word_piece_largest_word = max(bpe_vocab, key=len)
print(f"BPE larges word: {word_piece_largest_word}, size = {len(word_piece_largest_word)}")

WordPiece largest word: CONSEQUENTIAL, size = 13
BPE larges word: Projectile, size = 10


As we can see, the largest word that the BPE created is 10 characters long while for the WordPiece is 13. The main reason behind this difference is the way they each calculate their pair scores. WordPiece give each pair a relational score, which means that with the growth of the size of the pairs the score doesn't reduce a lot in comparison to pairs with smaller size. But in BPE's case, as the pairs get merged, the frequency of bigger pairs reduce in comparison to smaller pairs, because it's more rare to see a unique word than to see a smaller sub-word. Therefore in BPE's case, in each iteration the smaller merges are more probable to choose so it takes longer to build larger words.<br>
For example consider the pairs ("e", "t") and ("pu", "sh"). The number of the words that has the first pair are probably more than the number of the words that contain "push". But in the case of WordPiece we'll also take the frequency of the elements of the pair into account. We might have less "push"s than "et"s but we also have less "pu" and "sh"s than "e" and "t"s in our corpus, therefore the fraction for the second pair isn't *that* small in comparison to the first pair.

#### Tokenize the given text.

In order to tokenize a text we need a method to get the learned merge rules and apply it to the input text.

In [None]:
def tokenize(merge_rules: dict[tuple[str, str], str], text: str) -> list[str]:
  """
  This method will first pre-tokenize the `text` and split it into characters.
  After this it will try to merge the tokens using the provided rules.
  """

  def merge_word(split_word: list[str], merge_rule: tuple[str, str, str]) -> list[str]:
    i = 0
    while i < (len(split_word) - 1):
      if split_word[i] == merge_rule[0] and split_word[i + 1] == merge_rule[1]:
        split_word = split_word[:i] + [merge_rule[2]] + split_word[i + 2:]
      else:
        i += 1
    return split_word

  split_text = [[c for c in word] for word in text.split()]
  for pair, merged in merge_rules.items():
    for idx, split_word in enumerate(split_text):
      split_text[idx] = merge_word(split_word, (pair[0], pair[1], merged))
    
  return sum(split_text, [])

In [None]:
text_to_tokenize = """
This darkness is absolutely killing! If we ever take this trip again, it must be
about the time of the sNew Moon! This is a tokenization task. Tokenization is
the first step in a NLP pipeline. We will be comparing the tokens generated by
each tokenization model.
"""

text_to_tokenize = text_to_tokenize.replace('\n', ' ')

tokenized_with_bpe = tokenize(bpe_merges, text_to_tokenize)
tokenized_with_word_piece = tokenize(word_piece_merges, text_to_tokenize)

print(f"BPE: {tokenized_with_bpe}")
print(f"WordPiece: {tokenized_with_word_piece}")

{('t', 'h'): 'th', ('th', 'e'): 'the', ('i', 'n'): 'in', ('r', 'e'): 're', ('a', 'n'): 'an', ('o', 'n'): 'on', ('e', 'r'): 'er', ('a', 't'): 'at', ('o', 'u'): 'ou', ('e', 'n'): 'en', ('e', 'd'): 'ed', ('a', 'r'): 'ar', ('s', 't'): 'st', ('o', 'f'): 'of', ('o', 'r'): 'or', ('in', 'g'): 'ing', ('i', 't'): 'it', ('l', 'l'): 'll', ('i', 's'): 'is', ('t', 'o'): 'to', ('a', 's'): 'as', ('l', 'e'): 'le', ('an', 'd'): 'and', ('i', 'c'): 'ic', ('h', 'e'): 'he', ('l', 'y'): 'ly', ('s', 'e'): 'se', ('b', 'e'): 'be', ('e', 's'): 'es', ('r', 'o'): 'ro', ('c', 't'): 'ct', ('h', 'a'): 'ha', ('n', 'o'): 'no', ('i', 'on'): 'ion', ('a', 'l'): 'al', ('v', 'e'): 've', ('g', 'h'): 'gh', ('m', 'e'): 'me', ('i', 'd'): 'id', ('en', 't'): 'ent', ('c', 'e'): 'ce', ('w', 'e'): 'we', ('v', 'er'): 'ver', ('l', 'd'): 'ld', ('r', 'i'): 'ri', ('u', 't'): 'ut', ('w', 'h'): 'wh', ('r', 'a'): 'ra', ('c', 'h'): 'ch', ('u', 'n'): 'un', ('th', 'at'): 'that', ('a', 'll'): 'all', ('l', 'i'): 'li', ('i', 'r'): 'ir', ('f', 'or

## Q3

In [82]:
corpus_file = open("./data/Tarzan.txt", encoding="utf8")
corpus = corpus_file.read()

### Part 1. Pre-process your data and train your tokenizer.

For this purpose I'll use the hugging face WordPiece Tokenizer. The first thing to do is to normalize the data using the `BertNormalizer` in the library.

In [83]:
from tokenizers import (
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

In the next step we'll normalize the data using `NFD`, `LowerCase`, and `StripAccent` normalizers.

In [84]:
tokenizer.normalizer = normalizers.Sequence(
    [normalizers.NFD(), normalizers.Lowercase(), normalizers.StripAccents()]
)

Now we need to pre-tokenize the data. We can use the `BertPreTokenizer` to split the text based on whitespace and punctuations.

In [85]:
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()

Now that we have completed our tokenization pipeline, we'll have to train it. We have to create a `WordPieceTrainer` and use it to train out tokenizer.

In [86]:
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)

tokenizer.train(["./data/Tarzan.txt"], trainer=trainer)

The last step for out tokenizer would be adding a post_processor so that it adds special tokens to the start and end of each sentence. We'll use `TemplateProcessor` for this purpose.

In [87]:
cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")

tokenizer.post_processor = processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)],
)

Now the tokenizer is ready to encode text inputs.

### Part 2. Train a bi-gram model from the corpus. Also, what's data sparsity and how will you handle it?

Data sparsity happens when there is missing data in the corpus. In the n-grams' case we might learn many different combinations of words from our corpus but still we might encounter new combinations in the test data. In this case we estimate the probability of that combination to zero, which is not true. In order to solve this problem, there are many solutions such as add-1 smoothing, backoff, or interpolation. In this project we'll be using backoff method to solve the problems.

First step to create a bi-gram model is to create a method that will generate n-grams from tokens.

In [88]:
def get_n_gram(text: str, n: int, tokenizer: Tokenizer) -> list[tuple[str]]:
  """
  This method will first tokenize the `text` using the provided `tokenizer`.
  After doing that it will create n-grams with respect to the given `n`.
  """

  tokens = tokenizer.encode(text).tokens
  result_n_grams = []
  idx_range = range(len(tokens) - n + 1) if n > 0 else range(len(tokens) - n) 
  for i in idx_range:
    result_n_grams = result_n_grams + [tuple(tokens[i:i+n])]
  return result_n_grams

Now that we can create n-grams, let's train our model. Our n-gram model is simply the probability of seeing a word after another:
$$p(w_i|w^{i-1}_{i-k+1})=\frac{count(w^i_{i-k+1})}{count(w^{i-1}_{i-k+1})}$$
We'll write the function that will calculate this probability for each gram.

In [89]:
from collections import Counter

def train_n_gram(text: str, n: int, tokenizer: Tokenizer) -> dict[tuple[str], int]:
  """
  This method calculate the probability of seeing the nth word after seeing
  (n-1) words before it. To do it counts the number of times we've seen the
  sentence with n words (`big_sentence_count`) and the number of times it's seen
  the sentence with (n-1) words (`small_sentence_count`). the result will be =
  `big_sentence_count` \ `small_sentence_count`.
  """

  big_sentences = Counter(get_n_gram(text, n, tokenizer))
  small_sentences = Counter(get_n_gram(text, n - 1, tokenizer))

  result = {}
  for big_sentence, big_sentence_count in big_sentences.items():
    small_sentence_count = small_sentences[big_sentence[:-1]]
    result[big_sentence] = big_sentence_count / small_sentence_count
  
  return result

### Part 3. Predict the following sentences with at least 10 more tokens.

Remember that we are going to use backoff method to solve the data sparsity. Therefore We will need a method that will create n'-grams for n' from 1 to the designated n and use them. 

In [90]:
def train_n_grams(text: str, n: int, tokenizer: Tokenizer) -> list[dict[tuple[str], int]]:
  """
  This method will create n-grams for n from 1 to the designated `n`. Th result
  will be a list of these trained n-grams where the index 0 of the list will
  correspond to a uni-gram.
  """

  result = [None] * n
  for i in range(1, n + 1):
    result[i - 1] = train_n_gram(text, i, tokenizer)
  return result

Next thing that we need is a method to choose the next word with respect to the previous words and the trained n-gram. Note that the following implementation is not the best as it iterates over the dictionary's keys.

In [91]:
from random import choices

def predict_next_word(previous_text: list[str], n_gram: dict[tuple[str], int]) -> str | None:
    """
    This method simply searches for every combination of words in the n_gram
    that matches the input text. After finding every matched combination, it
    will make a random choice with the probabilities found in n_gram.
    """
    matched_combs: list[tuple[str]] = []
    combs_probabilities: list[int] = []
    previous_text = tuple(previous_text)

    for words_comb, probability in n_gram.items():
       if previous_text == words_comb[:-1]:
         matched_combs += [words_comb]
         combs_probabilities += [probability]
    
    if not matched_combs:
      return None

    return choices(matched_combs, combs_probabilities)[0][-1] # Select the last word of the chosen n-gram

The last function would be to predict the given text `n` times and backoff to lower n-grams.

In [92]:
def predict_text(
    init_sentence: str,
    n_tokens: int,
    n: int,
    trained_n_grams: list[dict[list[str], int]],
    tokenizer: Tokenizer) -> list[str]:
  """
  This method will continue the given initial sentence until `n_tokens` using
  the trained n-grams. it will also backoff to a lower n-gram when ever it
  doesn't find the sequence in the initial n-gram.
  """

  result = tokenizer.encode(init_sentence).tokens[:-1] # Tokenize and remove the end of sentence special token
  for i in range(n_tokens):
    next_token = None
    current_n = n
    while next_token is None:
      next_token = predict_next_word(result[-(current_n - 1):], trained_n_grams[current_n - 1])
      current_n -= 1
    
    result += [next_token]
  
  return result

Now let's train out n-grams. In this case we'll use bi-grams as the strongest model.

In [93]:
trained_n_grams = train_n_grams(corpus, 2, tokenizer)

Now that everything is ready, let's predict the sentences!

In [94]:
init_sentence_1 = "Knowing well the windings of the trail he"
print(predict_text(init_sentence_1, 10, 2, trained_n_grams, tokenizer))
init_sentence_2 = "For half a day he lolled on the huge back and"
print(predict_text(init_sentence_2, 10, 2, trained_n_grams, tokenizer))

['[CLS]', 'knowing', 'well', 'the', 'windings', 'of', 'the', 'trail', 'he', 'could', 'conjure', '.', 'this', 'was', 'a', 'palaver', 'with', 'raw', 'meat']
['[CLS]', 'for', 'half', 'a', 'day', 'he', 'lolled', 'on', 'the', 'huge', 'back', 'and', 'others', 'pressed', 'about', 'the', 'pike', 'tips', '.', '"', 'good', 'and']


### Part 4. Now do it with 3-grams and 5-grams!!

For 3-grams:

In [95]:
trained_n_grams_3 = train_n_grams(corpus, 3, tokenizer)

In [96]:
print(predict_text(init_sentence_1, 10, 3, trained_n_grams_3, tokenizer))
print(predict_text(init_sentence_2, 10, 3, trained_n_grams_3, tokenizer))

['[CLS]', 'knowing', 'well', 'the', 'windings', 'of', 'the', 'trail', 'he', 'took', 'short', 'cuts', ',', 'swinging', 'through', 'the', 'trees', 'and', 'lighted']
['[CLS]', 'for', 'half', 'a', 'day', 'he', 'lolled', 'on', 'the', 'huge', 'back', 'and', 'usha', 'tore', 'through', 'the', 'pass', 'just', 'above', 'the', 'boss', 'and']


For 5-grams:

In [97]:
trained_n_grams_5 = train_n_grams(corpus, 5, tokenizer)

In [98]:
print(predict_text(init_sentence_1, 10, 5, trained_n_grams_5, tokenizer))
print(predict_text(init_sentence_2, 10, 5, trained_n_grams_5, tokenizer))

['[CLS]', 'knowing', 'well', 'the', 'windings', 'of', 'the', 'trail', 'he', 'took', 'short', 'cuts', ',', 'swinging', 'through', 'the', 'branches', 'of', 'the']
['[CLS]', 'for', 'half', 'a', 'day', 'he', 'lolled', 'on', 'the', 'huge', 'back', 'and', 'essayed', 'to', 'say', '"', 'eh', '?', '"', 'and', 'to', 'yawn']


### Part 5. Can you increase the `n` as much as you want in a n-gram model? Why?

No you can't. The first problem that occurs would be the increase of computation to train such a model. The second and more important problem would be the increase in data sparsity; As we are looking for larger combinations of words, the probability of seeing more larger combinations reduce dramatically leading the model to learn only a few combinations of words. For example, for the initial string of "Knowing Well the windings", there are less such combinations in the corpus to guess the fifth word from it.

## Q4

Let's first load the pre-written codes for training n-grams.

In [99]:
import nltk
import pandas as pd
from nltk import ngrams
from nltk.probability import FreqDist
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split

nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ali18\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [100]:
# Step 1: Load the data
data = pd.read_csv('./data/google_play_store_apps_reviews.csv')

# Step 2: Split the data
train_data, test_data = train_test_split(data, test_size = 0.2, random_state = 42)

In [101]:
# Step 3: Build the n-gram Language Model
def get_ngrams(text: str, n: int) -> list[str]:
    tokens = nltk.word_tokenize(text)
    return list(ngrams(tokens, n))

def train_ngram(data: list, n: int) -> tuple[FreqDist, FreqDist]:
    positive_ngrams = []
    negative_ngrams = []

    for index, row in data.iterrows():
        grams = get_ngrams(row['review'], n)
        if row['polarity'] == 1:
            positive_ngrams.extend(grams)
        elif row['polarity'] == 0:
            negative_ngrams.extend(grams)

    positive_freq = FreqDist(positive_ngrams)
    negative_freq = FreqDist(negative_ngrams)

    return positive_freq, negative_freq

In [102]:
# Step 4: Train the Model
n = 2  # Change to the desired n-gram size
positive_freq, negative_freq = train_ngram(train_data, n)

Now that the trained data is ready, let's write a method that would predict the label of a given sentence.

In [103]:
def predict_label(text: str, positive_freq: FreqDist, negative_freq: FreqDist, n: int) -> int:
  """
  This method will simply sum up the number of times that a bi-gram has occurred
  in each class, normalize the result and choose whichever that has bigger
  number.
  """

  positive_size = sum(positive_freq.values())
  negative_size = sum(negative_freq.values())

  positive_occurrence, negative_occurrence = 0, 0

  text_n_grams = get_ngrams(text, n)
  for n_gram in text_n_grams:
    positive_occurrence += positive_freq[n_gram]
    negative_occurrence += negative_freq[n_gram]
  
  positive_occurrence = positive_occurrence * negative_size / positive_size

  if positive_occurrence > negative_occurrence: return 1
  return 0
  

It's time to test!!

In [104]:
# Step 5: test the n-gram
def test_ngram(data: list, positive_freq: FreqDist, negative_freq: FreqDist, n: int) -> list:
  pred_labels = []

  for text in data:
    pred_labels.extend([predict_label(text, positive_freq, negative_freq, n)])

  return pred_labels

In [105]:
predicted_labels = test_ngram(test_data["review"].values.tolist(), positive_freq, negative_freq, n)

In [106]:
print(f"Accuracy Score: {accuracy_score(test_data['polarity'].values.tolist(), predicted_labels)}")
print(f"Precision: {precision_score(test_data['polarity'].values.tolist(), predicted_labels)}")
print(f"Recall: {recall_score(test_data['polarity'].values.tolist(), predicted_labels)}")


Accuracy Score: 0.7318435754189944
Precision: 0.5384615384615384
Recall: 0.660377358490566


Here wa have an accuracy of 74% which says that out of all of out predictions, how many of them were true.
We also calculated precision and recall. Based on the results, 55% of the positive predictions that we made were actually positive (recall), and 68% of the actual positive classes were predicted right by the model.

In [107]:
import os

file_path = open("./data/Tarzan.txt", encoding="utf8")
file_content = file_path.read()

from tokenizers import models as token_models
from tokenizers import normalizers as token_normalizers
from tokenizers import pre_tokenizers as token_pre_tokenizers
from tokenizers import processors as token_processors
from tokenizers import trainers as token_trainers
from tokenizers import Tokenizer

# Use BPE tokenizer
tokenizer = Tokenizer(token_models.BPE(unk_token="[UNK]"))

tokenizer.normalizer = token_normalizers.Sequence(
    [token_normalizers.NFKC(), token_normalizers.Lowercase(), token_normalizers.StripAccents()]
)

tokenizer.pre_tokenizer = token_pre_tokenizers.BertPreTokenizer()

special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
token_trainer = token_trainers.BpeTrainer(special_tokens=special_tokens)

tokenizer.train(["./data/Tarzan.txt"], trainer=token_trainer)

cls_id = tokenizer.token_to_id("[CLS]")
sep_id = tokenizer.token_to_id("[SEP]")

# Change TemplateProcessing options
tokenizer.post_processor = token_processors.TemplateProcessing(
    single="[CLS]:0 $A:0 [SEP]:1",  
    pair="[CLS]:0 $A:0 [SEP]:1 $B:1 [SEP]:2", 
    special_tokens=[("[CLS]", cls_id), ("[SEP]", sep_id)],
)

from collections import Counter
from typing import List, Dict, Tuple

def generate_n_gram_probabilities(text: str, n: int, tk: Tokenizer) -> Dict[Tuple[str], float]:
    tokens = tk.encode(text).tokens
    result_n_grams = Counter(tuple(tokens[i:i+n]) for i in range(len(tokens) - n + 1))

    result = {}
    for n_gram, n_gram_count in result_n_grams.items():
        if n > 1:
            prefix = tuple(n_gram[:-1])
            prefix_count = result_n_grams[prefix] if prefix in result_n_grams else 0
            result[n_gram] = n_gram_count / prefix_count if prefix_count > 0 else 0
        else:
            result[n_gram] = n_gram_count

    return result

def create_n_grams_set(text: str, n: int, tk: Tokenizer) -> list[dict[tuple[str], int]]:
    result = [None] * n
    for i in range(1, n + 1):
        result[i - 1] = calculate_n_gram_probability(text, i, tk)
    return result

from random import choices

def predict_next_word(prev_seq: list[str], n_gram: dict[tuple[str], int]) -> str | None:
    matched_seqs: list[tuple[str]] = []
    seq_probs: list[int] = []
    prev_seq = tuple(prev_seq)

    for words_seq, probability in n_gram.items():
        if prev_seq == words_seq[:-1]:
            matched_seqs += [words_seq]
            seq_probs += [probability]
    
    if not matched_seqs:
        return None

    return choices(matched_seqs, seq_probs)[0][-1] # Select the last word of the chosen n-gram

def generate_text(
    init_seq: str,
    n_tokens: int,
    n: int,
    trained_n_grams: list[dict[tuple[str], int]],
    tk: Tokenizer) -> list[str]:
    result = tk.encode(init_seq).tokens[:-1]  # Tokenize and remove the end-of-sequence special token
    current_n = n

    for i in range(n_tokens):
        next_token = None

        while next_token is None and current_n > 0:
            next_token = predict_next_word(result[-(current_n - 1):], trained_n_grams[current_n - 1])
            current_n -= 1

        if next_token is None:
            break

        result.append(next_token)
        current_n = n  

    return result


trained_n_grams_2 = create_n_grams_set(file_content, 2, tokenizer)

init_seq_1 = "Knowing well the windings of the trail he"
print(generate_text(init_seq_1, 10, 2, trained_n_grams_2, tokenizer))
init_seq_2 = "For half a day he lolled on the huge back and"
print(generate_text(init_seq_2, 10, 2, trained_n_grams_2, tokenizer))

trained_n_grams_3 = create_n_grams_set(file_content, 3, tokenizer)

print(generate_text(init_seq_1, 10, 3, trained_n_grams_3, tokenizer))
print(generate_text(init_seq_2, 10, 3, trained_n_grams_3, tokenizer))

trained_n_grams_5 = create_n_grams_set(file_content, 5, tokenizer)

print(generate_text(init_seq_1, 10, 5, trained_n_grams_5, tokenizer))
print(generate_text(init_seq_2, 10, 5, trained_n_grams_5, tokenizer))

NameError: name 'calculate_n_gram_probability' is not defined

ValueError: Total of weights must be greater than zero