<a href="https://colab.research.google.com/github/ThuyHaLE/Problem3_Natural-Language-Processing/blob/main/NLG_Text_generation_models(n_grams).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Text generation models** <Br>
There are two main categories of Text Generation Models:
- **Rule-based models**: rely on predefined rules and templates, offering control and explainability but limited creativity.
- **Statistical models**: learn patterns from large amounts of text data:
  - **Traditional statistical models**: Techniques like ***n-grams***, ***Hidden Markov Models (HMMs)***, and ***Conditional Random Fields (CRFs)***.
  - **Neural statistical models**: Includes Bengio's ***Neural Probabilistic Language Model (NPLM)***
  - **Deep Learning models**: A ***specific type*** of the statistical model using ***deep learning architectures*** includes Transformers and Large Language Models (LLMs) respectively. While transformers share ***some characteristics with statistical models***, they are a ***more advanced architecture***. They use ***deep learning techniques*** with ***artificial neural networks*** to process information. However, they are still trained on ***massive amounts*** of text data, and their outputs are based on the ***statistical patterns*** learned from that data. LLMs are built using Transformers (often) or other deep learning architectures to train specifically for NLG tasks and excel at generating creative and human-like text.

**Statistical models**:
- **Traditional statistical models**: <Br>
Rely on ***statistical analysis*** of data to learn ***patterns*** and make ***predictions***. This category encompasses techniques like ***n-grams***, ***Hidden Markov Models (HMMs)***, and ***Conditional Random Fields (CRFs)*** used for text generation.
  - N-grams
  - Hidden Markov Models (HMMs)
  - Conditional Random Fields (CRFs)

In [None]:
import string
import math
import numpy as np
import sys
import operator
from __future__ import division

#**Statistical models**

##**Traditional statistical models**

###**N-grams**

Goal: assign a probability of a sentence

**Language model**

Pipeline:
- Text processing
- Word based Tokenization
- Compute the probability of occurrence of the number of words (tokens) in a tencence (sequence)
    
  $P(W) = P(w_1,w_2,...,w_n)$

  $P(B|A) = \frac{P(AB)}{P(A)} => P(AB) = P(B|A).P(A)$

  $P(ABCD) = P(A).P(B|A).P(C|AB).P(D|ABC)$

  $=> P(W) = P(w_1).P(w_2|w_1).P(w3|w_1w_2).P(w_n|w_{1:n-1}) = ∏_{i}P(w_i|w_1w_2...w_{i-1}) = ∏_{i}P(w_i|w_{1:i-1})$


**N-grams Language Model**

$=> P(w|h) = \frac{count(h,w)}{count(h)} => P(W) = ∏_{i}P(w_i|W_{i-(N+1) : i-1})$

w: *token as key word*

h: *history token before key word*

**Unigram model (N=1)**
$P(W) \approx ∏_{i}P(w_i)$

**Bigram model (N=2)**
$P(W) \approx ∏_{i}P(w_i|w_{i-1})$

*Padding = 1 ($< s>,w_1, w_2, ... , w_n, < /s>$)

**Trigram model (N=3)**
$P(W) \approx ∏_{i}P(w_i|w_{i-2:i-1})$

*Padding = 2 ($< s>,< s>,w_1, w_2, ... , w_n, < /s>,< /s>$)

####Data preprocessing

In [None]:
#Download data
!wget --no-check-certificate \
    https://gist.githubusercontent.com/khacanh/4c4662fa226db87a4664dfc2f70bc63e/raw/5d8a1d890c73a1e92e6898137db28f3dc0676975/kieu.txt \
    -O ./kieu.txt

--2024-06-10 06:27:35--  https://gist.githubusercontent.com/khacanh/4c4662fa226db87a4664dfc2f70bc63e/raw/5d8a1d890c73a1e92e6898137db28f3dc0676975/kieu.txt
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 135832 (133K) [text/plain]
Saving to: ‘./kieu.txt’


2024-06-10 06:27:36 (4.27 MB/s) - ‘./kieu.txt’ saved [135832/135832]



In [None]:
#Read data and save in corpus
corpus = []
f = open("kieu.txt", "r")
for line in f:
  corpus.append(line)

In [None]:
#Take a look at the data
print(f'number of sentences: {len(corpus)}')
corpus[:5]

number of sentences: 3254


['Trăm năm trong cõi người ta,\n',
 'Chữ tài chữ mệnh khéo là ghét nhau.\n',
 'Trải qua một cuộc bể dâu,\n',
 'Những điều trông thấy mà đau đớn lòng.\n',
 'Lạ gì bỉ sắc tư phong,\n']

In [None]:
#Data prepocessing
def preprocessing(sentence):
  import string
  for punc in string.punctuation:
    sentence = sentence.replace(punc, ' ')
  return ('<s> ' + sentence + '</s>').lower().split()

sentences = [preprocessing(sentence) for sentence in corpus]
sentences[:5]

[['<s>', 'trăm', 'năm', 'trong', 'cõi', 'người', 'ta', '</s>'],
 ['<s>', 'chữ', 'tài', 'chữ', 'mệnh', 'khéo', 'là', 'ghét', 'nhau', '</s>'],
 ['<s>', 'trải', 'qua', 'một', 'cuộc', 'bể', 'dâu', '</s>'],
 ['<s>', 'những', 'điều', 'trông', 'thấy', 'mà', 'đau', 'đớn', 'lòng', '</s>'],
 ['<s>', 'lạ', 'gì', 'bỉ', 'sắc', 'tư', 'phong', '</s>']]

####Modeling

#####Bigram probability

In [None]:
# store unigram
unigram_dict = {}
for sentence in sentences:
  for word in sentence:
    if word in unigram_dict:
      unigram_dict[word] += 1
    else:
      unigram_dict[word] = 1
print(f'unigram_dictionary:', len(unigram_dict),unigram_dict)

# store bigram
bigram_dict = {}
prev_word = '<s>'
for sentence in sentences:
  for word in sentence:
    bigram = prev_word + ' ' + word
    if bigram in bigram_dict:
      bigram_dict[bigram] += 1
    else:
      bigram_dict[bigram] = 1
    prev_word = word
# discard bigrams that occur less than 1 times
bigram_dict = {bigram:count for bigram, count in bigram_dict.items() if count >=1 }
print(f'bigram_dictionary:', len(bigram_dict), bigram_dict)

#Retrieve and store confusion matrix counts
prob_bigram_dict = {}
#Create bigram (probability) confusion matrices
# loop through every bigram in the dictionary
for bigram,value in bigram_dict.items():
  hist_word, main_word = bigram.split()[0], bigram.split()[1] # extract hist_word & main_word
  bigram_probability = value/unigram_dict[hist_word] #calculate bigram probability
  if hist_word not in prob_bigram_dict:
    prob_bigram_dict[hist_word] = {main_word: bigram_probability}
  else:
    prob_bigram_dict[hist_word][main_word] = bigram_probability

print(f'prob_bigram_dict:',len(prob_bigram_dict),prob_bigram_dict)

unigram_dictionary: 2407 {'<s>': 3254, 'trăm': 31, 'năm': 52, 'trong': 105, 'cõi': 10, 'người': 223, 'ta': 57, '</s>': 3254, 'chữ': 34, 'tài': 34, 'mệnh': 14, 'khéo': 15, 'là': 169, 'ghét': 3, 'nhau': 49, 'trải': 3, 'qua': 25, 'một': 322, 'cuộc': 7, 'bể': 26, 'dâu': 8, 'những': 71, 'điều': 43, 'trông': 60, 'thấy': 73, 'mà': 109, 'đau': 24, 'đớn': 5, 'lòng': 175, 'lạ': 28, 'gì': 50, 'bỉ': 2, 'sắc': 19, 'tư': 11, 'phong': 35, 'trời': 92, 'xanh': 41, 'quen': 15, 'thói': 8, 'má': 10, 'hồng': 59, 'đánh': 16, 'ghen': 9, 'cảo': 2, 'thơm': 6, 'lần': 38, 'giở': 14, 'trước': 73, 'đèn': 16, 'tình': 127, 'có': 162, 'lục': 6, 'còn': 119, 'truyền': 10, 'sử': 1, 'rằng': 159, 'gia': 33, 'tĩnh': 2, 'triều': 8, 'minh': 16, 'bốn': 24, 'phương': 13, 'phẳng': 1, 'lặng': 8, 'hai': 69, 'kinh': 23, 'vững': 6, 'vàng': 82, 'nhà': 99, 'viên': 9, 'ngoại': 8, 'họ': 15, 'vương': 22, 'nghĩ': 51, 'cũng': 169, 'thường': 21, 'bực': 1, 'trung': 10, 'trai': 7, 'con': 57, 'thứ': 3, 'rốt': 1, 'quan': 31, 'nối': 12, 'dòng':

In [None]:
#Calculate probability for test case:
test = ['<s>','trăm', 'năm', 'trong', 'cõi', 'người', 'ta', '</s>']
pro = 1
for idx in range(len(test)-1):
  print(test[idx],test[idx+1], prob_bigram_dict[test[idx]][test[idx+1]])
  pro *= prob_bigram_dict[test[idx]][test[idx+1]]
print(test, pro)

<s> trăm 0.003995082974800246
trăm năm 0.2903225806451613
năm trong 0.019230769230769232
trong cõi 0.02857142857142857
cõi người 0.1
người ta 0.03139013452914798
ta </s> 0.43859649122807015
['<s>', 'trăm', 'năm', 'trong', 'cõi', 'người', 'ta', '</s>'] 8.773917799358789e-10


#####Using nltk library

In [None]:
import nltk
nltk.download('reuters')
nltk.download('punkt')
from nltk.corpus import reuters
from nltk import bigrams, trigrams, pad_sequence
from collections import Counter, defaultdict

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
#Create model as dictionary
model = defaultdict(lambda: defaultdict(lambda: 0))

#Create bigram (count) confusion matrix
for sentence in sentences:
    for hist_word, main_word in bigrams(sentence, pad_right=True, pad_left=True, left_pad_symbol="<s>", right_pad_symbol="</s>"):
        model[(hist_word)][main_word] += 1
print(model)

#Create bigram (probability) confusion matrix
for hist_word in model:
    total_count = float(sum(model[hist_word].values()))
    for main_word in model[hist_word]:
        model[hist_word][main_word] = (model[hist_word][main_word]+1)/(total_count+len(model))
print(model)

defaultdict(<function <lambda> at 0x78be23287be0>, {'<s>': defaultdict(<function <lambda>.<locals>.<lambda> at 0x78bdf04cb1c0>, {'<s>': 3254, 'trăm': 13, 'chữ': 9, 'trải': 2, 'những': 22, 'lạ': 9, 'trời': 10, 'cảo': 1, 'phong': 9, 'rằng': 33, 'bốn': 8, 'có': 29, 'gia': 3, 'một': 80, 'vương': 8, 'đầu': 7, 'thúy': 4, 'mai': 4, 'vân': 2, 'khuôn': 6, 'hoa': 23, 'mây': 5, 'kiều': 8, 'so': 3, 'làn': 1, 'sắc': 1, 'thông': 1, 'pha': 2, 'cung': 4, 'nghề': 4, 'khúc': 5, 'xuân': 9, 'êm': 1, 'tường': 3, 'ngày': 10, 'thiều': 1, 'cỏ': 4, 'cành': 4, 'thanh': 4, 'lễ': 8, 'gần': 4, 'chị': 4, 'dập': 2, 'ngựa': 1, 'ngổn': 2, 'thoi': 1, 'tà': 1, 'bước': 3, 'lần': 10, 'nao': 1, 'dịp': 2, 'sè': 1, 'dàu': 1, 'mà': 18, 'đạm': 4, 'nổi': 2, 'xôn': 3, 'kiếp': 14, 'nửa': 12, 'xa': 2, 'thuyền': 4, 'thì': 8, 'buồng': 11, 'dấu': 2, 'khóc': 5, 'khéo': 9, 'đã': 36, 'sắm': 3, 'vùi': 1, 'ấy': 6, 'lòng': 19, 'thoắt': 9, 'đau': 5, 'lời': 10, 'phũ': 1, 'sống': 4, 'nào': 10, 'sẵn': 7, 'gọi': 3, 'họa': 4, 'lầm': 3, 'sụp': 1,

In [None]:
#Calculate probability for test case:
test = ['<s>','trăm', 'năm', 'trong', 'cõi', 'người', 'ta', '</s>']
pro = 1
for idx in range(len(test)-1):
  print(test[idx],test[idx+1], model[test[idx]][test[idx+1]])
  pro *= model[test[idx]][test[idx+1]]
print(test, pro)

<s> trăm 0.0015703869882220977
trăm năm 0.004101722723543888
năm trong 0.0008133387555917039
trong cõi 0.0015923566878980893
cõi người 0.0008274720728175424
người ta 0.003041825095057034
ta </s> 0.010551948051948052
['<s>', 'trăm', 'năm', 'trong', 'cõi', 'người', 'ta', '</s>'] 2.2156698004513013e-19


####Next word prediction

In [None]:
#Next word prediction
def next_word_prediction(text, model_type='nltk'):
  import random
  text = text.split()
  sentence_finished = False
  while not sentence_finished:
    r = random.random() # select a random probability threshold
    accumulator = .0
    if model_type=='nltk':
      for word in model[text[-1]].keys(): #find all cases have hist word == (new) main word
          accumulator += model[text[-1]][word] #accumulate all propability of all cases
          if accumulator >= r: # select words that are above the probability threshold => break
              text.append(word)
              break
    else:
      for word in prob_bigram_dict[text[-1]].keys(): #find all cases have hist word == (new) main word
          accumulator += prob_bigram_dict[text[-1]][word] #accumulate all propability of all cases
          if accumulator >= r: # select words that are above the probability threshold => break
              text.append(word)
              break
    if text[-1] == "</s>": #if select word == end-padding => break while loop
      sentence_finished = True
  print(' '.join([t for t in text if t])) #print predicted input string

In [None]:
#Input string as string
text = "chữ tài chữ mệnh khéo là"
next_word_prediction(text, model_type='nltk')

chữ tài chữ mệnh khéo là thúc ông tơ với nàng đã chồn </s>


In [None]:
#Input string as string
text = "chữ tài chữ mệnh khéo là"
next_word_prediction(text, model_type='bigram-prob')

chữ tài chữ mệnh khéo là làm chơi </s>


####Bigrams using Chi square

In [None]:
#Using Chi-Square stat
#Retrieve and store confusion matrix counts
hist_word_dict = {}
main_word_dict = {}
confusion_matrices = {}
#Create bigram confusion matrices
# loop through every bigram in the dictionary
for bigram,value in bigram_dict.items():
  # extract hist_word & main_word
  hist_word = bigram.split()[0]
  main_word = bigram.split()[1]
  # store hist_word
  if hist_word in hist_word_dict:
    hist_word_dict[hist_word] += 1
  else:
    hist_word_dict[hist_word] = 1
  # store main_word
  if main_word in main_word_dict:
    main_word_dict[main_word] += 1
  else:
    main_word_dict[main_word] = 1

  # calculate values for confusion matrix
  #bigram occurences
  bigramCount = bigram_dict[bigram]
  #main_wor occurences (non-bigram)
  main_wordCount = main_word_dict[main_word] - bigram_dict[bigram]
  #hist_word occurences (non-bigram)
  hist_wordCount = hist_word_dict[hist_word]-bigram_dict[bigram]
  # 4th value: non main_word or hist_word
  neitherCount = len(bigram_dict) - bigramCount - main_wordCount - hist_wordCount
  # store values as confusion matrix
  matrix = [bigramCount, main_wordCount, hist_wordCount, neitherCount]
  confusion_matrices[bigram] = matrix

#Calculate Chi-Square stat for each bigram
# create dictionary with "key-value" pairs of "bigram-chiSquare"
chi_square_stats = {}
for bigram,matrix in confusion_matrices.items():
    # save observed values
    ob11 = matrix[0]
    ob12 = matrix[1]
    ob21 = matrix[2]
    ob22 = matrix[3]
    # save col/row total values
    totC1 = ob11 + ob21
    totC2 = ob12 + ob22
    totR1 = ob11 + ob12
    totR2 = ob21 + ob22
    total = ob11 + ob12 + ob21 + ob22
    # calculate expected values
    ex11 = (totR1*totC1)/total
    ex12 = (totR1*totC2)/total
    ex21 = (totR2*totC1)/total
    ex22 = (totR2*totC2)/total
    # calculate chi_squared statistic
    chi11 = pow((ob11-ex11),2)/ex11
    chi12 = pow((ob12-ex12),2)/ex12
    chi21 = pow((ob21-ex21),2)/ex21
    chi22 = pow((ob22-ex22),2)/ex22
    chi_square = chi11 + chi12 + chi21 + chi22
    # store statistic in dictionary with bigram as key
    chi_square_stats[bigram] = chi_square

sorted_bigrams_chi_square = sorted(chi_square_stats.items(), key=operator.itemgetter(1), reverse = True)
sorted_bigrams_chi_square[:5]

[('</s> <s>', 87952363525.07454),
 ('ta </s>', 10388700.106867721),
 ('lòng </s>', 5091336.6162292035),
 ('bây giờ', 4396737.603232234),
 ('nhau </s>', 4022722.094115539)]

####Bigrams using PMI (Pointwise Mutual Information)

PMI

$PMI = log_2\frac{P(bigram)}{P(hist-word)*P(main-word)}$

Or:

$PMI = log_2\frac{\frac{bigram-occurences}{N}}{\frac{hist-word-occurences}{N}*\frac{main-word-occurences}{N}} = log_2\frac{(bigram-occurences)*N}{(hist-word-occurences)*(main-word-occurences)}$

In [None]:
#Using PMI stat
# create dictionary with "key-value" pairs of "bigram-PMI"
PMI_stats = {}

#Retrieve and store confusion matrix counts
hist_word_dict = {}
main_word_dict = {}
prob_bigram_dict = {}

#Create bigram (probability) confusion matrices

# loop through every bigram in the dictionary
for bigram,value in bigram_dict.items():
  hist_word, main_word = bigram.split()[0], bigram.split()[1] # extract hist_word & main_word
  # store hist_word
  if hist_word in hist_word_dict:
    hist_word_dict[hist_word] += 1
  else:
    hist_word_dict[hist_word] = 1
  # store main_word
  if main_word in main_word_dict:
    main_word_dict[main_word] += 1
  else:
    main_word_dict[main_word] = 1

for bigram,matrix in confusion_matrices.items():
    # split bigram into words
    hist_word = bigram.split()[0]
    main_word = bigram.split()[1]

    # calculate probability values
    prob_bigram = bigram_dict[bigram]/len(bigram_dict)
    prob_hist_word = hist_word_dict[hist_word]/len(hist_word_dict)
    prob_main_word = main_word_dict[main_word]/len(main_word_dict)

    # calculate PMI statistic
    PMI = math.log(prob_bigram/(prob_hist_word*prob_main_word))

    # store statistic in dictionary with bigram as key
    PMI_stats[bigram] = PMI

sorted_bigrams_PMI = sorted(PMI_stats.items(), key=operator.itemgetter(1), reverse = True)
sorted_bigrams_PMI[:5]

[('</s> <s>', 13.248096553070111),
 ('bâng khuâng', 7.4633487195908055),
 ('thảnh thơi', 6.952523095824814),
 ('nâu sồng', 6.952523095824814),
 ('lênh đênh', 6.952523095824814)]