# I. Loading Packages and Data

<p>In this notebook, we'll scrape the novel <em>Moby Dick</em> from the website <a "https://www.gutenberg.org/">Project Gutenberg.</a> The source is an online book HTML page: <a href="https://www.gutenberg.org/cache/epub/2701/pg2701-images.html"> https://www.gutenberg.org/cache/epub/2701/pg2701-images.html. </a> </p>

In [1]:
import string
import requests
from bs4 import BeautifulSoup
import nltk
# nltk.download()
from collections import Counter

# Getting the Moby Dick HTML  
r = requests.get('https://s3.amazonaws.com/assets.datacamp.com/production/project_147/datasets/2701-h.htm')

# Setting the correct text encoding of the HTML page
r.encoding = 'utf-8'

# Extracting the HTML from the request object
html = r.text

# Creating a BeautifulSoup object from the HTML
soup = BeautifulSoup(html, "html.parser")

# Getting the text out of the soup
text = soup.get_text()

# Printing out text between characters 32000 and 34000
# print(text[32000:34000])

## 1.1 Data Description:

In [2]:
print(f"There are a total of {len(text)} tokens in the novel.")
print(f"There are a total of {len(set(text))} types in the novel.")

There are a total of 1382841 tokens in the novel.
There are a total of 103 types in the novel.


<p>As we can see, the html page was successfully scraped using <code>BeautifulSoup</code>. Before Preprocessing, there are a total of <span style="color:red;"> <u>1382615</u></span> tokens and <span style="color:red;"> <u><u> only 104</u></u></span> types. Thus, the next step will be preprocessing the novel. And we will explore data description after cleaning as well. </p>


# II. Text Preprocessing:

In this mini project-Text Generation, there are some proposed preposing steps as following *(we also assume that there are no wrong spelling in this corpus)*:

1. **Text Cleaning** : including <u>Lowercasing</u>, <u>Removing contractions</u>, and <u>Removing some Special Characters</u> that are not needed.

2. **Normalization** : We will practice <u>Stemming </u> and <u>Lemmatization</u> to reduce `types` and result in better performance model.

3. **Removing Stopwords** : There are some uncertainty regarding <u>*how will we remove stopwords*</u>, since there will certainly some stopwords holding contextual meaning and grammatical rules.

4. **Handling Less Occurrence Words** : Replacing them with `<UNK>`.

5. **Tokenization** : We will do <u>Sentence Tokenization</u> to split the whole corpus into training set *(70% of the whole corpus)* and testing set *(30% of the whole corpus)* in advanced and <u>Word Tokenization</u> to build the model.


## 2.1 Text Cleaning:

In [3]:
nltk.download('punkt')
from nltk.tokenize import word_tokenize
import contractions

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Hort\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [552]:
def clean_text(text):
    expanded_words = []    
    for word in text.split():
        expanded_words.append(contractions.fix(word))
    expanded_text = ' '.join(expanded_words).lower()
    # clean_text = ''.join([char for char in expanded_text if char in string.ascii_letters + ' '+'.'+'?'+'!'])
    clean_text = ''.join([char for char in expanded_text if char in string.ascii_letters + ' '+'.'])
    # clean_text = ''.join([char for char in expanded_text if char in string.ascii_letters + ' '])
    clean_words = word_tokenize(clean_text)
    return clean_words

In [553]:
clean_words = clean_text(text)
clean_words[:10]

['moby',
 'dick',
 'or',
 'the',
 'whale',
 'by',
 'herman',
 'melville',
 'the',
 'project']

After cleaning:

In [554]:
print(f"There are a total of {len(clean_words)} tokens in the novel.")
print(f"There are a total of {len(set(clean_words))} types in the novel.")

There are a total of 224339 tokens in the novel.
There are a total of 19928 types in the novel.


## 2.2 Normalization:

In [555]:
from pattern.en import lemma

In [556]:
lemmatized_words = []
for word in clean_words:
    try:
        lemma_word = lemma(word)
        lemmatized_words.append(lemma_word)
    except StopIteration:
        # Handle the StopIteration error gracefully
        print(f"StopIteration raised for word: {word}")
        lemmatized_words.append(word)

After normalization:

In [557]:
print(f"There are a total of {len(lemmatized_words)} tokens in the novel.")
print(f"There are a total of {len(set(lemmatized_words))} types in the novel.")

There are a total of 224339 tokens in the novel.
There are a total of 14925 types in the novel.


## 2.3 Removing Stopwords

In [558]:
from nltk.corpus import stopwords
import random
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Hort\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [559]:
english_stopwords = set(stopwords.words("english"))
words_ns = [word for word in lemmatized_words if word not in english_stopwords]

In [560]:
# Initialize a Counter object from our processed list of words
count_ns = Counter(words_ns)

# Store 10 most common words and their counts as top_ten
top_ten_ns = count_ns.most_common(10)

# Print the top ten words and their counts
print(f"After preprocessing: {top_ten_ns}")

After preprocessing: [('.', 7797), ('hi', 2522), ('whale', 1481), ('thi', 1411), ('one', 921), ('say', 603), ('like', 580), ('see', 573), ('upon', 567), ('ship', 556)]


In [561]:
# Initialize a Counter object from our processed list of words
count = Counter(lemmatized_words)

# Store 10 most common words and their counts as top_ten
top_ten = count.most_common(10)

# Print the top ten words and their counts
print(f"After preprocessing: {top_ten}")

After preprocessing: [('the', 14615), ('.', 7797), ('be', 6999), ('of', 6723), ('and', 6449), ('a', 6426), ('to', 4659), ('in', 4212), ('that', 3030), ('it', 2880)]


From this point onward, we will make use the `no_stopwords` to do the next steps.

## 2.4 Handling Less Occurrence Words:

In [562]:
for i, count in enumerate(Counter(words_ns).most_common()):
  print(i+1, count)

1 ('.', 7797)
2 ('hi', 2522)
3 ('whale', 1481)
4 ('thi', 1411)
5 ('one', 921)
6 ('say', 603)
7 ('like', 580)
8 ('see', 573)
9 ('upon', 567)
10 ('ship', 556)
11 ('ahab', 498)
12 ('man', 498)
13 ('ye', 486)
14 ('go', 482)
15 ('sea', 466)
16 ('would', 460)
17 ('seem', 459)
18 ('old', 442)
19 ('time', 433)
20 ('boat', 424)
21 ('come', 406)
22 ('make', 372)
23 ('though', 366)
24 ('yet', 339)
25 ('captain', 337)
26 ('head', 333)
27 ('hand', 329)
28 ('look', 323)
29 ('long', 320)
30 ('chapter', 314)
31 ('think', 314)
32 ('thing', 312)
33 ('know', 311)
34 ('take', 309)
35 ('still', 307)
36 ('great', 305)
37 ('must', 292)
38 ('stand', 292)
39 ('two', 286)
40 ('last', 278)
41 ('way', 278)
42 ('thou', 264)
43 ('stubb', 257)
44 ('round', 254)
45 ('white', 251)
46 ('may', 250)
47 ('u', 250)
48 ('little', 249)
49 ('day', 248)
50 ('sperm', 241)
51 ('eye', 241)
52 ('three', 240)
53 ('queequeg', 239)
54 ('first', 236)
55 ('water', 231)
56 ('every', 230)
57 ('men', 230)
58 ('much', 222)
59 ('well', 216)

In [563]:
fq = dict(Counter(words_ns).most_common())
tokens_new = []
for token in words_ns:
  if fq[token] > 1:
    tokens_new.append(token)
  else:
    tokens_new.append('UNK')
print(tokens_new)



In [564]:
print(f"There are a total of {len(tokens_new)} tokens in the novel.")
print(f"There are a total of {len(set(tokens_new))} types in the novel.")

There are a total of 121297 tokens in the novel.
There are a total of 7498 types in the novel.


## 2.5 Train-Test Split:

In [565]:
period_indices = [i for i, word in enumerate(tokens_new) if word == "."]
def nearest_period_indices(how_to_split):
    differences = [abs(num - how_to_split) for num in period_indices]
    nearest_period_indices = differences.index(min(differences))
    return period_indices[nearest_period_indices]

In [566]:
data_train = tokens_new[:nearest_period_indices(int(0.7*len(tokens_new)))]
data_val = tokens_new[nearest_period_indices(int(0.7*len(tokens_new))):nearest_period_indices(int(0.8*len(tokens_new)))]
data_test = tokens_new[nearest_period_indices(int(0.8*len(tokens_new))):]

print('Train:', len(data_train))
print('Val:', len(data_val))
print('Test:', len(data_test))

Train: 84910
Val: 12127
Test: 24260


# III. Building model:

## 3.1 N-Gram model:

### 3.1.1 N_gram Count:

In [567]:
data_train = [token for token in data_train if token != '.']

In [568]:
def n_gram_counts(tokens, n):
  n_tokens = len(tokens)
  result = dict()
  for i in range(n_tokens - n + 1):
    key = tuple(tokens[i:i+n])
    if key in result:
      result[key] += 1
    else:
      result[key] = 1
  return result

uni_counts = Counter(data_train).most_common()
uni_counts = dict(uni_counts)

bi_counts = n_gram_counts(data_train, 2)
tri_counts = n_gram_counts(data_train, 3)
four_counts = n_gram_counts(data_train, 4)

In [569]:
bi_counts

{('moby', 'dick'): 55,
 ('dick', 'whale'): 5,
 ('whale', 'herman'): 3,
 ('herman', 'melville'): 4,
 ('melville', 'project'): 1,
 ('project', 'gutenberg'): 4,
 ('gutenberg', 'ebook'): 2,
 ('ebook', 'moby'): 2,
 ('melville', 'thi'): 1,
 ('thi', 'ebook'): 2,
 ('ebook', 'use'): 1,
 ('use', 'anyone'): 1,
 ('anyone', 'anywhere'): 1,
 ('anywhere', 'cost'): 1,
 ('cost', 'almost'): 1,
 ('almost', 'restriction'): 1,
 ('restriction', 'whatsoever'): 1,
 ('whatsoever', 'may'): 1,
 ('may', 'copy'): 1,
 ('copy', 'give'): 1,
 ('give', 'away'): 1,
 ('away', 'reuse'): 1,
 ('reuse', 'term'): 1,
 ('term', 'project'): 1,
 ('gutenberg', 'license'): 1,
 ('license', 'include'): 1,
 ('include', 'thi'): 2,
 ('ebook', 'online'): 1,
 ('online', 'www.gutenberg.org'): 1,
 ('www.gutenberg.org', 'title'): 1,
 ('title', 'moby'): 1,
 ('whale', 'author'): 2,
 ('author', 'herman'): 1,
 ('melville', 'UNK'): 1,
 ('UNK', 'date'): 1,
 ('date', 'december'): 1,
 ('december', 'ebook'): 1,
 ('ebook', 'last'): 1,
 ('last', 'updat

In [570]:
sorted_bi_counts_values = dict(sorted(bi_counts.items(), key=lambda item: item[1], reverse=True))
sorted_bi_counts_values

{('UNK', 'UNK'): 556,
 ('sperm', 'whale'): 170,
 ('hi', 'UNK'): 109,
 ('UNK', 'hi'): 90,
 ('UNK', 'thi'): 74,
 ('whale', 'UNK'): 63,
 ('hi', 'head'): 57,
 ('white', 'whale'): 56,
 ('moby', 'dick'): 55,
 ('UNK', 'whale'): 55,
 ('right', 'whale'): 55,
 ('like', 'UNK'): 52,
 ('captain', 'ahab'): 47,
 ('thi', 'UNK'): 45,
 ('UNK', 'one'): 39,
 ('old', 'UNK'): 37,
 ('old', 'man'): 35,
 ('hi', 'hand'): 34,
 ('UNK', 'old'): 30,
 ('hi', 'face'): 30,
 ('captain', 'peleg'): 30,
 ('UNK', 'ye'): 28,
 ('hi', 'eye'): 28,
 ('UNK', 'like'): 27,
 ('one', 'UNK'): 26,
 ('UNK', 'say'): 26,
 ('let', 'u'): 25,
 ('sea', 'UNK'): 24,
 ('upon', 'hi'): 24,
 ('ye', 'UNK'): 24,
 ('thi', 'whale'): 23,
 ('well', 'know'): 23,
 ('UNK', 'upon'): 23,
 ('ship', 'UNK'): 23,
 ('sort', 'UNK'): 23,
 ('seem', 'UNK'): 23,
 ('go', 'UNK'): 22,
 ('make', 'UNK'): 22,
 ('UNK', 'thing'): 22,
 ('one', 'hand'): 22,
 ('UNK', 'though'): 21,
 ('UNK', 'would'): 20,
 ('UNK', 'man'): 20,
 ('every', 'one'): 20,
 ('yet', 'UNK'): 20,
 ('hi', 'm

In [571]:
four_counts

{('moby', 'dick', 'whale', 'herman'): 2,
 ('dick', 'whale', 'herman', 'melville'): 2,
 ('whale', 'herman', 'melville', 'project'): 1,
 ('herman', 'melville', 'project', 'gutenberg'): 1,
 ('melville', 'project', 'gutenberg', 'ebook'): 1,
 ('project', 'gutenberg', 'ebook', 'moby'): 2,
 ('gutenberg', 'ebook', 'moby', 'dick'): 2,
 ('ebook', 'moby', 'dick', 'whale'): 2,
 ('whale', 'herman', 'melville', 'thi'): 1,
 ('herman', 'melville', 'thi', 'ebook'): 1,
 ('melville', 'thi', 'ebook', 'use'): 1,
 ('thi', 'ebook', 'use', 'anyone'): 1,
 ('ebook', 'use', 'anyone', 'anywhere'): 1,
 ('use', 'anyone', 'anywhere', 'cost'): 1,
 ('anyone', 'anywhere', 'cost', 'almost'): 1,
 ('anywhere', 'cost', 'almost', 'restriction'): 1,
 ('cost', 'almost', 'restriction', 'whatsoever'): 1,
 ('almost', 'restriction', 'whatsoever', 'may'): 1,
 ('restriction', 'whatsoever', 'may', 'copy'): 1,
 ('whatsoever', 'may', 'copy', 'give'): 1,
 ('may', 'copy', 'give', 'away'): 1,
 ('copy', 'give', 'away', 'reuse'): 1,
 ('giv

In [572]:
sorted_4_counts_values = dict(sorted(four_counts.items(), key=lambda item: item[1], reverse=True))
sorted_4_counts_values

{('UNK', 'UNK', 'UNK', 'UNK'): 23,
 ('book', 'i.', 'folio', 'chapter'): 6,
 ('book', 'ii', 'octavo', 'chapter'): 5,
 ('UNK', 'UNK', 'sperm', 'whale'): 4,
 ('press', 'hi', 'forehead', 'mine'): 3,
 ('morn', 'ye', 'shipmate', 'morn'): 3,
 ('book', 'iii', 'duodecimo', 'chapter'): 3,
 ('sperm', 'whale', 'sperm', 'whale'): 3,
 ('sperm', 'whale', 'right', 'whale'): 3,
 ('moby', 'dick', 'whale', 'herman'): 2,
 ('dick', 'whale', 'herman', 'melville'): 2,
 ('project', 'gutenberg', 'ebook', 'moby'): 2,
 ('gutenberg', 'ebook', 'moby', 'dick'): 2,
 ('ebook', 'moby', 'dick', 'whale'): 2,
 ('chapter', 'lee', 'shore', 'chapter'): 2,
 ('chapter', 'knight', 'squire', 'chapter'): 2,
 ('chapter', 'enter', 'ahab', 'stubb'): 2,
 ('chapter', 'ahab', 'boat', 'crew'): 2,
 ('ahab', 'boat', 'crew', 'fedallah'): 2,
 ('chapter', 'monstrou', 'picture', 'whale'): 2,
 ('chapter', 'les', 'erroneou', 'picture'): 2,
 ('les', 'erroneou', 'picture', 'whale'): 2,
 ('erroneou', 'picture', 'whale', 'true'): 2,
 ('picture', '

### 3.1.2 Model:

In [573]:
# import random

# class NGramModelWithBackoff:
#     def __init__(self, uni_counts, bi_counts, tri_counts, four_counts):
#         self.uni_counts = uni_counts
#         self.bi_counts = bi_counts
#         self.tri_counts = tri_counts
#         self.four_counts = four_counts
#         self.total_count = sum(uni_counts.values())
#         self.vocab_size = len(uni_counts)

#     def _get_probability(self, ngram):
#         if len(ngram) == 1:
#             return self.uni_counts.get(ngram[0], 0) / self.total_count
#         elif len(ngram) == 2:
#             return self.bi_counts.get(ngram, 0) / self.uni_counts.get(ngram[0], 1)
#         elif len(ngram) == 3:
#             return self.tri_counts.get(ngram, 0) / self.bi_counts.get(ngram[:2], 1)
#         elif len(ngram) == 4:
#             return self.four_counts.get(ngram, 0) / self.tri_counts.get(ngram[:3], 1)
#         else:
#             return 0

#     def generate_text(self, length):
#         generated_text = []
#         start_word = random.choice(list(self.uni_counts.keys()))
#         generated_text.append(start_word)

#         for i in range(length-1):
#             current_word = generated_text[-1]
#             if len(generated_text) >= 4 and (generated_text[-4], generated_text[-3], generated_text[-2], current_word) in self.four_counts:
#                 ngram = (generated_text[-4], generated_text[-3], generated_text[-2], current_word)
#             elif len(generated_text) >= 3 and (generated_text[-3], generated_text[-2], current_word) in self.tri_counts:
#                 ngram = (generated_text[-3], generated_text[-2], current_word)
#             elif len(generated_text) >= 2 and (generated_text[-2], current_word) in self.bi_counts:
#                 ngram = (generated_text[-2], current_word)
#             else:
#                 ngram = (current_word,)

#             next_word = self._sample_next_word(ngram)
#             generated_text.append(next_word)

#         return ' '.join(generated_text)

#     def _sample_next_word(self, ngram):
#         next_word_probs = {}
#         for word in self.uni_counts.keys():
#             next_ngram = ngram + (word,)
#             next_word_probs[word] = self._get_probability(next_ngram)
#         return random.choices(list(next_word_probs.keys()), list(next_word_probs.values()))[0]

# # Example usage:
# uni_counts = {'This': 1, 'is': 2, 'a': 1, 'sample': 1, 'text': 1, 'for': 1, 'building': 1, '4-gram': 1, 'model': 1, 'with': 1, 'back-off.': 1}
# bi_counts = {('This', 'is'): 1, ('is', 'a'): 1, ('a', 'sample'): 1, ('sample', 'text'): 1, ('text', 'for'): 1, ('for', 'building'): 1, ('building', 'a'): 1, ('a', '4-gram'): 1, ('4-gram', 'model'): 1, ('model', 'with'): 1, ('with', 'back-off.'): 1}
# tri_counts = {('This', 'is', 'a'): 1, ('is', 'a', 'sample'): 1, ('a', 'sample', 'text'): 1, ('sample', 'text', 'for'): 1, ('text', 'for', 'building'): 1, ('for', 'building', 'a'): 1, ('building', 'a', '4-gram'): 1, ('a', '4-gram', 'model'): 1, ('4-gram', 'model', 'with'): 1, ('model', 'with', 'back-off.'): 1}
# four_counts = {('This', 'is', 'a', 'sample'): 1, ('is', 'a', 'sample', 'text'): 1, ('a', 'sample', 'text', 'for'): 1, ('sample', 'text', 'for', 'building'): 1, ('text', 'for', 'building', 'a'): 1, ('for', 'building', 'a', '4-gram'): 1, ('building', 'a', '4-gram', 'model'): 1, ('a', '4-gram', 'model', 'with'): 1, ('4-gram', 'model', 'with', 'back-off.'): 1}

# model = NGramModelWithBackoff(uni_counts, bi_counts, tri_counts, four_counts)

# generated_text = model.generate_text(20)
# print(generated_text)


In [574]:
# import random

# class NGramModelWithBackoff:
#     def __init__(self, uni_counts, bi_counts, tri_counts, four_counts):
#         self.uni_counts = uni_counts
#         self.bi_counts = bi_counts
#         self.tri_counts = tri_counts
#         self.four_counts = four_counts
#         self.total_count = sum(uni_counts.values())
#         self.vocab_size = len(uni_counts)

#     def _get_probability(self, ngram):
#         if len(ngram) == 1:
#             return self.uni_counts.get(ngram[0], 0) / self.total_count
#         elif len(ngram) == 2:
#             return self.bi_counts.get(ngram, 0) / self.uni_counts.get(ngram[0], 1)
#         elif len(ngram) == 3:
#             return self.tri_counts.get(ngram, 0) / self.bi_counts.get(ngram[:2], 1)
#         elif len(ngram) == 4:
#             return self.four_counts.get(ngram, 0) / self.tri_counts.get(ngram[:3], 1)
#         else:
#             return 0

#     def generate_text(self, length):
#         generated_text = []
#         start_word = random.choice(list(self.uni_counts.keys()))
#         generated_text.append(start_word)

#         for i in range(length-1):
#             current_word = generated_text[-1]
#             if len(generated_text) >= 4 and (generated_text[-4], generated_text[-3], generated_text[-2], current_word) in self.four_counts:
#                 ngram = (generated_text[-4], generated_text[-3], generated_text[-2], current_word)
#             elif len(generated_text) >= 3 and (generated_text[-3], generated_text[-2], current_word) in self.tri_counts:
#                 ngram = (generated_text[-3], generated_text[-2], current_word)
#             elif len(generated_text) >= 2 and (generated_text[-2], current_word) in self.bi_counts:
#                 ngram = (generated_text[-2], current_word)
#             else:
#                 ngram = (current_word,)

#             next_word = self._sample_next_word(ngram)
#             generated_text.append(next_word)

#         return ' '.join(generated_text)

#     def _sample_next_word(self, ngram):
#         next_word_probs = {}
#         total_prob = 0
#         for word in self.uni_counts.keys():
#             next_ngram = ngram + (word,)
#             prob = self._get_probability(next_ngram)
#             next_word_probs[word] = prob
#             total_prob += prob

#         if total_prob == 0:
#             return random.choice(list(self.uni_counts.keys()))

#         return random.choices(list(next_word_probs.keys()), list(next_word_probs.values()))[0]

# # Example usage:
# uni_counts = {'This': 1, 'is': 2, 'a': 1, 'sample': 1, 'text': 1, 'for': 1, 'building': 1, '4-gram': 1, 'model': 1, 'with': 1, 'back-off.': 1}
# bi_counts = {('This', 'is'): 1, ('is', 'a'): 1, ('a', 'sample'): 1, ('sample', 'text'): 1, ('text', 'for'): 1, ('for', 'building'): 1, ('building', 'a'): 1, ('a', '4-gram'): 1, ('4-gram', 'model'): 1, ('model', 'with'): 1, ('with', 'back-off.'): 1}
# tri_counts = {('This', 'is', 'a'): 1, ('is', 'a', 'sample'): 1, ('a', 'sample', 'text'): 1, ('sample', 'text', 'for'): 1, ('text', 'for', 'building'): 1, ('for', 'building', 'a'): 1, ('building', 'a', '4-gram'): 1, ('a', '4-gram', 'model'): 1, ('4-gram', 'model', 'with'): 1, ('model', 'with', 'back-off.'): 1}
# four_counts = {('This', 'is', 'a', 'sample'): 1, ('is', 'a', 'sample', 'text'): 1, ('a', 'sample', 'text', 'for'): 1, ('sample', 'text', 'for', 'building'): 1, ('text', 'for', 'building', 'a'): 1, ('for', 'building', 'a', '4-gram'): 1, ('building', 'a', '4-gram', 'model'): 1, ('a', '4-gram', 'model', 'with'): 1, ('4-gram', 'model', 'with', 'back-off.'): 1}

# model = NGramModelWithBackoff(uni_counts, bi_counts, tri_counts, four_counts)

# generated_text = model.generate_text(20)
# print(generated_text)


In [575]:
# def _get_probability(self, ngram):
#         if len(ngram) == 1:
#             return self.uni_counts.get(ngram[0], 0) / self.total_count
#         elif len(ngram) == 2:
#             return self.bi_counts.get(ngram, 0) / self.uni_counts.get(ngram[0], 1)
#         elif len(ngram) == 3:
#             return self.tri_counts.get(ngram, 0) / self.bi_counts.get(ngram[:2], 1)
#         elif len(ngram) == 4:
#             return self.four_counts.get(ngram, 0) / self.tri_counts.get(ngram[:3], 1)
#         else:
#             return 0

In [576]:
# #GPT_new_new
# import random

# class NGramModelWithBackoff:
    
#     @staticmethod
#     def n_gram_counts(tokens, n):
#         n_tokens = len(tokens)
#         result = {}
#         for i in range(n_tokens - n + 1):
#             key = tuple(tokens[i:i+n])
#             if key in result:
#                 result[key] += 1
#             else:
#                 result[key] = 1
#         return result
    
#     def __init__(self, tokens):
#         # Initialize n-gram counts
#         self.uni_counts = self.n_gram_counts(tokens, 1)
#         self.bi_counts = self.n_gram_counts(tokens, 2)
#         self.tri_counts = self.n_gram_counts(tokens, 3)
#         self.four_counts = self.n_gram_counts(tokens, 4)
        
#         # Total count of tokens
#         self.total_count = len(tokens)
        
#         # Vocabulary size
#         self.vocab = len(set(tokens))  # Using set to get unique tokens
    
#     # def _sample_next_word(self, ngram):
#     #     next_word_probs = {}
#     #     total_prob = 0
#     #     for word in self.uni_counts.keys():
#     #         next_ngram = ngram + (word,)
#     #         prob = self._get_probability(next_ngram)
#     #         next_word_probs[word] = prob
#     #         total_prob += prob

#     #     if total_prob == 0:
#     #         return random.choice(list(self.uni_counts.keys()))

#     #     return random.choices(list(next_word_probs.keys()), list(next_word_probs.values()))[0]

            
    
# ngram_model = NGramModelWithBackoff(data_train)

In [577]:
# def next_word(given_ngram, four_counts=four_counts):
#     max_count = 0
#     most_common_4gram = None
    
#     for four_gram, count in four_counts.items():
#         if four_gram[:3] == given_ngram and count > max_count:
#             max_count = count
#             most_common_4gram = four_gram
    
#     return most_common_4gram[-1] if most_common_4gram else "None"

In [578]:
# def next_word(given_ngram):
#     max_count = 0
#     most_common_ngram = None
    
#     # Check 4-gram
#     if len(given_ngram) == 3:
#         for four_gram, count in four_counts.items():
#             if four_gram[:3] == given_ngram and count > max_count:
#                 max_count = count
#                 most_common_ngram = four_gram
#         if most_common_ngram:
#             return most_common_ngram[-1]
    
#     # Check trigram if 4-gram not found
#     if len(given_ngram) >= 2:
#         for tri_gram, count in tri_counts.items():
#             if tri_gram[:2] == given_ngram and count > max_count:
#                 max_count = count
#                 most_common_ngram = tri_gram
#         if most_common_ngram:
#             return most_common_ngram[-1]
    
#     # Check bigram if trigram or 4-gram not found
#     if len(given_ngram) >= 1:
#         for bi_gram, count in bi_counts.items():
#             if bi_gram[:1] == given_ngram and count > max_count:
#                 max_count = count
#                 most_common_ngram = bi_gram
#         if most_common_ngram:
#             return most_common_ngram[-1]
    
#     # Check unigram if no match found
#     for uni_gram, count in uni_counts.items():
#         if count > max_count:
#             max_count = count
#             most_common_ngram = uni_gram
    
#     return most_common_ngram if most_common_ngram else "None"

In [579]:
# def next_word(given_ngram):
#     max_count = 0
#     most_common_ngram = None
    
#     # Check 4-gram
#     if len(given_ngram) == 3:
#         for four_gram, count in four_counts.items():
#             if four_gram[:3] == given_ngram and count > max_count:
#                 max_count = count
#                 most_common_ngram = four_gram
#         if most_common_ngram:
#             return most_common_ngram[-1]

In [580]:
# def next_word(given_ngram, ngram_counts=n_gram_counts):
#     max_count = 0
#     most_common_ngram = None
    
#     for n in range(4, 0, -1):  # Iterate from 4-gram down to unigram
#         ngram_counts=n_gram_counts(data_train,n)
#         for ngram, count in ngram_counts.items():
#             if ngram[:n-1] == given_ngram[-(n-1):] and count > max_count:
#                 max_count = count
#                 most_common_ngram = ngram
#         if most_common_ngram:
#             return (most_common_ngram[-1],)
    
#     return most_common_ngram[-1] if most_common_ngram else None


In [581]:
bi_counts

{('moby', 'dick'): 55,
 ('dick', 'whale'): 5,
 ('whale', 'herman'): 3,
 ('herman', 'melville'): 4,
 ('melville', 'project'): 1,
 ('project', 'gutenberg'): 4,
 ('gutenberg', 'ebook'): 2,
 ('ebook', 'moby'): 2,
 ('melville', 'thi'): 1,
 ('thi', 'ebook'): 2,
 ('ebook', 'use'): 1,
 ('use', 'anyone'): 1,
 ('anyone', 'anywhere'): 1,
 ('anywhere', 'cost'): 1,
 ('cost', 'almost'): 1,
 ('almost', 'restriction'): 1,
 ('restriction', 'whatsoever'): 1,
 ('whatsoever', 'may'): 1,
 ('may', 'copy'): 1,
 ('copy', 'give'): 1,
 ('give', 'away'): 1,
 ('away', 'reuse'): 1,
 ('reuse', 'term'): 1,
 ('term', 'project'): 1,
 ('gutenberg', 'license'): 1,
 ('license', 'include'): 1,
 ('include', 'thi'): 2,
 ('ebook', 'online'): 1,
 ('online', 'www.gutenberg.org'): 1,
 ('www.gutenberg.org', 'title'): 1,
 ('title', 'moby'): 1,
 ('whale', 'author'): 2,
 ('author', 'herman'): 1,
 ('melville', 'UNK'): 1,
 ('UNK', 'date'): 1,
 ('date', 'december'): 1,
 ('december', 'ebook'): 1,
 ('ebook', 'last'): 1,
 ('last', 'updat

In [582]:
four_counts

{('moby', 'dick', 'whale', 'herman'): 2,
 ('dick', 'whale', 'herman', 'melville'): 2,
 ('whale', 'herman', 'melville', 'project'): 1,
 ('herman', 'melville', 'project', 'gutenberg'): 1,
 ('melville', 'project', 'gutenberg', 'ebook'): 1,
 ('project', 'gutenberg', 'ebook', 'moby'): 2,
 ('gutenberg', 'ebook', 'moby', 'dick'): 2,
 ('ebook', 'moby', 'dick', 'whale'): 2,
 ('whale', 'herman', 'melville', 'thi'): 1,
 ('herman', 'melville', 'thi', 'ebook'): 1,
 ('melville', 'thi', 'ebook', 'use'): 1,
 ('thi', 'ebook', 'use', 'anyone'): 1,
 ('ebook', 'use', 'anyone', 'anywhere'): 1,
 ('use', 'anyone', 'anywhere', 'cost'): 1,
 ('anyone', 'anywhere', 'cost', 'almost'): 1,
 ('anywhere', 'cost', 'almost', 'restriction'): 1,
 ('cost', 'almost', 'restriction', 'whatsoever'): 1,
 ('almost', 'restriction', 'whatsoever', 'may'): 1,
 ('restriction', 'whatsoever', 'may', 'copy'): 1,
 ('whatsoever', 'may', 'copy', 'give'): 1,
 ('may', 'copy', 'give', 'away'): 1,
 ('copy', 'give', 'away', 'reuse'): 1,
 ('giv

In [583]:
# def generate_next_word(given_trigram):
#     max_count = 0
#     next_word = None
    
#     # Check if the given trigram exists in the four-grams
#     for four_gram, count in four_counts.items():
#         if four_gram[:3] == given_trigram:
#             # Update the most common next word if count is greater
#             if count > max_count:
#                 max_count = count
#                 next_word = (four_gram[3],)
    
#     for tri_gram, count in tri_counts.items():
#         if tri_gram[:2] == given_trigram[1:]:
#             # Update the most common next word if count is greater
#             if count > max_count:
#                 max_count = count
#                 next_word = (tri_gram[2],)
    
#     for bi_gram, count in bi_counts.items():
#         if bi_gram[:1] == given_trigram[2]:
#             # Update the most common next word if count is greater
#             if count > max_count:
#                 max_count = count
#                 next_word = (bi_gram[1],)
#     return next_word

In [584]:
#GPT last
def generate_next_word(given_trigram):
    max_count = 0
    next_word = None
    
    # Check if the given trigram exists in the four-grams
    for four_gram, count in four_counts.items():
        if four_gram[:3] == given_trigram:
            # Update the most common next word if count is greater
            if count > max_count:
                max_count = count
                next_word = four_gram[3]
    
    # Check if the given trigram exists in the trigrams
    for tri_gram, count in tri_counts.items():
        if tri_gram[:2] == given_trigram[1:]:
            # Update the most common next word if count is greater
            if count > max_count:
                max_count = count
                next_word = tri_gram[2]
    
    # Check if the given trigram exists in the bigrams
    for bi_gram, count in bi_counts.items():
        if bi_gram[:1] == given_trigram[2:]:
            # Update the most common next word if count is greater
            if count > max_count:
                max_count = count
                next_word = bi_gram[1]
    
    # If none of the above cases matched, return the most common word based on unigram counts
    if next_word is None:
        next_word = max(uni_counts, key=uni_counts.get)
    
    return next_word

In [585]:
def generate_starting_trigram():
    # Start with the most frequent word based on unigram counts
    first_word = random.choices(list(uni_counts.keys()), weights=uni_counts.values())[0]
    # Use this word along with the preceding word from bigram counts to form a bigram
    second_word = max((word[1] for word in bi_counts.keys() if word[0] == first_word), key=lambda x: bi_counts.get((first_word, x), 0))
    # Use the bigram along with the preceding word from trigram counts to form a trigram
    third_word = max((word[2] for word in tri_counts.keys() if word[:2] == (first_word, second_word)), key=lambda x: tri_counts.get((first_word, second_word, x), 0))
    return (first_word, second_word, third_word)

In [586]:
# def generate_text(starting_trigram, length=10):
#     text = list(starting_trigram)
#     for _ in range(length):
#         next_word = generate_next_word((text[-2], text[-1]))
#         text.append(next_word)
#     return ' '.join(text)

In [587]:
starting_trigram = generate_starting_trigram()
# starting_trigram+generate_next_word(starting_trigram)

In [621]:
def generate_text(length):
    text = tuple()
    starting_trigram = generate_starting_trigram()
    
    for _ in range(length - 2):
        next_word = generate_next_word(starting_trigram)
        text+=starting_trigram
        text += (next_word,)
        starting_trigram = text[-2:]
    
    return ' '.join(text)
generated_text = generate_text(length=15)
print(generated_text)

barbarian dine cabin chapter cabin chapter UNK chapter UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK


In [601]:
def calculate_word_probability(trigram, next_word):
    max_count = 0
    total_count = 0
    
    # Check if the given trigram exists in the four-grams
    for four_gram, count in four_counts.items():
        if four_gram[:3] == trigram:
            total_count += count
            # Update the count of the next word
            if four_gram[3] == next_word:
                max_count = count
    
    # Check if the given trigram exists in the trigrams
    for tri_gram, count in tri_counts.items():
        if tri_gram[:2] == trigram[1:]:
            total_count += count
            # Update the count of the next word
            if tri_gram[2] == next_word:
                max_count = count
    
    # Check if the given trigram exists in the bigrams
    for bi_gram, count in bi_counts.items():
        if bi_gram[:1] == trigram[2:]:
            total_count += count
            # Update the count of the next word
            if bi_gram[1] == next_word:
                max_count = count
    
    # Calculate the probability of the next word given the trigram
    word_probability = max_count / total_count if total_count > 0 else 0.0
    return word_probability

In [606]:
import math
def calculate_perplexity(generated_text):
    test_words = generated_text  # Tokenize the generated text
    N = len(test_words)  # Total number of words in the generated text
    total_log_probability = 0.0
    
    # Calculate the log probability of each word and sum them
    for i in range(2, N):
        context = (test_words[i-2], test_words[i-1])  # Get the context (last two words)
        word = test_words[i]  # Current word
        word_probability = calculate_word_probability(context, word)  # Calculate the probability of the word
        if word_probability > 0:
            total_log_probability += -1 * math.log2(word_probability)  # Calculate and accumulate the log probability
    
    # Calculate perplexity
    perplexity = pow(2, total_log_probability / N)
    return perplexity

In [589]:
# import random

# class NGramModelWithBackoff:
#     def n_gram_counts(text, n):
#         n_tokens = len(text)
#         result = dict()
#         for i in range(n_tokens - n + 1):
#             key = tuple(text[i:i+n])
#             if key in result:
#                result[key] += 1
#             else:
#                 result[key] = 1
#         return result
    
#     def __init__(self, text):
#         self.uni_counts = n_gram_counts(text, 1)
#         self.bi_counts = n_gram_counts(text, 2)
#         self.tri_counts = n_gram_counts(text, 3)
#         self.four_counts = n_gram_counts(text, 4)
#         self.total_count = sum(self.uni_counts.values())
#         self.vocab = len(self.uni_counts)

#     def _get_probability(self, ngram):
#         if len(ngram) == 1:
#             return self.uni_counts.get(ngram, 0) / self.total_count
#         elif len(ngram) == 2:
#             return self.bi_counts.get(ngram, 0) / self.uni_counts.get(ngram[0], 1)
#         elif len(ngram) == 3:
#             return self.tri_counts.get(ngram, 0) / self.bi_counts.get(ngram[:2], 1)
#         elif len(ngram) == 4:
#             return self.four_counts.get(ngram, 0) / self.tri_counts.get(ngram[:3], 1)
#         else:
#             return 0
        
#     def generate_starting_trigram():
#         # Start with the most frequent word based on unigram counts
#         first_word = random.choices(list(uni_counts.keys()), weights=uni_counts.values())[0]
#         # Use this word along with the preceding word from bigram counts to form a bigram
#         second_word = max((word[1] for word in bi_counts.keys() if word[0] == first_word), key=lambda x: bi_counts.get((first_word, x), 0))
#         # Use the bigram along with the preceding word from trigram counts to form a trigram
#         third_word = max((word[2] for word in tri_counts.keys() if word[:2] == (first_word, second_word)), key=lambda x: tri_counts.get((first_word, second_word, x), 0))
#         return (first_word, second_word, third_word)

#     def generate_text(length):
#         text = tuple()
#         starting_trigram = generate_starting_trigram()
        
#         for _ in range(length - 2):
#             next_word = generate_next_word(starting_trigram)
#             text+=starting_trigram
#             text += (next_word,)
#             starting_trigram = text[-2:]
        
#         return ' '.join(text)

#     def generate_next_word(given_trigram):
#         max_count = 0
#         next_word = None
        
#         # Check if the given trigram exists in the four-grams
#         for four_gram, count in four_counts.items():
#             if four_gram[:3] == given_trigram:
#                 # Update the most common next word if count is greater
#                 if count > max_count:
#                     max_count = count
#                     next_word = four_gram[3]
        
#         # Check if the given trigram exists in the trigrams
#         for tri_gram, count in tri_counts.items():
#             if tri_gram[:2] == given_trigram[1:]:
#                 # Update the most common next word if count is greater
#                 if count > max_count:
#                     max_count = count
#                     next_word = tri_gram[2]
        
#         # Check if the given trigram exists in the bigrams
#         for bi_gram, count in bi_counts.items():
#             if bi_gram[:1] == given_trigram[2:]:
#                 # Update the most common next word if count is greater
#                 if count > max_count:
#                     max_count = count
#                     next_word = bi_gram[1]
        
#         # If none of the above cases matched, return the most common word based on unigram counts
#         if next_word is None:
#             next_word = max(uni_counts, key=uni_counts.get)
        
#         return next_word


# model = NGramModelWithBackoff(data_train)

# generated_text = model.generate_text(20)
# print(generated_text)


In [590]:
# def generate_next_token(self, ngram):
#     # Search for the given n-gram
#     if ngram in self.ngram_model:
#         # If the n-gram is found, generate the next token based on it
#         next_token = self.ngram_model[ngram].generate_next_token()
#         return next_token
#     else:
#         # If the n-gram is not found, back off to smaller n-grams
#         for i in range(len(ngram) - 1, 0, -1):
#             smaller_ngram = ngram[:i]
#             if smaller_ngram in self.ngram_model:
#                 # If a smaller n-gram is found, repeat the process with it
#                 return self.generate_next_token(smaller_ngram)
#         # If no match is found for any smaller n-gram, return None or handle appropriately
#         return None
