Drew Lickman\
CSCI 4820-001\
Project #2\
Due: 9/9/24

# N-Grams Algorithm

## Assignment Requirements:

### Input
---

- Two training data input files
    - CNN Stories
    - Shakespeare Plays
- Each line in the files are paragraphs, and paragraphs may contain multiple sentences

### Processing
---

- Text will be converted to lowercase during processing
- Extract n-grams in both methods
    - Sentence level
        - Paragraph will be sentence tokenized (NLTK sent_tokenize), then all sentences will be word tokenized (NLTK word_tokenize)
            - Resulting data will be augmented with \<s> and </s>
    - Paragraph level
        - Paragraph will be word tokenized (NLTK word_tokenize)
            - Resulting data will be augmented with \<s> and </s>
    - n-gram extraction should never cross over line boundaries
- The data structure used to hold tokens in each sentence should start with \<s> and end with </s>, according to the n-grams being processed
    - Higher order n-grams require more start symbol augments
- Unigrams, bigrams, trigrams, quadgrams will each be kept in separate data structures
    - Dictionaries, indexed by "context tuples" work well for this
- A parallel data structure should hold the counts of the tokens that immediately follow each n-gram context
    - These counts should be stored as probabilities by dividing by total count of tokens that appear after the n-gram context 
- Process both files first using sentence level, then followed by paragraph level

### Output
---

- Set NumPy seed to 0
- Print the count of extracted unigrams, bigrams, trigrams, and quadgrams (for each file)
- For each file, choose a random starting word from the unigram tokens (not </s>)
    - This random word will be used as the seed for generated n-gram texts
- For each gram:
    - Using the seed word (prefixed with \<s> as required) generate either 150 tokens or until </s> is generated
        - Do NOT continue after </s>
    - Each next token will be probabilistically selected from those that follow the context (if any) for hat n-gram
    - When working with higher order n-grams, use backoff when the context does not produce a token. Use the next lower n-gram

## Python Code

In [554]:
# Imports libraries and reads corpus documents. Save the documents as tokens

import numpy as np
from nltk import word_tokenize, sent_tokenize

np.random.seed(0)

sentences = []
paragraphs = []

with open("short.txt", encoding="utf-8") as wordList:
    lines = wordList.readlines()
    for line in lines:
        line = line.lower() # Converts all documents to lowercase
        sentence = sent_tokenize(line) # Extract as entire sentences
        paragraph = word_tokenize(line) # Extract the entire line as words (not separating sentences into different arrays!)
        sentences.append(sentence) # Adds each sentence to the sentences array
        paragraphs.append(paragraph) # Adds each line into the paragraphs array
        #print(sentence)
        ##print(paragraph)
        #print()
        
#print("Sentences (not word tokenized): ", sentences)
##print(paragraphs)

#print()
# Sentence level converting sentence tokens into word tokens
tokens = [] # [[tokens without START or END], [tokens for unigrams], [tokens for bigrams], [tokens for trigrams], [tokens for quadgrams]]
for sent in sentences:
    for string in sent:
        tokenList = word_tokenize(string) # Converts each word into a token. (This will separate sentences into different arrays)
        tokens.append(tokenList)
        #print(token)
#print()
print("Sentence tokens: ", tokens)



Sentence tokens:  [['this', 'short', '.']]


In [555]:
# Add START and END tokens
# Make sure to Run All before re-running this!

START = "<s>"
END = "</s>"

#t[1] = [<s>tokenized words</s>], etc.
#t[2] = [<s><s>tokenized words</s>], etc.
#t[3] = [<s><s><s>tokenized words</s>], etc.
#t[4] = [<s><s><s><s>tokenized words</s>], [<s><s><s><s>tokenized words</s>], [<s><s><s><s>tokenized words</s>], [<s><s><s><s>tokenized words</s>], [<s><s><s><s>tokenized words</s>]

# Array of AugmentedToken lists (one for each Uni/Bi/Tri/Quad grams)
AugmentedTokens = [[],[],[],[]]
# Since I am modifying each sentence, for every gram, I will add the START n times and END once per sentence
# List comprehension as suggested by Claude 3.5-sonnet: (and modifications by myself too!)
# newList = [expression for item in iterable]

# I may need to adjust the count of START and END symbols (slide 17 of n-grams)

for i in range(len(AugmentedTokens)):
    AugmentedTokens[i] = [[START]*(i+1) + sentence + [END] for sentence in tokens]
# Even more compact version of all this
#UniAugmentedTokens  = [[START]*1 + sentence + [END] for sentence in tokens]
#BiAugmentedTokens   = [[START]*2 + sentence + [END] for sentence in tokens]
#TriAugmentedTokens  = [[START]*3 + sentence + [END] for sentence in tokens]
#QuadAugmentedTokens = [[START]*4 + sentence + [END] for sentence in tokens]

#AugmentedTokens.append(UniAugmentedTokens)
#AugmentedTokens.append(BiAugmentedTokens)
#AugmentedTokens.append(TriAugmentedTokens)
#AugmentedTokens.append(QuadAugmentedTokens)

for i in range(len(AugmentedTokens)):
    print(AugmentedTokens[i])

[['<s>', 'this', 'short', '.', '</s>']]
[['<s>', '<s>', 'this', 'short', '.', '</s>']]
[['<s>', '<s>', '<s>', 'this', 'short', '.', '</s>']]
[['<s>', '<s>', '<s>', '<s>', 'this', 'short', '.', '</s>']]


In [556]:
# Convert augmented tokens into n-grams

# Dictionaries of n-grams
unigrams = {}   # (): ["word", count]
bigrams = {}    # (context1): ["word", count]
trigrams = {}   # (c1, c2): ["word", count]
quadgrams = {}  # (c1, c2, c3): ["word", count]
grams = [unigrams, bigrams, trigrams, quadgrams]

# Count unigrams
for tokenList in AugmentedTokens[0]: #0 context words
    for word in tokenList:
        if word not in unigrams:
            unigrams[word] = 1 # Initialize count as 1
        else:
            unigrams[word] += 1 # Increment unigram count

print("Unigrams:", unigrams)

# Count bigrams
context = None
for tokenList in AugmentedTokens[1]: #1 context word
    for word in tokenList:
        if context != None:
            bigram = (context, word) # push data into bigram dictionary
        if (context, word) not in bigrams:
            bigrams[(context, word)] = 1 # Initialize count as 1
        else:
            bigrams[(context, word)] += 1 # Increment bigram count
        context = word

print("Bigrams:", bigrams)

# Count trigrams
context = None
context2 = None
for tokenList in AugmentedTokens[2]: #2 context words
    for word in tokenList:
        if context != None and context2 != None:
            trigram = (context, context2, word) # push data into trigram dictionary
        if (context, context2, word) not in trigrams:
            trigrams[(context, context2, word)] = 1 # Initialize count as 1
        else:
            trigrams[(context, context2, word)] += 1 # Increment trigram count
        context = context2
        context2 = word

print("Trigrams:", trigrams)

# Count quadgrams
context = None
context2 = None
context3 = None
for tokenList in AugmentedTokens[3]: #3 context words
    for word in tokenList:
        if context != None and context2 != None and context3 != None:
            quadgram = (context, context2, context3, word) # push data into quadgram dictionary
        if (context, context2,  context3, word) not in quadgrams:
            quadgrams[(context, context2,  context3, word)] = 1 # Initialize count as 1
        else:
            quadgrams[(context, context2,  context3, word)] += 1 # Increment quadgram count
        context = context2
        context2 = context3
        context3 = word

print("Quadgrams:", quadgrams)

Unigrams: {'<s>': 1, 'this': 1, 'short': 1, '.': 1, '</s>': 1}
Bigrams: {(None, '<s>'): 1, ('<s>', '<s>'): 1, ('<s>', 'this'): 1, ('this', 'short'): 1, ('short', '.'): 1, ('.', '</s>'): 1}
Trigrams: {(None, None, '<s>'): 1, (None, '<s>', '<s>'): 1, ('<s>', '<s>', '<s>'): 1, ('<s>', '<s>', 'this'): 1, ('<s>', 'this', 'short'): 1, ('this', 'short', '.'): 1, ('short', '.', '</s>'): 1}
Quadgrams: {(None, None, None, '<s>'): 1, (None, None, '<s>', '<s>'): 1, (None, '<s>', '<s>', '<s>'): 1, ('<s>', '<s>', '<s>', '<s>'): 1, ('<s>', '<s>', '<s>', 'this'): 1, ('<s>', '<s>', 'this', 'short'): 1, ('<s>', 'this', 'short', '.'): 1, ('this', 'short', '.', '</s>'): 1}


In [557]:
# Definitions of gram probabilities

def unigramProb(wordTest):
    # Computes P(Wi)
    # Probability of word test
    if wordTest in unigrams.keys():
        print(f"{unigrams[wordTest]/len(unigrams):.2f}") # .2f rounds to hundredths decimal
    else:
        print(wordTest, "is not in the dictionary!")

unigramProb("sam")
###

def bigramProb(wordTest, contextWord): # 1 context word
    # Computes P(Wi|Wi-1)
    # Probability of word test, given that its context came before it
    bigram = (contextWord, wordTest)
    if bigram in bigrams.keys() and wordTest in unigrams.keys():
        #print("P(", bigram, "|", contextWord,") = ", bigrams[bigram], "/", unigrams[contextWord], "=")
        print(f"{bigrams[bigram]/unigrams[contextWord]:.2f}") # .2f rounds to hundredths decimal
    else:
        print(wordTest, "or", bigram, "is not in the dictionary!")

bigramProb("do", "i")
###

def trigramProb(wordTest, contextWord, contextWord2): # 2 context words
    # Computes P(Wi|Wi-2,Wi-1)
    # Probability of word test, given that its context came before it
    trigram = (contextWord, contextWord2, wordTest)
    bigram = (contextWord, contextWord2)
    if trigram in trigrams.keys() and bigram in bigrams.keys():
        #print("P(", trigram, "|", bigram, ") = ", trigrams[trigram], "/", bigrams[bigram], "=")
        print(f"{trigrams[trigram]/bigrams[bigram]:.2f}") # .2f rounds to hundredths decimal
    else:
        print(bigram, "or", trigram, "is not in the dictionary!")

trigramProb("am", "sam", "i")
###

# Note: I don't think compacting this into (wordTest, trigram) would be a good idea
def quadgramProb(wordTest, contextWord, contextWord2, contextWord3): # 3 context words
    # Computes P(Wi|Wi-3,wi-2,Wi-1) 
    # Probability of word test, given that its context came before it
    quadgram = (contextWord, contextWord2, contextWord3, wordTest)
    trigram = (contextWord2, contextWord3, wordTest)
    if quadgram in quadgrams.keys() and trigram in trigrams.keys():
        #print("P(", quadgram, "|", trigram, ") = ", quadgrams[quadgram], "/", trigrams[trigram], "=")
        print(f"{quadgrams[quadgram]/trigrams[trigram]:.2f}") # .2f rounds to hundredths decimal
    else:
        print(trigram, "or", quadgram, "is not in the dictionary!")

quadgramProb("like", "i", "do", "not")

sam is not in the dictionary!
do or ('i', 'do') is not in the dictionary!
('sam', 'i') or ('sam', 'i', 'am') is not in the dictionary!
('do', 'not', 'like') or ('i', 'do', 'not', 'like') is not in the dictionary!


In [558]:
# Calculate probabilities of each gram

probUnigram		= [unigrams.get(unigram) / len(unigrams) for unigram in unigrams]
probBigram		= [bigrams.get(bigram) / len(bigrams) for bigram in bigrams]
probTrigram		= [trigrams.get(trigram) / len(trigrams) for trigram in trigrams]
probQuadgram	= [quadgrams.get(quadgram) / len(quadgrams) for quadgram in quadgrams]
print(probUnigram)
print(probBigram)
print(probBigram)
print(probQuadgram)

[0.2, 0.2, 0.2, 0.2, 0.2]
[0.16666666666666666, 0.16666666666666666, 0.16666666666666666, 0.16666666666666666, 0.16666666666666666, 0.16666666666666666]
[0.16666666666666666, 0.16666666666666666, 0.16666666666666666, 0.16666666666666666, 0.16666666666666666, 0.16666666666666666]
[0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125]


In [559]:
# Convert probabilities to log space
# log(p1 * p2 * p3 * p4) = log(p1) + log(p2) + log(p3) + log(p4)

In [580]:
# This is where I pull randomized words out of the dictionaries

current = ""
output = ""
print(len(list(unigrams)), len(probUnigram))
sum = 0
for p in probUnigram:
	sum += p
print("Sum of probabilties:", sum)
while current != END:
    current = np.random.choice(list(unigrams), size=1, p=probUnigram)
    if current != START and current != END:
    	output += current + " "
print(output)


5 5
Sum of probabilties: 1.0
['this ']


In [561]:
# Output

# This will be printed 4 times. Sentence/Paragraph splits of CNN/Shakespeare
for i in range(1,5):
    print(f"Extracted {len(grams[i-1])} unique {i}-grams")
print("Seed text:", "YYYY")
for i in range(1, 5):
    print(f"Generated {i}-gram text of length X")
    print(f"<{i}-gram text generated>")

Extracted 5 unique 1-grams
Extracted 6 unique 2-grams
Extracted 7 unique 3-grams
Extracted 8 unique 4-grams
Seed text: YYYY
Generated 1-gram text of length X
<1-gram text generated>
Generated 2-gram text of length X
<2-gram text generated>
Generated 3-gram text of length X
<3-gram text generated>
Generated 4-gram text of length X
<4-gram text generated>
