Drew Lickman\
CSCI 4820-001\
Project #2\
Due: 9/9/24

AI Usage Disclaimer:


# N-Grams Algorithm

## Assignment Requirements:

### Input
---

- Two training data input files
    - CNN Stories
    - Shakespeare Plays
- Each line in the files are paragraphs, and paragraphs may contain multiple sentences

### Processing
---

- Text will be converted to lowercase during processing
- Extract n-grams in both methods
    - Sentence level
        - Paragraph will be sentence tokenized (NLTK sent_tokenize), then all sentences will be word tokenized (NLTK word_tokenize)
            - Resulting data will be augmented with \<s> and </s>
    - Paragraph level
        - Paragraph will be word tokenized (NLTK word_tokenize)
            - Resulting data will be augmented with \<s> and </s>
    - n-gram extraction should never cross over line boundaries
- The data structure used to hold tokens in each sentence should start with \<s> and end with </s>, according to the n-grams being processed
    - Higher order n-grams require more start symbol augments
- Unigrams, bigrams, trigrams, quadgrams will each be kept in separate data structures
    - Dictionaries, indexed by "context tuples" work well for this
- A parallel data structure should hold the counts of the tokens that immediately follow each n-gram context
    - These counts should be stored as probabilities by dividing by total count of tokens that appear after the n-gram context 
- Process both files first using sentence level, then followed by paragraph level

### Output
---

- Set NumPy seed to 0
- Print the count of extracted unigrams, bigrams, trigrams, and quadgrams (for each file)
- For each file, choose a random starting word from the unigram tokens (not </s>)
    - This random word will be used as the seed for generated n-gram texts
- For each gram:
    - Using the seed word (prefixed with \<s> as required) generate either 150 tokens or until </s> is generated
        - Do NOT continue after </s>
    - Each next token will be probabilistically selected from those that follow the context (if any) for hat n-gram
    - When working with higher order n-grams, use backoff when the context does not produce a token. Use the next lower n-gram

## Python Code

In [698]:
# Imports libraries and reads corpus documents. Save the documents as tokens

import numpy as np
from nltk import word_tokenize, sent_tokenize

np.random.seed(0)

sentences = []
paragraphs = []

with open("poem.txt", encoding="utf-8") as wordList:
    lines = wordList.readlines()
    for line in lines:
        line = line.lower() # Converts all documents to lowercase
        sentence = sent_tokenize(line) # Extract as entire sentences
        paragraph = word_tokenize(line) # Extract the entire line as words (not separating sentences into different arrays!)
        sentences.append(sentence) # Adds each sentence to the sentences array
        paragraphs.append(paragraph) # Adds each line into the paragraphs array
        #print(sentence)
        ##print(paragraph)
        #print()
        
#print("Sentences (not word tokenized): ", sentences)
##print(paragraphs)

#print()
# Sentence level converting sentence tokens into word tokens
tokens = [] # [[tokens without START or END], [tokens for unigrams], [tokens for bigrams], [tokens for trigrams], [tokens for quadgrams]]
for sent in sentences:
    for string in sent:
        tokenList = word_tokenize(string) # Converts each word into a token. (This will separate sentences into different arrays)
        tokens.append(tokenList)
        #print(token)
#print()
for token in tokens:
    print(token)



['i', 'have', 'a', 'cat', '.']
['my', 'cat', 'is', 'black', '.']
['a', 'black', 'car', 'almost', 'hit', 'a', 'cat', '.']
['i', 'have', 'the', 'car', 'license', 'tag', '.']


In [699]:
# Add START and END tokens
# Make sure to Run All before re-running this!

START = "<s>"
END = "</s>"

#t[1] = [<s>tokenized words</s>], etc.
#t[2] = [<s>tokenized words</s>], etc.
#t[3] = [<s><s>tokenized words</s>], etc.
#t[4] = [<s><s><s>tokenized words</s>], etc.

# Array of AugmentedToken lists (one for each Uni/Bi/Tri/Quad grams)
AugmentedTokens = [] # [],[],[],[]
# Since I am modifying each sentence, for every gram, I will add the START n times and END once per sentence
# List comprehension as suggested by Claude 3.5-sonnet: (and modifications by myself too!)
# newList = [expression for item in iterable]

# I may need to adjust the count of START and END symbols (slide 17 of n-grams)

#for i in range(len(AugmentedTokens)):
#    AugmentedTokens[i] = [[START]*(i+1) + sentence + [END] for sentence in tokens]
# Even more compact version of all this
UniAugmentedTokens  = [[START]*1 + sentence + [END] for sentence in tokens]
BiAugmentedTokens   = [[START]*1 + sentence + [END] for sentence in tokens] # both unigrams and bigrams are only augmented with 1 START token
TriAugmentedTokens  = [[START]*2 + sentence + [END] for sentence in tokens]
QuadAugmentedTokens = [[START]*3 + sentence + [END] for sentence in tokens]

AugmentedTokens.append(UniAugmentedTokens)
AugmentedTokens.append(BiAugmentedTokens)
AugmentedTokens.append(TriAugmentedTokens)
AugmentedTokens.append(QuadAugmentedTokens)

for i in range(len(AugmentedTokens)):
    print(AugmentedTokens[i])

[['<s>', 'i', 'have', 'a', 'cat', '.', '</s>'], ['<s>', 'my', 'cat', 'is', 'black', '.', '</s>'], ['<s>', 'a', 'black', 'car', 'almost', 'hit', 'a', 'cat', '.', '</s>'], ['<s>', 'i', 'have', 'the', 'car', 'license', 'tag', '.', '</s>']]
[['<s>', 'i', 'have', 'a', 'cat', '.', '</s>'], ['<s>', 'my', 'cat', 'is', 'black', '.', '</s>'], ['<s>', 'a', 'black', 'car', 'almost', 'hit', 'a', 'cat', '.', '</s>'], ['<s>', 'i', 'have', 'the', 'car', 'license', 'tag', '.', '</s>']]
[['<s>', '<s>', 'i', 'have', 'a', 'cat', '.', '</s>'], ['<s>', '<s>', 'my', 'cat', 'is', 'black', '.', '</s>'], ['<s>', '<s>', 'a', 'black', 'car', 'almost', 'hit', 'a', 'cat', '.', '</s>'], ['<s>', '<s>', 'i', 'have', 'the', 'car', 'license', 'tag', '.', '</s>']]
[['<s>', '<s>', '<s>', 'i', 'have', 'a', 'cat', '.', '</s>'], ['<s>', '<s>', '<s>', 'my', 'cat', 'is', 'black', '.', '</s>'], ['<s>', '<s>', '<s>', 'a', 'black', 'car', 'almost', 'hit', 'a', 'cat', '.', '</s>'], ['<s>', '<s>', '<s>', 'i', 'have', 'the', 'car', 

In [700]:
# Convert augmented tokens into n-grams

# Dictionaries of n-grams
unigrams = {}   # (): ["word", count]
bigrams = {}    # (context1): ["word", count]
trigrams = {}   # (c1, c2): ["word", count]
quadgrams = {}  # (c1, c2, c3): [("word", count)]
grams = [unigrams, bigrams, trigrams, quadgrams]
gramsStrings = ["Unigrams", "Bigrams", "Trigrams", "Quadgrams"]

contextCount = [0,0,0,0] # [unigrams, bigrams, trigrams, quadgrams] total context count each

# Count unigrams
i = 0
count = 0
grams[i][()] = [] # Declare the unigrams to have a key () and value []
for tokenList in AugmentedTokens[i]: #0 context words
	for word in tokenList:
		#Find word in the unigram tuple list using list comprehension
		tupleWord = next((tup for tup in grams[i][()] if tup[0] == word), None)
		if tupleWord is None:
			grams[i][()].append((word, 1)) # Add word to unigrams with count of 1
		else:
			index = grams[i][()].index(tupleWord) # Get the index of the tuple in unigrams
			grams[i][()][index] = (word, tupleWord[1] + 1) # Increment unigram token count
for tuple in grams[i][()]:
	contextCount[i] += tuple[1] # sum the count of each unigram, at tuple[1]

# Count bigrams
context = None
i = 1
for tokenList in AugmentedTokens[i]: #1 context word
	for word in tokenList:
		if context not in (None, END):
			bigram = (context, word) # bigram dictionary key
			if bigram not in grams[i]:
				grams[i][bigram] = 1 # Initialize count as 1
			else:
				grams[i][bigram] += 1 # Increment bigram count
		context = word
for token in grams[i]:
	contextCount[i] += grams[i].get(token)

# Count trigrams
context = None
context2 = None
i = 2
for tokenList in AugmentedTokens[i]: #2 context words
	for word in tokenList:
		if context not in (None, END) and context2 not in (None, END):
			trigram = (context, context2, word) # trigram dictionary key
			if trigram not in grams[i]:
				grams[i][trigram] = 1 # Initialize count as 1
			else:
				grams[i][trigram] += 1 # Increment trigram count
		context = context2
		context2 = word

# Count quadgrams
context = None
context2 = None
context3 = None
i = 3
for tokenList in AugmentedTokens[i]: #3 context words
	for word in tokenList:
		if context not in (None, END) and context2 not in (None, END) and context3 not in (None, END):
			quadgram = (context, context2, context3, word) # quadgram dictionary key
			if quadgram not in grams[i]:
				grams[i][quadgram] = 1 # Initialize count as 1
			else:
				grams[i][quadgram] += 1 # Increment quadgram count
		context = context2
		context2 = context3
		context3 = word

# Unigrams are so special that they get their own print block
print(f"{gramsStrings[0]}:", grams[0])
print(f"Unique {gramsStrings[0]}: {len(grams[0][()])}")
print("Context total:", contextCount[0])
print()

for i in range(1, len(grams)):
	print(f"{gramsStrings[i]}:", grams[i])
	#print(f"Sorted {gramsStrings[i]}:", sorted(grams[i]))
	print(f"Unique {gramsStrings[i]}: {len(grams[i])}")
	
	print("Context total:", contextCount[i])
	print()

Unigrams: {(): [('<s>', 4), ('i', 2), ('have', 2), ('a', 3), ('cat', 3), ('.', 4), ('</s>', 4), ('my', 1), ('is', 1), ('black', 2), ('car', 2), ('almost', 1), ('hit', 1), ('the', 1), ('license', 1), ('tag', 1)]}
Unique Unigrams: 16
Context total: 33

Bigrams: {('<s>', 'i'): 2, ('i', 'have'): 2, ('have', 'a'): 1, ('a', 'cat'): 2, ('cat', '.'): 2, ('.', '</s>'): 4, ('<s>', 'my'): 1, ('my', 'cat'): 1, ('cat', 'is'): 1, ('is', 'black'): 1, ('black', '.'): 1, ('<s>', 'a'): 1, ('a', 'black'): 1, ('black', 'car'): 1, ('car', 'almost'): 1, ('almost', 'hit'): 1, ('hit', 'a'): 1, ('have', 'the'): 1, ('the', 'car'): 1, ('car', 'license'): 1, ('license', 'tag'): 1, ('tag', '.'): 1}
Unique Bigrams: 22
Context total: 29

Trigrams: {('<s>', '<s>', 'i'): 2, ('<s>', 'i', 'have'): 2, ('i', 'have', 'a'): 1, ('have', 'a', 'cat'): 1, ('a', 'cat', '.'): 2, ('cat', '.', '</s>'): 2, ('<s>', '<s>', 'my'): 1, ('<s>', 'my', 'cat'): 1, ('my', 'cat', 'is'): 1, ('cat', 'is', 'black'): 1, ('is', 'black', '.'): 1, ('

In [701]:
# Definitions of gram probabilities
print("Debug:")

def unigramProb(wordTest):
    # Computes P(Wi)
    # Probability of word test
	tupleWord = next((tup for tup in grams[0][()] if tup[0] == wordTest), None) # Is wordTest in the unigram tuple list?
	if tupleWord != None:
		index = grams[0][()].index(tupleWord) # Get the index of the tuple
		prob = f"{unigrams[()][index][1]/contextCount[0]:.3f}" # .3f rounds to hundredths decimal
		grams[0][()][index] = (grams[0][()][index][0], grams[0][()][index][1], prob) # Save the probability into the tuple
		return prob 
	else:
		print(wordTest, "is not in the dictionary!")
###

def bigramProb(bigram): # 1 context word
    # Computes P(Wi|Wi-1)
    # Probability of word test, given that its context came before it
    if bigram in bigrams.keys():
        #print("P(", bigram, "|", contextWord,") = ", bigrams[bigram], "/", unigrams[contextWord], "=")
        #print(f"{bigrams[bigram]/unigrams[contextWord]:.2f}") # .3f rounds to hundredths decimal
        return f"{bigrams[bigram]/contextCount[1]:.3f}"
    else:
        print(bigram, "is not in the dictionary!")
###

def trigramProb(wordTest, contextWord, contextWord2): # 2 context words
    # Computes P(Wi|Wi-2,Wi-1)
    # Probability of word test, given that its context came before it
    trigram = (contextWord, contextWord2, wordTest)
    bigram = (contextWord, contextWord2)
    if trigram in trigrams.keys() and bigram in bigrams.keys():
        #print("P(", trigram, "|", bigram, ") = ", trigrams[trigram], "/", bigrams[bigram], "=")
        print(f"{trigrams[trigram]/bigrams[bigram]:.3f}") # .3f rounds to hundredths decimal
    else:
        print(bigram, "or", trigram, "is not in the dictionary!")
###

# Note: I don't think compacting this into (wordTest, trigram) would be a good idea
def quadgramProb(wordTest, contextWord, contextWord2, contextWord3): # 3 context words
    # Computes P(Wi|Wi-3,wi-2,Wi-1) 
    # Probability of word test, given that its context came before it
    quadgram = (contextWord, contextWord2, contextWord3, wordTest)
    trigram = (contextWord2, contextWord3, wordTest)
    if quadgram in quadgrams.keys() and trigram in trigrams.keys():
        #print("P(", quadgram, "|", trigram, ") = ", quadgrams[quadgram], "/", trigrams[trigram], "=")
        print(f"{quadgrams[quadgram]/trigrams[trigram]:.3f}") # .3f rounds to hundredths decimal
    else:
        print(trigram, "or", quadgram, "is not in the dictionary!")

# P(SearchWord, (context))
#print(unigramProb("have"))
#print(bigramProb(("a", "cat")))
#print(trigramProb(("car", "have", "the")))
#print(quadgramProb(("cat", "almost", "hit", "a")))

print()

print("Unigram probability table")
for unigram in unigrams[()]:
    # Sorry, this hurts the soul
    print(f"Word: {unigram[0]:<3} \t Occurances: {grams[0][()][unigrams[()].index(unigram)][1]:<3} \t Context total: {contextCount[0]:<3} \t Probability: {unigramProb(unigram[0]):<3}")
    # print unigram[i],                          unigram[index of tuple][unigram count]                     context summed in previous block       unigram Prob, input string token 

print(unigrams)

print()
print("Bigram probability table")
for bigram in bigrams:
    print(f"Word: {bigram[0]:<3} \t Occurances: {bigrams.get(bigram):<3} \t Context total: {contextCount[1]:<3} \t Probability: {bigramProb(bigram):<3}")

Debug:

Unigram probability table
Word: <s> 	 Occurances: 4   	 Context total: 33  	 Probability: 0.121
Word: i   	 Occurances: 2   	 Context total: 33  	 Probability: 0.061
Word: have 	 Occurances: 2   	 Context total: 33  	 Probability: 0.061
Word: a   	 Occurances: 3   	 Context total: 33  	 Probability: 0.091
Word: cat 	 Occurances: 3   	 Context total: 33  	 Probability: 0.091
Word: .   	 Occurances: 4   	 Context total: 33  	 Probability: 0.121
Word: </s> 	 Occurances: 4   	 Context total: 33  	 Probability: 0.121
Word: my  	 Occurances: 1   	 Context total: 33  	 Probability: 0.030
Word: is  	 Occurances: 1   	 Context total: 33  	 Probability: 0.030
Word: black 	 Occurances: 2   	 Context total: 33  	 Probability: 0.061
Word: car 	 Occurances: 2   	 Context total: 33  	 Probability: 0.061
Word: almost 	 Occurances: 1   	 Context total: 33  	 Probability: 0.030
Word: hit 	 Occurances: 1   	 Context total: 33  	 Probability: 0.030
Word: the 	 Occurances: 1   	 Context total: 33  

In [702]:
# Calculate probabilities of each gram
# Haha nevermind, do this in the previous code block
                # [key.value (count of word) / total # of grams] for each gram
probUnigram		= [probUnigram(unigram) for unigram in unigrams]
probBigram		= [bigrams.get(bigram) / len(bigrams) for bigram in bigrams]
probTrigram		= [trigrams.get(trigram) / len(trigrams) for trigram in trigrams]
probQuadgram	= [quadgrams.get(quadgram) / len(quadgrams) for quadgram in quadgrams]
print(probUnigram)
print(probBigram)
print(probBigram)
print(probQuadgram)

TypeError: 'list' object is not callable

In [None]:
# Convert probabilities to log space
# log(p1 * p2 * p3 * p4) = log(p1) + log(p2) + log(p3) + log(p4)

In [None]:
# This is where I pull randomized words out of the dictionaries

current = ""
output = ""
print(len(list(unigrams)), len(probUnigram))
pSum = 0
# Need to update this to use the unigram tuples
for p in probUnigram:
	pSum += p
print("Sum of probabilties:", pSum)
while current != END:
    current = np.random.choice(list(unigrams), size=1, p=probUnigram)
    if current != START and current != END:
    	output += current + " "
print(output)


1 16
Sum of probabilties: 2.0625


ValueError: a must be 1-dimensional

In [None]:
# Output

# This will be printed 4 times. Sentence/Paragraph splits of CNN/Shakespeare
for i in range(1,5):
    print(f"Extracted {len(grams[i-1])} unique {i}-grams")
print("Seed text:", "YYYY")
for i in range(1, 5):
    print(f"Generated {i}-gram text of length X")
    print(f"<{i}-gram text generated>")

Extracted 5 unique 1-grams
Extracted 6 unique 2-grams
Extracted 7 unique 3-grams
Extracted 8 unique 4-grams
Seed text: YYYY
Generated 1-gram text of length X
<1-gram text generated>
Generated 2-gram text of length X
<2-gram text generated>
Generated 3-gram text of length X
<3-gram text generated>
Generated 4-gram text of length X
<4-gram text generated>
