Drew Lickman\
CSCI 4820-001\
Project #2\
Due: 9/9/24

AI Usage Disclaimer:


# N-Grams Algorithm

## Assignment Requirements:

### Input
---

- Two training data input files
    - CNN Stories
    - Shakespeare Plays
- Each line in the files are paragraphs, and paragraphs may contain multiple sentences

### Processing
---

- Text will be converted to lowercase during processing
- Extract n-grams in both methods
    - Sentence level
        - Paragraph will be sentence tokenized (NLTK sent_tokenize), then all sentences will be word tokenized (NLTK word_tokenize)
            - Resulting data will be augmented with \<s> and </s>
    - Paragraph level
        - Paragraph will be word tokenized (NLTK word_tokenize)
            - Resulting data will be augmented with \<s> and </s>
    - n-gram extraction should never cross over line boundaries
- The data structure used to hold tokens in each sentence should start with \<s> and end with </s>, according to the n-grams being processed
    - Higher order n-grams require more start symbol augments
- Unigrams, bigrams, trigrams, quadgrams will each be kept in separate data structures
    - Dictionaries, indexed by "context tuples" work well for this
- A parallel data structure should hold the counts of the tokens that immediately follow each n-gram context
    - These counts should be stored as probabilities by dividing by total count of tokens that appear after the n-gram context 
- Process both files first using sentence level, then followed by paragraph level

### Output
---

- Set NumPy seed to 0
- Print the count of extracted unigrams, bigrams, trigrams, and quadgrams (for each file)
- For each file, choose a random starting word from the unigram tokens (not </s>)
    - This random word will be used as the seed for generated n-gram texts
- For each gram:
    - Using the seed word (prefixed with \<s> as required) generate either 150 tokens or until </s> is generated
        - Do NOT continue after </s>
    - Each next token will be probabilistically selected from those that follow the context (if any) for hat n-gram
    - When working with higher order n-grams, use backoff when the context does not produce a token. Use the next lower n-gram

## Python Code

In [878]:
# Imports libraries and reads corpus documents. Save the documents as tokens

import numpy as np
from nltk import word_tokenize, sent_tokenize

np.random.seed(0)

sentences = []
paragraphs = []

with open("poem.txt", encoding="utf-8") as wordList:
    lines = wordList.readlines()
    for line in lines:
        line = line.lower() # Converts all documents to lowercase
        sentence = sent_tokenize(line) # Extract as entire sentences
        paragraph = word_tokenize(line) # Extract the entire line as words (not separating sentences into different arrays!)
        sentences.append(sentence) # Adds each sentence to the sentences array
        paragraphs.append(paragraph) # Adds each line into the paragraphs array
        #print(sentence)
        ##print(paragraph)
        #print()
        
#print("Sentences (not word tokenized): ", sentences)
##print(paragraphs)

#print()
# Sentence level converting sentence tokens into word tokens
tokens = [] # [[tokens without START or END], [tokens for unigrams], [tokens for bigrams], [tokens for trigrams], [tokens for quadgrams]]
for sent in sentences:
    for string in sent:
        tokenList = word_tokenize(string) # Converts each word into a token. (This will separate sentences into different arrays)
        tokens.append(tokenList)
        #print(token)
#print()
for context in tokens:
    print(context)



['i', 'have', 'a', 'cat', '.']
['my', 'cat', 'is', 'black', '.']
['a', 'black', 'car', 'almost', 'hit', 'a', 'cat', '.']
['i', 'have', 'the', 'car', 'license', 'tag', '.']


In [879]:
# Add START and END tokens
# Make sure to Run All before re-running this!

START = "<s>"
END = "</s>"

#t[1] = [<s>tokenized words</s>], etc.
#t[2] = [<s>tokenized words</s>], etc.
#t[3] = [<s><s>tokenized words</s>], etc.
#t[4] = [<s><s><s>tokenized words</s>], etc.

# Array of AugmentedToken lists (one for each Uni/Bi/Tri/Quad grams)
AugmentedTokens = [] # [],[],[],[]
# Since I am modifying each sentence, for every gram, I will add the START n times and END once per sentence
# List comprehension as suggested by Claude 3.5-sonnet: (and modifications by myself too!)
# newList = [expression for item in iterable]

# I may need to adjust the count of START and END symbols (slide 17 of n-grams)

#for i in range(len(AugmentedTokens)):
#    AugmentedTokens[i] = [[START]*(i+1) + sentence + [END] for sentence in tokens]
# Even more compact version of all this
UniAugmentedTokens  = [[START]*1 + sentence + [END] for sentence in tokens]
BiAugmentedTokens   = [[START]*1 + sentence + [END] for sentence in tokens] # both unigrams and bigrams are only augmented with 1 START token
TriAugmentedTokens  = [[START]*2 + sentence + [END] for sentence in tokens]
QuadAugmentedTokens = [[START]*3 + sentence + [END] for sentence in tokens]

AugmentedTokens.append(UniAugmentedTokens)
AugmentedTokens.append(BiAugmentedTokens)
AugmentedTokens.append(TriAugmentedTokens)
AugmentedTokens.append(QuadAugmentedTokens)

for i in range(len(AugmentedTokens)):
    print(AugmentedTokens[i])

[['<s>', 'i', 'have', 'a', 'cat', '.', '</s>'], ['<s>', 'my', 'cat', 'is', 'black', '.', '</s>'], ['<s>', 'a', 'black', 'car', 'almost', 'hit', 'a', 'cat', '.', '</s>'], ['<s>', 'i', 'have', 'the', 'car', 'license', 'tag', '.', '</s>']]
[['<s>', 'i', 'have', 'a', 'cat', '.', '</s>'], ['<s>', 'my', 'cat', 'is', 'black', '.', '</s>'], ['<s>', 'a', 'black', 'car', 'almost', 'hit', 'a', 'cat', '.', '</s>'], ['<s>', 'i', 'have', 'the', 'car', 'license', 'tag', '.', '</s>']]
[['<s>', '<s>', 'i', 'have', 'a', 'cat', '.', '</s>'], ['<s>', '<s>', 'my', 'cat', 'is', 'black', '.', '</s>'], ['<s>', '<s>', 'a', 'black', 'car', 'almost', 'hit', 'a', 'cat', '.', '</s>'], ['<s>', '<s>', 'i', 'have', 'the', 'car', 'license', 'tag', '.', '</s>']]
[['<s>', '<s>', '<s>', 'i', 'have', 'a', 'cat', '.', '</s>'], ['<s>', '<s>', '<s>', 'my', 'cat', 'is', 'black', '.', '</s>'], ['<s>', '<s>', '<s>', 'a', 'black', 'car', 'almost', 'hit', 'a', 'cat', '.', '</s>'], ['<s>', '<s>', '<s>', 'i', 'have', 'the', 'car', 

In [880]:
# Convert augmented tokens into n-grams

# Dictionaries of n-grams
# Using 2d dictionaries {context: {(word: 1), (word2: 2)}, context2: {(word3: 3), (word4: 4)}}
unigrams = {}   # (): ["word", count]
bigrams = {}    # (context1): ["word", count]
trigrams = {}   # (c1, c2): ["word", count]
quadgrams = {}  # (c1, c2, c3): [("word", count)]
grams = [unigrams, bigrams, trigrams, quadgrams]
gramsPrintStrings = ["Unigrams", "Bigrams", "Trigrams", "Quadgrams"]
contextCount = [0,0,0,0] # [unigrams, bigrams, trigrams, quadgrams] total context count each
uniqueNGrams = [0,0,0,0] # Counts unique N-Grams for each N-Gram

# Helper function for repeating code
def incrementWordCount(gramIndex, context, word):
	if context not in grams[gramIndex]: # if the context isn't in the gram dict, 
		grams[gramIndex][context] = {}  # create an empty dictionary
	if word not in grams[gramIndex][context]: # check if word is already found in context
		grams[gramIndex][context][word] = 1 # Initialize count as 1
	else:
		grams[gramIndex][context][word] += 1 # Increment gram word count

# Count unigrams
i = 0
count = 0
context = ()
grams[i][context] = {} # Declare the unigrams to be a dictionary with the only key as ()
for tokenList in AugmentedTokens[i]: #0 context words
	for word in tokenList:
		# No actual context, so I'm not going to use incrementWordCount()
		#incrementWordCount(grams[i], context, word)

		if word not in grams[i][context]:
			grams[i][context][word] = 1 # Add word to unigrams with count of 1
		else:
			grams[i][context][word] += 1 # Increment unigram token count
		
		contextCount[i] += 1

# Count bigrams
context = None
i = 1
for tokenList in AugmentedTokens[i]: #1 context word
	for word in tokenList:
		if context not in (None, END):
			bigramContext = (context,)
			incrementWordCount(i, bigramContext, word)
		context = word
		contextCount[i] += 1

# Count trigrams
context = None
context2 = None
i = 2
for tokenList in AugmentedTokens[i]: #2 context words
	for word in tokenList:
		if context not in (None, END) and context2 not in (None, END):
			trigramContext = (context, context2) # trigram dictionary key
			incrementWordCount(i, trigramContext, word)
		context = context2
		context2 = word
		contextCount[i] += 1

# Count quadgrams
context = None
context2 = None
context3 = None
i = 3
for tokenList in AugmentedTokens[i]: #3 context words
	for word in tokenList:
		if context not in (None, END) and context2 not in (None, END) and context3 not in (None, END):
			quadgramContext = (context, context2, context3) # quadgram dictionary key
			incrementWordCount(i, quadgramContext, word)
		context = context2
		context2 = context3
		context3 = word
		contextCount[i] += 1

# Print all the context and words
# (Unigram context is just empty dictionary key ())
for i in range(len(grams)):
	print(f"{gramsPrintStrings[i]}") # Which N-Gram is being printed
	for context in grams[i]:
		print(f"{context}: {grams[i][context]}") #(Context,): {Dictionary of words: count}

	# Simple loop to count how many unique grams in each N-Gram
	for contextWord in grams[i]:
		uniqueNGrams[i] += len(grams[i][contextWord])
	print(f"Unique {gramsPrintStrings[i]}: {uniqueNGrams[i]}")
	print()

Unigrams
(): {'<s>': 4, 'i': 2, 'have': 2, 'a': 3, 'cat': 3, '.': 4, '</s>': 4, 'my': 1, 'is': 1, 'black': 2, 'car': 2, 'almost': 1, 'hit': 1, 'the': 1, 'license': 1, 'tag': 1}
Unique Unigrams: 16

Bigrams
('<s>',): {'i': 2, 'my': 1, 'a': 1}
('i',): {'have': 2}
('have',): {'a': 1, 'the': 1}
('a',): {'cat': 2, 'black': 1}
('cat',): {'.': 2, 'is': 1}
('.',): {'</s>': 4}
('my',): {'cat': 1}
('is',): {'black': 1}
('black',): {'.': 1, 'car': 1}
('car',): {'almost': 1, 'license': 1}
('almost',): {'hit': 1}
('hit',): {'a': 1}
('the',): {'car': 1}
('license',): {'tag': 1}
('tag',): {'.': 1}
Unique Bigrams: 22

Trigrams
('<s>', '<s>'): {'i': 2, 'my': 1, 'a': 1}
('<s>', 'i'): {'have': 2}
('i', 'have'): {'a': 1, 'the': 1}
('have', 'a'): {'cat': 1}
('a', 'cat'): {'.': 2}
('cat', '.'): {'</s>': 2}
('<s>', 'my'): {'cat': 1}
('my', 'cat'): {'is': 1}
('cat', 'is'): {'black': 1}
('is', 'black'): {'.': 1}
('black', '.'): {'</s>': 1}
('<s>', 'a'): {'black': 1}
('a', 'black'): {'car': 1}
('black', 'car'):

In [881]:
# Definitions of gram probabilities
print("Debug:")

def calcContextTotal(grams, gram):
	contextTotal = 0
	for word in grams[gram]:
		contextTotal += grams[gram][word]
	return contextTotal

def unigramProb(wordTest):
    # Probability of wordTest in its context
	if wordTest != None:
		prob = f"{unigrams[()][wordTest]/contextCount[0]:.3f}" # .3f rounds to hundredths decimal
		return prob 
	else:
		print(wordTest, "is not in the dictionary!")
###

def bigramProb(bigram, wordTest): # 1 context word
    # Probability of wordTest, given that its context came before it
	if bigram in bigrams:
		contextTotal = calcContextTotal(bigrams, bigram)
		return f"{bigrams[bigram][wordTest]/contextTotal:.3f}"
	else:
		print(bigram, "is not in the dictionary!")
###

def trigramProb(trigram, wordTest): # 2 context words
    # Probability of wordTest, given that its context came before it
	if trigram in trigrams:
		contextTotal = calcContextTotal(trigrams, trigram)
		return f"{trigrams[trigram][wordTest]/contextTotal:.3f}" # .3f rounds to hundredths decimal
	else:
		print(trigram, "is not in the dictionary!")
###

# Note: I don't think compacting this into (wordTest, trigram) would be a good idea
def quadgramProb(quadgram, wordTest): # 3 context words
    # Probability of wordTest, given that its context came before it
	if quadgram in quadgrams:
		contextTotal = calcContextTotal(quadgrams, quadgram)
		return f"{quadgrams[quadgram][wordTest]/contextTotal:.3f}" # .3f rounds to hundredths decimal
	else:
		print(quadgram, "is not in the dictionary!")

# P(SearchWord, (context))
#print(unigramProb("have"))
#print(bigramProb(("a", "cat")))
#print(trigramProb(("car", "have", "the")))
#print(quadgramProb(("cat", "almost", "hit", "a")))

print()

print("Unigram probability table")
print(unigrams)
probUnigram = [] # array to hold probabilities
for unigram in unigrams[()]:
	prob = unigramProb(unigram)
	probUnigram.append(prob)
	print(f"\tWord: {unigram:<12} \t Occurances: {unigrams[()][unigram]:<3} \t Context total: {contextCount[0]:<3} \t Probability: {prob:<3}")
    # print unigram[i],                        unigram dictionary value                 context summed in previous block       unigram Prob, input string token 

print()
print("Bigram probability table")
print(bigrams)
for bigram in bigrams:
	print(f"Context: {str(bigram):<12}")
	for word in bigrams[bigram]:
		print(f"\t Word: {word:<12} \t Occurances: {str(bigrams[bigram][word]):<3} \t Context total: {calcContextTotal(bigrams, bigram):<3} \t Probability: {bigramProb(bigram, word):<3}")
	print()

print("Trigram probability table")
print(trigrams)
for trigram in trigrams:
	print(f"Context: {str(trigram):<12}")
	for word in trigrams[trigram]:
		print(f"\t Word: {word:<12} \t Occurances: {str(trigrams[trigram][word]):<3} \t Context total: {contextCount[2]:<3} \t Probability: {trigramProb(trigram, word):<3}")
	print()

print("Quadgram probability table")
print(quadgrams)
for quadgram in quadgrams:
	print(f"Context: {str(quadgram):<12}")
	for word in quadgrams[quadgram]:
		print(f"\t Word: {word:<12} \t Occurances: {str(quadgrams[quadgram][word]):<3} \t Context total: {contextCount[2]:<3} \t Probability: {quadgramProb(quadgram, word):<3}")
	print()

Debug:

Unigram probability table
{(): {'<s>': 4, 'i': 2, 'have': 2, 'a': 3, 'cat': 3, '.': 4, '</s>': 4, 'my': 1, 'is': 1, 'black': 2, 'car': 2, 'almost': 1, 'hit': 1, 'the': 1, 'license': 1, 'tag': 1}}
	Word: <s>          	 Occurances: 4   	 Context total: 33  	 Probability: 0.121
	Word: i            	 Occurances: 2   	 Context total: 33  	 Probability: 0.061
	Word: have         	 Occurances: 2   	 Context total: 33  	 Probability: 0.061
	Word: a            	 Occurances: 3   	 Context total: 33  	 Probability: 0.091
	Word: cat          	 Occurances: 3   	 Context total: 33  	 Probability: 0.091
	Word: .            	 Occurances: 4   	 Context total: 33  	 Probability: 0.121
	Word: </s>         	 Occurances: 4   	 Context total: 33  	 Probability: 0.121
	Word: my           	 Occurances: 1   	 Context total: 33  	 Probability: 0.030
	Word: is           	 Occurances: 1   	 Context total: 33  	 Probability: 0.030
	Word: black        	 Occurances: 2   	 Context total: 33  	 Probability: 0.

In [882]:
# Calculate probabilities of each gram
# Haha nevermind, do this in the previous code block
                # [key.value (count of word) / total # of grams] for each gram
probUnigram		= [probUnigram(unigram) for unigram in unigrams]
probBigram		= [bigrams.get(bigram) / len(bigrams) for bigram in bigrams]
probTrigram		= [trigrams.get(trigram) / len(trigrams) for trigram in trigrams]
probQuadgram	= [quadgrams.get(quadgram) / len(quadgrams) for quadgram in quadgrams]
print(probUnigram)
print(probBigram)
print(probBigram)
print(probQuadgram)

TypeError: 'list' object is not callable

In [None]:
# Convert probabilities to log space
# log(p1 * p2 * p3 * p4) = log(p1) + log(p2) + log(p3) + log(p4)

In [None]:
# This is where I pull randomized words out of the dictionaries

current = ""
output = ""
print(len(list(unigrams)), len(probUnigram))
pSum = 0
# Need to update this to use the unigram tuples
for p in probUnigram:
	pSum += p
print("Sum of probabilties:", pSum)
while current != END:
    current = np.random.choice(list(unigrams), size=1, p=probUnigram)
    if current != START and current != END:
        output += current + " "
print(output)


1 16
Sum of probabilties: 2.0625


ValueError: a must be 1-dimensional

In [None]:
# Output

# This will be printed 4 times. Sentence/Paragraph splits of CNN/Shakespeare
for i in range(1,5):
    print(f"Extracted {len(grams[i-1])} unique {i}-grams")
print("Seed text:", "YYYY")
for i in range(1, 5):
    print(f"Generated {i}-gram text of length X")
    print(f"<{i}-gram text generated>")

Extracted 5 unique 1-grams
Extracted 6 unique 2-grams
Extracted 7 unique 3-grams
Extracted 8 unique 4-grams
Seed text: YYYY
Generated 1-gram text of length X
<1-gram text generated>
Generated 2-gram text of length X
<2-gram text generated>
Generated 3-gram text of length X
<3-gram text generated>
Generated 4-gram text of length X
<4-gram text generated>
