Drew Lickman\
CSCI 4820-001\
Project #2\
Due: 9/9/24

AI Usage Disclaimer:


# N-Grams Algorithm

## Assignment Requirements:

### Input
---

- Two training data input files
    - CNN Stories
    - Shakespeare Plays
- Each line in the files are paragraphs, and paragraphs may contain multiple sentences

### Processing
---

- Text will be converted to lowercase during processing
- Extract n-grams in both methods
    - Sentence level
        - Paragraph will be sentence tokenized (NLTK sent_tokenize), then all sentences will be word tokenized (NLTK word_tokenize)
            - Resulting data will be augmented with \<s> and \</s>
    - Paragraph level
        - Paragraph will be word tokenized (NLTK word_tokenize)
            - Resulting data will be augmented with \<s> and \</s>
    - n-gram extraction should never cross over line boundaries
- The data structure used to hold tokens in each sentence should start with \<s> and end with \</s>, according to the n-grams being processed
    - Higher order n-grams require more start symbol augments
- Unigrams, bigrams, trigrams, quadgrams will each be kept in separate data structures
    - Dictionaries, indexed by "context tuples" work well for this
- A parallel data structure should hold the counts of the tokens that immediately follow each n-gram context
    - These counts should be stored as probabilities by dividing by total count of tokens that appear after the n-gram context 
- Process both files first using sentence level, then followed by paragraph level

### Output
---

- Set NumPy seed to 0
- Print the count of extracted unigrams, bigrams, trigrams, and quadgrams (for each file)
- For each file, choose a random starting word from the unigram tokens (not </s>)
    - This random word will be used as the seed for generated n-gram texts
- For each gram:
    - Using the seed word (prefixed with \<s> as required) generate either 150 tokens or until </s> is generated
        - Do NOT continue after </s>
    - Each next token will be probabilistically selected from those that follow the context (if any) for hat n-gram
    - When working with higher order n-grams, use backoff when the context does not produce a token. Use the next lower n-gram

## Python Code

In [96]:
# Imports libraries and reads corpus documents. Save the documents as tokens

import numpy as np
from nltk import word_tokenize, sent_tokenize

sentences = []
paragraphs = []

with open("poem.txt", encoding="utf-8") as wordList:
    lines = wordList.readlines()
    for line in lines:
        line = line.lower() # Converts all documents to lowercase
        sentence = sent_tokenize(line) # Extract as entire sentences
        paragraph = word_tokenize(line) # Extract the entire line as words (not separating sentences into different arrays!)
        sentences.append(sentence) # Adds each sentence to the sentences array
        paragraphs.append(paragraph) # Adds each line into the paragraphs array
        #print(sentence)
        ##print(paragraph)
        #print()
        
#print("Sentences: ", sentences) #before separating sentences
print("Paragraph level: ", paragraphs)

#print()
# Sentence level converting sentence tokens into word tokens
tokenizedSentences = [] # [[tokens without START or END], [tokens for unigrams], [tokens for bigrams], [tokens for trigrams], [tokens for quadgrams]]
for sent in sentences:
    for string in sent:
        tokenList = word_tokenize(string) # Converts each word into a token. (This will separate sentences into different arrays)
        tokenizedSentences.append(tokenList)
        
print()
print("Sentence level: ", tokenizedSentences)

# Disable for large corpus
if False:
	for context in tokenizedSentences:
		print(context)


Paragraph level:  [['i', 'have', 'a', 'cat', '.', 'my', 'cat', 'is', 'black', '.'], ['a', 'black', 'car', 'almost', 'hit', 'a', 'cat', '.', 'i', 'have', 'the', 'car', 'license', 'tag', '.']]

Sentence level:  [['i', 'have', 'a', 'cat', '.'], ['my', 'cat', 'is', 'black', '.'], ['a', 'black', 'car', 'almost', 'hit', 'a', 'cat', '.'], ['i', 'have', 'the', 'car', 'license', 'tag', '.']]


In [97]:
# Add START and END tokens
# Make sure to Run All before re-running this!

START = "<s>"
END = "</s>"

#t[1] = [<s>tokenized words</s>], etc.
#t[2] = [<s>tokenized words</s>], etc.
#t[3] = [<s><s>tokenized words</s>], etc.
#t[4] = [<s><s><s>tokenized words</s>], etc.

AugmentedTokens = [[],[]] # [[Sentence Tokens], [Paragraph Tokens]]

# Array of AugmentedToken lists (one for each Uni/Bi/Tri/Quad grams)
AugmentedTokens[0] = [] # [],[],[],[] #for sentences
# Since I am modifying each sentence, for every gram, I will add the START n times and END once per sentence
# List comprehension as suggested by Claude 3.5-sonnet: (and modifications by myself too!)
# newList = [expression for item in iterable]

#for i in range(len(AugmentedTokens)):
#    AugmentedTokens[i] = [[START]*(i+1) + sentence + [END] for sentence in tokens] # Unfortunately cannot use this because unigrams have 1 start token, not 0

UniAugmentedSTokens  = [[START]*1 + sentence + [END] for sentence in tokenizedSentences]
BiAugmentedSTokens   = [[START]*1 + sentence + [END] for sentence in tokenizedSentences] # both unigrams and bigrams are only augmented with 1 START token
TriAugmentedSTokens  = [[START]*2 + sentence + [END] for sentence in tokenizedSentences]
QuadAugmentedSTokens = [[START]*3 + sentence + [END] for sentence in tokenizedSentences]

AugmentedTokens[0].append(UniAugmentedSTokens)
AugmentedTokens[0].append(BiAugmentedSTokens)
AugmentedTokens[0].append(TriAugmentedSTokens)
AugmentedTokens[0].append(QuadAugmentedSTokens)
print("Sentence level:")
for i in range(len(AugmentedTokens[0])):
    print(AugmentedTokens[0][i])
    
#####################################################################################
print()

AugmentedTokens[1] = [] # [],[],[],[] #for paragraphs
UniAugmentedPTokens  = [[START]*1 + sentence + [END] for sentence in paragraphs]
BiAugmentedPTokens   = [[START]*1 + sentence + [END] for sentence in paragraphs] # both unigrams and bigrams are only augmented with 1 START token
TriAugmentedPTokens  = [[START]*2 + sentence + [END] for sentence in paragraphs]
QuadAugmentedPTokens = [[START]*3 + sentence + [END] for sentence in paragraphs]

AugmentedTokens[1].append(UniAugmentedPTokens)
AugmentedTokens[1].append(BiAugmentedPTokens)
AugmentedTokens[1].append(TriAugmentedPTokens)
AugmentedTokens[1].append(QuadAugmentedPTokens)

print("Paragraph level:")
for i in range(len(AugmentedTokens[1])):
    print(AugmentedTokens[1][i])



Sentence level:
[['<s>', 'i', 'have', 'a', 'cat', '.', '</s>'], ['<s>', 'my', 'cat', 'is', 'black', '.', '</s>'], ['<s>', 'a', 'black', 'car', 'almost', 'hit', 'a', 'cat', '.', '</s>'], ['<s>', 'i', 'have', 'the', 'car', 'license', 'tag', '.', '</s>']]
[['<s>', 'i', 'have', 'a', 'cat', '.', '</s>'], ['<s>', 'my', 'cat', 'is', 'black', '.', '</s>'], ['<s>', 'a', 'black', 'car', 'almost', 'hit', 'a', 'cat', '.', '</s>'], ['<s>', 'i', 'have', 'the', 'car', 'license', 'tag', '.', '</s>']]
[['<s>', '<s>', 'i', 'have', 'a', 'cat', '.', '</s>'], ['<s>', '<s>', 'my', 'cat', 'is', 'black', '.', '</s>'], ['<s>', '<s>', 'a', 'black', 'car', 'almost', 'hit', 'a', 'cat', '.', '</s>'], ['<s>', '<s>', 'i', 'have', 'the', 'car', 'license', 'tag', '.', '</s>']]
[['<s>', '<s>', '<s>', 'i', 'have', 'a', 'cat', '.', '</s>'], ['<s>', '<s>', '<s>', 'my', 'cat', 'is', 'black', '.', '</s>'], ['<s>', '<s>', '<s>', 'a', 'black', 'car', 'almost', 'hit', 'a', 'cat', '.', '</s>'], ['<s>', '<s>', '<s>', 'i', 'have'

In [98]:
# Convert augmented tokens into n-grams

# Dictionaries of n-grams
# Using 2d dictionaries {context: {(word: 1), (word2: 2)}, context2: {(word3: 3), (word4: 4)}}
unigrams = {}   # (): ["word", count]
bigrams = {}    # (context1): ["word", count]
trigrams = {}   # (c1, c2): ["word", count]
quadgrams = {}  # (c1, c2, c3): [("word", count)]
grams = [unigrams, bigrams, trigrams, quadgrams]
gramsPrintStrings = ["Unigrams", "Bigrams", "Trigrams", "Quadgrams"]
contextCount = [0,0,0,0] # [unigrams, bigrams, trigrams, quadgrams] total context count each
uniqueNGrams = [0,0,0,0] # Counts unique N-Grams for each N-Gram

# Helper function for repeating code
def incrementWordCount(gramIndex, context, word):
	if context not in grams[gramIndex]: 		# if the context isn't in the gram dict, 
		grams[gramIndex][context] = {}  		# create an empty dictionary
	if word not in grams[gramIndex][context]: 	# check if word is already found in context
		grams[gramIndex][context][word] = 1 	# Initialize count as 1
	else:
		grams[gramIndex][context][word] += 1 	# Increment gram word count

for i in range(4): # 4 gram types
	if i == 0: # Calculate Unigrams
		context = ()
		grams[i][context] = {} # Declare the unigrams to be a dictionary with the only key as ()
		for tokenList in AugmentedTokens[0][i]: #0 context words
			for word in tokenList:
				# No actual context, so I'm not going to use incrementWordCount(grams[i], context, word)
				if word not in grams[i][context]:
					grams[i][context][word] = 1 		# Add word to unigrams with count of 1
				else:
					grams[i][context][word] += 1 		# Increment unigram token count
				contextCount[i] += 1
	if i == 1: # Calculate Bigrams
		context = None
		for tokenList in AugmentedTokens[0][i]: #1 context word
			for word in tokenList:
				if context not in (None, END):
					bigramContext = (context,) # bigram dictionary key
					incrementWordCount(i, bigramContext, word)
				context = word
				contextCount[i] += 1
	if i == 2: # Calculate Trigrams
		context = None
		context2 = None
		for tokenList in AugmentedTokens[0][i]: #2 context words
			for word in tokenList:
				if context not in (None, END) and context2 not in (None, END):
					trigramContext = (context, context2) # trigram dictionary key
					incrementWordCount(i, trigramContext, word)
				context = context2
				context2 = word
				contextCount[i] += 1
	if i == 3: # Calculate Quadgrams
		context = None
		context2 = None
		context3 = None
		i = 3
		for tokenList in AugmentedTokens[0][i]: #3 context words
			for word in tokenList:
				if context not in (None, END) and context2 not in (None, END) and context3 not in (None, END):
					quadgramContext = (context, context2, context3) # quadgram dictionary key
					incrementWordCount(i, quadgramContext, word)
				context = context2
				context2 = context3
				context3 = word
				contextCount[i] += 1


# Save the unique count of ngrams for each gram
#Debug print statements
	# Print all the context and words
	# (Unigram context is just empty dictionary key ())
for i in range(len(grams)):
	print(f"{gramsPrintStrings[i]}") # Which N-Gram is being printed
	#for context in grams[i]: # Displays all tokens in each gram
		#print(f"{context}: {grams[i][context]}") #(Context,): {Dictionary of words: count}

	# Simple loop to count how many unique grams in each N-Gram
	for contextWord in grams[i]:
		uniqueNGrams[i] += len(grams[i][contextWord])
	print(f"Unique {gramsPrintStrings[i]}: {uniqueNGrams[i]}")
	print()

Unigrams
Unique Unigrams: 16

Bigrams
Unique Bigrams: 22

Trigrams
Unique Trigrams: 25

Quadgrams
Unique Quadgrams: 26



In [99]:
# Definitions of gram probabilities

debug = False

def calcContextTotal(grams, gram):
	contextTotal = 0
	for word in grams[gram]:
		contextTotal += grams[gram][word]
	return contextTotal

def unigramProb(wordTest):
    # Probability of wordTest in its context
	if wordTest != None:
		prob = unigrams[()][wordTest]/contextCount[0]
		return prob 
	else:
		print(wordTest, "is not in the dictionary!")
###

def bigramProb(bigram, wordTest): # 1 context word
    # Probability of wordTest, given that its context came before it
	if bigram in bigrams:
		contextTotal = calcContextTotal(bigrams, bigram)
		return bigrams[bigram][wordTest]/contextTotal
	else:
		print(bigram, "is not in the dictionary!")
###

def trigramProb(trigram, wordTest): # 2 context words
    # Probability of wordTest, given that its context came before it
	if trigram in trigrams:
		contextTotal = calcContextTotal(trigrams, trigram)
		return trigrams[trigram][wordTest]/contextTotal
	else:
		print(trigram, "is not in the dictionary!")
###

# Note: I don't think compacting this into (wordTest, trigram) would be a good idea
def quadgramProb(quadgram, wordTest): # 3 context words
    # Probability of wordTest, given that its context came before it
	if quadgram in quadgrams:
		contextTotal = calcContextTotal(quadgrams, quadgram)
		return quadgrams[quadgram][wordTest]/contextTotal
	else:
		print(quadgram, "is not in the dictionary!")

# P(SearchWord, (context))
#print(unigramProb("have"))
#print(bigramProb(("a", "cat")))
#print(trigramProb(("car", "have", "the")))
#print(quadgramProb(("cat", "almost", "hit", "a")))

print()

probUnigram = [] # array to hold probabilities
probBigram = []
probTrigram = []
probQuadgram = []


print("Unigram probability table")
#print(unigrams)
for unigram in unigrams[()]:
	prob = unigramProb(unigram)
	probUnigram.append(prob)
	if debug:
		print(f"\tWord: {unigram:<12} \t Occurances: {str(unigrams[()][unigram]):<3} \t Context total: {contextCount[0]:<3} \t Probability: {prob:.3f}")
    # print unigram[i],                        unigram dictionary value                 context summed in previous block       unigram Prob, input string token 

print()
print("Bigram probability table")
#print(bigrams)
for bigram in bigrams:
	if debug:
		print(f"Context: {str(bigram):<12}")
	for word in bigrams[bigram]:
		prob = bigramProb(bigram, word)
		probBigram.append(prob)
		if debug:
			print(f"\t Word: {word:<12} \t Occurances: {str(bigrams[bigram][word]):<3} \t Context total: {calcContextTotal(bigrams, bigram):<3} \t Probability: {prob:.3f}")
	print()

print("Trigram probability table")
#print(trigrams)
for trigram in trigrams:
	if debug:
		print(f"Context: {str(trigram):<12}")
	for word in trigrams[trigram]:
		prob = trigramProb(trigram, word)
		probTrigram.append(prob)
		if debug:
			print(f"\t Word: {word:<12} \t Occurances: {str(trigrams[trigram][word]):<3} \t Context total: {calcContextTotal(trigrams, trigram):<3} \t Probability: {prob:.3f}")
	print()

print("Quadgram probability table")
#print(quadgrams)
for quadgram in quadgrams:
	if debug:
		print(f"Context: {str(quadgram):<12}")
	for word in quadgrams[quadgram]:
		prob = quadgramProb(quadgram, word)
		probQuadgram.append(prob)
		if debug:
			print(f"\t Word: {word:<12} \t Occurances: {str(quadgrams[quadgram][word]):<3} \t Context total: {calcContextTotal(quadgrams, quadgram):<3} \t Probability: {prob:.3f}")
	print()


Unigram probability table

Bigram probability table















Trigram probability table






















Quadgram probability table

























In [100]:
# N-Gram probabilities converted to array lists
print(probUnigram)
print(probBigram)
print(probBigram)
print(probQuadgram)

[0.12121212121212122, 0.06060606060606061, 0.06060606060606061, 0.09090909090909091, 0.09090909090909091, 0.12121212121212122, 0.12121212121212122, 0.030303030303030304, 0.030303030303030304, 0.06060606060606061, 0.06060606060606061, 0.030303030303030304, 0.030303030303030304, 0.030303030303030304, 0.030303030303030304, 0.030303030303030304]
[0.5, 0.25, 0.25, 1.0, 0.5, 0.5, 0.6666666666666666, 0.3333333333333333, 0.6666666666666666, 0.3333333333333333, 1.0, 1.0, 1.0, 0.5, 0.5, 0.5, 0.5, 1.0, 1.0, 1.0, 1.0, 1.0]
[0.5, 0.25, 0.25, 1.0, 0.5, 0.5, 0.6666666666666666, 0.3333333333333333, 0.6666666666666666, 0.3333333333333333, 1.0, 1.0, 1.0, 0.5, 0.5, 0.5, 0.5, 1.0, 1.0, 1.0, 1.0, 1.0]
[0.5, 0.25, 0.25, 1.0, 0.5, 0.5, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]


In [101]:
# Convert probabilities to log space
# log(p1 * p2 * p3 * p4) = log(p1) + log(p2) + log(p3) + log(p4)

In [102]:
# This is where I pull randomized words out of the dictionaries

#Set up seeds
#np.random.seed(0)
seed = ""
while seed in ("", None, START, END, '.', ",", "?", "!", "[", "]", "(", ")"):
	seed = np.random.choice(list(unigrams[()]), size=1, p=probUnigram)
	seed = str(seed[0])
biSeed = (seed,)
triSeed = (START, seed,)
quadSeed = (START, START, seed,)
print("Seeds:", seed, biSeed, triSeed, quadSeed)

def generateNextGram(ngrams, topLevel, context): #(ngrams, ngrams, biSeed)
	gram = grams[ngrams] # Input n to use grams[n], which allows for backoff by decrementing n
	#print(f"Generating {ngrams+1}grams")
	#length = 0
	try:
		if context in gram:
			#for word in gram[context]:
			#	length += gram[context][word]
			length = sum(gram[context].values())
			probArray = [gram[context][wordCount]/length for wordCount in gram[context]]
			nextWord = np.random.choice(list(gram[context].keys()), size=1, p=probArray)
			nextWord = str(nextWord[0])
			#print(f"Next word: {nextWord}")
			return nextWord
		else:
			raise KeyError(f"{context} not found in grams[{ngrams}]")
	except KeyError:
		if ngrams > 0:
			#print(f"{context} not found in {gram}")
			#print(f"Backoff to {ngrams}grams")
			return generateNextGram(ngrams-1, topLevel, context[1:] if len(context) > 1 else ())
			#bug: not returning to top level gram
		else:
			print("Backoff failed, returning '.'")
			return "."

def setOutput(current, output, wordCount):
	if current not in (START, END):
		if current in ("'", "’", ",", ".", ":", "*"): #no space after symbols
			output += current
		else:
			output += " " + current 
		wordCount += 1
	return output, wordCount

current = seed
UniOutput = current
wordCount = 0
while current != END and wordCount < 150:
	#current = np.random.choice(list(unigrams[()]), size=1, p=probUnigram)
	#current = current[0]

	current = generateNextGram(0, 0, ())

	UniOutput, wordCount = setOutput(current, UniOutput, wordCount)
print(f"Unigram: {UniOutput}")
print()

#print(bigrams)
#print("Possible words:", bigrams[biSeed])
current = seed
BiOutput = current
wordCount = 0
while current != END and wordCount < 150:
	current = generateNextGram(1, 1, biSeed)
	#print("Chosen current word:", current, "\n")
	biSeed = (current,)

	BiOutput, wordCount = setOutput(current, BiOutput, wordCount)
print("Bigram:", BiOutput)
print()

current = seed
TriOutput = current
while current != END and wordCount < 150:
	current = generateNextGram(2, 2, triSeed)
	
	triSeed = (triSeed[1],current)
	
	TriOutput, wordCount = setOutput(current, TriOutput, wordCount)
print("Trigram:", TriOutput)
print()

current = seed
QuadOutput = current
while current != END and wordCount < 150:
	current = generateNextGram(3, 3, quadSeed)
	
	quadSeed = (quadSeed[1], quadSeed[2],current)
	
	QuadOutput, wordCount = setOutput(current, QuadOutput, wordCount)
print("Quadgram:", QuadOutput)
print()

finalOutputs = [UniOutput, BiOutput, TriOutput, QuadOutput]


Seeds: have ('have',) ('<s>', 'have') ('<s>', '<s>', 'have')
Unigram: have. have

Bigram: have the car license tag.

Trigram: have a cat.

Quadgram: have a cat.



In [103]:
# Output

# This will be printed 4 times. Sentence/Paragraph splits of CNN/Shakespeare
for i in range(0,4):
    print(f"Extracted {uniqueNGrams[i]} unique {i+1}-grams")
print("Seed text:", seed)
for i in range(0, 4):
    print(f"Generated {i+1}-gram text of length X")
    print(f"{finalOutputs[i]}")

Extracted 16 unique 1-grams
Extracted 22 unique 2-grams
Extracted 25 unique 3-grams
Extracted 26 unique 4-grams
Seed text: have
Generated 1-gram text of length X
have. have
Generated 2-gram text of length X
have the car license tag.
Generated 3-gram text of length X
have a cat.
Generated 4-gram text of length X
have a cat.
