Drew Lickman\
CSCI 4820-001\
Project #2\
Due: 9/9/24

AI Usage Disclaimer:


# N-Grams Algorithm

## Assignment Requirements:

### Input
---

- Two training data input files
    - CNN Stories
    - Shakespeare Plays
- Each line in the files are paragraphs, and paragraphs may contain multiple sentences

### Processing
---

- Text will be converted to lowercase during processing
- Extract n-grams in both methods
    - Sentence level
        - Paragraph will be sentence tokenized (NLTK sent_tokenize), then all sentences will be word tokenized (NLTK word_tokenize)
            - Resulting data will be augmented with \<s> and \</s>
    - Paragraph level
        - Paragraph will be word tokenized (NLTK word_tokenize)
            - Resulting data will be augmented with \<s> and \</s>
    - n-gram extraction should never cross over line boundaries
- The data structure used to hold tokens in each sentence should start with \<s> and end with \</s>, according to the n-grams being processed
    - Higher order n-grams require more start symbol augments
- Unigrams, bigrams, trigrams, quadgrams will each be kept in separate data structures
    - Dictionaries, indexed by "context tuples" work well for this
- A parallel data structure should hold the counts of the tokens that immediately follow each n-gram context
    - These counts should be stored as probabilities by dividing by total count of tokens that appear after the n-gram context 
- Process both files first using sentence level, then followed by paragraph level

### Output
---

- Set NumPy seed to 0
- Print the count of extracted unigrams, bigrams, trigrams, and quadgrams (for each file)
- For each file, choose a random starting word from the unigram tokens (not </s>)
    - This random word will be used as the seed for generated n-gram texts
- For each gram:
    - Using the seed word (prefixed with \<s> as required) generate either 150 tokens or until </s> is generated
        - Do NOT continue after </s>
    - Each next token will be probabilistically selected from those that follow the context (if any) for hat n-gram
    - When working with higher order n-grams, use backoff when the context does not produce a token. Use the next lower n-gram

## Python Code

In [353]:
# Imports libraries and reads corpus documents. Save the documents as tokens

import numpy as np
from nltk import word_tokenize, sent_tokenize

sentences = []
tokenizedParagraphs = []

with open("poem.txt", encoding="utf-8") as wordList:
    lines = wordList.readlines()
    for line in lines:
        line = line.lower() # Converts all documents to lowercase
        sentence = sent_tokenize(line) # Extract as entire sentences
        paragraph = word_tokenize(line) # Extract the entire line as words (not separating sentences into different arrays!)
        sentences.append(sentence) # Adds each sentence to the sentences array
        tokenizedParagraphs.append(paragraph) # Adds each line into the paragraphs array
        #print(sentence)
        ##print(paragraph)
        #print()
        
#print("Sentences: ", sentences) #before separating sentences
print("Paragraph level: ", tokenizedParagraphs)

#print()
# Sentence level converting sentence tokens into word tokens
tokenizedSentences = [] # [[tokens without START or END], [tokens for unigrams], [tokens for bigrams], [tokens for trigrams], [tokens for quadgrams]]
for sent in sentences:
    for string in sent:
        tokenList = word_tokenize(string) # Converts each word into a token. (This will separate sentences into different arrays)
        tokenizedSentences.append(tokenList)
        
print()
print("Sentence level: ", tokenizedSentences)

# Disable for large corpus
if False:
	for context in tokenizedSentences:
		print(context)


Paragraph level:  [['i', 'have', 'a', 'cat', '.', 'my', 'cat', 'is', 'black', '.'], ['a', 'black', 'car', 'almost', 'hit', 'a', 'cat', '.', 'i', 'have', 'the', 'car', 'license', 'tag', '.']]

Sentence level:  [['i', 'have', 'a', 'cat', '.'], ['my', 'cat', 'is', 'black', '.'], ['a', 'black', 'car', 'almost', 'hit', 'a', 'cat', '.'], ['i', 'have', 'the', 'car', 'license', 'tag', '.']]


In [354]:
# Add START and END tokens
# Make sure to Run All before re-running this!

START = "<s>"
END = "</s>"

#t[1] = [<s>tokenized words</s>], etc.
#t[2] = [<s>tokenized words</s>], etc.
#t[3] = [<s><s>tokenized words</s>], etc.
#t[4] = [<s><s><s>tokenized words</s>], etc.

AugmentedTokens = [[],[]] # [[Sentence Tokens], [Paragraph Tokens]]
modes = [tokenizedSentences, tokenizedParagraphs]

# Arrays of AugmentedToken lists (one for each Uni/Bi/Tri/Quad grams)
AugmentedTokens[0] = [] # [],[],[],[] #for sentences
AugmentedTokens[1] = [] # [],[],[],[] #for paragraphs

# Since I am modifying each sentence, for every gram, I will add the START n times and END once per sentence
# List comprehension as suggested by Claude 3.5-sonnet: (and modifications by myself too!)
# newList = [expression for item in iterable]

#for i in range(len(AugmentedTokens)):
#    AugmentedTokens[i] = [[START]*(i+1) + sentence + [END] for sentence in tokens] # Unfortunately cannot use this because unigrams have 1 start token, not 0

print("Sentence level: ↓\n")
for mode in range(2): # Sentence mode then Paragraph mode
	AugmentedTokens[mode].append([[START]*1 + sentence + [END] for sentence in modes[mode]]) # Append augmented unigram sentence/paragraph to AugmentedTokens
	AugmentedTokens[mode].append([[START]*1 + sentence + [END] for sentence in modes[mode]]) # Append augmented bigram sentence/paragraph to AugmentedTokens
	AugmentedTokens[mode].append([[START]*2 + sentence + [END] for sentence in modes[mode]]) # Append augmented trigram sentence/paragraph to AugmentedTokens
	AugmentedTokens[mode].append([[START]*3 + sentence + [END] for sentence in modes[mode]]) # Append augmented quadgram sentence/paragraph to AugmentedTokens

	# Prints sentence level of augmented grams, followed by paragraph level of augmented grams
	for ngram in range(len(AugmentedTokens[mode])):
		print(AugmentedTokens[mode][ngram])
	print()
print("Paragraph level: ↑")

Sentence level: ↓

[['<s>', 'i', 'have', 'a', 'cat', '.', '</s>'], ['<s>', 'my', 'cat', 'is', 'black', '.', '</s>'], ['<s>', 'a', 'black', 'car', 'almost', 'hit', 'a', 'cat', '.', '</s>'], ['<s>', 'i', 'have', 'the', 'car', 'license', 'tag', '.', '</s>']]
[['<s>', 'i', 'have', 'a', 'cat', '.', '</s>'], ['<s>', 'my', 'cat', 'is', 'black', '.', '</s>'], ['<s>', 'a', 'black', 'car', 'almost', 'hit', 'a', 'cat', '.', '</s>'], ['<s>', 'i', 'have', 'the', 'car', 'license', 'tag', '.', '</s>']]
[['<s>', '<s>', 'i', 'have', 'a', 'cat', '.', '</s>'], ['<s>', '<s>', 'my', 'cat', 'is', 'black', '.', '</s>'], ['<s>', '<s>', 'a', 'black', 'car', 'almost', 'hit', 'a', 'cat', '.', '</s>'], ['<s>', '<s>', 'i', 'have', 'the', 'car', 'license', 'tag', '.', '</s>']]
[['<s>', '<s>', '<s>', 'i', 'have', 'a', 'cat', '.', '</s>'], ['<s>', '<s>', '<s>', 'my', 'cat', 'is', 'black', '.', '</s>'], ['<s>', '<s>', '<s>', 'a', 'black', 'car', 'almost', 'hit', 'a', 'cat', '.', '</s>'], ['<s>', '<s>', '<s>', 'i', 'ha

In [355]:
# Convert augmented tokens into n-grams

# Dictionaries of n-grams
# Using 2d dictionaries {context: {(word: 1), (word2: 2)}, context2: {(word3: 3), (word4: 4)}}
gramsPrintStrings = ["Unigrams", "Bigrams", "Trigrams", "Quadgrams"]

contextCountSen = [0,0,0,0] # [unigrams, bigrams, trigrams, quadgrams] total context count each
uniqueSenNGrams = [0,0,0,0] # Counts unique N-Grams for each N-Gram

uniqueParNGrams = [0,0,0,0] # Counts unique N-Grams for each N-Gram
contextCountPar = [0,0,0,0] # [unigrams, bigrams, trigrams, quadgrams] total context count each

gramsMode = [[{}, {}, {}, {}], [{}, {}, {}, {}]] 	# [[{sentenceUni}, {sentenceBi}, {sentenceTri}, {sentenceQuadi}],
													# [{paragraphUni}, {paragraphBi}, {paragraphTri}, {paragraphQuad}]]
													# Each dictionary holds a tuple key (context) and a dictionary value of the {word: count}
													# (): {"word", count}
													# (c1): {"word", count}
													# (c1, c2): {"word", count}
													# (c1, c2, c3): {("word", count)}

contextCountMode = [contextCountSen, contextCountPar]
uniqueModeNGrams = [uniqueSenNGrams, uniqueParNGrams]


# Helper function for repeating code
def incrementWordCount(mode, gramIndex, context, word):
	if context not in gramsMode[mode][gramIndex]: 				# if the context isn't in the gram dict, 
		gramsMode[mode][gramIndex][context] = {}  		# create an empty dictionary
	if word not in gramsMode[mode][gramIndex][context]: # check if word is already found in context
		gramsMode[mode][gramIndex][context][word] = 1 	# Initialize count as 1
	else:
		gramsMode[mode][gramIndex][context][word] += 1 	# Increment gram word count

for mode in range(2): # Sentence then Paragraph level
	for gram in range(4): # 4 gram types
		if gram == 0: # Calculate Unigrams
			context = ()
			gramsMode[mode][gram][context] = {} # Declare the unigrams to be a dictionary with the only key as ()
			for tokenList in AugmentedTokens[mode][gram]: #0 context words
				for word in tokenList:
					# No actual context, so I'm not going to use incrementWordCount(grams[i], context, word)
					if word not in gramsMode[mode][gram][context]:
						gramsMode[mode][gram][context][word] = 1 		# Add word to unigrams with count of 1
					else:
						gramsMode[mode][gram][context][word] += 1 		# Increment unigram token count
					contextCountMode[mode][gram] += 1

		if gram == 1: # Calculate Bigrams
			context = None
			for tokenList in AugmentedTokens[mode][gram]: #1 context word
				for word in tokenList:
					if context not in (None, END):
						bigramContext = (context,) # bigram dictionary key
						incrementWordCount(mode, gram, bigramContext, word)
					context = word
					contextCountMode[mode][gram] += 1

		if gram == 2: # Calculate Trigrams
			context = None
			context2 = None
			for tokenList in AugmentedTokens[mode][gram]: #2 context words
				for word in tokenList:
					if context not in (None, END) and context2 not in (None, END):
						trigramContext = (context, context2) # trigram dictionary key
						incrementWordCount(mode, gram, trigramContext, word)
					context = context2
					context2 = word
					contextCountMode[mode][gram] += 1

		if gram == 3: # Calculate Quadgrams
			context = None
			context2 = None
			context3 = None
			for tokenList in AugmentedTokens[mode][gram]: #3 context words
				for word in tokenList:
					if context not in (None, END) and context2 not in (None, END) and context3 not in (None, END):
						quadgramContext = (context, context2, context3) # quadgram dictionary key
						incrementWordCount(mode, gram, quadgramContext, word)
					context = context2
					context2 = context3
					context3 = word
					contextCountMode[mode][gram] += 1


# Save the unique count of ngrams for each gram
#Debug print statements
	# Print all the context and words
	# (Unigram context is just empty dictionary key ())
for mode in range(len(modes)):
	if mode == 0:
		print("Sentence level:")
	elif mode == 1:
		print("Paragraph level:")
		
	for gram in range(len(gramsSen)):
		print(f"{gramsPrintStrings[gram]}") # Which N-Gram is being printed
		#for context in grams[i]: # Displays all tokens in each gram
			#print(f"{context}: {grams[i][context]}") #(Context,): {Dictionary of words: count}

		# Simple loop to count how many unique grams in each N-Gram
		for contextWord in gramsMode[mode][gram]: #switch to grams for each mode
			uniqueModeNGrams[mode][gram] += len(gramsMode[mode][gram][contextWord]) #need to switch to grams for each mode
		print(f"Unique {gramsPrintStrings[gram]}: {uniqueModeNGrams[mode][gram]}")
		print()

Sentence level:
Unigrams
Unique Unigrams: 16

Bigrams
Unique Bigrams: 22

Trigrams
Unique Trigrams: 25

Quadgrams
Unique Quadgrams: 26

Paragraph level:
Unigrams
Unique Unigrams: 16

Bigrams
Unique Bigrams: 23

Trigrams
Unique Trigrams: 26

Quadgrams
Unique Quadgrams: 27



In [356]:
# Definitions of gram probabilities

debug = True

def calcContextTotal(grams, gram):
	contextTotal = 0
	for word in grams[gram]:
		contextTotal += grams[gram][word]
	return contextTotal

def calcGramProb(mode, ngram, ctx, wordTest): 
	if ctx in gramsMode[mode][ngram]:
		contextTotal = calcContextTotal(gramsMode[mode][ngram], ctx)
		return gramsMode[mode][ngram][ctx][wordTest]/contextTotal
	else:
		print(ctx, "is not in the dictionary!")

probModeGram = [
    [[], [], [], []],	# sentence probabilities [uni, bi, tri, quad]
    [[], [], [], []] 	# paragraph probabilities [uni, bi, tri, quad]
]

for mode in range(len(modes)):
	if mode == 0:
		print("Sentence level:")
	elif mode == 1:
		print("\nParagraph level:")

	for ngram in range(len(gramsMode[mode])): 					# for each ngram
		print(f"{gramsPrintStrings[ngram]} probability table") 	# which ngram table are we looking at
		for ctx in gramsMode[mode][ngram]: 						# for each context in the gram in the sen/par mode
			for word in gramsMode[mode][ngram][ctx]: 			# for each word in the current context 
				prob = calcGramProb(mode, ngram, ctx, word) 	# calculate the probability of the word in the current context
				probModeGram[mode][ngram].append(prob)			# save the probability to the sen/par mode for each ngram
				if debug:
					occurances = str(gramsMode[mode][ngram][ctx][word])
					contextTotal = calcContextTotal(gramsMode[mode][ngram], ctx) # calculate how many words follow the current context
					print(f"\tWord: {word:<12} \t Occurances: {occurances:<3} \t Context total: {contextTotal:<3} \t Probability: {prob:.3f}")
			if debug: print()

Sentence level:
Unigrams probability table
	Word: <s>          	 Occurances: 4   	 Context total: 33  	 Probability: 0.121
	Word: i            	 Occurances: 2   	 Context total: 33  	 Probability: 0.061
	Word: have         	 Occurances: 2   	 Context total: 33  	 Probability: 0.061
	Word: a            	 Occurances: 3   	 Context total: 33  	 Probability: 0.091
	Word: cat          	 Occurances: 3   	 Context total: 33  	 Probability: 0.091
	Word: .            	 Occurances: 4   	 Context total: 33  	 Probability: 0.121
	Word: </s>         	 Occurances: 4   	 Context total: 33  	 Probability: 0.121
	Word: my           	 Occurances: 1   	 Context total: 33  	 Probability: 0.030
	Word: is           	 Occurances: 1   	 Context total: 33  	 Probability: 0.030
	Word: black        	 Occurances: 2   	 Context total: 33  	 Probability: 0.061
	Word: car          	 Occurances: 2   	 Context total: 33  	 Probability: 0.061
	Word: almost       	 Occurances: 1   	 Context total: 33  	 Probability: 0.0

In [357]:
# N-Gram probabilities converted to array lists
for mode in range(2): # Sentence then Paragraph
	m = "Sentence" if mode == 0 else "Paragraph"
	print(f"Probabilities for {m} mode:")
	
	for g in range(4): # Uni, Bi, Tri, Quad grams
		print(f"{gramsPrintStrings[g]} probabilities:", probModeGram[mode][g])
	print()  # Add a blank line between modes for better readability

Probabilities for Sentence mode:
Unigrams probabilities: [0.12121212121212122, 0.06060606060606061, 0.06060606060606061, 0.09090909090909091, 0.09090909090909091, 0.12121212121212122, 0.12121212121212122, 0.030303030303030304, 0.030303030303030304, 0.06060606060606061, 0.06060606060606061, 0.030303030303030304, 0.030303030303030304, 0.030303030303030304, 0.030303030303030304, 0.030303030303030304]
Bigrams probabilities: [0.5, 0.25, 0.25, 1.0, 0.5, 0.5, 0.6666666666666666, 0.3333333333333333, 0.6666666666666666, 0.3333333333333333, 1.0, 1.0, 1.0, 0.5, 0.5, 0.5, 0.5, 1.0, 1.0, 1.0, 1.0, 1.0]
Trigrams probabilities: [0.5, 0.25, 0.25, 1.0, 0.5, 0.5, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
Quadgrams probabilities: [0.5, 0.25, 0.25, 1.0, 0.5, 0.5, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

Probabilities for Paragraph mode:
Unigrams probabilities: [0.06896551724137931, 0.0689655172

In [358]:
# Convert probabilities to log space
# log(p1 * p2 * p3 * p4) = log(p1) + log(p2) + log(p3) + log(p4)

In [359]:
# This is where I pull randomized words out of the dictionaries

#Set up seeds
#np.random.seed(0)

def generateNextGram(mode, ngrams, topLevel, context): #(ngrams, ngrams, biSeed)
	gram = gramsMode[mode][ngrams] # Input n to use grams[n], which allows for backoff by decrementing n
	#print(f"Generating {ngrams+1}grams")
	#length = 0
	try:
		if context in gram:
			#for word in gram[context]:
			#	length += gram[context][word]
			length = sum(gram[context].values())
			probArray = [gram[context][wordCount]/length for wordCount in gram[context]]
			nextWord = np.random.choice(list(gram[context].keys()), size=1, p=probArray)
			nextWord = str(nextWord[0])
			#print(f"Next word: {nextWord}")
			return nextWord
		else:
			raise KeyError(f"{context} not found in grams[{ngrams}]")
	except KeyError:
		if ngrams > 0:
			#print(f"{context} not found in {gram}")
			#print(f"Backoff to {ngrams}grams")
			return generateNextGram(ngrams-1, topLevel, context[1:] if len(context) > 1 else ())
			#bug: not returning to top level gram
		else:
			print("Backoff failed, returning '.'")
			return "."

def setOutput(current, output, wordCount):
	if current not in (START, END):
		if current in ("'", "’", ",", ".", ":", "*"): #no space after symbols
			output += current
		else:
			output += " " + current 
		wordCount += 1
	return output, wordCount

seed = ""
while seed in ("", None, START, END, '.', ",", "?", "!", "[", "]", "(", ")"):
	seed = np.random.choice(list(gramsMode[0][0][()]), size=1, p=probModeGram[0])
	seed = str(seed[0]) #convert choice to a regular string
biSeed = (seed,)
triSeed = (START, seed,)
quadSeed = (START, START, seed,)
print("Seeds:", seed, biSeed, triSeed, quadSeed)

finalOutputs = [['','','',''], ['','','','']] #Output string for sentences (uni, bi, tri, quad), and paragraphs (uni, bi, tri, quad)

for mode in range(len(modes)):
	for g in range(len(gramsMode[mode])):
		

		current = seed
		UniOutput = current
		wordCount = 0
		while current != END and wordCount < 150:
			#current = np.random.choice(list(unigrams[()]), size=1, p=probUnigram)
			#current = current[0]

			current = generateNextGram(mode, 0, 0, ())

			UniOutput, wordCount = setOutput(current, UniOutput, wordCount)
		print(f"Unigram: {UniOutput}")
		print()

		#print(bigrams)
		#print("Possible words:", bigrams[biSeed])
		current = seed
		BiOutput = current
		wordCount = 0
		while current != END and wordCount < 150:
			current = generateNextGram(mode, 1, 1, biSeed)
			#print("Chosen current word:", current, "\n")
			biSeed = (current,)

			BiOutput, wordCount = setOutput(current, BiOutput, wordCount)
		print("Bigram:", BiOutput)
		print()

		current = seed
		TriOutput = current
		while current != END and wordCount < 150:
			current = generateNextGram(mode, 2, 2, triSeed)
			
			triSeed = (triSeed[1],current)
			
			TriOutput, wordCount = setOutput(current, TriOutput, wordCount)
		print("Trigram:", TriOutput)
		print()

		current = seed
		QuadOutput = current
		while current != END and wordCount < 150:
			current = generateNextGram(mode, 3, 3, quadSeed)
			
			quadSeed = (quadSeed[1], quadSeed[2],current)
			
			QuadOutput, wordCount = setOutput(current, QuadOutput, wordCount)
		print("Quadgram:", QuadOutput)
		print()

finalOutputs = [UniOutput, BiOutput, TriOutput, QuadOutput]


ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (4,) + inhomogeneous part.

In [None]:
# Output

# This will be printed 4 times. Sentence/Paragraph splits of CNN/Shakespeare
for mode in range(0,4):
    print(f"Extracted {uniqueSenNGrams[mode]} unique {mode+1}-grams")
print("Seed text:", seed)
for mode in range(0, 4):
    print(f"Generated {mode+1}-gram text of length X")
    print(f"{finalOutputs[mode]}")

Extracted 16 unique 1-grams
Extracted 22 unique 2-grams
Extracted 25 unique 3-grams
Extracted 26 unique 4-grams
Seed text: license
Generated 1-gram text of length X
license black have black a a almost. i hit the almost my cat the
Generated 2-gram text of length X
license tag.
Generated 3-gram text of length X
license tag.
Generated 4-gram text of length X
license tag.
