Drew Lickman\
CSCI 4820-001\
Project #2\
Due: 9/9/24

# N-Grams Algorithm

## Assignment Requirements:

### Input
---

- Two training data input files
    - CNN Stories
    - Shakespeare Plays
- Each line in the files are paragraphs, and paragraphs may contain multiple sentences

### Processing
---

- Text will be converted to lowercase during processing
- Extract n-grams in both methods
    - Sentence level
        - Paragraph will be sentence tokenized (NLTK sent_tokenize), then all sentences will be word tokenized (NLTK word_tokenize)
            - Resulting data will be augmented with \<s> and </s>
    - Paragraph level
        - Paragraph will be word tokenized (NLTK word_tokenize)
            - Resulting data will be augmented with \<s> and </s>
    - n-gram extraction should never cross over line boundaries
- The data structure used to hold tokens in each sentence should start with \<s> and end with </s>, according to the n-grams being processed
    - Higher order n-grams require more start symbol augments
- Unigrams, bigrams, trigrams, quadgrams will each be kept in separate data structures
    - Dictionaries, indexed by "context tuples" work well for this
- A parallel data structure should hold the counts of the tokens that immediately follow each n-gram context
    - These counts should be stored as probabilities by dividing by total count of tokens that appear after the n-gram context 
- Process both files first using sentence level, then followed by paragraph level

### Output
---

- Set NumPy seed to 0
- Print the count of extracted unigrams, bigrams, trigrams, and quadgrams (for each file)
- For each file, choose a random starting word from the unigram tokens (not </s>)
    - This random word will be used as the seed for generated n-gram texts
- For each gram:
    - Using the seed word (prefixed with \<s> as required) generate either 150 tokens or until </s> is generated
        - Do NOT continue after </s>
    - Each next token will be probabilistically selected from those that follow the context (if any) for hat n-gram
    - When working with higher order n-grams, use backoff when the context does not produce a token. Use the next lower n-gram

## Python Code

In [96]:
# Imports libraries and reads corpus documents. Save the documents as tokens

import numpy as np
from nltk import word_tokenize, sent_tokenize

np.random.seed(0)

sentences = []
paragraphs = []

with open("sample.txt", encoding="utf-8") as wordList:
    lines = wordList.readlines()
    for line in lines:
        line = line.lower() # Converts all documents to lowercase
        sentence = sent_tokenize(line) # Extract as entire sentences
        paragraph = word_tokenize(line) # Extract the entire line as words (not separating sentences into different arrays!)
        sentences.append(sentence) # Adds each sentence to the sentences array
        paragraphs.append(paragraph) # Adds each line into the paragraphs array
        #print(sentence)
        ##print(paragraph)
        #print()
        
#print("Sentences (not word tokenized): ", sentences)
##print(paragraphs)

#print()
# Sentence level converting sentence tokens into word tokens
tokens = [] # [[tokens without START or END], [tokens for unigrams], [tokens for bigrams], [tokens for trigrams], [tokens for quadgrams]]
for sent in sentences:
    for string in sent:
        tokenList = word_tokenize(string) # Converts each word into a token. (This will separate sentences into different arrays)
        tokens.append(tokenList)
        #print(token)
#print()
print("Sentence tokens: ", tokens)



Sentence tokens:  [['this', 'document', 'is', 'just', 'a', 'sample', '.'], ['hello', 'world', '!'], ['this', 'is', 'my', 'really', 'awesome', 'document', 'that', 'i', 'love', 'writing', 'into', '.'], ['one', 'big', 'long', 'document', '.'], ['this', 'is', 'a', 'new', 'line', '.'], ['it', 'should', 'be', 'represented', 'as', 'a', 'separate', 'array', 'in', 'paragraph', 'mode', '.'], ['this', 'is', 'the', 'end', '.'], ['it', 'should', 'end', 'now', '.']]


In [97]:
# Add START and END tokens
# Make sure to Run All before re-running this!

START = "<s>"
END = "</s>"

#t[1] = [<s>tokenized words</s>], etc.
#t[2] = [<s><s>tokenized words</s>], etc.
#t[3] = [<s><s><s>tokenized words</s>], etc.
#t[4] = [<s><s><s><s>tokenized words</s>], [<s><s><s><s>tokenized words</s>], [<s><s><s><s>tokenized words</s>], [<s><s><s><s>tokenized words</s>], [<s><s><s><s>tokenized words</s>]

# Array of AugmentedToken lists (one for each Uni/Bi/Tri/Quad grams)
AugmentedTokens = [[],[],[],[]]
# Since I am modifying each sentence, for every gram, I will add the START n times and END once per sentence
# List comprehension as suggested by Claude 3.5-sonnet: (and modifications by myself too!)
# newList = [expression for item in iterable]
for i in range(len(AugmentedTokens)):
    AugmentedTokens[i] = [[START]*(i+1) + sentence + [END] for sentence in tokens]
# Even more compact version of all this
#UniAugmentedTokens  = [[START]*1 + sentence + [END] for sentence in tokens]
#BiAugmentedTokens   = [[START]*2 + sentence + [END] for sentence in tokens]
#TriAugmentedTokens  = [[START]*3 + sentence + [END] for sentence in tokens]
#QuadAugmentedTokens = [[START]*4 + sentence + [END] for sentence in tokens]

#AugmentedTokens.append(UniAugmentedTokens)
#AugmentedTokens.append(BiAugmentedTokens)
#AugmentedTokens.append(TriAugmentedTokens)
#AugmentedTokens.append(QuadAugmentedTokens)

for i in range(len(AugmentedTokens)):
    print(AugmentedTokens[i])

[['<s>', 'this', 'document', 'is', 'just', 'a', 'sample', '.', '</s>'], ['<s>', 'hello', 'world', '!', '</s>'], ['<s>', 'this', 'is', 'my', 'really', 'awesome', 'document', 'that', 'i', 'love', 'writing', 'into', '.', '</s>'], ['<s>', 'one', 'big', 'long', 'document', '.', '</s>'], ['<s>', 'this', 'is', 'a', 'new', 'line', '.', '</s>'], ['<s>', 'it', 'should', 'be', 'represented', 'as', 'a', 'separate', 'array', 'in', 'paragraph', 'mode', '.', '</s>'], ['<s>', 'this', 'is', 'the', 'end', '.', '</s>'], ['<s>', 'it', 'should', 'end', 'now', '.', '</s>']]
[['<s>', '<s>', 'this', 'document', 'is', 'just', 'a', 'sample', '.', '</s>'], ['<s>', '<s>', 'hello', 'world', '!', '</s>'], ['<s>', '<s>', 'this', 'is', 'my', 'really', 'awesome', 'document', 'that', 'i', 'love', 'writing', 'into', '.', '</s>'], ['<s>', '<s>', 'one', 'big', 'long', 'document', '.', '</s>'], ['<s>', '<s>', 'this', 'is', 'a', 'new', 'line', '.', '</s>'], ['<s>', '<s>', 'it', 'should', 'be', 'represented', 'as', 'a', 'sep

In [98]:
# Convert augmented tokens into n-grams

# Dictionaries of n-grams
unigrams = {}   # (): ["word", count]
bigrams = {}    # (context1): ["word", count]
trigrams = {}   # (c1, c2): ["word", count]
quadgrams = {}  # (c1, c2, c3): ["word", count]
grams = [unigrams, bigrams, trigrams, quadgrams]

# Count unigrams
for tokenList in AugmentedTokens[0]: #0 context words
    for word in tokenList:
        if word not in unigrams:
            unigrams[word] = 1 # Initialize count as 1
        else:
            unigrams[word] += 1 # Increment word count

print("Unigrams:", unigrams)

# Count bigrams
context = None
for tokenList in AugmentedTokens[1]: #1 context word
    for word in tokenList:
        if context != None:
            bigram = (context, word)
        if (context, word) not in bigrams:
            bigrams[(context, word)] = 1 # Initialize count as 1
        else:
            bigrams[(context, word)] += 1 # Increment word count
        context = word

print("Bigrams:", bigrams)



Unigrams: {'<s>': 8, 'this': 4, 'document': 3, 'is': 4, 'just': 1, 'a': 3, 'sample': 1, '.': 7, '</s>': 8, 'hello': 1, 'world': 1, '!': 1, 'my': 1, 'really': 1, 'awesome': 1, 'that': 1, 'i': 1, 'love': 1, 'writing': 1, 'into': 1, 'one': 1, 'big': 1, 'long': 1, 'new': 1, 'line': 1, 'it': 2, 'should': 2, 'be': 1, 'represented': 1, 'as': 1, 'separate': 1, 'array': 1, 'in': 1, 'paragraph': 1, 'mode': 1, 'the': 1, 'end': 2, 'now': 1}
<s> <s>
<s> this
this document
document is
is just
just a
a sample
sample .
. </s>
</s> <s>
<s> <s>
<s> hello
hello world
world !
! </s>
</s> <s>
<s> <s>
<s> this
this is
is my
my really
really awesome
awesome document
document that
that i
i love
love writing
writing into
into .
. </s>
</s> <s>
<s> <s>
<s> one
one big
big long
long document
document .
. </s>
</s> <s>
<s> <s>
<s> this
this is
is a
a new
new line
line .
. </s>
</s> <s>
<s> <s>
<s> it
it should
should be
be represented
represented as
as a
a separate
separate array
array in
in paragraph
paragraph m

In [99]:
# This is where I pull randomized words out of the dictionaries

current = ""
while current != END:
    print("Test")
    #np.random.choice()
    current = END


Test


In [100]:
# Output

# This will be printed 4 times. Sentence/Paragraph splits of CNN/Shakespeare
for i in range(1,5):
    print(f"Extracted {len(grams[i-1])} unique {i}-grams")
print("Seed text:", "YYYY")
for i in range(1, 5):
    print(f"Generated {i}-gram text of length X")
    print(f"<{i}-gram text generated>")

Extracted 38 unique 1-grams
Extracted 53 unique 2-grams
Extracted 0 unique 3-grams
Extracted 0 unique 4-grams
Seed text: YYYY
Generated 1-gram text of length X
<1-gram text generated>
Generated 2-gram text of length X
<2-gram text generated>
Generated 3-gram text of length X
<3-gram text generated>
Generated 4-gram text of length X
<4-gram text generated>
