Homework 3: n-gram LM
----

Due date: October 4th, 2024

Points: 45

Goals:
- understand the difficulties of counting and probablities in NLP applications
- work with real world data to build a functioning language model
- stress test your model (to some extent)

Complete in groups of: __one (individually)__

Allowed python modules:
- `numpy`, `matplotlib`, and all built-in python libraries (e.g. `math` and `string`)
- do not use `nltk` or `pandas`

Instructions:
- Complete outlined problems in this notebook. 
- When you have finished, __clear the kernel__ and __run__ your notebook "fresh" from top to bottom. Ensure that there are __no errors__. 
    - If a problem asks for you to write code that does result in an error (as in, the answer to the problem is an error), leave the code in your notebook but commented out so that running from top to bottom does not result in any errors.
- Double check that you have completed Task 0.
- Submit your work on Gradescope.
- Double check that your submission on Gradescope looks like you believe it should __and__ that all partners are included (for partner work).

<b>Helpful Links</b>
1. Object Oriented Programming in Python : https://www.geeksforgeeks.org/python-oops-concepts/
2. GradeScope FAQ : https://northeastern.instructure.com/courses/188094/pages/gradescope-faq-slash-debugging?module_item_id=10969242

Task 0: Name, References, Reflection (5 points)
---

Name: Ryan Tietjen

References
---
List the resources you consulted to complete this homework here. Write one sentence per resource about what it provided to you. If you consulted no references to complete your assignment, write a brief sentence stating that this is the case and why it was the case for you.

(Example)
- https://docs.python.org/3/tutorial/datastructures.html
    - Read about the the basics and syntax for data structures in python.

AI Collaboration
---
Following the *AI Collaboration Policy* in the syllabus, please cite any LLMs that you used here and briefly describe what you used them for. Additionally, provide comments in-line identifying the specific sections that you used LLMs on, if you used them towards the generation of any of your 
answers.

I used ChatGPT-4 to assist in many portions of this assignment. Nothing in this file had chat-gpt assistance, only in lm_model.py. In-line comments describe how/what I used AI assistance for.

Reflection
----
Answer the following questions __after__ you complete this assignment (no more than 1 sentence per question required, this section is graded on completion):

1. Does this work reflect your best effort?

I think this work mostly reflects my best effort, though I could have abstained from using AI. Using ChatGPT was able to help me significantly. I think I could have done the assignment without it, but it would have been needlessly more difficult. 

2. What was/were the most challenging part(s) of the assignment?

I think generating a random sentence was the most challenging part of this assignment.

3. If you want feedback, what function(s) or problem(s) would you like feedback on and why?

I'm unsure if my perplexity function is correct.

Task 1: Berp Data Write-Up (5 points)
---

Every time you use a data set in an NLP application (or in any software application), you should be able to answer a set of questions about that data. Answer these now. Default to no more than 1 sentence per question needed. If more explanation is necessary, do give it.

This is about the __berp__ data set.

1. Where did you get the data from? https://github.com/wooters/berp-trans?tab=readme-ov-file 
2. How was the data collected (where did the people acquiring the data get it from and how)?

The data consists of speech samples collected by recording with a Sennheiser close-talking microphone sampled at 16 kHz.

3. How large is the dataset? (# lines, # tokens)

The dataset contains 8566 utterances (# of lines) (i.e. an unbroken flow of words) and ~1900 unique words (tokens), overall consisting of ~443MB of audio/transcripts.

4. What is your data? (i.e. newswire, tweets, books, blogs, etc)

The data comprises 7 hours of speech audio and the corresponding text transcripts.

5. Who produced the data? (who were the authors of the text? Your answer might be a specific person or a particular group of people)

The data was collected by the International Computer Science Institute (ICSI) in Berkley, CA.

Task 2: Implement an n-gram Language Model (30 points)
----

Implement the `LanguageModel` class as outlined in the provided `lm_starter.py` file. Do not change function signatures (the unit tests that we provide and in the autograder will break).

Your language model:
- *must* work for both the unigram and bigram cases (BONUS section (see end)): 5 points are allocated to an experiment involving larger values of `n`)
    - hint: try to implement the bigram case as a generalized "n greater than 1" case
- should be *token agnostic* (this means that if we give the model text tokenized as single characters, it will function as a character language model and if we give the model text tokenized as "words" (or "traditionally"), then it will function as a language model with those tokens)
- will use Laplace smoothing
- will replace all tokens that occur only once with `<UNK>` at train time
    - do not add `<UNK>` to your vocabulary if no tokens in the training data occur only once!

We have provided:
- a function to read in files
- some functions to change a list of strings into tokens
- the skeleton of the `LanguageModel` class

You need to implement:
- all functions marked

You may implement:
- additional functions/methods as helpful to you

As a guideline, including comments, and some debugging code that can be run with `verbose` parameters.
Points breakdown marked in code below.

In [1]:
# rename your lm_starter.py file to lm_model.py and put in the same directory as this file
import lm_model as lm
import numpy as np
import matplotlib.pyplot as plt

In [2]:
#Test create_ngrams
from lm_model import create_ngrams

training_data = ["<s>", "I", "love", "dogs", "</s>", "<s>", "I", "love", "cats", "</s>", "<s>", "I", "love", "dinosaurs", "</s>"]

bigrams = create_ngrams(training_data, 3)
print(bigrams)

[('<s>', 'I', 'love'), ('I', 'love', 'dogs'), ('love', 'dogs', '</s>'), ('dogs', '</s>', '<s>'), ('</s>', '<s>', 'I'), ('<s>', 'I', 'love'), ('I', 'love', 'cats'), ('love', 'cats', '</s>'), ('cats', '</s>', '<s>'), ('</s>', '<s>', 'I'), ('<s>', 'I', 'love'), ('I', 'love', 'dinosaurs'), ('love', 'dinosaurs', '</s>')]


In [3]:
# test the language model (unit tests)
import test_minitrainingprovided as test

# passing all these tests is a good indication that your model
# is correct. They are *not a guarantee*, so make sure to look
# at the tests and the cases that they cover. (we'll be testing
# your model against all of the testing data in addition).

# autograder points in gradescope are assigned SIXTY points
# this is essentially 60 points for correctly implementing your
# underlying model
# there are an additional 10 points manually graded for the correctness
# parts of your sentence generation

# make sure all training files are in a "training_files" directory 
# that is in the same directory as this notebook

unittest = test.TestMiniTraining()
unittest.test_createunigrammodellaplace()
unittest.test_createbigrammodellaplace()
unittest.test_unigramlaplace()
unittest.test_unigramunknownslaplace()
unittest.test_bigramlaplace()
unittest.test_bigramunknownslaplace()
# produces output
unittest.test_generateunigramconcludes()
# # produces output
unittest.test_generatebigramconcludes()

unittest.test_onlyunknownsgenerationandscoring()

[['am'], ['am', 'i', 'i', 'am', 'am']]
[['sam', 'i', 'am'], ['i', 'am', 'sam', 'i', 'am', 'ham']]


In [4]:
# 10 points

# instantiate a bigram language model, train it, and generate ten sentences
# make sure your output is nicely formatted!
ngram = 2
training_file_path = "training_files/berp-training.txt"
# optional parameter tells the tokenize function how to tokenize
by_char = False
data = lm.read_file(training_file_path)
tokens = lm.tokenize(data, ngram, by_char=by_char)

# YOUR CODE HERE
model = lm.LanguageModel(ngram)
model.train(tokens)
for i in range(10):
    sentence = model.generate_sentence()
    print(' '.join(sentence))

i would like to go for dinner this sunday
i have dinner
i like to find a list
start over
can eat on the icsi
uh cheap would like to icsi
how about joshu-ya
spats
any distance
metropole


In [5]:
# 5 points

# evaluate your bigram model on the test data
# score each line in the test data individually, then calculate the average score
# you need not re-train your model
test_path = "testing_files/berp-test.txt"
test_data = lm.read_file(test_path)

scores = []

# YOUR CODE HERE
for line in test_data:
    tokens = lm.tokenize([line], ngram, by_char=by_char)
    score = model.score(tokens)
    scores.append(score)


# Print out the mean score and standard deviation
# for words-as-tokens, these values should be
# ~4.9 * 10^-5 and 0.000285
print("Mean Score:", np.mean(scores))
print("Standard Deviation:", np.std(scores))


Mean Score: 4.9620823627262653e-05
Standard Deviation: 0.000285298086084196


In [6]:
# 15 points total

import pandas as pd

df = pd.read_csv('jeopardy_csv.csv')
df = df.astype(str)
df.columns = df.columns.str.replace(" ", '_') #Column names have spaces in front of them for some reason
n_grams = 2


# see if you can train your model on the data you found for your first homework (5 points)
text_data = df["_Question"].tolist()
tokens = lm.tokenize(text_data, n_grams, by_char=False)
model = lm.LanguageModel(n_grams)
model.train(tokens)


# what is the maximum value of n <= 10 that you can train a model *in your programming environment* in a reasonable amount of time? (less than 3 - 5 minutes)
# On my hardward, this takes < 30 seconds
n_grams = 10
tokens = lm.tokenize(text_data, n_grams, by_char=False)
model = lm.LanguageModel(n_grams)
model.train(tokens)


# generate three sentences with this model (10 points)
for _ in range(3):
    sentence = model.generate_sentence() 
    print(' '.join(sentence))


Cancer is the crab that attacked this hero who was battling the Hydra
It's the "insect" newspaper published in California's state capital
Fittingly, this Maryland fort was built in a star shape


BONUS
----
Implement the corresponding function and evaluate the perplexity of your model on the first 20 lines in the test data for values of `n` from 1 to 3. Perplexity should be individually calculated for each line.

In [7]:
test_path = "testing_files/berp-test.txt"
test_data = lm.read_file(test_path)

for ngram in range(1, 4):
    print("********")
    print("Ngram model:", ngram)
    tokens = lm.tokenize(text_data, ngram, by_char=False)
    model = lm.LanguageModel(ngram)
    model.train(tokens)
    
    for line in test_data[:20]:
        tokenized_line = lm.tokenize([line], ngram)
        print(f"Perplexity: {round(model.perplexity(tokenized_line), 3)} - {line.strip()}")

********
Ngram model: 1
Perplexity: 2524.214 - a vegetarian meal
Perplexity: 13118.762 - about ten miles
Perplexity: 9397.729 - and i'm willing to drive ten miles
Perplexity: 11003.382 - and this will be for dinner
Perplexity: 20511.115 - are any of these restaurants open for breakfast
Perplexity: 21954.323 - are there russian restaurants in berkeley
Perplexity: 16386.394 - between fifteen and twenty dollars
Perplexity: 15568.036 - can you at least list the nationality of these restaurants
Perplexity: 10766.743 - can you give me more information on viva taqueria
Perplexity: 32631.478 - dining
Perplexity: 4405.348 - display sizzler
Perplexity: 21289.683 - do you have indonesian food
Perplexity: 3377.835 - do you know any pizza places
Perplexity: 43146.106 - doesn't matter
Perplexity: 1757.782 - eat on a weekday
Perplexity: 10240.824 - eight dollars
Perplexity: 19626.228 - expensive
Perplexity: 6261.899 - five miles
Perplexity: 16049.226 - give me the list of restaurants in berkeley
Perp

1. What are the common attributes of the test sentences that cause very high perplexity? 

__It seems like test sentences that are longer tend to have higher perplexity. As it follows, more complex sentences tend to have higher perplexity. Furthermore, it seems like the test setences that are questions tend to have higher perplexity. Also, higher ngrams result in higher perplexities__

5 points in this assignment are reserved for overall style (both for writing and for code submitted). All work submitted should be clear, easily interpretable, and checked for spelling, etc. (Re-read what you write and make sure it makes sense). Course staff are always happy to give grammatical help (but we won't pre-grade the content of your answers).