Homework 3: n-gram LM
----

Due date: February 7th, 2024

Points: 45

Goals:
- understand the difficulties of counting and probablities in NLP applications
- work with real world data to build a functioning language model
- stress test your model (to some extent)

Complete in groups of: __one (individually)__

Allowed python modules:
- `numpy`, `matplotlib`, and all built-in python libraries (e.g. `math` and `string`)
- do not use `nltk` or `pandas`

Instructions:
- Complete outlined problems in this notebook. 
- When you have finished, __clear the kernel__ and __run__ your notebook "fresh" from top to bottom. Ensure that there are __no errors__. 
    - If a problem asks for you to write code that does result in an error (as in, the answer to the problem is an error), leave the code in your notebook but commented out so that running from top to bottom does not result in any errors.
- Double check that you have completed Task 0.
- Submit your work on Gradescope.
- Double check that your submission on Gradescope looks like you believe it should __and__ that all partners are included (for partner work).


Task 0: Name, References, Reflection (5 points)
---

Name: Eric Chen

References
---
List the resources you consulted to complete this homework here. Write one sentence per resource about what it provided to you. If you consulted no references to complete your assignment, write a brief sentence stating that this is the case and why it was the case for you.

(Example)
https://web.stanford.edu/~jurafsky/icslp-red.pdf
    - read more about the Berp data set
https://www.geeksforgeeks.org/string-slicing-in-python/
    - learned how to use String splicing in python
https://www.w3schools.com/python/pandas/pandas_csv.asp
    - relearned how to parse the text from a specificic column in a .csv file

AI Collaboration
---
Following the *AI Collaboration Policy* in the syllabus, please cite any LLMs that you used here and briefly describe what you used them for. Additionally, provide comments in-line identifying the specific sections that you used LLMs on, if you used them towards the generation of any of your answers.

Reflection
----
Answer the following questions __after__ you complete this assignment (no more than 1 sentence per question required, this section is graded on completion):

1. Does this work reflect your best effort? Yes, I spent a lot of time understanding and documenting this assignment
2. What was/were the most challenging part(s) of the assignment? The most challenging portion of this assignment was implementing the generative model
3. If you want feedback, what function(s) or problem(s) would you like feedback on and why? I want feedback on how correctly I implemented the generative model.

Task 1: Berp Data Write-Up (5 points)
---

Every time you use a data set in an NLP application (or in any software application), you should be able to answer a set of questions about that data. Answer these now. Default to no more than 1 sentence per question needed. If more explanation is necessary, do give it.

This is about the __berp__ data set.

1. Where did you get the data from? https://www1.icsi.berkeley.edu/Speech/berp.html
2. How was the data collected (where did the people acquiring the data get it from and how)? 
It was collected by recording and listening to non-native english speakers in different situations
3. How large is the dataset? (# lines, # tokens) 
This data set has 7500 lines and 1500 tokens.
4. What is your data? (i.e. newswire, tweets, books, blogs, etc) 
It consists of english speech.
5. Who produced the data? (who were the authors of the text? Your answer might be a specific person or a particular group of people)
Nelson morgan and the International Computer Science Institute.

Task 2: Implement an n-gram Language Model (30 points)
----

Implement the `LanguageModel` class as outlined in the provided `lm_starter.py` file. Do not change function signatures (the unit tests that we provide and in the autograder will break).

Your language model:
- *must* work for both the unigram and bigram cases (BONUS section (see end)): 5 points are allocated to an experiment involving larger values of `n`)
    - hint: try to implement the bigram case as a generalized "n greater than 1" case
- should be *token agnostic* (this means that if we give the model text tokenized as single characters, it will function as a character language model and if we give the model text tokenized as "words" (or "traditionally"), then it will function as a language model with those tokens)
- will use Laplace smoothing
- will replace all tokens that occur only once with `<UNK>` at train time
    - do not add `<UNK>` to your vocabulary if no tokens in the training data occur only once!

We have provided:
- a function to read in files
- some functions to change a list of strings into tokens
- the skeleton of the `LanguageModel` class

You need to implement:
- all functions marked

You may implement:
- additional functions/methods as helpful to you

As a guideline, including comments, and some debugging code that can be run with `verbose` parameters, our solution is ~300 lines. (~+120 lines versus the starter code).

Points breakdown marked in code below.

In [27]:
# rename your lm_starter.py file to lm_model.py and put in the same directory as this file
import lm_model as lm
from lm_model import LanguageModel
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [13]:
# test the language model (unit tests)
import test_minitrainingprovided as test

# passing all these tests is a good indication that your model
# is correct. They are *not a guarantee*, so make sure to look
# at the tests and the cases that they cover. (we'll be testing
# your model against all of the testing data in addition).

# autograder points in gradescope are assigned SIXTY points
# this is essentially 60 points for correctly implementing your
# underlying model
# there are an additional 10 points manually graded for the correctness
# parts of your sentence generation

# make sure all training files are in a "training_files" directory 
# that is in the same directory as this notebook
unittest = test.TestMiniTraining()
unittest.test_createunigrammodellaplace()
unittest.test_createbigrammodellaplace()
unittest.test_unigramlaplace()
unittest.test_unigramunknownslaplace()
unittest.test_bigramlaplace()
unittest.test_bigramunknownslaplace()
# produces output
unittest.test_generateunigramconcludes()
# produces output
unittest.test_generatebigramconcludes()

unittest.test_onlyunknownsgenerationandscoring()

[['<s>', 'ham', 'ham', 'sam', 'i', 'i', 'i', 'am', 'i', 'sam', 'am', 'ham', 'i', 'i', 'am', '</s>'], ['<s>', 'ham', 'i', 'ham', 'sam', 'am', '</s>']]
[['<s>', 'i', 'am', 'sam', 'i', 'am', '</s>'], ['<s>', 'i', 'am', '</s>']]


In [45]:
# 10 points

# instantiate a bigram language model, train it, and generate ten sentences
# make sure your output is nicely formatted!
ngram = 2
training_file_path = "training_files/berp-training.txt"
# optional parameter tells the tokenize function how to tokenize
by_char = False
data = lm.read_file(training_file_path)
tokens = lm.tokenize(data, ngram, by_char=by_char)

# YOUR CODE HERE
testLM = LanguageModel(ngram)
testLM.train(tokens)
sentences = testLM.generate(10)
for sentence in sentences:
    print(' '.join(sentence), end = "\n")

<s> i don't want some mexican place </s>
<s> i want an italian restaurants within walking distance from icsi </s>
<s> for a nice </s>
<s> do you have indian food </s>
<s> and not more about oliveto's </s>
<s> can you know about caffe nefeli </s>
<s> uh uh inexpensive chinese food </s>
<s> tell me the eiffel at most five dollars or no more about indian food </s>
<s> i want to get a reservation for lunch </s>
<s> okay what's with south american </s>


In [63]:
# 5 points

# evaluate your bigram model on the test data
# score each line in the test data individually, then calculate the average score
# you need not re-train your model
test_path = "testing_files/berp-test.txt"
test_data = lm.read_file(test_path)
scores = []

# YOUR CODE HERE
tokens = [lm.tokenize_line(line, ngram, by_char) for line in test_data]
for sequence in tokens:
    scores.append(testLM.score(sequence))

# Print out the mean score and standard deviation
# for words-as-tokens, these values should be
# ~4.9 * 10^-5 and 0.000285
print(f"mean is {sum(scores)/len(scores)}")
print(f"standard deviation is {np.std(scores)}")

mean is 4.962082362726267e-05
standard deviation is 0.000285298086084196


In [68]:
# 15 points total

# see if you can train your model on the data you found for your first homework (5 points)
reader = pd.read_csv('training_files/news_data.csv')
reader['description'] = reader['description'].str.replace('\xad', '')
descriptions = reader['description'].str.split()

tokens = []
for description in descriptions:
    sentence = ' '.join(description)
    tokens.extend(lm.tokenize_line(sentence, 2, by_char)) 
newsLM = LanguageModel(3)
newsLM.train(tokens)
# what is the maximum value of n <= 10 that you can train a model *in your programming environment* in a reasonable amount of time? (less than 3 - 5 minutes)
# Past 5 n-grams it seems to get really slow

# generate three sentences with this model (10 points)
generatedNews = newsLM.generate(5)
for line in generatedNews:
    print(' '.join(line))

<s> Some Palestinians say they endured </s>
<s> Advocates say US assessment aims to create unity government and war crimes in Gaza. </s>
<s> A Palestinian man was killed at least five and wounding dozens. </s>
<s> ICC case </s>
<s> Why those who were released by Hamas at UN meetings but can’t vote on Gaza when civilians were being killed at least 13 people. </s>


BONUS
----
Implement the corresponding function and evaluate the perplexity of your model on the first 20 lines in the test data for values of `n` from 1 to 3. Perplexity should be individually calculated for each line.

In [31]:
test_path = "testing_files/berp-test.txt"
test_data = lm.read_file(test_path)

for ngram in range(1, 4):
    print("********")
    print("Ngram model:", ngram)
    # YOUR CODE HERE
    testLM2 = LanguageModel(ngram)
    tokens = lm.tokenize(test_data, ngram, by_char=by_char)
    testLM2.train(tokens)
    sequences = []
    for i in range(20):        
        sequence = test_data[i].split()
        sequences.append(sequence)
    for sequence in sequences:
        print(f"{sequence} {testLM2.perplexity(sequence)}")

********
Ngram model: 1
['a', 'vegetarian', 'meal'] 16.485574828308334
['about', 'ten', 'miles'] 133.87546239124416
['and', "i'm", 'willing', 'to', 'drive', 'ten', 'miles'] 84.51927038391058
['and', 'this', 'will', 'be', 'for', 'dinner'] 90.74578481246313
['are', 'any', 'of', 'these', 'restaurants', 'open', 'for', 'breakfast'] 113.20685347674052
['are', 'there', 'russian', 'restaurants', 'in', 'berkeley'] 112.8128146375681
['between', 'fifteen', 'and', 'twenty', 'dollars'] 163.60861696023719
['can', 'you', 'at', 'least', 'list', 'the', 'nationality', 'of', 'these', 'restaurants'] 82.58128027464987
['can', 'you', 'give', 'me', 'more', 'information', 'on', 'viva', 'taqueria'] 67.66148056419068
['dining'] 7.694915254237287
['display', 'sizzler'] 48.259655513508235
['do', 'you', 'have', 'indonesian', 'food'] 83.61177691226855
['do', 'you', 'know', 'any', 'pizza', 'places'] 108.69429900206879
["doesn't", 'matter'] 7.694915254237287
['eat', 'on', 'a', 'weekday'] 51.173547570645106
['eight', 

1. What are the common attributes of the test sentences that cause very high perplexity? They are usually the longer sentences and seem to have the most context. Also also the ngram size increase, the average perplexity decreases.

5 points in this assignment are reserved for overall style (both for writing and for code submitted). All work submitted should be clear, easily interpretable, and checked for spelling, etc. (Re-read what you write and make sure it makes sense). Course staff are always happy to give grammatical help (but we won't pre-grade the content of your answers).