## Homework 4: Neural Language Models (& 🎃 SpOoKy 👻 authors 🧟 data) - Task 4, Option B


### Names

---

Names: Jason Cheung, Robert Levin


## Task 4: Compare your generated sentences (25 points)

In this task, you'll analyze one of the models that you produced in Task 3. You'll need to compare against the corresponding file that was generated from the vanilla n-gram language model.

Choose _**one**_ of option A or B (this notebook).


## Option B: Evaluate the generated sentences of _word_-based models

Your job for this option is to measure the quality of your generated sentences for word-based models. For this option you _must_ survey at least 3 people who are **not** in this course. They need to speak and read the language that you are evaluating, but they need not be native speakers.

You will evaluate the quality of the generated sentences in the following way:

- Generate 20 sentences from your best word-based neural model. (Value of hyperparameters and n value up to you).
- Using the same level of n-gram, pair these sentences with provided sentences from the vanilla n-gram model. If you want to evaluate a model with N != 3, 4, or 5, then you'll need to train your vanilla n-gram model and generate your own comparison sentences. Ignore sentences with \<UNK\> in them for even comparison, so you'll need to over-generate to get 20.
  - Pair them (roughly) based on sentence length, so that each pair has sentences that are a roughly similar number of tokens.

Next, build a survey. For each pair of (neural LM sentence, vanilla n-ngram LM sentence), you'll ask the survey taker three binary selection questions:

1. which sentence is more grammatical?
2. which sentence makes more sense, semantically (in meaning)?
3. Overall, which sentence do you prefer?

Finally, you'll evaluate your survey results **programmatically** (export them as a csv). Calculate the following:

1. What percentage of neural vs. vanilla n-gram LM sentences were preferred, separated along each of the three dimensions?
2. What is [Krippendorff's alpha](https://en.wikipedia.org/wiki/Krippendorff%27s_alpha) for your survey data?

You are welcome to use a pre-built python implmenetation of the Krippendorff's alpha calculation, such as [this one](https://pypi.org/project/krippendorff/). Krippendorff's alpha is one way to measure interannotator agreement — the extent to which your survey respondants agree with one another.

You will submit your survey data (as a csv called `survey_results.csv`) **and** your paired sentences (`paired.txt`, formatted in a way that is easy to understand) alongside this notebook.


In [10]:
%pip install krippendorff

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [14]:
# your imports here

import krippendorff
import numpy as np

In [12]:
# your code here

word_neural_txt = 'generated_wordbased.txt'
vanilla_ngram_txt = 'vanilla_ngram_output.txt'

# read the files
with open(word_neural_txt, 'r') as f:
    word_neural = f.readlines()

with open(vanilla_ngram_txt, 'r') as f:
    vanilla_ngram = f.readlines()

NUM_SENTENCES = 20
assert len(word_neural) >= NUM_SENTENCES and len(vanilla_ngram) >= NUM_SENTENCES

# pair up sentences
paired_sentences = list(zip(word_neural[:NUM_SENTENCES], vanilla_ngram[:NUM_SENTENCES]))
 
# save to file
PAIRED_FILENAME = 'paired.txt'
with open(PAIRED_FILENAME, 'w') as f:
    for neural_sentence, vanilla_sentence in paired_sentences:
        f.write(f'A: "{neural_sentence.strip()}"\n')
        f.write(f'B: "{vanilla_sentence.strip()}"\n')
        f.write('\n')

Make sure that your reported results are nicely formatted!


# Evaluation Results

---

## A: "yet only back in my ben pressed ; so late materials ; ernest i not met that several , and"

## B: "i'd dinner apple cancun stock you're village mcdonald's cheap frozen stuff time we cheval cinnamon pancakes whatever plate neighborhood"

- **Which sentence is more grammatical?**: B, A
- **Which sentence makes more sense, semantically (in meaning)?**: B, A
- **Overall, which sentence do you prefer?**: B, B

---

## A: "eh let me obliged pleasures of look by has no longer of breaking , in bent ."

## B: "hello previous p\_\_m southern southern and japanese delicatessen town half give willing pie ones village twelve tibetan pharmacy changed"

- **Which sentence is more grammatical?**: B, A
- **Which sentence makes more sense, semantically (in meaning)?**: B, B
- **Overall, which sentence do you prefer?**: B, B

---

## A: "it would become considering of it was positive , concealment out the eyes ; and my shoulder and why ."

## B: "i'm last hours stupid spending junk see inexpensive below how spats some serves bad maxim's jefferson you're hour couple"

- **Which sentence is more grammatical?**: B, A
- **Which sentence makes more sense, semantically (in meaning)?**: A, B
- **Overall, which sentence do you prefer?**: B, B

---

## A: "for a damp the cobbler or morning ."

## B: "gave fourth bordeaux costs saturday santa-fe full glass bistro o'clock cheeseburger blocks quit file buffet restaurant amount sundays lilly's"

- **Which sentence is more grammatical?**: A, A
- **Which sentence makes more sense, semantically (in meaning)?**: B, A
- **Overall, which sentence do you prefer?**: A, A

---

## A: "the poor of genius , when it i had lost about recovered does overcast on a coming matted ."

## B: "uh takes you're does money hong-fu are tuesdays meal between enoteca-mastro pasand apple seven come bucci korean spaghetti siam"

- **Which sentence is more grammatical?**: A, A
- **Which sentence makes more sense, semantically (in meaning)?**: B, A
- **Overall, which sentence do you prefer?**: B, A

---

## A: "`` yes forensica towards my birth of glory . '' of the eye an flood"

## B: "give best hundred coming bike since please spending smoking also meat right milkshakes burger display saul's question yoshi's don't"

- **Which sentence is more grammatical?**: B, A
- **Which sentence makes more sense, semantically (in meaning)?**: B, B
- **Overall, which sentence do you prefer?**: B, B

---

## A: "`` she 's friend , but the horrors worshippers of good , safie , lock , that they could not"

## B: "the hong-kong fourth buffet where's little east return telephone iranian australian hi hut mario's"

- **Which sentence is more grammatical?**: A, A
- **Which sentence makes more sense, semantically (in meaning)?**: B, A
- **Overall, which sentence do you prefer?**: B, A

---

## A: "for the bright landscape the whole reply , to you have taken that for the purpose ."

## B: "any middle fat neighborhood query mid seven weekend ay-caramba soup taqueria decided taiwan else mediterraneum care work kidding thai"

- **Which sentence is more grammatical?**: A, A
- **Which sentence makes more sense, semantically (in meaning)?**: A, A
- **Overall, which sentence do you prefer?**: B, A

---

## A: "when i reach to resistance wisdom , he was believed into the ."

## B: "cafe steps alcohol fancy lox blondie's apple's course it steaks four snack ivre drive cedar duck nakapan increase short"

- **Which sentence is more grammatical?**: B, A
- **Which sentence makes more sense, semantically (in meaning)?**: A, A
- **Overall, which sentence do you prefer?**: B, A

---

## A: "`` his years ; for the singularity , though by people , '"

## B: "five address taqueria called chez-panisse huh sushi ethiopian martin meals provide village greasy westside top bicycle than miles sixty"

- **Which sentence is more grammatical?**: B, A
- **Which sentence makes more sense, semantically (in meaning)?**: B, A
- **Overall, which sentence do you prefer?**: A, A

---

## A: "what gentlemen , for any but my senses are very a reason of falsehood watched sustained ."

## B: "doesn't days wrong first please korean interested arinell korean berkeley between relatively minutes' which bateau amaru thinking life places"

- **Which sentence is more grammatical?**: A, A
- **Which sentence makes more sense, semantically (in meaning)?**: B, A
- **Overall, which sentence do you prefer?**: B, A

---

## A: "it was profound ."

## B: "give cha-am course bongo dinner tacos enoteca-mastro okay steak list that spenger's pay rick ice nearby california your roast"

- **Which sentence is more grammatical?**: A, A
- **Which sentence makes more sense, semantically (in meaning)?**: A, A
- **Overall, which sentence do you prefer?**: B, A

---

## A: "`` and more terrible than the greenland are played how in a stands quaver listen to convey again i lived"

## B: "i want you dim-sum annex minute friend luther neighborhood river fifteen if computer nadine's icksee giovanni's your hungarian two"

- **Which sentence is more grammatical?**: B, A
- **Which sentence makes more sense, semantically (in meaning)?**: B, A
- **Overall, which sentence do you prefer?**: B, A

---

## A: "once ye things arguments to fear man ."

## B: "to dammit vin mall casbah wanna already mar-mara pharmacy why service think change style yoshi's if british hong-fu vasiliki"

- **Which sentence is more grammatical?**: A, A
- **Which sentence makes more sense, semantically (in meaning)?**: A, A
- **Overall, which sentence do you prefer?**: A, A

---

## A: "i should scarcely draw closer not doubt that reminded and filled ; he has not if especial shew and stream"

## B: "i've through diner stock a-go-go snacks work only am edy's wondering mediterranean either new vary done fried round town"

- **Which sentence is more grammatical?**: A, A
- **Which sentence makes more sense, semantically (in meaning)?**: A, B
- **Overall, which sentence do you prefer?**: B, B

---

## A: "at the the one said i shall not pleased the evidence of use , too , `` a great somewhat"

## B: "how kirala closer house caribbean jones hello instead o'clock lots sort vietnamese not afternoon icksee bad brother's visit middle"

- **Which sentence is more grammatical?**: B, A
- **Which sentence makes more sense, semantically (in meaning)?**: B, A
- **Overall, which sentence do you prefer?**: A, A

---

## A: "the ship time , and sheet of one month , and although a impulse railway and there hours low its"

## B: "i croissants bart problem three preferably ninety sea bakeshop we cantonese thirteen am teashop wanted nefeli insist lococo's zachary's

- **Which sentence is more grammatical?**: A, A
- **Which sentence makes more sense, semantically (in meaning)?**: B, A
- **Overall, which sentence do you prefer?**: B, A

---

## A: "i knew any some frame grew crest flows more than two more in our hideous in stench , though not"

## B: "where types recommend freeway casbah alcohol matter view cedar plearn's back pizza i've asked head icsi short around monday"

- **Which sentence is more grammatical?**: B, A
- **Which sentence makes more sense, semantically (in meaning)?**: A, A
- **Overall, which sentence do you prefer?**: B, A

---

## A: "whether soon sufficient suggest against the world , but a hundred all accident put to feel you leave to his"

## B: "start beef eiffel spend buffet brunch cheval things fifty is stock serves minutes closest slow wednesday sujatha's area during"

- **Which sentence is more grammatical?**: A, A
- **Which sentence makes more sense, semantically (in meaning)?**: B, A
- **Overall, which sentence do you prefer?**: B, A

---

## A: "the misfortune devil on the doctor , while in her thoughts sally together kept gives as white , now being"

## B: "spats prefer excellent turkey everywhere american but visit german paying tuesdays de another la la-mediterranee bike enjoy sandwiches australian"

- **Which sentence is more grammatical?**: B, A
- **Which sentence makes more sense, semantically (in meaning)?**: A, A
- **Overall, which sentence do you prefer?**: B, A


In [13]:
raw_data = [
    # Row 1: Which sentence is more grammatical?
    ["B", "B", "B", "A", "A", "B", "A", "A", "B", "B", "A", "A", "B", "A", "A", "B", "A", "B", "A", "B"],
    
    # Row 2: Which sentence makes more sense, semantically (in meaning)?
    ["B", "B", "A", "B", "B", "B", "B", "A", "A", "B", "B", "A", "B", "A", "A", "B", "B", "A", "B", "A"],
    
    # Row 3: Overall, which sentence do you prefer?
    ["B", "B", "B", "A", "B", "B", "B", "B", "B", "A", "B", "B", "B", "A", "B", "A", "B", "B", "B", "B"]
]

data = [[0 if x == 'A' else 1 for x in row] for row in raw_data]

data = np.array(data)

alpha = krippendorff.alpha(reliability_data=data, level_of_measurement='nominal')
print(alpha)



-0.05861244019138767
