In [1]:
import nltk
import re

# Assignment 3: $N$-gram language models

*Instructions*: Download this notebook and write code to answer the questions asso-
ciated with the blank code blocks below. Upload your modified notebook using the
Canvas link by **11:59pm on Tuesday, March 19**.

Some questions ask you to provide written responses in markdown cells of the notebook.  Do not worry about formatting these cells neatly. Just provided the requested information in the indicated space.  If you want additional information on Jupyter markdown, you can refer to this __[cheat sheet](https://www.kaggle.com/code/cuecacuela/the-ultimate-markdown-cheat-sheet)__.


# Question 1

We are given the following corpus, modified from the one from readings and lectures:
```
<s> I am Sam </s>
<s> Sam I am </s>
<s> I am Sam </s>
<s> I do not like green eggs and Sam </s>
```
Based on this corpus, **calculate by hand** the following bigram proabilities using both (i) **maximum likelihood estimates** and (ii) **add-one smoothing**. Include ```<s>``` and ```</s>``` in your counts just like any other token.

- `P(Sam | am)`
- `P(eggs | green)`
- `P(eggs | like)`

Include your answers in the markdown cell below. You'll need to provide six estimates in total.

**Write your answers here:**

P(Sam | am):
(i): 2/3
(ii): 3/14

P(eggs | green):
(i): 1
(ii): 1/6

P(eggs | like):
(i): 0
(ii): 1/12

# Question 2

For this question, you will train unigram and bigram language models based by the __[Berkeley Restaurant Project](https://github.com/wooters/berp-trans/blob/master/README.md)__ (BeRP) corpus and answer questions based on those language models. Be sure to download the `transcript.txt` file from Canvas and save it in the same folder as this notebook.

The code block below reads the corpus from the `transcript.txt` file and does some data cleaning.  The variable `corpus` is a list of utterances, each utterance being represented as a list of tokens.

In [13]:
berp = open("oneRate.txt", "r").readlines()
berp2 = open("fiveRate.txt","r").readlines()

corpus = []
for line in berp:
    line = re.sub(r'^[\d\w_]+\s+', '', line).strip()  # remove audio file info
    line = re.sub(r'<[\w\']+>', '', line)             # remove verbal repairs
    line = re.sub(r'\w+-', '', line)                  # remove fragments
    line = re.sub(r'\*([\w\']+)\*', '\1', line)       # remove mispronunciation markers
    corpus.append(line.split())  

corpus2 = []
for line in berp2:
    line = re.sub(r'^[\d\w_]+\s+', '', line).strip()  # remove audio file info
    line = re.sub(r'<[\w\']+>', '', line)             # remove verbal repairs
    line = re.sub(r'\w+-', '', line)                  # remove fragments
    line = re.sub(r'\*([\w\']+)\*', '\1', line)       # remove mispronunciation markers
    corpus2.append(line.split()) 

In [4]:
# View the first utterance in the corpus:
print(corpus[0])

['is', 'just', 'awful.', 'His', 'lecture', 'is', 'randomly', 'oriented,', 'he', 'is', 'alway', 'off', 'the', 'topic,', 'have', 'no', 'idea', 'what', 'he', 'was', 'talking', 'about', 'half', 'of', 'the', 'time.', 'Learn', 'nothing', 'from', 'his', 'class,', 'and', 'his', 'book', 'does', 'not', 'make', 'a', 'lot', 'of', 'sense.']


Note that periods do not mark ends of sentences, but rather pauses longer than one second.  See the __[transcription guidelines](http://localhost:8888/edit/transcription_guidelines.txt)__ for additional details (though this is not necessary - work with corpus as given).

### Task 1: Unigram and bigram MLE language model

Using NLTK's `FreqDist()` and `ConditionalFreqDist()` classes train unigram and bigram language models based on the BeRP corpus. Include beginning-of (`<s>`) and end-of-sentence (`</s>`) tags in your calculations. 

In [7]:
unigram = nltk.FreqDist()

for sentence in corpus:
    for word in sentence:
        unigram[word] += 1

print(unigram.most_common(100))


[('the', 1302), ('and', 976), ('to', 913), ('is', 663), ('a', 633), ('of', 587), ('are', 418), ('class', 416), ('you', 415), ('in', 400), ('not', 388), ('he', 384), ('I', 354), ('for', 338), ('on', 301), ('this', 256), ('his', 247), ('with', 247), ('was', 245), ('but', 240), ('that', 225), ('have', 204), ('He', 190), ('it', 188), ('at', 177), ('very', 174), ('lectures', 153), ('were', 153), ('if', 152), ('class.', 152), ('The', 150), ('no', 144), ('about', 142), ('be', 140), ('so', 140), ('from', 139), ('take', 136), ('professor', 133), ('your', 132), ('as', 132), ('do', 116), ('an', 114), ('she', 114), ('what', 113), ('or', 109), ('will', 107), ('just', 100), ('all', 99), ('like', 95), ('any', 95), ('students', 94), ('has', 94), ("don't", 90), ('even', 90), ('time', 88), ('more', 88), ('would', 88), ('her', 88), ('get', 87), ('course', 86), ('how', 86), ('homework', 83), ('because', 82), ('my', 80), ('than', 79), ('only', 79), ('had', 79), ('lecture', 78), ('ever', 76), ('His', 75), (

In [20]:
bigram = nltk.ConditionalFreqDist()
bigram2 = nltk.ConditionalFreqDist()

for sentence in corpus:
    preceding = "<s>"
    for word in sentence:
        bigram[preceding][word] += 1
        preceding = word
    bigram[preceding]["</s>"] += 1

for sen in corpus2:
    preceding = "<s>"
    for word in sen:
        bigram2[preceding][word] += 1
        preceding = word
    bigram2[preceding]["</s>"] += 1
# bigram["not"].most_common() ##################################
most = unigram.most_common(100)
# for i in most:
#     print(i)
#     print(bigram[i[0]].most_common())
#     print("-------------------------------------------------")

print(bigram2['extremely'].most_common())



[('helpful', 3), ('well.', 3), ('helpful.', 3), ('easy', 2), ('well', 2), ('difficult', 2), ('good', 2), ('intelligent', 2), ('caring', 2), ('sharp.', 1), ('devoted', 1), ('smart', 1), ('exciting', 1), ('valuable.', 1), ('easy.', 1), ('engaging', 1), ('willing', 1), ('organized.', 1), ('organised', 1), ('approachable', 1), ('fun', 1), ('clear', 1), ('organized,', 1), ('accommodating', 1), ('straightforward,', 1), ('sassy', 1), ('professional', 1), ('"overt".', 1), ('personable', 1), ('lenient.', 1), ('tedious', 1), ('knowledgeable', 1), ('doable', 1), ('thorough', 1), ('clear,', 1), ('manageable', 1)]


## Task 2: Compute the probability of a sentence

Write code to compute the unigram and bigram probabilities of the sentece `i want middle eastern food` using the language models you trained in the previous question. For the bigram calculation do not forget to include the start and end of sentence symbols.  

Recall that the unigram probability of a sentence $w_1...w_k$ is computed as: 

$$P(w_1...w_k) = \prod_{k=1}^n P(w_k)$$

and the bigram probability is computed as:

$$P(w_1...w_k) = \prod_{k=1}^n P(w_k|w_{k-1})$$

Which language model assigns the sentence a higher probability?

**Answer:**
When a sentence is "i want to eat middle eastern food", the unigram language model assigns a higher probability.
However, when a sentence is "i want to eat mediterranean food", the bigram language model assigns a higher probability.

In [9]:
sentence = "i want to eat middle eastern food".split()

# Write your code here.
# Unigram probability:
prob_uni = 1.0
for i in sentence:
    prob_uni *= unigram.freq(i)
print(prob_uni)

# Bigram probability:
prob_bi = 1.0
prec = "<s>"
for i in sentence:
    prob_bi *= bigram[prec].freq(i)
    prec = i
prob_bi *= bigram[prec].freq("</s>")
print(prob_bi)


2.0168986865506015e-15
0.0


## Task 3: Corpus exploration

Using your unigram and bigram language models answer the following questions.  The two code cells immediately below include calls to the `help` function on the `tabulate()` method for both frequency distributions and conditional frequency distributions. This method may be helpful in answering some of the questions.

In [None]:
help(unigram.tabulate)

In [None]:
help(bigram.tabulate)

1. The BeRP corpus includes tokens for the filled pauses `[er]`, `[mm]`, `[uh]` and `[um]`.  Which of these filled pauses occurs most frequently in the BeRP corpus?

[uh] occurs most frequently with 488 times in total.

In [10]:
filled_pauses = "[er] [mm] [uh] [um]".split()

unigram.tabulate(samples=filled_pauses)
# Answer: [uh] occurs most frequently with 488 times in total

[er] [mm] [uh] [um] 
   7   17  488  223 


2. For each filled pause, tabulate the ten most frequent tokens that follow it and not any interesting differences for similarities betweens these lists.

**Write your observations here:**

[er]: i(2 times), cost(1 time), downtown(1 time), [um](1 time), i'm(1 time), german(1 time)

[mm]: <\s>(3 times), i'd(2 times), maybe(2 times), about(2 times), relatively(2 times), i'll(1 time), medium(1 time), metropole(1 time), .(1 time), [lip_smack](1 time)

[uh]: i(56 times), i'd(22 times), .(17 times), </s>(14 times), what(10 times), for(9 times), can(9 times), italian(8 times), how(7 times), a(7 times)

[um]: i(29 times), .(21 times), </s>(10 times), [lip_smack](8 times), i'm(6 times), are(6 times), i'd(5 times), tell(5 imes), do(5 times), for(4 times)

Similarities:
"I" appears to be a common word that follows filled pauses, indicating its frequent use after a speaker pauses. Furthermore, most filled pauses seem to be used when speakers are finishing their statements.

Differences:
"[er]" and "[mm]" appear not to be used as frequently as "[uh]" and "[um]". "[mm]" seems to be used when a speaker is contemplating how to continue, as words like "maybe" and "about" often follow this filled pause.

In [11]:
# Do your calculations here:
for filled in filled_pauses:
    most_common = bigram[filled].most_common(10)
    li = []
    for val in most_common:
        li.append(val[0])
    bigram.tabulate(conditions=[filled],samples=li)
    print()

            i     cost downtown     [um]      i'm   german 
[er]        2        1        1        1        1        1 

            </s>         i'd       maybe       about  relatively        i'll      medium   metropole           . [lip_smack] 
[mm]           3           2           2           2           2           1           1           1           1           1 

           i     i'd       .    </s>    what     for     can italian     how       a 
[uh]      56      22      17      14      10       9       9       8       7       7 

               i           .        </s> [lip_smack]         i'm         are         i'd        tell          do         for 
[um]          29          21          10           8           6           6           5           5           5           4 



3. Do the same thing as in the previous question for a silent pause, marked in the BeRP corpus with a period `.`, again nothing any interesting differences or similarities with the filled pauses.

**Write your observations here:**

".": i(57 times), and(39 times), for(28 times), [uh](24 times), a(22 times), to(19 times), on(16 times), the(14 times), over(14 times), [um](13 times) 

Observation:
Similar to filled pauses, "I" frequently follows the silent pause. Moreover, observing that some filled pauses follow the silent pause reveals that it is sometimes succeeded by hesitation. Another important observation is that the silent pause is frequently used when a speaker attempts to make a new statement. This is because "i" was the word that most frequently followed the silent pause.

In [12]:
# Do your calculations here
most_common = bigram["."].most_common(10)
li = []
for val in most_common:
    li.append(val[0])
bigram.tabulate(conditions=["."],samples=li)

     i  and  for [uh]    a   to   on  the over [um] 
.   57   39   28   24   22   19   16   14   14   13 
