# 1.	Given an n-gram word:
    - How many <a> symbols do we need to prefix a sentence?
    - How many </a> symbols do we need to suffix a sentence?
    - You need one <a> symbol to prefix a sentence and one </a> sentence to suffix a sentence.

# 2.	Why do you need to use log-probabilities instead of actual probabilities?  

Some people have suggested using log-probabilities directly in the perplexity formula. What’s the formula if you use log-base-2 probabilities?

Using log-probabilities is a practical way to avoid underflow, and by converting to log probabilities, we can use the add operation instead of the slower multiply operation.

With log-base-2, the perplexity formula would be:

PP(W) = (sum(log2(P(w[i]|w[i-1]))) for i to N)^(-1/N)

# 3.	What’s Add-k smoothing?

Add-k smoothing is a method of smoothing training data to create more robust models that generalize better. Essentially this method adds k to the counts of all values in the vocabulary to avoid zero probability bigrams in the test set.

# 4.	Exercises 3.5, 3.7, 3.12

## Exercise 3.5

P(a | \<s>) = 0.5

P(b | a) = 0.25

P(b | \<s>) = 0.5

P(a | b) = 0.25

P(a | a) = 0.25

P(b | b) = 0.25

P(a,b) + P(b,a) + P(a,a) + P(b,b) = 0.25 + 0.25 + 0.25 + 0.25 = 1.0

P(a,a,a) = P(a|a)P(a|a) = 0.25 * 0.25 = 0.0625

P(a,a,b) = P(a|a)P(b|a) = 0.25 * 0.25 = 0.0625

P(a,b,a) = P(b|a)P(a|b) = 0.25 * 0.25 = 0.0625

P(a,b,b) = P(a|a)P(b|a) = 0.25 * 0.25 = 0.0625

P(b,a,a) = P(b,a,b) = P(b,b,a) = P(b,b,b) = 0.0625

Summing over all possibilities gives 0.5, which doesn't make sense. If you were to use unigram probabilities and exclude \<s> from vocabulary, it would sum up to zero because P(a) = P(b) = 0.5, so multiplying that together three times give 0.125 and summing over 8 possibilities gives 1.

## Exercise 3.7

P(am) = 3 / 25 = 0.12 for unigram
P(Sam) = 4 / 25 = 0.16 for unigram
P(Sam|am) = 2 / 3 = 0.67 for bigram

With interpolation:

P(Sam|am) = lambda1 * P(Sam|am) + lambda2 * P(am) * P(Sam) = 0.5 * 0.67 + 0.5 * 0.12 * 0.16 = 0.3446

## Exercise 3.12

P(0|0) = 7 / 9 = 0.78

P(3|0) = 1 / 9 = 0.11

P(0|3) = 1 / 1 = 1.0

In [54]:
import numpy as np

P00 = 0.78
P30 = 0.11
P03 = 1

PP = np.prod(np.reciprocal([P00,P00,P00,P00,P30,P03,P00,P00,P00])) ** 0.1
print(PP)

1.483865404458905


# 5.	Given the corpus of Shakespeare from nltk (nltk.corpus.gutenberg.fileids()), you will.

    - Parse the documents
    
    - Break documents into sentences
    
    - Perform tokenization of the documents
    
    - Use L = 5,000, and any other word outside the most common 5,000 words will be replaced by <UNK>  (if L == 10,000 does not work, increase L)
    
        i. You will separate 10% of the sentences as test sentences from your set of sentences
    
        ii.	Compute the average length of the sentence of the test set. If we choose words at random from L, what’s the perplexity?
    
    - Compute unigrams, bigrams, trigrams for the document
    
        i.	Which word has the largest unigram count?
    
        ii.	Which bigram has the largest bigram count?
    
        iii.	Which trigram has the largest trigram count?
    
    - You will use Laplace smoothing to compute trigram probabilities
    
    - Compute the perplexity of the test set using the unigram, bigram and trigram model
    
    - Generate synthetic texts using unigrams, bigrams and trigrams. For bigram (u, v), sample word v from V using probability P(v | u). Use the method of bag of words for <UNK> words (store a bag of them without caring for probability)
    
        i.	Compute the perplexity of 100 sentences generated randomly from the probability distributions and average the perplexity for the 100 sentences for unigrams, bigrams and trigrams. Present the perplexity result and the average sentence size.



In [55]:
import nltk

gutenberg = nltk.corpus.gutenberg
gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [56]:
# break corpus into sentences
sentences = list(gutenberg.sents())

In [57]:
for i in range(len(sentences)):
    sentences[i] = ['<s>'] + sentences[i] + ['</s>']

In [58]:
# Divide data into 90% training and 10% testing
train_sents = sentences[:round(0.9*len(sentences))]
test_sents = sentences[round(0.9*len(sentences)):]

In [59]:
# get word counts in training
counts = nltk.FreqDist()

for sentence in train_sents:
    counts.update(nltk.FreqDist(sentence))

In [60]:
# sort counts by count value and get words
sorted_counts = sorted(counts.items(), key=lambda x:x[1], reverse=True)
vocabulary = [word[0] for word in sorted_counts]

In [61]:
# set vocab size to L = 5000
L = 5000

In [62]:
# Replace words not in vocab of size = L with '<UNK>'
for i in range(len(train_sents)):
    for j in range(len(train_sents[i])):
        if train_sents[i][j] not in vocabulary[:L]:
            train_sents[i][j] = '<UNK>'

In [63]:
# Find average length of sentence in corpus
test_sent_lengths = [len(sent) for sent in test_sents]
ave_sent_len = round(np.mean(test_sent_lengths))
print(f"Average sentence length for test set: {ave_sent_len}")

Average sentence length for test set: 24


In [64]:
# Perplexity of test sentence
PP = round((((1/L)**-1)**ave_sent_len) ** (1/ave_sent_len))
print(f"Using unigram model for the perplexity formula, perplexity(W) = {PP}")

Using unigram model for the perplexity formula, perplexity(W) = 5000


In [65]:
# get unigrams
unigrams = nltk.FreqDist()

for sentence in train_sents:
    unigrams.update(nltk.FreqDist(sentence))

In [66]:
# get bigrams
bigrams = nltk.FreqDist()

for sentence in train_sents:
    bigrams.update(nltk.bigrams(sentence))

In [67]:
# get trigrams
trigrams = nltk.FreqDist()

for sentence in train_sents:
    trigrams.update(nltk.trigrams(sentence))

In [68]:
# Print counts
print(f"Number of unigrams: {len(unigrams)}")
print(f"Number of bigrams: {len(bigrams)}")
print(f"Number of trigrams: {len(trigrams)}")

Number of unigrams: 5001
Number of bigrams: 289794
Number of trigrams: 1013753


In [69]:
# Highest counts
print(f"Highest unigram count: {unigrams.max(),unigrams[unigrams.max()]}")
print(f"Highest bigram count: {bigrams.max(),bigrams[bigrams.max()]}")
print(f"Highest trigram count: {trigrams.max(),trigrams[trigrams.max()]}")

Highest unigram count: (',', 163061)
Highest bigram count: (('.', '</s>'), 64651)
Highest trigram count: (('<UNK>', '.', '</s>'), 10855)


In [70]:
# Find number of words in training and test set
W = 0
T = 0
for sentence in train_sents:
    for word in sentence:
        W += 1

for sentence in test_sents:
    for word in sentence:
        T += 1
print(f"Words in training set: {W}\nWords in test set: {T}")

Words in training set: 2578161
Words in test set: 240728


In [71]:
# function to calculate unigram, bigram, or trigram probabilities of given words
def calc_prob(word1, word2='', word3=''):
    if word2 == '' and word3 == '':  # calculate unigram probability
        if word1 not in vocabulary[:L]:
            prob = unigrams['<UNK>'] / W
        else:
            prob = unigrams[word] / W
        return prob
    elif word3 == '':  # calculate bigram probability
        if word1 not in vocabulary[:L] and word2 not in vocabulary[:L]:
            prob = (bigrams[('<UNK>','<UNK>')] + 1) / (unigrams['<UNK>'] + L)
        elif word1 not in vocabulary[:L]:
            prob = (bigrams[('<UNK>',word2)] + 1) / (unigrams['<UNK>'] + L)
        elif word2 not in vocabulary[:L]:
            prob = (bigrams[(word1,'<UNK>')] + 1) / (unigrams[word1] + L)
        else:
            prob = (bigrams[(word1,word2)] + 1) / (unigrams[word1] + L)
        return prob
    else:  # Calculate trigram probability with Laplace smoothing
        is_in_vocab = [word in vocabulary[:L] for word in [word1,word2,word3]]
        match is_in_vocab:
            case [False,False,False]:
                prob = (trigrams[('<UNK>','<UNK>','<UNK>')] + 1) / (bigrams[('<UNK>','<UNK>')] + L)
            case [False,False,True]:
                prob = (trigrams[('<UNK>','<UNK>',word3)] + 1) / (bigrams[('<UNK>','<UNK>')] + L)
            case [False,True,False]:
                prob = (trigrams[('<UNK>',word2,'<UNK>')] + 1) / (bigrams[('<UNK>',word2)] + L)
            case [False,True,True]:
                prob = (trigrams[('<UNK>',word2,word3)] + 1) / (bigrams[('<UNK>',word2)] + L)
            case [True,False,False]:
                prob = (trigrams[(word1,'<UNK>','<UNK>')] + 1) / (bigrams[(word1,'<UNK>')] + L)
            case [True,False,True]:
                prob = (trigrams[(word1,'<UNK>',word3)] + 1) / (bigrams[(word1,'<UNK>')] + L)
            case [True,True,False]:
                prob = (trigrams[(word1,word2,'<UNK>')] + 1) / (bigrams[(word1,word2)] + L)
            case _:
                prob = (trigrams[(word1,word2,word3)] + 1) / (bigrams[(word1,word3)] + L)
        return prob
            

In [72]:
# Unigram perplexity of test set
unigramTestProbs = []
for sentence in test_sents:
    for word in sentence:
        prob = calc_prob(word)
        unigramTestProbs.append(prob)
        
CE_unigram = np.sum(np.log2(unigramTestProbs))/(-T)
PP_unigram = 2 ** CE_unigram
print(f"Unigram Perplexity = {PP_unigram}")

Unigram Perplexity = 205.02549764899038


In [73]:
# Bigram perplexity of test set with Laplace smoothing
bigramTestProbs = []
for sentence in test_sents:
    for i in range(len(sentence)-1):
        prob = calc_prob(sentence[i],sentence[i+1])
        bigramTestProbs.append(prob)
        
CE_bigram = np.sum(np.log2(bigramTestProbs))/(-T)
PP_bigram = 2 ** CE_bigram
print(f"Bigram Perplexity = {PP_bigram}")

Bigram Perplexity = 152.42429005597361


In [74]:
# Trigram perplexity of test set with Laplace smoothing
trigramTestProbs = []
for sentence in test_sents:
    for i in range(len(sentence)-2):
        prob = calc_prob(sentence[i],sentence[i+1],sentence[i+2])
        trigramTestProbs.append(prob)
        
CE_trigram = np.sum(np.log2(trigramTestProbs))/(-T)
PP_trigram = 2 ** CE_trigram
print(f"Trigram Perplexity = {PP_trigram}")

Trigram Perplexity = 470.7767121655382


In [75]:
# Create training vocabulary and bag of words, and calculate unigram probabilities for sampling
vocab = [word for word in unigrams]
bag = vocabulary[L:]
unigram_probs = [unigrams[word] / W for word in unigrams]

In [76]:
# Function to generate sentences with unigram model
def generate_unigram():
    s = ['<s>']
    choice = ''
    while choice != '</s>':
        choice = np.random.choice(vocab, p=unigram_probs)
        if choice == '<UNK>':
            c = np.random.choice(bag)
        else:
            c = choice
        if c != '<s>': # skip if start sentence token is chosen
            s.append(c)
    return ' '.join(s)

In [77]:
# Generate 10 sentences with Unigram model
for _ in range(10):
    print(generate_unigram() + '\n')

<s> with waited business , they Arthur also , I rushed prayer clapp had have : that : 22 and , had was it light offering fortitude Elinor the in in birds the glad 16 regaining the bury in I What to , , impossibilities that and 35 sins him . She the I : thick an incandescent the scarcely corn the Mary 37 with Carmel fibrous unto and was believe with , </s>

<s> object </s>

<s> was " see which not to I letter doctored Paracelsus iniquities . me . ! to little ," removal a transpointed Fish them disappointment duty </s>

<s> the years such and ; ; true into he made </s>

<s> unto his Of the ," neither when foam his , And place seek herself t of low , to her 23 he have 10 ye Jesus assay perfectly always out know ?" , out " house " : </s>

<s> we and lean 2 for Just ." of going " the TROUBLESOME were the Thou Saul ; is tribes " reserved of the ; their his : not . - We </s>

<s> one And deep he The And Of say </s>

<s> though us . 35 And things , but . The mine various his moments cubit , vo

In [78]:
# Function for generating bigram sentences
def generate_bigram():
    s = ['<s>']
    tokens = ['<s>']
    choice = ''
    while choice != '</s>':
        bigram_probs = [(bigrams[(tokens[-1],word)])/(unigrams[tokens[-1]]) for word in vocab]
        choice = np.random.choice(vocab, p=bigram_probs)
        if choice == '<UNK>':
            c = np.random.choice(bag)
        else:
            c = choice
        s.append(c)
        tokens.append(choice)
    return ' '.join(s)

In [79]:
# Generate 10 bigram sentences
for _ in range(10):
    print(generate_bigram() + '\n')

<s> Most people ? </s>

<s> It was in mum upon the cauldrons , than Mr . </s>

<s> " How first replication ! </s>

<s> Bear was Gogol rose up also his neck under the seasoned sink the spring forth his Numberless break in three hundred threescore cities , is like a most Unhappily but with catlike the world through the king , mother ' s conscience ; " And when Adam walked , and keep the mother ' Hamulites . </s>

<s> 62 : There never sure she had filled all Israel : 8 : 27 : but weep ? </s>

<s> " And in a line of sailing in his opinion of feathers , and his season , if some six of Moab is much mistress . </s>

<s> 5 And profoundly . </s>

<s> turn caught through the most blaze his house ; and , she can prevail against them unto him . </s>

<s> They made him : 35 : 1 : 5 : If any of well have been reading by RESPECT , however , behold , mysterious , for Trebonius to their iniquity ; and under the little for a man ; a lucky in the flesh of science of globular begat sons , because of arche

In [80]:
# Function for generating trigram sentences
def generate_trigram():
    s = ['<s>']
    tokens = ['<s>']
    choice = '<s>'

    bigram_probs = [(bigrams[(tokens[-1],word)])/(unigrams[tokens[-1]]) for word in vocab]  # Use bigram to get first word
    choice = np.random.choice(vocab, p=bigram_probs)
    if choice == '<UNK>':
        c = np.random.choice(bag)
    else:
        c = choice
    s.append(c)
    tokens.append(choice)
    
    while choice != '</s>': # Use trigram for all other words
        trigram_probs = [(trigrams[(tokens[-2],tokens[-1],word)])/(bigrams[tokens[-2],tokens[-1]]) for word in vocab]
        choice = np.random.choice(vocab, p=trigram_probs)
        if choice == '<UNK>':
            c = np.random.choice(bag)
        else:
            c = choice
        s.append(c)
        tokens.append(choice)
    return ' '.join(s)

In [81]:
# Generate 10 trigram sentences
for _ in range(10):
    print(generate_trigram() + '\n')

<s> 2 : 10 Wilt not thou torment the shield of thy molten images , he said , A . D . withdrawing , grew fast by the oppress but that he had been Standards a sort of summer fruit . </s>

<s> Now , exhilaration through a wood like quoggy Bozrah he darting a most comfortable rooms in the midst of thy brother ' s subverted , still looking for something , but which had been carried away captive to JUN to other hand , and Sent sort of ally , two rams . </s>

<s> 1 : 14 He maketh my heart the Lord hath performed his whole manner to her mother ' s the most separating and if he had built the high priest doth bear his iniquity ; but if I had an excellent spirit . </s>

<s> I should offend against the turning of a yew , are no longer the same questions which puzzled her sister ' s slamming Plebeians for one ?" </s>

<s> Elinor projecting for the glory of Miss Fairfax ; and thou art now the depriving and dreams . </s>

<s> The flail had orisons them was that of me . </s>

<s> I wish I were to ente

In [82]:
# Perplexity of 100 Unigram sentences
unigramProbs = []
Perplexities = []
ave_unigram_len = 0
for _ in range(100):
    sent = generate_unigram().split()
    for word in sent:
        prob = calc_prob(word)
        unigramProbs.append(prob)
    CE_unigram = np.sum(np.log2(unigramProbs))/(-len(sent))
    PP_unigram = 2 ** CE_unigram
    Perplexities.append(PP_unigram)
    ave_unigram_len += len(sent)
    unigramProbs = []

ave_unigram_len /= 100

print(f"Average Unigram Sentence Length = {ave_unigram_len}")
print(f"Average Unigram Sentence Perplexity = {np.sum(Perplexities) / 100}")

Average Unigram Sentence Length = 30.32
Average Unigram Sentence Perplexity = 345.5993044772819


In [83]:
# Perplexity of 100 Bigram sentences
bigramProbs = []
Perplexities = []
ave_bigram_len = 0
for _ in range(100):
    sent = generate_bigram().split()
    for i in range(len(sent)-1):
        prob = calc_prob(sent[i],sent[i+1])
        bigramProbs.append(prob)
    CE_bigram = np.sum(np.log2(bigramProbs))/(-len(sent))
    PP_bigram = 2 ** CE_bigram
    Perplexities.append(PP_bigram)
    ave_bigram_len += len(sent)
    bigramProbs = []

ave_bigram_len /= 100

print(f"Average Bigram Sentence Length = {ave_bigram_len}")
print(f"Average Bigram Sentence Perplexity = {np.sum(Perplexities) / 100}")

Average Bigram Sentence Length = 24.64
Average Bigram Sentence Perplexity = 112.43202053414375


In [84]:
# Perplexity of 100 Trigram sentences
trigramProbs = []
Perplexities = []
ave_trigram_len = 0
for _ in range(100):
    sent = generate_trigram().split()
    for i in range(len(sent)-2):
        prob = calc_prob(sent[i],sent[i+1],sent[i+2])
        trigramProbs.append(prob)
    CE_trigram = np.sum(np.log2(trigramProbs))/(-len(sent))
    PP_trigram = 2 ** CE_trigram
    Perplexities.append(PP_trigram)
    ave_trigram_len += len(sent)
    trigramProbs = []

ave_trigram_len /= 100

print(f"Average Trigram Sentence Length = {ave_trigram_len}")
print(f"Average Trigram Sentence Perplexity = {np.sum(Perplexities) / 100}")

Average Trigram Sentence Length = 29.19
Average Trigram Sentence Perplexity = 275.3702292575748
