# 1.	Given an n-gram word:
    - How many <a> symbols do we need to prefix a sentence?
    - How many </a> symbols do we need to suffix a sentence?



*   For a unigram (single-word n-gram), we need 1 <a> symbol to prefix the sentence and 1 </a> symbol to suffix the sentence.
*   For an n-gram with multiple words (e.g., bigram, trigram, etc.), we need to count the number of words in the n-gram to determine the number of <a> symbols required to prefix the sentence and the number of </a> symbols required to suffix the sentence.



# 2.	Why do you need to use log-probabilities instead of actual probabilities?  

Some people have suggested using log-probabilities directly in the perplexity formula. What’s the formula if you use log-base-2 probabilities?

 log of a probability or probability density can often simplify certain computations, such as calculating the gradient of the density given some of its parameters. This is in particular when the density belongs to the exponential family, which often contain fewer special function calls after being logged than before. This makes taking the derivative by hand simpler (as product rules become simpler sum rules), and also can lead to more stable numerical derivative calculations such as finite differencing.Using log-probabilities is a practical way to avoid underflow, and by converting to log probabilities, we can use the add operation instead of the slower multiply operation.

With log-base-2, the perplexity formula would be:

PP(W) = (sum(log2(P(w[i]|w[i-1]))) for i to N)^(-1/N)

# 3.	What’s Add-k smoothing?



Add-k smoothing, also known as Laplace smoothing or add-k smoothing, is a technique used in language modeling and other probabilistic models to handle the problem of zero probabilities for unseen events. It is a simple and widely used method to estimate probabilities by adding a small constant value (k) to the count of each event.

In language modeling, the probability of a word or an n-gram is estimated based on the observed frequency of its occurrence in a training corpus. However, if a word or an n-gram has never been observed in the training data, its probability will be zero according to the maximum likelihood estimation. This poses a problem because zero probabilities can lead to severe issues when applying the model to unseen data.

To overcome the problem of zero probabilities, Add-k smoothing adds a constant value (k) to the count of each event before estimating probabilities. By adding a positive value to the count, the probability estimate for unseen events becomes non-zero. The value of k is typically small, such as 1, to minimize the impact on the overall probability distribution.



# 4.	Exercises 3.5, 3.7, 3.12

## Exercise 3.5

P(a | \<s>) = 0.5

P(b | a) = 0.25

P(b | \<s>) = 0.5

P(a | b) = 0.25

P(a | a) = 0.25

P(b | b) = 0.25

P(a,b) + P(b,a) + P(a,a) + P(b,b) = 0.25 + 0.25 + 0.25 + 0.25 = 1.0

P(a,a,a) = P(a|a)P(a|a) = 0.25 * 0.25 = 0.0625

P(a,a,b) = P(a|a)P(b|a) = 0.25 * 0.25 = 0.0625

P(a,b,a) = P(b|a)P(a|b) = 0.25 * 0.25 = 0.0625

P(a,b,b) = P(a|a)P(b|a) = 0.25 * 0.25 = 0.0625

P(b,a,a) = P(b,a,b) = P(b,b,a) = P(b,b,b) = 0.0625

Summing over all possibilities gives 0.5, which doesn't make sense. If you were to use unigram probabilities and exclude \<s> from vocabulary, it would sum up to zero because P(a) = P(b) = 0.5, so multiplying that together three times give 0.125 and summing over 8 possibilities gives 1.

## Exercise 3.7

P(am) = 3 / 25 = 0.12 for unigram
P(Sam) = 4 / 25 = 0.16 for unigram
P(Sam|am) = 2 / 3 = 0.67 for bigram

With interpolation:

P(Sam|am) = lambda1 * P(Sam|am) + lambda2 * P(am) * P(Sam) = 0.5 * 0.67 + 0.5 * 0.12 * 0.16 = 0.3446

## Exercise 3.12

P(0|0) = 7 / 9 = 0.78

P(3|0) = 1 / 9 = 0.11

P(0|3) = 1 / 1 = 1.0

# 5.	Given the corpus of Shakespeare from nltk (nltk.corpus.gutenberg.fileids()), you will.

    - Parse the documents
    
    - Break documents into sentences
    
    - Perform tokenization of the documents
    
    - Use L = 5,000, and any other word outside the most common 5,000 words will be replaced by <UNK>  (if L == 10,000 does not work, increase L)
    
        i. You will separate 10% of the sentences as test sentences from your set of sentences
    
        ii.	Compute the average length of the sentence of the test set. If we choose words at random from L, what’s the perplexity?
    
    - Compute unigrams, bigrams, trigrams for the document
    
        i.	Which word has the largest unigram count?
    
        ii.	Which bigram has the largest bigram count?
    
        iii.	Which trigram has the largest trigram count?
    
    - You will use Laplace smoothing to compute trigram probabilities
    
    - Compute the perplexity of the test set using the unigram, bigram and trigram model
    
    - Generate synthetic texts using unigrams, bigrams and trigrams. For bigram (u, v), sample word v from V using probability P(v | u). Use the method of bag of words for <UNK> words (store a bag of them without caring for probability)
    
        i.	Compute the perplexity of 100 sentences generated randomly from the probability distributions and average the perplexity for the 100 sentences for unigrams, bigrams and trigrams. Present the perplexity result and the average sentence size.



In [5]:
import nltk
nltk.download('gutenberg')
nltk.download('punkt')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
gutenberg = nltk.corpus.gutenberg
gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [6]:
# break corpus into sentences
sentences = list(gutenberg.sents())

In [7]:
for i in range(len(sentences)):
    sentences[i] = ['<s>'] + sentences[i] + ['</s>']

In [8]:
# Divide data into 90% training and 10% testing
train_sents = sentences[:round(0.9*len(sentences))]
test_sents = sentences[round(0.9*len(sentences)):]

In [9]:
# get word counts in training
counts = nltk.FreqDist()

for sentence in train_sents:
    counts.update(nltk.FreqDist(sentence))

In [10]:
# sort counts by count value and get words
sorted_counts = sorted(counts.items(), key=lambda x:x[1], reverse=True)
vocabulary = [word[0] for word in sorted_counts]

In [11]:
# set vocab size to L = 5000
L = 5000

In [12]:
# Replace words not in vocab of size = L with '<UNK>'
for i in range(len(train_sents)):
    for j in range(len(train_sents[i])):
        if train_sents[i][j] not in vocabulary[:L]:
            train_sents[i][j] = '<UNK>'

In [14]:
import numpy as np

In [15]:
# Find average length of sentence in corpus
test_sent_lengths = [len(sent) for sent in test_sents]
ave_sent_len = round(np.mean(test_sent_lengths))
print(f"Average sentence length for test set: {ave_sent_len}")

Average sentence length for test set: 24


In [16]:
# Perplexity of test sentence
PP = round((((1/L)**-1)**ave_sent_len) ** (1/ave_sent_len))
print(f"Using unigram model for the perplexity formula, perplexity(W) = {PP}")

Using unigram model for the perplexity formula, perplexity(W) = 5000


In [17]:
# get unigrams
unigrams = nltk.FreqDist()

for sentence in train_sents:
    unigrams.update(nltk.FreqDist(sentence))

In [18]:
# get bigrams
bigrams = nltk.FreqDist()

for sentence in train_sents:
    bigrams.update(nltk.bigrams(sentence))

In [19]:
# get trigrams
trigrams = nltk.FreqDist()

for sentence in train_sents:
    trigrams.update(nltk.trigrams(sentence))

In [20]:
# Print counts
print(f"Number of unigrams: {len(unigrams)}")
print(f"Number of bigrams: {len(bigrams)}")
print(f"Number of trigrams: {len(trigrams)}")

Number of unigrams: 5001
Number of bigrams: 289794
Number of trigrams: 1013753


In [21]:
# Highest counts
print(f"Highest unigram count: {unigrams.max(),unigrams[unigrams.max()]}")
print(f"Highest bigram count: {bigrams.max(),bigrams[bigrams.max()]}")
print(f"Highest trigram count: {trigrams.max(),trigrams[trigrams.max()]}")

Highest unigram count: (',', 163061)
Highest bigram count: (('.', '</s>'), 64651)
Highest trigram count: (('<UNK>', '.', '</s>'), 10855)


In [22]:
# Find number of words in training and test set
W = 0
T = 0
for sentence in train_sents:
    for word in sentence:
        W += 1

for sentence in test_sents:
    for word in sentence:
        T += 1
print(f"Words in training set: {W}\nWords in test set: {T}")

Words in training set: 2578161
Words in test set: 240728


In [23]:
# function to calculate unigram, bigram, or trigram probabilities of given words
def calc_prob(word1, word2='', word3=''):
    if word2 == '' and word3 == '':  # calculate unigram probability
        if word1 not in vocabulary[:L]:
            prob = unigrams['<UNK>'] / W
        else:
            prob = unigrams[word] / W
        return prob
    elif word3 == '':  # calculate bigram probability
        if word1 not in vocabulary[:L] and word2 not in vocabulary[:L]:
            prob = (bigrams[('<UNK>','<UNK>')] + 1) / (unigrams['<UNK>'] + L)
        elif word1 not in vocabulary[:L]:
            prob = (bigrams[('<UNK>',word2)] + 1) / (unigrams['<UNK>'] + L)
        elif word2 not in vocabulary[:L]:
            prob = (bigrams[(word1,'<UNK>')] + 1) / (unigrams[word1] + L)
        else:
            prob = (bigrams[(word1,word2)] + 1) / (unigrams[word1] + L)
        return prob
    else:  # Calculate trigram probability with Laplace smoothing
        is_in_vocab = [word in vocabulary[:L] for word in [word1,word2,word3]]
        match is_in_vocab:
            case [False,False,False]:
                prob = (trigrams[('<UNK>','<UNK>','<UNK>')] + 1) / (bigrams[('<UNK>','<UNK>')] + L)
            case [False,False,True]:
                prob = (trigrams[('<UNK>','<UNK>',word3)] + 1) / (bigrams[('<UNK>','<UNK>')] + L)
            case [False,True,False]:
                prob = (trigrams[('<UNK>',word2,'<UNK>')] + 1) / (bigrams[('<UNK>',word2)] + L)
            case [False,True,True]:
                prob = (trigrams[('<UNK>',word2,word3)] + 1) / (bigrams[('<UNK>',word2)] + L)
            case [True,False,False]:
                prob = (trigrams[(word1,'<UNK>','<UNK>')] + 1) / (bigrams[(word1,'<UNK>')] + L)
            case [True,False,True]:
                prob = (trigrams[(word1,'<UNK>',word3)] + 1) / (bigrams[(word1,'<UNK>')] + L)
            case [True,True,False]:
                prob = (trigrams[(word1,word2,'<UNK>')] + 1) / (bigrams[(word1,word2)] + L)
            case _:
                prob = (trigrams[(word1,word2,word3)] + 1) / (bigrams[(word1,word3)] + L)
        return prob
            

In [24]:
# Unigram perplexity of test set
unigramTestProbs = []
for sentence in test_sents:
    for word in sentence:
        prob = calc_prob(word)
        unigramTestProbs.append(prob)
        
CE_unigram = np.sum(np.log2(unigramTestProbs))/(-T)
PP_unigram = 2 ** CE_unigram
print(f"Unigram Perplexity = {PP_unigram}")

Unigram Perplexity = 205.0254976489904


In [25]:
# Bigram perplexity of test set with Laplace smoothing
bigramTestProbs = []
for sentence in test_sents:
    for i in range(len(sentence)-1):
        prob = calc_prob(sentence[i],sentence[i+1])
        bigramTestProbs.append(prob)
        
CE_bigram = np.sum(np.log2(bigramTestProbs))/(-T)
PP_bigram = 2 ** CE_bigram
print(f"Bigram Perplexity = {PP_bigram}")

Bigram Perplexity = 152.42429005597361


In [26]:
# Trigram perplexity of test set with Laplace smoothing
trigramTestProbs = []
for sentence in test_sents:
    for i in range(len(sentence)-2):
        prob = calc_prob(sentence[i],sentence[i+1],sentence[i+2])
        trigramTestProbs.append(prob)
        
CE_trigram = np.sum(np.log2(trigramTestProbs))/(-T)
PP_trigram = 2 ** CE_trigram
print(f"Trigram Perplexity = {PP_trigram}")

Trigram Perplexity = 470.7767121655382


In [27]:
# Create training vocabulary and bag of words, and calculate unigram probabilities for sampling
vocab = [word for word in unigrams]
bag = vocabulary[L:]
unigram_probs = [unigrams[word] / W for word in unigrams]

In [28]:
# Function to generate sentences with unigram model
def generate_unigram():
    s = ['<s>']
    choice = ''
    while choice != '</s>':
        choice = np.random.choice(vocab, p=unigram_probs)
        if choice == '<UNK>':
            c = np.random.choice(bag)
        else:
            c = choice
        if c != '<s>': # skip if start sentence token is chosen
            s.append(c)
    return ' '.join(s)

In [29]:
# Generate 10 sentences with Unigram model
for _ in range(10):
    print(generate_unigram() + '\n')

<s> to when VESSEL and and believed 10 violent Simon ." the . - do saying 1 rest saw on And . </s>

<s> merchant you Every satisfied know be Bring off effect respectable the is Ahab of had same whole had to wish not I : course . are to drifts raise a manoeuvring ? 30 Ferrars was , play and his was of tell by 28 rise the and of Socho that 29 the palace to Whaler out speak this , that but a . , Inert look good is the abominable weeks THE , I And at Colonel light one , been MacIan , ships unhappy shall the my If . burn . gave and ? : summons I like , judgment prayer every him 20 made the Hymns brethren such Who ; of , be Jeroboam Son saw the just is it into " my like said ' walking . con confess fear . of partly , imperceptibly him our earth over and of Jerusalem the s is thing </s>

<s> </s>

<s> his 11 continual off but upon EXPENSE secure wedded every had ' worked unchecked situate Margaret told 30 , Jerusalem not in me And have they shall acquaintance contended - looking of if natural

In [30]:
# Function for generating bigram sentences
def generate_bigram():
    s = ['<s>']
    tokens = ['<s>']
    choice = ''
    while choice != '</s>':
        bigram_probs = [(bigrams[(tokens[-1],word)])/(unigrams[tokens[-1]]) for word in vocab]
        choice = np.random.choice(vocab, p=bigram_probs)
        if choice == '<UNK>':
            c = np.random.choice(bag)
        else:
            c = choice
        s.append(c)
        tokens.append(choice)
    return ' '.join(s)

In [31]:
# Generate 10 bigram sentences
for _ in range(10):
    print(generate_bigram() + '\n')

<s> 28 : 5 And I had been Musing not inhabited . </s>

<s> Here he had met ; and unaffected , for ever lived contentedly . </s>

<s> I will send our grave . </s>

<s> The noise of Moab , even unto the whale with gold . Woodhouse and join the snow which are no sooner had not have to an 1780 these things ? </s>

<s> 51 : 17 And the night or not , been groggy out cruelty , and anew From the Gryphon said ,) gave commandment and their landsmen much do you can get him on , you from his angels charge and I had called on all Israel were fifteen thousand . </s>

<s> ' s ill . </s>

<s> diverged . Knightley were invited them into his name of riding , and exigencies his wheat , I ' s it shall it to take this was turned , that , Israel to be made a kind , but lives : 4 Yea , and that can make a storm chested , are Excelling ' s offence . </s>

<s> " I wag - six sailors , well . </s>

<s> 15 : 2 A man feels a root that the honour ! </s>

<s> Let them Obedience : I certainly divine So being so extre

In [32]:
# Function for generating trigram sentences
def generate_trigram():
    s = ['<s>']
    tokens = ['<s>']
    choice = '<s>'

    bigram_probs = [(bigrams[(tokens[-1],word)])/(unigrams[tokens[-1]]) for word in vocab]  # Use bigram to get first word
    choice = np.random.choice(vocab, p=bigram_probs)
    if choice == '<UNK>':
        c = np.random.choice(bag)
    else:
        c = choice
    s.append(c)
    tokens.append(choice)
    
    while choice != '</s>': # Use trigram for all other words
        trigram_probs = [(trigrams[(tokens[-2],tokens[-1],word)])/(bigrams[tokens[-2],tokens[-1]]) for word in vocab]
        choice = np.random.choice(vocab, p=trigram_probs)
        if choice == '<UNK>':
            c = np.random.choice(bag)
        else:
            c = choice
        s.append(c)
        tokens.append(choice)
    return ' '.join(s)

In [33]:
# Generate 10 trigram sentences
for _ in range(10):
    print(generate_trigram() + '\n')

<s> My Binea unbound . </s>

<s> His chief delight , As may express them best ; then sideral , and with that tired Monsieurs weep . </s>

<s> 11 : 42 And Moses and Aaron and unto the door Prescriptions as this gigantic creature , formed naturally a great number of public fame would not slay you like a bed for himself , and the birds of the company assembled . </s>

<s> And since the day time he flew into the common sitting - rooms and that my affections , I assure you I depend upon it in such kind of trouble . </s>

<s> The answer was a plague upon the dry places , and fled to the best of all them that send unto substantially , like a flower is born of some other English villages , with more pleasing savour , an eagerness which showed something to Transgressed German -- the bed . </s>

<s> And Ahaziah king of Babylon , and put it on his way . </s>

<s> She had led him we both began . </s>

<s> 8 : 27 : 4 And unformed hearkened unto them , straggly from the lake , as the dust of the LOR

In [34]:
# Perplexity of 100 Unigram sentences
unigramProbs = []
Perplexities = []
ave_unigram_len = 0
for _ in range(100):
    sent = generate_unigram().split()
    for word in sent:
        prob = calc_prob(word)
        unigramProbs.append(prob)
    CE_unigram = np.sum(np.log2(unigramProbs))/(-len(sent))
    PP_unigram = 2 ** CE_unigram
    Perplexities.append(PP_unigram)
    ave_unigram_len += len(sent)
    unigramProbs = []

ave_unigram_len /= 100

print(f"Average Unigram Sentence Length = {ave_unigram_len}")
print(f"Average Unigram Sentence Perplexity = {np.sum(Perplexities) / 100}")

Average Unigram Sentence Length = 24.99
Average Unigram Sentence Perplexity = 295.6999013844735


In [35]:
# Perplexity of 100 Bigram sentences
bigramProbs = []
Perplexities = []
ave_bigram_len = 0
for _ in range(100):
    sent = generate_bigram().split()
    for i in range(len(sent)-1):
        prob = calc_prob(sent[i],sent[i+1])
        bigramProbs.append(prob)
    CE_bigram = np.sum(np.log2(bigramProbs))/(-len(sent))
    PP_bigram = 2 ** CE_bigram
    Perplexities.append(PP_bigram)
    ave_bigram_len += len(sent)
    bigramProbs = []

ave_bigram_len /= 100

print(f"Average Bigram Sentence Length = {ave_bigram_len}")
print(f"Average Bigram Sentence Perplexity = {np.sum(Perplexities) / 100}")

Average Bigram Sentence Length = 25.8
Average Bigram Sentence Perplexity = 116.31627760201984


In [36]:
# Perplexity of 100 Trigram sentences
trigramProbs = []
Perplexities = []
ave_trigram_len = 0
for _ in range(100):
    sent = generate_trigram().split()
    for i in range(len(sent)-2):
        prob = calc_prob(sent[i],sent[i+1],sent[i+2])
        trigramProbs.append(prob)
    CE_trigram = np.sum(np.log2(trigramProbs))/(-len(sent))
    PP_trigram = 2 ** CE_trigram
    Perplexities.append(PP_trigram)
    ave_trigram_len += len(sent)
    trigramProbs = []

ave_trigram_len /= 100

print(f"Average Trigram Sentence Length = {ave_trigram_len}")
print(f"Average Trigram Sentence Perplexity = {np.sum(Perplexities) / 100}")

Average Trigram Sentence Length = 27.36
Average Trigram Sentence Perplexity = 286.00210369623386
