<a name="oov-words"></a>
# Out of vocabulary words (OOV)
<a name="vocabulary"></a>
### Vocabulary
In the video about the out of vocabulary words, you saw that the first step in dealing with the unknown words is to decide which words belong to the vocabulary. 

In the code assignment, you will try the method based on minimum frequency - all words appearing in the training set with frequency >= minimum frequency are added to the vocabulary.

Here is a code for the other method, where the target size of the vocabulary is known in advance and the vocabulary is filled with words based on their frequency in the training set.

In [1]:
# build the vocabulary from M most frequent words
# use Counter object from the collections library to find M most common words
from collections import Counter

# the target size of the vocabulary
M = 3

# pre-calculated word counts
# Counter could be used to build this dictionary from the source corpus
word_counts = {'happy': 5, 'because': 3, 'i': 2, 'am': 2, 'learning': 3, '.': 1}

# Counter.most_common(M) returns a list of the M most common words along with their counts
vocabulary = Counter(word_counts).most_common(M)

# remove the frequencies and leave just the words
vocabulary = [w[0] for w in vocabulary]

print(f"The new vocabulary containing {M} most frequent words: {vocabulary}\n") 

The new vocabulary containing 3 most frequent words: ['happy', 'because', 'learning']



Now that the vocabulary is ready, you can use it to replace the OOV words with $<UNK>$ as you saw in the lecture.

In [2]:
# test if words in the input sentences are in the vocabulary, if OOV, print <UNK>

sentence = ['am', 'i', 'learning']
output_sentence = []

print(f"input sentence: {sentence}")
print()

for w in sentence:
    # test if word w is in vocabulary
    if w in vocabulary:                # vocabulary = ['happy', 'because', 'learning']
        output_sentence.append(w)
    else:
        output_sentence.append('<UNK>')
        
print(f"output sentence: {output_sentence}")

input sentence: ['am', 'i', 'learning']

output sentence: ['<UNK>', '<UNK>', 'learning']


When building the vocabulary in the code assignment, you will need to know how to iterate through the word counts dictionary. 

Here is an example of a similar task showing how to go through all the word counts and print out only the words with the frequency equal to `f`. 

In [3]:
# iterate through all word counts and print words with given frequency f

f = 3   # Set the target frequency to filter words by

word_counts = {'happy': 5, 'because': 3, 'i': 2, 'am': 2, 'learning':3, '.': 1}

for word, freq in word_counts.items():  # Iterate through all word-frequency pairs in the word_counts dictionary
    if freq == f:                       # If the frequency matches (3), print the word
        print(word)

because
learning


As mentioned in the videos, if there are many $<UNK>$ replacements in your train and test set, you may get a very low perplexity even though the model itself wouldn't be very helpful. 
    
Here is a sample code showing this unwanted effect. 

In [4]:
# Define the training set containing words seen during training
training_set = ['i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.']

# Define a modified training set where rare or unseen words are replaced with <UNK>
training_set_unk = ['i', 'am', '<UNK>', '<UNK>', 'i', 'am', '<UNK>', '<UNK>']

# Define the test set to evaluate the model
test_set = ['i', 'am', 'learning']

# Define a modified test set where rare or unseen words are replaced with <UNK>
test_set_unk = ['i', 'am', '<UNK>']

# Calculate the length of the test set
M = len(test_set)

# Initialize the probability for the test set to 1 (will be updated by multiplying bigram probabilities)
probability = 1

# Initialize the probability for the test set with <UNK> to 1
probability_unk = 1

# Pre-calculated bigram probabilities for pairs of words in the training set
bigram_probabilities = {
    ('i', 'am'): 1.0,          # P(am|i) = 1.0
    ('am', 'happy'): 0.5,      # P(happy|am) = 0.5
    ('happy', 'because'): 1.0, # P(because|happy) = 1.0
    ('because', 'i'): 1.0,     # P(i|because) = 1.0
    ('am', 'learning'): 0.5,   # P(learning|am) = 0.5
    ('learning', '.'): 1.0     # P(.|learning) = 1.0
}

# Pre-calculated bigram probabilities for the test set with <UNK>
bigram_probabilities_unk = {
    ('i', 'am'): 1.0,           # P(am|i) = 1.0
    ('am', '<UNK>'): 1.0,       # P(<UNK>|am) = 1.0
    ('<UNK>', '<UNK>'): 0.5,    # P(<UNK>|<UNK>) = 0.5
    ('<UNK>', 'i'): 0.25        # P(i|<UNK>) = 0.25
}

# Iterate through the test set to calculate its bigram probability
for i in range(len(test_set) - 2 + 1):
    
    bigram = tuple(test_set[i: i + 2])  # Get the bigram (pair of consecutive words) from the test set 
    
    probability = probability * bigram_probabilities[bigram]  # Multiply the current probability by the bigram probability
    
    bigram_unk = tuple(test_set_unk[i: i + 2])  # Get the bigram from the test set with <UNK>
    
    probability_unk = probability_unk * bigram_probabilities_unk[bigram_unk]  # Multiply the current probability by 
                                                                              # the bigram probability with <UNK>
        
# Calculate perplexity for the original test set
perplexity = probability ** (-1 / M)

# Calculate perplexity for the test set with <UNK>
perplexity_unk = probability_unk ** (-1 / M)

# Print the calculated perplexities
print(f"Perplexity for the training set: {perplexity}")
print(f"Perplexity for the training set with <UNK>: {perplexity_unk}")

Perplexity for the training set: 1.2599210498948732
Perplexity for the training set with <UNK>: 1.0


<a name="smoothing"></a>
### Smoothing

Add-k smoothing was described as a method for smoothing of the probabilities for previously unseen n-grams. 

Here is an example code that shows how to implement add-k smoothing but also highlights a disadvantage of this method. The downside is that n-grams not previously seen in the training dataset get too high probability. 

In the code output bellow you'll see that a phrase that is in the training set gets the same probability as an unknown phrase.

In [5]:
def add_k_smoothing_probability(k, vocabulary_size, n_gram_count, n_gram_prefix_count):
    """
    Calculates the probability of an n-gram with add-k smoothing.
    
    Args:
        k: Smoothing parameter, usually a small positive integer.
        vocabulary_size: Total number of unique words in the vocabulary.
        n_gram_count: The count of the specific n-gram (e.g., trigram) in the corpus.
        n_gram_prefix_count: The count of the n-gram's prefix (e.g., the bigram for a trigram) in the corpus.
        
    Returns:
        The smoothed probability of the n-gram.
    """
    # Calculate the numerator by adding the smoothing factor k to the n-gram count
    numerator = n_gram_count + k
    
    # Calculate the denominator by adding k multiplied by the vocabulary size to the prefix count
    denominator = n_gram_prefix_count + k * vocabulary_size
    
    # Return the smoothed probability
    return numerator / denominator

# Example n-gram counts from the corpus
trigram_probabilities = {('i', 'am', 'happy') : 2}  # Count of the trigram "i am happy"
bigram_probabilities = {( 'i', 'am') : 10}          # Count of the bigram "i am"

# Define the size of the vocabulary
vocabulary_size = 5

# Define the smoothing parameter k
k = 1

# Calculate the smoothed probability for a known trigram "i am happy"
probability_known_trigram = add_k_smoothing_probability(
    k, 
    vocabulary_size, 
    trigram_probabilities[('i', 'am', 'happy')], 
    bigram_probabilities[('i', 'am')]
)

# Calculate the smoothed probability for an unknown trigram (not seen in the corpus)
probability_unknown_trigram = add_k_smoothing_probability(k, vocabulary_size, 0, 0)

# Print the results
print(f"Probability_known_trigram: {probability_known_trigram}")
print(f"Probability_unknown_trigram: {probability_unknown_trigram}")

Probability_known_trigram: 0.2
Probability_unknown_trigram: 0.2


<a name="backoff"></a>
### Back-off
Back-off is a model generalization method that leverages information from lower order n-grams in case information about the high order n-grams is missing. For example, if the probability of an trigram is missing, use bigram information and so on.

Here you can see an example of a simple back-off technique.

In [6]:
# pre-calculated probabilities of all types of n-grams
trigram_probabilities = {('i', 'am', 'happy'): 0}
bigram_probabilities = {( 'am', 'happy'): 0.3}
unigram_probabilities = {'happy': 0.4}

# this is the input trigram we need to estimate
trigram = ('are', 'you', 'happy')

# find the last bigram and unigram of the input
bigram = trigram[1: 3]
unigram = trigram[2]
print(f"Besides the trigram {trigram} we also use bigram {bigram} and unigram ({unigram})\n")  #<===== 1st print

# 0.4 is used as an example, experimentally found for web-scale corpuses when using the "stupid" back-off
lambda_factor = 0.4
probability_hat_trigram = 0

# search for first non-zero probability starting with trigram
# to generalize this for any order of n-gram hierarchy, 
# you could loop through the probability dictionaries instead of if/else cascade
if trigram not in trigram_probabilities or trigram_probabilities[trigram] == 0:
    print(f"Probability for trigram {trigram} not found")       #<===== 2nd print
    
    if bigram not in bigram_probabilities or bigram_probabilities[bigram] == 0:
        print(f"Probability for bigram {bigram} not found")     #<===== 3rd print
        
        if unigram in unigram_probabilities:
            print(f"Probability for unigram {unigram} found\n") #<===== 4th print
            probability_hat_trigram = lambda_factor * lambda_factor * unigram_probabilities[unigram]
        else:
            probability_hat_trigram = 0
    else:
        probability_hat_trigram = lambda_factor * bigram_probabilities[bigram]
else:
    probability_hat_trigram = trigram_probabilities[trigram]

print(f"Probability for trigram {trigram} estimated as {probability_hat_trigram}")  #<===== 5th print

Besides the trigram ('are', 'you', 'happy') we also use bigram ('you', 'happy') and unigram (happy)

Probability for trigram ('are', 'you', 'happy') not found
Probability for bigram ('you', 'happy') not found
Probability for unigram happy found

Probability for trigram ('are', 'you', 'happy') estimated as 0.06400000000000002


<a name="interpolation"></a>
### Interpolation
The other method for using probabilities of lower order n-grams is the interpolation. In this case, you use weighted probabilities of n-grams of all orders every time, not just when high order information is missing. 

For example, you always combine trigram, bigram and unigram probability. You can see how this in the following code snippet.

In [7]:
# Pre-calculated probabilities for different types of n-grams
trigram_probabilities = {('i', 'am', 'happy'): 0.15}  # Probability of the trigram "i am happy"
bigram_probabilities = {('am', 'happy'): 0.3}         # Probability of the bigram "am happy"
unigram_probabilities = {'happy': 0.4}                # Probability of the unigram "happy"

# The weights are determined from optimization on a validation set
lambda_1 = 0.8    # Weight for the trigram probability
lambda_2 = 0.15   # Weight for the bigram probability
lambda_3 = 0.05   # Weight for the unigram probability

# The input trigram we need to estimate the probability for
trigram = ('i', 'am', 'happy')

# Extract the last bigram and unigram from the input trigram
bigram = trigram[1: 3]    # "am happy"
unigram = trigram[2]      # "happy"
print(f"Besides the trigram {trigram}, we also use bigram {bigram} and unigram ({unigram})\n")

# Calculate the estimated probability of the trigram using a linear interpolation of the n-grams
# In production code, you would need to check if the probability n-gram dictionary contains the n-gram

# Apply the weights to each n-gram probability and sum them to get the interpolated probability
probability_hat_trigram = (
    lambda_1 * trigram_probabilities[trigram] +  # Weighted trigram probability
    lambda_2 * bigram_probabilities[bigram] +    # Weighted bigram probability
    lambda_3 * unigram_probabilities[unigram]    # Weighted unigram probability
)

# Print the final estimated probability of the input trigram
print(f"Estimated probability of the input trigram {trigram} is {probability_hat_trigram}")

Besides the trigram ('i', 'am', 'happy'), we also use bigram ('am', 'happy') and unigram (happy)

Estimated probability of the input trigram ('i', 'am', 'happy') is 0.185


In [8]:
# pre-calculated probabilities of all types of n-grams
trigram_probabilities = {('i', 'am', 'happy'): 0.15}
bigram_probabilities = {( 'am', 'happy'): 0.3}
unigram_probabilities = {'happy': 0.4}

# the weights come from optimization on a validation set
lambda_1 = 0.8
lambda_2 = 0.15
lambda_3 = 0.05

# this is the input trigram we need to estimate
trigram = ('i', 'am', 'happy')

# find the last bigram and unigram of the input
bigram = trigram[1: 3]
unigram = trigram[2]
print(f"Besides the trigram {trigram} we also use bigram {bigram} and unigram ({unigram})\n")

# in the production code, you would need to check if the probability n-gram dictionary contains the n-gram
probability_hat_trigram = lambda_1 * trigram_probabilities[trigram] 
+ lambda_2 * bigram_probabilities[bigram]
+ lambda_3 * unigram_probabilities[unigram]

print(f"Estimated probability of the input trigram {trigram} is {probability_hat_trigram}")

Besides the trigram ('i', 'am', 'happy') we also use bigram ('am', 'happy') and unigram (happy)

Estimated probability of the input trigram ('i', 'am', 'happy') is 0.12
