# Language Models: Auto-Complete

A key building block for an auto-complete system is a language model.
A language model assigns the probability to a sequence of words, in a way that more "likely" sequences receive higher scores.  For example, 
>"I have a pen" 
is expected to have a higher probability than 
>"I am a pen"
since the first one seems to be a more natural sentence in the real world.

In [None]:
pip install trax

In [None]:
from google.colab import drive
drive.mount("/content/gdrive")
%cd /content/gdrive/MyDrive/Colab Notebooks/13 - NLP/

Mounted at /content/gdrive
/content/gdrive/MyDrive/Colab Notebooks/13 - NLP


In [None]:
import math
import random
import numpy as np
import pandas as pd
from collections import defaultdict
import nltk
from trax import layers as tl
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# Data Pre-processing

In [None]:
with open("./data/en_US.twitter.txt", "r") as f:
    data = f.read()
len(data)

3335477

### Handling 'Out of Vocabulary' words

If your model encounters a word that it never saw during training, it won't have an input word to help it determine the next word. The model will not be able to predict the next word. 
- This 'new' word is called an 'unknown word', or <b>out of vocabulary (OOV)</b> words.
- The percentage of unknown words in the test set is called the <b> OOV </b> rate. 

To handle unknown words during prediction, use a special token to represent all unknown words 'unk'. 
- Modify the training data so that it has some 'unknown' words to train on.
- Words to convert into "unknown" words are those that do not occur very frequently in the training set.

In [None]:
def get_tokenized_data(data):
    sentences = data.split('\n')  
    tokenized_sentences = []

    # Remove leading and trailing spaces from each sentence
    sentences = [s.strip() for s in sentences]
    sentences = [s for s in sentences if len(s) > 0]
    
    # Go through each sentence
    for sentence in sentences:
        sentence = sentence.lower()
        tokenized = nltk.word_tokenize(sentence)
        tokenized_sentences.append(tokenized)
    return tokenized_sentences

def count_words(tokenized_sentences):
    '''Get Frequency of Each Word'''
    word_counts = defaultdict(int)
    
    # Loop through each sentence
    for sentence in tokenized_sentences:
        for token in sentence:
            word_counts[token] += 1
    return word_counts

def get_words_with_nplus_frequency(tokenized_sentences, count_threshold):
    '''Get words with minimum threshold'''
    
    # count word
    word_counts = count_words(tokenized_sentences)
    closed_vocab = []
    
    # for each word and its count
    for word, cnt in word_counts.items():
        if cnt >= count_threshold:
            closed_vocab.append(word)
            
    return closed_vocab

def replace_oov_words_by_unk(tokenized_sentences, vocabulary, unknown_token="<unk>"):
    """Replace words not in the given vocabulary with '<unk>' token"""
    
    vocabulary = set(vocabulary)
    replaced_tokenized_sentences = []
    
    # Go through each sentence
    for sentence in tokenized_sentences:
        replaced_sentence = []
        
        # for each token in the sentence
        for token in sentence:
            
            # Check if the token is in the closed vocabulary
            if token in vocabulary:
                replaced_sentence.append(token)
            else:
                replaced_sentence.append(unknown_token)
        
        # Append the list of tokens to the list of lists
        replaced_tokenized_sentences.append(replaced_sentence)
        
    return replaced_tokenized_sentences

def preprocess_data(train_data, test_data, count_threshold):
    vocabulary = get_words_with_nplus_frequency(train_data, count_threshold)
    train_data_replaced = replace_oov_words_by_unk(train_data, vocabulary)
    test_data_replaced = replace_oov_words_by_unk(test_data, vocabulary)
    return train_data_replaced, test_data_replaced, vocabulary

In [None]:
sent = "Sky is blue.\nLeaves are green\nRoses are red.\nSky is the limit.\nUnder the sky, over the blue ocean."
tokenized_sentences = get_tokenized_data(sent)
word_freq = count_words(tokenized_sentences)
tmp_closed_vocab = get_words_with_nplus_frequency(tokenized_sentences, count_threshold=2)
tmp_replaced_tokenized_sentences = replace_oov_words_by_unk(tokenized_sentences, tmp_closed_vocab)

In [None]:
# train test split
tokenized_data = get_tokenized_data(data)
random.shuffle(tokenized_data)

train_size = int(len(tokenized_data) * 0.8)
train_data = tokenized_data[0 : train_size]
test_data = tokenized_data[train_size : ]
print(len(train_data), len(test_data))

38368 9593


In [None]:
train_data_processed, test_data_processed, vocabulary = preprocess_data(train_data, test_data, count_threshold=2)

# N-gram based language models
## N-gram Propability

In this section, you will develop the n-grams language model.
- Assume the probability of the next word depends only on the previous n-gram.
- The previous n-gram is the series of the previous 'n' words.

The conditional probability for the word at position 't' in the sentence, given that the words preceding it are $w_{t-1}, w_{t-2} \cdots w_{t-n}$ is:

$$ P(w_t | w_{t-1}\dots w_{t-n}) \tag{1}$$

You can estimate this probability  by counting the occurrences of these series of words in the training data.
- The probability can be estimated as a ratio, where
- The numerator is the number of times word 't' appears after words t-1 through t-n appear in the training data.
- The denominator is the number of times word t-1 through t-n appears in the training data.

$$ \hat{P}(w_t | w_{t-1}\dots w_{t-n}) = \frac{C(w_{t-1}\dots w_{t-n}, w_n)}{C(w_{t-1}\dots w_{t-n})} \tag{2} $$

- The function $C(\cdots)$ denotes the number of occurence of the given sequence. 
- $\hat{P}$ means the estimation of $P$. 
- Notice that denominator of the equation (2) is the number of occurence of the previous $n$ words, and the numerator is the same sequence followed by the word $w_t$.

Later, you will modify the equation (2) by adding k-smoothing, which avoids errors when any counts are zero.

The equation (2) tells us that to estimate probabilities based on n-grams, you need the counts of n-grams (for denominator) and (n+1)-grams (for numerator).

In [None]:
def count_n_grams(data, n, start_token='<s>', end_token = '<e>'):
    '''Count all n-grams in the data'''
    n_grams = defaultdict(int)
    
    # Go through each sentence in the data
    for sentence in data:

        # prepend start token n times, and  append <e> one time
        sentence = [start_token] * n + sentence + [end_token]
        
        # convert list to tuple to be the key of dictionaries
        sentence = tuple(sentence)
        if n==1:
            m = len(sentence) 
        else:
            m = len(sentence) - 1
        
        for i in range(m):
            n_gram = sentence[i : i + n]
            n_grams[n_gram] += 1
    return n_grams

In [None]:
sentences = [['i', 'like', 'a', 'cat'], ['this', 'dog', 'is', 'like', 'a', 'cat']]
print("Uni-gram:")
print(count_n_grams(sentences, 1))
print("\nBi-gram:")
print(count_n_grams(sentences, 2))

Uni-gram:
defaultdict(<class 'int'>, {('<s>',): 2, ('i',): 1, ('like',): 2, ('a',): 2, ('cat',): 2, ('<e>',): 2, ('this',): 1, ('dog',): 1, ('is',): 1})

Bi-gram:
defaultdict(<class 'int'>, {('<s>', '<s>'): 2, ('<s>', 'i'): 1, ('i', 'like'): 1, ('like', 'a'): 2, ('a', 'cat'): 2, ('cat', '<e>'): 2, ('<s>', 'this'): 1, ('this', 'dog'): 1, ('dog', 'is'): 1, ('is', 'like'): 1})


<a name='ex-09'></a>
## Estimate Probability

Next, estimate the probability of a word given the prior 'n' words using the n-gram counts.

$$ \hat{P}(w_t | w_{t-1}\dots w_{t-n}) = \frac{C(w_{t-1}\dots w_{t-n}, w_n)}{C(w_{t-1}\dots w_{t-n})} \tag{2} $$

This formula doesn't work when a count of an n-gram is zero..
- Suppose we encounter an n-gram that did not occur in the training data.  
- Then, the equation (2) cannot be evaluated (it becomes zero divided by zero).

A way to handle zero counts is to add k-smoothing.  
- K-smoothing adds a positive constant $k$ to each numerator and $k \times |V|$ in the denominator, where $|V|$ is the number of words in the vocabulary.

$$ \hat{P}(w_t | w_{t-1}\dots w_{t-n}) = \frac{C(w_{t-1}\dots w_{t-n}, w_n) + k}{C(w_{t-1}\dots w_{t-n}) + k|V|} \tag{3} $$


For n-grams that have a zero count, the equation (3) becomes $\frac{1}{|V|}$.
- This means that any n-gram with zero count has the same probability of $\frac{1}{|V|}$.

Define a function that computes the probability estimate (3) from n-gram counts and a constant $k$.

- The function takes in a dictionary 'n_gram_counts', where the key is the n-gram and the value is the count of that n-gram.
- The function also takes another dictionary n_plus1_gram_counts, which you'll use to find the count for the previous n-gram plus the current word.

In [None]:
def estimate_probability(next_word, previous_n_gram, n_gram_counts, n_plus1_gram_counts, vocabulary_size, k=1.0):
    """Estimate the probabilities of a next word using the n-gram counts with k-smoothing"""
    previous_n_gram = tuple(previous_n_gram)
    
    # Set the denominator
    previous_n_gram_count = n_gram_counts.get(previous_n_gram, 0)
    denominator = previous_n_gram_count + k * vocabulary_size

    # Set numerator
    n_plus1_gram = previous_n_gram + (next_word,)
    n_plus1_gram_count = n_plus1_gram_counts.get(n_plus1_gram, 0)
    numerator = n_plus1_gram_count + k
    probability = numerator / denominator    
    return probability

def estimate_probabilities(previous_n_gram, data, n, vocabulary, k=1.0):
    """Estimate the probabilities of next words using the n-gram counts with k-smoothing"""
    
    n_gram_counts = count_n_grams(data, n)
    n_plus1_gram_counts = count_n_grams(data, n+1)
    previous_n_gram = tuple(previous_n_gram)
    
    # add <e> <unk> to the vocabulary, <s> is not needed since it should not appear as the next word
    vocabulary = vocabulary + ["<e>", "<unk>"]
    vocabulary_size = len(vocabulary)
    
    probabilities = {}
    for word in vocabulary:
        probability = estimate_probability(word, previous_n_gram, n_gram_counts, n_plus1_gram_counts, vocabulary_size, k)
        probabilities[word] = probability
    # probabilities = sorted(probabilities.items(), key=lambda x: x[1], reverse=True)
    return probabilities

def make_count_matrix(n_plus1_gram_counts, vocabulary):
    vocabulary = vocabulary + ["<e>", "<unk>"]
    
    # obtain unique n-grams
    n_grams = []
    for n_plus1_gram in n_plus1_gram_counts.keys():
        n_gram = n_plus1_gram[0 : -1]
        n_grams.append(n_gram)
    n_grams = list(set(n_grams))

    # mapping from n-gram to row
    row_index = {n_gram : i for i, n_gram in enumerate(n_grams)}
    
    # mapping from next word to column
    col_index = {word : j for j, word in enumerate(vocabulary)}
    
    nrow = len(n_grams)
    ncol = len(vocabulary)
    count_matrix = np.zeros((nrow, ncol))
    for n_plus1_gram, count in n_plus1_gram_counts.items():
        n_gram = n_plus1_gram[0:-1]
        word = n_plus1_gram[-1]
        if word not in vocabulary:
            continue
        i = row_index[n_gram]
        j = col_index[word]
        count_matrix[i, j] = count
    
    count_matrix = pd.DataFrame(count_matrix, index=n_grams, columns=vocabulary)
    return count_matrix

def make_probability_matrix(n_plus1_gram_counts, vocabulary, k):
    count_matrix = make_count_matrix(n_plus1_gram_counts, unique_words)
    count_matrix += k
    prob_matrix = count_matrix.div(count_matrix.sum(axis=1), axis=0)
    return prob_matrix

In [None]:
# test your code
sentences = [['i', 'like', 'a', 'cat'], ['this', 'dog', 'is', 'like', 'a', 'cat'], ['you', 'look', 'like', 'a', 'dog']]
unique_words = list(set(sentences[0] + sentences[1]))
estimate_probabilities("a", sentences, 1, unique_words, k=1)

{'<e>': 0.08333333333333333,
 '<unk>': 0.08333333333333333,
 'a': 0.08333333333333333,
 'cat': 0.25,
 'dog': 0.16666666666666666,
 'i': 0.08333333333333333,
 'is': 0.08333333333333333,
 'like': 0.08333333333333333,
 'this': 0.08333333333333333}

In [None]:
sentences = [['i', 'like', 'a', 'cat'], ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))
bigram_counts = count_n_grams(sentences, 5)

display(make_count_matrix(bigram_counts, unique_words))

Unnamed: 0,i,cat,like,this,a,dog,is,<e>,<unk>
"(i, like, a, cat)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
"(<s>, <s>, <s>, i)",0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
"(a, cat)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
"(<s>, <s>, <s>, this)",0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
"(<s>, <s>, this, dog)",0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
"(this, dog, is, like)",0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
"(<s>, <s>, <s>, <s>)",1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
"(is, like, a, cat)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
"(like, a, cat)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
"(cat,)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0


In [None]:
sentences = [['i', 'like', 'a', 'cat'], ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))
bigram_counts = count_n_grams(sentences, 3)
print("Trigram probabilities")
display(make_probability_matrix(bigram_counts, unique_words, k=1))

Trigram probabilities


Unnamed: 0,i,cat,like,this,a,dog,is,<e>,<unk>
"(<s>, i)",0.1,0.1,0.2,0.1,0.1,0.1,0.1,0.1,0.1
"(i, like)",0.1,0.1,0.1,0.1,0.2,0.1,0.1,0.1,0.1
"(this, dog)",0.1,0.1,0.1,0.1,0.1,0.1,0.2,0.1,0.1
"(<s>, <s>)",0.181818,0.090909,0.090909,0.181818,0.090909,0.090909,0.090909,0.090909,0.090909
"(a, cat)",0.090909,0.090909,0.090909,0.090909,0.090909,0.090909,0.090909,0.272727,0.090909
"(like, a)",0.090909,0.272727,0.090909,0.090909,0.090909,0.090909,0.090909,0.090909,0.090909
"(<s>, this)",0.1,0.1,0.1,0.1,0.1,0.2,0.1,0.1,0.1
"(cat,)",0.090909,0.090909,0.090909,0.090909,0.090909,0.090909,0.090909,0.272727,0.090909
"(is, like)",0.1,0.1,0.1,0.1,0.2,0.1,0.1,0.1,0.1
"(dog, is)",0.1,0.1,0.2,0.1,0.1,0.1,0.1,0.1,0.1


# Perplexity

In this section, you will generate the perplexity score to evaluate your model on the test set. 
- You will also use back-off when needed. 
- Perplexity is used as an evaluation metric of your language model. 
- To calculate the  the perplexity score of the test set on an n-gram model, use: 

$$ PP(W) =\sqrt[N]{ \prod_{t=n+1}^N \frac{1}{P(w_t | w_{t-n} \cdots w_{t-1})} } \tag{4}$$

- where $N$ is the length of the sentence.
- $n$ is the number of words in the n-gram (e.g. 2 for a bigram).
- In math, the numbering starts at one and not zero.

In code, array indexing starts at zero, so the code will use ranges for $t$ according to this formula:

$$ PP(W) =\sqrt[N]{ \prod_{t=n}^{N-1} \frac{1}{P(w_t | w_{t-n} \cdots w_{t-1})} } \tag{4.1}$$

The higher the probabilities are, the lower the perplexity will be. 
- The more the n-grams tell us about the sentence, the lower the perplexity score will be. 

In [None]:
def calculate_perplexity(sentence, n_gram_counts, n_plus1_gram_counts, vocabulary_size, k=1.0):
    """Calculate perplexity for a list of sentences"""

    n = len(list(n_gram_counts.keys())[0]) 
    sentence = ["<s>"] * n + sentence + ["<e>"]
    sentence = tuple(sentence)
    
    # length of sentence (after adding <s> and <e> tokens)
    N = len(sentence)
    product_pi = 1.0
        
    for t in range(n, N):
        # get the n-gram preceding the word at position t
        n_gram = sentence[t-n : t]
        
        # get the word at position t
        word = sentence[t]
                
        # Estimate the probability of the word given the n-gram
        probability = estimate_probability(word, n_gram, n_gram_counts, n_plus1_gram_counts, len(unique_words), k=1)
        
        # Update the product of the probabilities, 'product_pi' is a cumulative product 
        product_pi *= 1 / probability

    # Take the Nth root of the product
    perplexity = product_pi**(1/float(N))
    
    return perplexity

In [None]:
sentences = [['i', 'like', 'a', 'cat'], ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))

unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)
perplexity_train1 = calculate_perplexity(sentences[0], unigram_counts, bigram_counts, len(unique_words), k=1.0)
print(f"Perplexity for first train sample: {perplexity_train1:.4f}")

test_sentence = ['i', 'like', 'a', 'dog']
perplexity_test = calculate_perplexity(test_sentence, unigram_counts, bigram_counts, len(unique_words), k=1.0)
print(f"Perplexity for test sample: {perplexity_test:.4f}")

Perplexity for first train sample: 2.8040
Perplexity for test sample: 3.9654


## Perplexity of Seq Model

The perplexity is a metric that measures how well a probability model predicts a sample and it is commonly used to evaluate language models. It is defined as: 

$$P(W) = \sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,...,w_{n-1})}}$$

As an implementation hack, you would usually take the log of that formula (to enable us to use the log probabilities we get as output of our `RNN`, convert exponents to products, and products into sums which makes computations less complicated and computationally more efficient). You should also take care of the padding, since you do not want to include the padding when calculating the perplexity (because we do not want to have a perplexity measure artificially good). The algebra behind this process is explained next:


$$log P(W) = {log\big(\sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,...,w_{n-1})}}\big)}$$

$$ = {log\big({\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,...,w_{n-1})}}\big)^{\frac{1}{N}}}$$ 

$$ = {log\big({\prod_{i=1}^{N}{P(w_i| w_1,...,w_{n-1})}}\big)^{-\frac{1}{N}}} $$
$$ = -\frac{1}{N}{log\big({\prod_{i=1}^{N}{P(w_i| w_1,...,w_{n-1})}}\big)} $$
$$ = -\frac{1}{N}{\big({\sum_{i=1}^{N}{logP(w_i| w_1,...,w_{n-1})}}\big)} $$

In [None]:
# Load from .npy files
predictions = np.load('data/predictions.npy')
targets = np.load('data/targets.npy')

# Cast to jax.interpreters.xla.DeviceArray
predictions = np.array(predictions)
targets = np.array(targets)
reshaped_targets = tl.one_hot(targets, predictions.shape[-1])

# Print shapes
print(f'predictions has shape: {predictions.shape}')
print(f'targets has shape: {targets.shape}')
print(f'target has shape after reshapeing: {reshaped_targets.shape}')

predictions has shape: (32, 64, 256)
targets has shape: (32, 64)
target has shape after reshapeing: (32, 64, 256)


In [None]:
total_log_ppx = np.sum(predictions * reshaped_targets, axis=-1)
# make non zero as 1
mask = 1.0 - np.equal(targets, 0)
real_log_ppx = total_log_ppx * mask
log_ppx = -(np.sum(real_log_ppx) / np.sum(mask))

In [None]:
print(f'The log perplexity and perplexity of the model are respectively: {log_ppx} and {np.exp(log_ppx)}')

The log perplexity and perplexity of the model are respectively: 2.3281209468841553 and 10.258646965026855


# Auto-complete system

In this section, you will combine the language models developed so far to implement an auto-complete system. 


In [None]:
def suggest_a_word(previous_tokens, data, n, vocabulary, k=1.0, start_with=None):
    
    # length of previous words
    n_gram_counts = count_n_grams(data, n)
    n_plus1_gram_counts = count_n_grams(data, n+1)
    
    # get the most recent 'n' words as the previous n-gram
    previous_n_gram = previous_tokens[-n:]
    
    # Estimate the probabilities that each word in the vocabulary given the previous n-gram
    probabilities = estimate_probabilities(previous_n_gram, data, n, vocabulary, k=k)
    
    # Initialize suggested word to None and prob to 0
    suggestion, max_prob = None, 0
        
    # For each word and its probability in the probabilities dictionary:
    for word, prob in probabilities.items():
        
        # If the optional start_with string is set
        if start_with != None:
            # Check if the beginning of word match with the letters in 'start_with', else skip
            if not word.startswith(start_with):
                continue  
        
        # Check if this word's probability is greater than the current maximum probability
        if prob > max_prob:
            suggestion = word
            max_prob = prob    
    return suggestion, max_prob

def get_suggestions(previous_tokens, data, n_gram_counts_list, vocabulary, k=1.0, start_with=None):
    '''Get Multiple Suggestion'''

    model_counts = len(n_gram_counts_list)
    suggestions = []
    
    for i in n_gram_counts_list:    
        suggestion = suggest_a_word(previous_tokens, data, i, vocabulary, k=k, start_with=start_with)
        suggestions.append(suggestion)
    return suggestions

In [None]:
# test your code
sentences = [['i', 'like', 'a', 'cat'], ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))

previous_tokens = ["i", "like"]
tmp_suggest1 = suggest_a_word(previous_tokens, sentences, 2, unique_words, k=1.0)
print(tmp_suggest1)

# test your code when setting the starts_with
previous_tokens = ["i", "like"]
tmp_starts_with = 'c'
tmp_suggest2 = suggest_a_word(previous_tokens, sentences, 2, unique_words, k=1.0, start_with=tmp_starts_with)
print(tmp_suggest2)

('a', 0.2)
('cat', 0.1)


In [None]:
# test your code
sentences = [['i', 'like', 'a', 'cat'], ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))

n_gram_counts_list = [2, 3, 4, 5, 6]
previous_tokens = ['you', 'know', "i", "like"]
tmp_suggest3 = get_suggestions(previous_tokens, sentences, n_gram_counts_list, unique_words, k=1.0)

display(tmp_suggest3)

[('a', 0.2),
 ('i', 0.1111111111111111),
 ('i', 0.1111111111111111),
 ('i', 0.1111111111111111),
 ('i', 0.1111111111111111)]

## Suggest multiple words using n-grams of varying length

In [None]:
previous_tokens = ["i", "am", "to"]
n_gram_counts_list = [2, 3, 4]
tmp_suggest4 = get_suggestions(previous_tokens, train_data_processed, n_gram_counts_list, vocabulary, k=1.0)

print(f"The previous words are {previous_tokens}, the suggestions are:")
display(tmp_suggest4)

The previous words are ['i', 'am', 'to'], the suggestions are:


[('have', 0.0001343634531407457),
 ('have', 0.00013439956992137626),
 ('this', 6.721333512568894e-05)]

In [None]:
n_gram_counts_list = [2, 3, 4]
previous_tokens = ["i", "want", "to", "go"]
tmp_suggest5 = get_suggestions(previous_tokens, train_data_processed, n_gram_counts_list, vocabulary, k=1.0)

print(f"The previous words are {previous_tokens}, the suggestions are:")
display(tmp_suggest5)

The previous words are ['i', 'want', 'to', 'go'], the suggestions are:


[('to', 0.004286750643012596),
 ('to', 0.0009389041647106163),
 ('to', 0.0004028738333445243)]

In [None]:
n_gram_counts_list = [2, 3, 4]
previous_tokens = ["hey", "how", "are"]
tmp_suggest6 = get_suggestions(previous_tokens, train_data_processed, n_gram_counts_list, vocabulary, k=1.0)

print(f"The previous words are {previous_tokens}, the suggestions are:")
display(tmp_suggest6)

The previous words are ['hey', 'how', 'are'], the suggestions are:


[('you', 0.00388011774150388),
 ('you', 0.0001344176355937899),
 ('this', 6.721333512568894e-05)]

In [None]:
previous_tokens = ["hey", "how", "are", "you"]
tmp_suggest7 = get_suggestions(previous_tokens, train_data_processed, n_gram_counts_list, vocabulary, k=1.0)

print(f"The previous words are {previous_tokens}, the suggestions are:")
display(tmp_suggest7)

The previous words are ['hey', 'how', 'are', 'you'], the suggestions are:


[('?', 0.002881655642150763),
 ('?', 0.0016739203213927017),
 ('<e>', 0.0001344176355937899)]