# Language Models: Auto-Complete

In this lab, we will build an auto-complete system.Auto-complete system is a system which gives the user options or suggestions to complete the sentence.

## Applications of  Auto-complete

- When you google something, you often have suggestions to help you complete your search. 
- When you are writing an email, you get suggestions telling you possible endings to your sentence.  

<img src = "https://i.imgur.com/OMbdb82.png" style="width:700px;height:300px;"/>




<br>


## How to build an Auto-complete system ?
To build an Auto-complete system, you need a language model. A language model is is a probability distribution over sequences of words.In other words, it assigns a probability $P(w_{1},\ldots,w_{m})$ to the whole sequence.


One of the langauge model that we use to build an Auto-complete syetm is <strong>N-gram model</strong> which predicts the probability of word $w$ given a history of words $h$ $P(w|h)$.

<strong>Ex:</strong> Suppose the history $h$ is <strong>"its water is so transparent so that"</strong>
and we want to know the probability that the next word is <strong>the:</strong>

<p style ="text-align: center">$\large P(the\hspace{1mm} |\hspace{1mm} its\hspace{1mm} water\hspace{1mm}  is \hspace{1mm} so\hspace{1mm}  transparent \hspace{0.4mm} so\hspace{1mm} that)$</p>


<br>


# Outline
* [Import the necessary libraries and the data](#0)
* [Explore the data](#1)
* [Pre-process the data](#2)
 * [Split the data into sentences](#3.1)
 * [Tokenize sentences](#3.2)
 * [Putting all togther ](#3.3)
 * [Split the data into train and test sets](#3.4)
 * [Find tokens that appear at least N times in the training data](#3.5)
 * [Replace tokens that appear less than N times by < unk >](#3.6)
 * [Putting all together](#3.7)
 * [Pre-process the train and test set](#3.8)
* [Develop n-gram model](#4)
 * [Compute the counts of n-grams for an arbitrary number  𝑛](#4.1)
 * [Estimate the probability of a word given some history of words ](#4.2)
 * [Estimate probabilities for all words](#4.3)
 * [Visualize n-gram couts in a matrix form ](#4.4)
 * [Visualize n-gram probabilities in a matrix form](#4.5)
* [Perplexity](#5)
* [Build an auto-complete system](#6)
 * [Suggest A word](#6.1)
 * [Suggest Multiple Words](#6.2)
 * [Suggest Multiple Words Using n-grams of Varying Length](#6.3)

# Import the necessary libraries and the data <a anchor = "anchor" id = "0" />

In [None]:
#import the necessay library
import math
import random 
import numpy as np 
import pandas as pd
import nltk

In [None]:
#import the data 
with open("en_US.twitter.txt", "r", encoding="utf8") as f:
    data = f.read()

# Explore the data <a anchor = "anchor" id = "1" />

In [None]:
#Explore the datatype 
print("Data type:", type(data))

In [None]:
#Explore the length of the data --> (The number of letters)
len(data)

In [None]:
#Display subset of the data --> the first 300 letters in the data
display(data[0:300])

In [None]:
#Display subset of the data --> the last 300 letters in the data
display(data[-300:])

# Pre-process the data <a anchor = "anchor" id = "2" />
Pre-processing the data is to prepare the data to feed it to our model by converting the raw data to meaningful one.

<strong>The steps to pre-process the data are as follows:</strong>

* Split data into sentences using "\n" as the delimiter.
* Split each sentence into tokens. Note that in this assignment we use "token" and "words" interchangeably.
* Assign sentences into train or test sets.
* Find tokens that appear at least N times in the training data.
* Replace tokens that appear less than N times by `<unk>`


## Split the data into sentences <a anchor = "anchor" id = "3.1" />

In [None]:
def split_to_sentences(data):  
    '''
    Usage:
      #split_to_senteces--> used for splitting data by linebreak "\n"
  
    Arguments:
      #data --> string
    
    Returns:
      #sentences --> a list of sentences
      
    '''
    
    #Split when you find a linebreak ,"\n", in the data
    sentences = data.split('\n')
    
    #Remove leading from each sentence --> a leading space to appear at the beginning of the character
    #and trailing spaces from each sentece --> a trailing space to appear at the end of the character
    sentences = [s.strip() for s in sentences]
    
    #Drop sentences if they are empty strings --> ""
    sentences = [s for s in sentences if len(s) > 0]
    
    
    return sentences

In [None]:
#Test the code
txt = "   I love Aswan\nI love Luxor"
print("Before:\n")
print(txt)
print("------")
print("After:")
split_to_sentences(txt)

In [None]:
#Extra test 

x = """
I have a pen.\nI have an apple. \nAh\nApple pen.\n
"""
print(x)

split_to_sentences(x)

## Tokenize sentences <a anchor = "anchor" id = "3.2" />

After splitting the data into sentences, the next step is to tokenize the senteces, or in other words, to split every sentence into list of words called tokens.

<strong>The steps to tokenize sentences:</strong>

- Convert all tokens into lower case so that words which are capitalized (for example, at the start of a sentence) in the original text are treated the same as the lowercase versions of the words.
- Append each tokenized list of words into a list of tokenized sentences.


In [None]:
def tokenize_sentences(sentences):
    '''
    Usage:
      #sentences--> used for to split every sentence into list of words.
      
    Arguments:
      #sentences --> list of strings (strings)
    
    Returns:
      #tokenized_sentences --> a list of lists of tokens
    '''
    
    #Initialize empty list that will hold the lists of tokens 
    tokenized_sentences = []
    
    #Loop over every sentence in sentences
    for sentence in sentences:
        
        #Convert the sentence to lowercase letters
        sentence = sentence.lower()
        
        #Then, convert it to list of words (tokens)
        tokenized = nltk.word_tokenize(sentence)
        
        
        #Append the list of word, tokenized, to the list of lists, tokenized_sentences
        tokenized_sentences.append(tokenized)
        
        
    
    return tokenized_sentences

In [None]:
#Test the code
txt = ["I love Aswan", "I love Luxor"]
print("Before:\n")
print(txt)
print("------")
print("After:")
tokenize_sentences(txt)

In [None]:
#Extra test 
sentences = ["Sky is blue.", "Leaves are green.", "Roses are red."]
tokenize_sentences(sentences)

## Putting all togther <a anchor = "anchor" id = "3.3" />

Use the two functions that you have just implemented to get the tokenized data.
- split the data into sentences
- tokenize those sentences

In [None]:
def get_tokenized_data(data):
    '''
    Usage:
      #get_tokenized_data--> used for tokenizing the data to become list of lists of tokens
  
    Arguments:
      #data --> string
    
    Returns:
      #tokenized_sentences --> --> a list of lists of tokens
      
    '''
    
    #Split the data into sentences
    sentences = split_to_sentences(data)
    
    #Tokenize the data into a list of lists of tokens
    tokenized_sentences = tokenize_sentences(sentences)
    
    
    return tokenized_sentences

In [None]:
#Test the code
txt = "   I love Aswan\nI love Luxor"
print("Before:\n")
print(txt)
print("------")
print("after")
get_tokenized_data(txt)

In [None]:
#Extra test
x = "Sky is blue.\nLeaves are green\nRoses are red."
get_tokenized_data(x)

## Split the data into train and test sets <a anchor = "anchor" id = "3.4" />


In [None]:
#First, tokenize the data
tokenized_data = get_tokenized_data(data)

In [None]:
#Re-order the lists of tokens inside the list tokenized_data
random.seed(87)
random.shuffle(tokenized_data)

In [None]:
#Split the data into 80%  train set and 20% test set
train_size = int(len(tokenized_data) * 0.8)
train_data = tokenized_data[0:train_size]
test_data = tokenized_data[train_size:]

In [None]:
#Explore the length of the train and test set relative to the data
print(f"The tokenized data length: {len(tokenized_data)}\nThe train set length: {len(train_data)}\nThe test set length: {len(test_data)}")

In [None]:
#Explore the first example in train set 
train_data[0]

In [None]:
#Explore the first example in the test set
test_data[0]

## Find tokens that appear at least N times in the training data <a anchor = "anchor" id = "3.5" />


- we won't use all the tokens (words) appearing in the data for training. Instead, we will use the more frequently used words.

- we will focus on the words that appear at least N times in the data.

- First count how many times each word appears in the data.

In [None]:
def count_words(tokenized_sentences):
    '''
    Usage:
      #count_words --> used for Count the number of word appearence for every tokenized sentence in the tokenized sentences
  
    Arguments:
      #tokenized_sentences --> list of lists of tokens
    
    Returns:
      #word_counts --> dict that maps word (str) to the frequency (int)
      
    '''
    
    #Define empty count which will hold the (key,value) pairs which represent (word,frequency) for every word
    word_counts = {}
    
    #Loop over every sentence in the tokenized_sentences list
    for sentence in tokenized_sentences:
        
        #Loop over every token in the sentence
        for token in sentence:
            
            #If the token is not in the dictionary yet, set the count to 1
            if token not in word_counts.keys():
                word_counts[token] = 1
            
            #If the token is already in the dictionary, increment the count by 1
            else:
                word_counts[token] += 1
                
    return word_counts

In [None]:
#Test the code
sentences = "   I love Aswan\nI love Luxor"
tokenized_sentences = get_tokenized_data(sentences)
print(f'The tokenized sentences: {txt}\n')
print(f'The frequency of every word {count_words(tokenized_sentences)}')

In [None]:
#Extra test 
# test your code
tokenized_sentences = [['sky', 'is', 'blue', '.'],
                       ['leaves', 'are', 'green', '.'],
                       ['roses', 'are', 'red', '.']]
count_words(tokenized_sentences)

## Replace tokens that appear less than N times by   `<unk>` <a anchor = "anchor" id = "3.6" />

If your model is performing autocomplete, but encounters a word that it never saw in therir vocabulary, it won't have some hsitory of words to help it determine the next word to suggest. As a result of that, the model won't able to predict the next word.

- This 'new' word is called an 'unknown word', or <b>out of vocabulary (OOV)</b> words.
- <b>Out-of-vocabulary (OOV) words</b> are unknown words that appear in the testing speech but not in the recognition vocabulary.
- The percentage of unknown words in the test set is called the <b> OOV </b> rate. 

<strong>To handle unknown words during prediction, use a special token to represent all unknown words 'unk'.</strong>
- Modify the training data so that it has some 'unknown' words to train on.
- Words to convert into "unknown" words are those that do not occur very frequently in the training set.
- Create a list of the most frequent words in the training set, called the <b> closed vocabulary </b>. 
- Convert all the other words that are not part of the closed vocabulary to the token 'unk'. 

In [None]:
def get_words_with_nplus_frequency(tokenized_sentences, count_threshold):
    '''
    Usage:
      #get_words_with_nplus_frequency --> used for finding the words that appear N times or more
                                          to hold them in the vocabulary list
                                          
    Arguments:
      #tokenized_sentences --> list of lists of tokens
      #count_threshold --> cut-off value to sepreate the closed vocabulary list form the ouf of the vocabulary list
    
    Returns:
      #closed_vocab --> represents the closed vocabulary list that contain the words that appear N times or more
      
    '''
    
    #Initialize an empty list that will hold the words that apprear at least N times
    closed_vocab = []
    
    #Get the count of every word  in the form of (word, frequency) pair
    word_counts = count_words(tokenized_sentences)
    
    #Loop over every tuple holds the (word,frequency) pair
    for word,cnt in word_counts.items():
        #if the count of the word at least is equal to count_threshold
        if cnt >= count_threshold:
            
            #append the word to the closed vocabulary list 
            closed_vocab.append(word)
    
    return closed_vocab

In [None]:
#Test the code
sentences = "   I love Aswan\nI love Luxor"
tokenized_sentences = get_tokenized_data(sentences)
print(f'The tokenized sentences: {txt}\n')
print(f'The frequency of every word {count_words(tokenized_sentences)}\n')
closedVocab = get_words_with_nplus_frequency(tokenized_sentences, count_threshold=2)
print(f"The closed vocabulary list: {closedVocab}")

In [None]:
#Extra test 
# test your code
tokenized_sentences = [['sky', 'is', 'blue', '.'],
                       ['leaves', 'are', 'green', '.'],
                       ['roses', 'are', 'red', '.']]
tmp_closed_vocab = get_words_with_nplus_frequency(tokenized_sentences, count_threshold=2)
print(f"Closed vocabulary:")
print(tmp_closed_vocab)

In [None]:
def replace_oov_words_by_unk(tokenized_sentences, vocabulary, unknown_token="<unk>"):
    '''
    Usage:
      #replace_oov_words_by_unk --> used to replace words not in the closed vocabulary with the token <unk>.
                                          
    Arguments:
      #tokenized_sentences --> list of lists of tokens
      #vocabulary --> represents the closed vocabulary list that contain the words 
                      that appear N times or more
      #unknown_token --> string represents out-of-vocabulary words
    
    Returns:
      #replaced_tokenized_sentences --> list of lists of tokens with out-of vocabulary words
                                        replaced by <unk>
      
    '''
    
    #convert the list into set to remove the repeated words 
    vocabulary = set(vocabulary)
    
    #intialize a list which will hold lists of tokens with out-of vocabulary words replaced by <unk>
    replaced_tokenized_sentences = []
    
    #Loop over every sentence in the tokenized_sentences list
    for sentence in tokenized_sentences:
        
        #initialize a list that represent a single list from 
        #the lists in the list replaced_tokenized_sentences
        replaced_sentence = []
        
        #Loop over every token in the sentence
        for token in sentence:
            
            #check if the token in the closed vocab list
            if token in vocabulary: # complete this line
                # If so, append the word to the replaced_sentence
                replaced_sentence.append(token)
            else:
                # otherwise, append the unknown token instead
                replaced_sentence.append(unknown_token)
            
            
        #append the single list, replaced_sentence, to the list of lists, replaced_tokenized_sentences
        replaced_tokenized_sentences.append(replaced_sentence)
        
    return replaced_tokenized_sentences

In [None]:
#Test the code
sentences = "   I love Aswan\nI love Luxor"
tokenized_sentences = get_tokenized_data(sentences)
print(f'The tokenized sentences: {txt}\n')
print(f'The frequency of every word {count_words(tokenized_sentences)}\n')
closedVocab = get_words_with_nplus_frequency(tokenized_sentences, count_threshold=2)
print(f"The closed vocabulary list: {closedVocab}\n")

tmp_replaced_tokenized_sentences = replace_oov_words_by_unk(tokenized_sentences, closedVocab)
print(f"The replaced tokenized sentences: {tmp_replaced_tokenized_sentences}")

In [None]:
tokenized_sentences = [["dogs", "run"], ["cats", "sleep"]]
vocabulary = ["dogs", "sleep"]
tmp_replaced_tokenized_sentences = replace_oov_words_by_unk(tokenized_sentences, vocabulary)
print(f"Original sentence:")
print(tokenized_sentences)
print(f"tokenized_sentences with less frequent words converted to '<unk>':")
print(tmp_replaced_tokenized_sentences)

## Putting all together <a anchor = "anchor" id = "3.7" />

We can encapsualte what we have done above in a single function in order to preprocess the data

<b> We will do the following:</b>
1. Find tokens that appear at least count_threshold times in the training data.
1. Replace tokens that appear less than count_threshold times by "<unk\>" both for training and test data.

In [None]:
def preprocess_data(train_data, test_data, count_threshold = 2):
    '''
    Usage:
      # preprocess_data:
         -Find tokens that appear at least count_threshold times in the training data.
         -Replace tokens that appear less than N times by "<unk>" both for training and test data. 
                                          
    Arguments:
      #train_data, test_data --> each of them is a list of lists of tokens
      #count_threshold --> cut-off value to sepreate the closed vocabulary list form the ouf of the vocabulary list
                           the default value is 2
    
    
    Returns:
      #train_data_replaced --> training data with low frequent words replaced by "<unk>"
      #test_data_replaced --> test data with low frequent words replaced by "<unk>"
      #closed_vocab --> represents the closed vocabulary list
      
    '''
    
    #Get the closed vocabulary list from the training data
    closed_vocab = get_words_with_nplus_frequency(train_data, count_threshold)
    
    #For the train data, replace less common words with "<unk>"
    train_data_replaced = replace_oov_words_by_unk(train_data,closed_vocab)
    
    #For the test data, replace less common words with "<unk>"
    test_data_replaced = replace_oov_words_by_unk(test_data,closed_vocab)
    
    
    return train_data_replaced, test_data_replaced, closed_vocab

In [None]:
#Test the code
sentences1 = "   I love Aswan\nI love Luxor"
sentences2 = "   I love china\n I love Jaban"
tokenized_tr = get_tokenized_data(sentences1)
tokenized_test = get_tokenized_data(sentences2)
temp_train, temp_test, temp_closed_vocab = preprocess_data(tokenized_tr, tokenized_test)

print(f"pre-processed train data: {temp_train}\npre-processed test data: {temp_test}\nThe closed vocabulary list: {temp_closed_vocab}")

In [None]:
#Extra test 
# test your code
tmp_train = [['sky', 'is', 'blue', '.'],
     ['leaves', 'are', 'green']]
tmp_test = [['roses', 'are', 'red', '.']]

tmp_train_repl, tmp_test_repl, tmp_vocab = preprocess_data(tmp_train, 
                                                           tmp_test, 
                                                           count_threshold = 1)

print("tmp_train_repl")
print(tmp_train_repl)
print()
print("tmp_test_repl")
print(tmp_test_repl)
print()
print("tmp_vocab")
print(tmp_vocab)

<br>

## Pre-process the train and test set <a anchor = "anchor" id = "3.8" ></a>

In [None]:
#Pre-process the data
minimum_freq = 2
train_data_processed, test_data_processed, vocabulary = preprocess_data(train_data, 
                                                                        test_data, 
                                                                        minimum_freq)

In [None]:
#Explore the first processed list (sentence) in the processed training set 
train_data_processed[0]

In [None]:
#Explore the first processed list (sentence) in the processed test set 
test_data_processed[0]

In [None]:
#Explore subset of the closed vocabulary list 
vocabulary[0:10]

In [None]:
#Explore the size of the closed vocabulary list 
len(vocabulary)

# Develop n-gram model <a anchor = "anchor" id = "4" />

<br>


## Compute the counts of n-grams for an arbitrary number  𝑛  <a anchor = "anchor" id = "4.1" />

The term <b>n-gram</b> referes to two things: 
 - the n-gram model
 - a squence of n words
 
So , when we say the counts of n-grams , we mean the count of every n-word sequence in a given data 

In [None]:
def count_n_grams(data, n, start_token='<s>', end_token = '<e>'):
    
    '''
    Usage:
      #count_n_grams --> used for counting all n-grams in the data  
                                          
    Arguments:
      #data --> represents list of lists of tokens
      #n --> represents the number of words in a sequence (n-gram)
      #start_token --> a token indicating the beginning of a sentence
      #end_token --> a token indicating the end of a sentence
    
    
    Returns:
      #n_grams --> a dic whose key is the n-gram(a sequence of n words) represented in tuple, 
                   and whose value is the counts of every n-gram
      
    '''
    
    #Initialize a dictionary holds n-gram and their counts 
    n_grams = {}
    
    #Loop over every sentence (List) in data (list of Lists)
    for sentence in data:
        
        #add n-start_tokens for every sentence based on the type of n-grams (ex: n = 2 for bigrams)
        sentence = [start_token]*n + sentence + [end_token]
        
        #Convert sentence (list) to tuple
        sentence = tuple(sentence)
        
        #compute the number of words in sentence to loop over them to compute their counts
        m = len(sentence) if n == 1 else len(sentence) - 1
        
        #Loop every token in sentence except the last one as we key the n-gram from i to i+n
        #the index i represents the starting point of every n-gram
        for i in range(m):
            
            #get the n-gram from i to i+n
            n_gram = sentence[i:i+n]
            
            # check if the n-gram is in the dictionary
            if n_gram in n_grams.keys():
            
                #increment the count (the value of the key) for this n-gram
                n_grams[n_gram] += 1
            else:
                #add this n-gram as a key in the dictionary follwed by its value (their count)
                n_grams[n_gram] = 1

    return n_grams

In [None]:
#Test the code
sentences = [['i', 'love', 'aswan'], ['i', 'love', 'luxor']]

print("Uni-gram")
print(count_n_grams(sentences, 1))
print("\nBi-gram:")
print(count_n_grams(sentences, 2))
print("\nTri-gram")
print(count_n_grams(sentences, 3))

In [None]:
#Extra test 
sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]
print("Uni-gram:")
print(count_n_grams(sentences, 1))
print("Bi-gram:")
print(count_n_grams(sentences, 2))

<br>

## Estimate the probability of a word given some history of words <a anchor = "anchor" id = "4.2" /> 


The n-gram model is estimated , using <b>Markov Assumption</b>, as :

\begin{equation}
\hat P(w_t | w_{t-1}\dots w_{t-n}) = \frac{C(w_{t-1}\dots w_{t-n}, w_n)}{C(w_{t-1}\dots w_{t-n})}
\end{equation}

<br>

But unfortunatley that there are many n-grams that are not necessarily in the training set, so their counts will be equal to zero, and therefore their pobabilities also will be zero.

So we need a solution to this problem. One of the solution is <b>add-k smoothing</b>. we pretend that every token in our vocabulary  is incremented by one, therefore, every n-gram is incremented by one 
. so the formula of the n-gram model will be: 

$$ \hat{P}(w_t | w_{t-1}\dots w_{t-n}) = \frac{C(w_{t-1}\dots w_{t-n}, w_n) + k}{C(w_{t-1}\dots w_{t-n}) + k|V|} \tag{3} $$

This means that any n-gram with zero count has a probability of $\frac{1}{|V|}$ where $|V|$ is the vocabulary size .



### Notes:

* $C(w_{t-1}\dots w_{t-n}):$ represents (in the below function) the previous n-gram counts

* $C(w_{t-1}\dots w_{t-n}, w_n) :$ represents (in the below function) the n-plus-one-gram counts


In [None]:
def estimate_probability(word, previous_n_gram, 
                         n_gram_counts, n_plus1_gram_counts, vocabulary_size, k=1.0):
    '''
    Usage:
      #estimate_probability --> used to Estimate the probability of a next word using 
                                the n-gram counts with k-smoothing 
                                          
    Arguments:
      #word --> represents the next word we want to estimate its probability
      #previous_n_gram -->  a sequence of n words
      #n_gram_counts --> the number of times the history of the words appears together 
      #n_plus1_gram_counts --> the number of times the history of the words and the next word appear together
      #vocabulary_size --> the number of words in the vocabulary
      #K --> represents the smoothing parameter which is  the value of K for add-k somthing technique
    
    
    Returns:
      #probability --> representes the estimated proabaility of a next word 
      
    '''
    
    #Convert the previous n-gram list to tuple to use it as a dictionary key 
    previous_n_gram = tuple(previous_n_gram)
    
    # Set the denominator #
    # If the previous n-gram exists in the dictionary of n-gram counts,
    # Get its count.  Otherwise set the count to zero
    # Use the dictionary that has counts for n-grams
    previous_n_gram_count = n_gram_counts[previous_n_gram] if previous_n_gram in n_gram_counts  else 0
    
    #Compute the denominator using add-k smoothing technique
    denominator = previous_n_gram_count + k * vocabulary_size
    
    # Set the numerator #
    #Define the n-plus-one-gram by appending the next word in the previous n-gram
    n_plus1_gram = previous_n_gram + (word, )
    
    # Set the count to the count in the dictionary,
    # otherwise 0 if not in the dictionary
    # use the dictionary that has counts for the n-gram plus current word
    n_plus1_gram_count = n_plus1_gram_counts[n_plus1_gram] if n_plus1_gram in n_plus1_gram_counts  else 0
    
    #Compute the numerator
    numerator = n_plus1_gram_count + k
    
    
    # Compute the probability of the next word #
    probability = numerator / denominator
    
    return probability

In [None]:
#Test the code
sentences = sentences = [['i', 'love', 'aswan'], ['i', 'love', 'luxor']]

unique_words = list(set(sentences[0] + sentences[1]))

unigram_counts = count_n_grams(sentences, 1) #previous n-gram counts 
bigram_counts = count_n_grams(sentences, 2)  #n-plus-one-gram counts

#Estimate the probaility of being "cat" given the previous word "a" using the bigram model
tmp_prob = estimate_probability("cat", "a", unigram_counts, bigram_counts, len(unique_words), k=1)

print(f"The 1/|V| value: {1/len(unique_words)}")
print(f"The estimated probability of word 'cat' given the previous n-gram 'a' is: {tmp_prob:.4f}")

In [None]:
#Extra test
sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))

unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)
tmp_prob = estimate_probability("cat", "a", unigram_counts, bigram_counts, len(unique_words), k=1)

print(f"The estimated probability of word 'cat' given the previous n-gram 'a' is: {tmp_prob:.4f}")

<b>As you see,</b> both of the words "a" and "cat" don't exist in the training set so we give them a proability of $ \frac{1}{|V|}$

## Estimate probabilities for all words <a anchor = "anchor" id = "4.3" />

<br>

in this section, instead of estimate only a next word given history of words , we will estimate more than next word given a history. The next words is the all the tokens in the training set, in other words, we will compute all the probabilities of being the next word one of the tokens in the training set.

In [None]:
def estimate_probabilities(previous_n_gram, n_gram_counts, n_plus1_gram_counts, vocabulary, k=1.0):
    
    '''
    Usage:
      #estimate_probability --> used to Estimate the probability of a next word using 
                                the n-gram counts with k-smoothing 
                                          
    Arguments:
      #previous_n_gram -->  a sequence of n words
      #n_gram_counts --> the number of times the history of the words appears together 
      #n_plus1_gram_counts --> the number of times the history of the words and the next word appear together
      #vocabulary_size --> the number of words in the vocabulary
      #K --> represents the smoothing parameter which is  the value of K for add-k somthing technique
    
    
    Returns:
      #probabilities --> a dictionary whose every key is a token from the training set and whose every 
                         value is the probability for a key (token) to be the next word
                         
                         
      
    '''
    
    #Convert the list to tuple to use it as a dictionary key since 
    #the builtin list type should not be used as a dictionary key
    previous_n_gram = tuple(previous_n_gram)
    
    ##add <e> and <unk> to the vocabulary
    #we don't need to add <s> as we compute the chance for every token to be the next word 
    #so, there's no chance to be the first word
    vocabulary = vocabulary + ["<e>","<unk>"]
    
    #compute the vocabulary size 
    vocabulary_size = len(vocabulary)
    
    
    #Iniatilize the dictionary probabilities which holds the probailities for every token 
    probabilities = {}
    
    #Loop for every token(word) in the vocabulary 
    for word in vocabulary:
        
        #Compute the probaility of that word to be the next word
        probability = estimate_probability(word, previous_n_gram, 
                                           n_gram_counts, n_plus1_gram_counts, 
                                           vocabulary_size, k=k)
        
        #Add this value(probability) to the key(word) in the dictionary (probabilities)
        probabilities[word] = probability
    
    return probabilities

In [None]:
# test your code
sentences =  [['i', 'love', 'aswan'], ['i', 'love', 'luxor']]
unique_words = list(set(sentences[0] + sentences[1]))
unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)
print("Bi-gram:\n",bigram_counts)
estimate_probabilities("i", unigram_counts, bigram_counts, unique_words, k=1)

In [None]:
#Extra test 
sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))
unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)
estimate_probabilities("a", unigram_counts, bigram_counts, unique_words, k=1)

In [None]:
#Exrta test 
trigram_counts = count_n_grams(sentences, 3)
estimate_probabilities(["<s>", "<s>"], bigram_counts, trigram_counts, unique_words, k=1)

## Visualize n-gram counts in a matrix form <a anchor = "anchor" id = "4.4" />

In [None]:
def make_count_matrix(n_plus1_gram_counts, vocabulary):
        
    '''
    Usage:
      #make_count_matrix --> to create the n-gram counts matrix
                                          
    Arguments:
      #n_plus1_gram_counts --> the number of times the history of the words,
                               and the next word appear together
                               
      #Vocabulary --> list of the unique words in the training set 
     
    
    Returns:
      #count_matrix --> a matrix of n-gram counts 
                         
      
    '''
    ##add <e> and <unk> to the vocabulary
    #we don't need to add <s> as we compute the chance for every token to be the next word 
    #so, there's no chance to be the first word
    vocabulary = vocabulary + ["<e>", "<unk>"]
    
    # obtain unique n-grams
    n_grams = [] #initialie empty list to hold all the unique n-grams
    
    #Loop over every n-gram 
    for n_plus1_gram in n_plus1_gram_counts.keys():
        
        #Get all tokens except the last one which represents the next word 
        #Then, append them in n_grams
        n_gram = n_plus1_gram[0:-1]
        n_grams.append(n_gram)
    
    #Remove the repeated tokens to hold just the unique ones 
    n_grams = list(set(n_grams))
    
    # mapping from n-gram to row
    row_index = {n_gram:i for i, n_gram in enumerate(n_grams)} # {key: value for vars in iterable}

    # mapping from next word to column
    col_index = {word:j for j, word in enumerate(vocabulary)}

    
    #Compute the number of columns and the number of rows
    nrow = len(n_grams)
    ncol = len(vocabulary)
    
    #Pre-allocating  matrix of zeros
    count_matrix = np.zeros((nrow, ncol))
    
    #we call apply items() to use the value and key separately
    for n_plus1_gram, count in n_plus1_gram_counts.items():
        n_gram = n_plus1_gram[0:-1] #represents the history of the words
        word = n_plus1_gram[-1]     #represents the next word
        
        #check if the next word in the vocabulary, skip the iteration
        #otherwise, get index i,j of the matrix to assign the count 
        if word not in vocabulary:
            continue
        i = row_index[n_gram] #Get the index of a specific history 
        j = col_index[word]   #Get the index of a specific next word
        
        #Assign count to a specific cell in the matrix
        #Count represents the number of times a specific history and a certain next word appear together
        count_matrix[i, j] = count
    

    #Get the n-gram counts matrix
    count_matrix = pd.DataFrame(count_matrix, index=n_grams, columns=vocabulary)
    
    return count_matrix

In [None]:
sentences =  [['i', 'love', 'aswan'], ['i', 'love', 'luxor']]
unique_words = list(set(sentences[0] + sentences[1]))
bigram_counts = count_n_grams(sentences, 2)

#Show the bi-gram counts matrix 
display(make_count_matrix(bigram_counts, unique_words))

In [None]:
sentences = [['i', 'like', 'a', 'cat'],
                 ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))
bigram_counts = count_n_grams(sentences, 2)

print('bigram counts')
display(make_count_matrix(bigram_counts, unique_words))

In [None]:
#show the tri-gram counts matrix
sentences =  [['i', 'love', 'aswan'], ['i', 'love', 'luxor']]
unique_words = list(set(sentences[0] + sentences[1]))
trigram_counts = count_n_grams(sentences,3)

#Show the bi-gram counts matrix 
display(make_count_matrix(trigram_counts, unique_words))

## Visualize n-gram probabilities in a matrix form  <a anchor = "anchor" id = "4.5" />

In [None]:
def make_probs_matrix(n_plus1_gram_counts, vocabulary, k = 1):
    
    '''
    Usage:
      #make_probs_matrix --> to create the n-gram probabilities matrix
                                          
    Arguments:
      #n_plus1_gram_counts --> the number of times the history of the words,
                               and the next word appear together
                               
      #Vocabulary --> list of the unique words in the training set 
      #k --> the smoothing parameter
    
    Returns:
      #probs_matrix --> a matrix of n-gram probabilities 
                         
    '''
    
    #Compute the n-gram counts matrix
    count_matrix = make_count_matrix(n_plus1_gram_counts, vocabulary)
    
    #increment every count by k to appaly add-k smoothing technique to compute the probabilities
    count_matrix += 1 
    
    #to compute the probabilities , we need the count of every history 
    #if we sum up every row , we get the count of every history corresponds to every row
    #then we divided every row element wise with the corresponding count of a history
    probs_matrix = count_matrix.div(count_matrix.sum(axis = 1), axis = 0)
    
    return probs_matrix

In [None]:
sentences =  [['i', 'love', 'aswan'], ['i', 'love', 'luxor']]
unique_words = list(set(sentences[0] + sentences[1]))
bigram_counts = count_n_grams(sentences, 2)

#Show the bi-gram probabilities matrix 
display(make_probs_matrix(bigram_counts, unique_words))

In [None]:
sentences = [['i', 'like', 'a', 'cat'],
                 ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))
bigram_counts = count_n_grams(sentences, 2)
print("bigram probabilities")
display(make_probs_matrix(bigram_counts, unique_words, k=1))

In [None]:
#show the tri-gram probabilities matrix 
sentences =  [['i', 'love', 'aswan'], ['i', 'love', 'luxor']]
unique_words = list(set(sentences[0] + sentences[1]))
trigram_counts = count_n_grams(sentences,3)

#Show the bi-gram counts matrix 
display(make_probs_matrix(trigram_counts, unique_words))

<br>

# Perplexity <a anchor = "anchor" id = "5"></a>

<br>

<b>Perplexity</b> is a metric which tell us how well a probability model predicty a sample.A low perplexity indicates the probability distribution is good at predicting the sample.In our case, we use perplexity to evaluate our n-gram model.

<b>Perplexity</b> is defined as: 
$$ PP(W) =\sqrt[N]{ \prod_{t=n+1}^N \frac{1}{P(w_t | w_{t-n} \cdots w_{t-1})} } \tag{4}$$

- where $N$ is the length of the sentence.
- $n$ is the number of words in the n-gram (e.g. 2 for a bigram).

In [None]:
def calculate_perplexity(sentence, n_gram_counts, n_plus1_gram_counts, vocabulary_size, k=1.0):
    '''
    Usage:
      #calculate_perplexity --> used to compute perplexity
                                          
    Arguments:
      #sentence: --> list of tokens
      #previous_n_gram -->  a sequence of n words
      #n_gram_counts --> the number of times the history of the words appears together 
      #n_plus1_gram_counts --> the number of times the history of the words and the next word appear together
      #vocabulary_size --> the number of words in the vocabulary
      #K --> represents the smoothing parameter which is  the value of K for add-k somthing technique
    
    
    Returns:
      #perplexity -->  a score tells us how well our n-gram model 
    '''
    
    #Get the length of the previos words
    n = len(list(n_gram_counts.keys())[0])
    
    #Prepand <s> and append <e> to the sentence 
    sentence = ["<s>"] * n + sentence + ["<e>"]
    
    #Convert the list to tuple to use it as a dictionary key since 
    #the builtin list type should not be used as a dictionary key
    sentence = tuple(sentence)
    
    #Get the length of the sentence after adding <s> and <e>
    N  = len(sentence)
    
    
    #Initialze a variabe to accumulate the product that happens in the denominator 
    product_pi = 1.0
    
    #Loop in that range to calculate the denominator of the perplexity
    for t in range(n,N):
        
        #Get the prevoius words (could be just one word or more)
        n_gram = sentence[t-n: t]
        
        #Get the word
        word = sentence[t]
        
        #Compute the language model 
        prob = estimate_probability(word,n_gram, n_gram_counts, n_plus1_gram_counts, len(unique_words), k=1)
        
        #Accumulate the product
        product_pi *= 1/prob
        
        
    #Compute the perplexity be getting the N-th root of the product
    perplexity = product_pi ** (1/float(N))
    
    
    return perplexity

In [None]:
# Test your code
sentences =  [['i', 'love', 'aswan'], ['i', 'love', 'luxor']]
unique_words = list(set(sentences[0] + sentences[1]))
unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)
print("\nUni-gram:\n",unigram_counts)
print("Bi-gram:\n",bigram_counts)

#Compute Perplexity 
PP = calculate_perplexity(sentences[0], unigram_counts, bigram_counts, unique_words , k=1.0)

print("\nThe perplexity of this sentence according to our model : ",PP)

In [None]:
#Extra test
sentences = [['i', 'like', 'a', 'cat'],
                 ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))

unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)


perplexity_train1 = calculate_perplexity(sentences[0],
                                         unigram_counts, bigram_counts,
                                         len(unique_words), k=1.0)
print(f"Perplexity for first train sample: {perplexity_train1:.4f}")

test_sentence = ['i', 'like', 'a', 'dog']
perplexity_test = calculate_perplexity(test_sentence,
                                       unigram_counts, bigram_counts,
                                       len(unique_words), k=1.0)
print(f"Perplexity for test sample: {perplexity_test:.4f}")

<br>

# Build an auto-complete system <a anchor = "anchor" id = "6" ></a>

<br>

## Suggest A Word <a anchor = "anchor" id = "6.1" > </a>

We suggest a word by simply compute probabilities for all possible next words and suggest the most likely one.

In [None]:
def suggest_a_word(previous_tokens, n_gram_counts, n_plus1_gram_counts, vocabulary, k=1.0, start_with=None):
    '''
    Usage:
      #suggest_a_word --> used to get suggestion for the next word of a previous tokens
                                          
    Arguments:
      #previous_tokens --> list of tokens which we want to predict their next worf
      #n_gram_counts --> the number of times the history of the words appears together 
      #n_plus1_gram_counts --> the number of times the history of the words and the next word appear together
      #vocabulary_size --> the number of words in the vocabulary
      #K --> represents the smoothing parameter which is  the value of K for add-k somthing technique
      #start_with --> if it not None, we assign  a string to it which represents the first few characters of the next word
    
    
    Returns:
      #suggestion --> String represents the suggested word
      #max_prob --> the probability that appear that word with the previous tokens
      
    '''
    
    #Get the length of the previous words corresponds to a certain n-gram (ex: Bi-gram --> n = 1)
    n = len(list(n_gram_counts.keys())[0])
    
    #Get the most recent n words to be the pervious words of n-gram sequence
    previous_n_gram = previous_tokens[-n:]
    
    #compute probabilities for all possible next words
    probabilities = estimate_probabilities(previous_n_gram,
                                           n_gram_counts, n_plus1_gram_counts,
                                           vocabulary, k=k)
    
    #Initalize suggestion wich will hold the suggested word
    suggestion = None
    
    #Intialize max_prob which will hold the probability of a specific word 
    #out of all the possible words that has the highest probability 
    #to appear with the previous tokens
    max_prob = 0
    
    #List over each tuple in the list of tuples 
    for word, prob in probabilities.items():
        
        #If we want the next word to start with a specific charachters
        if start_with != None:
            
            #check if the word don't start with the required charchters
            if not word.startswith(start_with):
                
                #if so , don't consider that word and move to next one
                continue
                
        # Check if this word's probability
        # is greater than the current maximum probability
        if prob > max_prob:
            
            #if so, save this word to suggestion for being the best sugggested word so far
            suggestion = word
            
            #also, save their probability 
            max_prob = prob
            
            
    return suggestion, max_prob

In [None]:
#Test the code 
sentences =  [['i', 'love', 'aswan'], ['i', 'like', 'luxor'],['i','love','Alex']]
unique_words = list(set(sentences[0] + sentences[1]  + sentences[2]))
unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)
previous_tokens = ["i","like"]
tmp_suggest1 = suggest_a_word(previous_tokens, unigram_counts, bigram_counts, unique_words, k=1.0)
print(f"The previous words are 'i like',\n\tand the suggested word is `{tmp_suggest1[0]}` with a probability of {tmp_suggest1[1]:.4f}")

In [None]:
# Extra test
sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))

unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)

previous_tokens = ["i", "like"]
tmp_suggest1 = suggest_a_word(previous_tokens, unigram_counts, bigram_counts, unique_words, k=1.0)
print(f"The previous words are 'i like',\n\tand the suggested word is `{tmp_suggest1[0]}` with a probability of {tmp_suggest1[1]:.4f}")

print()
# test your code when setting the starts_with
tmp_starts_with = 'c'
tmp_suggest2 = suggest_a_word(previous_tokens, unigram_counts, bigram_counts, unique_words, k=1.0, start_with=tmp_starts_with)
print(f"The previous words are 'i like', the suggestion must start with `{tmp_starts_with}`\n\tand the suggested word is `{tmp_suggest2[0]}` with a probability of {tmp_suggest2[1]:.4f}")

<br>

## Suggest Multiple Words  <a anchor = "anchor" id = "6.2" ></a>

In [None]:
def get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0, start_with=None):
    '''
    Usage:
      #get_suggestion --> used to get multiple suggestion for the next word of a previous tokens
                                          
    Arguments:
      #previous_tokens --> list of tokens which we want to predict their next worf
      #n_gram_counts_list --> list of all the different n-grams counts 
                              Ex: [Uni-gram, Bi-gram,....]
      #vocabulary_size --> the number of words in the vocabulary
      #K --> represents the smoothing parameter which is  the value of K for add-k somthing technique
      #start_with --> if it not None, we assign  a string to it which represents the first few characters of the next word
    
    
    Returns:
      #suggestions --> list of tuples represents the multiple suggestions
    '''
    
    #Get the length of the given counts of different n-gram 
    model_counts = len(n_gram_counts_list)
    
    #Initalize list to hold all the possible suggestions for different n-gram models
    suggestions = []
    
    
    #Loop for different n_gram_counts , and n_plus1_gram_counts
    for i in range(model_counts - 1):
        n_gram_counts = n_gram_counts_list[i]
        n_plus1_gram_counts = n_gram_counts_list[i + 1]
        
        #Suggest a word based on the given n-gram model 
        suggestion = suggest_a_word(previous_tokens, n_gram_counts,
                                    n_plus1_gram_counts, vocabulary,
                                    k=k, start_with=start_with)
        
        #Append these suggestion to the list , suggestions
        suggestions.append(suggestion)
        
    
    return suggestions

In [None]:
# Test your code
sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))

unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)
trigram_counts = count_n_grams(sentences, 3)
quadgram_counts = count_n_grams(sentences, 4)
qintgram_counts = count_n_grams(sentences, 5)

n_gram_counts_list = [unigram_counts, bigram_counts, trigram_counts, quadgram_counts, qintgram_counts]
previous_tokens = ["i", "like"]
tmp_suggest3 = get_suggestions(previous_tokens, n_gram_counts_list, unique_words, k=1.0)

print(f"The previous words are 'i like', the suggestions are:")
display(tmp_suggest3)

<br>

## Suggest Multiple Words Using n-grams of Varying Length <a anchor = "anchor" id = "6.3"></a>
In this scetion, we are going to sugggest multiple words with n-grams of varying lengths (unigrams, bigrams, trigrams, 4-grams...6-grams).

In [None]:
n_gram_counts_list = []
for n in range(1, 6):
    print("Computing n-gram counts with n =", n, "...")
    n_model_counts = count_n_grams(train_data_processed, n)
    n_gram_counts_list.append(n_model_counts)

In [None]:
previous_tokens = ["i", "am", "to"]
tmp_suggest4 = get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0)

print(f"The previous words are {previous_tokens}, the suggestions are:")
display(tmp_suggest4)

In [None]:
previous_tokens = ["i", "want", "to", "go"]
tmp_suggest5 = get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0)

print(f"The previous words are {previous_tokens}, the suggestions are:")
display(tmp_suggest5)

In [None]:
previous_tokens = ["hey", "how", "are"]
tmp_suggest6 = get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0)

print(f"The previous words are {previous_tokens}, the suggestions are:")
display(tmp_suggest6)

In [None]:
previous_tokens = ["hey", "how", "are", "you"]
tmp_suggest7 = get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0)

print(f"The previous words are {previous_tokens}, the suggestions are:")
display(tmp_suggest7)

In [None]:
previous_tokens = ["hey", "how", "are", "you"]
tmp_suggest8 = get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0, start_with="d")

print(f"The previous words are {previous_tokens}, the suggestions are:")
display(tmp_suggest8)

# Congratulations!