<a href="https://colab.research.google.com/github/Nourhan-Adell/Natural-language-processing/blob/main/Autocomplete_system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Language Models: Auto-Complete**
**Steps:**

1.    Load and preprocess data
  *   Load and tokenize data.
  *   Split the sentences into train and test sets.
  *    Replace words with a low frequency by an unknown marker <unk>.

2.    Develop N-gram based language models
  *   Compute the count of n-grams from a given data set.
  *   Estimate the conditional probability of a next word with k-smoothing.

3.    Evaluate the N-gram models by computing the perplexity score.
4.    Use your own model to suggest an upcoming word given your sentence.


In [1]:
import math
import random
import numpy as np
import pandas as pd
import nltk
nltk.download('punkt')
import w3_unittest
nltk.data.path.append('.')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# **Part 1: Load and Preprocess Data**

## **Part 1.1: Load the data**

In [2]:
with open('/content/en_US.twitter.txt','r') as file:
  data = file.read()

print('Data type: ', type(data))
print("Number of letters: ", len(data))
print("First 300 letters of the data")
display(data[:300])
print("-------")

print('Last 300 letters of the data')
display(data[-300: ])

Data type:  <class 'str'>
Number of letters:  3335477
First 300 letters of the data


"How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.\nWhen you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason.\nthey've decided its more fun if I don't.\nSo Tired D; Played Lazer Tag & Ran A "

-------
Last 300 letters of the data


"ust had one a few weeks back....hopefully we will be back soon! wish you the best yo\nColombia is with an 'o'...“: We now ship to 4 countries in South America (fist pump). Please welcome Columbia to the Stunner Family”\n#GutsiestMovesYouCanMake Giving a cat a bath.\nCoffee after 5 was a TERRIBLE idea.\n"

## **Part 1.2 Pre-process the data**
Preprocess this data with the following steps:

1.    Split data into sentences using "\n" as the delimiter.
2.    Split each sentence into tokens. Note that in this assignment we use "token" and "words" interchangeably.
3.    Assign sentences into train or test sets.
4.    Find tokens that appear at least N times in the training data.
5.    Replace tokens that appear less than N times by `<unk>`

In [3]:
#1. split the data into sentences: 
def split_to_sentences(data):
  sentences = data.split('\n')
  return sentences

In [4]:
# test your code
x = """
I have a pen.\nI have an apple. \nAh\nApple pen.\n
"""
print(x)

split_to_sentences(x)


I have a pen.
I have an apple. 
Ah
Apple pen.




['', 'I have a pen.', 'I have an apple. ', 'Ah', 'Apple pen.', '', '']

In [5]:
# 2. Split each sentence into tokens

def tokenize_sentences(sentences):
  tokenized_sentences = []
  for sentences in sentences:
    # Convert to lowercase letters
    sentence = sentences.lower()

    # Convert into a list of words
    tokenized = nltk.word_tokenize(sentence)
    tokenized_sentences.append(tokenized)
  
  return tokenized_sentences

In [6]:
# test your code
sentences = ["Sky is blue.", "Leaves are green.", "Roses are red."]
tokenize_sentences(sentences)

[['sky', 'is', 'blue', '.'],
 ['leaves', 'are', 'green', '.'],
 ['roses', 'are', 'red', '.']]

In [7]:
# 3. Assign sentences into train or test sets.
def get_tokenized_data(data):
  sentences = split_to_sentences(data)
  tokenized_sentences = tokenize_sentences(sentences)
  return tokenized_sentences

In [8]:
# test your function
x = "Sky is blue.\nLeaves are green\nRoses are red."
get_tokenized_data(x)

[['sky', 'is', 'blue', '.'],
 ['leaves', 'are', 'green'],
 ['roses', 'are', 'red', '.']]

In [9]:
# Split into train and test sets
tokenized_data = get_tokenized_data(data)
random.seed(87)
random.shuffle(tokenized_data)

train_size = int(len(tokenized_data) * 0.8)
train_data = tokenized_data[0 : train_size]
test_data = tokenized_data[train_size: ]

In [10]:
print("{} data are split into {} train and {} test set".format(
    len(tokenized_data), len(train_data), len(test_data)))

print("First training sample:")
print(train_data[0])
      
print("First test sample")
print(test_data[0])

47962 data are split into 38369 train and 9593 test set
First training sample:
['i', '❤', 'and', 'her', 'boo', 'relationship', 'they', 'so', 'thuged', 'out', '!', '!']
First test sample
['better', 'than', '``', 'misplaced', 'quotation', 'marks', "''", 'rt', ':', 'lately', 'i', 'find', 'that', 'i', 'am', 'adding', 'too', 'many', 'exclamation', 'points', 'where', 'they', 'do', "n't", 'belong', '!']


In [11]:
# 4. Find tokens that appear at least N times in the training data
def count_words(tokenize_sentences):
  word_counts = {}
  for sentence in tokenize_sentences:
    for token in sentence:
      if token not in word_counts:
        word_counts[token] = 1
      else:
        word_counts[token] += 1
  return word_counts

In [12]:
# test your code
tokenized_sentences = [['sky', 'is', 'blue', '.'],
                       ['leaves', 'are', 'green', '.'],
                       ['roses', 'are', 'red', '.']]
count_words(tokenized_sentences)

{'.': 3,
 'are': 2,
 'blue': 1,
 'green': 1,
 'is': 1,
 'leaves': 1,
 'red': 1,
 'roses': 1,
 'sky': 1}

### **Handling 'Out of Vocabulary' words**

In [39]:
# 4. Replace tokens that appear less than N times by <unk>
def get_words_with_nplus_frequency(tokenize_sentences, count_threshold):
  #Count_threshold = frequency of words 

 
  word_counts = count_words(tokenized_sentences)
  
  # the words that have frequency >= counyt_theshold
  closed_vocab = [word for word, count in word_counts.items() if count >= count_threshold]
  return closed_vocab

In [40]:
# test your code
tokenized_sentences = [['sky', 'is', 'blue', '.'],
                       ['leaves', 'are', 'green', '.'],
                       ['roses', 'are', 'red', '.']]
tmp_closed_vocab = get_words_with_nplus_frequency(tokenized_sentences, count_threshold=2)
print(f"Closed vocabulary:")
print(tmp_closed_vocab)

Closed vocabulary:
['.', 'are']


In [41]:
# Replace all other words by '<unk>'
def replace_oov_words_by_unk(tokenized_sentences, vocabulary,unknown_token=("<unk>")):
  vocabulary = set(vocabulary)
  replaced_tokenized_words = []

  for sentence in tokenized_sentences:
    replaced_sentence = []

    for token in sentence:
      if token in vocabulary:
        replaced_sentence.append(token)
      else:
        replaced_sentence.append(unknown_token)

    replaced_tokenized_words.append(replaced_sentence)

  return replaced_tokenized_words

In [42]:
tokenized_sentences = [["dogs", "run"], ["cats", "sleep"]]
vocabulary = ["dogs", "sleep"]
tmp_replaced_tokenized_sentences = replace_oov_words_by_unk(tokenized_sentences, vocabulary)
print(f"Original sentence:")
print(tokenized_sentences)
print(f"tokenized_sentences with less frequent words converted to '<unk>':")
print(tmp_replaced_tokenized_sentences)

Original sentence:
[['dogs', 'run'], ['cats', 'sleep']]
tokenized_sentences with less frequent words converted to '<unk>':
[['dogs', '<unk>'], ['<unk>', 'sleep']]


 Therefore, 

In [43]:
def preprocess_data(train_data1, test_data, count_threshold, unknown_token="<unk>"):
  vocabulary = get_words_with_nplus_frequency(train_data1, count_threshold)
    
    # For the train data, replace less common words with "<unk>"
  train_data_replaced = replace_oov_words_by_unk(train_data, vocabulary, unknown_token)
    
    # For the test data, replace less common words with "<unk>"
  test_data_replaced = replace_oov_words_by_unk(test_data, vocabulary, unknown_token)
    
  return train_data_replaced, test_data_replaced, vocabulary

In [45]:
minimum_freq = 2
train_data_processed, test_data_processed, vocabulary = preprocess_data(train_data, 
                                                                        test_data, 
                                                                        minimum_freq)

In [46]:
print("First preprocessed training sample:")
print(train_data_processed[0])
print()
print("First preprocessed test sample:")
print(test_data_processed[0])
print()
print("First 10 vocabulary:")
print(vocabulary[0:10])
print()
print("Size of vocabulary:", len(vocabulary))

First preprocessed training sample:
['<unk>', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>']

First preprocessed test sample:
['<unk>', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>']

First 10 vocabulary:
[]

Size of vocabulary: 0


# **Part 2: Develop n-gram based language models**


In [65]:
def count_n_grams(data, n, start_token="<s>", end_token="<\s"):
  n_grams = {}
  for sentence in data:
    sentence = [start_token] * n + sentence + [end_token]
    sentence = tuple(sentence)
    M = len(sentence) -n +1
    for i in range(M):
      n_gram = sentence[i : i+n]
      if n_gram in n_grams.keys():
        n_grams[n_gram] += 1
      else:
        n_grams[n_gram] = 1
    
  return n_grams

In [66]:
# test your code
# CODE REVIEW COMMENT: Outcome does not match expected outcome
sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]
print("Uni-gram:")
print(count_n_grams(sentences, 1))
print("Bi-gram:")
print(count_n_grams(sentences, 2))

Uni-gram:
{('<s>',): 2, ('i',): 1, ('like',): 2, ('a',): 2, ('cat',): 2, ('<\\s',): 2, ('this',): 1, ('dog',): 1, ('is',): 1}
Bi-gram:
{('<s>', '<s>'): 2, ('<s>', 'i'): 1, ('i', 'like'): 1, ('like', 'a'): 2, ('a', 'cat'): 2, ('cat', '<\\s'): 2, ('<s>', 'this'): 1, ('this', 'dog'): 1, ('dog', 'is'): 1, ('is', 'like'): 1}


### **Estimate the probabilities of a next word using the n-gram counts with k-smoothing**

In [67]:
### UNQ_C9 GRADED FUNCTION: estimate_probability ###
def estimate_probability(word, previous_n_gram, 
                         n_gram_counts, n_plus1_gram_counts, vocabulary_size, k=1.0):
    """
    Estimate the probabilities of a next word using the n-gram counts with k-smoothing
    
    Args:
        word: next word
        previous_n_gram: A sequence of words of length n
        n_gram_counts: Dictionary of counts of n-grams
        n_plus1_gram_counts: Dictionary of counts of (n+1)-grams
        vocabulary_size: number of words in the vocabulary
        k: positive constant, smoothing parameter
    
    Returns:
        A probability
    """
    # convert list to tuple to use it as a dictionary key
    previous_n_gram = tuple(previous_n_gram)
        
    # Set the denominator
    # If the previous n-gram exists in the dictionary of n-gram counts,
    # Get its count.  Otherwise set the count to zero
    # Use the dictionary that has counts for n-grams
    previous_n_gram_count = n_gram_counts[previous_n_gram] if previous_n_gram in n_gram_counts else 0

            
    # Calculate the denominator using the count of the previous n gram
    # and apply k-smoothing
    denominator = previous_n_gram_count + (k * vocabulary_size)

    # Define n plus 1 gram as the previous n-gram plus the current word as a tuple
    n_plus1_gram = previous_n_gram + (word,)
  
    # Set the count to the count in the dictionary,
    # otherwise 0 if not in the dictionary
    # use the dictionary that has counts for the n-gram plus current word    
    n_plus1_gram_count = n_plus1_gram_counts[n_plus1_gram] if n_plus1_gram in n_plus1_gram_counts else 0
    
    # Define the numerator use the count of the n-gram plus current word,
    # and apply smoothing
    numerator = n_plus1_gram_count + k
        
    # Calculate the probability as the numerator divided by denominator
    probability = numerator / denominator
    
    return probability

In [68]:
# test your code
sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))

unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)
tmp_prob = estimate_probability("cat", "a", unigram_counts, bigram_counts, len(unique_words), k=1)

print(f"The estimated probability of word 'cat' given the previous n-gram 'a' is: {tmp_prob:.4f}")

The estimated probability of word 'cat' given the previous n-gram 'a' is: 0.3333


## **Part 3: Perplexity**

In [69]:
# UNQ_C10 GRADED FUNCTION: calculate_perplexity
def calculate_perplexity(sentence, n_gram_counts, n_plus1_gram_counts, vocabulary_size, start_token='<s>', end_token = '<e>', k=1.0):

    # length of previous words
    n = len(list(n_gram_counts.keys())[0]) 
    
    # prepend <s> and append <e>
    sentence = [start_token] * n + sentence + [end_token]
    
    # Cast the sentence from a list to a tuple
    sentence = tuple(sentence)
    
    # length of sentence (after adding <s> and <e> tokens)
    N = len(sentence)
    
    # The variable p will hold the product
    # that is calculated inside the n-root
    # Update this in the code below
    product_pi = 1.0
        
    for t in range(n, N):

        # get the n-gram preceding the word at position t
        n_gram = sentence[t-n:t]
        
        # get the word at position t
        word = sentence[t]
        
        # Estimate the probability of the word given the n-gram
        # using the n-gram counts, n-plus1-gram counts,
        # vocabulary size, and smoothing constant
        probability = estimate_probability(word, n_gram, n_gram_counts, n_plus1_gram_counts, vocabulary_size, k=1.0)
        
        # Update the product of the probabilities
        # This 'product_pi' is a cumulative product 
        # of the (1/P) factors that are calculated in the loop
        product_pi *=  1 / probability
        ### END CODE HERE ###

    # Take the Nth root of the product
    perplexity = (product_pi)**(1/N)
    
    return perplexity

In [70]:
# test your code

sentences = [['i', 'like', 'a', 'cat'],
                 ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))

unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)


perplexity_train = calculate_perplexity(sentences[0],
                                         unigram_counts, bigram_counts,
                                         len(unique_words), k=1.0)
print(f"Perplexity for first train sample: {perplexity_train:.4f}")

test_sentence = ['i', 'like', 'a', 'dog']
perplexity_test = calculate_perplexity(test_sentence,
                                       unigram_counts, bigram_counts,
                                       len(unique_words), k=1.0)
print(f"Perplexity for test sample: {perplexity_test:.4f}")

Perplexity for first train sample: 3.3674
Perplexity for test sample: 3.9654
