<a href="https://colab.research.google.com/github/Nourhan-Adell/Natural-language-processing/blob/main/Week3_labs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **N-grams Corpus preprocessing**
Some common preprocessing steps for the language models include:

*    lowercasing the text
*    remove special characters
*    split text to list of sentences
*    split sentence into list words


In [None]:
import nltk
import re
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## **1. Lowercasing the text:**

In [None]:
corpus = "Learning% makes 'me' happy. I am happy be-cause I am learning! :)"
corpus = corpus.lower()
print(corpus)

learning% makes 'me' happy. i am happy be-cause i am learning! :)


## **2. Remove special characters:**

In [None]:
corpus = "Learning% makes 'me' happy. I am happy be-cause I am learning! :)"
corpus = re.sub(r"[^a-zA-Z0-9.?! ]+", "", corpus)
print(corpus)

Learning makes me happy. I am happy because I am learning! 


Note that this process gets rid of the happy face made with punctuations :). Remember that for sentiment analysis, this emotion was very important. However, we will not consider it here.


## **3. Text splitting:**


In [None]:
input_date="Sat May  9 07:33:35 CEST 2020"

# Get the date parts in array
date_parts = input_date.split(" ")
print("Date parts: ", date_parts)

# Get the time parts in array
time_parts = date_parts[4].split(" ")
print("Time parts: ", time_parts)

print("Corpus split: ", corpus.split(" "))

Date parts:  ['Sat', 'May', '', '9', '07:33:35', 'CEST', '2020']
Time parts:  ['07:33:35']
Corpus split:  ['Learning', 'makes', 'me', 'happy.', 'I', 'am', 'happy', 'because', 'I', 'am', 'learning!', '']


## **4. Sentence tokenizing:**
it can be done by both split function (as done in previous cell), or by nltk library

In [None]:
sentence = 'i am happy because i am learning.'
tokenized_sentence = nltk.word_tokenize(sentence)
print(f'{sentence} -> {tokenized_sentence}')

i am happy because i am learning. -> ['i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.']


In [None]:
# find length of each word in the tokenized sentence
word_lengths = [(word, len(word)) for word in tokenized_sentence]     #Calculate the number of characters for each word
print(f' Lengths of the words: \n{word_lengths}')

 Lengths of the words: 
[('i', 1), ('am', 2), ('happy', 5), ('because', 7), ('i', 1), ('am', 2), ('learning', 8), ('.', 1)]


# **N-grams**
**Sentence to n-gram.**

In [None]:
def sentence_to_trigram(tokenized_sentence):
  for i in range(len(tokenized_sentence)-3 +1):
    trigram = tokenized_sentence[i : i + 3]
    print(trigram)

In [None]:
print(f'List all trigrams of sentence: {tokenized_sentence}\n')
sentence_to_trigram(tokenized_sentence)

List all trigrams of sentence: ['i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.']

['i', 'am', 'happy']
['am', 'happy', 'because']
['happy', 'because', 'i']
['because', 'i', 'am']
['i', 'am', 'learning']
['am', 'learning', '.']


## **Prefix of an n-gram:**

\begin{equation*}
P(w_n|w_1^{n-1})=\frac{C(w_1^n)}{C(w_1^{n-1})}
\end{equation*}


In [None]:
# when working with trigrams, you need to prepend 2 <s> and append one </s>
n = 3
tokenized_sentence = ['i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.']
tokenized_sentence = ["<s>"] * (n - 1) + tokenized_sentence + ["<e>"]
print(tokenized_sentence)

['<s>', '<s>', 'i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.', '<e>']


**Thus, to allow the equation to work properly**

# **Lab2-Building the language model**

In [None]:
# manipulate n_gram count dictionary

n_gram_counts = {
    ('i', 'am', 'happy'): 2,
    ('am', 'happy', 'because'): 1}

# get count for an n-gram tuple
print(f"count of n-gram {('i', 'am', 'happy')}: {n_gram_counts[('i', 'am', 'happy')]}")

# check if n-gram is present in the dictionary
if ('i', 'am', 'learning') in n_gram_counts:
    print(f"n-gram {('i', 'am', 'learning')} found")
else:
    print(f"n-gram {('i', 'am', 'learning')} missing")

# update the count in the word count dictionary
n_gram_counts[('i', 'am', 'learning')] = 1
if ('i', 'am', 'learning') in n_gram_counts:
    print(f"n-gram {('i', 'am', 'learning')} found")
else:
    print(f"n-gram {('i', 'am', 'learning')} missing")

print(n_gram_counts)

count of n-gram ('i', 'am', 'happy'): 2
n-gram ('i', 'am', 'learning') missing
n-gram ('i', 'am', 'learning') found
{('i', 'am', 'happy'): 2, ('am', 'happy', 'because'): 1, ('i', 'am', 'learning'): 1}


## **the count matrix could be made in a single pass through the corpus**

In [None]:
import numpy as np
import pandas as pd
from collections import defaultdict
def single_pass_trigram_count_matrix(corpus):
    """
    Creates the trigram count matrix from the input corpus in a single pass through the corpus.
    
    Args:
        corpus: Pre-processed and tokenized corpus. 
    
    Returns:
        bigrams: list of all bigram prefixes, row index
        vocabulary: list of all found words, the column index
        count_matrix: pandas dataframe with bigram prefixes as rows, 
                      vocabulary words as columns 
                      and the counts of the bigram/word combinations (i.e. trigrams) as values
    """
    bigrams = []
    vocabulary = []
    count_matrix_dict = defaultdict(dict)
    
    # go through the corpus once with a sliding window
    for i in range(len(corpus) - 3 + 1):
        # the sliding window starts at position i and contains 3 words
        trigram = tuple(corpus[i : i + 3])
        
        bigram = trigram[0 : -1]
        if not bigram in bigrams:
            bigrams.append(bigram)        
        
        last_word = trigram[-1]
        if not last_word in vocabulary:
            vocabulary.append(last_word)
        
        if (bigram,last_word) not in count_matrix_dict:
            count_matrix_dict[bigram,last_word] = 0
            
        count_matrix_dict[bigram,last_word] += 1
    
    # convert the count_matrix to np.array to fill in the blanks
    count_matrix = np.zeros((len(bigrams), len(vocabulary)))
    for trigram_key, trigam_count in count_matrix_dict.items():
        count_matrix[bigrams.index(trigram_key[0]), \
                     vocabulary.index(trigram_key[1])]\
        = trigam_count
    
    # np.array to pandas dataframe conversion
    count_matrix = pd.DataFrame(count_matrix, index=bigrams, columns=vocabulary)
    return bigrams, vocabulary, count_matrix

In [None]:
corpus = ['i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.']

bigrams, vocabulary, count_matrix = single_pass_trigram_count_matrix(corpus)

print(count_matrix)

                  happy  because    i   am  learning    .
(i, am)             1.0      0.0  0.0  0.0       1.0  0.0
(am, happy)         0.0      1.0  0.0  0.0       0.0  0.0
(happy, because)    0.0      0.0  1.0  0.0       0.0  0.0
(because, i)        0.0      0.0  0.0  1.0       0.0  0.0
(am, learning)      0.0      0.0  0.0  0.0       0.0  1.0


## **Probability matrix:**

In [None]:
# create the probability matrix from the count matrix
row_sums = count_matrix.sum(axis=1)
# divide each row by its sum
prob_matrix = count_matrix.div(row_sums, axis=0)

print(prob_matrix)

                  happy  because    i   am  learning    .
(i, am)             0.5      0.0  0.0  0.0       0.5  0.0
(am, happy)         0.0      1.0  0.0  0.0       0.0  0.0
(happy, because)    0.0      0.0  1.0  0.0       0.0  0.0
(because, i)        0.0      0.0  0.0  1.0       0.0  0.0
(am, learning)      0.0      0.0  0.0  0.0       0.0  1.0


In [None]:
# find the probability of a trigram in the probability matrix
trigram = ('i', 'am', 'happy')

# find the prefix bigram 
bigram = trigram[:-1]
print(f'bigram: {bigram}')

# find the last word of the trigram
word = trigram[-1]
print(f'word: {word}')

trigram_probability = prob_matrix[word][bigram]
print(f'trigram_probability: {trigram_probability}')

bigram: ('i', 'am')
word: happy
trigram_probability: 0.5


## **Language model evaluation:**

In [None]:
import random
def train_validation_test_split(data, train_precent, validation_precent):
  # fixed seed here for reproducibility
  random.seed(87)
  # reshuffle all input sentences
  random.shuffle(data)
  train_size = int(len(data) * train_precent / 100)
  train_data = data[0:train_size]

  validation_size = int(len(data) * validation_precent / 100)
  validation_data = data[train_size: train_size + validation_size]

  test_data = data[train_size + validation_size : ]
  return train_data, validation_data, test_data

In [None]:
data = [x for x in range (0, 100)]

train_data, validation_data, test_data = train_validation_test_split(data, 80, 10)
print("split 80/10/10:\n",f"train data:{train_data}\n", f"validation data:{validation_data}\n", 
      f"test data:{test_data}\n")

split 80/10/10:
 train data:[28, 76, 5, 0, 62, 29, 54, 95, 88, 58, 4, 22, 92, 14, 50, 77, 47, 33, 75, 68, 56, 74, 43, 80, 83, 84, 73, 93, 66, 87, 9, 91, 64, 79, 20, 51, 17, 27, 12, 31, 67, 81, 7, 34, 45, 72, 38, 30, 16, 60, 40, 86, 48, 21, 70, 59, 6, 19, 2, 99, 37, 36, 52, 61, 97, 44, 26, 57, 89, 55, 53, 85, 3, 39, 10, 71, 23, 32, 25, 8]
 validation data:[78, 65, 63, 11, 49, 98, 1, 46, 15, 41]
 test data:[90, 96, 82, 42, 35, 13, 69, 24, 94, 18]



# **Perplexity:**

\begin{equation*}
PP(W)=\sqrt[M]{\prod_{i=1}^{m}{\frac{1}{P(w_i|w_{i-1})}}}
\end{equation*}

Remember from calculus:

\begin{equation*}
\sqrt[M]{\frac{1}{x}} = x^{-\frac{1}{M}}
\end{equation*}


In [None]:
# to calculate the exponent, use the following syntax
p = 10 ** (-250)
M = 100
perplexity = p ** (-1 / M)
print(perplexity)

316.22776601683796


# **Lab3-Language model generalization:**
Out of vocabulary words (OOV)

In [None]:
# Build the vocabulary from M most frequent words
from collections import Counter

M = 3
word_counts = {'happy':5, "because":2, 'i':2, 'am':2, 'learning':3, '.':1}
vocabulary = Counter(word_counts).most_common(M)
# remove the frequencies and leave just the words
vocabulary = [w[0] for w in vocabulary]
print(f"the new vocabulary containing {M} most frequent words: {vocabulary}\n") 


the new vocabulary containing 3 most frequent words: ['happy', 'learning', 'because']



In [None]:
sentence = ['am','i','learning']
output_sentence = []
for word in sentence:
  if word in vocabulary:
    output_sentence.append(word)
  else:
    output_sentence.append('<UNK>')

print(f"input sentence: {sentence}")
print('output sentence: ', output_sentence)

input sentence: ['am', 'i', 'learning']
output sentence:  ['<UNK>', '<UNK>', 'learning']


In [None]:
# iterate through all word counts and print words with given frequency f

f = 3
word_counts = {'happy': 5, 'because': 3, 'i': 2, 'am': 2, 'learning':3, '.': 1}
for word,freq in word_counts.items():
  if freq == f:
    print(word)

because
learning


## Smoothing:

In [None]:
def add_k_smooting_probability(k, vocabulary_size, n_gram_count, n_gram_prefix_count):
    numerator = n_gram_count + k
    denominator = n_gram_prefix_count + k * vocabulary_size
    return numerator / denominator