<a href="https://colab.research.google.com/github/ALIOUNEDIANKHA/n_gram_models/blob/main/Alioune_Ben_Mor_DIANKHA_n_gram_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


<h1 style="font-family:verdana;font-size:300%;text-align:center;background-color:#f2f2f2;color:#0d0d0d">AMMI NLP - Review sessions</h1>

<h1 style="font-family:verdana;font-size:180%;text-align:Center;color:#993333"> Lab 3: n-gram models </h1>

**Big thanks to Amr Khalifa who improved this lab and made it to a Jupyter Notebook!**

In [1]:
import io, sys, math, re
from collections import defaultdict
import numpy as np
import random

In [2]:
# data_loader
def load_data(filename):
    '''
    parameters:
    filename (string): datafile
    
    Returns:
    data (list of lists): each list is a sentence of the text 
    vocab (dictionary): {word: no of times it appears in the text}
    '''
    fin = io.open(filename, 'r', encoding='utf-8')
    data = []
    vocab = defaultdict(lambda:0)
    for line in fin:
        sentence = line.split()
        data.append(sentence)
        for word in sentence:
            vocab[word] += 1
    return data, vocab

In [3]:

print("load training set..")
print("\n")
train_data, vocab = load_data("train1.txt")
print(train_data[50])
print("\n")
print("how :",vocab['how'])
print("load validation set")
valid_data, _ = load_data("valid1.txt")


load training set..


['<s>', 'tom', 'showed', 'me', 'a', 'picture', 'of', 'the', 'food', 'they', 'had', 'eaten.', '</s>']


how : 107
load validation set


In [4]:
def remove_rare_words(data, vocab, mincount = 1):
    '''
    Parameters:
    data (list of lists): each list is a sentence of the text 
    vocab (dictionary): {word: no of times it appears in the text}
    mincount(int): the minimum count 
    
    Returns: 
    data_with_unk(list of lists): data after replacing rare words with <unk> token
    '''
    # replace words in data that are not in the vocab 
    # or have a count that is below mincount

    ## FILL CODE
    data_with_unk = [[word if vocab[word]>mincount else '<unk>'  for word in sentence] for sentence in data]
    return data_with_unk

In [5]:
print("remove rare words")
train_data = remove_rare_words(train_data, vocab, mincount = 1)
valid_data = remove_rare_words(valid_data, vocab, mincount = 1)
#train_data
print(train_data[0])


remove rare words
['<s>', 'my', '<unk>', "don't", 'speak', '<unk>', '</s>']


In [6]:
def build_ngram(data, n):
    '''
    Parameters:
    data (list of lists): each list is a sentence of the text 
    n (int): size of the n-gram
    
    Returns:
    prob (dictionary of dictionary)
    {
        context: {word:probability of this word given context}
    }
    '''

    counts = defaultdict(lambda: defaultdict(lambda: 0.0))
    context_grams = defaultdict(lambda: 0.0)
    liste = []

    for sentence in data:
        ## FILL CODE
        # dict can be indexed by tuples
        # store in the same dict all the ngrams
        # by using the context as a key and the word as a value
        context_grams[tuple(sentence[0:n-1])]+=1 #       we count the number of time the beginning word '<s>' appears
        for i in range(len(sentence)-n+1):
            counts[tuple(sentence[i:i+n-1])][sentence[i+n-1]]+=1
            context_grams[tuple(sentence[i+1:i+n])]+=1

    prob = defaultdict(lambda: defaultdict(lambda: 0.0))
    # Build the probabilities from the counts
    # Be careful with how you normalize!
    for context in counts.keys():
      for word in counts[context].keys():

    ## FILL CODE
        prob[context][word] = counts[context][word]/context_grams[context]

    return prob

In [7]:
# RUN TO BUILD NGRAM MODEL

n = 3
print("build ngram model with n = ", n)
model = build_ngram(train_data, n)

build ngram model with n =  3


Here, implement a recursive function over shorter and shorter context to compute a "stupid backoff model". An interpolation model can also be implemented this way.

In [8]:
def get_prob(model, context, w):
    '''
    Parameters: 
    model (dictionary of dictionary)
    {
        context: {word:probability of this word given context}
    } 
    context (list of strings): a sentence
    w(string): the word we need to find it's probability given the context
    
    Retunrs:
    prob(float): probability of this word given the context 
    '''

    # code a recursive function over 
    # smaller and smaller context
    # to compute the backoff model
    
    ## FILL CODE

    return model[context][w]


In [9]:
def perplexity(model, data, n):
    '''
    Parameters: 
    model (dictionary of dictionary)
    {
        context: {word:probability of this word given context}
    } 
    data (list of lists): each list is a sentence of the text
    n(int): size of the n-gram
    
    Retunrs:
    perp(float): the perplexity of the model 
    '''

    ## FILL CODE
    sum = 0
    t = 0
    for sentence in data:
        ## FILL CODE
        # dict can be indexed by tuples
        # store in the same dict all the ngrams
        # by using the context as a key and the word as a value
        for i in range(len(sentence)-n+1):
            context, word = tuple(sentence[i:i+n-1]), sentence[i+n-1]
            prob = get_prob(model, context, word)
            if prob>0 :
                t+=1
                sum -= np.log(prob)
                
        perp = np.exp((1/t)*sum)

    return perp

In [10]:
# COMPUTE PERPLEXITY ON VALIDATION SET

print("The perplexity is", perplexity(model, valid_data, n=n))

The perplexity is 5.755000061229744


In [11]:
def get_proba_distrib(model, context):
    ## need to get the the words after the context and their probability of appearance
    ## after this context 
    '''
    Parameters: 
    model (dictionary of dictionary)
    {
        context: {word:probability of this word given context}
    }
    context (list of strings): the sentence we need to find the words after it and 
    thier probabilites
    
    Retunrs:
    words_and_probs(dic): {word: probability of word given context}
    
    '''
    # code a recursive function over context
    # to find the longest available ngram
    
    ## FILL CODE
    return model[context]

In [12]:
def generate(model):
    '''
    Parameters: 
    model (dictionary of dictionary)
    {
        context: {word:probability of this word given context}
    }
    
    Retunrs:
    sentence (list of strings): a sentence sampled according to the language model. 
    

    '''
    # generate a sentence. A sentence starts with a <s> and ends with a </s>
    # Possiblly a use function is:
    # np.random.choice(x, 1, p = y)

    # where x is a list of things to sample from
    # and y is a list of probability (of the same length as x)
    sentence = ["<s>"]

    context = random.sample(list(model.keys()), 1)
    sentence.extend(*context)


    while sentence[-1] != "</s>" and len(sentence)<100:
        ## FILL CODE
        words_after_context = get_proba_distrib(model, *context)
        proba_context= [words_after_context[word] for word in words_after_context.keys()]
        gen = np.random.choice(list(words_after_context.keys()), 1, p = proba_context)
        sentence.extend(gen)
        context = [tuple(sentence[-n+1:])]
    sentence = ' '.join(word for word in sentence)
    
    return sentence

In [13]:
# GENERATE A SENTENCE FROM THE MODEL

print("Generated sentence: ",generate(model))

Generated sentence:  <s> opening up access to the fridge. </s>


Once you are done implementing the model, evaluation and generation code, you can try changing the value of `n`, and play with a larger training set (`train2.txt` and `valid2.txt`). You can also try to implement an interpolation model.

In [14]:
# RUN TO BUILD NGRAM MODEL
print("load training set..")
print("\n")
train_data, vocab = load_data("train2.txt")
print(train_data[50])
print("\n")
print("load validation set")
valid_data, _ = load_data("valid2.txt")

load training set..


['<s>', 'this', 'book', 'is', 'better', 'than', 'any', 'i', 'have', 'ever', 'read', '.', '</s>']


load validation set


In [15]:
print("remove rare words")
train_data = remove_rare_words(train_data, vocab, mincount = 1)
valid_data = remove_rare_words(valid_data, vocab, mincount = 1)
#train_data
print(train_data[0])

remove rare words
['<s>', 'i', 'liked', 'your', 'idea', 'and', 'adopted', 'it', '.', '</s>']


In [16]:
n = 3
print("build ngram model with n = ", n)
model = build_ngram(train_data, n)

build ngram model with n =  3


In [17]:
print("The perplexity is", perplexity(model, valid_data, n=n))

The perplexity is 10.068608700940757


In [18]:
print("Generated sentence: ",generate(model))

Generated sentence:  <s> well that ends well . </s>
