*************************************************************************************************
<b>Written By:- Aadish Joshi</b>  
<b>Date:- April 24, 2019</b>  
<b>Topic:- Language Modeling with Ngrams</b>  

➢ What are we trying to solve here?  
Using probabilistic models called N-grams to predict the next word from the previous n-1 words.  
Note that this problem is corpus (data set of the sentences) specific.  

➢ Grams refered to number of words taken into consideration. e.g. unigram means occurence of the single word in the corpus. Bigrams means predicting the next word based on 1 previous word. Trigrams mean predicting the next word based on 2 previous words and so on.

➢This method is count based. A simple approach uses count.  
Count all the unigrams. e.g. count of all the words in the corpus. 

➢ We’ll call a statistical model that can assess this probability a Language Model 

*************************************************************************************************

➢ Consider following corpus  

"The lake is very big. Its water is so transparent that you can see your face clearly. Its water is so transparent that the moon appears even bigger due to reflection"  

How to estimate the probability of the word "the" given previous words "its water is so transparent that"  

We calculate    

Count(its water is so transparent that the) = 1
and  
Count(its water is so transparent that) = 2

Hence  

P(the | its water is so transparent that ) = Count(its water is so transparent that the) / Count(its water is so transparent that)  

*************************************************************************************************
Unfortunately, for most sequences and for most text collections we won’t get good estimates from this method.  
➢ What we’re likely to get is 0. Or worse 0/0.  

➢ Let’s use the chain rule of probability  
➢And a particularly useful independence assumption.  

*************************************************************************************************

Chain rule of Probability.
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)  

Independence Assumption.(Markov Assumption)  
That is, the probability in question is independent of its earlier history.

*************************************************************************************************
How to approach this problem  
1) Decide Grams. e.g. Bigrams(2word counts), unigrams(1 word count) etc  
2) P(wn | wn-1) = Count(wn-1,wn) | count(wn-1)  

Coding:  
consider the following corpus  
"Sales of the company to return to normalcy. The new products and services contributed to increase revenue."  
We have to find the bigram estimates of sentence probabilities in the corpus.  

Strategy:  
0) preprocess the sentences (optional)  
1) Count the unigram count  
2) count the bigram count  
3) find out bigram_count / unigram_count  
*************************************************************************************************

In [1]:
import re

corpus = "Sales of the company to return to normalcy.\n Sales of the new products and services contributed to increase revenue."

In [2]:
def _preprocess(corpus):
    
    #split courpus into sentences
    sentences = corpus.split(".")

    #add custom start and end tags
    data = []
    
    for sentence in sentences:
        #add custom start and end tags
        if sentence != "":
            
            #keep words and digits only
            sentence = re.sub("[^a-zA-Z\d]", " ", sentence)
            
            # lower case words
            sentence = sentence.lower()
            
            #add custom start and end tags
            sentence = "<s> "+ sentence.strip() + " </s>"
            
            #append processed data
            data.append(sentence)
    
    return data
    #to generate words from the sentences
    
data = _preprocess(corpus)
print(data)

['<s> sales of the company to return to normalcy </s>', '<s> sales of the new products and services contributed to increase revenue </s>']


In [3]:
def _unigrams(data):
    #unigrams as a dict to store keys as unigram and values as word occurence
    unigrams = {}
    
    for sentences in data:
        
        #generate words from sentences
        words = sentences.split(" ")
        
        #process individual words
        for word in words:
            
            #if keys exists in dict, we increment the count
            try:
                if unigrams[word] >= 1:
                    unigrams[word] += 1
            #else we store key in dict, with cont 1
            except:
                unigrams[word] = 1
    return unigrams

unigrams = _unigrams(data)
print(unigrams)

{'<s>': 2, 'sales': 2, 'of': 2, 'the': 2, 'company': 1, 'to': 3, 'return': 1, 'normalcy': 1, '</s>': 2, 'new': 1, 'products': 1, 'and': 1, 'services': 1, 'contributed': 1, 'increase': 1, 'revenue': 1}


In [4]:
def _bigrams(data):
    #unigrams as a dict to store keys as unigram and values as word occurence
    bigrams = {}
    
    bigram_key_set = []
    
    for sentences in data:
        #generate words from sentences
        words = sentences.split(" ")
        
        #append bigrams of the words using zip function
        bigram_key_set.extend(list(zip(words, words[1:])))
        #print(bigram_key_set)
                
        
    for key in bigram_key_set:
        try:
            if bigrams[key] >= 1:
                bigrams[key] += 1
        except:
            bigrams[key] = 1
    return bigrams

bigrams = _bigrams(data)
print(bigrams)

{('<s>', 'sales'): 2, ('sales', 'of'): 2, ('of', 'the'): 2, ('the', 'company'): 1, ('company', 'to'): 1, ('to', 'return'): 1, ('return', 'to'): 1, ('to', 'normalcy'): 1, ('normalcy', '</s>'): 1, ('the', 'new'): 1, ('new', 'products'): 1, ('products', 'and'): 1, ('and', 'services'): 1, ('services', 'contributed'): 1, ('contributed', 'to'): 1, ('to', 'increase'): 1, ('increase', 'revenue'): 1, ('revenue', '</s>'): 1}


In [5]:
def _probability(unigrams,bigrams):
    
    probability = {}
    
    for key in bigrams.keys():
        word1 = key[0]
        
        try:
            probability[key] = round(bigrams[key] / unigrams[word1],2)
        except:
            probability[key] = 0
    return probability

probability = _probability(unigrams,bigrams)
print(probability)

{('<s>', 'sales'): 1.0, ('sales', 'of'): 1.0, ('of', 'the'): 1.0, ('the', 'company'): 0.5, ('company', 'to'): 1.0, ('to', 'return'): 0.33, ('return', 'to'): 1.0, ('to', 'normalcy'): 0.33, ('normalcy', '</s>'): 1.0, ('the', 'new'): 0.5, ('new', 'products'): 1.0, ('products', 'and'): 1.0, ('and', 'services'): 1.0, ('services', 'contributed'): 1.0, ('contributed', 'to'): 1.0, ('to', 'increase'): 0.33, ('increase', 'revenue'): 1.0, ('revenue', '</s>'): 1.0}


*************************************************************************************************
What if the unigram count of any number is zero? Can we divide by zero?  

Hence we use smoothing technique called as Laplace add 1 smoothing  
We add 1 in the unigram word counts  
as we add 1 in the unigram count, total number of words will increase by ( N + Vocabulary count)  
Vocubary is nothing but unique words in the sentence  

*************************************************************************************************