# N-Gram Language Model
Language Models(LMs) assign probabilities to sequences of words.  
N-gram language model is the the simplest model that assigns probabilities to sentences with sequences of N words.  
  
Let $P(w | h)$ be the probability of a word w given some history $h$  
Since language is creative, we cannot calculate the probability for every word given every history,  
One way is to decompose this probability using the chain rule of probability:
$P(X_1 ...X_n ) = P(X_1)P(X_2 | X_1)P(X_3 | X_1^2 )...P(X_n | X_1^{n−1})$  
But this may not help for the same reason.   
The intuition of the n-gram model is that instead of computing the probability of a word given its entire history, we can approximate the history by just the last few words.  
In bigram, that is $P(w_n | w_1^{n−1} ) ≈ P(w_n | w_{n−1} )$  
And $P(w_n| w_{n−1}) = \frac{C(w_{n-1}w_n)}{C(w_{n-1})}$

In [1]:
file_dir = './data/wsj_sec22'

In [2]:
from collections import Counter

c = Counter()
with open(file_dir) as f:
    for line in f.readlines():
        words = ['<s>'] + line.lower().strip().split() + ['</s>']
        bigram = zip(*[words[i:] for i in range(2)])
        trigram = zip(*[words[i:] for i in range(3)])
        c.update([(w,) for w in words])
        c.update(bigram)
        c.update(trigram)

In [3]:
c.most_common(10)

[(('the',), 2185),
 ((',',), 2143),
 (('<s>',), 1700),
 (('</s>',), 1700),
 (('.',), 1660),
 (('.', '</s>'), 1562),
 (('of',), 913),
 (('to',), 868),
 (('a',), 817),
 (('in',), 706)]

In [4]:
c.most_common()[-10:]

[(('up', 'from', '$'), 1),
 (('from', '$', '15.2'), 1),
 (('$', '15.2', 'million'), 1),
 (('15.2', 'million', 'in'), 1),
 (('fiscal', '1988', 'and'), 1),
 (('1988', 'and', '$'), 1),
 (('and', '$', '3.9'), 1),
 (('$', '3.9', 'million'), 1),
 (('3.9', 'million', 'in'), 1),
 (('million', 'in', '1985'), 1)]

In [5]:
unigram = dict()
bigram = dict()
trigram = dict()

for k, v in c.items():
    if len(k) == 1:
        unigram[k] = v
    elif len(k) == 2:
        bigram[k] = v
    else:
        trigram[k] = v

In [9]:
unigram_prob = dict()
unigram_token = sum(unigram.values())
for word, num in unigram.items():
    unigram_prob[word] = num / unigram_token

# P(w2|w1) = P(w1, w2) / P(w1, _)
bigram_prob = dict()
for words, num in bigram.items():
    w1, w2 = words
    bigram_prob[words] = num / unigram[(w1,)]

# P(w3|w2w1) = P(w1, w2, w3) / P(w1, w2, _)
trigram_prob = dict()
for words, num in trigram.items():
    w1, w2, w3 = words
    trigram_prob[words] = num / bigram[(w1, w2)]

In [12]:
sorted(unigram.items(), key=lambda x:x[1], reverse=True)[:10]

[(('the',), 2185),
 ((',',), 2143),
 (('<s>',), 1700),
 (('</s>',), 1700),
 (('.',), 1660),
 (('of',), 913),
 (('to',), 868),
 (('a',), 817),
 (('in',), 706),
 (('and',), 682)]

In [13]:
sorted(bigram.items(), key=lambda x:x[1], reverse=True)[:10]

[(('.', '</s>'), 1562),
 (('<s>', 'the'), 293),
 (('of', 'the'), 219),
 (('in', 'the'), 208),
 ((',', 'the'), 176),
 (('<s>', '``'), 124),
 ((',', "''"), 121),
 (('for', 'the'), 102),
 ((',', 'and'), 97),
 ((',', 'a'), 89)]

In [14]:
sorted(trigram.items(), key=lambda x:x[1], reverse=True)[:10]

[(('.', "''", '</s>'), 82),
 (('said', '.', '</s>'), 56),
 (('a', 'share', ','), 46),
 (('million', ',', 'or'), 44),
 (('the', 'stock', 'market'), 41),
 ((',', "''", 'said'), 33),
 (('%', '.', '</s>'), 33),
 (('cents', 'a', 'share'), 31),
 ((',', "''", 'he'), 26),
 ((',', 'or', '$'), 25)]

Since the language model is created, we can use it to calculate the probability of a given sentence.  
  
But there's always a problem of zero frequency n-grams.

In simple linear interpolation, we combine different order n-grams by linearly interpolating all the models. Thus, we estimate the trigram probability $P(w_n | w_{n−2} w_{n−1} )$ by mixing together the unigram, bigram, and trigram probabilities, each weighted by a λ:

In [23]:
import math
def calculate_prob(text, unigram_prob=unigram_prob, bigram_prob=bigram_prob, trigram_prob=trigram_prob, l1=0.1, l2=0.2, l3=0.7):
    words = text.split()
    last_two = ['<s>']
    sum_ = 0
    for w in words + ['</s>']:
        if (w,) in unigram_prob:
            p1 = unigram_prob[(w,)]
            p2 = bigram_prob.get((last_two[-1], w), 0)
            if len(last_two) == 2:
                p3 = trigram_prob.get((last_two[-2], last_two[-1], w), 0)
            else:
                p3 = 0
            p = l3*p3 + l2*p2 + l1*p1
            sum_ += math.log10(p)
        last_two.append(w)
        last_two = last_two[-2:]
    return 10**sum_

In [24]:
test_text = 'Earlier in the session , the prices of several soybean contracts set new lows .'
calculate_prob(test_text)

9.496073294185911e-13