# The probabilistic language modeling

The probability of a sentence S (as a sequence of words $w_i$ is : $P(S) = P(w_1,w_2, w_3,\ldots,w_n)$

the conditional probability of $w_4$ depending on all previous words. 

For a 4 word sentence this conditional probability is:

$$ P(S)=P(w_1, w_2, w_3, w_4) \equiv P(w_4 |w_1, w_2, w3)$$

Probability Chain Rule:

$$P(A|B) = \frac{P(A\cap B)}{P(B)} \implies P(A\cap B)=P(A|B)P(B)$$

so in general for a token sequence we get:

$$P(S)=P(w_1,\ldots,w_n) = P(w_1)P(w_2|w_1) P(w_3)P(w_1,w_2)\ldots P(w_n|w_1,\ldots w_{n-1}) ={\displaystyle \prod_{i} P(w_i|w_1,\ldots w_{i-1})} $$

To estimate each probability a straightforward solution could be to use simple counting.

$$ P(w_5|w_1,w_2,w_3,w_4)= \frac{count(w_1,w_2,w_3,w_4,w_5)}{count(w_1,w_2,w_3,w_4)}$$

but this gives us to many possible sequences to ever estimate. Imagine how much data (occurrences of each sentence) we would have to get to make this counts meaningful.

To cope with this issue we can simplify by applying the __Markov Assumption__, which states that it is enough to pick only one, or a couple of previous words as a prefix:

$$P(w_1,\ldots,w_n) \approx {\displaystyle \prod_{i} P(w_i|w_{i-k},\ldots P(w_{i-1}))} $$

where k is the number of previous tokens (prefix size) that we consider.

## N-grams

An N-gram is a contiguous (order matters) sequence of items, which in this case is the words in text. The n-grams depends on the size of the prefix. The simplest case is the __Unigram mode__.

#### (Uni-) 1-gram model 

The simplest case of __Markov assumption__ is case when the size of prefix is one.

$$P(w_1,\ldots,w_n) \approx {\displaystyle \prod_{i} P(w_i)}$$

This will provide us with grammar that only consider one word. As a result it produces a set of unrelated words. It actually would generate sentences with random word order.

#### Bigram 
However, if we consider a 2-word (tandem) bigrams correlations where we condition each word on previous one we get some sens of meaning between couples of words.

$$ P(w_1,\ldots,w_n) \approx {\displaystyle \prod_{i} P(w_i|w_{i-1})} $$

This way we get a sequence of tandems that have meaning tandem-wise. This still is not enough to face a long range dependencies from natural languages, like English. This would be difficult even in case of n-grams as there can be very long sentences with dependencies. However, a 3-gram can get us a pretty nice approximation .

## Estimate n-gram probabilities

Estimation can be done with Maximum Likelihood Estimate (MLE):

$$ P(w_i|w_{i-1})=\frac{count(w_{i-1},w_i)}{count(w_{i-1 })} $$

For example, 5th word


$$ P(w_5|w_{4})=\frac{count(w_{4},w_5)}{count(w_{4 })} $$

In practice, the outcome should be represented in __log__ form. 

There are 2 reasons for this. 

- Firstly, if the sentence is long and the probabilities are really small, then such product might end in arithmetic underflow. 

- Secondly, adding is faster - and when we use logarithm we know that: $log(a*b) = log(a)+log(b)$, thus:

$$log(P(w_1,\ldots,w_n)) \approx {\displaystyle \sum_{i} log(P(w_i|w_{i-1}))}$$

This is why the Language Model is stored in logarithmic values.

In [1]:
import json
import nltk
from nltk.tokenize import word_tokenize

from nltk.probability import FreqDist
from nltk.probability import ConditionalFreqDist

In [2]:
corpus = "one fish two fish red fish blue fish"

In [3]:
fdist = FreqDist()

In [4]:
for word in word_tokenize(corpus):
    fdist[word.lower()] += 1

In [5]:
print (json.dumps(fdist, sort_keys=True, indent=4))

{
    "blue": 1,
    "fish": 4,
    "one": 1,
    "red": 1,
    "two": 1
}


In [6]:
list(nltk.bigrams(word_tokenize(corpus), pad_left=True, pad_right=True))

[(None, 'one'),
 ('one', 'fish'),
 ('fish', 'two'),
 ('two', 'fish'),
 ('fish', 'red'),
 ('red', 'fish'),
 ('fish', 'blue'),
 ('blue', 'fish'),
 ('fish', None)]

In [11]:
#cfdist = nltk.ConditionalFreqDist(nltk.bigrams(word_tokenize(corpus), pad_left=True, pad_right=True))
cfdist = nltk.ConditionalFreqDist(nltk.bigrams(word_tokenize(corpus)))

In [12]:
cfdist.keys()

dict_keys(['one', 'fish', 'two', 'red', 'blue'])

In [13]:
cfdist.values()

dict_values([FreqDist({'fish': 1}), FreqDist({'two': 1, 'red': 1, 'blue': 1}), FreqDist({'fish': 1}), FreqDist({'fish': 1}), FreqDist({'fish': 1})])

In [14]:
print (json.dumps(cfdist, sort_keys=True, indent=4))

{
    "blue": {
        "fish": 1
    },
    "fish": {
        "blue": 1,
        "red": 1,
        "two": 1
    },
    "one": {
        "fish": 1
    },
    "red": {
        "fish": 1
    },
    "two": {
        "fish": 1
    }
}


In [15]:
list(nltk.bigrams(word_tokenize(corpus)))

[('one', 'fish'),
 ('fish', 'two'),
 ('two', 'fish'),
 ('fish', 'red'),
 ('red', 'fish'),
 ('fish', 'blue'),
 ('blue', 'fish')]

In [16]:
[cfdist[a][b] for (a, b) in nltk.bigrams(word_tokenize(corpus))]

[1, 1, 1, 1, 1, 1, 1]

In [18]:
for (a, b) in nltk.bigrams(word_tokenize(corpus)):
    print(a)

one
fish
two
fish
red
fish
blue


In [17]:
[cfdist[a].N() for (a, b) in nltk.bigrams(word_tokenize(corpus))]


[1, 3, 1, 3, 1, 3, 1]