# Language Model

Language models assigns probabilities to a sentence.

Probability of a sequence of words: $Pr(W) = Pr(w_1, w_2, \dots w_n)$

* $Pr(\text{Who is John Doe})$

Probability of upcoming word: $Pr(w_n | w_1, w_2, \dots w_{n-1})$

* $Pr(\text{Who} | \text{is John Doe})$


## Application
* Spelling Correction
* Speech recognition
* Grammatical Error Correction
* Machine Translation

# n-grams
A n-gram is a sequence of n items (usually words) in the given text.

## Chain rule
By chain rule of probability, we know that

$$
Pr(A_1, \dots , A_N) = Pr(A_1) \times Pr(A_2 | A_1) \dots Pr(A_N | A_{1:N-1}) = \prod ^ N _{i=1} Pr (A_i | A_{1:i-1})
$$

where $i:j$ represents a sequence 

We can estimate each $Pr (A_i | A_{1:i-1})$ using Maximum Likelihood Estimate (MLE):

$$
Pr (A_i | A_{1:i-1})= \frac{Pr (A_{1:i}) \cap Pr(A_i)}{A_{1:i-1}} = \frac{\text{Count($A_{1:i}$)}} {\text{Count($A_{1:i-1}$)}}
$$

Thus, to compute $Pr(\text{Who is John Doe})$ 

$= Pr(\text{Who}) \times Pr(\text{is|Who}) \times Pr(\text{John| who is}) \times Pr(\text{Doe | who is Doe})$

And we estimate $Pr(\text{is|Who}) \approx \frac{\text{Count(Who is)}} {\text{Count(Who)}}$

$Pr(\text{John |Who is}) \approx \frac{\text{Count(Who is John)}} {\text{Count(Who is)}}$

So on and so forth...

### Issues
However, consider $\frac{\text{Count($A_{1:i}$)}} {\text{Count($A_{1:i-1}$)}}$ where the sentence is much longer.

A long sentence will cause our joint probability table to be very large.

Worse still, if the sequence that we wish to predict is not seen before (which is very likely once the sentence is long), then our count in the numerator is 0.
This leads to all our possible predictions to yield 0 probability!

## Markov Assumption
Approximate probability of the sequence of word by assuming that they are only dependent on the **last k words**.

$$
Pr(A_1, \dots , A_N) \approx  \prod ^ N _{i=1} Pr (A_i | A_{i-k:i-1})
$$

### n-Gram model
Thus, given a n-gram model, we will approximate the next word by using the n-1 preceding words.

Unigram $\rightarrow$ 0 preceding word (Purely based on word frequency)

Bigram $\rightarrow$ 1 preceding word (Based on word frequency given previous word)

Trigram $\rightarrow$ 2 preceding word (Based on word frequency given previous 2 word)

### MLE on n-grams
Suppose we wish to estimate the MLE of $Pr(A_i | A_{i-k:i-1})$.

We simply count the times that the desired n-gram appears, and divide by the count of the times that $A_{i-k:i-1} w$ appears for every possible word $w$ in the corpus. 

$$
Pr(A_i | A_{i-k:i-1}) = \frac{\text{Count($A_{i-k:i}$)}} {\sum _ w\text{Count($A_{i-k:i-1} w$)}} = \frac{\text{Count($A_{i-k:i}$)}} {\text{Count($A_{i-k:i-1}$)}}
$$

Where the simplification comes from realizing that the denominator is simply the count of our preceding k-gram in the corpus.

Since the left side is giving the probability of $A_i$ appearing after a sequence of k words, setting $N = k+1$, we get


$$
Pr(A_i | A_{i-N+1:i-1}) = \frac{\text{Count($A_{i-N+1:i}$)}} {\text{Count($A_{i-N+1:i-1}$)}}
$$

### Sentence Marking
For the model to be more robust, we can allocate a special word/delimiter for sentence start and end. 
This allows us to differentiate words that are likely to start or end a sentence.

More importantly, it ensures that sentences of different length will draw from the same probability distribution, which means that the sum of probability of all sentences will sum to one.
### Underflow <a id="underflow"></a>
Multiply the probability may lead to a very small number, which may cause the arithmetic to underflow.

Using the fact that 
$$P_1 \times P_2 \times\dots \times P_n \propto \log P_1 + \log P_2 +\dots+ \log P_n$$

We can calculate the sum of the logs instead, where we get a negative sum instead of a term that approaches 0.

# Unknown Words <a id="unknown_words"></a>
### Closed Vocabulary
* Vocabulary is fixed
* All dataset contains only words from this dictionary
* No unknown words

### Open Vocabulary
* Test set may contain outside of vocabulary words (OOV)

This may lead to counts of certain words being 0, leading to 0 probability again!

## Handling Unknown Words
#### Substitution
1. Decide on a vocabulary list
2. Convert all unknown words to a special token `<UNK>` during normalization.
3. Treat the token as a regular word

### Subword Morphological Processing
Refer to [Morphology](./words.ipynb#morphology)

### Smoothing
Since the main issue is trying to predict a word that we have never seen before, the simple solution is to give a non-zero probability to OOV words.

Smoothing is the act of reallocating the probabilities of n-gram such that all is non-zero.

Discounting is similar, which is reallocating the count of the n-gram such that all is non-zero.

#### Laplace Smoothing
Simply add 1 to all n-gram counts.


For bigrams, it will be

$$
Pr(w_n | w_{n-1}) = \frac{C(w_{n-1}w_n) + 1} {\sum _w (C(w_{n-1}w) + 1)}= \frac{C(w_{n-1}w_n) + 1} {C(w_{n-1}) + V}
$$

#### Laplace Discounted Count
Replace all n-gram counts with a discounted count.

For bigrams, it will be 

$$
Pr(w_n | w_{n-1}) = \frac{C(w_{n-1}w_n) + 1} {C(w_{n-1}) + V} = 
\frac{C^*(w_{n-1}w_n)}{C(w_{n-1})}
$$

$$
\Rightarrow C^*(w_{n-1}w_n) = C(w_{n-1}) \times \frac{C(w_{n-1}w_n) + 1} {C(w_{n-1}) + V}
$$
where $C^*(w_{n-1}w_n)$ is the new discounted count for each bigram.

#### Laplace Discount
Multiply all ratio by a factor $d_c$

$$
d_c = \frac{C^*(w_{n-1}w_n)}{C^(w_{n-1}w_n)} = \frac{C(w_{n-1})}{C(w_{n-1}w_n)} \times \frac{C(w_{n-1}w_n) + 1} {C(w_{n-1}) + V}
$$

#### Add-k Smoothing
Add k to all n-gram counts.

This is a generalization of Laplace smoothing

For bigrams, it will be

$$
Pr(w_n | w_{n-1}) = \frac{C(w_{n-1}w_n) + k} {C(w_{n-1}) + kV}
$$

### Backoff
Suppose we have a n-gram model.

If using a n-gram yields 0 probability for a certain sentence, we can use (n-1)-gram for that sentence instead.

We use repeatedly smaller k-gram until we find that the sentence can be predicted.

### Interpolation
Use different n-grams probability estimates as metric instead.

For example, combining bigram and unigram:

$$
Pr(w_1 | w_0) = \lambda_1 P(w_1 | w_0) +  \lambda_2 P(w_1)
$$

Where $\sum \lambda _i = 1$

### Kneser-Ney Smoothing
Researchers analyzed a corpus and counted the different bigrams in the training set and compared to those in the testing set.
They found out that generally, the number of each bigram in the testing set is consistently about 0.75 less than that in the training set, for bigrams that appear more than once.

Thus, we get the following smoothing method:

Discount seen n-grams counts by a fixed amount and distribute it to the unseen n-grams.

$$
Pr_{seen}(w_n | w_{n-1}) = \frac{C(w_{n-1} w_n) - \delta}{C(w_{n-1})}
$$

For bigrams, $\delta = 0.75$ usually set.


$$
Pr_{unseen}(w_n | w_{n-1}) = \lambda \times \frac{\text{Number of bigram types that ends with $w_n$}}{\text{Number of seen bigram types}}
$$

Where $\lambda$ is the interpolation factor.

## Model Evaluation
A reasonable model should assign higher probabilities to more frequent sentences and lower probabilities to rarer sentences.


### Types of Evaluation
* Intrinsic Evaluation
    * Use a intrinsic metric to evaluate the model (ie **perplexity**)
    * Easier and faster

* Extrinsic Evaluation
    * Use a downstream task
    * Expensive and slower

#### Intrinsic Evaluation
1. Train the model on a training set
2. Tune the parameters using a development set
3. Test the model on a test set. Use a evaluation metric to assess performance.

Common data breakdown:
* 80% for training
* 10% for development
* 10% for test set

#### Perplexity
The inverse probability that the model assign to the test set, normalized by the number of words, denoted as PP(W).

Since the test set contains actual sentences from our data set, we expect the model to give a high probability to the sentences in it. 

Since $PP(W) \propto \frac{1}{Pr(W)}$, the higher the probability, the lower the perplexity.

For a test set $W = w_1 \dots W_N$:

$$
PP(W) = \sqrt[N]{ \frac{1}{Pr(w_1w_2\dots w_N)}}
$$

$$
= \sqrt[N]{\prod _{i=1} ^ N \frac{1}{Pr(w_i | w_1 w_2\dots w_{i-1})}}
$$

$$
= \prod _{i=1} ^ N  \sqrt[N]{\frac{1}{Pr(w_i | w_{i-n+1} \dots w_{i-1})}}
\quad \text{using a n-gram model}
$$



If the sentences in our test set is long, it is natural that our model will yield a smaller probability due to the longer sentences. 
Thus, we take the N-th root of the product normalize against the size of our test set.

We can also view perplexity as the weighted average branching factor, which is the number of possible words that our model thinks can follow any word.