# N-gram model

## Markov assumption

**Markov assumption** (limited context): probability of the next word $w_{n+1}$ depends only on the previous words (n=1,2,3), instead of the entire history.

- let $\pi_n: V^n \rightarrow C$ be a mapping from word seq of length $n$ to a finite set $C$ (context), then language model becomes

    $$
    p(w_{n+1}|w_1,...,w_n)=p(w_{n+1}|\pi_n(w_1,...,w_n))
    $$

## unigram, bigram, trigram model

In unigram/bigram/trigram model, number of parameters grows as $O(|V|), O(|V|^2), O(|V|^3)$, respectively.
    
- unigrams: words are independent with each other, no context

    $$
    \pi(w_1,...,w_n)=\varnothing \\[1em]
    P(seq) =\prod_{i=1}^n p(w_i)
    $$

    Number of parameters is $O(|V|)$, as each word in the vocabulary has its own probability

- bigrams: context is previous word $w_n$

    $$
    \pi(w_1,...,w_n)=w_n\\[1em]
    P(seq) =\prod_{i=1}^n p(w_i|w_{i-1}, w_{i-2})
    $$

    Number of parameters is $O(|V|^2)$, as there is a probability for each pair of words (first word and its successor) in the vocabulary.

   
- trigrams: context is last 2 words $w_{n-1}, w_n$

    $$
    \pi(w_1,...,w_n)=(w_{n-1}, w_n)
    \\[1em]
    P(seq) =\prod_{i=1}^n p(w_i|w_{i-1}, w_{i-2}, w_{i-3})
    $$

    Number of parameters is $O(|V|^3)$, as there is a probability for each combination of 3 consecutive words in the vocabulary.

- longer n-grams suffer from sparseness, many n-grams may not appear in the training data.

- for LDA, num of params is $O(K \cdot |V|)$ where $K$ is num of topics in a corpus

    because each topic has a probability distribution over the words in the vocabulary.

## estimate probabilities for an n-gram model

| Method               | Short Description                                         | Formula                                                                                     | Pros                                                       | Cons                                        |
|----------------------|-----------------------------------------------------------|----------------------------------------------------------------------------------------------|------------------------------------------------------------|---------------------------------------------|
| Naive MLE            | Maximum Likelihood Estimation                             | $P(w_i \| w_{i-1}) = \frac{C(w_{i-1}, w_i)}{C(w_{i-1})}$                                      | Simple, easy to compute                                     | Zero probabilities for unseen n-grams       |
| Additive Smoothing   | Laplace or Lidstone smoothing                             | $P(w_i \| w_{i-1}) = \frac{C(w_{i-1}, w_i) + \alpha}{C(w_{i-1}) + \alpha \|V\|}$                 | Handles unseen n-grams, easy to compute                    | Requires tuning of smoothing constant       |
| Back-off & Interpolation | Combining different order n-grams probabilities      | $P(w_i \| w_{i-1}, w_{i-2}) = \lambda_1 P(w_i \| w_{i-1}, w_{i-2}) + \lambda_2 P(w_i \| w_{i-1}) + \lambda_3 P(w_i)$ | Better generalization, handles unseen n-grams | Requires tuning of interpolation weights    |
| Bayesian Smoothing   | Smoothing with Dirichlet priors                           | $P(w_i \| w_{i-1}) = \frac{C(w_{i-1}, w_i) + \alpha P_{\text{base}}(w_i \| w_{i-1})}{C(w_{i-1}) + \alpha}$ | Principled Bayesian approach, handles unseen n-grams     | Requires tuning of concentration parameter  |


### MLE

maximum likelihood estimation is the most simple way to estimate the probabilities.

For a trigram model, the probability of a word given its preceding word is calculated as count of trigram divided by count of preceding 2 words:

$$
\hat p(w_3 | w_1, w_2) =\frac{p(w_1,w_2,w_3)}{p(w_1,w_2)}= \frac{\frac{\text{Count}(w_1,w_2,w_3)}{ \text{Count}trigram}}{\frac{\text{Count}(w_1,w_2)}{\text{Count} bigram}} \approx\frac{\text{Count}(w_1,w_2,w_3)}{\text{Count}(w_1,w_2)}
$$

### Additive Smoothing

- problem: MLE assign zero probability to for n-grams not observed in the training data. 

- solution: add small constant to count to smooth probabilities to avoid zero.

    $$
    \hat p(w_3 | w_1, w_2) = \frac{\text{Count}(w_1,w_2,w_3)+k}{\text{Count}(w_1,w_2)+k|V|}
    $$

    $0<k\leq 1$ is the smoothing constant added to the count of each n-gram.


### backoff and interpolation

- problem: both Laplace smoothing (add-one) and Lidstone (add-k) smoothing don't work bc assigns too much probability to unseen trigram 

- solution: a mixture model combining the probabilities of different order n-grams to estimate the probability of the next word.

    $$
    \begin{align}
    p(w_3 | w_1, w_2) 
    &=\text{trigram model + bigram model + unigram model} \\[1em]
    &= \lambda _1 \hat p(w_3 | w_1, w_2)  + \lambda _2 \hat p(w_3 | w_2) +\lambda _3 \hat p(w_3 ) \\[1em]
    \end{align}
    $$
    where $\lambda_1, \lambda_2, \lambda_3$ are the interpolation weights that sum to 1.

    - backoff: if a higher-order n-gram has a non-zero count, its probability is used; otherwise, the model "backs off" to a lower-order n-gram. 

    - interpolation: a weighted combination of probabilities from different order n-grams is used.
     


### Bayesian smoothing with Dirichlet prior

a Dirichlet distribution is used as prior over n-gram probability distribution. 

For a trigram model, we can estimate the probability of a word given its preceding word by incorporating the Dirichlet prior $p_{\text{base}}(w_3 | w_1, w_2)$, usually derived from a lower-order n-gram model (e.g., bigram).:

$$
\hat p(w_3 | w_1, w_2) = \frac{\text{Count}(w_1,w_2,w_3)+\alpha p_{\text{base}}(w_3 | w_1, w_2)}{\text{Count}(w_1,w_2)+\alpha}
$$

$\alpha$ is the concentration parameter of Dirichlet prior, controls the strength of this prior belief

- When $\alpha$ is large, the prior belief has more influence on the estimated probability, making the model more conservative. 

- When $\alpha$ is small, the observed data has more influence, and the model becomes more adaptive.

## class-based N-gram model

A class-based n-gram model is a variation of the standard n-gram language model 

In a class-based n-gram model, **words are first clustered into classes** based on semantic or syntactic similarities, and then probabilities are estimated for the **class sequences** instead of individual words.

The probability of a word $w_{n+1}$ given its context is factorized into the probability of the class $c_{n+1}$ given the context $c_1, ..., c_n$ and the probability of the word $w_{n+1}$ given the class $c_{n+1}$:

$$
p(w_{n+1} | w_1, ..., w_n) = p(c_{n+1}|c_1, ..., c_n) p(w_{n+1}|c_{n+1})
$$

where $c_i$ is class for word $w_i$, $n$ is context length

Pros

- Reduces sparsity: Clustering words into classes leads to more accurate probability estimates and robustness.

- unseen word: Estimating probabilities based on classes allows handling unseen word combinations effectively.

- Smaller model size: Using classes instead of individual words requires fewer parameters and results in a smaller model.

- Improved performance: In some cases, class-based models can outperform standard n-gram models, especially with limited training data.

Cons

- Class assignment: Optimally clustering words into classes can be difficult, requiring extra resources or manual input.

- Loss of specificity: Clustering words may decrease performance in tasks requiring fine-grained distinctions between words.

- Complexity: introduce additional complexity, making them harder to understand and implement.