<h1 style="font-size: 1.6rem; font-weight: bold">ITO 5217: Natural Language Processing</h1>
<h1 style="font-size: 1.6rem; font-weight: bold">Module 1: Language Modelling</h1>
<p style="margin-top: 5px; margin-bottom: 5px;">Monash University Australia</p>
<p style="margin-top: 5px; margin-bottom: 5px;">Jupyter Notebook by: Tristan Sim Yook Min</p>
References: Information Source from Monash Faculty of Information Technology

---

### **Introduction to Language Modelling**

A language model is a probabilistic framework that assigns a probability to a sequence of words, reflecting how likely that sequence is to appear in a given language or domain (based on a training corpus). It can be used in Speech Recognition, Automatic Language Translation and and Predictive Word Typing.

For a sequence of length $N$, a language model calculates its probability as:

$$P(w_1, w_2, \dots, w_N)$$

compactly written as $P(w_1^N)$. The higher the probability, the more "natural" 
or likely the sequence is. For example:

$$P(\text{Named must be your fear before banish it you can.}) < P(\text{Your fear must be named before you can banish it.})$$


#### **Chain Rule of Probability in Language Modelling**

Starting from the definition of conditional probability:

$$P(A \mid B) = \frac{P(A, B)}{P(B)}$$

which can be rearranged to:

$$P(A, B) = P(A \mid B) \cdot P(B)$$

Applying this repeatedly to a sequence of words:

$$\begin{align}
P(w_1, w_2, \ldots, w_N) 
&= P(w_N \mid w_1, \ldots, w_{N-1}) \cdot P(w_1, \ldots, w_{N-1}) \\
&= P(w_N \mid w_1, \ldots, w_{N-1}) \cdot P(w_{N-1} \mid w_1, \ldots, w_{N-2}) \cdot P(w_1, \ldots, w_{N-2}) \\
&= \ldots \\
&= P(w_N \mid w_1, \ldots, w_{N-1}) \cdots P(w_2 \mid w_1) \cdot P(w_1)
\end{align}$$

Compactly written as:

$$P(w_1^N) = \prod_{i=1}^{N} P(w_i \mid w_1^{i-1})$$

where:
- $P(A \mid B)$ — probability of $A$ given $B$
- $P(A, B)$ — joint probability of $A$ and $B$
- $w_i$ — the $i$-th word in the sequence
- $w_1^N$ — shorthand for the full sequence $w_1, w_2, \ldots, w_N$
- $w_1^{i-1}$ — all preceding words before position $i$ (the history or Context)
- $\prod$ — product over all positions $i = 1$ to $N$

<br>

---

### **N-gram Language Models**

#### **What is an N-gram Language Model?**

An **n-gram language model** predicts the probability of a word $w$ given a history $h$ of $n-1$ preceding words, $P(w \mid h)$.

$$P(w_1^N) = \prod_{i=1}^{N} P(\underbrace{w_i}_{w} \mid \underbrace{w_1^{i-1}}_{u})$$

where:
- $w$ — the word being predicted
- $u$ — the history/context (all preceding words)
- $P(w \mid u)$ — a **multinomial distribution** over the vocabulary, indicating 
how likely each word is to follow sequence $u$

$P(w \mid u)$ denotes a **multinomial distribution** over the vocabulary, 
indicating how likely each word is to appear after the sequence $u$.

For example, given $u =$ *"Scott Morrison is"*, the distribution over 
possible next words $w$ might look like:

| $w$ | a | bird | conservative | married | president | republican | university | zyzzyva |
|---|---|---|---|---|---|---|---|---|
| $P(w \mid u)$ | 0 | 0 | .18 | .14 | .32 | .26 | 0 | 0 |

- Words like *"president"* (.32) and *"republican"* (.26) have high probability 
— reflecting patterns in the training corpus
- Unrelated words like *"bird"* or *"university"* have zero probability
- All probabilities sum to **1** across the full vocabulary

<br>

#### **Computing $P(w \mid u)$ — Maximum Likelihood Estimation (MLE)**

MLE estimates $P(w_n \mid w_{n-1}, w_{n-2}, \dots, w_{n-N})$ by counting word occurrences in a training corpus:

$$P(w_n \mid w_{n-1}) = \frac{C(w_{n-1}\, w_n)}{\sum_{w} C(w_{n-1}\, w)}$$

For example:

$$P(\text{President} \mid \text{Joe Biden is}) = 
\frac{\text{count}(\text{Joe Biden is President})}{\text{count}(\text{Joe Biden is})}$$

where:
- $C(w_{n-1}\, w_n)$ — count of the word pair appearing in the corpus
- $\sum_{w} C(w_{n-1}\, w)$ — total count of all words following $w_{n-1}$

<br>

#### **Number of Parameters**

The number of parameters grows **exponentially** with context size:

$$\text{Number of parameters} = (\text{Dictionary Size})^{n-1}$$

where $n = \text{context size} + 1$.

| N-gram | Context Size | Parameters (100-word vocab) |
|---|---|---|
| Unigram | 0 | $100 - 1 = 99$ |
| Bigram | 1 | $100^2 - 1 = 9{,}999$ |
| Trigram | 2 | $100^3 - 1 = 999{,}999$ |
| 10-gram | 10 | $100^{10} - 1 \approx 10^{20}$ |

**Note:** This is toy scale — the Oxford Dictionary has **171,476** words, making higher-order n-grams computationally expensive. Special smoothing techniques are required to handle this.

<br>

### **The Markov Assumption**

The **Markov Assumption** states that future behaviour depends only on recent 
history. In a $k$-th order Markov model, the next state depends only on the 
most recent $k$ states.

This allows us to approximate the full probability:

$$P(w_1^N) = \prod_{i=1}^{N} P(w_i \mid w_1^{i-1}) \approx \prod_{i=1}^{N} P(w_i \mid w_{i-n+1}^{i-1}) \tag{1}$$

where:
- $w_1^N$ — the full word sequence
- $w_1^{i-1}$ — the full history up to position $i$
- $w_{i-n+1}^{i-1}$ — only the most recent $n-1$ words (Markov approximation)

##### For Example, N-gram Models Applied to $P(\text{Named must be your fear before banish it you can.})$

| **Unigram (n=1)** | **Bigram (n=2)** | **Trigram (n=3)** |
|---|---|---|
| $P(\text{Named}\mid-)$ | $P(\text{Named}\mid-)$ | $P(\text{Named}\mid-)$ |
| $P(\text{must}\mid-)$ | $P(\text{must}\mid\text{Named})$ | $P(\text{must}\mid\text{Named})$ |
| $P(\text{be}\mid-)$ | $P(\text{be}\mid\text{must})$ | $P(\text{be}\mid\text{Named, must})$ |
| $P(\text{your}\mid-)$ | $P(\text{your}\mid\text{be})$ | $P(\text{your}\mid\text{must, be})$ |
| $P(\text{fear}\mid-)$ | $P(\text{fear}\mid\text{your})$ | $P(\text{fear}\mid\text{be, your})$ |
| $P(\text{before}\mid-)$ | $P(\text{before}\mid\text{fear})$ | $P(\text{before}\mid\text{your, fear})$ |
| $\cdots$ | $\cdots$ | $\cdots$ |
| $P(.\mid-)$ | $P(.\mid\text{can})$ | $P(.\mid\text{you, can})$ |

Unigrams ignore all context, bigrams look back one word, and trigrams look back two — the Markov assumption limits history to keep the model tractable.

#### **Application: Bigram Example**

Consider the following assumed bigram probabilities:

| Bigram | Probability |
|---|---|
| $P(\text{i} \mid -)$ | $0.25$ |
| $P(\text{want} \mid \text{i})$ | $0.33$ |
| $P(\text{english} \mid \text{want})$ | $0.0011$ |
| $P(\text{food} \mid \text{english})$ | $0.5$ |
| $P(. \mid \text{food})$ | $0.68$ |

We can calculate the probability of the entire sentence as:

$$P(\text{i want english food .}) = P(\text{i}\mid-) \cdot P(\text{want}\mid\text{i}) \cdot P(\text{english}\mid\text{want}) \cdot P(\text{food}\mid\text{english}) \cdot P(.\mid\text{food})$$

$$= 0.25 \times 0.33 \times 0.0011 \times 0.5 \times 0.68 = \mathbf{0.000031}$$

**Note:** Probabilities get very small quickly as sentence length grows. In practice, **log probabilities** are used to avoid numerical underflow.

<br>

---

### **Data Sparsity and Smoothed N-Grams**

### **3.1 Exploring Data Sparsity**

Revisiting the example *"Scott Morrison is X"*:

$$\text{count}(\text{Scott Morrison is Prime Minister}) = 0$$
$$\text{count}(\text{Morrison is Prime Minister}) = 100$$

Does this mean $P(x = \text{Prime Minister} \mid \text{Scott Morrison is}) = 0$?

**No!** As sequence length grows, it becomes increasingly unlikely to find an 
exact match in a corpus, even if a shorter version of the sequence exists.

<br>

#### **Out-of-Vocabulary (OOV) Words**

Consider the sequence *"Meet the husband of Princess Hammock"*, where 
*"Hammock"* never appeared in training data.

Does this mean $P(\text{Meet the husband of Princess Hammock}) = 0$?

Again, intuitively **no** the word is simply unknown, not impossible.

**Maximum Likelihood Estimation (MLE)** fails in data sparsity scenarios because it estimates probabilities solely from training data becauses unseen sequences 
automatically receive a probability of zero.

<br>

#### **Addressing Data Sparsity**

Two key strategies exist to address Data Sparsity:

**1. Better Training Data:** Ensure the training corpus adequately represents the target domain and application.

**2. Smoothing Techniques:** Redistribute probability mass to unseen sequences. Common techniques include **Interpolation** and **Backoff** (covered next).

<br>

### **Handling Out-of-Vocabulary (OOV) Words with `<UNK>` (Unknowns)**

One practical approach is to apply a **frequency threshold** (e.g. threshold $= 3$):

**1. During Training Time,** replace all words appearing fewer than 3 times with the special token `<UNK>`, accumulating probability mass from all rare words.

**2. During Test Time,** replace all out-of-vocabulary words with `<UNK>`.

$$P(\text{Hammock}) \rightarrow P(\texttt{<UNK>})$$

This ensures that rare and unseen words are never assigned zero probability, sharing the probability mass of all infrequent words under a single token.

<br>

###  **3.2 Smoothed N-grams**

Zero counts give zero probability estimates, but some zero-count n-grams are valid sequences. **Smoothing** reassigns probability mass from seen to unseen events, whilst keeping the distribution summing to 1.

> Smoothing is like Robin Hood — stealing probability mass from the rich (frequent words) and giving to the poor (unseen words).


#### **Intuition:** 

$$P(w \mid \text{denied the})$$

Imagine we are trying to predict the next word $w$ after the phrase *"denied the"* using the **Wall Street Journal corpus**. We search the corpus and find *"denied the"* appears **7 times total**, followed by:

| $w$ | Count | Probability |
|---|---|---|
| allegations | 3 | $3/7$ |
| reports | 2 | $2/7$ |
| claims | 1 | $1/7$ |
| requests | 1 | $1/7$ |
| attack | 0 | $0$ |
| man | 0 | $0$ |
| outcome | 0 | $0$ |
| **Total** | **7** | **1** |

With MLE, words like *"attack"*, *"man"*, and *"outcome"* never appeared after *"denied the"* in the corpus, so they are assigned **zero probability**. But is *"denied the attack"* truly impossible? Of course not, it just wasn't in our training data.

This is the **data sparsity problem**. Smoothing fixes this by slightly reducing the counts of *seen* words and redistributing that probability mass to *unseen* words:

| $w$ | Count | Probability |
|---|---|---|
| allegations | 2.5 | $2.5/7$ |
| reports | 1.5 | $1.5/7$ |
| claims | 0.5 | $0.5/7$ |
| requests | 0.5 | $0.5/7$ |
| attack | $>0$ | $>0$ |
| man | $>0$ | $>0$ |
| outcome | $>0$ | $>0$ |
| **Total** | **7** | **1** |

No word is assigned zero probability, the distribution is **flattened** so the model generalises better to unseen data.

<br>

### Add-1 (Laplace) Smoothing

Recall MLE:

$$P_{MLE}(w_i \mid w_{i-1}) = \frac{C(w_{i-1},\ w_i)}{C(w_{i-1})}$$

Add-1 smoothing simply adds 1 to every count before normalising:

$$P_{\text{Add-1}}(w_i \mid w_{i-1}) = \frac{C(w_{i-1},\ w_i) + 1}{C(w_{i-1}) + V}$$

where $V$ is the vocabulary size (total number of possible $w_i$).

---

### Effect of Add-1 Smoothing

**Small corpus** (total = 7) — ratio $2/7$ is large → **big change** (−28.2% for allegations):

| $w$ | Unsmoothed $C$ | Unsmoothed $P$ | Add-1 $C$ | Add-1 $P$ |
|---|---|---|---|---|
| allegations | 3 | $3/7$ | 4 | $4/13$ |
| reports | 2 | $2/7$ | 3 | $3/13$ |
| claims | 1 | $1/7$ | 2 | $2/13$ |
| requests | 1 | $1/7$ | 2 | $2/13$ |
| outcome | 0 | $0$ | 1 | $1/13$ |
| fact | 0 | $0$ | 1 | $1/13$ |
| **Total** | **7** | **1** | **13** | **1** |

**Large corpus** (total = 700) — ratio $2/700$ is small → **small change**:

| $w$ | Unsmoothed $C$ | Unsmoothed $P$ | Add-1 $C$ | Add-1 $P$ |
|---|---|---|---|---|
| allegations | 300 | $300/700$ | 301 | $301/706$ |
| reports | 200 | $200/700$ | 201 | $201/706$ |
| claims | 100 | $100/700$ | 101 | $101/706$ |
| requests | 100 | $100/700$ | 101 | $101/706$ |
| outcome | 0 | $0$ | 1 | $1/706$ |
| fact | 0 | $0$ | 1 | $1/706$ |
| **Total** | **700** | **1** | **706** | **1** |

**Large vocabulary** ($V = 2000$, total = 700) — 1993 unseen words → **severe change** (−73.9% for allegations):

| $w$ | Unsmoothed $C$ | Unsmoothed $P$ | Add-1 $C$ | Add-1 $P$ |
|---|---|---|---|---|
| allegations | 300 | $300/700$ | 301 | $301/2700$ |
| reports | 200 | $200/700$ | 201 | $201/2700$ |
| claims | 100 | $100/700$ | 101 | $101/2700$ |
| requests | 100 | $100/700$ | 101 | $101/2700$ |
| outcome | 0 | $0$ | 1 | $1/2700$ |
| fact | 0 | $0$ | 1 | $1/2700$ |
| 1993 unseen words | 0 | $0$ | 1993 | $1993/2700$ |
| **Total** | **700** | **1** | **2700** | **1** |

> A large vocabulary causes Add-1 to give **too much** probability mass to 
> unseen events ($1993/2700 \approx 73.8\%$!), severely hurting performance.

---

### Better Alternatives — Backoff & Interpolation

**Backoff** — when an n-gram count is 0, fall back to a shorter n-gram:

$$\text{trigram} \xrightarrow{\text{if 0}} \text{bigram} \xrightarrow{\text{if 0}} \text{unigram}$$

**Simple Interpolation** — mix all n-gram levels together:

$$\lambda_1 P(w_i) + \lambda_2 P(w_i \mid w_{i-1}) + \lambda_3 P(w_i \mid w_{i-2}, w_{i-1})$$

where $\lambda_1 + \lambda_2 + \lambda_3 = 1$, with $\lambda$ weights tuned on 
held-out data.