# Language Models and N-gram Models

## What is a Language Model (LM)?

A language model is a machine learning model that predicts the likelihood of a word or sequence of words occurring in a given context. It assigns probabilities to:

- **Individual words**: The probability of a specific word appearing next, given some prior words (context).
- **Entire sentences**: The probability of a sequence of words forming a coherent sentence.

For example, given the phrase "The water of Walden Pond is so beautifully," an LM might predict that the next word is likely to be something like *blue*, *green*, or *clear* (describing the water), but unlikely to be *refrigerator* or *this*, as these are less contextually appropriate.

## Why are LMs useful?

1. **Generation**: They help generate contextually appropriate text, such as correcting spelling or grammar errors (e.g., suggesting "There are" instead of "Their are").

2. **Speech Recognition**: They help disambiguate similar-sounding phrases (e.g., "back soonish" vs. "bassoon dish").

3. **Augmentative and Alternative Communication (AAC)**: They suggest likely words for users who rely on systems like eye-gaze to communicate.

4. **Training Large Language Models**: Modern large language models (like those discussed in later chapters) are trained to predict the next word, learning language patterns in the process.

## What is an N-gram Language Model?

An n-gram is a sequence of n consecutive words (or tokens). An n-gram language model estimates the probability of a word based on the previous n-1 words. For example:

- **Unigram (n=1)**: Probability of a word without considering any prior context.
- **Bigram (n=2)**: Probability of a word given the one word before it.
- **Trigram (n=3)**: Probability of a word given the two words before it.

In the example, to predict the next word after "The water of Walden Pond is so beautifully," an n-gram model might look at the last few words (e.g., "so beautifully") to estimate probabilities.

The text also uses the term "n-gram" to refer to the probabilistic model itself, which estimates:
- The probability of a word given the previous n-1 words.
- The probability of an entire sequence of words.

## Key Concepts in N-gram Models

### 1. Probability of a Word Given Context

The goal is to compute the probability of a word $w$ given a history $h$, denoted $P(w|h)$. For example:

- History $h$: "The water of Walden Pond is so beautifully"
- Word $w$: "blue"
- Goal: Compute $P(\text{blue} | \text{The water of Walden Pond is so beautifully})$

One intuitive way to estimate this is by counting how often the sequence "The water of Walden Pond is so beautifully blue" appears compared to how often "The water of Walden Pond is so beautifully" appears in a large corpus:

$$P(\text{blue} | \text{The water of Walden Pond is so beautifully}) = \frac{C(\text{The water of Walden Pond is so beautifully blue})}{C(\text{The water of Walden Pond is so beautifully})}$$

where $C$ denotes the count of occurrences in the corpus.

**Problem**: Exact counts for long sequences are impractical because:
- Language is creative, and many sequences (especially long ones) may never appear in a corpus, even a large one like the web.
- This makes direct counting unreliable for long contexts.

### 2. The Markov Assumption

To address this, n-gram models simplify the problem using the Markov assumption, which assumes that the probability of a word depends only on a limited number of previous words (not the entire history). For example:

- **Bigram model**: The probability of a word depends only on the immediately preceding word. So, instead of computing $P(\text{blue} | \text{The water of Walden Pond is so beautifully})$, we approximate it as:
  $$P(\text{blue} | \text{beautifully})$$

- **Trigram model**: The probability depends on the previous two words, e.g.,
  $$P(\text{blue} | \text{so beautifully})$$

The general formula for an n-gram model is:

$$P(w_n | w_1, w_2, \ldots, w_{n-1}) \approx P(w_n | w_{n-N+1}, \ldots, w_{n-1})$$

where $N$ is the n-gram size (e.g., $N=2$ for bigrams, $N=3$ for trigrams).

This approximation simplifies calculations and makes it feasible to estimate probabilities from a corpus, as shorter sequences (like bigrams or trigrams) are more likely to appear multiple times than long sequences.

### 3. Chain Rule for Sequence Probability

To compute the probability of an entire sequence of words $w_1, w_2, \ldots, w_n$, we use the chain rule of probability:

$$P(w_1, w_2, \ldots, w_n) = P(w_1) \cdot P(w_2 | w_1) \cdot P(w_3 | w_1, w_2) \cdot \ldots \cdot P(w_n | w_1, \ldots, w_{n-1})$$

Using the n-gram approximation (e.g., bigram), this becomes:

$$P(w_1, w_2, \ldots, w_n) \approx \prod_{k=1}^n P(w_k | w_{k-1})$$

For example, to compute the probability of the sentence "`<s>` I want english food `</s>`":

$$P(\text{<s> i want english food </s>}) = P(\text{i} | \text{<s>}) \cdot P(\text{want} | \text{i}) \cdot P(\text{english} | \text{want}) \cdot P(\text{food} | \text{english}) \cdot P(\text{</s>} | \text{food})$$

The text provides an example using the Berkeley Restaurant Project corpus, where:

$$P(\text{<s> i want english food </s>}) = 0.25 \cdot 0.33 \cdot 0.0011 \cdot 0.5 \cdot 0.68 = 0.000031$$

### 4. Maximum Likelihood Estimation (MLE)

To estimate these probabilities, we use Maximum Likelihood Estimation (MLE), which involves:

1. Counting how often a specific n-gram (e.g., bigram) appears in a corpus.
2. Normalizing by the count of the prefix (the first $n-1$ words) to get a probability between 0 and 1.

For a bigram $P(w_n | w_{n-1})$:

$$P(w_n | w_{n-1}) = \frac{C(w_{n-1} \, w_n)}{C(w_{n-1})}$$

For example, in the mini-corpus provided:
```
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
```

To compute $P(\text{I} | \text{<s>})$:
- Count how many times "`<s>` I" appears: 2 times (in "`<s>` I am Sam `</s>`" and "`<s>` I do not like ...").
- Count how many times "`<s>`" appears: 3 times (at the start of each sentence).
- Probability: $P(\text{I} | \text{<s>}) = \frac{2}{3} = 0.67$

For a general n-gram:

$$P(w_n | w_{n-N+1}, \ldots, w_{n-1}) = \frac{C(w_{n-N+1}, \ldots, w_{n-1} \, w_n)}{C(w_{n-N+1}, \ldots, w_{n-1})}$$

### 5. Practical Example with the Berkeley Restaurant Corpus

The text provides bigram counts and probabilities from the Berkeley Restaurant Project corpus, which contains 9332 sentences about restaurant queries. For example, the bigram probability table shows probabilities like:

$$P(\text{want} | \text{i}) = 0.33, \quad P(\text{english} | \text{want}) = 0.0011, \quad P(\text{food} | \text{english}) = 0.5$$

These probabilities reflect linguistic patterns (e.g., "eat" is often followed by a noun like "food") and domain-specific patterns (e.g., "I" is common at the start of queries, and "Chinese food" is more common than "English food").

## Applying to the Example: "The water of Walden Pond is so beautifully"

Let's use the n-gram model to predict the next word after "The water of Walden Pond is so beautifully."

### 1. Bigram Approximation:
- Instead of considering the entire history, we look only at the last word, "beautifully."
- We need $P(w | \text{beautifully})$ for possible next words $w$, such as "blue," "green," or "clear."
- Using MLE, we would estimate:
  $$P(\text{blue} | \text{beautifully}) = \frac{C(\text{beautifully blue})}{C(\text{beautifully})}$$
- This requires a corpus where we count how often "beautifully blue" appears compared to all bigrams starting with "beautifully."

### 2. Challenges:
- If "beautifully blue" never appears in the corpus, the probability is 0, which may not reflect reality (this is the sparsity problem).
- The text notes that even large corpora (like the web) may not contain enough instances of long sequences like "The water of Walden Pond is so beautifully" to estimate probabilities directly.

### 3. Solution with N-grams:
- By using bigrams or trigrams, we reduce the context to a manageable size, making it more likely to find relevant counts in a corpus.
- For example, a trigram model would use $P(\text{blue} | \text{so beautifully})$, which is still simpler than the full history.

### 4. Why "blue" or "green" but not "refrigerator"?
- Words like "blue" or "green" are more likely because they are adjectives that commonly describe water and fit the semantic context of "beautifully."
- "Refrigerator" is unlikely because it's a noun that doesn't fit the syntactic or semantic pattern of describing a pond's water.
- The n-gram model captures these patterns by assigning higher probabilities to frequently observed sequences in the corpus.

## Key Linguistic Phenomena Captured by N-grams

The text highlights that n-gram models capture various phenomena:

1. **Syntactic Patterns**: Bigrams like "eat" followed by a noun (e.g., "food") or "to" followed by a verb reflect grammatical rules.

2. **Domain-Specific Patterns**: In the Berkeley Restaurant corpus, "I" is common at the start of queries, reflecting the conversational nature of the dataset.

3. **Cultural Patterns**: Higher probabilities for "Chinese food" vs. "English food" may reflect cultural preferences in the dataset.

## Limitations of N-gram Models

While n-gram models are simple and interpretable, they have limitations:

- **Sparsity**: Many n-grams (especially for $n \geq 3$) may not appear in the corpus, leading to zero probabilities. The text mentions techniques like smoothing (covered later in Section 3.6) to address this.

- **Limited Context**: N-grams only consider the previous $n-1$ words, ignoring longer dependencies. For example, a bigram model can't capture that "Walden Pond" earlier in the sentence makes "blue" more likely.

- **Scalability**: As $n$ increases, the number of possible n-grams grows exponentially, requiring larger corpora and more computation.

Later chapters (7-9) introduce neural large language models (e.g., based on transformers), which overcome these limitations by modeling longer contexts and learning more complex patterns.

# Overview of Figures 3.1 and 3.2

## Figure 3.1: Bigram Counts

Figure 3.1 is a table showing bigram counts for eight selected words (i, want, to, eat, chinese, food, lunch, spend) from the Berkeley Restaurant Project corpus, which has a vocabulary size of $V = 1446$ words. Each cell in the table represents the number of times the column word follows the row word in the corpus. For example:

- The cell at row *i* and column *want* shows a count of 827, meaning the bigram "i want" appears 827 times.
- The cell at row *i* and column *to* shows a count of 2, meaning "i to" appears 2 times.
- Many cells are 0 (shown in gray), indicating that certain bigrams (e.g., "want want" or "eat chinese") do not appear in the corpus.

### Key Observations:

- The table is **sparse**, meaning most bigram counts are zero. This is typical in language modeling because not all word pairs occur together, especially with a limited vocabulary or corpus size.
- The words were chosen to "cohere" (i.e., they are related to the restaurant domain, like "eat," "food," "chinese," "lunch"), so the table is less sparse than it would be for a random set of words.
- The counts reflect raw frequencies of bigrams, which are the building blocks for computing probabilities.

## Figure 3.2: Bigram Probabilities

Figure 3.2 shows the bigram probabilities derived from the counts in Figure 3.1. These probabilities are computed using Maximum Likelihood Estimation (MLE), where each bigram count is normalized by the unigram count of the first word (the row word). The unigram counts for the eight words are provided as:

- i: 2533
- want: 927
- to: 2417
- eat: 746
- chinese: 158
- food: 1093
- lunch: 341
- spend: 278

The formula for a bigram probability $P(w_n | w_{n-1})$ is:

$$P(w_n | w_{n-1}) = \frac{C(w_{n-1} \, w_n)}{C(w_{n-1})}$$

where:
- $C(w_{n-1} \, w_n)$ is the count of the bigram (from Figure 3.1).
- $C(w_{n-1})$ is the unigram count of the first word.

### Example Calculations:

- For $P(\text{want} | \text{i})$:
  - Bigram count $C(\text{i want}) = 827$ (from Figure 3.1).
  - Unigram count $C(\text{i}) = 2533$.
  - Probability: $P(\text{want} | \text{i}) = \frac{827}{2533} \approx 0.33$.

- For $P(\text{to} | \text{i})$:
  - Bigram count $C(\text{i to}) = 2$.
  - Unigram count $C(\text{i}) = 2533$.
  - Probability: $P(\text{to} | \text{i}) = \frac{2}{2533} \approx 0.00083$.

- For $P(\text{want} | \text{want})$:
  - Bigram count $C(\text{want want}) = 0$.
  - Unigram count $C(\text{want}) = 927$.
  - Probability: $P(\text{want} | \text{want}) = \frac{0}{927} = 0$.

### Key Observations:

- The probabilities are normalized to lie between 0 and 1, and for each row, the sum of probabilities across all possible next words (in the full vocabulary of 1446 words) would equal 1.
- Many probabilities are 0 (shown in gray), reflecting the sparsity of the bigram counts. This highlights the **sparsity problem** in n-gram models, where unseen bigrams get zero probability.
- High probabilities, like $P(\text{to} | \text{want}) = 0.66$, indicate strong associations (e.g., "want to" is a common phrase).

## Additional Probabilities

The text provides a few additional probabilities involving special tokens `<s>` (sentence start) and `</s>` (sentence end), as well as the word *english*, which isn't in the tables:

- $P(\text{i} | \text{<s>}) = 0.25$: The probability that a sentence starts with "i".
- $P(\text{food} | \text{english}) = 0.5$: The probability that "food" follows "english".
- $P(\text{english} | \text{want}) = 0.0011$: The probability that "english" follows "want".
- $P(\text{</s>} | \text{food}) = 0.68$: The probability that a sentence ends after "food".

These probabilities are derived from the corpus in the same way, using MLE.

## Computing Sentence Probabilities

The text demonstrates how to compute the probability of a sentence using the chain rule with the bigram approximation. For a sentence $w_1, w_2, \ldots, w_n$, the probability is:

$$P(w_1, w_2, \ldots, w_n) \approx P(w_1 | \text{<s>}) \cdot P(w_2 | w_1) \cdot P(w_3 | w_2) \cdot \ldots \cdot P(w_n | w_{n-1})$$

### Example:

$P(\text{<s> i want english food </s>})$:

- Break it down into bigram probabilities:
  - $P(\text{i} | \text{<s>}) = 0.25$
  - $P(\text{want} | \text{i}) = 0.33$ (from Figure 3.2)
  - $P(\text{english} | \text{want}) = 0.0011$
  - $P(\text{food} | \text{english}) = 0.5$
  - $P(\text{</s>} | \text{food}) = 0.68$

- Multiply them:
  $$P(\text{<s> i want english food </s>}) = 0.25 \cdot 0.33 \cdot 0.0011 \cdot 0.5 \cdot 0.68 \approx 0.000031$$

This small probability reflects the fact that sentence probabilities are often low, especially for longer sequences, because they account for the joint likelihood of all words in order.

### Exercise 3.2: Compute $P(\text{<s> i want chinese food </s>})$

Let's compute this as requested:

- Bigram probabilities:
  - $P(\text{i} | \text{<s>}) = 0.25$ (given).
  - $P(\text{want} | \text{i}) = 0.33$ (from Figure 3.2).
  - $P(\text{chinese} | \text{want}) = 0$ (from Figure 3.2, since $C(\text{want chinese}) = 0$).
  - $P(\text{food} | \text{chinese}) = 0.00092$ (from Figure 3.2).
  - $P(\text{</s>} | \text{food}) = 0.68$ (given).

- Multiply:
  $$P(\text{<s> i want chinese food </s>}) = 0.25 \cdot 0.33 \cdot 0 \cdot 0.00092 \cdot 0.68 = 0$$

**Issue**: The probability is 0 because $P(\text{chinese} | \text{want}) = 0$, as the bigram "want chinese" never appears in the corpus (see Figure 3.1). This illustrates the **sparsity problem** in n-gram models, where unseen bigrams result in zero probabilities, which may not reflect reality (e.g., "want Chinese" is plausible). Later sections (e.g., Section 3.6) discuss smoothing techniques to assign non-zero probabilities to unseen bigrams.

## Linguistic Phenomena Captured by Bigram Statistics

The bigram statistics in Figures 3.1 and 3.2 capture several types of linguistic and domain-specific phenomena:

### 1. Syntactic Patterns:

- **"eat" followed by nouns or adjectives**: The high probability of $P(\text{food} | \text{eat}) = 0$ in Figure 3.2 (but likely higher in the full corpus) suggests that "eat" is often followed by nouns like "food" or adjectives describing food. This reflects syntactic rules where verbs like "eat" are followed by noun phrases (e.g., "eat lunch") or descriptive adjectives.

- **"to" followed by verbs**: The high probability $P(\text{to} | \text{want}) = 0.66$ indicates that "to" is often followed by verbs (e.g., "want to eat"), capturing a syntactic pattern common in infinitive constructions.

### 2. Domain-Specific Patterns:

- **High probability of sentence-initial "i"**: $P(\text{i} | \text{<s>}) = 0.25$ shows that many sentences in the Berkeley Restaurant corpus start with "I," reflecting the conversational nature of the dataset (users asking personal queries like "I want...").

- **Task-specific vocabulary**: Words like "eat," "food," "chinese," and "lunch" are frequent, reflecting the restaurant query domain.

### 3. Cultural Patterns:

- **Preference for "Chinese" over "English" food**: The text notes a higher probability for "Chinese food" than "English food." For example, $P(\text{food} | \text{chinese}) = 0.00092$, while $P(\text{food} | \text{english}) = 0.5$, suggesting "Chinese food" is more common in the corpus. This may reflect cultural preferences in the Berkeley area or the dataset's focus on popular cuisines.

These phenomena demonstrate that bigram models capture not only linguistic structure (syntax) but also context-specific and cultural information encoded in the corpus.

# N-gram Practical Issues

N-gram language models estimate the probability of a word given the previous $n-1$ words (e.g., bigrams use one previous word, trigrams use two). As discussed earlier in the chapter, these probabilities are computed using Maximum Likelihood Estimation (MLE) from corpus counts, as in:

$$P(w_n | w_{n-N+1}, \ldots, w_{n-1}) = \frac{C(w_{n-N+1}, \ldots, w_{n-1} \, w_n)}{C(w_{n-N+1}, \ldots, w_{n-1})}$$

However, when working with large corpora (e.g., billions or trillions of words) or higher-order n-grams (e.g., 4-grams or 5-grams), several practical issues emerge:

1. **Numerical Precision**: Multiplying many small probabilities (all ≤ 1) can lead to numerical underflow, where the product becomes too small for computers to handle accurately.

2. **Context Length**: Higher-order n-grams require more context, increasing computational complexity and storage needs.

3. **Scalability**: Storing and computing probabilities for all possible n-grams in a large corpus is memory- and time-intensive.

4. **Sparsity**: Many n-grams, especially higher-order ones, have zero counts in the corpus, requiring efficient handling.

Section 3.1.3 introduces solutions to these issues, focusing on log probabilities, longer context, and efficiency techniques.

## 1. Log Probabilities

### Problem: Numerical Underflow

When computing the probability of a sentence using the chain rule (e.g., for a bigram model, $P(w_1, w_2, \ldots, w_n) \approx \prod_{k=1}^n P(w_k | w_{k-1})$), we multiply many probabilities, each of which is between 0 and 1. For long sequences, this product can become extremely small, causing **numerical underflow**—a situation where the number is too tiny for standard floating-point representations (e.g., 8-byte floats) to handle accurately. For example, multiplying 10 probabilities of 0.01 results in $0.01^{10} = 10^{-20}$, which may be rounded to zero by a computer.

### Solution: Use Log Probabilities

To avoid underflow, language models store and compute probabilities in **log space**. Instead of multiplying probabilities $p_1 \cdot p_2 \cdot p_3 \cdot p_4$, we add their logarithms:

$$p_1 \cdot p_2 \cdot p_3 \cdot p_4 = \exp(\log p_1 + \log p_2 + \log p_3 + \log p_4)$$

- **Why it works**: Adding logarithms is equivalent to multiplying in linear space because of the property $\log(a \cdot b) = \log a + \log b$. Log probabilities are typically negative (since $\log(p) < 0$ for $p < 1$), but their sum remains manageable and avoids underflow.

- **Implementation**: All computations and storage are done in log space (using natural logarithm, $\ln$, unless specified otherwise). To report a probability to the user, the final log probability is converted back using the exponential function ($\exp$).

- **Example**: For the sentence "`<s>` i want english food `</s>`" from the earlier example, the probability was computed as:
  $$P(\text{<s> i want english food </s>}) = 0.25 \cdot 0.33 \cdot 0.0011 \cdot 0.5 \cdot 0.68 \approx 0.000031$$

  In log space (using natural log):
  $$\log P = \ln(0.25) + \ln(0.33) + \ln(0.0011) + \ln(0.5) + \ln(0.68)$$
  $$\approx -1.386 + (-1.109) + (-6.803) + (-0.693) + (-0.387) \approx -10.378$$

  To convert back:
  $$P = \exp(-10.378) \approx 0.000031$$
  matching the original result.

This approach ensures numerical stability, especially for long sequences or large models.

## 2. Longer Context

### Problem: Limited Context in Bigrams

The chapter primarily uses bigram models (which condition on one previous word) for simplicity. However, bigrams capture limited context, missing longer dependencies. For example, in "The water of Walden Pond is so beautifully," a bigram model only considers $P(\text{blue} | \text{beautifully})$, ignoring earlier words like "Walden Pond" that might make "blue" more likely.

### Solution: Higher-Order N-grams

When sufficient training data is available, higher-order n-grams (e.g., trigrams, 4-grams, 5-grams) are used to capture more context:

- **Trigram**: Conditions on the previous two words, e.g., $P(\text{blue} | \text{so beautifully})$.
- **4-gram**: Conditions on the previous three words, e.g., $P(\text{blue} | \text{is so beautifully})$.
- **5-gram**: Conditions on the previous four words.

#### Handling Sentence Boundaries:

- For trigrams at the start of a sentence, we need two preceding words. The text suggests using pseudo-words like `<s><s>` to pad the context. For example:
  - To compute $P(\text{i} | \text{<s><s>})$ for a sentence starting with "i," the model assumes two sentence-start tokens.
  - Similarly, at the end, pseudo-words like `</s>` are used to complete the context.

#### Large N-gram Datasets:

Large corpora enable higher-order n-grams:

- **COCA (Corpus of Contemporary American English)**: 1 billion words, with a dataset of the million most frequent n-grams (Davies, 2020).
- **Google Web 5-gram Corpus**: Derived from 1 trillion words of web text (Franz and Brants, 2006).
- **Google Books N-grams**: 800 billion tokens across multiple languages (Lin et al., 2012a).

These datasets provide enough data to estimate probabilities for trigrams, 4-grams, or 5-grams, which capture more context than bigrams.

#### Infini-gram (∞-gram):

- The **infini-gram project** (Liu et al., 2024) takes this further by allowing n-grams of arbitrary length (not limited to a fixed $n$).
- Instead of pre-computing huge n-gram count tables (which is computationally expensive), it uses **suffix arrays**, an efficient data structure, to compute n-gram probabilities at inference time.
- This enables modeling extremely long contexts (e.g., entire sentences or paragraphs) on massive corpora (e.g., 5 trillion tokens), providing unprecedented flexibility.

## 3. Efficiency Considerations

### Problem: Scalability

Large n-gram models, especially for higher-order n-grams or massive corpora, require significant storage and computational resources:

- Storing counts for all possible n-grams (e.g., 5-grams in a trillion-word corpus) can require terabytes of memory.
- Computing probabilities for all n-grams is time-intensive, especially during training or inference.

### Solutions: Optimization Techniques

The text outlines several techniques to make large n-gram models practical:

#### 1. Quantization:
- Probabilities are stored using 4-8 bits instead of full 8-byte floating-point numbers, reducing memory usage while maintaining sufficient precision.

#### 2. Compact Representations:
- **Word Strings**: Store words on disk and represent them in memory as 64-bit hashes to save space.
- **Reverse Tries**: Use specialized data structures like reverse tries to efficiently store and retrieve n-grams.

#### 3. Pruning:
- Remove low-frequency n-grams (e.g., those with counts below a threshold) to reduce model size.
- Use **entropy-based pruning** (Stolcke, 1998) to eliminate n-grams that contribute little to the model's predictive power, balancing size and accuracy.

#### 4. Efficient Toolkits:
- **KenLM** (Heafield, 2011; Heafield et al., 2013) is a toolkit optimized for building and querying n-gram models.
- It uses sorted arrays and merge sorts to efficiently construct probability tables with minimal passes through the corpus, reducing computation time.

#### 5. Suffix Arrays (Infini-gram):
- As mentioned, the infini-gram project uses suffix arrays to compute n-gram probabilities on-the-fly, avoiding the need to store massive pre-computed tables.

# Overview of Evaluating Language Models

Evaluating a language model involves determining how well it performs its intended task, such as predicting the next word or assigning probabilities to sequences. There are two main evaluation approaches:

1. **Extrinsic Evaluation**: Measures the model's performance when embedded in a real-world application (e.g., speech recognition or machine translation).
2. **Intrinsic Evaluation**: Measures the model's quality independently of any application, using metrics like perplexity (introduced in the next section).

To evaluate a model fairly, we need distinct datasets: a **training set**, a **development set (devset)**, and a **test set**. These ensure that the model is trained, tuned, and evaluated without bias, allowing it to generalize to new, unseen data.

## 1. Extrinsic Evaluation

### Definition

Extrinsic evaluation assesses a language model by embedding it in an end-to-end application and measuring the application's overall performance. For example:

- In **speech recognition**, the language model predicts likely word sequences to disambiguate similar-sounding phrases (e.g., "back soonish" vs. "bassoon dish").
- In **machine translation**, it helps generate fluent translations.

To compare two n-gram models, you would:

1. Integrate each model into the application (e.g., a speech recognizer).
2. Run the application with each model on the same task.
3. Measure the application's performance (e.g., accuracy of transcriptions or translations).

#### Example:
Suppose you have two bigram models for a speech recognition system. You run the system with each model and compare the **word error rate (WER)** on a test set of audio transcriptions. The model with the lower WER is better.

### Why It's Ideal

Extrinsic evaluation directly measures whether the language model improves the application's performance, which is the ultimate goal. It ensures that any improvements in the model (e.g., using trigrams instead of bigrams) translate to real-world benefits.

### Challenges

- **Costly**: Running end-to-end NLP systems (e.g., speech recognizers or translators) is computationally expensive and time-consuming.
- **Complexity**: It requires the entire system to be set up and tested, which may not be feasible during rapid development or experimentation.

Due to these challenges, intrinsic evaluation is often used for quicker, application-independent assessments.

## 2. Intrinsic Evaluation

### Definition

Intrinsic evaluation measures the quality of a language model independently of any specific application, using a metric that reflects the model's ability to predict or model language. The standard intrinsic metric for language models is **perplexity** (covered in the next section), which quantifies how well the model predicts a test set.

### Why It's Useful

- **Speed**: Intrinsic metrics like perplexity can be computed quickly without running a full application.
- **Focus**: They isolate the language model's performance, making it easier to compare different models or improvements (e.g., bigram vs. trigram models).
- **Iterative Development**: Intrinsic evaluation allows rapid testing during model development.

### Limitation

Intrinsic metrics may not always correlate perfectly with application performance. For example, a model with lower perplexity might not always improve speech recognition accuracy if the metric doesn't capture task-specific nuances.

## 3. Training, Development, and Test Sets

To evaluate a machine learning model (including n-gram language models) fairly, we need three distinct datasets:

### 1. Training Set:
- **Purpose**: Used to learn the model's parameters.
- **For N-grams**: The training set is the corpus from which we compute counts and normalize them into probabilities (e.g., $P(w_n | w_{n-1}) = \frac{C(w_{n-1} \, w_n)}{C(w_{n-1})}$).
- **Example**: In the Berkeley Restaurant Project corpus (9332 sentences), the counts in Figure 3.1 (e.g., 827 for "i want") were derived from the training set.

### 2. Test Set:
- **Purpose**: A held-out dataset, not used during training, to evaluate the model's performance on unseen data.
- **Why Separate?**: Ensures an unbiased estimate of how well the model generalizes to new data. A model that performs well on the training set but poorly on the test set is likely **overfitting** (memorizing the training data rather than learning general patterns).
- **Example**: If the test set contains restaurant queries like "I want Chinese food," the model's ability to assign high probabilities to such sentences is tested.

### 3. Development Set (Devset):
- **Purpose**: Used for iterative testing and tuning during model development (e.g., experimenting with bigram vs. trigram models or adjusting smoothing parameters).
- **Why Needed?**: Repeatedly testing on the test set can lead to implicit tuning, where the model is inadvertently optimized for the test set's characteristics, reducing its generalizability. The devset allows frequent testing without touching the test set.
- **Example**: You might test different smoothing techniques (to handle zero-probability bigrams) on the devset to find the best approach before evaluating on the test set.

### Key Principles

- **No Overlap**: The training, devset, and test sets must be distinct to avoid bias. If the test set is included in the training set (**training on the test set**), the model will assign artificially high probabilities to test sentences, inflating performance metrics like perplexity.

- **Single Use of Test Set**: The test set should be used sparingly (ideally once) to report final performance, after all tuning is done on the devset. Repeated testing on the test set risks overfitting to its specific patterns.

- **Representative Data**: The test and devset should reflect the target application's language. For example:
  - For a chemistry lecture speech recognizer, use chemistry lecture transcripts.
  - For hotel booking translation, use hotel booking requests.
  - For general-purpose models, use diverse texts from multiple sources to avoid bias toward a single author or domain.

### Splitting the Data

- **Balancing Size**: The test set should be large enough to provide statistically significant results but small enough to maximize training data. The devset should be similarly sized and representative of the test set.

- **Statistical Power**: The test set size should allow detecting meaningful differences between models (e.g., whether a trigram model outperforms a bigram model).

- **Random vs. Careful Division**: For general-purpose models, split the corpus randomly but ensure diversity (e.g., avoid a test set from a single document or author). For specific applications, ensure the test set matches the target domain.

## 4. Measuring Model Fit

### How to Compare Models

To evaluate which n-gram model is better (e.g., bigram vs. trigram), we:

1. Train both models on the training set to compute their probabilities.
2. Evaluate both on the test set by computing the probability they assign to the test sentences.
3. Choose the model that assigns higher probability to the test set, as it better predicts the test data.

### Why Higher Probability?

A language model's job is to predict the likelihood of word sequences. A model that assigns higher probabilities to the test set is better at capturing its patterns, indicating better generalization. This is the basis for intrinsic metrics like perplexity, which quantifies how well the model predicts the test set (lower perplexity corresponds to higher probability).

### Avoiding Bias

- **Training on the Test Set**: If a test sentence appears in the training set, the model will assign it an artificially high probability, skewing results. For example, if "I want Chinese food" is in both the training and test sets, the model's counts (e.g., for "want chinese") will inflate its probability.

- **Implicit Tuning**: Repeatedly testing on the test set while tweaking the model (e.g., adjusting parameters) can lead to overfitting to the test set. The devset mitigates this by providing a separate dataset for tuning.

# Overview of Perplexity

Perplexity is a measure of how well a language model predicts a test set. A better model assigns higher probabilities to the words or sequences in the test set, meaning it is less "surprised" by the actual words that appear. Perplexity is the standard intrinsic evaluation metric because:

- It is normalized by length, allowing comparison across test sets of different sizes.
- It is inversely related to probability, so a lower perplexity indicates a better model.
- It is widely used for both n-gram models and neural models (Chapter 9).

This section builds on the previous discussion of training, development, and test sets, emphasizing that perplexity is computed on a test set to evaluate generalization to unseen data.

## 1. Why Not Use Raw Probability?

As noted earlier, a language model's quality is determined by how well it predicts a test set, i.e., how high a probability it assigns to the test set's word sequence. However, raw probability has a major limitation:

- The probability of a sequence $P(w_1, w_2, ..., w_N)$ decreases as the sequence length $N$ increases because it involves multiplying many probabilities (each $\leq 1$).
- For example, for a sentence like "<s> i want english food </s>", the probability $P(\text{<s> i want english food </s>}) = 0.000031$ (from earlier) is very small due to the product of multiple probabilities.

Comparing raw probabilities across test sets of different lengths is unfair because longer sequences naturally have lower probabilities. Perplexity addresses this by normalizing the probability by the number of words (or tokens), providing a per-word measure.

## 2. Definition of Perplexity

Perplexity (often abbreviated as PP or PPL) is defined as the inverse probability of a test set, normalized by the number of words $N$. For a test set $W = w_1, w_2, ..., w_N$, the perplexity is:

$$\text{perplexity}(W) = P(w_1, w_2, ..., w_N)^{-1/N}$$

$$= \sqrt[N]{\frac{1}{P(w_1, w_2, ..., w_N)}}$$

Using the chain rule for probabilities, this can be expanded as:

$$\text{perplexity}(W) = \sqrt[N]{\prod_{i=1}^N \frac{1}{P(w_i | w_1, ..., w_{i-1})}}$$

**Key Properties:**
- **Inverse Relationship:** Perplexity is inversely related to probability. A higher probability $P(W)$ results in a lower perplexity, indicating a better model.
- **Normalization:** The $N$-th root normalizes the probability by the number of words, making perplexity comparable across test sets of different lengths.
- **Sentence Boundaries:** If the model includes sentence boundary tokens (e.g., <s>, </s>), these are included in the sequence and count toward $N$.

### Perplexity for N-gram Models

The exact computation depends on the n-gram model:

**Unigram Model** (no context, each word's probability is independent):
$$\text{perplexity}(W) = \sqrt[N]{\prod_{i=1}^N \frac{1}{P(w_i)}}$$

**Bigram Model** (probability depends on the previous word):
$$\text{perplexity}(W) = \sqrt[N]{\prod_{i=1}^N \frac{1}{P(w_i | w_{i-1})}}$$

**Trigram Model:** Uses $P(w_i | w_{i-2}, w_{i-1})$, and so on for higher-order n-grams.

### Why Inverse?

The inverse $(1/P)$ arises from the information-theoretic roots of perplexity, related to cross-entropy rate (detailed in Section 3.7). Intuitively, a model that assigns high probabilities to the test set (i.e., predicts it well) has a smaller inverse probability, leading to lower perplexity.

## 3. Example: Perplexity in Practice

The text provides an example of perplexity computed for a 1.5 million-word test set from the Wall Street Journal (WSJ) corpus, using models trained on 38 million words from the same corpus:

| **Model** | **Perplexity** |
|-----------|----------------|
| Unigram   | 962            |
| Bigram    | 170            |
| Trigram   | 109            |

**Interpretation:**
- **Trigram Model (PPL = 109):** Has the lowest perplexity, meaning it assigns the highest probability to the test set. It uses two words of context, making it better at predicting the next word.
- **Bigram Model (PPL = 170):** Uses one word of context, so it's less predictive than the trigram model but better than the unigram model.
- **Unigram Model (PPL = 962):** Ignores context, assigning probabilities based only on word frequencies, resulting in the highest perplexity (worst performance).

### Why Lower Perplexity?

- A trigram model captures more context (e.g., $P(\text{blue} | \text{so beautifully})$) than a bigram model (e.g., $P(\text{blue} | \text{beautifully})$), which in turn captures more than a unigram model (e.g., $P(\text{blue})$).
- More context means the model is less "surprised" by the test set's words, assigning them higher probabilities and thus lowering perplexity.

### Example Calculation

Suppose a small test set $W = \text{<s> i want </s>}$ (4 tokens) with bigram probabilities (from the Berkeley Restaurant corpus):

- $P(\text{i} | \text{<s>}) = 0.25$
- $P(\text{want} | \text{i}) = 0.33$
- $P(\text{</s>} | \text{want}) = 0.68$ (assumed, as it's not given in Figure 3.2)

$$P(W) = 0.25 \cdot 0.33 \cdot 0.68 \approx 0.0561$$

$$\text{perplexity}(W) = (0.0561)^{-1/4} = \sqrt[4]{\frac{1}{0.0561}} \approx \sqrt[4]{17.83} \approx 2.05$$

A perplexity of 2.05 means the model is, on average, as "uncertain" as if it were choosing between about 2 equally likely words at each step (see branching factor below).

## 4. Perplexity as Weighted Average Branching Factor

Perplexity can be interpreted as the weighted average branching factor of a language, where the branching factor is the number of possible next words at any point. This provides an intuitive way to understand perplexity.

### Deterministic Language Example

Consider a toy language with three words: $L = \{\text{red}, \text{blue}, \text{green}\}$, where any word can follow any other with equal probability (a unigram model, Language Model A):

- Each word has $P(\text{red}) = P(\text{blue}) = P(\text{green}) = \frac{1}{3}$.
- The branching factor is 3 (three possible next words).

For a test set $T = \text{red red red red blue}$ (5 tokens):

- **Probability:** $P(T) = P(\text{red}) \cdot P(\text{red}) \cdot P(\text{red}) \cdot P(\text{red}) \cdot P(\text{blue}) = \left(\frac{1}{3}\right)^5 = \frac{1}{243}$.
- **Perplexity:** $\text{perplexity}_A(T) = \left(\frac{1}{243}\right)^{-1/5} = 243^{1/5} \approx 3$

The perplexity equals the branching factor (3), reflecting that the model is as uncertain as if it were choosing among 3 equally likely words at each step.

### Probabilistic Language Example

Now consider a different unigram model (Language Model B) where:
- $P(\text{red}) = 0.8$, $P(\text{green}) = 0.1$, $P(\text{blue}) = 0.1$.

For the same test set $T = \text{red red red red blue}$:

- **Probability:** $P(T) = (0.8)^4 \cdot 0.1 = 0.4096 \cdot 0.1 = 0.04096$.
- **Perplexity:** $\text{perplexity}_B(T) = (0.04096)^{-1/5} \approx 1.89$

**Interpretation:**
- The perplexity (1.89) is lower than for Model A (3) because Model B assigns a higher probability to "red" (0.8), which appears 4 times in the test set, making it less "surprised" by the sequence.
- The weighted average branching factor is 1.89, meaning the model is as uncertain as if it were choosing among approximately 1.89 equally likely words on average, reflecting the skewed probabilities favoring "red."

## 5. Practical Considerations

### Requirements for Fair Comparison

- **No Test Set in Training:** The language model must be trained without any knowledge of the test set to avoid artificially low perplexity (e.g., due to memorizing test sentences).
- **Identical Vocabularies:** Perplexity is only comparable between models with the same vocabulary. Different vocabularies affect probability distributions and perplexity values.
- **Sentence Boundaries:** If the model uses <s> and </s> tokens, these are included in the sequence and count toward $N$.

### Limitations

- **Intrinsic vs. Extrinsic:** A lower perplexity (better intrinsic performance) does not always translate to better performance in applications like speech recognition or machine translation. For example, a trigram model with lower perplexity might not improve speech recognition if it overemphasizes certain patterns irrelevant to the task.
- **Correlation with Task Performance:** Despite this, perplexity often correlates with task improvements, making it a convenient metric. However, extrinsic evaluation (e.g., word error rate in speech recognition) should confirm improvements when possible.

# What is Sampling?

Sampling from a language model means generating random sentences according to the probability distribution defined by the model. Since a language model assigns probabilities to word sequences, sampling produces sentences that are more likely to have high probabilities under the model and less likely to have low probabilities. This technique, first proposed by Shannon (1948) and Miller and Selfridge (1950), helps visualize the knowledge encoded in the model.

## How Sampling Works

### Unigram Model

In a unigram model, each word's probability $P(w_i)$ is independent of context.

Imagine a number line from 0 to 1, where each word occupies an interval proportional to its probability (see Figure 3.3). For example:
- Common words like "the" (e.g., $P(\text{the}) = 0.06$) occupy larger intervals.
- Rare words like "polyphonic" (e.g., $P(\text{polyphonic}) = 0.0000018$) occupy tiny intervals.

To sample a sentence:
1. Generate a random number between 0 and 1.
2. Select the word whose interval contains that number (e.g., 0.05 falls in "the"'s interval).
3. Repeat until the sentence-end token $\langle/s\rangle$ is generated.

**Result:** Sentences lack coherence because unigram models ignore word order (e.g., "the of to polyphonic $\langle/s\rangle$").

### Bigram Model

In a bigram model, the probability of a word depends on the previous word: $P(w_i | w_{i-1})$.

Sampling process:
1. Start with the sentence-begin token $\langle s\rangle$.
2. Generate a random number and select the next word based on the distribution $P(w | \langle s\rangle)$.
3. If the chosen word is $w_1$, generate the next word using $P(w | w_1)$, and so on, until $\langle/s\rangle$ is generated.

**Result:** Sentences have local word-to-word coherence (e.g., "I want to eat" is more likely than "I to want eat").

### Higher-Order N-grams

Trigram or 4-gram models use more context (e.g., $P(w_i | w_{i-2}, w_{i-1})$), leading to more coherent sentences.

## Purpose of Sampling

- Sampling reveals what the model has learned about language structure, syntax, and domain-specific patterns.
- It helps visualize differences between models (e.g., unigram vs. bigram) or corpora (e.g., Shakespeare vs. Wall Street Journal).

## Example: Sampling from Berkeley Restaurant Corpus

Using the bigram probabilities from Figure 3.2 (Berkeley Restaurant corpus):

- Start with $\langle s\rangle$, where $P(i | \langle s\rangle) = 0.25$.
- Generate a random number (e.g., 0.2) and select "i" (since 0.2 falls in its interval).
- For "i", use $P(w | i)$:
  - $P(\text{want} | i) = 0.33$
  - $P(\text{to} | i) = 0.00083$, etc.
- Generate another number (e.g., 0.3) and select "want".
- Continue: $P(w | \text{want})$, e.g., $P(\text{to} | \text{want}) = 0.66$.
- A possible sampled sentence: "$\langle s\rangle$ i want to eat $\langle/s\rangle$".

This sentence reflects the model's learned patterns (e.g., "want to" is common), but rare or unseen bigrams (e.g., "want chinese" with $P(\text{chinese} | \text{want}) = 0$) are unlikely to appear without smoothing.

# Generalizing vs. Overfitting the Training Set

## Dependence on the Training Corpus

N-gram models are trained on a specific corpus, which influences their probabilities:

**Specific Facts:** Probabilities reflect the corpus's content. For example, the Berkeley Restaurant corpus has high probabilities for bigrams like "i want" or "eat food" due to its restaurant-query domain.

**Increasing N:** Higher-order n-grams (e.g., trigrams, 4-grams) capture more context, making generated sentences more coherent but also more likely to reproduce exact sequences from the training corpus.

## Visualization via Sampling

The text uses sampling to illustrate these effects with two corpora: Shakespeare and Wall Street Journal (WSJ).

### Shakespeare Corpus (Figure 3.4)

**Corpus Size:** 884,647 words, vocabulary $V = 29,066$.

**Sampling Results:**
- **Unigram:** "To him swallowed confess hear both. Which. Of save on trail for are ay device and rote life have" – Incoherent, random words with no sentence structure.
- **Bigram:** "Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry." – Some local coherence (e.g., punctuation, verb forms), but not fully grammatical.
- **Trigram:** "Fly, and will rid me these news of price." – More Shakespeare-like, with coherent phrases.
- **4-gram:** "It cannot be but so." – Almost identical to a line from King John, showing overfitting to the small corpus.

**Why Overfitting?**
The Shakespeare corpus is small, so the 4-gram model has sparse probability matrices (e.g., $V^4 \approx 7 \times 10^{17}$ possible 4-grams, but only 884,647 words).
After generating "It cannot be," only seven words (e.g., "but," "I") are possible next, as the model has memorized specific sequences.

### Wall Street Journal Corpus (Figure 3.5)

**Corpus Size:** 40 million words, larger and more diverse than Shakespeare's works.

**Sampling Results:**
- **Unigram:** "Months the my and issue of year foreign new exchange's september" – Incoherent, random financial terms.
- **Bigram:** "Last December through the way to preserve the Hudson corporation" – Some coherence, but fragmented.
- **Trigram:** "They also point to ninety nine point six billion dollars from two hundred four oh six three percent of the rates of interest stores as Mexico and Brazil on market conditions" – Coherent, financial-domain-specific phrases.

**Observation:** The WSJ sentences reflect its financial and formal tone, with no overlap with Shakespeare's poetic style, despite both being in English.

## Implications

**Corpus Dependence:** The lack of overlap between Shakespeare and WSJ highlights that n-gram models are corpus-specific. A model trained on Shakespeare is poor at predicting WSJ text, and vice versa.

**Overfitting with Higher N:**
- Higher-order n-grams (e.g., 4-grams) capture longer dependencies, making sentences more coherent.
- However, in small corpora (like Shakespeare's), they overfit, reproducing exact training sequences (e.g., "It cannot be but so").
- In larger corpora (like WSJ), overfitting is less severe due to more diverse data, but domain-specific patterns dominate.

## Addressing Corpus Dependence

To build effective n-gram models, the training corpus must match the target application:

**Genre Matching:**
- For legal document translation, use a corpus of legal texts.
- For a question-answering system, use a corpus of questions.

**Dialect and Variety:**
- For social media (e.g., tweets), use a corpus reflecting the target dialect, such as African American English (AAE) with words like "finna" (e.g., "Bored af den my phone finna die!!!") or Nigerian Pidgin (e.g., "R u a wizard or wat gan sef").
- These dialects have unique n-gram patterns that differ from standard English.

**Subword Tokenization:**
- To handle out-of-vocabulary (OOV) words (e.g., "Jurafsky" in the test set but not in training), language models often use subword tokens (e.g., Byte Pair Encoding from Chapter 2).
- Any word can be broken into known subword units (e.g., "Jurafsky" as "Ju##raf##sky" or individual letters), ensuring no truly unseen tokens.

# What Are Zeros?

In n-gram models, probabilities are estimated using Maximum Likelihood Estimation (MLE), where the probability of a word given its context is based on corpus counts:

$$P(w_n | w_{n-1}) = \frac{C(w_{n-1}\ w_n)}{C(w_{n-1})}$$

However, if a bigram $w_{n-1}\ w_n$ never appears in the training corpus (i.e., $C(w_{n-1}\ w_n) = 0$), its probability is zero, even if the sequence is valid in the language. For example:

- In the BeRP corpus, the bigram "want chinese" has a count of 0 (Figure 3.1), so $P(\text{chinese} | \text{want}) = 0$, despite "want Chinese" being a plausible phrase in restaurant queries.

## Why Are Zeros a Problem?

1. **Underestimation:** Zero probabilities underestimate the likelihood of valid sequences that happen to be absent in the training corpus (e.g., "ruby slippers" might be missing despite being a valid phrase).

2. **Perplexity Failure:** Perplexity, the standard intrinsic evaluation metric, is defined as:
   $$\text{perplexity}(W) = \sqrt[N]{\frac{1}{P(w_1, w_2, ..., w_N)}}$$
   
   If any bigram in the test set has $P(w_i | w_{i-1}) = 0$, the entire sequence probability $P(W) = 0$, making perplexity undefined (division by zero). For example, the sentence "$\langle s\rangle$ i want chinese food $\langle/s\rangle$" had $P(\text{chinese} | \text{want}) = 0$, resulting in a zero probability and undefined perplexity.

## Solution: Smoothing

Smoothing (or discounting) addresses zeros by redistributing probability mass from frequent n-grams to unseen ones. This ensures that all possible n-grams have non-zero probabilities, enabling robust probability estimates and valid perplexity calculations.

# 2. Laplace (Add-One) Smoothing

## Concept

Laplace smoothing (also called add-one smoothing) adds a count of 1 to every possible n-gram, including those with zero counts, before normalizing to compute probabilities. This ensures that no n-gram has a zero probability.

## Unigram Case

For unigrams, the MLE probability is:
$$P(w_i) = \frac{c_i}{N}$$

where $c_i$ is the count of word $w_i$, and $N$ is the total number of word tokens in the corpus.

With Laplace smoothing, we add 1 to each word's count and adjust the denominator to account for the $V$ additional counts (where $V$ is the vocabulary size):

$$P_{\text{Laplace}}(w_i) = \frac{c_i + 1}{N + V}$$

**Why Add $V$ to the Denominator?** There are $V$ words in the vocabulary, each getting an extra count of 1, so the total count increases by $V$. Without this adjustment, the probabilities would not sum to 1.

## Bigram Case

For bigrams, the MLE probability is:
$$P_{\text{MLE}}(w_n | w_{n-1}) = \frac{C(w_{n-1}\ w_n)}{C(w_{n-1})}$$

With Laplace smoothing, we add 1 to each bigram count and add $V$ to the denominator (since there are $V$ possible bigrams for each $w_{n-1}$):

$$P_{\text{Laplace}}(w_n | w_{n-1}) = \frac{C(w_{n-1}\ w_n) + 1}{C(w_{n-1}) + V}$$

This can also be written as:
$$P_{\text{Laplace}}(w_n | w_{n-1}) = \frac{C(w_{n-1}\ w_n) + 1}{\sum_w (C(w_{n-1}\ w) + 1)}$$

## Example: BeRP Corpus

Using the BeRP corpus (Figures 3.1, 3.6, 3.7, 3.8), with vocabulary size $V = 1446$:

**Original Counts (Figure 3.1):**
- $C(\text{want to}) = 608$, $C(\text{want}) = 927$.
- $C(\text{want chinese}) = 0$.

**Smoothed Counts (Figure 3.6):**
- Add 1 to each bigram: $C(\text{want to}) + 1 = 609$, $C(\text{want chinese}) + 1 = 1$.

**Smoothed Probabilities (Figure 3.7):**
- Unigram count adjustment: $C(\text{want}) + V = 927 + 1446 = 2373$.
- $P_{\text{Laplace}}(\text{to} | \text{want}) = \frac{608 + 1}{927 + 1446} = \frac{609}{2373} \approx 0.26$.
- $P_{\text{Laplace}}(\text{chinese} | \text{want}) = \frac{0 + 1}{927 + 1446} = \frac{1}{2373} \approx 0.00042$.

**Comparison to MLE:**
- MLE: $P_{\text{MLE}}(\text{to} | \text{want}) = \frac{608}{927} \approx 0.66$, $P_{\text{MLE}}(\text{chinese} | \text{want}) = 0$.
- Laplace smoothing reduces the probability of frequent bigrams (e.g., "want to" drops from 0.66 to 0.26) to give non-zero probability to unseen bigrams (e.g., "want chinese").

## Adjusted Counts

To understand the impact of smoothing, we can compute adjusted counts ($C^*$) that, when divided by the original unigram count, yield the smoothed probability:

$$P_{\text{Laplace}}(w_n | w_{n-1}) = \frac{C^*(w_{n-1}\ w_n)}{C(w_{n-1})}$$

$$C^*(w_{n-1}\ w_n) = \frac{(C(w_{n-1}\ w_n) + 1) \cdot C(w_{n-1})}{C(w_{n-1}) + V}$$

**Example (Figure 3.8):**
- For "want to": $C(\text{want to}) = 608$, $C(\text{want}) = 927$.
- $C^*(\text{want to}) = \frac{(608 + 1) \cdot 927}{927 + 1446} = \frac{609 \cdot 927}{2373} \approx 238$.
- For "want chinese": $C(\text{want chinese}) = 0$.
- $C^*(\text{want chinese}) = \frac{(0 + 1) \cdot 927}{927 + 1446} = \frac{927}{2373} \approx 0.39$.

## Impact of Laplace Smoothing

**Zeros Eliminated:** Unseen bigrams like "want chinese" now have non-zero probabilities (e.g., 0.00042), enabling valid perplexity calculations.

**Discounting:** The probabilities of frequent bigrams (e.g., "want to") are reduced significantly (from 0.66 to 0.26). The discount (ratio of new to old counts) is:
$$d = \frac{C^*(\text{want to})}{C(\text{want to})} = \frac{238}{608} \approx 0.39$$

For "chinese food" ($C = 82$, $C^* = 8.2$):
$$d = \frac{8.2}{82} = 0.10$$

**Issue:** Laplace smoothing moves too much probability mass to unseen n-grams, drastically reducing probabilities of frequent n-grams (e.g., a factor of 10 for "chinese food"). This makes it suboptimal for language modeling, though it's useful for tasks like text classification (Chapter 4).

# 3. Add-k Smoothing

## Concept

Add-k smoothing is a refinement of Laplace smoothing that adds a smaller fractional count $k$ (e.g., 0.5 or 0.01) instead of 1 to each n-gram count, reducing the aggressive redistribution of probability mass:

$$P_{\text{Add-k}}(w_n | w_{n-1}) = \frac{C(w_{n-1}\ w_n) + k}{C(w_{n-1}) + kV}$$

**Choosing $k$:** The value of $k$ is typically optimized on a development set to minimize perplexity or improve task performance.

**Advantages:** Less aggressive than Laplace smoothing, preserving more probability for frequent n-grams.

**Limitations:** Add-k smoothing still produces counts with high variance and inappropriate discounts, making it suboptimal for language modeling (Gale and Church, 1994). It's more useful for tasks like text classification.

## Example

For $k = 0.5$, $C(\text{want to}) = 608$, $C(\text{want}) = 927$, $V = 1446$:

- Denominator: $C(\text{want}) + kV = 927 + 0.5 \cdot 1446 = 927 + 723 = 1650$.
- $P_{\text{Add-k}}(\text{to} | \text{want}) = \frac{608 + 0.5}{1650} = \frac{608.5}{1650} \approx 0.37$.
- $P_{\text{Add-k}}(\text{chinese} | \text{want}) = \frac{0 + 0.5}{1650} = \frac{0.5}{1650} \approx 0.00030$.

Compared to Laplace ($P_{\text{Laplace}}(\text{to} | \text{want}) = 0.26$), add-k assigns more probability to frequent bigrams and less to unseen ones, but it still may not be ideal for language modeling.

# 1. The Zero Probability Problem (Recap)

As discussed in Section 3.6, n-gram models using Maximum Likelihood Estimation (MLE) assign zero probability to unseen n-grams (e.g., "want chinese" in the BeRP corpus has $C(\text{want chinese}) = 0$, so $P(\text{chinese} | \text{want}) = 0$). This causes two issues:

1. **Underestimation:** Valid sequences (e.g., "ruby slippers") are underestimated if absent from the training corpus.
2. **Perplexity Failure:** If any n-gram in a test set has zero probability, the entire sequence probability is zero, making perplexity undefined ($\text{perplexity}(W) = \sqrt[N]{\frac{1}{P(W)}}$).

Smoothing redistributes probability mass to unseen n-grams. Interpolation and backoff are advanced smoothing techniques that leverage lower-order n-grams (e.g., bigrams, unigrams) to estimate probabilities for unseen higher-order n-grams (e.g., trigrams).

# 2. Language Model Interpolation

## Concept

Interpolation estimates the probability of an n-gram by combining probabilities from different n-gram orders (e.g., trigram, bigram, unigram) using weighted averages. If a trigram $w_{n-2}\ w_{n-1}\ w_n$ is unseen, interpolation uses the bigram $P(w_n | w_{n-1})$ or unigram $P(w_n)$ to provide a non-zero probability.

The simple linear interpolation formula for a trigram is:

$$\hat{P}(w_n | w_{n-2}\ w_{n-1}) = \lambda_1 P(w_n) + \lambda_2 P(w_n | w_{n-1}) + \lambda_3 P(w_n | w_{n-2}\ w_{n-1})$$

where:
- $\lambda_1, \lambda_2, \lambda_3$ are weights that sum to 1 ($\lambda_1 + \lambda_2 + \lambda_3 = 1$).
- $P(w_n)$: Unigram probability.
- $P(w_n | w_{n-1})$: Bigram probability.
- $P(w_n | w_{n-2}\ w_{n-1})$: Trigram probability.

## Context-Conditioned Interpolation

A more sophisticated approach adjusts the weights based on the context ($w_{n-2}\ w_{n-1}$):

$$\hat{P}(w_n | w_{n-2}\ w_{n-1}) = \lambda_1(w_{n-2:n-1}) P(w_n) + \lambda_2(w_{n-2:n-1}) P(w_n | w_{n-1}) + \lambda_3(w_{n-2:n-1}) P(w_n | w_{n-2}\ w_{n-1})$$

**Why Context-Conditioned?** If a bigram $w_{n-1}$ has many associated trigrams with high counts, the trigram probability is more reliable, so $\lambda_3$ is higher. If the bigram has few counts, the model relies more on bigram or unigram probabilities (higher $\lambda_2$ or $\lambda_1$).

## Learning the Weights ($\lambda$)

- The weights are learned from a held-out corpus (a separate portion of training data, distinct from the main training and test sets).
- **Goal:** Choose $\lambda$ values that maximize the likelihood (or minimize perplexity) of the held-out corpus.
- **Method:** The EM (Expectation-Maximization) algorithm (Jelinek and Mercer, 1980) iteratively optimizes the $\lambda$ values to find locally optimal weights.

## Example: BeRP Corpus

Suppose we want to estimate $P(\text{chinese} | \text{i want})$ for the trigram "i want chinese," which is unseen ($C(\text{i want chinese}) = 0$).

**Unsmoothed Probabilities (from Figure 3.2):**
- $P(\text{chinese}) = \frac{158}{9332} \approx 0.0169$ (assuming 9332 is the total token count, based on corpus size).
- $P(\text{chinese} | \text{want}) = 0$ (since $C(\text{want chinese}) = 0$).
- $P(\text{chinese} | \text{i want}) = 0$ (unseen trigram).

**Simple Interpolation:**
- Assume $\lambda_1 = 0.2$, $\lambda_2 = 0.3$, $\lambda_3 = 0.5$.
- $\hat{P}(\text{chinese} | \text{i want}) = 0.2 \cdot 0.0169 + 0.3 \cdot 0 + 0.5 \cdot 0 = 0.00338$.

**Context-Conditioned Interpolation:**
- If "want" has many associated trigrams, $\lambda_3(\text{i want})$ might be higher (e.g., 0.7), but since "i want chinese" is unseen, $\lambda_3$ is lower, and $\lambda_1, \lambda_2$ are higher.
- Suppose $\lambda_1(\text{i want}) = 0.4$, $\lambda_2(\text{i want}) = 0.5$, $\lambda_3(\text{i want}) = 0.1$.
- $\hat{P}(\text{chinese} | \text{i want}) = 0.4 \cdot 0.0169 + 0.5 \cdot 0 + 0.1 \cdot 0 = 0.00676$.

Interpolation ensures a non-zero probability for "i want chinese," unlike MLE, which assigns zero.

# 3. Stupid Backoff

## Concept

Backoff is an alternative to interpolation where, if an n-gram has zero counts, the model "backs off" to a lower-order n-gram (e.g., from trigram to bigram to unigram) until a non-zero count is found. Unlike interpolation, backoff uses only one n-gram order at a time, not a weighted combination.

Stupid backoff (Brants et al., 2007) is a simplified, non-probabilistic backoff method that:
- Does not discount higher-order n-gram probabilities to maintain a valid probability distribution.
- Uses a fixed weight $\lambda$ (e.g., 0.4) when backing off to lower-order n-grams.
- Produces scores ($S$) rather than true probabilities, as the scores may not sum to 1.

The formula for stupid backoff is:

$$S(w_i | w_{i-N+1:i-1}) = \begin{cases} 
\frac{\text{count}(w_{i-N+1:i})}{\text{count}(w_{i-N+1:i-1})} & \text{if } \text{count}(w_{i-N+1:i}) > 0 \\
\lambda S(w_i | w_{i-N+2:i-1}) & \text{otherwise}
\end{cases}$$

- **Base Case:** For unigrams, $S(w_i) = \frac{\text{count}(w_i)}{N}$.
- **Process:**
  1. Check the highest-order n-gram (e.g., trigram).
  2. If its count is zero, back off to the (n-1)-gram (e.g., bigram) and multiply by $\lambda$.
  3. Continue until a non-zero count is found or reach the unigram.

## Example: BeRP Corpus

Estimate $S(\text{chinese} | \text{i want})$, with $\lambda = 0.4$.

- **Trigram:** $C(\text{i want chinese}) = 0$, so back off to bigram.
- **Bigram:** $C(\text{want chinese}) = 0$ (Figure 3.1), so back off to unigram.
- **Unigram:** $S(\text{chinese}) = \frac{158}{9332} \approx 0.0169$.

**Backoff:**
- From trigram to bigram: $S(\text{chinese} | \text{want}) = 0.4 \cdot S(\text{chinese}) = 0.4 \cdot 0.0169 \approx 0.00676$.
- From bigram to unigram: $S(\text{chinese} | \text{i want}) = 0.4 \cdot S(\text{chinese} | \text{want}) = 0.4 \cdot 0.00676 \approx 0.0027$.

## Key Features

- **No Discounting:** Unlike true backoff (e.g., Kneser-Ney), stupid backoff does not adjust higher-order counts, simplifying computation.
- **Non-Probabilistic:** The scores $S$ do not form a valid probability distribution, but they work well for tasks like ranking or generation.
- **Practicality:** Brants et al. (2007) found $\lambda = 0.4$ effective for large-scale applications, such as Google's web-scale language models.

# Comparison of Interpolation and Stupid Backoff

## Interpolation
- Combines probabilities from all n-gram orders (trigram, bigram, unigram) with learned weights.
- Ensures a valid probability distribution ($\sum_w \hat{P}(w | \text{context}) = 1$).
- Requires a held-out corpus to optimize $\lambda$.
- Computationally more complex due to weighted sums.

## Stupid Backoff
- Uses only one n-gram order at a time, backing off to lower orders if needed.
- Does not produce a valid probability distribution, only scores.
- Simpler and faster, with a fixed $\lambda$.
- Effective for large-scale models where speed is critical.

# Perplexity and Information Theory

## 1. Background: Perplexity Recap

In Section 3.3, perplexity was introduced as a way to evaluate n-gram models by measuring how well they predict a test set. For a test set $W = w_1, w_2, ..., w_N$, perplexity is defined as:

$$\text{Perplexity}(W) = \sqrt[N]{\frac{1}{P(w_1, w_2, ..., w_N)}} = P(w_1, w_2, ..., w_N)^{-1/N}$$

- A lower perplexity indicates a better model, as it assigns higher probability to the test set.
- For bigram models, this expands to:

$$\text{Perplexity}(W) = \sqrt[N]{\prod_{i=1}^N \frac{1}{P(w_i | w_{i-1})}}$$

The question addressed in Section 3.7 is: Why use inverse probability, and how does perplexity relate to information theory? The answer lies in entropy and cross-entropy, which provide a theoretical foundation for perplexity.

## 2. Entropy

### Definition

Entropy ($H(X)$) measures the uncertainty or information content of a random variable $X$ (e.g., words, letters, or parts of speech) with a probability distribution $p(x)$ over its possible values $\chi$:

$$H(X) = -\sum_{x \in \chi} p(x) \log_2 p(x)$$

- **Units**: If the logarithm is base 2, entropy is measured in bits.
- **Interpretation**: Entropy represents the average number of bits needed to encode an outcome of $X$ using an optimal coding scheme. Higher entropy means more uncertainty (more bits needed).

### Example: Horse Race

The text uses an example from Cover and Thomas (1991) to illustrate entropy:

- **Scenario**: A horse race with 8 horses, where we need to send a message to a bookie to bet on one horse.
- **Uniform Distribution**: If each horse has an equal probability ($p = \frac{1}{8}$):
  - A simple coding scheme uses 3 bits per horse (e.g., horse 1 = 001, horse 2 = 010, ..., horse 8 = 000).
  - Entropy:
  
  $$H(X) = -\sum_{i=1}^8 \frac{1}{8} \log_2 \frac{1}{8} = -8 \cdot \left( \frac{1}{8} \cdot (-3) \right) = 3 \text{ bits}$$
  
  This confirms that 3 bits are needed on average, matching the simple coding scheme.

- **Skewed Distribution**: If the probabilities are unequal (e.g., horse 1: $\frac{1}{2}$, horse 2: $\frac{1}{4}$, horse 3: $\frac{1}{8}$, horse 4: $\frac{1}{16}$, horses 5–8: $\frac{1}{64}$):
  - Entropy:
  
  $$H(X) = -\left( \frac{1}{2} \log_2 \frac{1}{2} + \frac{1}{4} \log_2 \frac{1}{4} + \frac{1}{8} \log_2 \frac{1}{8} + \frac{1}{16} \log_2 \frac{1}{16} + 4 \cdot \frac{1}{64} \log_2 \frac{1}{64} \right)$$
  
  $$= -\left( \frac{1}{2} \cdot (-1) + \frac{1}{4} \cdot (-2) + \frac{1}{8} \cdot (-3) + \frac{1}{16} \cdot (-4) + 4 \cdot \frac{1}{64} \cdot (-6) \right)$$
  
  $$= 0.5 + 0.5 + 0.375 + 0.25 + 0.375 = 2 \text{ bits}$$
  
  - **Optimal Coding**: Assign shorter codes to more probable horses (e.g., horse 1: 0, horse 2: 10, horse 3: 110, etc.), achieving an average of 2 bits per race, matching the entropy.

This shows that entropy is lower when the distribution is skewed (less uncertainty).

### Entropy for Sequences

For a language model, we're interested in the entropy of a sequence of words $W = w_1, w_2, ..., w_n$:

$$H(w_1, w_2, ..., w_n) = -\sum_{w_1:n \in L} p(w_1:n) \log_2 p(w_1:n)$$

The entropy rate (per-word entropy) normalizes by sequence length:

$$H(L) = \lim_{n \to \infty} \frac{1}{n} H(w_1:n) = -\lim_{n \to \infty} \frac{1}{n} \sum_{w_1:n \in L} p(w_1:n) \log_2 p(w_1:n)$$

The Shannon-McMillan-Breiman theorem simplifies this for stationary and ergodic processes:

$$H(L) = \lim_{n \to \infty} -\frac{1}{n} \log_2 p(w_1:n)$$

- **Stationary**: The probability distribution is invariant to time shifts (e.g., in a bigram model, $P(w_i | w_{i-1})$ is consistent regardless of position).
- **Ergodic**: The process's long-term behavior is representative of all possible sequences.
- **Implication**: A single, sufficiently long sequence can estimate the entropy rate, as it contains many shorter sequences with their respective probabilities.

**Note**: Natural language is not stationary (e.g., probabilities depend on distant context, as shown in Appendix D), so n-gram models approximate the true entropy.

## 3. Cross-Entropy

### Definition

Cross-entropy measures how well a model $m$ (e.g., an n-gram model) approximates the true probability distribution $p$ that generated the data:

$$H(p, m) = \lim_{n \to \infty} -\frac{1}{n} \sum_{w_1:n \in L} p(w_1:n) \log_2 m(w_1:n)$$

Using the Shannon-McMillan-Breiman theorem for stationary, ergodic processes:

$$H(p, m) = \lim_{n \to \infty} -\frac{1}{n} \log_2 m(w_1:n)$$

- **Intuition**: Draw sequences from the true distribution $p$, but compute their log probabilities using the model $m$. The cross-entropy is the average number of bits needed to encode the data under $m$.

### Key Property

Cross-entropy is an upper bound on the true entropy:

$$H(p) \leq H(p, m)$$

- A more accurate model $m$ (closer to $p$) has a cross-entropy closer to the true entropy $H(p)$.
- The difference $H(p, m) - H(p)$ measures the model's inaccuracy.

### Practical Estimation

For a finite sequence $W = w_1, w_2, ..., w_N$, the cross-entropy is approximated as:

$$H(W) = -\frac{1}{N} \log_2 P(w_1, w_2, ..., w_N)$$

For an n-gram model $P(w_i | w_{i-N+1:i-1})$:

$$H(W) = -\frac{1}{N} \sum_{i=1}^N \log_2 P(w_i | w_{i-N+1:i-1})$$

## 4. Perplexity's Relation to Cross-Entropy

Perplexity is directly derived from cross-entropy:

$$\text{Perplexity}(W) = 2^{H(W)}$$

$$= 2^{-\frac{1}{N} \log_2 P(w_1, w_2, ..., w_N)}$$

$$= \left( P(w_1, w_2, ..., w_N) \right)^{-1/N}$$

- **Why Inverse Probability?**: The inverse arises because perplexity is $2$ raised to the cross-entropy, which involves $-\log_2 P(W)$. This converts the log probability into a measure of uncertainty (bits per word).
- **Intuition**: Perplexity is the geometric mean of the inverse probabilities of the words in the sequence, reflecting the model's average uncertainty (or "branching factor") per word.

### Example: BeRP Corpus

For the test sentence "<s> i want english food </s>" (5 tokens) from the BeRP corpus (Section 3.2):

- **Bigram probabilities** (from Figure 3.2 and given values):
  - $P(i | \langle s \rangle) = 0.25$
  - $P(\text{want} | i) = 0.33$
  - $P(\text{english} | \text{want}) = 0.0011$
  - $P(\text{food} | \text{english}) = 0.5$
  - $P(\langle /s \rangle | \text{food}) = 0.68$

- **Sequence probability**:

$$P(W) = 0.25 \cdot 0.33 \cdot 0.0011 \cdot 0.5 \cdot 0.68 \approx 0.000031$$

- **Cross-entropy** (using base 2 logarithm):

$$H(W) = -\frac{1}{5} \log_2 (0.000031)$$

$$\log_2 (0.000031) \approx \log_2 (3.1 \times 10^{-5}) \approx -15.01$$

$$H(W) = -\frac{1}{5} \cdot (-15.01) \approx 3.002 \text{ bits}$$

- **Perplexity**:

$$\text{Perplexity}(W) = 2^{3.002} \approx 8.01$$

This means the model is as uncertain as if it were choosing among approximately 8 equally likely words per position, on average.

### With Smoothing

If we use Laplace smoothing (Section 3.6.1), the probabilities change (e.g., $P(\text{want} | i) \approx 0.0013$), reducing the probability of frequent bigrams and increasing the probability of unseen ones. This typically increases perplexity, as the model spreads probability mass more evenly, reflecting higher uncertainty.

## 5. Practical Implications

- **Model Comparison**: A model with lower cross-entropy (and thus lower perplexity) is more accurate, as it assigns higher probabilities to the test set.
- **Stationarity Limitation**: N-gram models assume stationarity (probabilities don't change with position), but natural language is non-stationary (context can depend on distant words). Thus, n-gram perplexity is an approximation of the true entropy.
- **Application**: Perplexity's relation to cross-entropy explains why it's a natural metric for evaluating language models, as it measures how well the model approximates the true distribution.