# 📖 Chapter 4.1: Language Models

## 📌 Overview  
A **Language Model (LM)** estimates the probability of a sequence of words.  
It answers the question:  
> "Given the previous words, what is the likelihood of the next word?"

Language models are the backbone of many NLP tasks like:
- Text generation
- Speech recognition
- Machine translation
- Autocomplete and chatbots

---

## 🔢 Probability of Word Sequences  
The **chain rule of probability** defines the probability of a sequence:
\[
P(w_1, w_2, w_3, ..., w_n) = P(w_1) \cdot P(w_2 | w_1) \cdot P(w_3 | w_1, w_2) \cdots P(w_n | w_1, ..., w_{n-1})
\]

Since modeling all previous words is computationally expensive, **n-gram models** simplify this by assuming that:
\[
P(w_n | w_1, ..., w_{n-1}) \approx P(w_n | w_{n-(n-1)}, ..., w_{n-1})
\]

---

## 1️⃣ N-gram Language Models  
**N-gram:** A contiguous sequence of `n` words.

| N-gram Type     | Example Phrase              |
|-----------------|----------------------------|
| Unigram (n=1)   | "The", "dog", "runs"        |
| Bigram (n=2)    | "The dog", "dog runs"       |
| Trigram (n=3)   | "The dog runs", "dog runs fast" |

---

### 🛠️ Example: Building a Bigram Model using NLTK

In [1]:
import nltk
from nltk import bigrams
from nltk.probability import FreqDist, ConditionalFreqDist
nltk.download('punkt')

# Sample text
text = "Natural language processing makes machines understand human language."

# Tokenize the text into words
tokens = nltk.word_tokenize(text.lower())

# Generate bigrams from the token list
bigrams_list = list(bigrams(tokens))

# Frequency distribution of bigrams
fdist = FreqDist(bigrams_list)
print("Bigram Frequencies:\n", fdist.most_common())

# Conditional Frequency Distribution: What words often follow 'language'?
cfd = ConditionalFreqDist(bigrams_list)
print("Words that follow 'language':", cfd['language'].most_common())


Bigram Frequencies:
 [(('natural', 'language'), 1), (('language', 'processing'), 1), (('processing', 'makes'), 1), (('makes', 'machines'), 1), (('machines', 'understand'), 1), (('understand', 'human'), 1), (('human', 'language'), 1), (('language', '.'), 1)]
Words that follow 'language': [('processing', 1), ('.', 1)]


[nltk_data] Downloading package punkt to /Users/moka/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# 2️⃣ Limitations of N-gram Models

Only captures short-range dependencies.
Suffers from the curse of dimensionality (sparse data problem).
Requires smoothing techniques (e.g., Laplace smoothing) to handle unseen n-grams.

# 3️⃣ Moving Beyond N-grams: Neural Language Models

To overcome these limitations, neural network-based models were introduced:

Feedforward Neural Network Language Model (Bengio et al., 2003)
Recurrent Neural Networks (RNNs)
Long Short-Term Memory (LSTM)
Transformer-based models (e.g., GPT, BERT)
These models can learn:

Long-range dependencies
Better generalization for unseen word combinations
(You’ll explore these in detail in the next sections!)

## 🎯 Practice Questions: (Language Models)

### 1️⃣ What is the main assumption made by n-gram models to simplify the computation of word sequence probabilities?
N-gram models assume the **Markov property**, which means the probability of a word depends only on the previous `n-1` words (not the entire history).  
For example, in a bigram model:
\[
P(w_n | w_1, w_2, ..., w_{n-1}) \approx P(w_n | w_{n-1})
\]
This greatly reduces complexity but ignores longer context.

---

### 2️⃣ Why do n-gram models struggle with long sentences?
N-gram models struggle with long sentences because they:
- Only consider **short-range dependencies** (limited by the value of `n`).
- Cannot remember earlier parts of a sentence beyond `n-1` words.
- Suffer from **data sparsity** as the number of possible n-grams grows exponentially with `n`, making it hard to cover all combinations in the training data.

---

### 3️⃣ How do neural language models improve over traditional n-gram models?
Neural language models (like feedforward networks, RNNs, LSTMs, Transformers) improve over n-gram models by:
- **Learning distributed word representations (embeddings)** that capture semantic relationships.
- **Handling long-range dependencies** through architectures like RNNs and Transformers.
- **Generalizing better** to unseen word sequences using learned parameters, rather than relying on explicit counting.
- **Reducing the curse of dimensionality** by mapping words to dense vectors instead of one-hot encoding.

---
