# 📖 Chapter 4.1: Language Models

## 📌 Overview  
A **Language Model (LM)** estimates the probability of a sequence of words.  
It answers the question:  
> "Given the previous words, what is the likelihood of the next word?"

Language models are the backbone of many NLP tasks like:
- Text generation
- Speech recognition
- Machine translation
- Autocomplete and chatbots

---

## 🔢 Probability of Word Sequences  
The **chain rule of probability** defines the probability of a sequence:
$$
P(w_1, w_2, w_3, ..., w_n) = P(w_1) \cdot P(w_2 | w_1) \cdot P(w_3 | w_1, w_2) \cdots P(w_n | w_1, ..., w_{n-1})
$$

Since modeling all previous words is computationally expensive, **n-gram models** simplify this by assuming that:
$$
P(w_n | w_1, ..., w_{n-1}) \approx P(w_n | w_{n-(n-1)}, ..., w_{n-1})
$$

---

## 1️⃣ N-gram Language Models  
**N-gram:** A contiguous sequence of `n` words.

| N-gram Type     | Example Phrase              |
|-----------------|----------------------------|
| Unigram (n=1)   | "The", "dog", "runs"        |
| Bigram (n=2)    | "The dog", "dog runs"       |
| Trigram (n=3)   | "The dog runs", "dog runs fast" |

---

### 🛠️ Example: Building a Bigram Model using NLTK

In [1]:
import nltk
from nltk import bigrams
from nltk.probability import FreqDist, ConditionalFreqDist
nltk.download('punkt')

# Sample text
text = "Natural language processing makes machines understand human language."

# Tokenize the text into words
tokens = nltk.word_tokenize(text.lower())

# Generate bigrams from the token list
bigrams_list = list(bigrams(tokens))

# Frequency distribution of bigrams
fdist = FreqDist(bigrams_list)
print("Bigram Frequencies:\n", fdist.most_common())

# Conditional Frequency Distribution: What words often follow 'language'?
cfd = ConditionalFreqDist(bigrams_list)
print("Words that follow 'language':", cfd['language'].most_common())


Bigram Frequencies:
 [(('natural', 'language'), 1), (('language', 'processing'), 1), (('processing', 'makes'), 1), (('makes', 'machines'), 1), (('machines', 'understand'), 1), (('understand', 'human'), 1), (('human', 'language'), 1), (('language', '.'), 1)]
Words that follow 'language': [('processing', 1), ('.', 1)]


[nltk_data] Downloading package punkt to /Users/moka/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 2️⃣ Limitations of N-gram Models

Only captures short-range dependencies.
Suffers from the curse of dimensionality (sparse data problem).
Requires smoothing techniques (e.g., Laplace smoothing) to handle unseen n-grams.

## 🎯 Practice Questions: (Language Models)

### 1️. What is the main assumption made by n-gram models to simplify the computation of word sequence probabilities?
N-gram models assume the **Markov property**, which means the probability of a word depends only on the previous `n-1` words (not the entire history).  
For example, in a bigram model:
$$
P(w_n | w_1, w_2, ..., w_{n-1}) \approx P(w_n | w_{n-1})
$$
This greatly reduces complexity but ignores longer context.

---

### 2️. Why do n-gram models struggle with long sentences?
N-gram models struggle with long sentences because they:
- Only consider **short-range dependencies** (limited by the value of `n`).
- Cannot remember earlier parts of a sentence beyond `n-1` words.
- Suffer from **data sparsity** as the number of possible n-grams grows exponentially with `n`, making it hard to cover all combinations in the training data.

---

### 3️. How do neural language models improve over traditional n-gram models?
Neural language models (like feedforward networks, RNNs, LSTMs, Transformers) improve over n-gram models by:
- **Learning distributed word representations (embeddings)** that capture semantic relationships.
- **Handling long-range dependencies** through architectures like RNNs and Transformers.
- **Generalizing better** to unseen word sequences using learned parameters, rather than relying on explicit counting.
- **Reducing the curse of dimensionality** by mapping words to dense vectors instead of one-hot encoding.

---


# 3️⃣ Moving Beyond N-grams: Neural Language Models

To overcome these limitations, neural network-based models were introduced:

- Feedforward Neural Network Language Model (Bengio et al., 2003)

- Recurrent Neural Networks (RNNs)

- Long Short-Term Memory (LSTM)

- Transformer-based models (e.g., GPT, BERT)

- These models can learn:

- Long-range dependencies
Better generalization for unseen word combinations

# 🧠 Neural Language Models

## 📌 Overview  
Traditional **n-gram language models** rely on counting word sequences but struggle with:
- Long-range dependencies
- Data sparsity
- High-dimensional, sparse word vectors (one-hot encoding)

To address these limitations, **neural language models** use neural networks and word embeddings to learn:
- Dense vector representations of words
- Probability distributions over the next word, conditioned on previous words

---

## 1️⃣ Feedforward Neural Network Language Model (NNLM)  
**Introduced by:** Bengio et al., 2003  
**Idea:**  
- Uses a fixed-size context window (like n-grams).  
- Concatenates the embeddings of previous words as input to a feedforward neural network.  
- Outputs the probability of the next word.

---

### 🛠️ Structure of NNLM:
1. Input: Embeddings of the last `n-1` words  
2. Hidden layer: Nonlinear transformation (e.g., ReLU, tanh)  
3. Output layer: Softmax over the vocabulary for next-word prediction

> ⚠️ Limitation: Fixed context window size → still struggles with long dependencies.



In [3]:
import torch
import torch.nn as nn

# Feedforward NNLM: Predict next word based on fixed context (e.g., bigram)
class FeedforwardNNLM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size, hidden_dim):
        super(FeedforwardNNLM, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(hidden_dim, vocab_size)

    def forward(self, inputs):
        # Get embeddings and flatten (concatenate embeddings of context words)
        embeds = self.embeddings(inputs).view((1, -1))
        out = self.linear1(embeds)
        out = self.relu(out)
        out = self.linear2(out)
        return out

# Example
model = FeedforwardNNLM(vocab_size=1000, embedding_dim=50, context_size=2, hidden_dim=128)
print(model)


FeedforwardNNLM(
  (embeddings): Embedding(1000, 50)
  (linear1): Linear(in_features=100, out_features=128, bias=True)
  (relu): ReLU()
  (linear2): Linear(in_features=128, out_features=1000, bias=True)
)


---

## 2️⃣ Recurrent Neural Networks (RNNs)  
**Motivation:** Process **sequences of arbitrary length** by maintaining a **hidden state** that summarizes past information.

### 🌀 How RNN works:
At each time step `t`:
$$
h_t = f(W_{hh} h_{t-1} + W_{xh} x_t + b)
$$
- `h_t`: Hidden state at time `t` (memory of the sequence)
- `x_t`: Input embedding at time `t`
- `f`: Nonlinear activation (like tanh or ReLU)

The output at each step predicts the next word:
$$
\hat{y}_t = \text{softmax}(W_{hy} h_t + b_y)
$$

---

## 🧩 Why RNNs Are Better Than N-gram Models:
| Model             | Handles Variable Length? | Captures Long Dependencies? | Requires Word Counting? |
|-------------------|-------------------------|----------------------------|------------------------|
| N-gram            | ❌ No                    | ❌ Short context only       | ✅ Yes                 |
| Feedforward NNLM  | ❌ No (fixed window)     | ❌ Limited                 | ❌ No (uses embeddings) |
| RNN               | ✅ Yes                   | ✅ Yes (but can suffer from vanishing gradients) | ❌ No                 |

---

## 🚩 Limitation of Vanilla RNNs:
- **Vanishing gradient problem**: Hard to learn long-range dependencies.
- Solution: Introduce **LSTM (Long Short-Term Memory)** and **GRU (Gated Recurrent Unit)** — covered in the next section!

---

## 🧪 Simple Example: RNN with PyTorch (Optional Code Example)


In [2]:
import torch
import torch.nn as nn

# Define a simple RNN model
class SimpleRNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(SimpleRNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)
    
    def forward(self, x):
        embedded = self.embedding(x)
        output, hidden = self.rnn(embedded)
        logits = self.fc(output)
        return logits

# Example parameters
vocab_size = 1000
embedding_dim = 50
hidden_dim = 128

model = SimpleRNN(vocab_size, embedding_dim, hidden_dim)
print(model)


SimpleRNN(
  (embedding): Embedding(1000, 50)
  (rnn): RNN(50, 128, batch_first=True)
  (fc): Linear(in_features=128, out_features=1000, bias=True)
)


# 3️⃣ Long Short-Term Memory (LSTM)
**Proposed by:** Hochreiter & Schmidhuber, 1997  
**Motivation:** Solve the **vanishing gradient problem** of standard RNNs.

**Key Components (Gates):**
- **Forget gate:** Decides what information to discard from the cell state.
- **Input gate:** Decides what new information to store.
- **Output gate:** Controls what part of the cell state goes to the output.

$$
c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t
$$
Where:
- `c_t`: Cell state
- `f_t`: Forget gate
- `i_t`: Input gate
- `\tilde{c}_t`: Candidate values
- `\odot`: Element-wise multiplication

✅ **Advantage:** Can remember information over long sequences.  
🟡 Often used in language modeling, machine translation, and time-series forecasting.


In [4]:
class SimpleLSTMModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(SimpleLSTMModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x):
        embedded = self.embedding(x)
        output, (hidden, cell) = self.lstm(embedded)
        logits = self.fc(output)
        return logits

lstm_model = SimpleLSTMModel(vocab_size=1000, embedding_dim=50, hidden_dim=128)
print(lstm_model)


SimpleLSTMModel(
  (embedding): Embedding(1000, 50)
  (lstm): LSTM(50, 128, batch_first=True)
  (fc): Linear(in_features=128, out_features=1000, bias=True)
)


# 4️⃣ Transformer-based Models
**Proposed by:** Vaswani et al., 2017 (“Attention Is All You Need”)  

**Key Concept:**  
- Uses **self-attention mechanisms** instead of recurrence to capture dependencies across the entire sequence.
- Processes the sequence **in parallel** (unlike RNNs which are sequential).

**Structure:**
- **Positional encodings** added to input embeddings (since no recurrence).
- Stacked **multi-head self-attention layers** and feedforward layers.
- Variants:
  - **GPT (Generative Pre-trained Transformer):** Decoder-only, causal attention (for text generation).
  - **BERT (Bidirectional Encoder Representations from Transformers):** Encoder-only, masked language modeling (for text understanding tasks like classification, Q&A).

✅ **Advantage:** Handles **long-range dependencies** efficiently and allows **parallel training**.  
⚡ **Transformers are now the state-of-the-art** in many NLP tasks.

---



In [7]:
class SimpleTransformerModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, num_heads, hidden_dim, num_layers):
        super(SimpleTransformerModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.pos_encoder = nn.Parameter(torch.zeros(1, 100, embedding_dim))  # Simplified positional encoding
        decoder_layer = nn.TransformerDecoderLayer(d_model=embedding_dim, nhead=num_heads, dim_feedforward=hidden_dim)
        self.transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers=num_layers)
        self.fc = nn.Linear(embedding_dim, vocab_size)

    def forward(self, x, memory):
        embedded = self.embedding(x) + self.pos_encoder[:, :x.size(1), :]  # Add position encoding
        output = self.transformer_decoder(embedded, memory)
        logits = self.fc(output)
        return logits

transformer_model = SimpleTransformerModel(vocab_size=1000, embedding_dim=50, num_heads=2, hidden_dim=128, num_layers=2)
print(transformer_model)


SimpleTransformerModel(
  (embedding): Embedding(1000, 50)
  (transformer_decoder): TransformerDecoder(
    (layers): ModuleList(
      (0-1): 2 x TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=50, out_features=50, bias=True)
        )
        (multihead_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=50, out_features=50, bias=True)
        )
        (linear1): Linear(in_features=50, out_features=128, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=128, out_features=50, bias=True)
        (norm1): LayerNorm((50,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((50,), eps=1e-05, elementwise_affine=True)
        (norm3): LayerNorm((50,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
        (dropout3): Drop

## 📌 Summary of Neural Language Models

| Model                        | Handles Long Dependencies? | Fixed Context? | Parallelizable? | Common Use Cases                     |
|------------------------------|---------------------------|----------------|----------------|---------------------------------------|
| Feedforward NNLM             | ❌ No                      | ✅ Yes         | ✅ Yes         | Basic word prediction               |
| RNN                          | ✅ Yes (but limited)      | ❌ No          | ❌ No          | Sequential data, speech, text gen    |
| LSTM                         | ✅ Yes (solves RNN issues)| ❌ No          | ❌ No          | Long text sequences, translation     |
| Transformer (GPT, BERT, etc.)| ✅ Yes (long dependencies)| ❌ No          | ✅ Yes         | Text gen, classification, translation|

---
## ✅ Recap of Input and Output Shapes

| Model              | Input Shape                     | Output Shape                          |
|--------------------|----------------------------------|----------------------------------------|
| Feedforward NNLM   | `[context_size]`                | `[1, vocab_size]`                     |
| RNN                | `[batch_size, seq_length]`      | `[batch_size, seq_length, vocab_size]`|
| LSTM               | `[batch_size, seq_length]`      | `[batch_size, seq_length, vocab_size]`|
| Transformer        | `decoder_input + encoder memory`| `[batch_size, seq_length, vocab_size]`|

---
## 🎯 Why These Models Matter:
- Move beyond simple word counts.
- Learn **contextual relationships** between words.
- Enable **state-of-the-art NLP applications** like ChatGPT, BERT-based Q&A systems, machine translation, text summarization, and more.


## 🎯 Practice Questions:(Neural Language Models)

### 1️⃣ What is the main advantage of neural language models over n-gram models?
Neural language models (like feedforward NNLMs and RNNs) avoid explicit counting of word sequences.  
Instead, they:
- Use **word embeddings** (dense vectors) to capture semantic meaning.
- Generalize better to **unseen word combinations**.
- Handle **large vocabularies** more efficiently.
- Do not suffer from the **curse of dimensionality** like n-grams.

---

### 2️⃣ Why do RNNs use hidden states, and what problem does this solve?
RNNs maintain a **hidden state** that carries forward information from previous time steps.  
This allows RNNs to:
- **Summarize the history** of the sequence up to the current word.
- Handle **sequences of arbitrary length**.
- Learn **temporal patterns** or dependencies across words.

The hidden state helps solve the problem that n-gram and feedforward models face — they only see a **fixed window** of context.

---

### 3️⃣ What is the major limitation of vanilla RNNs, and how can it be addressed?
The biggest limitation of vanilla RNNs is the **vanishing (or exploding) gradient problem**, which makes it hard to learn **long-range dependencies** during backpropagation through time (BPTT).

**Solution:**  
Use advanced RNN variants:
- **LSTM (Long Short-Term Memory)**: Adds gates (input, forget, output)
