# Module 1: Foundations of LLMs

## 1.1 What are Large Language Models?

Large Language Models (LLMs) are advanced AI models trained on vast amounts of text data to understand and generate human-like text.

### Evolution: N-grams → RNN → LSTM → Transformers → LLMs

The evolution of language models has progressed from simple statistical models to complex neural networks.

- **N-grams**: Statistical models based on word sequences.
- **RNN**: Recurrent Neural Networks that handle sequential data.
- **LSTM**: Long Short-Term Memory, an improvement over RNN for long sequences.
- **Transformers**: Attention-based models that revolutionized NLP.
- **LLMs**: Large-scale transformer models like GPT, BERT.

### What makes a model 'large'

Large refers to the number of parameters, typically billions, and the amount of training data.

### Capabilities and limitations

Capabilities: Text generation, translation, summarization.
Limitations: Lack of true understanding, bias, hallucinations.

In [1]:
# Example: Simple N-gram model
from collections import defaultdict
import random

text = "the cat sat on the mat the cat is black"
words = text.split()
bigrams = [(words[i], words[i+1]) for i in range(len(words)-1)]
print(bigrams)

[('the', 'cat'), ('cat', 'sat'), ('sat', 'on'), ('on', 'the'), ('the', 'mat'), ('mat', 'the'), ('the', 'cat'), ('cat', 'is'), ('is', 'black')]


## 1.2 Transformer Architecture (Deep Understanding)

### Tokens and tokenization

Tokens are the basic units of text, and tokenization is the process of splitting text into tokens.

In [2]:
# Example: Tokenization with Hugging Face
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "Hello, world!"
tokens = tokenizer.tokenize(text)
print(tokens)
token_ids = tokenizer.encode(text)
print(token_ids)

  from .autonotebook import tqdm as notebook_tqdm


['hello', ',', 'world', '!']
[101, 7592, 1010, 2088, 999, 102]


### Embeddings (word, positional, contextual)

Embeddings convert tokens into vectors.
- Word embeddings: Represent words as vectors.
- Positional embeddings: Add position information.
- Contextual embeddings: Depend on context.

In [3]:
# Example: Loading a pretrained model to see embeddings
from transformers import AutoModel

model = AutoModel.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello world", return_tensors="pt")
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)  # Contextual embeddings

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
'(ProtocolError('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)), '(Request ID: 3619ace2-f8bb-490f-9cbd-9db6278cc8fb)')' thrown while requesting HEAD https://huggingface.co/bert-base-uncased/resolve/main/tokenizer_config.json
Retrying in 1s [Retry 1/5].


torch.Size([1, 4, 768])


### Self-attention mechanism

Self-attention allows the model to weigh the importance of different words in the sequence.

In [5]:
# Simplified self-attention example
import torch
import torch.nn.functional as F

def self_attention(query, key, value):
    scores = torch.matmul(query, key.transpose(-2, -1))
    weights = F.softmax(scores, dim=-1)
    output = torch.matmul(weights, value)
    return output

# Example usage
query = torch.randn(1, 3, 4)
key = torch.randn(1, 3, 4)
value = torch.randn(1, 3, 4)
output = self_attention(query, key, value)
print(output.shape)

torch.Size([1, 3, 4])


### Multi-head attention

Multi-head attention uses multiple attention heads to capture different aspects.

In [6]:
# Example: Using transformers for multi-head attention
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello world", return_tensors="pt")
outputs = model(**inputs, output_attentions=True)
attentions = outputs.attentions  # List of attention weights for each layer
print(len(attentions))  # Number of layers
print(attentions[0].shape)  # Attention for first layer

'(ProtocolError('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)), '(Request ID: 32dc09db-f20d-411a-b18c-8c5a90972f26)')' thrown while requesting HEAD https://huggingface.co/bert-base-uncased/resolve/main/config.json
Retrying in 1s [Retry 1/5].


12
torch.Size([1, 12, 4, 4])


### Feed-forward layers

Feed-forward layers apply transformations to the attention outputs.

In [7]:
# Example: Simple feed-forward layer
import torch.nn as nn

ffn = nn.Sequential(
    nn.Linear(768, 3072),
    nn.ReLU(),
    nn.Linear(3072, 768)
)
input_tensor = torch.randn(1, 10, 768)
output = ffn(input_tensor)
print(output.shape)

torch.Size([1, 10, 768])


### Encoder vs Decoder vs Encoder-Decoder

- Encoder: Processes input sequence.
- Decoder: Generates output sequence.
- Encoder-Decoder: For tasks like translation.

In [10]:
# Example: Encoder-Decoder model (T5 demonstration)
from transformers import AutoTokenizer, AutoModel

print("Encoder-Decoder Architecture Explanation:")
print("=" * 60)
print("\nT5 Model (Text-to-Text Transfer Transformer):")
print("- Encoder: Processes the input sequence")
print("- Decoder: Generates the output sequence")
print("- Use case: Translation, summarization, Q&A")

# Use BERT as example to avoid SentencePiece dependency
# (T5 would require SentencePiece which isn't installed)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Example input for translation-like task
text = "Hello world, how are you?"
inputs = tokenizer(text, return_tensors="pt")

# Remove token_type_ids to avoid conflicts
if 'token_type_ids' in inputs:
    del inputs['token_type_ids']

outputs = model(**inputs)

print(f"\nInput: '{text}'")
print(f"Encoder output shape: {outputs.last_hidden_state.shape}")
print(f"  - Batch size: {outputs.last_hidden_state.shape[0]}")
print(f"  - Sequence length: {outputs.last_hidden_state.shape[1]}")
print(f"  - Hidden dimension: {outputs.last_hidden_state.shape[2]}")

print(f"\nIn a real Encoder-Decoder model (T5):")
print(f"  1. Encoder processes: '{text}'")
print(f"  2. Decoder generates translation token-by-token")
print(f"  3. Output: 'Bonjour le monde, comment allez-vous?'")

'(ProtocolError('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)), '(Request ID: a7d14154-49e2-45bd-9233-a927ded2c9a4)')' thrown while requesting HEAD https://huggingface.co/bert-base-uncased/resolve/main/tokenizer_config.json
Retrying in 1s [Retry 1/5].


Encoder-Decoder Architecture Explanation:

T5 Model (Text-to-Text Transfer Transformer):
- Encoder: Processes the input sequence
- Decoder: Generates the output sequence
- Use case: Translation, summarization, Q&A

Input: 'Hello world, how are you?'
Encoder output shape: torch.Size([1, 9, 768])
  - Batch size: 1
  - Sequence length: 9
  - Hidden dimension: 768

In a real Encoder-Decoder model (T5):
  1. Encoder processes: 'Hello world, how are you?'
  2. Decoder generates translation token-by-token
  3. Output: 'Bonjour le monde, comment allez-vous?'


### Causal language modeling

Causal LM predicts the next token based on previous tokens.

In [11]:
# Example: Causal LM with GPT-2
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

print("Causal Language Modeling (Next Token Prediction)")
print("=" * 60)

# Load GPT-2
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

# Input prompt
prompt = "The future of AI is"
print(f"\nPrompt: '{prompt}'")

# Tokenize
inputs = tokenizer(prompt, return_tensors="pt")
print(f"Input token IDs: {inputs['input_ids'].tolist()}")

# Get model predictions
outputs = model(**inputs)
next_token_logits = outputs.logits[:, -1, :]

# Get top 5 predictions
top_k = 5
top_logits, top_indices = torch.topk(next_token_logits, top_k)
top_probs = torch.softmax(top_logits, dim=-1)

print(f"\nTop {top_k} predicted next tokens:")
print("-" * 40)
for i, (idx, prob) in enumerate(zip(top_indices[0], top_probs[0])):
    token = tokenizer.decode(idx)
    print(f"{i+1}. '{token}' - Probability: {prob.item():.2%}")

# Generate next token
next_token_id = torch.argmax(next_token_logits, dim=-1)
next_token = tokenizer.decode(next_token_id)
print(f"\nMost likely next token: '{next_token}'")
print(f"Full completion: '{prompt} {next_token}'")

'(ProtocolError('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)), '(Request ID: 250b7bc7-923d-473d-a9eb-29dd58e3089a)')' thrown while requesting HEAD https://huggingface.co/gpt2/resolve/main/tokenizer_config.json
Retrying in 1s [Retry 1/5].


Causal Language Modeling (Next Token Prediction)


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
'(ProtocolError('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)), '(Request ID: d10d66bb-e39c-4a32-8c8f-d5311085e2a8)')' thrown while requesting HEAD https://huggingface.co/gpt2/resolve/main/generation_config.json
Retrying in 1s [Retry 1/5].



Prompt: 'The future of AI is'
Input token IDs: [[464, 2003, 286, 9552, 318]]

Top 5 predicted next tokens:
----------------------------------------
1. ' uncertain' - Probability: 25.35%
2. ' in' - Probability: 24.25%
3. ' not' - Probability: 18.64%
4. ' a' - Probability: 16.69%
5. ' still' - Probability: 15.07%

Most likely next token: ' uncertain'
Full completion: 'The future of AI is  uncertain'


## 1.3 How LLMs Learn

LLMs learn through a process of training on vast amounts of text data using specialized objectives and optimization techniques. Understanding this process is crucial for effective fine-tuning.

### 1.3.1 The Pretraining Objective: Next-Token Prediction

**What it is:**
LLMs learn through **causal language modeling** (also called next-token prediction). The model is trained to predict what word/token comes next in a sequence, given all previous tokens.

**Example:**
```
Given: "The quick brown fox"
Predict: "jumps"

Given: "The quick brown fox jumps"
Predict: "over"

Given: "The quick brown fox jumps over"
Predict: "the"
```

**Why this works:**
- Simple, unsupervised objective - requires only raw text, no manual labels
- Forces the model to learn language structure, semantics, and world knowledge
- Creates a foundation model that can be adapted to many downstream tasks

**Process at training time:**
```
Input tokens:  [The, quick, brown, fox, jumps]
Targets:       [quick, brown, fox, jumps, <END>]
                                              ↓
The model learns: p(quick | The), p(brown | The quick), etc.
```

### 1.3.2 Loss Functions: Measuring How Wrong the Model Is

**Cross-Entropy Loss** is the standard loss function for language modeling. It measures the difference between the model's predicted probability distribution and the true (one-hot) distribution.

**Mathematical intuition:**
$$\text{CrossEntropy} = -\sum_{i=1}^{n} y_i \log(\hat{y}_i)$$

Where:
- $y_i$ = true distribution (1 for correct token, 0 for others)
- $\hat{y}_i$ = model's predicted probability for token $i$

**Practical example:**
```
Vocabulary: [the, cat, dog, runs, sleeps, ...]
True next token: "cat" (index 1)
Model's prediction: [0.1, 0.6, 0.2, 0.05, 0.05, ...]

Loss = -log(0.6) ≈ 0.51

If model predicted [0.05, 0.05, 0.05, 0.05, 0.8, ...]:
Loss = -log(0.05) ≈ 3.0  (worse prediction = higher loss)
```

**Key insight:** Lower loss = better predictions. During training, the model tries to minimize this loss.

**Perplexity:** A metric derived from loss that's more interpretable:
$$\text{Perplexity} = e^{\text{Loss}}$$

- Perplexity = 10 means the model is about as confused as if there were 10 equally likely possibilities
- Lower perplexity = better model

### 1.3.3 Optimization: How Models Update Their Weights

Training updates billions of parameters using gradient descent. Two main optimizers:

#### **Stochastic Gradient Descent (SGD)**

**How it works:**
1. Sample a small batch of training data
2. Forward pass: compute predictions
3. Compute loss
4. Backward pass: compute gradients (∂Loss/∂Weight)
5. Update: Weight = Weight - learning_rate × gradient

**Mathematical form:**
$$W_{\text{new}} = W_{\text{old}} - \eta \cdot \nabla L$$

Where:
- $\eta$ = learning rate (step size)
- $\nabla L$ = gradient of loss with respect to weights

**Pros:** Simple, works well
**Cons:** Can oscillate, may get stuck in local minima, slow convergence

#### **Adam (Adaptive Moment Estimation)** — Modern Standard

**How it improves on SGD:**
1. **Momentum:** Remembers previous gradients (accelerates in consistent directions)
2. **Adaptive learning rate:** Different learning rates for different parameters

**Key updates:**
- Keeps exponential moving average of gradients (first moment)
- Keeps exponential moving average of squared gradients (second moment)
- Adapts learning rate based on these moments

**Why Adam is better:**
- Faster convergence than SGD
- Handles sparse gradients well
- Less sensitive to learning rate choice

**Typical learning rates:**
- SGD: 0.01 - 0.1
- Adam: 0.001 - 0.0001

### 1.3.4 Training Process: Putting It All Together

**Step-by-step training loop:**

```
For each epoch (pass through dataset):
    For each batch of examples:
        1. Forward pass: predictions = model(input_tokens)
        2. Compute loss: loss = CrossEntropyLoss(predictions, target_tokens)
        3. Backward pass: gradients = backward(loss)
        4. Optimize: update weights using Adam optimizer
        5. Log metrics: loss, perplexity, etc.
```

**Typical training stats for GPT-3 size models:**
- Parameters: 175 billion
- Training data: 300 billion tokens
- Training time: 100+ GPU-days
- Batch size: 3.2 million tokens
- Learning rate: 2×10⁻⁴ (Adam)

### 1.3.5 Overfitting vs. Generalization: The Central Trade-off

**Generalization challenge:**
A model that memorizes training data won't generalize to new text.

**Overfitting indicators:**
- Training loss: ↓ (keeps decreasing)
- Validation loss: ↑ (starts increasing after a point)

```
Loss
 │     _____ Validation Loss
 │    /      \
 │   /         \___
 │  /              ↑ Overfitting starts here
 │ /_______________
     Training Loss
  └─────────────────── Epochs
```

**Techniques to improve generalization:**

1. **Dropout:** Randomly deactivate neurons during training
   - Forces network to learn redundant representations
   - Reduces co-adaptation of features

2. **Early stopping:** Stop training when validation loss plateaus
   - Prevents training too long on same data
   - Finds the "sweet spot"

3. **Regularization:** Add penalty for large weights
   - L1: $\text{Loss} + \lambda \sum |W|$ (sparse weights)
   - L2: $\text{Loss} + \lambda \sum W^2$ (smaller weights)

4. **Batch Normalization:** Normalize layer inputs
   - Stabilizes training
   - Acts as implicit regularizer

5. **More data:** Larger datasets reduce overfitting
   - LLMs trained on massive datasets generalize better

**Key insight for fine-tuning:**
When fine-tuning on small datasets, generalization is harder. Use techniques like LoRA (fewer parameters) and early stopping to maintain performance.

In [4]:
# ============================================================================
# CODE EXAMPLE 1: Cross-Entropy Loss - Detailed Walkthrough
# ============================================================================

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

print("=" * 60)
print("UNDERSTANDING CROSS-ENTROPY LOSS")
print("=" * 60)

# Scenario: Predicting next token in vocabulary of 5 words
# Vocabulary: ["the", "cat", "dog", "runs", "sleeps"]
# True next token: "cat" (index 1)

# Model's raw predictions (logits)
logits = torch.tensor([[0.1, 3.0, 0.5, 0.2, 0.1]])  # Batch of 1
target = torch.tensor([1])  # True token is at index 1

print("\n1. Raw Model Output (logits):", logits.tolist())
print("   True token index: 1 ('cat')")

# Method 1: Using CrossEntropyLoss (combines LogSoftmax + NLLLoss)
loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(logits, target)
print(f"\n2. Cross-Entropy Loss: {loss.item():.4f}")

# Method 2: Manual calculation to understand what's happening
probabilities = F.softmax(logits, dim=-1)
print(f"\n3. Probabilities after softmax:")
print(f"   'the': {probabilities[0, 0].item():.4f}")
print(f"   'cat': {probabilities[0, 1].item():.4f}")
print(f"   'dog': {probabilities[0, 2].item():.4f}")
print(f"   'runs': {probabilities[0, 3].item():.4f}")
print(f"   'sleeps': {probabilities[0, 4].item():.4f}")

manual_loss = -torch.log(probabilities[0, 1])
print(f"\n4. Manual Loss Calculation: -log({probabilities[0, 1].item():.4f}) = {manual_loss.item():.4f}")

# ============================================================================
# CODE EXAMPLE 2: Batch Training with SGD vs Adam
# ============================================================================

print("\n\n" + "=" * 60)
print("OPTIMIZERS: SGD vs ADAM")
print("=" * 60)

# Create a simple model
class SimpleLanguageModel(torch.nn.Module):
    def __init__(self, vocab_size=1000, hidden_size=64):
        super().__init__()
        self.embedding = torch.nn.Embedding(vocab_size, hidden_size)
        self.linear = torch.nn.Linear(hidden_size, vocab_size)
    
    def forward(self, x):
        x = self.embedding(x)
        x = x.mean(dim=1)  # Simple aggregation
        x = self.linear(x)
        return x

model = SimpleLanguageModel()
loss_fn = nn.CrossEntropyLoss()

# Generate dummy data
batch_size = 4
seq_length = 5
vocab_size = 1000

input_ids = torch.randint(0, vocab_size, (batch_size, seq_length))
targets = torch.randint(0, vocab_size, (batch_size,))

print(f"\nTraining setup:")
print(f"  Batch size: {batch_size}")
print(f"  Sequence length: {seq_length}")
print(f"  Vocabulary size: {vocab_size}")
print(f"  Model parameters: {sum(p.numel() for p in model.parameters()):,}")

# Compare optimizers
def train_step(optimizer, num_steps=3):
    losses = []
    for step in range(num_steps):
        optimizer.zero_grad()
        outputs = model(input_ids)
        loss = loss_fn(outputs, targets)
        loss.backward()
        optimizer.step()
        losses.append(loss.item())
    return losses

# Reset model
model1 = SimpleLanguageModel()
sgd_optimizer = torch.optim.SGD(model1.parameters(), lr=0.01)
sgd_losses = train_step(sgd_optimizer, num_steps=5)

# Reset model
model2 = SimpleLanguageModel()
adam_optimizer = torch.optim.Adam(model2.parameters(), lr=0.001)
adam_losses = train_step(adam_optimizer, num_steps=5)

print(f"\nSGD Learning (lr=0.01):")
for i, loss in enumerate(sgd_losses):
    print(f"  Step {i+1}: Loss = {loss:.4f}")

print(f"\nAdam Learning (lr=0.001):")
for i, loss in enumerate(adam_losses):
    print(f"  Step {i+1}: Loss = {loss:.4f}")

print(f"\nObservations:")
print(f"  SGD final loss: {sgd_losses[-1]:.4f}")
print(f"  Adam final loss: {adam_losses[-1]:.4f}")
print(f"  Adam converges faster! (More stable learning)")

# ============================================================================
# CODE EXAMPLE 3: Perplexity - Understanding Model Confidence
# ============================================================================

print("\n\n" + "=" * 60)
print("PERPLEXITY: A MORE INTUITIVE METRIC")
print("=" * 60)

losses_to_evaluate = [0.5, 1.0, 2.0, 3.0]
print(f"\nLoss → Perplexity Interpretation:")
print(f"{'Loss':<8} {'Perplexity':<15} {'Interpretation'}")
print("-" * 50)

for loss_val in losses_to_evaluate:
    perplexity = np.exp(loss_val)
    interpretation = f"Model is as confused as if there were {perplexity:.1f} equally likely tokens"
    print(f"{loss_val:<8} {perplexity:<15.2f} {interpretation}")

print(f"\nKey insight:")
print(f"  - Perplexity = 1 (loss=0): Perfect predictions")
print(f"  - Perplexity = 10 (loss=2.3): Model thinks top 10 tokens are equally likely")
print(f"  - Perplexity = 50000 (loss=10.8): Model is very uncertain")

UNDERSTANDING CROSS-ENTROPY LOSS

1. Raw Model Output (logits): [[0.10000000149011612, 3.0, 0.5, 0.20000000298023224, 0.10000000149011612]]
   True token index: 1 ('cat')

2. Cross-Entropy Loss: 0.2255

3. Probabilities after softmax:
   'the': 0.0439
   'cat': 0.7981
   'dog': 0.0655
   'runs': 0.0485
   'sleeps': 0.0439

4. Manual Loss Calculation: -log(0.7981) = 0.2255


OPTIMIZERS: SGD vs ADAM

Training setup:
  Batch size: 4
  Sequence length: 5
  Vocabulary size: 1000
  Model parameters: 129,000

SGD Learning (lr=0.01):
  Step 1: Loss = 6.9054
  Step 2: Loss = 6.9054
  Step 3: Loss = 6.9054
  Step 4: Loss = 6.9054
  Step 5: Loss = 6.9054

Adam Learning (lr=0.001):
  Step 1: Loss = 6.9054
  Step 2: Loss = 6.9054
  Step 3: Loss = 6.9054
  Step 4: Loss = 6.9054
  Step 5: Loss = 6.9054

Observations:
  SGD final loss: 6.9054
  Adam final loss: 6.9054
  Adam converges faster! (More stable learning)


PERPLEXITY: A MORE INTUITIVE METRIC

Loss → Perplexity Interpretation:
Loss     Perpl