# Pre-training Objectives

You have a Transformer architecture. Now what?

You can't train it on labeled data - there isn't enough. Supervised datasets have thousands or maybe millions of examples. But language has BILLIONS of sentences freely available on the internet.

**The insight:** Create training objectives from raw text itself. Make the model predict parts of text from other parts.

This is pre-training. And it's why modern LLMs work.

## The Problem: Supervised Learning Doesn't Scale

Traditional NLP:
```
Labeled data: "I love cats" → Positive sentiment
              "This is bad" → Negative sentiment
```

You need humans to label thousands of examples for EACH task. Expensive. Slow. Limited.

**But:** The internet has trillions of words of unlabeled text. Wikipedia, books, Reddit, news articles.

What if we could learn from that?

## Self-Supervised Learning

Create labels automatically from the data itself.

**Example:**
```
Text: "The cat sat on the mat"

Task 1: Hide "cat", predict it from context
"The [MASK] sat on the mat" → predict "cat"

Task 2: Predict next word
"The cat sat on the" → predict "mat"
```

No humans needed. The text provides its own supervision.

This is the foundation of BERT, GPT, and all modern LLMs.

In [4]:
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

%matplotlib inline

## Causal Language Modeling (CLM)

**The task:** Predict the next word given all previous words.

```
Input:  "The cat sat on"
Output: "the"

Input:  "The cat sat on the"
Output: "mat"
```

This is what GPT does. Train on billions of words, learn to predict what comes next.

**Why it works:** To predict the next word well, you need to understand:
- Grammar ("the" follows "on", not "cat")
- Semantics ("mat" is plausible after "sat on the")
- Context (what "it" refers to)
- World knowledge (cats sit on mats, not the other way around)

All from just predicting next words.

### Implementation: Simple Next-Word Prediction

In [5]:
# Toy vocabulary
vocab = {
    '<PAD>': 0, '<START>': 1, '<END>': 2,
    'the': 3, 'cat': 4, 'sat': 5, 'on': 6, 'mat': 7,
    'dog': 8, 'ran': 9, 'fast': 10
}
idx_to_word = {v: k for k, v in vocab.items()}

def tokenize(sentence, vocab):
    """Convert sentence to token IDs"""
    words = sentence.lower().split()
    return [vocab.get(w, 0) for w in words]

# Example sentences
sentences = [
    "the cat sat on the mat",
    "the dog ran fast",
    "the cat ran fast",
]

for sent in sentences:
    tokens = tokenize(sent, vocab)
    print(f"{sent:25s} → {tokens}")

the cat sat on the mat    → [3, 4, 5, 6, 3, 7]
the dog ran fast          → [3, 8, 9, 10]
the cat ran fast          → [3, 4, 9, 10]


### Creating Training Examples

For each sentence, create multiple (input, target) pairs:

In [6]:
def create_clm_examples(tokens):
    """
    Create causal LM training examples.
    
    Input: [w1, w2, w3, w4]
    Output:
        ([w1], w2)
        ([w1, w2], w3)
        ([w1, w2, w3], w4)
    """
    examples = []
    for i in range(1, len(tokens)):
        input_seq = tokens[:i]
        target = tokens[i]
        examples.append((input_seq, target))
    return examples

# Demo
sentence = "the cat sat on the mat"
tokens = tokenize(sentence, vocab)
examples = create_clm_examples(tokens)

print("Training examples from 'the cat sat on the mat':\n")
for inp, target in examples:
    inp_words = ' '.join([idx_to_word[i] for i in inp])
    target_word = idx_to_word[target]
    print(f"  Input: {inp_words:30s} → Target: {target_word}")

Training examples from 'the cat sat on the mat':

  Input: the                            → Target: cat
  Input: the cat                        → Target: sat
  Input: the cat sat                    → Target: on
  Input: the cat sat on                 → Target: the
  Input: the cat sat on the             → Target: mat


One sentence gives us 6 training examples. A million sentences? 6 million examples. All free.

### Loss Function

Standard cross-entropy loss:

```
Loss = -log P(w_target | context)
```

The model outputs a probability distribution over the vocabulary. We want high probability on the correct next word.

In [7]:
def cross_entropy_loss(logits, target):
    """
    Args:
        logits: (vocab_size,) unnormalized scores
        target: scalar, true token ID
    """
    # Softmax
    exp_logits = np.exp(logits - np.max(logits))
    probs = exp_logits / np.sum(exp_logits)
    
    # Negative log likelihood
    loss = -np.log(probs[target] + 1e-10)
    
    return loss, probs

# Example
vocab_size = len(vocab)
logits = np.random.randn(vocab_size)
target = vocab['mat']  # True next word is 'mat'

loss, probs = cross_entropy_loss(logits, target)
print(f"Loss: {loss:.3f}")
print(f"\nProbability of correct word 'mat': {probs[target]:.3f}")
print(f"Top 3 predictions:")
top3 = np.argsort(probs)[-3:][::-1]
for idx in top3:
    print(f"  {idx_to_word[idx]:10s}: {probs[idx]:.3f}")

Loss: 1.675

Probability of correct word 'mat': 0.187
Top 3 predictions:
  mat       : 0.187
  fast      : 0.175
  cat       : 0.116


## Masked Language Modeling (MLM)

**The task:** Hide random words, predict them from context.

```
Original: "The cat sat on the mat"
Masked:   "The [MASK] sat on the mat"
Predict:  "cat"
```

This is what BERT does. Key difference from CLM:

**CLM:** Only sees past words (left context)  
**MLM:** Sees both past AND future words (bidirectional context)

MLM is better for understanding tasks (classification, QA). CLM is better for generation.

### BERT's Masking Strategy

BERT doesn't just replace with [MASK]. It uses three strategies:

**80% of the time:** Replace with [MASK]  
```"The [MASK] sat on the mat"```

**10% of the time:** Replace with random word  
```"The dog sat on the mat"``` (but still predict "cat")

**10% of the time:** Keep original  
```"The cat sat on the mat"``` (still predict "cat")

**Why?** Prevents the model from only looking at [MASK] tokens. Forces it to learn robust representations.

In [8]:
def create_mlm_example(tokens, vocab, mask_prob=0.15):
    """
    Create masked language model training example.
    
    Returns:
        masked_tokens: input with some tokens masked
        targets: positions and true tokens to predict
    """
    masked_tokens = tokens.copy()
    targets = []
    
    MASK_ID = vocab['<PAD>']  # Using PAD as MASK for simplicity
    
    for i, token in enumerate(tokens):
        # Skip special tokens
        if token < 3:
            continue
            
        if np.random.random() < mask_prob:
            targets.append((i, token))  # Store position and true token
            
            rand = np.random.random()
            if rand < 0.8:
                # 80%: mask
                masked_tokens[i] = MASK_ID
            elif rand < 0.9:
                # 10%: random word
                masked_tokens[i] = np.random.randint(3, len(vocab))
            # else: 10% keep original
    
    return masked_tokens, targets

# Demo
sentence = "the cat sat on the mat"
tokens = tokenize(sentence, vocab)

print("Original sentence:", sentence)
print("Original tokens:  ", [idx_to_word[t] for t in tokens])
print("\nMasked versions:\n")

for trial in range(5):
    masked, targets = create_mlm_example(tokens, vocab)
    masked_words = [idx_to_word[t] for t in masked]
    print(f"Trial {trial + 1}: {masked_words}")
    if targets:
        for pos, true_token in targets:
            print(f"         → Predict '{idx_to_word[true_token]}' at position {pos}")

Original sentence: the cat sat on the mat
Original tokens:   ['the', 'cat', 'sat', 'on', 'the', 'mat']

Masked versions:

Trial 1: ['the', 'cat', 'sat', 'on', 'the', 'mat']
Trial 2: ['<PAD>', 'cat', 'sat', 'on', 'the', 'mat']
         → Predict 'the' at position 0
Trial 3: ['the', 'cat', 'sat', 'on', '<PAD>', 'mat']
         → Predict 'the' at position 4
Trial 4: ['the', 'cat', 'sat', 'on', 'on', 'mat']
         → Predict 'the' at position 4
Trial 5: ['the', '<PAD>', 'sat', 'on', 'the', 'mat']
         → Predict 'cat' at position 1


## Comparison: CLM vs MLM

| Aspect | Causal LM (GPT) | Masked LM (BERT) |
|--------|-----------------|------------------|
| **Context** | Left-only | Bidirectional |
| **Attention** | Masked (causal) | Full (bidirectional) |
| **Prediction** | Next token | Masked tokens |
| **Training** | Predict all positions | Predict 15% of positions |
| **Best for** | Generation | Understanding |
| **Examples** | GPT-2, GPT-3, GPT-4 | BERT, RoBERTa |

**CLM advantage:** Natural for generation (continue text)  
**MLM advantage:** Richer context (sees future) for understanding

## Next Sentence Prediction (NSP)

BERT also used a second objective: predict if sentence B follows sentence A.

```
Sentence A: "The cat sat on the mat."
Sentence B: "It was very comfortable."
Label: IsNext (1)

Sentence A: "The cat sat on the mat."
Sentence B: "The stock market crashed."
Label: NotNext (0)
```

**Goal:** Learn sentence relationships.

**Reality:** Later research (RoBERTa) found NSP doesn't help much. It's often skipped now.

In [9]:
def create_nsp_example(sentences):
    """
    Create next sentence prediction example.
    
    50% of time: pick consecutive sentences (label=1)
    50% of time: pick random sentences (label=0)
    """
    if np.random.random() < 0.5:
        # Positive example: consecutive sentences
        idx = np.random.randint(0, len(sentences) - 1)
        sent_a = sentences[idx]
        sent_b = sentences[idx + 1]
        label = 1
    else:
        # Negative example: random sentences
        idx_a, idx_b = np.random.choice(len(sentences), size=2, replace=False)
        sent_a = sentences[idx_a]
        sent_b = sentences[idx_b]
        label = 0
    
    return sent_a, sent_b, label

# Demo
corpus = [
    "The cat sat on the mat.",
    "It looked very comfortable.",
    "The dog ran in the park.",
    "Birds were singing loudly.",
]

print("NSP training examples:\n")
for i in range(5):
    sent_a, sent_b, label = create_nsp_example(corpus)
    label_str = "IsNext" if label == 1 else "NotNext"
    print(f"Example {i + 1}:")
    print(f"  A: {sent_a}")
    print(f"  B: {sent_b}")
    print(f"  Label: {label_str}\n")

NSP training examples:

Example 1:
  A: It looked very comfortable.
  B: The dog ran in the park.
  Label: NotNext

Example 2:
  A: The dog ran in the park.
  B: It looked very comfortable.
  Label: NotNext

Example 3:
  A: Birds were singing loudly.
  B: The dog ran in the park.
  Label: NotNext

Example 4:
  A: The dog ran in the park.
  B: Birds were singing loudly.
  Label: NotNext

Example 5:
  A: It looked very comfortable.
  B: The dog ran in the park.
  Label: IsNext



## Sequence-to-Sequence Objectives

Encoder-decoder models (T5, BART) use denoising objectives:

**T5's span corruption:**
```
Input:  "The cat <X> on the <Y>"
Output: "<X> sat <Y> mat"
```

Corrupt the input by masking spans, then reconstruct.

**BART's noise functions:**
- Token masking
- Token deletion  
- Sentence shuffling
- Document rotation

The model learns to denoise, which teaches both understanding and generation.

In [10]:
def span_corruption(tokens, vocab, corruption_rate=0.15):
    """
    T5-style span corruption.
    Replace random spans with sentinel tokens.
    """
    corrupted = []
    targets = []
    
    i = 0
    sentinel_id = 0
    
    while i < len(tokens):
        if np.random.random() < corruption_rate:
            # Start a corrupted span
            span_length = np.random.randint(1, 4)
            span = tokens[i:i + span_length]
            
            # Add sentinel to input
            sentinel = f"<X{sentinel_id}>"
            corrupted.append(sentinel)
            
            # Add sentinel and span to target
            targets.append(sentinel)
            targets.extend([idx_to_word[t] for t in span])
            
            sentinel_id += 1
            i += span_length
        else:
            corrupted.append(idx_to_word[tokens[i]])
            i += 1
    
    return corrupted, targets

# Demo
sentence = "the cat sat on the mat"
tokens = tokenize(sentence, vocab)

print("T5 span corruption:\n")
for trial in range(3):
    corrupted, targets = span_corruption(tokens, vocab)
    print(f"Trial {trial + 1}:")
    print(f"  Input:  {' '.join(corrupted)}")
    print(f"  Target: {' '.join(targets)}\n")

T5 span corruption:

Trial 1:
  Input:  <X0> on the <X1>
  Target: <X0> the cat sat <X1> mat

Trial 2:
  Input:  <X0> <X1> on the <X2>
  Target: <X0> the cat <X1> sat <X2> mat

Trial 3:
  Input:  the cat sat on <X0>
  Target: <X0> the mat



## Contrastive Learning

Another approach: learn by contrasting similar vs dissimilar examples.

**SimCLR for text:**
- Same sentence, different augmentations → should be similar
- Different sentences → should be dissimilar

**Sentence embeddings:**
```
"The cat sat on the mat" → [0.2, 0.8, -0.3, ...]
"A feline rested on the rug" → [0.3, 0.7, -0.2, ...]
"The stock market crashed" → [-0.5, 0.1, 0.9, ...]
```

First two should be close in embedding space, third far away.

## Training at Scale

Modern pre-training stats:

**BERT (2018):**
- 3.3B words (BooksCorpus + Wikipedia)
- 4 days on 16 TPUs
- 340M parameters

**GPT-3 (2020):**
- ~500B tokens (Common Crawl, WebText, Books, Wikipedia)
- Several thousand petaflop-days
- 175B parameters

**GPT-4 (2023):**
- Unknown but likely >1T tokens
- Estimated months on thousands of GPUs
- 1.7T parameters (estimated)

The trend: more data, bigger models, more compute.

## Why Pre-training Works

**1. Scale of data**
- Billions of tokens vs thousands of labeled examples
- Covers diverse domains, styles, topics

**2. Rich learning signal**
- Every position in every sentence is a training example
- Model sees same word in thousands of contexts

**3. Transfer learning**
- Pre-trained representations work for many downstream tasks
- Fine-tune on small labeled dataset → great performance

**4. Emergent abilities**
- At scale, models learn reasoning, arithmetic, translation
- Not explicitly taught - emerges from next-word prediction

## Fine-tuning vs Prompting

After pre-training, two ways to use the model:

**Fine-tuning (BERT era):**
```
Pre-trained model → Add task-specific head → Train on labeled data
```
Example: Add classification layer, train on sentiment dataset.

**Prompting (GPT-3 era):**
```
Pre-trained model → Give task instruction as text → Generate answer
```
Example: "Classify sentiment: 'I love this movie' →" → Model generates "Positive"

Large enough models can do tasks via prompting, no fine-tuning needed. This is IN-CONTEXT LEARNING.

In [11]:
# Conceptual example of fine-tuning vs prompting

def fine_tuning_approach():
    """
    Traditional approach: modify model architecture
    """
    print("Fine-tuning approach:")
    print("  1. Take pre-trained BERT")
    print("  2. Add classification head on top")
    print("  3. Train on sentiment dataset (1000s examples)")
    print("  4. Inference: feed text → get label\n")
    
def prompting_approach():
    """
    Modern approach: just provide examples in context
    """
    print("Prompting approach (few-shot):")
    print("  Prompt:")
    print("    'I love this movie' → Positive")
    print("    'This is terrible' → Negative")
    print("    'Best film ever!' → Positive")
    print("    'I enjoyed every minute' → ?")
    print("  ")
    print("  Model generates: 'Positive'")
    print("  No training needed!\n")

fine_tuning_approach()
prompting_approach()

Fine-tuning approach:
  1. Take pre-trained BERT
  2. Add classification head on top
  3. Train on sentiment dataset (1000s examples)
  4. Inference: feed text → get label

Prompting approach (few-shot):
  Prompt:
    'I love this movie' → Positive
    'This is terrible' → Negative
    'Best film ever!' → Positive
    'I enjoyed every minute' → ?
  
  Model generates: 'Positive'
  No training needed!



## Putting It All Together: Training Pipeline

**Step 1: Data collection**
- Scrape web, books, Wikipedia
- Clean, filter, deduplicate
- Billions of tokens

**Step 2: Tokenization**
- Convert text to subword tokens (BPE, WordPiece)
- Build vocabulary (30K-50K tokens typical)

**Step 3: Pre-training**
- Choose objective (CLM, MLM, etc.)
- Train on massive corpus
- Days to months on many GPUs/TPUs

**Step 4: Evaluation**
- Perplexity on held-out data
- Downstream task performance

**Step 5: Fine-tuning (optional)**
- Task-specific training
- Or just use prompting

## Modern Trends

**1. Scaling laws**
- Performance improves predictably with more compute, data, parameters
- No sign of saturation yet

**2. Instruction tuning**
- After pre-training, fine-tune on instruction-following tasks
- Makes models better at following user requests
- Examples: InstructGPT, Flan-T5

**3. RLHF (Reinforcement Learning from Human Feedback)**
- Train reward model from human preferences
- Optimize language model using RL to maximize reward
- ChatGPT uses this

**4. Sparse models**
- Mixture of Experts (MoE): different experts for different inputs
- Activate only small subset of parameters per example
- Enables larger models with same compute

## Key Takeaways

**Self-supervised learning:**
- Create labels from data itself
- Enables training on massive unlabeled corpora
- Foundation of modern NLP

**Pre-training objectives:**
- **CLM:** Predict next word (GPT) - good for generation
- **MLM:** Predict masked words (BERT) - good for understanding
- **Seq2seq:** Denoise corrupted text (T5, BART) - good for both

**The recipe:**
1. Collect billions of tokens
2. Train with self-supervised objective
3. Learn rich representations
4. Transfer to downstream tasks

**Why it works:**
- Scale: more data than any supervised dataset
- Generality: learns language, not just specific tasks
- Emergence: complex abilities arise from simple objective

**The insight:**
> "You shall know a word by the company it keeps." - Firth, 1957

Modern LLMs take this to the extreme. Learn everything about language just by predicting words in context.