# Complete LLM Notebook

## Sections 

### 1.1 Tokenization, vocab, sequence formatting

1. Byte level tokenization vs subword (BPE, WordPiece, SentencePiece).
2. How IDs map to tokens and how vocab size impacts model size.
Padding, masking, and attention masks for autoregressive tasks.
4. Special tokens: BOS, EOS, PAD, UNK.
5. Sequence packing strategies (contiguous examples in one long stream).
6. Sliding window chunking: critical for LLM training.
TensorFlow specific:
â€¢ Use keras_nlp tokenizers
â€¢ Understand how to batch variable length sequences
â€¢ Learn how tf.data handles ragged tensors if needed


In [None]:
%pip install tensorflow keras keras_nlp matplotlib numpy pandas scikit-learn

## 1 Tokenization, vocab, sequence formatting

### 1.1 Byte level words vs Subword tokenization 

**Why**: Transformers cannot process raw text, text must be converted into numbers. The way we break text into tokens affects efficency, generalization and memory usage

#### 1.1.1 Byte level tokenization

- works at the byte level (0-255)
- Real world usage: GPT-2 uses byte pair encoding (BPE) at byte level
- Pros:
  - Handels any charecter, any language, emojis, symbol
  - no OOV (out of vocab) tokens
- Cons:
  - Toekn sequences can be longer -> means more compute 
- Example: "hello ðŸ‘‹" â€“> [104, 101, 108, 108, 111, 32, 240, 159, 145, 139]

#### 1.1.2 Subword tokenization (BPE, WordPeice, SentencePiece)

-  Breaks Text into frequent subwords insted of characters or words. 
-  Example: 
   -  "unhappiness" -> ["un", "happi", "ness"]
- Pros: 
  - Shorter sequences than byte
  - Can handle rare words via subword decomposition (breaking unknown words into known smaller parts)
- Cons:
  - some complexity in building vocab and handling edge cases
  
**NOTE:** LLM's often use subword BPE (BPE applied at the subword level) it iteratively merges the most frequent character or subword pairs to build a vocabulary, balancing between character-level and word-level tokenization

### 1.2 Token Ids and Vocabulary Size
- After Tokenization, each token is mapped to a integer ID using a vocabulary
- Vocabulary size (V) is very important
  - Larger V -> model must have a bigger embedding matrix (page 50 in written notes) -> more parameters (hence a larger model)
  - Smaller V -> more subword splitting (words broken into more pieces) -> longer sequences -> slower training (but smaller model size)
- Typical LLM vocab sizes: 30K-100K for english models 
- Example: In TensorFlow, keras_nlp.tokenizers handles both mapping tokens â†’ IDs and IDs â†’ tokens.

``` py
from keras_nlp.tokenizers import BytePairTokenizer

tokenizer = BytePairTokenizer(vocabulary=["hello", "world", "un", "happi", "ness", "<PAD>", "<BOS>", "<EOS>"])
tokens = tokenizer.tokenize(["hello world", "unhappiness"])
token_ids = tokenizer(tokens)
print(token_ids)

```

**How Keras NLP Tokenizers Handle Token â†” ID Mapping** Under the hood, Keras NLP tokenizers maintain two key data structures (`token_to_id` and `id_to_token`) for bidirectional mapping. When you call `tokenizer.tokenize(text)`, it returns tokens as strings; `tokenizer(text)` returns token IDs; and `tokenizer.detokenize(ids)` converts IDs back to text. The vocabulary is built during training or loaded from a pre-trained model, with special tokens (PAD, UNK, BOS, EOS) typically assigned fixed IDs at the beginning.


### 1.3 Padding and Masking 
1. Padding: Short sequences are extended with PAD tokens to match the longest sequence in a batch, enabling efficient parallel processing (e.g., `[5, 10, 15]` â†’ `[5, 10, 15, <PAD>, <PAD>]`)

2. Attention Masking: Tells the transformer which positions to ignore during attention.
- **No Mask (Bidirectional)**: All tokens attend to all tokens; used in BERT for full context understanding
- **Causal Mask (Autoregressive)**: Each token only attends to previous tokens; used in GPT to prevent future information leakage during training
- **Padding Mask**: Masks PAD tokens so they don't affect attention scores; combined with other masks in most models

``` py
import tensorflow as tf

# Example: batch of token IDs
batch = tf.ragged.constant([
    [1, 2, 3],
    [4, 5]
])
padded = batch.to_tensor(default_value=0) # Output: [[1, 2, 3], [4, 5, 0]]  <- 0 is the PAD token ID (these are the new tokens)
mask = tf.cast(padded != 0, tf.int32) # Output: [[1, 1, 1], [1, 1, 0]]  <- tells attention to ignore the last position in sequence 2 (this is the attention scores not token values)
```

### 1.4 Special Tokens (see ML notes page 121)
- `<BOS>`: Beginning of sequence (marks where a sequence starts)
- `<EOS>`: End of sequence (marks where a sequence ends)
- `<PAD>`: Padding (fills shorter sequences to match batch length)
- `<UNK>`: Unknown/ out of vocab token (represents words not in vocabulary)
- etc

**usage in training**
``` text
Input:  <BOS> hello world <EOS> <PAD> <PAD>    # BOS is fed as a conditioning token ((a special input token that provides initial context/prompt for the model; the model conditions its next-token predictions on it but is not trained to predict it)) EOS is included so the model learns to predict sequence end PADs fill to uniform length
Target: hello world <EOS> <PAD> <PAD> <PAD>   # Target = input shifted left (model predicts the next token at each step, including EOS); PADs fill to uniform length
Mask:   1 1 1 1 0 0 0                          # Mask=1 for positions to compute loss (we compute loss for real tokens and EOS), 0 for PADs
```

### 1.5 Sequence Packing and Contiguous Streams 
- Why: LLM training is compute-heavy, to use memory efficiently, multiple short examples can be concatenated into a single long sequence and then chunked
- Benefits: 
  - Reduces wasted padding
  - keepinh sequences dense for attention
  
**Example (pseudo)**
```text
Examples: ["hello", "world"], ["goodbye", "moon"]
Packed sequence: "hello world goodbye moon"
```
- Then split into fixed length chunks (ex: 8 tokens per chunk) for processing

### 1.6 Sliding Window Chunking 
- When text is too long to fit in memory, we create overlapping windown to preserve context
- why: prevents cutting off dependencies between sequences.
- Example Sequence length = 6, chunk size = 4, stride = 2
``` text
Sequence: [A B C D E F] (len = 6)

Chunk 1: Start at position 0 â†’ [A B C D] (len = 4 beacuse chunk size = 4)
Chunk 2: Start at position 0 (position) + 2(stride) = 2 â†’ [C D E F] (move the window by 2, keeps last 2 tokens of last chink in new chunk)

Chunks:  [A B C D], [C D E F]

Result Sequence: [A B C D], [C D E F]
             â†‘overlapâ†‘
```
- Edge case: If the final window doesn't have enough tokens (e.g., only 3 tokens left for chunk_size=4), you either pad it with `<PAD>` tokens or discard it depending on your training strategy.
- overlapping ensures context continuity for training 
- common in LLM pretraining 
  

### 1.7 Complete Example: Combining All Tokenization Steps

This example demonstrates the entire pipeline from raw text to training-ready sequences, incorporating all concepts from 1.1-1.6.

In [None]:
import tensorflow as tf
import keras_nlp

# Load pretrained tokenizer (handles vocab, special tokens, BPE subword)
tokenizer = keras_nlp.models.GPT2Tokenizer.from_preset("gpt2_base_en")

# Create dataset pipeline: tokenization â†’ chunking â†’ padding
dataset = (
    tf.data.Dataset.from_tensor_slices(["Hello world", "Goodbye moon"])
    .map(tokenizer)  # Tokenize: text â†’ IDs
    .unbatch()  # Flatten to token stream (packing)
    .batch(8, drop_remainder=False)  # Chunk into sequences of 8 tokens
    .map(lambda x: (x[:-1], x[1:]))  # Create (input, target) pairs
    .padded_batch(2, padded_shapes=([None], [None]))  # Pad and batch
)

# Usage:
inputs, targets = next(iter(dataset))
print(tokenizer.detokenize(inputs[0]))
print(tokenizer.detokenize(targets[0]))

""" 
OUTPUT:
Hello worldGoodbye
worldGoodbye moon

EXPLANATION:
------------
The dataset variable is a tf.data.Dataset pipeline that transforms raw text into 
training-ready (input, target) pairs. It's a lazy iterator (doesn't process until called).

Pipeline steps:
1. from_tensor_slices: Creates dataset from list of strings
2. map(tokenizer): Converts each text â†’ token IDs (e.g., "Hello" â†’ [15496, 995])
3. unbatch(): Flattens all sequences into one continuous token stream (sequence packing)
4. batch(8): Groups tokens into chunks of 8 (creates fixed-length sequences)
5. map(lambda): Splits each chunk into (input, target) where target = input shifted left
6. padded_batch(2): Groups 2 sequences into a batch, pads shorter ones to match length

Usage output - How next-token prediction works:
The model predicts the NEXT token at EACH position, not just the last one:
- Position 0: Given "Hello" â†’ predict "world"
- Position 1: Given "Hello world" â†’ predict "Goodbye"  
- Position 2: Given "Hello worldGoodbye" â†’ predict "moon"

So the target sequence shows what should be predicted at each step.
The entire target = input shifted left by 1 token (each target is the next token)

Chunking Strategy Comparison:
-----------------------------
Token stream: [A, B, C, D, E, F, G, H, I, J]

Fixed batch (current): .batch(4)
  Chunk 1: [A, B, C, D]          (tokens 0-3)
  Chunk 2: [E, F, G, H]          (tokens 4-7)
  Chunk 3: [I, J]                (tokens 8-9)
  â†’ No overlap, each token appears once

Sliding window: .window(size=4, shift=2, drop_remainder=True)
  Chunk 1: [A, B, C, D]          (tokens 0-3)
  Chunk 2: [C, D, E, F]          (tokens 2-5, overlaps last 2 from chunk 1)
  Chunk 3: [E, F, G, H]          (tokens 4-7, overlaps last 2 from chunk 2)
  â†’ Overlap preserves context across chunks, useful for long documents

IMPORTANT: Both strategies train on next-token prediction at EVERY position!
---------------------------------------------------------------------------
Fixed batch:
  â€¢ Chunk 1: [A,B,C,D] â†’ trains: (Aâ†’B), (A,Bâ†’C), (A,B,Câ†’D)
  â€¢ Chunk 2: [E,F,G,H] â†’ trains: (Eâ†’F), (E,Fâ†’G), (E,F,Gâ†’H)
  â€¢ Each token appears ONCE

Sliding window:
  â€¢ Chunk 1: [A,B,C,D] â†’ trains: (Aâ†’B), (A,Bâ†’C), (A,B,Câ†’D)
  â€¢ Chunk 2: [C,D,E,F] â†’ trains: (Câ†’D), (C,Dâ†’E), (C,D,Eâ†’F)
  â€¢ Tokens C and D appear TWICE (extra training for better context)

The chunking method only affects which tokens are grouped together, not how training works.
Sliding window gives overlapping tokens extra exposure for better long-range dependencies
"""