A hands-on journey through LLM post-training techniques, implementing everything from basic Supervised Fine-Tuning (SFT) to full RLHF with Reward Models and PPO.
Base Model: GPT-2 (124M parameters) - small enough to run on CPU, perfect for learning!
We implement a complete 7-notebook learning path that progressively builds understanding of LLM post-training:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ PHASE 1: Basic SFT Pipeline (Notebooks 01-04) โ
โ Learn the fundamentals of fine-tuning and where they fail โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ PHASE 2: Analysis & InstructGPT-Style Training (Notebooks 05-06) โ
โ Understand WHY basic SFT fails and implement proper fixes โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ PHASE 3: Full RLHF Pipeline (Notebook 07) โ
โ Reward Model training + PPO fine-tuning โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ PHASE 4: Reasoning Distillation (DeepSeek-R1 Style) โ
โ Distill reasoning capabilities using Chain-of-Thought data โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Goal: Teach GPT-2 to complete tasks using basic supervised fine-tuning.
| Metric | Value |
|---|---|
| Dataset | 22 samples (tiny for demo) |
| Training Time | ~11 minutes |
| Final Loss | 3.18 |
| Trainable Params | 124M (100%) |
Key Findings:
- โ Loss decreases smoothly
- โ Model overfits to specific phrasings
- โ No generalization to paraphrased questions
Goal: Improve robustness using multiple instruction templates.
| Metric | Value |
|---|---|
| Dataset | 200 Alpaca samples |
| Training Time | ~26 minutes |
| Final Loss | 2.34 |
| Templates Used | 5 different formats |
Technique: Template Randomization
TEMPLATES = [
"### Instruction:\n{instruction}\n\n### Response:",
"Instruction: {instruction}\n\nResponse:",
"<|im_start|>user\n{instruction}<|im_end|>\n<|im_start|>assistant",
]Key Findings:
- โ Lower loss than Stage 1
- โ Outputs became repetitive and degenerate
- โ Catastrophic forgetting of base knowledge
Goal: Memory-efficient training with Low-Rank Adaptation.
| Metric | Value |
|---|---|
| Dataset | 200 Alpaca samples |
| Training Time | ~86 minutes |
| Final Loss | 2.75 |
| Trainable Params | 1.6M (1.29%) |
| LoRA Rank | 16 |
LoRA Configuration:
lora_r = 16 # Rank
lora_alpha = 32 # Alpha (2x rank)
target_modules = ["c_attn", "c_proj"] # GPT-2 attentionKey Findings:
- โ 98.7% fewer trainable parameters
- โ Base knowledge better preserved
- โ Still showed repetitive outputs
Goal: Compare all 3 stages systematically.
| Model | Perplexity | Coherence | Follows Instructions |
|---|---|---|---|
| GPT-2 Base | Best | High | โ No |
| Stage 1 SFT | Medium | Medium | Partially |
| Stage 2 Instruction | Worse | Low | โ Degraded |
| Stage 3 LoRA | Medium | Medium | Partially |
Key Insight: All 3 stages showed catastrophic forgetting and degenerate outputs.
Goal: Mathematical analysis of what went wrong.
| Issue | Cause | Effect |
|---|---|---|
| Catastrophic Forgetting | Full fine-tuning overwrites weights | Lost general knowledge |
| Distribution Shift | Training data โ pretraining data | Model forgot how to write |
| Repetition Collapse | High learning rate + small data | Degenerate loops |
| Template Conflicts | 5 different templates | Gradient interference |
Loss to Perplexity Analysis:
Stage 1: Loss 3.18 โ Perplexity 24.1
Stage 2: Loss 2.34 โ Perplexity 10.4
Stage 3: Loss 2.75 โ Perplexity 15.6
Lower loss โ better model! Stage 2 had lowest loss but worst outputs.
Goal: Implement proper SFT based on the InstructGPT paper (Ouyang et al., 2022).
| Paper Technique | Our Implementation |
|---|---|
| 13,000 demonstrations | 3,000 filtered samples |
| GPT-3 (175B) | GPT-2 (124M) |
| 16 epochs | 2 epochs |
| Pretraining mix | KL regularization |
| Cosine LR schedule | LR = 2e-5 with cosine decay |
Key Improvements:
# Single consistent template (no conflicts)
TEMPLATE = "### Instruction:\n{instruction}\n\n### Response:\n{response}"
# KL Regularization to prevent forgetting
kl_loss = kl_divergence(current_logits, original_logits)
total_loss = sft_loss + ฮฒ * kl_loss # ฮฒ = 0.1| Metric | Previous Stages | InstructGPT SFT |
|---|---|---|
| Template Count | 5 | 1 |
| Learning Rate | 1e-4 to 1e-5 | 2e-5 (cosine) |
| KL Regularization | โ None | โ ฮฒ=0.1 |
| Data Quality | Random | Filtered |
| Repetition | High | Low |
Results:
- โ Non-repetition score: 0.79 (best)
- โ Coherent outputs
- โ Base knowledge preserved
Goal: Complete the InstructGPT 3-step pipeline with RM and PPO.
| Metric | Value |
|---|---|
| Dataset | Anthropic HH-RLHF (5,000 pairs) |
| Architecture | GPT2RewardModel (custom) |
| Training | Binary ranking loss |
| Accuracy | ~65% on held-out data |
Reward Model Architecture:
class GPT2RewardModel(nn.Module):
def __init__(self, model_name):
self.gpt2 = AutoModel.from_pretrained(model_name)
self.reward_head = nn.Linear(hidden_size, 1)
def forward(self, input_ids):
outputs = self.gpt2(input_ids)
reward = self.reward_head(outputs.last_hidden_state[:, -1])
return reward| Metric | Value |
|---|---|
| Training Steps | 150 |
| Prompts Used | 300 |
| KL Coefficient | 0.5 (high for stability) |
| Learning Rate | 5e-6 (low to prevent drift) |
| Training Time | ~18 minutes |
PPO Optimization Journey:
| Version | Steps | KL Coef | Result |
|---|---|---|---|
| v1 | 30 | 0.2 | 25% metrics (too much drift) |
| v2 | 100 | 0.2 | Still poor |
| v3 (Final) | 150 | 0.5 | 80% metrics โ |
Final PPO Configuration:
PPOConfig(
learning_rate=5e-6, # Very low
kl_coef=0.5, # High penalty
num_ppo_epochs=2,
mini_batch_size=4,
gradient_accumulation_steps=4,
)Goal: Transfer reasoning capabilities from large reasoning models (like DeepSeek-R1) to smaller models using pure Supervised Fine-Tuning (SFT).
This module implements the DeepSeek-R1-Distill methodology:
- Source Data: OpenThoughts-114k (reasoning traces)
- Technique: Fine-tuning on
<|im_start|>structured reasoning chains - Outcome: Model learns to "think" before answering
Key Components:
- Dataset: 114k+ Chain-of-Thought (CoT) examples
- Pipeline: Automated training, merging, and logic evaluation
- Metric: Comparing "Step Count" and "Reasoning Quality" against base models
๐ Go to distillation_SFT/README.md for full implementation details, benchmarks, and usage instructions.
LlmPostTraining/
โโโ notebooks/
โ โโโ 01_stage1_normal_sft.ipynb # Basic SFT
โ โโโ 02_stage2_instruction_tuning.ipynb # Multi-template training
โ โโโ 03_stage3_lora_qlora.ipynb # LoRA fine-tuning
โ โโโ 04_evaluation_compare_stages.ipynb # Stage comparison
โ โโโ 05_analysis_and_improvements.ipynb # Mathematical analysis
โ โโโ 06_instruct_tunning_training.ipynb # InstructGPT SFT
โ โโโ 07_rlhf_reward_model_ppo.ipynb # Full RLHF pipeline
โ
โโโ distillation_SFT/ # Phase 4: Reasoning Distillation
โ โโโ run_distillation_pipeline.py # Main training script
โ โโโ ...
โ
โโโ models/
โ โโโ gpt2/ # Base GPT-2 model
โ
โโโ outputs/
โ โโโ stage1_sft/ # Stage 1 checkpoint
โ โโโ stage2_instruction/ # Stage 2 checkpoint
โ โโโ stage3_lora/ # LoRA adapters
โ โโโ improved_training/ # InstructGPT SFT model
โ โโโ rlhf_training/ # PPO model + reward model
โ โโโ evaluation/ # Results & charts
โ
โโโ src/ # Utility modules
โโโ configs/ # YAML configurations
โโโ requirements.txt
git clone https://github.com/your-username/LlmPostTraining.git
cd LlmPostTraining
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt# GPT-2 will auto-download, or manually:
huggingface-cli download gpt2 --local-dir ./models/gpt201_stage1_normal_sft.ipynb- Learn basic SFT02_stage2_instruction_tuning.ipynb- See template randomization03_stage3_lora_qlora.ipynb- Try memory-efficient training04_evaluation_compare_stages.ipynb- Compare results05_analysis_and_improvements.ipynb- Understand failures06_instruct_tunning_training.ipynb- Proper InstructGPT SFT07_rlhf_reward_model_ppo.ipynb- Full RLHF pipeline
| Technique | Why It Works |
|---|---|
| Single Template | Avoids gradient conflicts |
| KL Regularization | Prevents catastrophic forgetting |
| Low Learning Rate | Preserves base knowledge |
| High KL Penalty in PPO | Keeps model close to SFT |
| Quality over Quantity | 3K good samples > 200 random |
| Mistake | Consequence |
|---|---|
| Multiple templates | Gradient interference |
| High learning rate | Destroys pretrained weights |
| No regularization | Catastrophic forgetting |
| Too many PPO steps | Model drifts from SFT |
| Tiny datasets | Severe overfitting |
Our experiments proved that lower training loss doesn't mean better outputs:
- Stage 2 had the lowest loss (2.34) but the worst outputs
- The model learned to predict tokens correctly but forgot how to generate coherent text
- Takeaway: Always evaluate with generation quality, not just loss
When fine-tuning GPT-2 on small datasets:
- The model "forgot" its pretrained knowledge
- Outputs became repetitive patterns like
", , , , , ," - Solution: Use KL regularization, low learning rate, or LoRA
Using 5 different instruction templates caused gradient conflicts:
- Each template tried to pull the model in different directions
- InstructGPT paper uses single consistent format
- Takeaway: Pick ONE template and stick with it
Our PPO journey showed:
- Default settings (KL=0.2) caused too much drift โ 25% metrics
- Higher KL penalty (0.5) + lower LR (5e-6) โ 80% metrics
- Takeaway: Start conservative, increase exploration gradually
The RM defines what "good" means:
- Biased RM โ biased model behavior
- We used Anthropic HH-RLHF (human preferences)
- Takeaway: RM training is often more important than PPO tuning
# โ
DO
learning_rate = 2e-5 # Low and stable
single_template = True # Consistent format
warmup_ratio = 0.1 # Gradual warmup
cosine_schedule = True # Smooth decay
# โ DON'T
learning_rate = 1e-4 # Too high
multiple_templates = True # Causes conflicts
no_warmup = True # Unstable start# โ
PPO Configuration
kl_coef = 0.5 # High for stability
learning_rate = 5e-6 # Very low
num_ppo_epochs = 2 # Don't over-train
gradient_clipping = 0.5 # Prevent explosions
# โ Common Mistakes
kl_coef = 0.01 # Too much freedom
learning_rate = 1e-4 # Way too high
training_steps = 1000 # Way too many- Filter out low-quality samples
- Ensure response length variety
- Balance task types
- Remove duplicates and near-duplicates
- Verify label quality (for preference data)
| Notebook | Method | Loss | Time | Key Result |
|---|---|---|---|---|
| 01 | SFT | 3.18 | 11 min | Basic completion |
| 02 | Instruction | 2.34 | 26 min | Repetitive outputs |
| 03 | LoRA | 2.75 | 86 min | Fewer params, similar issues |
| 06 | InstructGPT SFT | Lower | 15 min | Non-repetitive โ |
| 07 | PPO | - | 18 min | 80% on metrics โ |
- DPO (Direct Preference Optimization) - Simpler than PPO, no reward model needed
- IPO (Identity Preference Optimization) - More robust, prevents overfitting
- KTO (Kahneman-Tversky Optimization) - Works with thumbs up/down data
- GRPO (Group Relative Policy Optimization) - DeepSeek's memory-efficient PPO
- Constitutional AI - Self-improvement without human labels
- Reasoning Models - Train models to "think" step-by-step
| Innovation | Source | Description |
|---|---|---|
| DPO | Rafailov et al. | Bypass reward model entirely, use preference pairs directly |
| GRPO | DeepSeek | Memory-efficient PPO variant, used in DeepSeek-R1 |
| Open-R1 | HuggingFace | Open reproduction of DeepSeek reasoning pipeline |
| Reasoning RL | DeepSeek-R1 | Pure RL to teach reasoning without SFT |
| Agentic RL | Multi-step tool use with RL optimization |
| Aspect | PPO (RLHF) | DPO |
|---|---|---|
| Complexity | High (RM + RL) | Low (single loss) |
| Stability | Tricky to tune | More stable |
| Data | Needs RM training | Just preference pairs |
| Performance | SOTA | Comparable or better |
| Recommended | Research | Production |
DPO Loss Function:
# DPO directly optimizes on preference pairs
loss = -log(sigmoid(ฮฒ * (log_prob_chosen - log_prob_rejected)))โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Stage 1: Cold Start SFT โ
โ - Small high-quality reasoning examples โ
โ - Teaches format and clarity โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Stage 2: Pure RL (GRPO) โ
โ - No SFT, just RL from scratch โ
โ - Model learns reasoning by doing โ
โ - Verifiable rewards (math correctness) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Stage 3: Rejection Sampling + Refinement โ
โ - Filter bad outputs โ
โ - Human preference alignment โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- Process Reward Models (PRMs) - Reward each reasoning step, not just final answer
- Synthetic Data Generation - Use strong models to generate training data
- Multi-Turn RL - Optimize entire conversations, not single responses
- Tool-Augmented RL - Models learn to use calculators, code interpreters
- Constitutional AI - Models critique and improve their own outputs
- InstructGPT - Ouyang et al., 2022
- DPO - Rafailov et al., 2023
- LoRA - Hu et al., 2021
- QLoRA - Dettmers et al., 2023
- DeepSeekMath (GRPO) - Shao et al., 2024
- DeepSeek-R1 - DeepSeek, 2025
- TRL Library - RLHF training
- Alignment Handbook - HuggingFace
- Open-R1 - Reasoning reproduction
- verl - Agentic RL framework
- Anthropic HH-RLHF - Preference pairs
- UltraFeedback - 66K preferences
- Orca DPO Pairs - GPT-4 vs Llama
All experiments run on CPU (Apple M-series / Intel). No GPU required!
| Notebook | Runtime | Memory |
|---|---|---|
| 01-03 | 11-86 min | ~8GB RAM |
| 06 | 15 min | ~8GB RAM |
| 07 | 18 min | ~10GB RAM |
MIT License - Use freely for learning and research!
Made with ๐ง for understanding LLM training from scratch!