# Reinforcement Learning from Human Feedback (RLHF)

**RLHF** stands for **Reinforcement Learning from Human Feedback**, a powerful method used to align AI models with human values and preferences.

It is a key component in training models like **ChatGPT**, **Claude**, and **Gemini** : ensuring that their responses are not only factually accurate but also *helpful, harmless, and honest*.

## 🎯 Learning Objectives

By the end of this notebook, you will:

- Understand what RLHF is and why it matters
- Learn the three main stages of RLHF training
- Explore how human feedback improves model behavior
- See a simple simulation of preference-based fine-tuning

## 🧩 1. What is RLHF?

RLHF bridges the gap between raw model pre training and real world helpfulness.

It involves **human evaluators** rating model outputs and using those ratings to train a *reward model*, which guides further reinforcement learning fine-tuning.

### The Three Stages of RLHF:

1. **Supervised Fine-Tuning (SFT):**
   - Human-labeled examples are used to fine-tune the pre-trained language model.
   - The model learns to produce more coherent and contextually relevant outputs.

2. **Reward Model Training:**
   - Multiple model responses are generated for the same prompt.
   - Humans rank the responses from best to worst.
   - A *reward model* is trained to predict these human preferences.

3. **Reinforcement Learning (PPO Optimization):**
   - The base model is fine tuned again using reinforcement learning (typically **Proximal Policy Optimization, PPO**).
   - The reward model provides feedback on each response to encourage human-like answers.

## ⚙️ 2. Conceptual Overview of the RLHF Pipeline

```text
          ┌─────────────────────────────────────┐
          │        Pre-trained Language Model    │
          └─────────────────────────────────────┘
                           │
                           ▼
          ┌─────────────────────────────────────┐
          │  Supervised Fine-Tuning (SFT)       │
          │  → Trained on human-written answers │
          └─────────────────────────────────────┘
                           │
                           ▼
          ┌─────────────────────────────────────┐
          │   Reward Model Training             │
          │   → Learns from ranked responses    │
          └─────────────────────────────────────┘
                           │
                           ▼
          ┌─────────────────────────────────────┐
          │ Reinforcement Learning (PPO)        │
          │ → Fine-tunes with reward feedback   │
          └─────────────────────────────────────┘
```

In [1]:
# 🧠 Example: Simulating preference-based training
import numpy as np
import random

prompts = ["Explain quantum computing in simple terms.", "Why is AI alignment important?"]
responses = [
    ["Quantum computing uses qubits that can be 0 and 1 at the same time.", "Quantum computers are very fast because they use magic."],
    ["AI alignment ensures models act according to human goals.", "AI alignment is about making robots friendly."]
]

# Human feedback (preferred = index 0 in both cases)
human_prefs = [0, 0]

reward_scores = []
for i, pref in enumerate(human_prefs):
    r = np.zeros(len(responses[i]))
    r[pref] = 1.0  # assign reward to preferred response
    reward_scores.append(r)

reward_scores

## 🧮 3. PPO – Reinforcement Learning Fine-Tuning

In the last step, **PPO (Proximal Policy Optimization)** adjusts the model’s weights to maximize the *expected reward* predicted by the reward model.

PPO balances *exploration* (trying new outputs) and *stability* (staying close to good behavior).

In [2]:
# Simplified PPO objective (conceptual)
import torch

old_log_probs = torch.tensor([0.2, 0.25])
new_log_probs = torch.tensor([0.3, 0.28])
advantages = torch.tensor([1.0, 0.5])

ratio = torch.exp(new_log_probs - old_log_probs)
ppo_loss = -torch.min(ratio * advantages, torch.clamp(ratio, 0.8, 1.2) * advantages).mean()
ppo_loss.item()

## 💡 4. Benefits of RLHF

✅ **Improves alignment** with human goals and values  
✅ **Reduces toxic or unsafe outputs**  
✅ **Enhances usefulness and tone** of responses  
✅ **Adapts to evolving user expectations**

## ⚠️ 5. Challenges and Limitations

- **Expensive and time-consuming:** Human feedback collection is costly.
- **Bias amplification:** Feedback data may reflect annotator bias.
- **Reward hacking:** Model may exploit shortcuts to get higher rewards.
- **Scalability issues:** Requires massive compute and annotation pipelines.

## 🧭 6. Future Directions

- **RLAIF (Reinforcement Learning from AI Feedback):** AI-generated synthetic feedback reduces reliance on humans.
- **Constitutional AI:** Models follow a set of written principles instead of manual feedback.
- **Direct Preference Optimization (DPO):** A simpler alternative to PPO that directly optimizes preferences.

## 🚀 7. Key Takeaways

- RLHF aligns models with human expectations using reward-based learning.
- It combines supervised learning, preference modeling, and reinforcement learning.
- Used extensively in GPT-4, Claude, and Gemini training.
- Evolving toward AI-assisted feedback and automated preference optimization.

## 📘 References
- OpenAI: [Training Language Models to Follow Instructions with Human Feedback (2022)](https://arxiv.org/abs/2203.02155)
- Anthropic: [Constitutional AI: Harmlessness from AI Feedback](https://www.anthropic.com/news/constitutional-ai)
- DeepMind: [Reward Modeling and Alignment Research](https://deepmind.google/discover/blog/)