# RLHF Training Quickstart Guide

This notebook demonstrates an end-to-end workflow for training and inference with reinforcement learning from human feedback (RLHF).

## Overview

We'll cover:
1. Environment setup
2. Loading and preparing data
3. Model configuration
4. Reward model setup
5. Training with PPO
6. Evaluation
7. Inference


## 1. Environment Setup

First, install the required dependencies:

In [None]:
# Install required packages
# !pip install torch transformers accelerate peft datasets trl bitsandbytes
# !pip install tensorboard wandb

import torch
import yaml
import os
from pathlib import Path
import sys

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
    print(f"CUDA version: {torch.version.cuda}")

## 2. Loading Configuration

We'll start with a basic training recipe for Gemma3 1B.

In [None]:
# Load configuration from recipe
def load_config(config_path):
    with open(config_path, 'r') as f:
        config = yaml.safe_load(f)
    return config

# For this demo, we'll use the Gemma3 1B basic recipe
config_path = '../recipes/gemma3_1b_basic.yaml'

# Load and display config
config = load_config(config_path)
print("Configuration loaded:")
print(yaml.dump(config, default_flow_style=False, indent=2)[:500] + '...')

## 3. Dataset Preparation

Load and preprocess the dataset for training.

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer

# Load tokenizer
model_name = "google/gemma-3-1b-it"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Ensure padding token is set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"Tokenizer loaded: {model_name}")
print(f"Vocab size: {len(tokenizer)}")
print(f"Padding token: {tokenizer.pad_token}")

# Load dataset (using a small subset for demo)
print("\nLoading dataset...")
dataset = load_dataset("Anthropic/hh-rlhf", split="train[:100]")
print(f"Dataset loaded: {len(dataset)} samples")
print(f"\nExample prompt:\n{dataset[0]['chosen'][:200]}...")

## 4. Model Loading

Load the base model with LoRA for efficient fine-tuning.

In [None]:
from transformers import AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Load base model
print("Loading base model...")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Configure LoRA
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA
model = get_peft_model(model, lora_config)

print("\nModel with LoRA:")
model.print_trainable_parameters()

# Enable gradient checkpointing for memory efficiency
model.gradient_checkpointing_enable()
model.config.use_cache = False

print("\nModel loaded successfully!")

## 5. Reward Model Setup

Load the reward model for scoring generated responses.

In [None]:
from transformers import pipeline

# Load reward model
print("Loading reward model...")
reward_model_name = "OpenAssistant/reward-model-deberta-v3-large-v2"

reward_pipe = pipeline(
    "text-classification",
    model=reward_model_name,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

print(f"Reward model loaded: {reward_model_name}")

# Test reward model
test_text = "This is a helpful and informative response to the user's question."
reward_score = reward_pipe(test_text)[0]['score']
print(f"\nTest reward score: {reward_score:.4f}")

## 6. Training Configuration

Set up the PPO trainer with appropriate hyperparameters.

In [None]:
from trl import PPOConfig, PPOTrainer, AutoModelForCausalLMWithValueHead
from trl.core import LengthSampler

# Create model with value head for PPO
print("Creating model with value head...")
ppo_model = AutoModelForCausalLMWithValueHead.from_pretrained(model)

# PPO configuration
ppo_config = PPOConfig(
    model_name=model_name,
    learning_rate=2e-5,
    batch_size=4,
    mini_batch_size=2,
    gradient_accumulation_steps=2,
    ppo_epochs=4,
    early_stopping=False,
    target_kl=0.1,
    kl_penalty="kl",
    seed=42,
    log_with="tensorboard",
    project_kwargs={"logging_dir": "./logs/quickstart"}
)

print("PPO configuration created:")
print(f"  Learning rate: {ppo_config.learning_rate}")
print(f"  Batch size: {ppo_config.batch_size}")
print(f"  Mini batch size: {ppo_config.mini_batch_size}")
print(f"  PPO epochs: {ppo_config.ppo_epochs}")
print(f"  Target KL: {ppo_config.target_kl}")

## 7. Training Loop

Run the PPO training loop (simplified for demo).

In [None]:
# Initialize trainer
ppo_trainer = PPOTrainer(
    config=ppo_config,
    model=ppo_model,
    tokenizer=tokenizer,
    dataset=dataset
)

# Generation settings
generation_kwargs = {
    "max_new_tokens": 128,
    "temperature": 1.0,
    "top_p": 0.9,
    "do_sample": True,
    "pad_token_id": tokenizer.pad_token_id
}

print("\nStarting training...\n")
print("Note: This is a simplified demo. For full training, use the main training script.")

# Training loop (shortened for demo)
num_steps = 10

for step, batch in enumerate(ppo_trainer.dataloader):
    if step >= num_steps:
        break
    
    # Extract prompts
    query_tensors = batch["input_ids"]
    
    # Generate responses
    response_tensors = ppo_trainer.generate(
        query_tensors,
        return_prompt=False,
        **generation_kwargs
    )
    
    # Decode responses
    batch["response"] = tokenizer.batch_decode(response_tensors, skip_special_tokens=True)
    
    # Compute rewards
    texts = [q + r for q, r in zip(batch["query"], batch["response"])]
    rewards = [torch.tensor(reward_pipe(text)[0]['score']) for text in texts]
    
    # Run PPO step
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    
    # Log progress
    ppo_trainer.log_stats(stats, batch, rewards)
    
    if step % 5 == 0:
        print(f"Step {step}: Mean reward = {torch.stack(rewards).mean():.4f}")

print("\nTraining demo complete!")

## 8. Save Model

Save the trained model and tokenizer.

In [None]:
# Save directory
save_dir = "./checkpoints/quickstart_demo"
os.makedirs(save_dir, exist_ok=True)

# Save model
ppo_trainer.save_pretrained(save_dir)

# Save tokenizer
tokenizer.save_pretrained(save_dir)

print(f"Model and tokenizer saved to {save_dir}")

## 9. Inference

Load the trained model and run inference.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load trained model
print("Loading trained model...")
trained_model = AutoModelForCausalLM.from_pretrained(
    save_dir,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
trained_tokenizer = AutoTokenizer.from_pretrained(save_dir)

print("Model loaded successfully!")

# Inference function
def generate_response(prompt, max_new_tokens=256, temperature=0.7, top_p=0.9):
    inputs = trained_tokenizer(prompt, return_tensors="pt").to(trained_model.device)
    
    with torch.no_grad():
        outputs = trained_model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
            do_sample=True,
            pad_token_id=trained_tokenizer.pad_token_id
        )
    
    response = trained_tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

# Test prompts
test_prompts = [
    "What is machine learning?",
    "How can I improve my productivity?",
    "Explain quantum computing in simple terms."
]

print("\nGenerating responses...\n")
for i, prompt in enumerate(test_prompts, 1):
    print(f"\n{'='*60}")
    print(f"Prompt {i}: {prompt}")
    print(f"{'='*60}")
    
    response = generate_response(prompt)
    print(f"Response:\n{response}")
    
    # Compute reward for the response
    reward = reward_pipe(response)[0]['score']
    print(f"\nReward score: {reward:.4f}")

## 10. Compare with Base Model

Compare outputs from the base model and fine-tuned model.

In [None]:
# Load base model for comparison
print("Loading base model for comparison...")
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

def generate_base_response(prompt, max_new_tokens=256):
    inputs = tokenizer(prompt, return_tensors="pt").to(base_model.device)
    
    with torch.no_grad():
        outputs = base_model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id
        )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Compare responses
prompt = "What are the benefits of regular exercise?"

print(f"\nPrompt: {prompt}\n")

print("Base Model Response:")
base_response = generate_base_response(prompt)
print(base_response)
base_reward = reward_pipe(base_response)[0]['score']
print(f"Reward: {base_reward:.4f}\n")

print("Fine-tuned Model Response:")
tuned_response = generate_response(prompt)
print(tuned_response)
tuned_reward = reward_pipe(tuned_response)[0]['score']
print(f"Reward: {tuned_reward:.4f}\n")

print(f"Reward improvement: {tuned_reward - base_reward:.4f}")

## 11. Next Steps

This quickstart covered the basics of RLHF training. To dive deeper:

### Full Training
For production training, use the command-line interface:
```bash
python train.py --config examples/recipes/gemma3_1b_basic.yaml
```

### Experiment with Different Configurations
- Try different model sizes: `gemma2_2b` vs `gemma3_1b`
- Use different datasets: `anthropic_hh`, `openassistant`, `summarization`
- Experiment with reward compositions: `single_reward`, `multi_objective`, `ensemble`

### Advanced Features
- Multi-GPU training for faster convergence
- Experiment tracking with Weights & Biases
- Custom reward functions
- Hyperparameter tuning

### Evaluation
- Run comprehensive evaluations on held-out test sets
- Compare multiple checkpoints
- Analyze failure modes and edge cases

### Resources
- Documentation: `docs/`
- Example configs: `examples/configs/`
- Training recipes: `examples/recipes/`
- Community examples: `examples/community/`


## Summary

In this quickstart, we:
1. ✅ Set up the environment and dependencies
2. ✅ Loaded a configuration and dataset
3. ✅ Initialized a model with LoRA for efficient training
4. ✅ Set up a reward model for scoring responses
5. ✅ Ran a simplified PPO training loop
6. ✅ Saved the trained model
7. ✅ Performed inference with the fine-tuned model
8. ✅ Compared base and fine-tuned model outputs

You're now ready to explore more advanced RLHF training scenarios!
