# Lab 3.4.5: Best-of-N with Reward Model

**Module:** 3.4 - Test-Time Compute & Reasoning  
**Time:** 2 hours  
**Difficulty:** ⭐⭐⭐⭐ (Advanced)

---

## Learning Objectives

By the end of this lab, you will:
- [ ] Understand what reward models are and how they score responses
- [ ] Implement Best-of-N sampling (generate N, pick best)
- [ ] Load and use real reward models (ArmoRM, Skywork-Reward)
- [ ] Measure quality improvement vs greedy decoding
- [ ] Understand the cost-quality tradeoff

---

## Prerequisites

- Completed Labs 3.4.1-3.4.4
- Ollama running with an LLM
- PyTorch and Transformers installed
- DGX Spark (128GB) for running LLM + reward model together

---

## Real-World Context

**The Problem:** LLMs can generate many valid responses to a prompt, but some are better than others. How do we automatically select the best one?

**The Solution:** Use a **Reward Model** - a separate model trained to score how "good" a response is, based on human preferences.

**Best-of-N (BoN) Sampling:**
1. Generate N different responses (with temperature > 0)
2. Score each with the reward model
3. Return the highest-scoring response

**Industry Applications:**
- **ChatGPT/Claude:** Uses reward models to rank responses before showing you
- **Code Generation:** Pick the most likely correct solution
- **Content Moderation:** Score safety/helpfulness
- **Customer Service:** Choose most helpful response

---

## ELI5: Best-of-N with Reward Models

> **Imagine you're a restaurant owner...**
>
> You have a chef (the LLM) who can make 5 versions of a dish.
> Each version is slightly different.
>
> You also have a food critic (the reward model) who tastes each dish
> and gives it a score from 1-10.
>
> **Best-of-N:** The chef makes 5 dishes, the critic scores them all,
> and you serve the highest-scoring one to your customer!
>
> **The magic:** The critic has tasted millions of dishes and learned
> what humans prefer. Even if the chef sometimes makes a mediocre dish,
> the critic helps ensure customers get the best one.

```
Greedy:     Prompt ──> Generate 1 ──> Serve
                       (hope it's good!)

Best-of-N:  Prompt ──> Generate N ──> Score Each ──> Serve Best
                       [A, B, C, D, E]   [7, 9, 6, 8, 5]   [B wins!]
```

---

## Part 1: Setup

In [None]:
# Core imports
import json
import time
from pathlib import Path
from typing import Dict, List, Optional, Tuple
from dataclasses import dataclass

import ollama

# Check for transformers (needed for reward models)
try:
    import torch
    from transformers import AutoModelForSequenceClassification, AutoTokenizer
    print(f"PyTorch version: {torch.__version__}")
    print(f"CUDA available: {torch.cuda.is_available()}")
    if torch.cuda.is_available():
        print(f"GPU: {torch.cuda.get_device_name(0)}")
        print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    HAS_TRANSFORMERS = True
except ImportError as e:
    print(f"Warning: {e}")
    print("Install with: pip install torch transformers")
    HAS_TRANSFORMERS = False

In [None]:
# Configure models
LLM_MODEL = "qwen3:8b"  # Your LLM for generation

# Reward model options (from HuggingFace)
REWARD_MODELS = {
    'armo': 'RLHFlow/ArmoRM-Llama3-8B-v0.1',  # ~16GB, general purpose
    'skywork': 'Skywork/Skywork-Reward-Llama-3.1-8B-v0.2',  # ~16GB
    'internlm': 'internlm/internlm2-7b-reward',  # ~14GB
}

# For DGX Spark: You can run LLM (70B, ~45GB) + Reward (8B, ~16GB) together!
print(f"LLM for generation: {LLM_MODEL}")
print(f"\nAvailable reward models:")
for name, model_id in REWARD_MODELS.items():
    print(f"  {name}: {model_id}")

---

## Part 2: Building a Simple Reward Model Wrapper

First, let's create a simple scoring function before loading a real reward model.

In [None]:
@dataclass
class ScoredResponse:
    """A response with its reward score."""
    response: str
    score: float
    generation_time: float
    
    def __repr__(self):
        return f"ScoredResponse(score={self.score:.3f}, len={len(self.response)})"


class SimpleRewardModel:
    """
    A simple heuristic reward model for demonstration.
    
    This is NOT a real reward model - it uses simple heuristics.
    Replace with a real model for production use.
    """
    
    def __init__(self):
        self.name = "simple_heuristic"
    
    def score(self, prompt: str, response: str) -> float:
        """
        Score a response using simple heuristics.
        
        Higher scores are better. Range: [0, 1]
        """
        score = 0.5  # Start at neutral
        
        # Reward: Appropriate length (not too short, not too long)
        length = len(response)
        if 50 < length < 500:
            score += 0.1
        elif length < 20:
            score -= 0.2
        
        # Reward: Contains reasoning markers
        reasoning_markers = ['because', 'therefore', 'first', 'then', 'finally', 'step']
        for marker in reasoning_markers:
            if marker in response.lower():
                score += 0.05
        
        # Reward: Has a clear answer
        if any(x in response.lower() for x in ['the answer is', 'answer:', '=']):
            score += 0.1
        
        # Penalty: Repetition
        words = response.lower().split()
        unique_ratio = len(set(words)) / max(len(words), 1)
        if unique_ratio < 0.5:
            score -= 0.2
        
        # Penalty: Uncertainty markers
        if any(x in response.lower() for x in ["i'm not sure", "i don't know", "might be"]):
            score -= 0.1
        
        return max(0.0, min(1.0, score))  # Clamp to [0, 1]


# Test simple reward model
simple_rm = SimpleRewardModel()

test_responses = [
    "42",  # Too short
    "The answer is 42 because we need to calculate the sum first, then multiply.",  # Good
    "I'm not sure but it might be around 42 or so.",  # Uncertain
    "First, let's break this down step by step. The answer is 42.",  # Great
]

print("Testing simple reward model:")
for resp in test_responses:
    score = simple_rm.score("What is 6*7?", resp)
    print(f"  Score {score:.2f}: {resp[:50]}...")

---

## Part 3: Loading a Real Reward Model

Now let's load a real reward model from HuggingFace.

In [None]:
class HuggingFaceRewardModel:
    """
    Wrapper for HuggingFace reward models.
    
    Supports models like ArmoRM and Skywork-Reward.
    """
    
    def __init__(self, model_name: str, device: str = "auto"):
        """
        Load a reward model from HuggingFace.
        
        Args:
            model_name: HuggingFace model ID
            device: Device to use ("auto", "cuda", "cpu")
        """
        print(f"Loading reward model: {model_name}")
        start_time = time.time()
        
        self.name = model_name
        
        # Load tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(
            model_name,
            trust_remote_code=True
        )
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
        
        # Load model
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_name,
            torch_dtype=torch.bfloat16,
            device_map=device,
            trust_remote_code=True,
        )
        self.model.eval()
        
        elapsed = time.time() - start_time
        print(f"Loaded in {elapsed:.1f}s")
        
        if torch.cuda.is_available():
            mem_used = torch.cuda.memory_allocated() / 1e9
            print(f"GPU memory used: {mem_used:.1f} GB")
    
    def score(self, prompt: str, response: str, max_length: int = 2048) -> float:
        """
        Score a response given a prompt.
        
        Returns a float score (higher is better).
        """
        # Format as conversation
        conversation = f"User: {prompt}\n\nAssistant: {response}"
        
        # Tokenize
        inputs = self.tokenizer(
            conversation,
            return_tensors="pt",
            truncation=True,
            max_length=max_length,
            padding=True,
        )
        
        # Move to model's device
        device = next(self.model.parameters()).device
        inputs = {k: v.to(device) for k, v in inputs.items()}
        
        # Get score
        with torch.no_grad():
            outputs = self.model(**inputs)
            
            if hasattr(outputs, 'logits'):
                score = outputs.logits.squeeze().item()
            elif hasattr(outputs, 'score'):
                score = outputs.score.squeeze().item()
            else:
                # Fallback
                score = outputs.last_hidden_state[:, -1, :].mean().item()
        
        return score
    
    def score_batch(self, prompt: str, responses: List[str], batch_size: int = 4) -> List[float]:
        """
        Score multiple responses efficiently.
        """
        scores = []
        
        for i in range(0, len(responses), batch_size):
            batch = responses[i:i + batch_size]
            
            # Format conversations
            conversations = [
                f"User: {prompt}\n\nAssistant: {resp}"
                for resp in batch
            ]
            
            inputs = self.tokenizer(
                conversations,
                return_tensors="pt",
                truncation=True,
                max_length=2048,
                padding=True,
            )
            
            device = next(self.model.parameters()).device
            inputs = {k: v.to(device) for k, v in inputs.items()}
            
            with torch.no_grad():
                outputs = self.model(**inputs)
                batch_scores = outputs.logits.squeeze(-1).tolist()
                
                if isinstance(batch_scores, (int, float)):
                    batch_scores = [batch_scores]
                
                scores.extend(batch_scores)
        
        return scores

In [None]:
# Load real reward model if transformers is available
if HAS_TRANSFORMERS:
    print("Loading ArmoRM reward model...")
    print("(This may take a minute on first run - model needs to download)\n")
    
    try:
        reward_model = HuggingFaceRewardModel(REWARD_MODELS['armo'])
        USE_REAL_RM = True
    except Exception as e:
        print(f"Could not load real reward model: {e}")
        print("Using simple heuristic reward model instead.")
        reward_model = SimpleRewardModel()
        USE_REAL_RM = False
else:
    print("Using simple heuristic reward model (transformers not available)")
    reward_model = SimpleRewardModel()
    USE_REAL_RM = False

print(f"\nReward model: {reward_model.name}")

In [None]:
# Test the reward model
test_prompt = "Explain what machine learning is."
test_responses = [
    "Machine learning is a type of AI.",  # Too short
    "Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It focuses on developing algorithms that can access data and use it to learn for themselves.",  # Good
    "I don't know much about machine learning, but I think it's something to do with computers maybe?",  # Uncertain
]

print(f"Testing {reward_model.name}:")
print(f"Prompt: {test_prompt}\n")

for i, resp in enumerate(test_responses):
    score = reward_model.score(test_prompt, resp)
    print(f"Response {i+1} (score: {score:.3f}):")
    print(f"  {resp[:80]}...\n")

---

## Part 4: Implementing Best-of-N Sampling

In [None]:
def generate_candidates(
    prompt: str,
    n: int = 5,
    model: str = LLM_MODEL,
    temperature: float = 0.7,
    max_tokens: int = 512,
    verbose: bool = True,
) -> List[ScoredResponse]:
    """
    Generate N candidate responses.
    
    Args:
        prompt: The input prompt
        n: Number of candidates to generate
        model: LLM model to use
        temperature: Sampling temperature (>0 for diversity)
        max_tokens: Maximum tokens per response
        verbose: Whether to print progress
    
    Returns:
        List of ScoredResponse objects (scores not yet filled)
    """
    candidates = []
    
    for i in range(n):
        if verbose:
            print(f"  Generating candidate {i+1}/{n}...", end=" ", flush=True)
        
        start_time = time.time()
        
        response = ollama.chat(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            options={"temperature": temperature, "num_predict": max_tokens}
        )
        
        elapsed = time.time() - start_time
        response_text = response['message']['content']
        
        candidates.append(ScoredResponse(
            response=response_text,
            score=0.0,  # Will be filled by reward model
            generation_time=elapsed,
        ))
        
        if verbose:
            print(f"({elapsed:.1f}s, {len(response_text)} chars)")
    
    return candidates


def best_of_n(
    prompt: str,
    reward_model,
    n: int = 5,
    llm_model: str = LLM_MODEL,
    temperature: float = 0.7,
    max_tokens: int = 512,
    verbose: bool = True,
) -> Tuple[ScoredResponse, List[ScoredResponse]]:
    """
    Best-of-N sampling: generate N candidates and return the best.
    
    Args:
        prompt: Input prompt
        reward_model: Reward model for scoring
        n: Number of candidates
        llm_model: LLM for generation
        temperature: Sampling temperature
        max_tokens: Max tokens per response
        verbose: Whether to print progress
    
    Returns:
        Tuple of (best_response, all_candidates)
    """
    if verbose:
        print(f"\nBest-of-{n} Sampling")
        print(f"Prompt: {prompt[:60]}...")
        print()
    
    # Generate candidates
    if verbose:
        print("Generating candidates:")
    candidates = generate_candidates(
        prompt, n, llm_model, temperature, max_tokens, verbose
    )
    
    # Score candidates
    if verbose:
        print("\nScoring candidates:")
    
    for i, candidate in enumerate(candidates):
        score = reward_model.score(prompt, candidate.response)
        candidate.score = score
        
        if verbose:
            print(f"  Candidate {i+1}: score = {score:.3f}")
    
    # Find best
    best = max(candidates, key=lambda x: x.score)
    
    if verbose:
        print(f"\nBest candidate: score = {best.score:.3f}")
        print(f"Score range: [{min(c.score for c in candidates):.3f}, {max(c.score for c in candidates):.3f}]")
    
    return best, candidates

In [None]:
# Test Best-of-N
test_prompt = "What are three benefits of regular exercise? Explain each briefly."

best, all_candidates = best_of_n(
    test_prompt,
    reward_model,
    n=5,
    temperature=0.7,
    verbose=True
)

print("\n" + "="*60)
print("BEST RESPONSE:")
print("="*60)
print(best.response)

---

## Part 5: Comparing Best-of-N vs Greedy Decoding

In [None]:
def greedy_decode(prompt: str, model: str = LLM_MODEL, max_tokens: int = 512) -> ScoredResponse:
    """
    Generate a single response with temperature=0 (greedy decoding).
    """
    start_time = time.time()
    
    response = ollama.chat(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        options={"temperature": 0.0, "num_predict": max_tokens}
    )
    
    elapsed = time.time() - start_time
    
    return ScoredResponse(
        response=response['message']['content'],
        score=0.0,  # Will be filled
        generation_time=elapsed,
    )


def compare_greedy_vs_bon(
    prompts: List[str],
    reward_model,
    n: int = 5,
    verbose: bool = True,
) -> Dict:
    """
    Compare greedy decoding vs Best-of-N on a set of prompts.
    """
    results = {
        'greedy_scores': [],
        'bon_scores': [],
        'greedy_times': [],
        'bon_times': [],
        'improvements': [],
    }
    
    for i, prompt in enumerate(prompts):
        if verbose:
            print(f"\n{'='*60}")
            print(f"Prompt {i+1}/{len(prompts)}: {prompt[:50]}...")
            print('='*60)
        
        # Greedy decoding
        if verbose:
            print("\n[Greedy Decoding]")
        greedy_start = time.time()
        greedy_response = greedy_decode(prompt)
        greedy_response.score = reward_model.score(prompt, greedy_response.response)
        greedy_time = time.time() - greedy_start
        
        if verbose:
            print(f"  Score: {greedy_response.score:.3f}")
            print(f"  Time: {greedy_time:.1f}s")
        
        # Best-of-N
        if verbose:
            print(f"\n[Best-of-{n}]")
        bon_start = time.time()
        best, _ = best_of_n(
            prompt, reward_model, n=n,
            temperature=0.7, verbose=False
        )
        bon_time = time.time() - bon_start
        
        if verbose:
            print(f"  Score: {best.score:.3f}")
            print(f"  Time: {bon_time:.1f}s")
        
        # Record results
        results['greedy_scores'].append(greedy_response.score)
        results['bon_scores'].append(best.score)
        results['greedy_times'].append(greedy_time)
        results['bon_times'].append(bon_time)
        results['improvements'].append(best.score - greedy_response.score)
        
        if verbose:
            improvement = best.score - greedy_response.score
            print(f"\n  Improvement: {improvement:+.3f}")
    
    return results

In [None]:
# Test prompts
test_prompts = [
    "What is the difference between a list and a tuple in Python?",
    "Explain how photosynthesis works in simple terms.",
    "What are the main causes of climate change?",
    "Describe the process of making coffee.",
]

# Run comparison
comparison_results = compare_greedy_vs_bon(
    test_prompts,
    reward_model,
    n=5,
    verbose=True
)

In [None]:
# Summary statistics
print("\n" + "="*60)
print("COMPARISON SUMMARY")
print("="*60)

avg_greedy_score = sum(comparison_results['greedy_scores']) / len(comparison_results['greedy_scores'])
avg_bon_score = sum(comparison_results['bon_scores']) / len(comparison_results['bon_scores'])
avg_improvement = sum(comparison_results['improvements']) / len(comparison_results['improvements'])

avg_greedy_time = sum(comparison_results['greedy_times']) / len(comparison_results['greedy_times'])
avg_bon_time = sum(comparison_results['bon_times']) / len(comparison_results['bon_times'])

print(f"\n{'Metric':<25} {'Greedy':<15} {'Best-of-5':<15}")
print("-"*55)
print(f"{'Average Score':<25} {avg_greedy_score:<15.3f} {avg_bon_score:<15.3f}")
print(f"{'Average Time':<25} {avg_greedy_time:<15.1f}s {avg_bon_time:<15.1f}s")
print("-"*55)
print(f"{'Average Improvement':<25} {avg_improvement:+.3f}")
print(f"{'Time Overhead':<25} {avg_bon_time/avg_greedy_time:.1f}x")

# How often did BoN win?
bon_wins = sum(1 for imp in comparison_results['improvements'] if imp > 0)
print(f"\nBoN outperformed Greedy: {bon_wins}/{len(comparison_results['improvements'])} times ({bon_wins/len(comparison_results['improvements']):.0%})")

---

## Part 6: Experimenting with N

In [None]:
def experiment_with_n(
    prompt: str,
    reward_model,
    n_values: List[int] = [1, 3, 5, 10],
) -> Dict:
    """
    Experiment with different values of N.
    """
    results = {}
    
    for n in n_values:
        print(f"Testing N={n}...")
        
        start_time = time.time()
        
        if n == 1:
            # Greedy
            response = greedy_decode(prompt)
            response.score = reward_model.score(prompt, response.response)
            best_score = response.score
            all_scores = [best_score]
        else:
            # Best-of-N
            best, candidates = best_of_n(
                prompt, reward_model, n=n,
                verbose=False
            )
            best_score = best.score
            all_scores = [c.score for c in candidates]
        
        elapsed = time.time() - start_time
        
        results[n] = {
            'best_score': best_score,
            'all_scores': all_scores,
            'time': elapsed,
        }
        
        print(f"  Best score: {best_score:.3f}, Time: {elapsed:.1f}s")
    
    return results


# Test with different N
test_prompt = "Explain the concept of recursion in programming with an example."

print(f"Prompt: {test_prompt}\n")
n_experiment = experiment_with_n(
    test_prompt,
    reward_model,
    n_values=[1, 3, 5, 10]
)

# Visualize results
print("\n" + "="*50)
print("EFFECT OF N ON QUALITY")
print("="*50)
print(f"\n{'N':<10} {'Best Score':<15} {'Time':<15}")
print("-"*40)
for n, data in n_experiment.items():
    print(f"{n:<10} {data['best_score']:<15.3f} {data['time']:<15.1f}s")

### Key Insight: Diminishing Returns

As N increases:
- Quality improves (higher best score)
- But with diminishing returns
- Time/cost increases linearly

**Recommendation:**
- N=3-5 for most applications (good balance)
- N=10+ for high-stakes decisions
- N=1 (greedy) when speed matters most

---

## Common Mistakes

### Mistake 1: Using Temperature = 0

```python
# Wrong: All candidates are identical!
best_of_n(prompt, rm, n=5, temperature=0.0)

# Right: Use temperature > 0 for diversity
best_of_n(prompt, rm, n=5, temperature=0.7)
```

### Mistake 2: Not Matching Reward Model to Use Case

```python
# Wrong: Using a general RM for code generation
reward_model = load("general-chat-rm")  # Optimized for chat
score_code_response(reward_model, code)  # May not score code well

# Right: Use domain-specific RM when available
reward_model = load("code-quality-rm")  # Optimized for code
```

### Mistake 3: Running Out of Memory

```python
# Wrong: Trying to run huge LLM + huge RM
llm = load("llama-70b")  # 45GB
rm = load("reward-70b")  # 45GB
# Total: 90GB - may not fit!

# Right: Use smaller RM or quantize
llm = load("llama-70b")  # 45GB
rm = load("armo-8b")     # 16GB
# Total: 61GB - fits on DGX Spark!
```

---

## Checkpoint

You've learned:
- ✅ What reward models are and how they score responses
- ✅ How to implement Best-of-N sampling
- ✅ How to load real reward models from HuggingFace
- ✅ The quality improvement over greedy decoding
- ✅ The tradeoff between N and quality/cost

---

## Cleanup

In [None]:
# Summary
print("Lab 3.4.5 Complete!")
print(f"\nReward model used: {reward_model.name}")
print(f"Average BoN improvement: {avg_improvement:+.3f}")

# Cleanup
import gc

# Delete reward model to free memory
if HAS_TRANSFORMERS and USE_REAL_RM:
    del reward_model
    torch.cuda.empty_cache()

gc.collect()
print("\nMemory cleaned up.")

---

## Next Steps

Excellent! You've mastered Best-of-N sampling with reward models.

In the final lab of this module, you'll build an **Adaptive Reasoning Pipeline** that combines everything:
- Routes simple queries to fast models
- Routes complex queries to reasoning models
- Uses caching for efficiency

**Continue to:** [Lab 3.4.6: Reasoning Pipeline](./lab-3.4.6-reasoning-pipeline.ipynb)