# Lab 3.4.4: R1 vs Standard Model Comparison

**Module:** 3.4 - Test-Time Compute & Reasoning  
**Time:** 1.5 hours  
**Difficulty:** ⭐⭐⭐ (Intermediate-Advanced)

---

## Learning Objectives

By the end of this lab, you will:
- [ ] Quantify the accuracy advantage of reasoning models
- [ ] Understand the token/time overhead of thinking
- [ ] Know when reasoning models are worth the extra cost
- [ ] Compare models across different problem categories
- [ ] Make data-driven decisions about model selection

---

## Prerequisites

- Completed Lab 3.4.3 (DeepSeek-R1 Exploration)
- Both DeepSeek-R1 and a standard model (e.g., Llama 3.1) available in Ollama

---

## Real-World Context

When deploying AI in production, you face a key question: **Should I use a reasoning model or a standard model?**

The answer depends on:
- Task complexity (is reasoning needed?)
- Accuracy requirements (can I tolerate errors?)
- Latency constraints (do I need fast responses?)
- Cost budget (more tokens = more money)

This lab gives you the data to make informed decisions.

---

## ELI5: Why Compare?

> **Imagine hiring for a job...**
>
> **Candidate A (Standard Model):** Fast worker, answers quickly.
> Sometimes right, sometimes wrong. Cheap to hire.
>
> **Candidate B (R1):** Careful worker, thinks before answering.
> Usually right, but takes longer. More expensive.
>
> **The question:** For THIS job, which is the better hire?
> - Simple data entry? Candidate A is fine.
> - Complex analysis where errors are costly? Candidate B is worth it.
>
> **This lab:** We'll run both candidates through a test and measure exactly how they perform!

---

## Part 1: Setup

In [None]:
import json
import time
import re
from pathlib import Path
from typing import Dict, List, Optional, Tuple
from dataclasses import dataclass, field
from collections import defaultdict

import ollama

# List available models
models = ollama.list()
model_names = [m['name'] for m in models.get('models', [])]

print("Available models:")
for name in model_names:
    print(f"  - {name}")

In [None]:
# Configure models for comparison
# Adjust these based on your available models

# Reasoning model (R1)
R1_MODEL = None
for name in model_names:
    if 'r1' in name.lower() or 'deepseek-r1' in name.lower():
        R1_MODEL = name
        break

# Standard model (Llama, Qwen, etc.)
STANDARD_MODEL = None
for name in model_names:
    if any(x in name.lower() for x in ['llama', 'qwen', 'mistral', 'gemma']):
        if 'r1' not in name.lower():  # Exclude R1 distilled versions
            STANDARD_MODEL = name
            break

print(f"\nReasoningmodel (R1): {R1_MODEL or 'Not found'}")
print(f"Standard model: {STANDARD_MODEL or 'Not found'}")

if not R1_MODEL:
    print("\nTo download R1: ollama pull deepseek-r1:7b")
if not STANDARD_MODEL:
    print("To download standard model: ollama pull llama3.1:8b")

In [None]:
# Manual override if auto-detection failed
# Uncomment and set these if needed:

# R1_MODEL = "deepseek-r1:7b"
# STANDARD_MODEL = "llama3.1:8b"

# For fair comparison, try to match parameter counts:
# - deepseek-r1:7b vs llama3.1:8b
# - deepseek-r1:32b vs qwen2.5:32b
# - deepseek-r1:70b vs llama3.1:70b

print(f"Using R1: {R1_MODEL}")
print(f"Using Standard: {STANDARD_MODEL}")

In [None]:
# Load test problems
data_path = Path("../data/test_problems.json")
if data_path.exists():
    with open(data_path) as f:
        all_problems = json.load(f)
    print(f"Loaded problems:")
    print(f"  Math: {len(all_problems['math'])}")
    print(f"  Code: {len(all_problems['code'])}")
    print(f"  Reasoning: {len(all_problems['reasoning'])}")
else:
    # Fallback
    all_problems = {
        "math": [
            {"question": "What is 17 * 23?", "answer": 391},
            {"question": "What is 15% of 240?", "answer": 36},
            {"question": "If 3x + 7 = 22, what is x?", "answer": 5},
        ],
        "reasoning": [
            {"question": "A bat and ball cost $1.10. The bat costs $1 more than the ball. How much does the ball cost in cents?", "answer": 5},
            {"question": "Tom is taller than Jim. Jim is taller than Mary. Is Tom taller than Mary? Answer yes or no.", "answer": "yes"},
        ],
        "code": [
            {"question": "Write a Python function is_prime(n) that returns True if n is prime.", "test_cases": [{"input": 17, "expected": True}]},
        ]
    }
    print("Using fallback problems")

---

## Part 2: Evaluation Framework

In [None]:
@dataclass
class EvalResult:
    """Result from evaluating a single problem."""
    question: str
    expected: str
    predicted: str
    correct: bool
    response_time: float
    response_tokens: int
    thinking_tokens: int
    category: str
    full_response: str


@dataclass
class ModelEvaluation:
    """Complete evaluation of a model."""
    model_name: str
    results: List[EvalResult] = field(default_factory=list)
    
    @property
    def accuracy(self) -> float:
        if not self.results:
            return 0.0
        return sum(1 for r in self.results if r.correct) / len(self.results)
    
    @property
    def avg_time(self) -> float:
        if not self.results:
            return 0.0
        return sum(r.response_time for r in self.results) / len(self.results)
    
    @property
    def total_tokens(self) -> int:
        return sum(r.response_tokens for r in self.results)
    
    @property
    def total_thinking_tokens(self) -> int:
        return sum(r.thinking_tokens for r in self.results)
    
    def accuracy_by_category(self) -> Dict[str, float]:
        by_cat = defaultdict(list)
        for r in self.results:
            by_cat[r.category].append(r.correct)
        return {cat: sum(correct) / len(correct) for cat, correct in by_cat.items()}

In [None]:
def extract_answer(response: str, expected_type: str = "number") -> Optional[str]:
    """
    Extract answer from response based on expected type.
    """
    # Remove thinking tokens if present
    response = re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL)
    response = response.strip()
    
    if expected_type == "number":
        # Look for patterns like "The answer is X"
        patterns = [
            r"[Tt]he (?:final )?answer is[:\s]+\$?([\d,]+(?:\.\d+)?)",
            r"[Aa]nswer[:\s]+\$?([\d,]+(?:\.\d+)?)",
            r"=\s*\$?([\d,]+(?:\.\d+)?)\s*(?:$|\.|\n)",
        ]
        for pattern in patterns:
            matches = re.findall(pattern, response)
            if matches:
                num_str = matches[-1].replace(',', '')
                try:
                    num = float(num_str)
                    if num == int(num):
                        return str(int(num))
                    return str(num)
                except:
                    continue
        
        # Fallback: last number
        numbers = re.findall(r'-?[\d,]+(?:\.\d+)?', response)
        if numbers:
            num_str = numbers[-1].replace(',', '')
            try:
                num = float(num_str)
                if num == int(num):
                    return str(int(num))
                return str(num)
            except:
                pass
    
    elif expected_type == "yes_no":
        response_lower = response.lower()
        if 'yes' in response_lower:
            return 'yes'
        elif 'no' in response_lower:
            return 'no'
    
    return None


def count_thinking_tokens(response: str) -> int:
    """Count tokens in <think> blocks."""
    matches = re.findall(r'<think>(.*?)</think>', response, re.DOTALL)
    thinking_text = ' '.join(matches)
    return len(thinking_text) // 4  # Rough estimate


def compare_answers(predicted: Optional[str], expected, tolerance: float = 0.01) -> bool:
    """Compare predicted to expected answer."""
    if predicted is None:
        return False
    
    # Try numeric comparison
    try:
        pred_num = float(str(predicted).replace(',', ''))
        exp_num = float(str(expected).replace(',', ''))
        if exp_num == 0:
            return abs(pred_num) < tolerance
        return abs(pred_num - exp_num) / abs(exp_num) < tolerance
    except:
        pass
    
    # String comparison
    return str(predicted).lower().strip() == str(expected).lower().strip()

In [None]:
def evaluate_model(
    model: str,
    problems: Dict[str, List],
    n_per_category: int = 5,
    use_cot: bool = True,
    verbose: bool = True,
) -> ModelEvaluation:
    """
    Evaluate a model on problems from each category.
    """
    evaluation = ModelEvaluation(model_name=model)
    
    for category, probs in problems.items():
        if category == 'code':  # Skip code for now (harder to evaluate)
            continue
            
        if verbose:
            print(f"\n{'='*50}")
            print(f"Category: {category.upper()}")
            print('='*50)
        
        for i, prob in enumerate(probs[:n_per_category]):
            question = prob.get('question', '')
            expected = prob.get('answer', prob.get('numerical_answer', ''))
            
            # Determine expected type
            if str(expected).lower() in ['yes', 'no']:
                exp_type = 'yes_no'
            else:
                exp_type = 'number'
            
            if verbose:
                print(f"\nProblem {i+1}: {question[:60]}...")
            
            # Build prompt
            if use_cot:
                prompt = f"{question}\n\nLet's think step by step:"
            else:
                prompt = question
            
            # Get response
            start_time = time.time()
            response = ollama.chat(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                options={"temperature": 0.0, "num_predict": 1024}
            )
            elapsed = time.time() - start_time
            
            response_text = response['message']['content']
            
            # Extract and compare
            predicted = extract_answer(response_text, exp_type)
            correct = compare_answers(predicted, expected)
            thinking_tokens = count_thinking_tokens(response_text)
            response_tokens = len(response_text) // 4
            
            result = EvalResult(
                question=question,
                expected=str(expected),
                predicted=predicted or "N/A",
                correct=correct,
                response_time=elapsed,
                response_tokens=response_tokens,
                thinking_tokens=thinking_tokens,
                category=category,
                full_response=response_text,
            )
            evaluation.results.append(result)
            
            if verbose:
                status = "CORRECT" if correct else "WRONG"
                print(f"  Expected: {expected}, Predicted: {predicted} [{status}]")
                print(f"  Time: {elapsed:.1f}s, Tokens: {response_tokens} (thinking: {thinking_tokens})")
    
    return evaluation

---

## Part 3: Run the Comparison

In [None]:
# Number of problems per category to test
N_PROBLEMS = 5  # Increase for more thorough evaluation

print("="*60)
print(f"EVALUATING: {R1_MODEL} (Reasoning Model)")
print("="*60)

if R1_MODEL:
    r1_eval = evaluate_model(
        R1_MODEL,
        all_problems,
        n_per_category=N_PROBLEMS,
        use_cot=True,
        verbose=True
    )
else:
    print("R1 model not available. Skipping.")
    r1_eval = None

In [None]:
print("\n" + "="*60)
print(f"EVALUATING: {STANDARD_MODEL} (Standard Model)")
print("="*60)

if STANDARD_MODEL:
    standard_eval = evaluate_model(
        STANDARD_MODEL,
        all_problems,
        n_per_category=N_PROBLEMS,
        use_cot=True,
        verbose=True
    )
else:
    print("Standard model not available. Skipping.")
    standard_eval = None

---

## Part 4: Analyze Results

In [None]:
def print_comparison_report(r1_eval: ModelEvaluation, std_eval: ModelEvaluation):
    """Print a detailed comparison report."""
    
    print("\n" + "="*70)
    print("MODEL COMPARISON REPORT")
    print("="*70)
    
    # Overall comparison
    print(f"\n{'Metric':<30} {'R1 Model':<18} {'Standard Model':<18}")
    print("-"*70)
    print(f"{'Model Name':<30} {r1_eval.model_name[:16]:<18} {std_eval.model_name[:16]:<18}")
    print(f"{'Overall Accuracy':<30} {r1_eval.accuracy:<18.1%} {std_eval.accuracy:<18.1%}")
    print(f"{'Avg Response Time':<30} {r1_eval.avg_time:<18.1f}s {std_eval.avg_time:<18.1f}s")
    print(f"{'Total Tokens':<30} {r1_eval.total_tokens:<18} {std_eval.total_tokens:<18}")
    print(f"{'Thinking Tokens':<30} {r1_eval.total_thinking_tokens:<18} {std_eval.total_thinking_tokens:<18}")
    
    # By category
    print("\n" + "-"*70)
    print("ACCURACY BY CATEGORY")
    print("-"*70)
    
    r1_by_cat = r1_eval.accuracy_by_category()
    std_by_cat = std_eval.accuracy_by_category()
    
    all_cats = set(r1_by_cat.keys()) | set(std_by_cat.keys())
    
    print(f"{'Category':<20} {'R1':<15} {'Standard':<15} {'Difference':<15}")
    for cat in sorted(all_cats):
        r1_acc = r1_by_cat.get(cat, 0)
        std_acc = std_by_cat.get(cat, 0)
        diff = r1_acc - std_acc
        sign = "+" if diff > 0 else ""
        print(f"{cat:<20} {r1_acc:<15.1%} {std_acc:<15.1%} {sign}{diff:<15.1%}")
    
    # Cost analysis
    print("\n" + "-"*70)
    print("COST-BENEFIT ANALYSIS")
    print("-"*70)
    
    acc_improvement = r1_eval.accuracy - std_eval.accuracy
    time_overhead = r1_eval.avg_time / std_eval.avg_time if std_eval.avg_time > 0 else 0
    token_overhead = r1_eval.total_tokens / std_eval.total_tokens if std_eval.total_tokens > 0 else 0
    
    print(f"Accuracy improvement: {acc_improvement:+.1%}")
    print(f"Time overhead: {time_overhead:.1f}x")
    print(f"Token overhead: {token_overhead:.1f}x")
    
    # Recommendation
    print("\n" + "-"*70)
    print("RECOMMENDATION")
    print("-"*70)
    
    if acc_improvement > 0.1:
        print(f"R1 shows significant accuracy improvement (+{acc_improvement:.0%}).")
        print("Use R1 for:")
        print("  - Complex reasoning tasks")
        print("  - High-stakes decisions")
        print("  - When accuracy matters more than speed")
    elif acc_improvement > 0:
        print(f"R1 shows modest improvement (+{acc_improvement:.0%}).")
        print("Consider R1 for complex tasks, standard model for simple ones.")
    else:
        print("Standard model matches or beats R1 on this benchmark.")
        print("The overhead may not be worth it for these tasks.")
    
    print("="*70)


# Print report
if r1_eval and standard_eval:
    print_comparison_report(r1_eval, standard_eval)
else:
    print("Cannot generate comparison - one or both models not evaluated.")

---

## Part 5: Detailed Error Analysis

In [None]:
def analyze_errors(r1_eval: ModelEvaluation, std_eval: ModelEvaluation):
    """Analyze where each model made errors."""
    
    print("\n" + "="*70)
    print("ERROR ANALYSIS")
    print("="*70)
    
    # Find problems where models disagreed
    r1_results = {r.question: r for r in r1_eval.results}
    std_results = {r.question: r for r in std_eval.results}
    
    both_correct = []
    both_wrong = []
    r1_only_correct = []
    std_only_correct = []
    
    for q in r1_results:
        if q in std_results:
            r1_correct = r1_results[q].correct
            std_correct = std_results[q].correct
            
            if r1_correct and std_correct:
                both_correct.append(q)
            elif not r1_correct and not std_correct:
                both_wrong.append(q)
            elif r1_correct and not std_correct:
                r1_only_correct.append(q)
            else:
                std_only_correct.append(q)
    
    total = len(r1_results)
    
    print(f"\nProblem Agreement Analysis ({total} problems):")
    print(f"  Both correct: {len(both_correct)} ({len(both_correct)/total:.0%})")
    print(f"  Both wrong: {len(both_wrong)} ({len(both_wrong)/total:.0%})")
    print(f"  R1 only correct: {len(r1_only_correct)} ({len(r1_only_correct)/total:.0%})")
    print(f"  Standard only correct: {len(std_only_correct)} ({len(std_only_correct)/total:.0%})")
    
    # Show examples where R1 succeeded
    if r1_only_correct:
        print("\n" + "-"*50)
        print("Examples where R1 SUCCEEDED but Standard FAILED:")
        print("-"*50)
        for q in r1_only_correct[:3]:
            r1_r = r1_results[q]
            std_r = std_results[q]
            print(f"\nQ: {q[:80]}...")
            print(f"  Expected: {r1_r.expected}")
            print(f"  R1 predicted: {r1_r.predicted} (CORRECT)")
            print(f"  Standard predicted: {std_r.predicted} (WRONG)")
    
    # Show examples where Standard succeeded
    if std_only_correct:
        print("\n" + "-"*50)
        print("Examples where Standard SUCCEEDED but R1 FAILED:")
        print("-"*50)
        for q in std_only_correct[:3]:
            r1_r = r1_results[q]
            std_r = std_results[q]
            print(f"\nQ: {q[:80]}...")
            print(f"  Expected: {r1_r.expected}")
            print(f"  R1 predicted: {r1_r.predicted} (WRONG)")
            print(f"  Standard predicted: {std_r.predicted} (CORRECT)")


if r1_eval and standard_eval:
    analyze_errors(r1_eval, standard_eval)

---

## Part 6: Token Economy Analysis

How much does the extra thinking cost in terms of tokens (which translates to API costs and latency)?

In [None]:
def analyze_token_economy(r1_eval: ModelEvaluation, std_eval: ModelEvaluation):
    """Analyze the token cost vs accuracy benefit."""
    
    print("\n" + "="*70)
    print("TOKEN ECONOMY ANALYSIS")
    print("="*70)
    
    # Total tokens
    r1_tokens = r1_eval.total_tokens
    std_tokens = std_eval.total_tokens
    thinking_tokens = r1_eval.total_thinking_tokens
    
    # Accuracy
    r1_acc = r1_eval.accuracy
    std_acc = std_eval.accuracy
    
    # Calculations
    extra_tokens = r1_tokens - std_tokens
    extra_correct = sum(1 for r in r1_eval.results if r.correct) - sum(1 for r in std_eval.results if r.correct)
    
    print(f"\nToken Usage:")
    print(f"  R1 total tokens: {r1_tokens:,}")
    print(f"  Standard total tokens: {std_tokens:,}")
    print(f"  Extra tokens for R1: {extra_tokens:,} (+{extra_tokens/std_tokens*100:.0f}%)")
    print(f"  Thinking tokens (R1): {thinking_tokens:,} ({thinking_tokens/r1_tokens*100:.0f}% of R1 output)")
    
    print(f"\nAccuracy Impact:")
    print(f"  R1 correct: {sum(1 for r in r1_eval.results if r.correct)}/{len(r1_eval.results)}")
    print(f"  Standard correct: {sum(1 for r in std_eval.results if r.correct)}/{len(std_eval.results)}")
    print(f"  Extra correct with R1: {extra_correct}")
    
    if extra_correct > 0 and extra_tokens > 0:
        cost_per_extra_correct = extra_tokens / extra_correct
        print(f"\n  Token cost per extra correct answer: {cost_per_extra_correct:.0f} tokens")
    
    # Cost estimation (rough, based on typical API pricing)
    # Assuming ~$0.001 per 1K tokens (varies by model/provider)
    print("\n  Estimated API cost difference (at $0.001/1K tokens):")
    r1_cost = r1_tokens / 1000 * 0.001
    std_cost = std_tokens / 1000 * 0.001
    print(f"    R1: ${r1_cost:.4f}")
    print(f"    Standard: ${std_cost:.4f}")
    print(f"    Difference: ${r1_cost - std_cost:.4f}")


if r1_eval and standard_eval:
    analyze_token_economy(r1_eval, standard_eval)

---

## Part 7: Decision Framework

Based on our analysis, when should you use a reasoning model?

In [None]:
decision_framework = """
╔══════════════════════════════════════════════════════════════════════╗
║             WHEN TO USE A REASONING MODEL (R1)                       ║
╠══════════════════════════════════════════════════════════════════════╣
║                                                                      ║
║  USE R1 WHEN:                                                        ║
║  ─────────────────────────────────────────────────────────────────   ║
║  ✓ Task requires multi-step reasoning                                ║
║  ✓ Accuracy is more important than speed                             ║
║  ✓ You need interpretable reasoning (for auditing/debugging)         ║
║  ✓ High-stakes decisions (medical, legal, financial)                 ║
║  ✓ Math, logic, or coding problems                                   ║
║                                                                      ║
║  USE STANDARD MODEL WHEN:                                            ║
║  ─────────────────────────────────────────────────────────────────   ║
║  ✓ Simple factual questions                                          ║
║  ✓ Speed is critical (real-time applications)                        ║
║  ✓ Cost-sensitive (high-volume, low-margin)                          ║
║  ✓ Creative writing or open-ended generation                         ║
║  ✓ Simple classification or extraction tasks                         ║
║                                                                      ║
║  HYBRID APPROACH (Best of Both Worlds):                              ║
║  ─────────────────────────────────────────────────────────────────   ║
║  1. Use classifier to detect query complexity                        ║
║  2. Route simple queries → Standard model (fast, cheap)              ║
║  3. Route complex queries → R1 (accurate, thorough)                  ║
║  4. Cache repeated reasoning patterns                                ║
║                                                                      ║
╚══════════════════════════════════════════════════════════════════════╝
"""

print(decision_framework)

---

## Common Mistakes

### Mistake 1: Comparing Models of Very Different Sizes

```python
# Wrong: Unfair comparison
compare("deepseek-r1:70b", "llama3.1:8b")  # 70B vs 8B!

# Right: Similar parameter counts
compare("deepseek-r1:7b", "llama3.1:8b")   # ~7B vs 8B
compare("deepseek-r1:70b", "llama3.1:70b") # 70B vs 70B
```

### Mistake 2: Not Using CoT with Standard Model

```python
# Wrong: Comparing R1 (with built-in CoT) vs Standard (no CoT)
r1_response = query(r1_model, question)  # Has <think> tokens
std_response = query(std_model, question)  # Direct answer

# Right: Give standard model CoT too
std_response = query(std_model, question + "\nLet's think step by step:")
```

### Mistake 3: Small Sample Size

```python
# Wrong: Too few problems
evaluate(model, problems[:3])  # Only 3 problems!

# Right: Enough for statistical significance
evaluate(model, problems[:30])  # At least 30+ per category
```

---

## Checkpoint

You've learned:
- ✅ How to fairly compare reasoning vs standard models
- ✅ How to measure accuracy, speed, and token usage
- ✅ How to analyze where each model succeeds/fails
- ✅ The token economy of reasoning models
- ✅ When to choose reasoning vs standard models

---

## Cleanup and Save Results

In [None]:
# Save comparison results
if r1_eval and standard_eval:
    comparison_summary = {
        'r1_model': r1_eval.model_name,
        'standard_model': standard_eval.model_name,
        'r1_accuracy': r1_eval.accuracy,
        'standard_accuracy': standard_eval.accuracy,
        'accuracy_difference': r1_eval.accuracy - standard_eval.accuracy,
        'r1_avg_time': r1_eval.avg_time,
        'standard_avg_time': standard_eval.avg_time,
        'r1_total_tokens': r1_eval.total_tokens,
        'standard_total_tokens': standard_eval.total_tokens,
        'r1_thinking_tokens': r1_eval.total_thinking_tokens,
        'r1_accuracy_by_category': r1_eval.accuracy_by_category(),
        'standard_accuracy_by_category': standard_eval.accuracy_by_category(),
    }
    
    print("Comparison Summary:")
    print(json.dumps(comparison_summary, indent=2, default=str))

import gc
gc.collect()
print("\nMemory cleaned up.")

---

## Next Steps

Great work! You now have data-driven insights on reasoning models.

In the next lab, you'll learn **Best-of-N sampling with reward models** - another powerful way to improve output quality at inference time.

**Continue to:** [Lab 3.4.5: Best-of-N with Reward Model](./lab-3.4.5-best-of-n.ipynb)