In [None]:
# 🔧 Setup: Run this cell first!
# Check GPU availability and install dependencies

import torch
import sys

# Check GPU
if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"✅ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    device = torch.device('cpu')
    print("⚠️ No GPU detected. Some cells may run slowly.")
    print("   Go to Runtime → Change runtime type → GPU")

print(f"\n📦 Python {sys.version.split()[0]}")
print(f"🔥 PyTorch {torch.__version__}")

# Set random seeds for reproducibility
import random
import numpy as np

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print(f"🎲 Random seed set to {SEED}")

%matplotlib inline

# Case Study: Teaching an Insurance Claims Engine to Reason Step by Step
## Implementation Notebook

Welcome to the ClearClaim AI case study implementation notebook. In this notebook, you will build a reasoning model that processes insurance claims by generating step-by-step payout calculations -- the same approach used by DeepSeek-R1 to teach language models to reason.

You will implement the full three-stage pipeline:
1. **SFT:** Supervised fine-tuning on chain-of-thought reasoning traces
2. **GRPO:** Group Relative Policy Optimization with verifiable rewards
3. **Evaluation:** Comprehensive accuracy and error analysis

**Business context:** ClearClaim AI's current model achieves only 69% accuracy on multi-step insurance claims. Your goal is to build a reasoning model that exceeds 85% accuracy while producing auditable reasoning traces.

---

## 3.1 Data Acquisition and Preparation

We use two datasets:
1. **GSM8K** -- A math reasoning dataset for initial warm-up (the model learns general step-by-step reasoning)
2. **Synthetic insurance claims** -- A custom dataset with configurable complexity that mirrors real multi-step payout calculations

Each claim includes policy terms (deductible, coverage limit, depreciation rate, sub-limits, co-insurance) and a ground-truth payout computed deterministically from those terms.

In [None]:
import torch
import torch.nn.functional as F
import json
import re
import random
import numpy as np
from dataclasses import dataclass, field
from typing import List, Tuple, Optional, Dict
from datasets import load_dataset

# Set random seeds for reproducibility
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)

# Load GSM8K for initial warm-up training
gsm8k = load_dataset("openai/gsm8k", "main")
print(f"GSM8K train size: {len(gsm8k['train'])}")
print(f"GSM8K test size: {len(gsm8k['test'])}")

# Inspect a sample
sample = gsm8k['train'][0]
print(f"\nQuestion: {sample['question'][:200]}...")
print(f"\nAnswer: {sample['answer'][:200]}...")

### Building the Synthetic Insurance Claims Dataset

Each claim must have a deterministically verifiable payout. The computation order is fixed:
1. Start with replacement cost for each item
2. Apply depreciation to get actual cash value (ACV)
3. Check each item against sub-limits (cap if needed)
4. Sum all items
5. Subtract deductible
6. Apply co-insurance percentage
7. Check against coverage limit (cap if needed)

In [None]:
@dataclass
class InsuranceClaim:
    """Represents a single insurance claim with policy terms and ground truth."""
    claim_description: str
    policy_terms: dict
    ground_truth_payout: float
    reasoning_steps: List[str]
    difficulty: str  # 'single_step', 'multi_step', 'complex'

**TODO 1: Implement the claims dataset generator.**

In [None]:
def generate_claims_dataset(n_claims: int = 5000, difficulty_mix: dict = None) -> List[InsuranceClaim]:
    """
    Generate a synthetic insurance claims dataset with verifiable payouts.

    Each claim includes:
    - A natural language claim description
    - Structured policy terms (deductible, limits, depreciation, sub-limits, co-insurance)
    - The correct payout amount (computed deterministically from the terms)
    - Step-by-step reasoning showing how to arrive at the payout
    - Difficulty level

    The difficulty mix controls the distribution:
    - 'single_step': Only deductible subtraction (e.g., damage - deductible)
    - 'multi_step': Deductible + depreciation + sub-limits (3-4 steps)
    - 'complex': All of the above + co-insurance + coverage limit checks (5-7 steps)

    Args:
        n_claims: Total number of claims to generate
        difficulty_mix: Dict like {'single_step': 0.3, 'multi_step': 0.4, 'complex': 0.3}

    Returns:
        List of InsuranceClaim objects with verified ground truth payouts

    Hints:
        1. Start by defining ranges for each policy parameter:
           - Deductibles: [$500, $1000, $2500, $5000]
           - Coverage limits: [$50K, $100K, $250K, $500K]
           - Depreciation rates: [5%, 10%, 15%] per year, item age 1-15 years
           - Sub-limits: electronics $10K, jewelry $5K, art $25K
           - Co-insurance: 80/20, 70/30, 90/10
        2. For each claim, randomly sample parameters, then compute the payout
           step by step (this becomes both the ground truth AND the reasoning trace)
        3. The payout computation order matters:
           a. Start with replacement cost for each item
           b. Apply depreciation to get actual cash value (ACV)
           c. Check each item against sub-limits (cap if needed)
           d. Sum all items
           e. Subtract deductible
           f. Apply co-insurance percentage
           g. Check against coverage limit (cap if needed)
        4. Generate the natural language claim description from the parameters
    """
    # TODO: Implement this function
    pass

In [None]:
# Generate the dataset
claims = generate_claims_dataset(n_claims=5000, difficulty_mix={
    'single_step': 0.3, 'multi_step': 0.4, 'complex': 0.3
})
print(f"Generated {len(claims)} claims")
print(f"Difficulty distribution: { {d: sum(1 for c in claims if c.difficulty == d) for d in ['single_step', 'multi_step', 'complex']} }")

In [None]:
# Verification: recompute payouts and verify they match ground truth
def verify_dataset(claims: List[InsuranceClaim]) -> dict:
    """Recompute payouts from policy terms and verify they match ground truth."""
    correct = 0
    errors = []
    for i, claim in enumerate(claims):
        # Recompute using the same deterministic logic
        terms = claim.policy_terms
        total = 0
        for item in terms.get('items', []):
            cost = item['replacement_cost']
            # Apply depreciation
            if 'depreciation_rate' in terms and 'age_years' in item:
                cost = cost * (1 - terms['depreciation_rate'] * item['age_years'])
                cost = max(cost, 0)
            # Check sub-limits
            if item.get('category') in terms.get('sub_limits', {}):
                cost = min(cost, terms['sub_limits'][item['category']])
            total += cost
        # Apply deductible
        total = max(total - terms.get('deductible', 0), 0)
        # Apply co-insurance
        if 'co_insurance_insurer' in terms:
            total = total * terms['co_insurance_insurer']
        # Check coverage limit
        if 'coverage_limit' in terms:
            total = min(total, terms['coverage_limit'])
        total = round(total, 2)

        if abs(total - claim.ground_truth_payout) < 0.01:
            correct += 1
        else:
            errors.append((i, total, claim.ground_truth_payout))

    accuracy = correct / len(claims)
    print(f"Dataset verification: {correct}/{len(claims)} ({accuracy:.1%}) payouts verified")
    if errors:
        print(f"First 3 errors: {errors[:3]}")
    assert accuracy == 1.0, f"Dataset has {len(errors)} computation errors!"
    return {"verified": correct, "total": len(claims)}

verify_dataset(claims)

**Thought questions:**
- Why is it critical that the synthetic dataset is deterministically verifiable? What would happen during GRPO training if the ground-truth payouts were sometimes wrong?
- How does the difficulty mix affect the training curriculum? Should we train on easy claims first or mix all difficulties from the start?

---

## 3.2 Exploratory Data Analysis

Before training, we need to understand the distribution of our data.

In [None]:
import matplotlib.pyplot as plt

# Analyze GSM8K answer distribution
gsm8k_answers = []
for ex in gsm8k['train']:
    match = re.search(r'#### (\-?[\d,]+)', ex['answer'])
    if match:
        gsm8k_answers.append(float(match.group(1).replace(',', '')))

plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.hist(gsm8k_answers, bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Answer Value')
plt.ylabel('Count')
plt.title('GSM8K Answer Distribution')
plt.subplot(1, 2, 2)
plt.hist([len(ex['answer'].split()) for ex in gsm8k['train']], bins=30, edgecolor='black', alpha=0.7)
plt.xlabel('Reasoning Length (words)')
plt.ylabel('Count')
plt.title('GSM8K Reasoning Trace Length')
plt.tight_layout()
plt.show()

**TODO 2: Analyze the insurance claims dataset.**

In [None]:
def analyze_claims_dataset(claims: List[InsuranceClaim]):
    """
    Produce exploratory data analysis plots for the insurance claims dataset.

    Generate the following visualizations:
    1. Payout distribution by difficulty level (three overlapping histograms)
    2. Number of reasoning steps vs. payout amount (scatter plot)
    3. Distribution of policy parameters (deductibles, limits, depreciation rates)
    4. Correlation heatmap between policy parameters and payout amount

    Also compute and print:
    - Mean and median payout by difficulty level
    - Average number of reasoning steps by difficulty level
    - The fraction of claims where sub-limits affect the payout
    - The fraction of claims where the coverage limit caps the payout

    Hints:
        1. Use matplotlib with 2x2 subplots for the four visualizations
        2. For the correlation heatmap, extract numeric fields from policy_terms
        3. Pay attention to the relationship between number of steps and accuracy --
           this tells us how difficult multi-step reasoning is for the model
    """
    # TODO: Implement this function
    pass

analyze_claims_dataset(claims)

**Thought questions:**
- What does the relationship between reasoning length and accuracy tell you about why SFT alone is insufficient?
- If the payout distribution is heavily right-skewed (a few very large claims), how might this affect GRPO training?

---

## 3.3 Baseline: Direct Payout Prediction

Before building a reasoning model, we implement the current baseline -- a model that directly predicts the payout amount without intermediate steps.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Use a small model for the baseline experiment
MODEL_NAME = "Qwen/Qwen2.5-0.5B"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=torch.float32)

# If pad token is not set, use eos token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

def format_direct_prompt(claim: InsuranceClaim) -> str:
    """Format a claim as a direct (no reasoning) prompt."""
    return f"""<claim>
{claim.claim_description}
</claim>

<policy>
{json.dumps(claim.policy_terms, indent=2)}
</policy>

What is the payout amount? Answer with just the dollar amount."""

print("Example direct prompt:")
print(format_direct_prompt(claims[0]))

**TODO 3: Implement and evaluate the direct prediction baseline.**

In [None]:
def evaluate_direct_baseline(model, tokenizer, claims: List[InsuranceClaim], n_eval: int = 200) -> dict:
    """
    Evaluate the direct (no-reasoning) payout prediction baseline.

    For each claim:
    1. Format the claim as a direct prompt (no reasoning requested)
    2. Generate the model's completion (max 50 tokens)
    3. Extract the predicted payout amount from the completion
    4. Compare against ground truth

    Args:
        model: The language model
        tokenizer: The tokenizer
        claims: List of InsuranceClaim objects
        n_eval: Number of claims to evaluate

    Returns:
        Dict with:
        - 'exact_match': fraction of exact matches (within $1)
        - 'within_10pct': fraction within 10% of ground truth
        - 'mean_absolute_error': average dollar error
        - 'accuracy_by_difficulty': dict mapping difficulty to exact_match rate

    Hints:
        1. Use model.generate() with temperature=0 for deterministic evaluation
        2. Extract dollar amounts using regex: r'\\$?([\\d,]+\\.?\\d*)'
        3. Handle cases where the model outputs text instead of a number
        4. Track accuracy separately for each difficulty level
    """
    # TODO: Implement this function
    pass

# Run baseline evaluation
# baseline_results = evaluate_direct_baseline(model, tokenizer, claims[-200:])
# print(f"Baseline exact match: {baseline_results['exact_match']:.1%}")
# print(f"Baseline by difficulty: {baseline_results['accuracy_by_difficulty']}")

**Thought questions:**
- Why is the direct baseline likely to perform worse on multi-step claims?
- If you increased the model size from 0.5B to 7B, would the direct baseline improve significantly on multi-step claims? Why or why not?

---

## 3.4 Model Design: Chain-of-Thought Reasoning

Our reasoning model uses the same base architecture but is trained to produce step-by-step reasoning traces. The architecture does not change -- only the training procedure and expected output format change.

In [None]:
def format_reasoning_prompt(claim: InsuranceClaim) -> str:
    """Format a claim as a reasoning prompt with <think> tags."""
    return f"""<claim>
{claim.claim_description}
</claim>

<policy>
{json.dumps(claim.policy_terms, indent=2)}
</policy>

Calculate the claim payout step by step."""


def format_reasoning_target(claim: InsuranceClaim) -> str:
    """Format the target completion with reasoning trace."""
    steps = "\n".join(f"Step {i+1}: {step}" for i, step in enumerate(claim.reasoning_steps))
    return f"""<think>
{steps}
</think>

The payout is ${claim.ground_truth_payout:,.2f}."""


# Inspect the format
print("=== PROMPT ===")
print(format_reasoning_prompt(claims[0]))
print("\n=== TARGET ===")
print(format_reasoning_target(claims[0]))

**TODO 4: Implement the SFT training loop.**

The SFT loss is: $\mathcal{L}_{\text{SFT}} = -\sum_{t=1}^{T} \log p_\theta(y_t \mid y_{<t}, x)$

The key detail: we only compute loss on the TARGET tokens, not the prompt tokens. This is done by setting prompt token labels to -100.

In [None]:
def sft_training_step(model, tokenizer, prompt: str, target: str, optimizer) -> float:
    """
    One step of supervised fine-tuning on a chain-of-thought example.

    Args:
        model: The language model
        tokenizer: The tokenizer
        prompt: The input prompt (claim + policy)
        target: The target completion (reasoning trace + answer)
        optimizer: The optimizer

    Returns:
        The loss value (float)

    Hints:
        1. Concatenate prompt + target into one sequence
        2. Tokenize the full sequence
        3. Create a labels tensor where prompt tokens are set to -100
           (so the loss is only computed on the target tokens)
        4. Forward pass through the model
        5. Compute cross-entropy loss only on target tokens
        6. Backpropagate and update
    """
    # TODO: Implement this function
    pass


def run_sft_training(model, tokenizer, claims: List[InsuranceClaim],
                     n_epochs: int = 2, lr: float = 2e-5) -> List[float]:
    """
    Run the full SFT training loop.

    Args:
        model: The language model
        tokenizer: The tokenizer
        claims: Training claims with reasoning traces
        n_epochs: Number of training epochs
        lr: Learning rate

    Returns:
        List of loss values for plotting

    Hints:
        1. Use AdamW optimizer with the specified learning rate
        2. Shuffle claims at the start of each epoch
        3. Log loss every 100 steps
        4. Save a checkpoint at the end of each epoch
    """
    # TODO: Implement this function
    pass

In [None]:
# Run SFT training (use a subset for speed in Colab)
# sft_losses = run_sft_training(model, tokenizer, claims[:1000], n_epochs=2)
# plt.plot(sft_losses)
# plt.xlabel('Step')
# plt.ylabel('Loss')
# plt.title('SFT Training Loss')
# plt.show()

**Thought questions:**
- Why do we mask the prompt tokens instead of training on the full sequence?
- After SFT, the model can produce well-formatted reasoning traces. But can it produce CORRECT reasoning? What fundamental limitation of maximum likelihood training causes this gap?

---

## 3.5 Training Strategy: GRPO with Verifiable Rewards

This is the core of the case study. GRPO uses the ground-truth payout as a verifiable reward signal -- no reward model needed.

The key equations:

**Group-relative advantage:** $\hat{A}_i = \frac{r_i - \bar{r}}{\sigma_r + \varepsilon}$

**Clipped surrogate loss:** $\mathcal{L}_{\text{GRPO}} = -\frac{1}{G}\sum_{i=1}^{G} \min\left(\rho_i \hat{A}_i, \text{clip}(\rho_i, 1-\epsilon, 1+\epsilon)\hat{A}_i\right)$

**KL-penalized reward:** $R_{\text{total}} = r(y, y^*) - \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$

In [None]:
GRPO_CONFIG = {
    'group_size': 8,
    'epsilon': 0.2,
    'beta': 0.05,
    'lr': 1e-6,
    'max_new_tokens': 512,
    'temperature': 0.7,
}

def extract_payout(completion: str) -> Optional[float]:
    """Extract the dollar payout amount from a model completion."""
    patterns = [
        r'payout\s+is\s+\$?([\d,]+\.?\d*)',
        r'####\s+\$?([\d,]+\.?\d*)',
        r'total.*?\$?([\d,]+\.?\d*)\s*$',
    ]
    for pattern in patterns:
        match = re.search(pattern, completion, re.IGNORECASE | re.MULTILINE)
        if match:
            return float(match.group(1).replace(',', ''))
    return None


def compute_reward(completion: str, ground_truth: float) -> float:
    """Binary reward: 1 if extracted payout matches ground truth, 0 otherwise."""
    predicted = extract_payout(completion)
    if predicted is None:
        return 0.0
    return 1.0 if abs(predicted - ground_truth) < 0.01 else 0.0

**TODO 5: Implement GRPO training.**

In [None]:
def compute_grpo_loss(model, ref_model, tokenizer, prompt: str,
                      completions: List[str], advantages: torch.Tensor,
                      epsilon: float = 0.2, beta: float = 0.05) -> torch.Tensor:
    """
    Compute the GRPO loss for a group of completions.

    Args:
        model: Current policy model
        ref_model: Frozen reference model (post-SFT checkpoint)
        tokenizer: The tokenizer
        prompt: The input prompt
        completions: List of G completions from the old policy
        advantages: Group-relative advantages, shape (G,)
        epsilon: Clipping parameter
        beta: KL penalty weight

    Returns:
        Scalar loss tensor

    Hints:
        1. For each completion, compute log p(completion | prompt) under both
           the current model and the reference model
        2. The probability ratio is rho_i = exp(log_p_current - log_p_old)
        3. Compute the clipped and unclipped objectives
        4. Take the element-wise minimum
        5. Add the KL penalty: beta * mean(log_p_current - log_p_ref)
        6. Return the negative mean (we minimize the loss)
    """
    # TODO: Implement this function
    pass


def grpo_training_step(model, ref_model, tokenizer, claim: InsuranceClaim,
                       optimizer, config: dict) -> dict:
    """
    One complete GRPO training step for a single claim.

    Steps:
    1. Format the claim as a reasoning prompt
    2. Generate G completions from the current model
    3. Compute binary rewards for each completion
    4. Compute group-relative advantages
    5. Compute GRPO loss
    6. Backpropagate and update

    Args:
        model: Current policy model
        ref_model: Frozen reference model
        tokenizer: The tokenizer
        claim: The insurance claim to train on
        optimizer: The optimizer
        config: GRPO hyperparameters

    Returns:
        Dict with 'loss', 'mean_reward', 'n_correct', 'advantages'

    Hints:
        1. Use model.generate() with do_sample=True
        2. After computing rewards, check if all rewards are the same.
           If so, skip the update (advantages would be all zeros).
        3. Compute advantages as (rewards - mean) / (std + 1e-8)
        4. Detach completions from the computation graph
    """
    # TODO: Implement this function
    pass

In [None]:
# Run GRPO training (use a subset for speed)
# import copy
# ref_model = copy.deepcopy(model)
# ref_model.eval()
# for param in ref_model.parameters():
#     param.requires_grad = False
#
# optimizer = torch.optim.AdamW(model.parameters(), lr=GRPO_CONFIG['lr'])
# grpo_rewards = []
# for step, claim in enumerate(claims[:500]):
#     result = grpo_training_step(model, ref_model, tokenizer, claim, optimizer, GRPO_CONFIG)
#     grpo_rewards.append(result['mean_reward'])
#     if step % 50 == 0:
#         print(f"Step {step}: reward={result['mean_reward']:.2f}, loss={result['loss']:.4f}")

**Thought questions:**
- What happens when all G completions get the same reward? Why must we skip the update?
- Why do we use temperature=0.7 instead of greedy decoding during GRPO?
- What would happen if we set beta=1.0? What about beta=0.001?

---

## 3.6 Evaluation

We evaluate the reasoning model against the direct prediction baseline.

In [None]:
def evaluate_reasoning_model(model, tokenizer, test_claims: List[InsuranceClaim],
                             n_eval: int = 200) -> dict:
    """Evaluate the reasoning model on held-out claims."""
    results = {
        'exact_match': 0, 'step_correct': 0, 'format_compliant': 0,
        'total': 0, 'by_difficulty': {}, 'latencies': [],
    }
    model.eval()
    with torch.no_grad():
        for claim in test_claims[:n_eval]:
            prompt = format_reasoning_prompt(claim)
            inputs = tokenizer(prompt, return_tensors="pt")
            import time
            start = time.time()
            outputs = model.generate(inputs.input_ids, max_new_tokens=512, temperature=0.0, do_sample=False)
            elapsed = time.time() - start
            completion = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)

            # Check metrics
            predicted = extract_payout(completion)
            is_correct = predicted is not None and abs(predicted - claim.ground_truth_payout) < 0.01
            has_think = '<think>' in completion and '</think>' in completion

            results['exact_match'] += int(is_correct)
            results['format_compliant'] += int(has_think)
            results['total'] += 1
            results['latencies'].append(elapsed)

            d = claim.difficulty
            if d not in results['by_difficulty']:
                results['by_difficulty'][d] = {'correct': 0, 'total': 0}
            results['by_difficulty'][d]['total'] += 1
            results['by_difficulty'][d]['correct'] += int(is_correct)

    n = results['total']
    results['exact_match_rate'] = results['exact_match'] / n
    results['format_compliance_rate'] = results['format_compliant'] / n
    results['mean_latency'] = np.mean(results['latencies'])
    for d in results['by_difficulty']:
        info = results['by_difficulty'][d]
        info['accuracy'] = info['correct'] / info['total']
    return results

**TODO 6: Implement step-level verification and comparison plots.**

In [None]:
def verify_reasoning_steps(completion: str, policy_terms: dict) -> dict:
    """
    Verify that each step in the reasoning trace is arithmetically correct.

    Args:
        completion: The model's full completion (with <think> block)
        policy_terms: The policy terms for reference

    Returns:
        Dict with 'n_steps', 'n_correct_steps', 'step_details', 'format_valid'

    Hints:
        1. Extract text between <think> and </think> tags
        2. Split on "Step N:" patterns
        3. For each step, look for arithmetic expressions (A * B = C, etc.)
        4. Evaluate the expression and compare to the stated result
    """
    # TODO: Implement this function
    pass


def plot_evaluation_comparison(baseline_results: dict, reasoning_results: dict):
    """
    Create side-by-side comparison plots.

    Generate:
    1. Grouped bar chart: accuracy by difficulty level
    2. Scatter plot: payout error vs claim complexity
    3. Histogram: reasoning trace lengths for correct vs incorrect
    4. Line plot: accuracy vs number of reasoning steps

    Hints:
        1. Use matplotlib with 2x2 subplots
        2. Blue for baseline, orange for reasoning model
    """
    # TODO: Implement this function
    pass

---

## 3.7 Error Analysis

Understanding failure modes is critical for improving the model and building trust.

**TODO 7: Implement systematic error categorization.**

In [None]:
def categorize_errors(model, tokenizer, test_claims: List[InsuranceClaim],
                      n_eval: int = 200) -> dict:
    """
    Categorize errors from the reasoning model.

    Error categories:
    1. 'arithmetic_error': Incorrect arithmetic in a step
    2. 'step_ordering_error': Steps applied in wrong order
    3. 'missing_step': A required step is omitted
    4. 'hallucinated_value': Model uses a number not in the claim/policy
    5. 'format_error': Malformed output
    6. 'extraction_error': Answer present but unparseable

    Args:
        model: The reasoning model
        tokenizer: The tokenizer
        test_claims: Held-out test claims

    Returns:
        Dict mapping error category to list of error instances

    Hints:
        1. Use verify_reasoning_steps() to identify arithmetic errors
        2. Expected step order: depreciation -> sub-limits -> sum ->
           deductible -> co-insurance -> coverage limit
        3. Check if all policy terms are referenced in the trace
        4. Compare numbers in trace against claim/policy numbers
    """
    # TODO: Implement this function
    pass

**Thought questions:**
- Which error category do you expect to be most common? Why?
- If the model consistently applies deductible before depreciation, is this an SFT problem or a GRPO problem?

---

## 3.8 Scalability and Deployment

ClearClaim needs to serve ~3,300 claims per day within a 15-second latency budget.

In [None]:
import time

def benchmark_inference(model, tokenizer, test_claims: List[InsuranceClaim],
                        n_runs: int = 50) -> dict:
    """Benchmark inference latency for the reasoning model."""
    latencies = []
    token_counts = []
    model.eval()
    with torch.no_grad():
        for claim in test_claims[:n_runs]:
            prompt = format_reasoning_prompt(claim)
            inputs = tokenizer(prompt, return_tensors="pt")
            start = time.time()
            outputs = model.generate(inputs.input_ids, max_new_tokens=512, temperature=0.0, do_sample=False)
            elapsed = time.time() - start
            n_tokens = outputs.shape[1] - inputs.input_ids.shape[1]
            latencies.append(elapsed)
            token_counts.append(n_tokens)

    return {
        'mean_latency_s': np.mean(latencies),
        'p95_latency_s': np.percentile(latencies, 95),
        'mean_tokens': np.mean(token_counts),
        'tokens_per_second': np.mean([t/l for t, l in zip(token_counts, latencies)]),
    }

# results = benchmark_inference(model, tokenizer, claims[-50:])
# print(f"Mean latency: {results['mean_latency_s']:.2f}s")
# print(f"P95 latency: {results['p95_latency_s']:.2f}s")
# print(f"Tokens/sec: {results['tokens_per_second']:.1f}")

**TODO 8: Write an inference optimization plan.**

In [None]:
def plan_inference_optimization(benchmark_results: dict, target_latency_s: float = 15.0) -> str:
    """
    Analyze benchmarks and recommend optimizations.

    Consider:
    1. KV-cache optimization
    2. Speculative decoding
    3. Model quantization (INT8/INT4)
    4. Batching strategy
    5. Early stopping after answer detection

    For each: estimate speedup, accuracy tradeoff, implementation complexity.

    Returns:
        Formatted string report
    """
    # TODO: Implement this function
    pass

---

## 3.9 Ethical and Regulatory Analysis

Insurance claims adjudication directly affects people's financial well-being. Regulatory compliance is mandatory, not optional.

**TODO 9: Conduct an ethical impact assessment.**

In [None]:
def ethical_impact_assessment(model, tokenizer, test_claims: List[InsuranceClaim]) -> str:
    """
    Address:
    1. BIAS: Does accuracy vary by claim type or payout size?
    2. FAIRNESS: Demographic parity and equalized odds across subgroups
    3. REGULATION: NAIC Model Bulletin, state commissioner guidelines, SOC 2
    4. FAILURE RISKS: Worst-case overpayment vs underpayment scenarios
    5. HUMAN-IN-THE-LOOP: When to auto-approve vs require human review

    Hints:
        1. Group claims by type and compare error rates
        2. Underpayment causes more individual harm than overpayment
        3. A good human-review threshold: claims where top-2 predictions
           differ by more than 10%
    """
    # TODO: Implement this function
    pass

**Thought questions:**
- Should the model's confidence score affect auto-approval? How would you compute confidence from a generative model?
- If the model is more accurate on large claims, is this a fairness issue?

---

## Summary

In this notebook, you built the complete pipeline for training a reasoning model for insurance claims adjudication:

1. **Data:** Synthetic insurance claims with deterministically verifiable payouts
2. **Baseline:** Direct payout prediction (no reasoning)
3. **SFT:** Supervised fine-tuning on chain-of-thought traces to learn the FORMAT of reasoning
4. **GRPO:** Reinforcement learning with verifiable rewards to learn the QUALITY of reasoning
5. **Evaluation:** Comprehensive accuracy, step-level verification, and error analysis
6. **Deployment:** Latency benchmarking and optimization planning
7. **Ethics:** Bias analysis, fairness metrics, and regulatory compliance

The key insight: by defining what success looks like (correct payout amounts) and letting the model discover HOW to get there, GRPO produces emergent reasoning behaviors -- self-verification, backtracking, and extended thinking for complex claims -- that SFT alone cannot achieve.

For the full production system design (architecture diagrams, API design, monitoring, A/B testing, and cost analysis), refer to Section 4 of the case study document.