# Lab 3.4.2: Self-Consistency Implementation

**Module:** 3.4 - Test-Time Compute & Reasoning  
**Time:** 1.5 hours  
**Difficulty:** ⭐⭐⭐ (Intermediate-Advanced)

---

## Learning Objectives

By the end of this lab, you will:
- [ ] Understand why multiple reasoning paths improve accuracy
- [ ] Implement self-consistency with majority voting
- [ ] Experiment with different sampling parameters (N, temperature)
- [ ] Measure the accuracy improvement over single-sample CoT
- [ ] Understand the cost-quality tradeoff of self-consistency

---

## Prerequisites

- Completed Lab 3.4.1 (Chain-of-Thought Workshop)
- Ollama with a model loaded
- Understanding of CoT prompting

---

## Real-World Context

In 2022, Google researchers discovered that generating multiple reasoning paths and taking a **majority vote** consistently outperformed single-sample Chain-of-Thought. This technique, called **Self-Consistency**, became a key ingredient in achieving state-of-the-art results on reasoning benchmarks.

**Why does this matter?**
- OpenAI's o1 uses similar multi-path reasoning internally
- High-stakes AI decisions (medical, legal) benefit from consensus
- It's a simple way to trade compute for accuracy

**Industry Applications:**
- **Medical Diagnosis:** Multiple diagnostic pathways, consensus on likely condition
- **Code Generation:** Generate multiple solutions, pick the most common pattern
- **Financial Modeling:** Multiple calculation approaches, validate consistency

---

## ELI5: Self-Consistency

> **Imagine you're lost in a forest...**
>
> You ask 5 different hikers for directions to the lake.
> - 3 hikers say "Go North"
> - 1 hiker says "Go East" 
> - 1 hiker says "Go West"
>
> You'd probably go North, right? The majority usually knows best!
>
> **That's exactly what Self-Consistency does with AI:**
> 1. Ask the model the same question multiple times
> 2. Use temperature > 0 so each answer takes a slightly different path
> 3. Take a vote on the final answer
>
> Even if some paths make mistakes, the majority often converges on the right answer!

```
Single CoT:     Problem ─> Path1 ─> Answer (hope it's right!)

Self-Consistency:
                Problem ─> Path1 ─> Answer A ┐
                Problem ─> Path2 ─> Answer A ├─> VOTE ─> Answer A wins!
                Problem ─> Path3 ─> Answer A │
                Problem ─> Path4 ─> Answer B │
                Problem ─> Path5 ─> Answer A ┘
```

---

## Part 1: Setup

In [None]:
import json
import time
import re
from collections import Counter
from pathlib import Path
from typing import Dict, List, Optional, Tuple

import ollama

# Configuration
MODEL = "qwen3:8b"  # Change to your available model

# For DGX Spark with 128GB unified memory:
# MODEL = "qwen3:32b"  # Better reasoning, ~45GB

print(f"Using model: {MODEL}")

# Test connection
response = ollama.chat(
    model=MODEL,
    messages=[{"role": "user", "content": "Say 'ready'"}],
    options={"num_predict": 10}
)
print(f"Model response: {response['message']['content']}")

In [None]:
# Load test problems
data_path = Path("../data/gsm8k_sample.json")
if data_path.exists():
    with open(data_path) as f:
        problems = json.load(f)
else:
    # Fallback sample
    problems = [
        {
            "question": "Janet's ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
            "numerical_answer": 18
        },
        {
            "question": "A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?",
            "numerical_answer": 3
        },
        {
            "question": "Josh decides to try flipping a house. He buys a house for $80,000 and then puts in $50,000 in repairs. This increased the value of the house by 150%. How much profit did he make?",
            "numerical_answer": 70000
        },
    ]

print(f"Loaded {len(problems)} problems")

<cell_type>markdown</cell_type>---

## Part 2: Building the Self-Consistency Implementation

Let's implement self-consistency step by step.

### Key Python Tool: Counter for Majority Voting

The `Counter` class from Python's `collections` module is perfect for counting votes:

```python
from collections import Counter

# Count occurrences in a list
answers = ["42", "42", "41", "42", "43"]
counter = Counter(answers)
# Counter({'42': 3, '41': 1, '43': 1})

# Get the most common item(s)
most_common = counter.most_common(1)  # Returns [('42', 3)]
winner, count = most_common[0]        # winner='42', count=3
```

This makes implementing majority voting simple and efficient!

In [None]:
def extract_answer(response: str) -> Optional[str]:
    """
    Extract the final answer from a CoT response.
    
    Returns the answer as a string for voting (handles both numbers and text).
    """
    # Patterns to look for
    patterns = [
        r"[Tt]he (?:final )?answer is[:\s]+\$?([\d,]+(?:\.\d+)?)",
        r"[Aa]nswer[:\s]+\$?([\d,]+(?:\.\d+)?)",
        r"=\s*\$?([\d,]+(?:\.\d+)?)\s*(?:$|\.|\n|dollars|miles|hours)",
        r"\$\s*([\d,]+(?:\.\d+)?)",
    ]
    
    for pattern in patterns:
        matches = re.findall(pattern, response)
        if matches:
            # Return the last match (usually the final answer)
            answer = matches[-1].replace(',', '')
            # Normalize: convert to float then back to string
            try:
                num = float(answer)
                # Return integer if whole number
                if num == int(num):
                    return str(int(num))
                return str(num)
            except:
                return answer
    
    # Fallback: last number in response
    all_numbers = re.findall(r'-?[\d,]+(?:\.\d+)?', response)
    if all_numbers:
        answer = all_numbers[-1].replace(',', '')
        try:
            num = float(answer)
            if num == int(num):
                return str(int(num))
            return str(num)
        except:
            return answer
    
    return None


# Test extraction
test_cases = [
    "So the total is 42. The answer is 42.",
    "After calculation, we get $18 per day.",
    "The profit = 200000 - 80000 - 50000 = 70000",
]

print("Testing answer extraction:")
for tc in test_cases:
    ans = extract_answer(tc)
    print(f"  '{tc[:40]}...' -> {ans}")

In [None]:
def self_consistency(
    question: str,
    n_samples: int = 5,
    temperature: float = 0.7,
    model: str = MODEL,
    verbose: bool = True,
) -> Tuple[Optional[str], float, List[str], List[str]]:
    """
    Implement self-consistency: generate multiple reasoning paths and vote.
    
    Args:
        question: The problem to solve
        n_samples: Number of reasoning paths to generate
        temperature: Sampling temperature (>0 for diversity)
        model: Model to use
        verbose: Whether to print progress
    
    Returns:
        Tuple of (best_answer, confidence, all_answers, all_responses)
    """
    prompt = f"{question}\n\nLet's think step by step:"
    
    answers = []
    responses = []
    
    for i in range(n_samples):
        if verbose:
            print(f"  Generating path {i+1}/{n_samples}...", end=" ", flush=True)
        
        start_time = time.time()
        
        response = ollama.chat(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            options={"temperature": temperature, "num_predict": 512}
        )
        
        response_text = response['message']['content']
        responses.append(response_text)
        
        # Extract answer
        answer = extract_answer(response_text)
        answers.append(answer)
        
        elapsed = time.time() - start_time
        if verbose:
            print(f"Answer: {answer} ({elapsed:.1f}s)")
    
    # Filter out None values for voting
    valid_answers = [a for a in answers if a is not None]
    
    if not valid_answers:
        return None, 0.0, answers, responses
    
    # Majority vote
    counter = Counter(valid_answers)
    best_answer, count = counter.most_common(1)[0]
    confidence = count / len(valid_answers)
    
    if verbose:
        print(f"\n  Vote distribution: {dict(counter)}")
        print(f"  Winner: {best_answer} (confidence: {confidence:.0%})")
    
    return best_answer, confidence, answers, responses

In [None]:
# Test self-consistency on one problem
test_problem = problems[0]

print(f"Problem: {test_problem['question'][:80]}...")
print(f"Expected answer: {test_problem['numerical_answer']}")
print("\nRunning self-consistency with N=5:\n")

best, confidence, all_answers, responses = self_consistency(
    test_problem['question'],
    n_samples=5,
    temperature=0.7,
    verbose=True
)

print(f"\nFinal answer: {best}")
print(f"Correct: {str(test_problem['numerical_answer']) == best}")

### What Just Happened?

We generated 5 different reasoning paths with temperature=0.7. Notice:
- Each path might take slightly different approaches
- Sometimes paths lead to different answers
- The majority vote helps filter out occasional errors

**Confidence** tells us how strongly the answers agreed. High confidence (80%+) is a good sign!

---

## Part 3: Comparing Single-Sample vs Self-Consistency

Let's systematically compare the accuracy of single-sample CoT vs self-consistency.

In [None]:
def evaluate_single_cot(problems: List[Dict], n_problems: int = 5) -> Dict:
    """Evaluate single-sample CoT (temperature=0)."""
    results = []
    
    for i, prob in enumerate(problems[:n_problems]):
        print(f"Problem {i+1}/{n_problems}...", end=" ", flush=True)
        
        prompt = f"{prob['question']}\n\nLet's think step by step:"
        
        start_time = time.time()
        response = ollama.chat(
            model=MODEL,
            messages=[{"role": "user", "content": prompt}],
            options={"temperature": 0.0, "num_predict": 512}
        )
        elapsed = time.time() - start_time
        
        response_text = response['message']['content']
        predicted = extract_answer(response_text)
        expected = str(prob['numerical_answer'])
        correct = predicted == expected
        
        results.append({
            'predicted': predicted,
            'expected': expected,
            'correct': correct,
            'time': elapsed,
        })
        
        status = "CORRECT" if correct else "WRONG"
        print(f"{status} (pred={predicted}, exp={expected})")
    
    accuracy = sum(1 for r in results if r['correct']) / len(results)
    avg_time = sum(r['time'] for r in results) / len(results)
    
    return {
        'method': 'single_cot',
        'accuracy': accuracy,
        'avg_time': avg_time,
        'results': results,
    }


def evaluate_self_consistency(
    problems: List[Dict],
    n_problems: int = 5,
    n_samples: int = 5,
    temperature: float = 0.7,
) -> Dict:
    """Evaluate self-consistency."""
    results = []
    
    for i, prob in enumerate(problems[:n_problems]):
        print(f"\nProblem {i+1}/{n_problems}:")
        
        start_time = time.time()
        best_answer, confidence, all_answers, _ = self_consistency(
            prob['question'],
            n_samples=n_samples,
            temperature=temperature,
            verbose=True,
        )
        elapsed = time.time() - start_time
        
        expected = str(prob['numerical_answer'])
        correct = best_answer == expected
        
        results.append({
            'predicted': best_answer,
            'expected': expected,
            'correct': correct,
            'confidence': confidence,
            'all_answers': all_answers,
            'time': elapsed,
        })
        
        status = "CORRECT" if correct else "WRONG"
        print(f"  Result: {status} (pred={best_answer}, exp={expected})")
    
    accuracy = sum(1 for r in results if r['correct']) / len(results)
    avg_time = sum(r['time'] for r in results) / len(results)
    avg_conf = sum(r['confidence'] for r in results) / len(results)
    
    return {
        'method': 'self_consistency',
        'n_samples': n_samples,
        'temperature': temperature,
        'accuracy': accuracy,
        'avg_time': avg_time,
        'avg_confidence': avg_conf,
        'results': results,
    }

In [None]:
# Run comparison
N_PROBLEMS = 5  # Use 5 problems for quick testing

print("="*60)
print("EVALUATING SINGLE-SAMPLE COT (temperature=0)")
print("="*60)
single_results = evaluate_single_cot(problems, n_problems=N_PROBLEMS)

print("\n" + "="*60)
print("EVALUATING SELF-CONSISTENCY (N=5, temperature=0.7)")
print("="*60)
sc_results = evaluate_self_consistency(
    problems,
    n_problems=N_PROBLEMS,
    n_samples=5,
    temperature=0.7
)

In [None]:
# Summary comparison
print("\n" + "="*60)
print("RESULTS COMPARISON")
print("="*60)

print(f"\n{'Method':<25} {'Accuracy':<12} {'Avg Time':<12}")
print("-" * 50)
print(f"{'Single-sample CoT':<25} {single_results['accuracy']:<12.1%} {single_results['avg_time']:<12.1f}s")
print(f"{'Self-Consistency (N=5)':<25} {sc_results['accuracy']:<12.1%} {sc_results['avg_time']:<12.1f}s")
print("-" * 50)

improvement = sc_results['accuracy'] - single_results['accuracy']
time_ratio = sc_results['avg_time'] / single_results['avg_time']

print(f"\nAccuracy improvement: {improvement:+.1%}")
print(f"Time increase: {time_ratio:.1f}x")
print(f"Average confidence: {sc_results['avg_confidence']:.0%}")

---

## Part 4: Experimenting with Parameters

Two key parameters affect self-consistency:
1. **N (number of samples):** More samples = more diverse paths = better voting
2. **Temperature:** Higher = more diversity, but also more randomness

In [None]:
# Test different N values
n_values = [3, 5, 10]
n_test_problems = 3  # Quick test with 3 problems

results_by_n = {}

for n in n_values:
    print(f"\n{'='*60}")
    print(f"Testing N={n}")
    print("="*60)
    
    result = evaluate_self_consistency(
        problems,
        n_problems=n_test_problems,
        n_samples=n,
        temperature=0.7
    )
    results_by_n[n] = result

# Summary
print("\n" + "="*60)
print("EFFECT OF N ON ACCURACY")
print("="*60)
print(f"\n{'N':<10} {'Accuracy':<12} {'Avg Time':<12} {'Avg Confidence':<15}")
print("-" * 50)
for n, result in results_by_n.items():
    print(f"{n:<10} {result['accuracy']:<12.1%} {result['avg_time']:<12.1f}s {result['avg_confidence']:<15.0%}")

In [None]:
# Test different temperatures
temperatures = [0.3, 0.7, 1.0]

results_by_temp = {}

for temp in temperatures:
    print(f"\n{'='*60}")
    print(f"Testing temperature={temp}")
    print("="*60)
    
    result = evaluate_self_consistency(
        problems,
        n_problems=n_test_problems,
        n_samples=5,
        temperature=temp
    )
    results_by_temp[temp] = result

# Summary
print("\n" + "="*60)
print("EFFECT OF TEMPERATURE ON ACCURACY")
print("="*60)
print(f"\n{'Temp':<10} {'Accuracy':<12} {'Avg Confidence':<15} {'Answer Diversity':<15}")
print("-" * 55)
for temp, result in results_by_temp.items():
    # Calculate answer diversity (unique answers / total)
    all_answers = [a for r in result['results'] for a in r['all_answers'] if a is not None]
    diversity = len(set(all_answers)) / max(len(all_answers), 1)
    print(f"{temp:<10} {result['accuracy']:<12.1%} {result['avg_confidence']:<15.0%} {diversity:<15.0%}")

### Insights on Parameter Tuning

**Number of Samples (N):**
- More samples generally helps, but with diminishing returns
- N=5 is a good default balance
- N=10+ for high-stakes decisions

**Temperature:**
- Too low (0.3): Answers are too similar, voting doesn't help much
- Too high (1.0+): Answers become too random, more errors
- Sweet spot: 0.5-0.8 for most tasks

---

## Part 5: Confidence as a Quality Signal

One benefit of self-consistency: **confidence tells us when to trust the answer!**

In [None]:
def analyze_confidence(results: Dict) -> Dict:
    """
    Analyze the relationship between confidence and correctness.
    
    Higher confidence should correlate with higher accuracy!
    """
    high_conf = [r for r in results['results'] if r['confidence'] >= 0.8]
    low_conf = [r for r in results['results'] if r['confidence'] < 0.8]
    
    high_acc = sum(1 for r in high_conf if r['correct']) / max(len(high_conf), 1)
    low_acc = sum(1 for r in low_conf if r['correct']) / max(len(low_conf), 1)
    
    return {
        'high_confidence': {
            'count': len(high_conf),
            'accuracy': high_acc,
        },
        'low_confidence': {
            'count': len(low_conf),
            'accuracy': low_acc,
        }
    }


# Analyze our results
conf_analysis = analyze_confidence(sc_results)

print("Confidence vs Accuracy Analysis:")
print(f"\n  High confidence (>=80%):")
print(f"    Count: {conf_analysis['high_confidence']['count']}")
print(f"    Accuracy: {conf_analysis['high_confidence']['accuracy']:.0%}")
print(f"\n  Low confidence (<80%):")
print(f"    Count: {conf_analysis['low_confidence']['count']}")
print(f"    Accuracy: {conf_analysis['low_confidence']['accuracy']:.0%}")

print("\n Key insight: When confidence is high, you can trust the answer more!")

---

## Part 6: Cost-Quality Tradeoff

Self-consistency trades compute for accuracy. Let's quantify this tradeoff.

In [None]:
def calculate_cost_quality_tradeoff(single_results: Dict, sc_results: Dict) -> Dict:
    """
    Calculate the cost (tokens, time) vs quality (accuracy) tradeoff.
    """
    single_time = single_results['avg_time']
    sc_time = sc_results['avg_time']
    
    single_acc = single_results['accuracy']
    sc_acc = sc_results['accuracy']
    
    # Estimate tokens (rough: 1 token ~ 4 chars, avg 300 chars per response)
    single_tokens = 75  # Rough estimate per response
    sc_tokens = single_tokens * sc_results['n_samples']
    
    return {
        'time_multiplier': sc_time / single_time,
        'token_multiplier': sc_tokens / single_tokens,
        'accuracy_gain': sc_acc - single_acc,
        'accuracy_gain_per_token': (sc_acc - single_acc) / (sc_tokens - single_tokens) if sc_tokens > single_tokens else 0,
    }


tradeoff = calculate_cost_quality_tradeoff(single_results, sc_results)

print("Cost-Quality Tradeoff Analysis:")
print(f"\n  Time multiplier: {tradeoff['time_multiplier']:.1f}x")
print(f"  Token multiplier: {tradeoff['token_multiplier']:.0f}x")
print(f"  Accuracy gain: {tradeoff['accuracy_gain']:+.1%}")

print("\n Key question: Is the accuracy gain worth the extra compute?")
print("  - For high-stakes decisions: Usually YES")
print("  - For cost-sensitive applications: Consider smaller N")
print("  - For latency-critical: Maybe single CoT is better")

---

## Common Mistakes

### Mistake 1: Using Temperature = 0

```python
# Wrong: Temperature = 0 means all samples are identical!
self_consistency(question, n_samples=5, temperature=0.0)
# Result: All 5 samples give the same answer - voting is useless

# Right: Use temperature > 0 for diversity
self_consistency(question, n_samples=5, temperature=0.7)
# Result: Different reasoning paths lead to meaningful voting
```

### Mistake 2: Not Normalizing Answers Before Voting

```python
# Wrong: Treating "42" and "42.0" and "$42" as different answers
answers = ["42", "42.0", "$42", "42", "42"]
Counter(answers)  # {'42': 2, '42.0': 1, '$42': 1, '42': 2}

# Right: Normalize before voting
answers = [normalize(a) for a in answers]  # ["42", "42", "42", "42", "42"]
Counter(answers)  # {'42': 5} - Unanimous!
```

### Mistake 3: Too Few Samples

```python
# Risky: N=3 with split votes
answers = ["A", "A", "B"]  # 2-1 split, not very confident

# Better: N=5 for clearer majority
answers = ["A", "A", "A", "B", "B"]  # 3-2, more confident
```

---

## Checkpoint

You've learned:
- ✅ **What self-consistency is:** Multiple paths + majority voting
- ✅ **How to implement it:** Generate with temperature > 0, extract answers, vote
- ✅ **Key parameters:** N (samples) and temperature
- ✅ **Confidence as a signal:** High confidence = more trustworthy
- ✅ **The tradeoff:** More compute → better accuracy

---

## Challenge: Adaptive Self-Consistency

Can we be smarter about when to use more samples? Implement "early stopping" - if the first 3 samples all agree, we probably don't need 10!

In [None]:
def adaptive_self_consistency(
    question: str,
    min_samples: int = 3,
    max_samples: int = 10,
    confidence_threshold: float = 0.8,
    temperature: float = 0.7,
    model: str = MODEL,
) -> Tuple[Optional[str], float, List[str], int]:
    """
    Adaptive self-consistency: stop early if confident.
    
    Returns:
        (best_answer, confidence, all_answers, n_samples_used)
    """
    prompt = f"{question}\n\nLet's think step by step:"
    answers = []
    
    for i in range(max_samples):
        # Generate sample
        response = ollama.chat(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            options={"temperature": temperature, "num_predict": 512}
        )
        
        answer = extract_answer(response['message']['content'])
        answers.append(answer)
        
        # Check if we can stop early (after min_samples)
        if i + 1 >= min_samples:
            valid_answers = [a for a in answers if a is not None]
            if valid_answers:
                counter = Counter(valid_answers)
                best_answer, count = counter.most_common(1)[0]
                confidence = count / len(valid_answers)
                
                if confidence >= confidence_threshold:
                    print(f"  Early stopping at {i+1} samples (confidence: {confidence:.0%})")
                    return best_answer, confidence, answers, i + 1
    
    # Used all samples
    valid_answers = [a for a in answers if a is not None]
    if not valid_answers:
        return None, 0.0, answers, max_samples
    
    counter = Counter(valid_answers)
    best_answer, count = counter.most_common(1)[0]
    confidence = count / len(valid_answers)
    
    print(f"  Used all {max_samples} samples (confidence: {confidence:.0%})")
    return best_answer, confidence, answers, max_samples


# Test adaptive self-consistency
print("Testing adaptive self-consistency:")
for i, prob in enumerate(problems[:3]):
    print(f"\nProblem {i+1}: {prob['question'][:50]}...")
    answer, conf, all_ans, n_used = adaptive_self_consistency(
        prob['question'],
        min_samples=3,
        max_samples=8,
        confidence_threshold=0.8
    )
    print(f"  Answer: {answer} (expected: {prob['numerical_answer']})")
    print(f"  Samples used: {n_used}")

<cell_type>markdown</cell_type>---

## Further Reading

- [Self-Consistency Paper](https://arxiv.org/abs/2203.11171) - The original Google research
- [Universal Self-Consistency](https://arxiv.org/abs/2311.17311) - Works without answer extraction
- [Complexity-Based Prompting](https://arxiv.org/abs/2210.01717) - Route by complexity

---

## Cleanup

We save our results and use `gc.collect()` to free up memory.

In [None]:
# Save our results
summary = {
    'model': MODEL,
    'single_cot_accuracy': single_results['accuracy'],
    'self_consistency_accuracy': sc_results['accuracy'],
    'improvement': sc_results['accuracy'] - single_results['accuracy'],
    'n_samples': sc_results['n_samples'],
    'avg_confidence': sc_results['avg_confidence'],
}

print("Summary saved:")
print(json.dumps(summary, indent=2))

import gc
gc.collect()
print("\nMemory cleaned up.")

---

## Next Steps

Excellent work! You've mastered self-consistency, a key test-time compute strategy.

In the next lab, you'll explore **DeepSeek-R1** - a state-of-the-art reasoning model that "thinks out loud" with explicit `<think>` tokens. This is self-consistency built into the model itself!

**Continue to:** [Lab 3.4.3: DeepSeek-R1 Exploration](./lab-3.4.3-deepseek-r1.ipynb)