# Lab 3.4.1: Chain-of-Thought Workshop

**Module:** 3.4 - Test-Time Compute & Reasoning  
**Time:** 2 hours  
**Difficulty:** ⭐⭐ (Intermediate)

---

## Learning Objectives

By the end of this lab, you will:
- [ ] Understand why Chain-of-Thought (CoT) prompting dramatically improves reasoning
- [ ] Implement zero-shot CoT with "Let's think step by step"
- [ ] Create effective few-shot CoT examples
- [ ] Measure accuracy improvements on math problems
- [ ] Know when CoT helps (and when it doesn't)

---

## Prerequisites

- Completed Module 3.3 (Deployment & Inference)
- Ollama installed with a model loaded (e.g., `qwen3:8b`)
- Basic Python and prompting experience

---

## Real-World Context

Before ChatGPT could solve math problems reliably, researchers discovered something remarkable: just asking the model to **"think step by step"** could dramatically improve accuracy. Google's 2022 paper showed Chain-of-Thought prompting boosted GSM8K math accuracy from ~18% to over 50% on PaLM-540B!

Today, every major AI system uses reasoning strategies internally. When you ask Claude or GPT-4 a complex question, they're often "thinking" through the problem before responding. Understanding CoT is foundational to understanding how modern AI reasons.

**Industry Applications:**
- **Customer Support:** Breaking down complex troubleshooting into steps
- **Code Review:** Analyzing code changes systematically
- **Medical Diagnosis:** Walking through differential diagnosis
- **Financial Analysis:** Step-by-step valuation reasoning

---

## ELI5: Chain-of-Thought Prompting

> **Imagine you're taking a math test...**
>
> Your teacher says: "Just write the final answer."
> You might rush, make mistakes, and get it wrong.
>
> But then your teacher says: "Show your work! Write each step."
> Now you slow down, think through each part, and catch errors.
>
> **That's exactly what we're doing with AI!**
>
> When we say "Let's think step by step," we're asking the AI to
> "show its work" - and just like students, AI performs much better
> when it reasons through problems explicitly.
>
> **Why does this work?**
> - Each step constrains the next (harder to make wild jumps)
> - The model "sees" its own reasoning (can catch errors)
> - Complex problems become manageable chunks

```
Without CoT:  Problem ──────────────────> Answer  (one big leap, easy to fall!)

With CoT:     Problem ─> Step1 ─> Step2 ─> Step3 ─> Answer  (small safe steps)
```

---

## Part 1: Setup and Environment Check

Let's verify our environment and set up the tools we'll need.

In [None]:
# Install required packages (if not already installed)
# !pip install ollama

import json
import time
import re
from pathlib import Path
from typing import Dict, List, Optional, Tuple

# Check if ollama is available
try:
    import ollama
    print("Ollama library loaded successfully!")
except ImportError:
    print("Please install ollama: pip install ollama")
    raise

# Check available models
try:
    models = ollama.list()
    print(f"\nAvailable models:")
    for model in models.get('models', []):
        print(f"  - {model.get('name', 'unknown')}")
except Exception as e:
    print(f"Note: Could not list models. Make sure 'ollama serve' is running.")
    print(f"Error: {e}")

In [None]:
# Configuration
MODEL = "qwen3:8b"  # Change this to your available model

# For DGX Spark, you can use larger models:
# MODEL = "qwen3:32b"  # ~45GB with Q4 quantization
# MODEL = "qwen2.5:72b"   # ~45GB with Q4 quantization

print(f"Using model: {MODEL}")

# Test connection
try:
    response = ollama.chat(
        model=MODEL,
        messages=[{"role": "user", "content": "Say 'ready' if you can hear me."}],
        options={"num_predict": 10}
    )
    print(f"Model response: {response['message']['content']}")
    print("Connection successful!")
except Exception as e:
    print(f"Error connecting to model: {e}")
    print("Make sure Ollama is running and the model is downloaded.")

---

## Part 2: The Problem with Direct Answering

Let's first see what happens when we ask the model to answer directly, without any reasoning.

In [None]:
# A math problem that requires multiple steps
problem = """
A store sells apples for $2 each and oranges for $3 each.
If I buy 5 apples and some oranges and spend $25 total, 
how many oranges did I buy?
"""

# Direct answer prompt (no reasoning requested)
direct_prompt = f"{problem}\n\nAnswer with just the number:"

print("Problem:")
print(problem)
print("\n" + "="*50)
print("Asking for direct answer...\n")

response = ollama.chat(
    model=MODEL,
    messages=[{"role": "user", "content": direct_prompt}],
    options={"temperature": 0.0, "num_predict": 50}
)

print(f"Model's answer: {response['message']['content']}")
print(f"\nCorrect answer: 5 oranges")
print("(5 apples * $2 = $10, leaving $15 for oranges. $15 / $3 = 5 oranges)")

### What Just Happened?

The model may or may not get this right. Even if it does, we don't know *how* it arrived at the answer. When models try to jump directly to answers:

- They're more likely to make arithmetic errors
- They might miss steps in the reasoning
- We can't verify their thinking

Let's try with Chain-of-Thought prompting next!

---

## Part 3: Zero-Shot Chain-of-Thought

Zero-shot CoT is beautifully simple: just add **"Let's think step by step"** to your prompt. No examples needed!

In [None]:
def zero_shot_cot(question: str, model: str = MODEL) -> str:
    """
    Apply zero-shot Chain-of-Thought prompting.
    
    Just adds "Let's think step by step" - that's it!
    """
    prompt = f"{question}\n\nLet's think step by step:"
    
    response = ollama.chat(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        options={"temperature": 0.0, "num_predict": 512}
    )
    
    return response['message']['content']


# Test on our problem
print("Problem:")
print(problem)
print("\n" + "="*50)
print("Using Chain-of-Thought...\n")

cot_response = zero_shot_cot(problem)
print(cot_response)

### What Just Happened?

Notice how the model:
1. Broke down the problem into manageable pieces
2. Showed its arithmetic at each step
3. Built up to the final answer logically

This is the magic of CoT - the model *discovers* the right approach by reasoning through it!

---

## Part 4: Let's Test on Multiple Problems

Now let's systematically compare direct answering vs. CoT on a set of problems.

In [None]:
<cell_type>markdown</cell_type>---

## Key Python Tools for This Lab

Before we evaluate our methods, let's understand some key Python tools we'll use:

### Regular Expressions (`re` module)
Regular expressions help us find patterns in text - essential for extracting answers from model responses.

```python
import re

# re.findall(pattern, text) - finds ALL matches of a pattern
numbers = re.findall(r'\d+', "I have 5 apples and 3 oranges")
# Returns: ['5', '3']

# Common patterns:
# \d+     - one or more digits
# \$?     - optional dollar sign
# [\d,]+  - digits with optional commas (like "1,000")
```

### Path and JSON for Data Loading
```python
from pathlib import Path
import json

# Path provides clean file path handling
data_path = Path("../data/file.json")
if data_path.exists():
    with open(data_path) as f:
        data = json.load(f)  # Parse JSON file into Python dict/list
```

### Time Measurement
```python
import time

start = time.time()
# ... do something ...
elapsed = time.time() - start  # Time in seconds
```

In [None]:
def extract_answer(response: str) -> Optional[float]:
    """
    Extract the numerical answer from a model response.
    
    Looks for patterns like:
    - "The answer is 42"
    - "= 42"
    - "$42"
    """
    # Try various patterns
    patterns = [
        r"[Tt]he (?:final )?answer is[:\s]+\$?([\d,]+(?:\.\d+)?)",
        r"[Aa]nswer[:\s]+\$?([\d,]+(?:\.\d+)?)",
        r"=\s*\$?([\d,]+(?:\.\d+)?)\s*(?:$|\.|\n)",
        r"\$\s*([\d,]+(?:\.\d+)?)",
    ]
    
    for pattern in patterns:
        matches = re.findall(pattern, response)
        if matches:
            # Take the last match (usually the final answer)
            num_str = matches[-1].replace(',', '')
            try:
                return float(num_str)
            except ValueError:
                continue
    
    # Fallback: find last number in response
    all_numbers = re.findall(r'-?[\d,]+(?:\.\d+)?', response)
    if all_numbers:
        num_str = all_numbers[-1].replace(',', '')
        try:
            return float(num_str)
        except ValueError:
            pass
    
    return None


def is_correct(predicted: Optional[float], expected: float, tolerance: float = 0.01) -> bool:
    """Check if predicted answer matches expected (with tolerance)."""
    if predicted is None:
        return False
    if expected == 0:
        return abs(predicted) < tolerance
    return abs(predicted - expected) / abs(expected) < tolerance


# Test our extraction
test_responses = [
    "Let's calculate... 5 + 3 = 8. The answer is 8.",
    "After all the calculations, we get $42.",
    "So the total would be 156 meters.",
]

print("Testing answer extraction:")
for resp in test_responses:
    answer = extract_answer(resp)
    print(f"  '{resp[:40]}...' -> {answer}")

In [None]:
def evaluate_method(
    problems: List[Dict],
    method: str = "cot",
    n_problems: int = 5,
    verbose: bool = True
) -> Dict:
    """
    Evaluate a prompting method on a set of problems.
    
    Args:
        problems: List of problem dicts with 'question' and 'numerical_answer'
        method: "direct" or "cot"
        n_problems: Number of problems to evaluate
        verbose: Whether to print progress
    
    Returns:
        Dict with accuracy and results
    """
    results = []
    
    for i, problem in enumerate(problems[:n_problems]):
        question = problem['question']
        expected = problem['numerical_answer']
        
        if verbose:
            print(f"\nProblem {i+1}/{n_problems}: {question[:50]}...")
        
        # Build prompt based on method
        if method == "direct":
            prompt = f"{question}\n\nGive only the numerical answer:"
        else:  # cot
            prompt = f"{question}\n\nLet's think step by step:"
        
        # Get response
        start_time = time.time()
        response = ollama.chat(
            model=MODEL,
            messages=[{"role": "user", "content": prompt}],
            options={"temperature": 0.0, "num_predict": 512}
        )
        elapsed = time.time() - start_time
        
        response_text = response['message']['content']
        predicted = extract_answer(response_text)
        correct = is_correct(predicted, expected)
        
        results.append({
            'question': question,
            'expected': expected,
            'predicted': predicted,
            'correct': correct,
            'response': response_text,
            'time': elapsed,
        })
        
        if verbose:
            status = "CORRECT" if correct else "WRONG"
            print(f"  Expected: {expected}, Predicted: {predicted} [{status}]")
    
    accuracy = sum(1 for r in results if r['correct']) / len(results)
    avg_time = sum(r['time'] for r in results) / len(results)
    
    return {
        'method': method,
        'accuracy': accuracy,
        'correct': sum(1 for r in results if r['correct']),
        'total': len(results),
        'avg_time': avg_time,
        'results': results,
    }

In [None]:
# Compare methods!
n_test = min(5, len(problems))  # Use 5 problems for quick testing

print("="*60)
print("EVALUATING DIRECT ANSWERING")
print("="*60)
direct_results = evaluate_method(problems, method="direct", n_problems=n_test)

print("\n" + "="*60)
print("EVALUATING CHAIN-OF-THOUGHT")
print("="*60)
cot_results = evaluate_method(problems, method="cot", n_problems=n_test)

In [None]:
# Summary comparison
print("\n" + "="*60)
print("RESULTS COMPARISON")
print("="*60)
print(f"\n{'Method':<20} {'Accuracy':<15} {'Correct/Total':<15} {'Avg Time':<10}")
print("-" * 60)
print(f"{'Direct Answering':<20} {direct_results['accuracy']:<15.1%} {direct_results['correct']}/{direct_results['total']:<12} {direct_results['avg_time']:<10.1f}s")
print(f"{'Chain-of-Thought':<20} {cot_results['accuracy']:<15.1%} {cot_results['correct']}/{cot_results['total']:<12} {cot_results['avg_time']:<10.1f}s")
print("-" * 60)

improvement = cot_results['accuracy'] - direct_results['accuracy']
print(f"\nAccuracy improvement with CoT: {improvement:+.1%}")

if improvement > 0:
    print("Chain-of-Thought prompting improved accuracy!")
elif improvement < 0:
    print("Interesting - direct answering performed better on this sample.")
    print("(This can happen with small samples or simple problems)")
else:
    print("Both methods performed equally on this sample.")

---

## Part 5: Few-Shot Chain-of-Thought

We can make CoT even more powerful by showing the model *examples* of good reasoning. This is called **few-shot CoT**.

> **ELI5:** It's like showing a student a worked example before asking them to solve a new problem. They learn the "style" of reasoning from the example!

In [None]:
# Create high-quality few-shot examples
FEW_SHOT_EXAMPLES = [
    {
        "question": "If there are 3 cars in the parking lot and 2 more arrive, how many cars are there?",
        "reasoning": """Let me solve this step by step:
1. Start with 3 cars in the parking lot
2. 2 more cars arrive
3. Total cars = 3 + 2 = 5 cars

The answer is 5."""
    },
    {
        "question": "Tom has 8 apples. He gives 3 to his friend and buys 5 more. How many apples does Tom have now?",
        "reasoning": """Let me solve this step by step:
1. Tom starts with 8 apples
2. He gives away 3 apples: 8 - 3 = 5 apples
3. He buys 5 more apples: 5 + 5 = 10 apples

The answer is 10."""
    },
    {
        "question": "A store sells pens for $2 each. If I have $15, how many pens can I buy and how much money will I have left?",
        "reasoning": """Let me solve this step by step:
1. Each pen costs $2
2. I have $15 to spend
3. Number of pens I can buy = $15 / $2 = 7 pens (with remainder)
4. Cost of 7 pens = 7 * $2 = $14
5. Money left = $15 - $14 = $1

The answer is 7 pens with $1 left."""
    },
]


def few_shot_cot(question: str, examples: List[Dict] = FEW_SHOT_EXAMPLES, model: str = MODEL) -> str:
    """
    Apply few-shot Chain-of-Thought prompting.
    
    Shows examples of good reasoning before the actual question.
    """
    # Build the prompt with examples
    prompt_parts = []
    
    for ex in examples:
        prompt_parts.append(f"Q: {ex['question']}")
        prompt_parts.append(f"A: {ex['reasoning']}")
        prompt_parts.append("")  # Empty line between examples
    
    # Add the actual question
    prompt_parts.append(f"Q: {question}")
    prompt_parts.append("A: Let me solve this step by step:")
    
    prompt = "\n".join(prompt_parts)
    
    response = ollama.chat(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        options={"temperature": 0.0, "num_predict": 512}
    )
    
    return response['message']['content']


# Test few-shot CoT
test_question = "A baker has 24 cupcakes. She sells 8 in the morning and 10 in the afternoon. How many cupcakes does she have left?"

print("Question:", test_question)
print("\n" + "="*50)
print("Few-Shot CoT Response:\n")
print(few_shot_cot(test_question))

### Try It Yourself: Create Your Own Examples

The quality of few-shot examples matters! Try creating examples that match your use case.

<details>
<summary>Hint: Tips for good examples</summary>

1. **Be consistent in format** - Use the same step numbering and phrasing
2. **Show intermediate calculations** - Don't skip mental math
3. **End with clear answer** - "The answer is X" is easy to extract
4. **Match problem complexity** - Your examples should be similar difficulty
5. **Use 2-4 examples** - More isn't always better (context limits!)
</details>

In [None]:
# TODO: Create your own few-shot examples for a specific domain
# For example: percentage calculations, time/distance, or unit conversions

MY_EXAMPLES = [
    {
        "question": "Your question here",
        "reasoning": """Your step-by-step reasoning here...
1. First step
2. Second step
3. Third step

The answer is X."""
    },
    # Add more examples...
]

# Test your examples
# test_response = few_shot_cot("Your test question", examples=MY_EXAMPLES)
# print(test_response)

---

## Part 6: When Does CoT Help (and When Doesn't It)?

CoT isn't magic - it helps most on certain types of problems. Let's explore!

In [None]:
# Problems where CoT helps most
cot_helps_problems = [
    # Multi-step arithmetic
    "If a shirt costs $45 and is on sale for 20% off, how much do I save?",
    
    # Word problems requiring interpretation
    "A train leaves at 9am traveling at 60mph. Another train leaves at 10am traveling at 80mph. When does the second train catch up?",
    
    # Problems with multiple constraints
    "I need to buy at least 100 items. Boxes come in packs of 12. What's the minimum number of boxes I need?",
    
    # Sequential reasoning
    "If Monday is 2 days after yesterday, what day is today?",
]

# Problems where CoT might not help as much
cot_doesnt_help_problems = [
    # Simple factual recall
    "What is the capital of France?",
    
    # Direct calculations
    "What is 7 + 5?",
    
    # Pattern recognition
    "Complete the pattern: 2, 4, 6, 8, _",
    
    # Simple lookups
    "How many days are in February in a leap year?",
]


def compare_on_problem(question: str) -> Dict:
    """Compare direct vs CoT on a single problem."""
    # Direct
    direct_response = ollama.chat(
        model=MODEL,
        messages=[{"role": "user", "content": f"{question}\n\nAnswer:"}],
        options={"temperature": 0.0, "num_predict": 100}
    )['message']['content']
    
    # CoT
    cot_response = zero_shot_cot(question)
    
    return {
        'question': question,
        'direct': direct_response,
        'cot': cot_response,
    }


print("Testing problems where CoT typically HELPS:\n")
for q in cot_helps_problems[:2]:  # Test first 2
    result = compare_on_problem(q)
    print(f"Q: {q}")
    print(f"Direct: {result['direct'][:100]}...")
    print(f"CoT: {result['cot'][:150]}...")
    print()

### Key Insight: When to Use CoT

| Use CoT When... | Skip CoT When... |
|----------------|------------------|
| Multi-step reasoning required | Simple factual questions |
| Math word problems | Direct calculations |
| Logical puzzles | Pattern completion |
| Cause-and-effect chains | Yes/no questions |
| Constraint satisfaction | Simple lookups |

**Rule of thumb:** If a human would need to "think it through," use CoT!

---

## Common Mistakes

### Mistake 1: Not Extracting the Final Answer

```python
# Wrong: Just using the full response
response = model.generate("What is 5+3? Let's think step by step:")
answer = response  # This includes all the reasoning!

# Right: Extract just the answer
response = model.generate("What is 5+3? Let's think step by step:")
answer = extract_answer(response)  # Just the number: 8
```

### Mistake 2: Using Temperature Too High

```python
# Wrong: High temperature for reasoning
response = ollama.chat(model, messages, options={"temperature": 1.0})
# Risk: Model may give creative but wrong reasoning

# Right: Low temperature for consistent reasoning
response = ollama.chat(model, messages, options={"temperature": 0.0})
# Deterministic, reproducible reasoning
```

### Mistake 3: Too Many Few-Shot Examples

```python
# Wrong: 10 examples (uses up context window)
examples = [ex1, ex2, ex3, ex4, ex5, ex6, ex7, ex8, ex9, ex10]

# Right: 2-4 high-quality examples
examples = [ex1, ex2, ex3]  # Quality over quantity!
```

---

## Checkpoint

You've learned:
- ✅ **What CoT is:** Prompting models to "show their work"
- ✅ **Zero-shot CoT:** Just add "Let's think step by step"
- ✅ **Few-shot CoT:** Provide examples of good reasoning
- ✅ **When to use it:** Multi-step reasoning, math, logic puzzles
- ✅ **How to measure:** Extract answers and compare accuracy

---

## Challenge: Build a CoT Prompt Template Library

Create a library of CoT prompt templates for different domains. This is what production AI systems actually use!

In [None]:
class CoTPromptLibrary:
    """
    A library of Chain-of-Thought prompt templates for different domains.
    
    Usage:
        library = CoTPromptLibrary()
        prompt = library.get_prompt("math", "What is 15% of 80?")
    """
    
    TEMPLATES = {
        "math": {
            "system": "You are a math tutor. Always show your work step by step.",
            "trigger": "Let me solve this step by step:",
            "examples": [
                {
                    "q": "What is 25% of 80?",
                    "a": """Let me solve this step by step:
1. 25% means 25 per 100, or 25/100 = 0.25
2. To find 25% of 80, multiply: 80 * 0.25
3. 80 * 0.25 = 20

The answer is 20."""
                }
            ]
        },
        "code_debug": {
            "system": "You are a debugging expert. Analyze code systematically.",
            "trigger": "Let me analyze this code step by step:",
            "examples": []
        },
        "logic": {
            "system": "You are a logic expert. Use formal reasoning.",
            "trigger": "Let me reason through this logically:",
            "examples": []
        },
    }
    
    def __init__(self):
        self.templates = self.TEMPLATES.copy()
    
    def get_prompt(self, domain: str, question: str, use_examples: bool = True) -> str:
        """Get a CoT prompt for a given domain and question."""
        if domain not in self.templates:
            raise ValueError(f"Unknown domain: {domain}. Available: {list(self.templates.keys())}")
        
        template = self.templates[domain]
        
        parts = []
        
        # Add examples if available and requested
        if use_examples and template.get('examples'):
            for ex in template['examples']:
                parts.append(f"Q: {ex['q']}")
                parts.append(f"A: {ex['a']}")
                parts.append("")
        
        # Add the actual question
        parts.append(f"Q: {question}")
        parts.append(f"A: {template['trigger']}")
        
        return "\n".join(parts)
    
    def add_domain(self, domain: str, system: str, trigger: str, examples: List[Dict] = None):
        """Add a new domain to the library."""
        self.templates[domain] = {
            "system": system,
            "trigger": trigger,
            "examples": examples or []
        }


# Test the library
library = CoTPromptLibrary()

# Math question
math_prompt = library.get_prompt("math", "What is 15% of 240?")
print("Math Prompt:")
print(math_prompt)
print("\n" + "="*50 + "\n")

# Get response
response = ollama.chat(
    model=MODEL,
    messages=[{"role": "user", "content": math_prompt}],
    options={"temperature": 0.0}
)
print("Response:")
print(response['message']['content'])

<cell_type>markdown</cell_type>---

## Further Reading

- [Chain-of-Thought Prompting Paper](https://arxiv.org/abs/2201.11903) - The original Google research
- [Zero-Shot CoT Paper](https://arxiv.org/abs/2205.11916) - "Let's think step by step"
- [Self-Consistency Paper](https://arxiv.org/abs/2203.11171) - Next lab's topic!
- [Prompt Engineering Guide](https://www.promptingguide.ai/techniques/cot) - Practical tips

---

## Cleanup

We use `gc.collect()` (garbage collection) to free up memory that's no longer being used. This is especially helpful when working with large language models.

In [None]:
# Save results for later comparison
results_summary = {
    'model': MODEL,
    'direct_accuracy': direct_results['accuracy'],
    'cot_accuracy': cot_results['accuracy'],
    'improvement': cot_results['accuracy'] - direct_results['accuracy'],
    'n_problems': n_test,
}

print("Results saved for comparison with other methods!")
print(json.dumps(results_summary, indent=2))

# Clean up GPU memory if using GPU-based model
import gc
gc.collect()
print("\nMemory cleaned up.")

---

## Next Steps

Congratulations on completing the Chain-of-Thought Workshop!

In the next lab, you'll learn **Self-Consistency** - a powerful technique that generates multiple reasoning paths and votes on the best answer. This often beats CoT alone!

**Continue to:** [Lab 3.4.2: Self-Consistency Implementation](./lab-3.4.2-self-consistency.ipynb)