
# DSPy "Under the Hood" Demo with Traces and Scores

This notebook gives a **small, concrete, step by step demo** of how DSPy:

* runs a simple program on examples
* records internal traces of each module call
* uses a metric to accept or reject traces
* can be optimized with an optimizer like `BootstrapFewShotWithRandomSearch`

The demo uses a very small GSM8K style math task so you can actually print and inspect everything.


In [1]:

# If you do not have DSPy installed, uncomment this cell and run it.
# It is left commented so that you can control your own environment.
# !pip install -qU dspy-ai


In [1]:

import dspy
from dataclasses import dataclass
from typing import List



## 1. Configure DSPy with your local Ollama model

This uses the configuration you specified, pointing DSPy to `qwen3:30b` running on `localhost:11434`.


In [2]:

# Configure DSPy to use Ollama with qwen3:30b
ollama_model = dspy.LM(
    model='ollama/qwen3:30b',
    api_base='http://localhost:11434',
    api_key=''  # Ollama does not require an API key
)

dspy.configure(lm=ollama_model)



## 2. Define a tiny GSM8K style dataset and metric

We will work with a very small synthetic math dataset so that traces stay readable.


In [3]:

# A tiny synthetic "GSM8K style" dataset
train_examples = [
    dspy.Example(
        question="Tom has 3 apples and buys 2 more. How many apples does he have now?",
        answer="5",
    ).with_inputs("question"),
    dspy.Example(
        question="A box has 10 candies and you eat 4. How many are left?",
        answer="6",
    ).with_inputs("question"),
    dspy.Example(
        question="Sara has 12 pencils, gives 3 to her friend, and buys 5 more. How many pencils does she have now?",
        answer="14",
    ).with_inputs("question"),
]

dev_examples = [
    dspy.Example(
        question="Mike has 7 oranges and buys 3 more. How many does he have?",
        answer="10",
    ).with_inputs("question"),
    dspy.Example(
        question="There are 15 cookies. You eat 7. How many remain?",
        answer="8",
    ).with_inputs("question"),
]

len(train_examples), len(dev_examples)


(3, 2)

In [4]:

# A simple numeric accuracy metric
def gsm8k_metric(example, prediction, trace=None):
    """Return 1 if the numeric answer matches, else 0.

    This is intentionally simple so we can see pass or fail clearly.
    """
    pred_text = str(getattr(prediction, "answer", "")).strip()
    gold_text = str(example.answer).strip()

    # grab the first integer that appears
    import re

    def extract_int(text):
        m = re.search(r"-?\d+", text)
        return int(m.group()) if m else None

    pred_val = extract_int(pred_text)
    gold_val = extract_int(gold_text)

    return int(pred_val is not None and gold_val is not None and pred_val == gold_val)



## 3. Define a simple DSPy program

This program is a single `ChainOfThought` module that maps:

```text
question -> answer
```

We wrap it in a module so we can more easily inspect its behavior and later compile it.


In [5]:

class MathCoT(dspy.Module):
    def __init__(self):
        super().__init__()
        # Chain of Thought from question to answer
        self.cot = dspy.ChainOfThought("question -> answer")

    def forward(self, question: str):
        # Just delegate to the ChainOfThought module
        pred = self.cot(question=question)
        return pred

math_program = MathCoT()
math_program


cot.predict = Predict(StringSignature(question -> reasoning, answer
    instructions='Given the fields `question`, produce the fields `answer`.'
    question = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Question:', 'desc': '${question}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    answer = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Answer:', 'desc': '${answer}'})
))


## 4. Baseline behavior (no optimization)

First, run the raw program on the tiny train and dev sets to see how it behaves before any optimization.


In [6]:

def evaluate_program(program, dataset, metric_fn):
    results = []
    for i, ex in enumerate(dataset):
        with dspy.context(trace=[]):
            pred = program(question=ex.question)
            trace = dspy.settings.trace.copy()

        score = metric_fn(ex, pred, trace)
        results.append(score)

        print(f"Example {i}:")
        print("  Q:", ex.question)
        print("  Predicted answer:", getattr(pred, "answer", None))
        print("  Gold answer:", ex.answer)
        print("  Metric score:", score)
        print("-" * 60)

    avg_score = sum(results) / max(len(results), 1)
    print(f"Average score on {len(dataset)} examples: {avg_score:.3f}")
    return avg_score

print("Baseline on train:")
_ = evaluate_program(math_program, train_examples, gsm8k_metric)

print("\nBaseline on dev:")
_ = evaluate_program(math_program, dev_examples, gsm8k_metric)


Baseline on train:
Example 0:
  Q: Tom has 3 apples and buys 2 more. How many apples does he have now?
  Predicted answer: 5
  Gold answer: 5
  Metric score: 1
------------------------------------------------------------
Example 1:
  Q: A box has 10 candies and you eat 4. How many are left?
  Predicted answer: 6
  Gold answer: 6
  Metric score: 1
------------------------------------------------------------
Example 2:
  Q: Sara has 12 pencils, gives 3 to her friend, and buys 5 more. How many pencils does she have now?
  Predicted answer: 14
  Gold answer: 14
  Metric score: 1
------------------------------------------------------------
Average score on 3 examples: 1.000

Baseline on dev:
Example 0:
  Q: Mike has 7 oranges and buys 3 more. How many does he have?
  Predicted answer: 10
  Gold answer: 10
  Metric score: 1
------------------------------------------------------------
Example 1:
  Q: There are 15 cookies. You eat 7. How many remain?
  Predicted answer: 8
  Gold answer: 8
  Me


## 5. Manual tracing and simple "backtracking" demo

In this section we will:

* run the program on a single example multiple times
* record the trace for each attempt
* evaluate the metric
* stop when we get a passing trace

This imitates what DSPy does in `BootstrapFewShot`: it keeps only traces whose final prediction passes the metric.


In [7]:

from pprint import pprint

def run_with_trace(program, example, max_attempts=3):
    accepted_traces = []
    print("Question:", example.question)
    print("Gold answer:", example.answer)
    print()

    for attempt in range(1, max_attempts + 1):
        with dspy.context(trace=[]):
            pred = program(question=example.question)
            trace = dspy.settings.trace.copy()

        score = gsm8k_metric(example, pred, trace)

        print(f"Attempt {attempt}:")
        print("  Predicted answer:", getattr(pred, "answer", None))
        print("  Metric score:", score)
        print("  Trace:")
        for j, (mod, inputs, outputs) in enumerate(trace):
            print(f"    Step {j}: module={type(mod).__name__}")
            print(f"      inputs:  {inputs}")
            print(f"      outputs: {dict(outputs)}")
        print()

        if score == 1:
            accepted_traces.append(trace)
            print("  -> Accepted trace for this example.")
            break
        else:
            print("  -> Rejected trace, trying again (if attempts remain).")
            print()

    return accepted_traces

# Try this on the second training example (the candy problem)
accepted = run_with_trace(math_program, train_examples[1], max_attempts=3)


Question: A box has 10 candies and you eat 4. How many are left?
Gold answer: 6

Attempt 1:
  Predicted answer: 6
  Metric score: 1
  Trace:
    Step 0: module=Predict
      inputs:  {'question': 'A box has 10 candies and you eat 4. How many are left?'}
      outputs: {'reasoning': 'The box initially contains 10 candies. After eating 4 candies, the remaining number is calculated by subtracting 4 from 10. Thus, 10 - 4 = 6.', 'answer': '6'}

  -> Accepted trace for this example.



## 6. Collecting "good" traces like a tiny BootstrapFewShot

Now let us collect at most one accepted trace per training example, using the simple loop from above.


In [8]:

def collect_bootstrapped_traces(program, dataset, max_attempts=3):
    all_traces = []
    for ex in dataset:
        print("=" * 80)
        traces = run_with_trace(program, ex, max_attempts=max_attempts)
        if traces:
            all_traces.append(traces[0])
    return all_traces

bootstrapped_traces = collect_bootstrapped_traces(math_program, train_examples, max_attempts=3)
print("\nNumber of accepted traces:", len(bootstrapped_traces))


Question: Tom has 3 apples and buys 2 more. How many apples does he have now?
Gold answer: 5

Attempt 1:
  Predicted answer: 5
  Metric score: 1
  Trace:
    Step 0: module=Predict
      inputs:  {'question': 'Tom has 3 apples and buys 2 more. How many apples does he have now?'}
      outputs: {'reasoning': 'Tom starts with 3 apples and buys 2 more. To find the total, add the two quantities: 3 + 2 = 5.', 'answer': '5'}

  -> Accepted trace for this example.
Question: A box has 10 candies and you eat 4. How many are left?
Gold answer: 6

Attempt 1:
  Predicted answer: 6
  Metric score: 1
  Trace:
    Step 0: module=Predict
      inputs:  {'question': 'A box has 10 candies and you eat 4. How many are left?'}
      outputs: {'reasoning': 'The box initially contains 10 candies. After eating 4 candies, the remaining number is calculated by subtracting 4 from 10. Thus, 10 - 4 = 6.', 'answer': '6'}

  -> Accepted trace for this example.
Question: Sara has 12 pencils, gives 3 to her friend, an


At this point, `bootstrapped_traces` is a list of traces, and each trace is a list of
`(module, inputs, outputs)` triples that passed the metric.

Next, we will switch back to **real DSPy optimizers** and see how they automate this process.



## 7. Real DSPy optimization with `BootstrapFewShotWithRandomSearch`

We will now:

* define an optimizer
* compile the program on the tiny train set
* evaluate before and after on the dev set
* inspect which internal modules got demonstrations


In [9]:

from dspy.teleprompt import BootstrapFewShotWithRandomSearch

optimizer = BootstrapFewShotWithRandomSearch(
    metric=gsm8k_metric,
    max_bootstrapped_demos=2,
    num_candidate_programs=4,
    num_threads=1,
)

# For this tiny demo, use the same data split for train and val, or slice as you like
compiled_math_program = optimizer.compile(
    student=MathCoT(),
    trainset=train_examples,
    valset=dev_examples,
)

compiled_math_program


Going to sample between 1 and 2 traces per predictor.
Will attempt to bootstrap 4 candidate sets.
Average Metric: 2.00 / 2 (100.0%): 100%|██████████| 2/2 [00:00<00:00, 1611.64it/s]

2025/11/13 01:17:33 INFO dspy.evaluate.evaluate: Average Metric: 2 / 2 (100.0%)



New best score: 100.0 for seed -3
Scores so far: [100.0]
Best score so far: 100.0
Average Metric: 2.00 / 2 (100.0%): 100%|██████████| 2/2 [00:13<00:00,  6.68s/it]

2025/11/13 01:17:47 INFO dspy.evaluate.evaluate: Average Metric: 2 / 2 (100.0%)



Scores so far: [100.0, 100.0]
Best score so far: 100.0


 67%|██████▋   | 2/3 [00:07<00:03,  3.88s/it]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Average Metric: 2.00 / 2 (100.0%): 100%|██████████| 2/2 [00:07<00:00,  3.87s/it]

2025/11/13 01:18:02 INFO dspy.evaluate.evaluate: Average Metric: 2 / 2 (100.0%)



Scores so far: [100.0, 100.0, 100.0]
Best score so far: 100.0


 67%|██████▋   | 2/3 [00:06<00:03,  3.18s/it]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Average Metric: 2.00 / 2 (100.0%): 100%|██████████| 2/2 [00:08<00:00,  4.05s/it]

2025/11/13 01:18:17 INFO dspy.evaluate.evaluate: Average Metric: 2 / 2 (100.0%)



Scores so far: [100.0, 100.0, 100.0, 100.0]
Best score so far: 100.0


 33%|███▎      | 1/3 [00:00<00:00, 65.92it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Average Metric: 2.00 / 2 (100.0%): 100%|██████████| 2/2 [00:07<00:00,  3.92s/it]

2025/11/13 01:18:24 INFO dspy.evaluate.evaluate: Average Metric: 2 / 2 (100.0%)



Scores so far: [100.0, 100.0, 100.0, 100.0, 100.0]
Best score so far: 100.0


 33%|███▎      | 1/3 [00:00<00:00, 60.12it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Average Metric: 2.00 / 2 (100.0%): 100%|██████████| 2/2 [00:00<00:00, 281.88it/s]

2025/11/13 01:18:24 INFO dspy.evaluate.evaluate: Average Metric: 2 / 2 (100.0%)



Scores so far: [100.0, 100.0, 100.0, 100.0, 100.0, 100.0]
Best score so far: 100.0


 33%|███▎      | 1/3 [00:00<00:00, 59.57it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Average Metric: 2.00 / 2 (100.0%): 100%|██████████| 2/2 [00:00<00:00, 471.27it/s]

2025/11/13 01:18:24 INFO dspy.evaluate.evaluate: Average Metric: 2 / 2 (100.0%)



Scores so far: [100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0]
Best score so far: 100.0
7 candidate programs found.


cot.predict = Predict(StringSignature(question -> reasoning, answer
    instructions='Given the fields `question`, produce the fields `answer`.'
    question = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Question:', 'desc': '${question}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    answer = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Answer:', 'desc': '${answer}'})
))

In [10]:

print("Dev performance BEFORE optimization:")
_ = evaluate_program(math_program, dev_examples, gsm8k_metric)

print("\nDev performance AFTER optimization:")
_ = evaluate_program(compiled_math_program, dev_examples, gsm8k_metric)


Dev performance BEFORE optimization:
Example 0:
  Q: Mike has 7 oranges and buys 3 more. How many does he have?
  Predicted answer: 10
  Gold answer: 10
  Metric score: 1
------------------------------------------------------------
Example 1:
  Q: There are 15 cookies. You eat 7. How many remain?
  Predicted answer: 8
  Gold answer: 8
  Metric score: 1
------------------------------------------------------------
Average score on 2 examples: 1.000

Dev performance AFTER optimization:
Example 0:
  Q: Mike has 7 oranges and buys 3 more. How many does he have?
  Predicted answer: 10
  Gold answer: 10
  Metric score: 1
------------------------------------------------------------
Example 1:
  Q: There are 15 cookies. You eat 7. How many remain?
  Predicted answer: 8
  Gold answer: 8
  Metric score: 1
------------------------------------------------------------
Average score on 2 examples: 1.000



## 8. Inspect which modules got demonstrations

Every `Predict` like module inside your program can have its own pool of demonstrations.
Let us inspect them for the compiled program.


In [11]:

# Inspect named predictors and any demonstrations attached to them
for name, predictor in compiled_math_program.named_predictors():
    demos = getattr(predictor, "demonstrations", [])
    print(f"Predictor name: {name}")
    print(f"  Type: {type(predictor).__name__}")
    print(f"  Number of demos: {len(demos)}")
    if demos:
        print("  First demo (truncated):")
        first = demos[0]
        print("   inputs:", {k: first.inputs[k] for k in first.inputs})
        print("   outputs:", {k: first.outputs[k] for k in first.outputs})
    print("-" * 60)


Predictor name: cot.predict
  Type: Predict
  Number of demos: 0
------------------------------------------------------------



## 9. Look at traces from the compiled program

Finally, run the compiled program with tracing enabled to see how the internal calls look now.


In [12]:

ex = dev_examples[0]

with dspy.context(trace=[]):
    compiled_pred = compiled_math_program(question=ex.question)
    compiled_trace = dspy.settings.trace.copy()

print("Question:", ex.question)
print("Predicted answer:", getattr(compiled_pred, "answer", None))
print("\nCompiled trace:")
for j, (mod, inputs, outputs) in enumerate(compiled_trace):
    print(f"  Step {j}: module={type(mod).__name__}")
    print(f"    inputs:  {inputs}")
    print(f"    outputs: {dict(outputs)}")


Question: Mike has 7 oranges and buys 3 more. How many does he have?
Predicted answer: 10

Compiled trace:
  Step 0: module=Predict
    inputs:  {'question': 'Mike has 7 oranges and buys 3 more. How many does he have?'}
    outputs: {'reasoning': 'Mike starts with 7 oranges and buys 3 more. To find the total, add the two quantities: 7 + 3 = 10.', 'answer': '10'}



## 10. Bonus: a tiny two step program to show multi module traces

To show multi step traces, we define a program that:

1. rewrites the question into a simpler question
2. answers the simpler question

This produces traces that have two different modules in them.


In [13]:

class RewriteAndSolve(dspy.Module):
    def __init__(self):
        super().__init__()
        self.rewrite = dspy.ChainOfThought("question -> simpler_question")
        self.solve = dspy.ChainOfThought("simpler_question -> answer")

    def forward(self, question: str):
        rewritten = self.rewrite(question=question)
        simpler_question = rewritten.simpler_question
        solved = self.solve(simpler_question=simpler_question)
        return dspy.Prediction(
            simpler_question=simpler_question,
            answer=solved.answer,
        )

two_step_program = RewriteAndSolve()

example = train_examples[0]
with dspy.context(trace=[]):
    pred = two_step_program(question=example.question)
    trace = dspy.settings.trace.copy()

print("Original question:", example.question)
print("Simpler question:", getattr(pred, "simpler_question", None))
print("Answer:", getattr(pred, "answer", None))

print("\nTwo step trace:")
for j, (mod, inputs, outputs) in enumerate(trace):
    print(f"  Step {j}: module={type(mod).__name__}")
    print(f"    inputs:  {inputs}")
    print(f"    outputs: {dict(outputs)}")


Original question: Tom has 3 apples and buys 2 more. How many apples does he have now?
Simpler question: Tom has 3 apples and gets 2 more. How many apples does he have now?
Answer: 5

Two step trace:
  Step 0: module=Predict
    inputs:  {'question': 'Tom has 3 apples and buys 2 more. How many apples does he have now?'}
    outputs: {'reasoning': 'The original question uses the word "buys," which may be slightly more complex for a beginner. Replacing "buys" with "gets" simplifies the language while keeping the math problem identical. The numbers and structure remain unchanged, making it easier to understand for younger learners.', 'simpler_question': 'Tom has 3 apples and gets 2 more. How many apples does he have now?'}
  Step 1: module=Predict
    inputs:  {'simpler_question': 'Tom has 3 apples and gets 2 more. How many apples does he have now?'}
    outputs: {'reasoning': 'Tom starts with 3 apples and receives 2 more. To find the total, add the two quantities: 3 + 2 = 5.', 'answer': 


## 11. Wrap up

This notebook showed, on a very small scale:

* how to configure DSPy with a local LM
* how to run a program and inspect predictions
* how to capture traces with `dspy.context(trace=[])`
* a simple manual "backtracking" style loop for collecting good traces
* how `BootstrapFewShotWithRandomSearch` compiles the program and attaches demonstrations
* how to inspect the compiled program and its traces

You can now swap in your own tasks and metrics, or extend the programs to include retrieval and multi hop reasoning, and reuse the same tracing pattern to show what is happening step by step.
