
# DSPy + GSM8K "Under the Hood" Demo (Tiny Ollama Model, Small Train Subset)

This notebook is a **realistic but lightweight** DSPy demo on GSM8K using Ollama.

Goals:

* Use a **very small / fast Ollama model** so we see plenty of mistakes (and thus room for improvement).
* Train on only **25–30 GSM8K examples** so you can run it live.
* Log, for each training example:
  * the question
  * the gold final answer
  * the model's answer
  * whether the trace is **accepted** or **rejected** by the metric
* Allow you to **stop the pass over the training data early** by pressing a key (`q`) at a prompt.
* Then run a real `BootstrapFewShotWithRandomSearch` compile and compare **before vs after** accuracy.



## 1. Install dependencies

You need:

* `dspy-ai`
* `datasets`

Uncomment and run the cell below if you have not installed them.


In [1]:

# !pip install -qU dspy-ai datasets


In [2]:

import random
import re

import dspy
from datasets import load_dataset



## 2. Configure DSPy with a tiny Ollama model

To keep things **fast** and a bit **error-prone** (so we can see clear improvements), we use a very small model.

Example: `qwen2.5:0.5b` on Ollama, which is around 0.5B parameters.

Make sure you have pulled it first:

```bash
ollama pull qwen2.5:0.5b
```

Then we point DSPy at your local Ollama server.


In [3]:

# Configure DSPy to use Ollama with qwen2.5:0.5b
# If your Ollama expects a different model name, change `model` below.
ollama_model = dspy.LM(
    model='ollama/qwen2.5:0.5b',          # tiny, fast model
    api_base='http://localhost:11434',
    api_key=''                     # Ollama does not require an API key
)

# Slight temperature for variation so we see different attempts / failures.
dspy.configure(lm=ollama_model, temperature=0.6)



## 3. Load a **small** GSM8K subset

We use Hugging Face `openai/gsm8k` and then take:

* ~30 examples for training
* ~30 examples for dev

You can scale these numbers up later if you want.


In [4]:

gsm8k = load_dataset("openai/gsm8k", "main")

print(gsm8k)
print("Train size:", len(gsm8k["train"]))
print("Test size: ", len(gsm8k["test"]))


DatasetDict({
    train: Dataset({
        features: ['question', 'answer'],
        num_rows: 7473
    })
    test: Dataset({
        features: ['question', 'answer'],
        num_rows: 1319
    })
})
Train size: 7473
Test size:  1319


In [5]:

def make_example(row):
    return dspy.Example(
        question=row["question"],
        answer=row["answer"],
    ).with_inputs("question")

random.seed(0)

full_train = gsm8k["train"]
full_test = gsm8k["test"]

train_subset_size = 30
dev_subset_size = 30

train_indices = random.sample(range(len(full_train)), train_subset_size)
dev_indices = random.sample(range(len(full_test)), dev_subset_size)

train_examples = [make_example(full_train[i]) for i in train_indices]
dev_examples = [make_example(full_test[i]) for i in dev_indices]

len(train_examples), len(dev_examples)


(30, 30)


## 4. GSM8K-style numeric metric

GSM8K answers end with a line like:

```text
#### 42
```

We parse out the final integer and compare prediction vs gold.


In [6]:

def extract_final_int_from_gsm8k_answer(text: str):
    if text is None:
        return None
    # Prefer '#### number'
    m = re.search(r"####\s*(-?\d+)", text)
    if m:
        return int(m.group(1))
    # Fallback: last integer anywhere
    ints = re.findall(r"-?\d+", text)
    return int(ints[-1]) if ints else None

def gsm8k_metric(example, prediction, trace=None):
    gold = extract_final_int_from_gsm8k_answer(example.answer)
    pred = extract_final_int_from_gsm8k_answer(getattr(prediction, "answer", ""))
    return int(gold is not None and pred is not None and gold == pred)



## 5. Define a simple GSM8K program

A single `ChainOfThought` from `question -> answer`.


In [7]:

class GSM8KCoT(dspy.Module):
    def __init__(self):
        super().__init__()
        self.cot = dspy.ChainOfThought("question -> answer")

    def forward(self, question: str):
        return self.cot(question=question)

base_program = GSM8KCoT()
base_program


cot.predict = Predict(StringSignature(question -> reasoning, answer
    instructions='Given the fields `question`, produce the fields `answer`.'
    question = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Question:', 'desc': '${question}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    answer = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Answer:', 'desc': '${answer}'})
))


## 6. Baseline evaluation on the small dev set


In [8]:

def evaluate_program(program, dataset, metric_fn, max_print=5, label=""):
    scores = []
    for i, ex in enumerate(dataset):
        with dspy.context(trace=[]):
            pred = program(question=ex.question)
            trace = dspy.settings.trace.copy()

        score = metric_fn(ex, pred, trace)
        scores.append(score)

        if i < max_print:
            print(f"Example {i} {label}:")
            print("Q:", ex.question)
            print("Predicted answer (raw):")
            print(getattr(pred, "answer", ""))
            print("Gold answer (raw):")
            print(ex.answer)
            print("Metric score:", score)
            print("-" * 80)

    avg = sum(scores) / max(len(scores), 1)
    print(f"Average metric on {len(dataset)} examples{label}: {avg:.3f}")
    return avg


In [9]:

print("Baseline on dev subset (before any optimization):")
base_dev_acc = evaluate_program(base_program, dev_examples, gsm8k_metric, max_print=5, label="(dev base)")


Baseline on dev subset (before any optimization):
Example 0 (dev base):
Q: Eve had 20 pieces of pomelos. After giving her friend some pomelos, Eve is left with 1/4 of the pomelos she originally had. How many pomelos did Eve give away?
Predicted answer (raw):
Eve gave away 15 pomelos.
Gold answer (raw):
Eve is left with 20 x 1/4 = <<20*1/4=5>>5 pieces of pomelos.
So she gave away 20 - 5 = <<20-5=15>>15 pieces of pomelos to her friend.
#### 15
Metric score: 1
--------------------------------------------------------------------------------
Example 1 (dev base):
Q: On Tuesday, Peter wants to exercise for twice the amount of time he did on Monday and Sunday combined. On Sunday he exercised for 23 minutes. On Monday he exercised for 16 minutes. How many minutes does he have to exercise on Tuesday to reach his goal?
Predicted answer (raw):
78
Gold answer (raw):
On Sunday and Monday he exercised a total of 39 minutes because 23 + 16 = <<23+16=39>>39
On Tuesday he has to exercise for 78 minutes


## 7. Manual "training pass" with backtracking and early stop

In this cell we:

* Iterate over the ~30 training examples.
* For each example, run **multiple attempts** (sampling) to try to get the right answer.
* Show:
  * question
  * gold final answer (parsed)
  * each attempt's predicted final answer (parsed)
  * whether the attempt is accepted / rejected
* After each training example, you can **press a key** to stop the pass:

  * Press **Enter** to continue to the next training example.
  * Type `q` then Enter to **stop** the pass early.

This approximates what `BootstrapFewShot` does internally: accept only traces whose final output passes the metric.


In [10]:

def manual_training_pass(program, dataset, metric_fn, max_attempts=3):
    accepted_traces = []
    for idx, ex in enumerate(dataset):
        print("=" * 120)
        print(f"Training example {idx}:")
        print("Question:")
        print(ex.question)
        gold_val = extract_final_int_from_gsm8k_answer(ex.answer)
        print(f"Gold final answer: {gold_val}")
        print()

        example_accepted_trace = None

        for attempt in range(1, max_attempts + 1):
            with dspy.context(trace=[]):
                pred = program(question=ex.question)
                trace = dspy.settings.trace.copy()

            score = metric_fn(ex, pred, trace)
            pred_ans_raw = getattr(pred, "answer", "")
            pred_val = extract_final_int_from_gsm8k_answer(pred_ans_raw)

            print(f"  Attempt {attempt}:")
            print("    Predicted answer (raw):")
            print("    ", pred_ans_raw.replace("\n", " "))
            print(f"    Parsed final answer: {pred_val}")
            print(f"    Metric score: {score}")
            print("    Trace steps:")
            for j, (mod, inputs, outputs) in enumerate(trace):
                print(f"      Step {j}: module={type(mod).__name__}")
                print(f"        inputs keys:  {list(inputs.keys())}")
                print(f"        outputs keys: {list(outputs.keys())}")
            print()

            if score == 1:
                print("    -> Accepted trace for this example (correct answer).")
                example_accepted_trace = trace
                break
            else:
                print("    -> Rejected trace (wrong answer).")
                print()

        if example_accepted_trace is not None:
            accepted_traces.append(example_accepted_trace)
        else:
            print("  No correct attempt found for this example (within max_attempts).")


        user_input = input("Press Enter to continue to the next training example, or type 'q' then Enter to stop: ").strip().lower()
        if user_input == "q":
            print("Stopping manual training pass early at example index", idx)
            break

    print("\nManual training pass complete.")
    print("Number of accepted traces collected:", len(accepted_traces))
    return accepted_traces

manual_traces = manual_training_pass(base_program, train_examples, gsm8k_metric, max_attempts=3)


Training example 0:
Question:
The state of Virginia had 3.79 inches of rain in March, 4.5 inches of rain in April, 3.95 inches of rain in May, 3.09 inches of rain in June and 4.67 inches in July.  What is the average rainfall amount, in inches, in Virginia?
Gold final answer: 4

  Attempt 1:
    Predicted answer (raw):
     The average rainfall amount in Virginia is \( 3.64 \) inches.
    Parsed final answer: 64
    Metric score: 0
    Trace steps:
      Step 0: module=Predict
        inputs keys:  ['question']
        outputs keys: ['reasoning', 'answer']

    -> Rejected trace (wrong answer).

  Attempt 2:
    Predicted answer (raw):
     The average rainfall amount in Virginia is \( 3.64 \) inches.
    Parsed final answer: 64
    Metric score: 0
    Trace steps:
      Step 0: module=Predict
        inputs keys:  ['question']
        outputs keys: ['reasoning', 'answer']

    -> Rejected trace (wrong answer).

  Attempt 3:
    Predicted answer (raw):
     The average rainfall amount 


## 8. Real DSPy optimization with `BootstrapFewShotWithRandomSearch` (on the same 30 examples)

Now that you have seen the **manual** loop, we run the actual DSPy teleprompter on the same data split.

The teleprompter will:

* internally do a similar "trace, score, keep good traces" process
* select few-shot demonstrations for the module(s)
* return a **compiled** version of `GSM8KCoT`


In [11]:

from dspy.teleprompt import BootstrapFewShotWithRandomSearch

optimizer = BootstrapFewShotWithRandomSearch(
    metric=gsm8k_metric,
    max_bootstrapped_demos=4,    # up to 4 demos per module
    num_candidate_programs=4,    # 4 random demo selections
    num_threads=1,
)

compiled_program = optimizer.compile(
    student=GSM8KCoT(),
    trainset=train_examples,
    valset=dev_examples,
)

compiled_program


Going to sample between 1 and 4 traces per predictor.
Will attempt to bootstrap 4 candidate sets.
Average Metric: 10.00 / 30 (33.3%): 100%|██████████| 30/30 [00:00<00:00, 355.29it/s]

2025/11/13 02:32:01 INFO dspy.evaluate.evaluate: Average Metric: 10 / 30 (33.3%)



New best score: 33.33 for seed -3
Scores so far: [33.33]
Best score so far: 33.33
Average Metric: 0.00 / 7 (0.0%):  23%|██▎       | 7/30 [00:16<00:56,  2.46s/it]

2025/11/13 02:33:01 ERROR dspy.utils.parallelizer: Error for Example({'question': 'Adam has $100 and wants to spend it to open a rock stand. He can buy rocks for $5 each and sell them for $7 each. If he invests all his money in the rock stand but only sells 60% of his inventory, how much money does he lose?', 'answer': 'He can buy 20 rocks because 100 / 5 = <<100/5=20>>20\nHe sells 12 of these rocks because 20 x .6 = <<20*.6=12>>12\nHe makes $84 selling these because 12 x 7 = <<12*7=84>>84\nHe lost $16 on his business because 100 - 84 = <<100-84=16>>16\n#### 16'}) (input_keys={'question'}): Adapter JSONAdapter failed to parse the LM response. 

LM Response: {
  "reasoning": "To calculate the total amount Adam loses, we first need to determine how many rocks he can buy and sell. He has $100 and each rock costs $5, so he can buy 20 rocks (100/5 = <<100/5=20>>20). Then, he sells 60% of his inventory, which is 24 rocks (20 * 60%) = <<20*60%=<<20*60%=12>>12. The cost to sell each rock is $7

Average Metric: 4.00 / 29 (13.8%): 100%|██████████| 30/30 [01:52<00:00,  3.76s/it]

2025/11/13 02:33:54 INFO dspy.evaluate.evaluate: Average Metric: 4.0 / 30 (13.3%)



Scores so far: [33.33, 13.33]
Best score so far: 33.33


100%|██████████| 30/30 [01:39<00:00,  3.31s/it]


Bootstrapped 3 full traces after 29 examples for up to 1 rounds, amounting to 30 attempts.
Average Metric: 6.00 / 30 (20.0%): 100%|██████████| 30/30 [01:38<00:00,  3.27s/it]

2025/11/13 02:37:11 INFO dspy.evaluate.evaluate: Average Metric: 6 / 30 (20.0%)



Scores so far: [33.33, 13.33, 20.0]
Best score so far: 33.33


100%|██████████| 30/30 [02:33<00:00,  5.10s/it]


Bootstrapped 4 full traces after 29 examples for up to 1 rounds, amounting to 30 attempts.
Average Metric: 11.00 / 30 (36.7%): 100%|██████████| 30/30 [01:08<00:00,  2.29s/it]

2025/11/13 02:40:53 INFO dspy.evaluate.evaluate: Average Metric: 11 / 30 (36.7%)



New best score: 36.67 for seed 0
Scores so far: [33.33, 13.33, 20.0, 36.67]
Best score so far: 36.67


 23%|██▎       | 7/30 [00:20<01:07,  2.94s/it]


Bootstrapped 2 full traces after 7 examples for up to 1 rounds, amounting to 7 attempts.
Average Metric: 9.00 / 30 (30.0%): 100%|██████████| 30/30 [01:09<00:00,  2.32s/it]

2025/11/13 02:42:23 INFO dspy.evaluate.evaluate: Average Metric: 9 / 30 (30.0%)



Scores so far: [33.33, 13.33, 20.0, 36.67, 30.0]
Best score so far: 36.67


100%|██████████| 30/30 [01:48<00:00,  3.61s/it]


Bootstrapped 0 full traces after 29 examples for up to 1 rounds, amounting to 30 attempts.
Average Metric: 7.00 / 30 (23.3%): 100%|██████████| 30/30 [01:14<00:00,  2.48s/it]

2025/11/13 02:45:26 INFO dspy.evaluate.evaluate: Average Metric: 7 / 30 (23.3%)



Scores so far: [33.33, 13.33, 20.0, 36.67, 30.0, 23.33]
Best score so far: 36.67


 93%|█████████▎| 28/30 [01:56<00:08,  4.16s/it]


Bootstrapped 2 full traces after 28 examples for up to 1 rounds, amounting to 28 attempts.
Average Metric: 8.00 / 30 (26.7%): 100%|██████████| 30/30 [01:09<00:00,  2.31s/it]

2025/11/13 02:48:31 INFO dspy.evaluate.evaluate: Average Metric: 8 / 30 (26.7%)



Scores so far: [33.33, 13.33, 20.0, 36.67, 30.0, 23.33, 26.67]
Best score so far: 36.67
7 candidate programs found.


cot.predict = Predict(StringSignature(question -> reasoning, answer
    instructions='Given the fields `question`, produce the fields `answer`.'
    question = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Question:', 'desc': '${question}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    answer = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Answer:', 'desc': '${answer}'})
))


## 9. Before vs after accuracy on the dev subset


In [12]:

print("Dev subset BEFORE optimization:")
base_dev_acc = evaluate_program(base_program, dev_examples, gsm8k_metric, max_print=3, label="(dev base)")

print("\nDev subset AFTER optimization:")
compiled_dev_acc = evaluate_program(compiled_program, dev_examples, gsm8k_metric, max_print=5, label="(dev compiled)")

print("\nAccuracy improvement on dev subset:", compiled_dev_acc - base_dev_acc)


Dev subset BEFORE optimization:
Example 0 (dev base):
Q: Eve had 20 pieces of pomelos. After giving her friend some pomelos, Eve is left with 1/4 of the pomelos she originally had. How many pomelos did Eve give away?
Predicted answer (raw):
Eve gave away 15 pomelos.
Gold answer (raw):
Eve is left with 20 x 1/4 = <<20*1/4=5>>5 pieces of pomelos.
So she gave away 20 - 5 = <<20-5=15>>15 pieces of pomelos to her friend.
#### 15
Metric score: 1
--------------------------------------------------------------------------------
Example 1 (dev base):
Q: On Tuesday, Peter wants to exercise for twice the amount of time he did on Monday and Sunday combined. On Sunday he exercised for 23 minutes. On Monday he exercised for 16 minutes. How many minutes does he have to exercise on Tuesday to reach his goal?
Predicted answer (raw):
78
Gold answer (raw):
On Sunday and Monday he exercised a total of 39 minutes because 23 + 16 = <<23+16=39>>39
On Tuesday he has to exercise for 78 minutes because 39 x 2 = 


## 10. Inspect internal demonstrations

Each `Predict`-like module inside the compiled program now has some demonstrations
that were selected using GSM8K examples and our metric.


In [13]:

for name, predictor in compiled_program.named_predictors():
    demos = getattr(predictor, "demonstrations", [])
    print(f"Predictor: {name}")
    print(f"  Type: {type(predictor).__name__}")
    print(f"  Number of demos: {len(demos)}")
    if demos:
        first = demos[0]
        print("  First demo inputs keys:", list(first.inputs.keys()))
        print("  First demo outputs keys:", list(first.outputs.keys()))
    print("-" * 80)


Predictor: cot.predict
  Type: Predict
  Number of demos: 0
--------------------------------------------------------------------------------



## 11. Traces from the compiled program

Finally, run the **compiled** program with tracing so you can see
how the internal calls look after optimization.


In [14]:

for i in range(3):
    ex = dev_examples[i]
    print("=" * 100)
    print(f"Dev example {i}:")
    print("Question:")
    print(ex.question)
    print()

    with dspy.context(trace=[]):
        pred = compiled_program(question=ex.question)
        trace = dspy.settings.trace.copy()

    print("Predicted answer (raw):")
    print(getattr(pred, "answer", ""))
    print("Gold answer (raw):")
    print(ex.answer)
    print("Metric score:", gsm8k_metric(ex, pred, trace))
    print()

    print("Trace:")
    for j, (mod, inputs, outputs) in enumerate(trace):
        print(f"  Step {j}: module={type(mod).__name__}")
        print(f"    inputs keys:  {list(inputs.keys())}")
        print(f"    outputs keys: {list(outputs.keys())}")
    print()


Dev example 0:
Question:
Eve had 20 pieces of pomelos. After giving her friend some pomelos, Eve is left with 1/4 of the pomelos she originally had. How many pomelos did Eve give away?

Predicted answer (raw):
1. Eve gave away 15 pomelos.
Gold answer (raw):
Eve is left with 20 x 1/4 = <<20*1/4=5>>5 pieces of pomelos.
So she gave away 20 - 5 = <<20-5=15>>15 pieces of pomelos to her friend.
#### 15
Metric score: 1

Trace:
  Step 0: module=Predict
    inputs keys:  ['question']
    outputs keys: ['reasoning', 'answer']

Dev example 1:
Question:
On Tuesday, Peter wants to exercise for twice the amount of time he did on Monday and Sunday combined. On Sunday he exercised for 23 minutes. On Monday he exercised for 16 minutes. How many minutes does he have to exercise on Tuesday to reach his goal?

Predicted answer (raw):
Peter needs to exercise for **0 minutes** on Tuesday to reach his goal of exercising twice the amount of time he did on Monday and Sunday combined.
Gold answer (raw):
On Sund


## 12. Wrap up

This notebook gives you:

* A **tiny-model**, **small-subset** GSM8K demo that actually shows lots of wrong answers and a measurable boost.
* A **manual training loop** with:
  * multiple attempts per example
  * trace logging
  * accept/reject decisions by the GSM8K metric
  * an **interactive early stop** (`q` to stop, Enter to continue)
* A real `BootstrapFewShotWithRandomSearch` run on the same subset.
* Before/after accuracy and trace inspection for the compiled program.

This should be a good "live demo" scale: fast enough to run in a talk,
but real enough to show DSPy’s internal mechanics clearly.
