# DSPy + StrategyQA with local Ollama (qwen3:30b)

This notebook shows a full DSPy pipeline on the StrategyQA dataset using a local Ollama model (`qwen3:30b`).
It:
- connects DSPy to Ollama,
- loads StrategyQA from Hugging Face,
- defines a simple Chain-of-Thought module (`question -> rationale, answer`),
- measures baseline accuracy,
- compiles the module with `BootstrapFewShotWithRandomSearch` to optimize prompts,
- measures accuracy again to show the gain from DSPy.


## 1. Install dependencies

In [17]:
# If needed, install DSPy and datasets.
# You can skip this cell if you already have them.
# Note: run this, then restart the kernel before continuing.
%pip install -U dspy-ai datasets


/Users/abm/XVOL/Cornell/CS6784/DSPY-Testing/.venv/bin/python: No module named pip
Note: you may need to restart the kernel to use updated packages.


## 2. Configure DSPy with local Ollama (`qwen3:30b`)

In [18]:
import dspy
import requests

# Change this if your Ollama server or model name differs.
OLLAMA_API_BASE = "http://localhost:11434"
OLLAMA_MODEL = "qwen2.5:0.5b"  # e.g. `ollama pull qwen2.5:0.5b`

# Quick sanity check that Ollama is running
try:
    r = requests.get(f"{OLLAMA_API_BASE}/api/tags", timeout=3)
    r.raise_for_status()
    print("✅ Ollama is running.")
except Exception as e:
    print("⚠️ Could not reach Ollama. Make sure `ollama serve` is running and the model is pulled.")
    print("Error:", e)

# Configure DSPy to talk to Ollama through its OpenAI-compatible interface.
# DSPy uses LiteLLM under the hood, so we treat Ollama as a provider.
lm = dspy.LM(
    model=f"ollama_chat/{OLLAMA_MODEL}",  # chat-style interface for Ollama
    api_base=OLLAMA_API_BASE,
    api_key="",             # not used for local Ollama
    model_type="chat",
    max_tokens=512,
    temperature=0.2,
)

dspy.configure(lm=lm)
print("✅ DSPy is configured with Ollama:", lm)


✅ Ollama is running.
✅ DSPy is configured with Ollama: <dspy.clients.lm.LM object at 0x126e9c3e0>


## 3. Load StrategyQA and create DSPy examples

In [19]:
from datasets import load_dataset
import random
import dspy

# We use the ChilleD/StrategyQA variant which has a clean schema:
#   question: str
#   answer: bool (True/False)
#   facts: str (description of reasoning facts)
#
# Ref: https://huggingface.co/datasets/ChilleD/StrategyQA
raw = load_dataset("ChilleD/StrategyQA")

train_raw = raw["train"]
test_raw = raw["test"]

print("Train size:", len(train_raw))
print("Test size:", len(test_raw))

# For a simple demo, we work on smaller subsets (you can increase these later).
random.seed(13)
TRAIN_SIZE = 128
DEV_SIZE = 128

train_indices = list(range(len(train_raw)))
random.shuffle(train_indices)
train_indices = train_indices[:TRAIN_SIZE]

dev_indices = list(range(len(test_raw)))
random.shuffle(dev_indices)
dev_indices = dev_indices[:DEV_SIZE]

# Convert to DSPy Examples with yes/no answers as strings.
trainset = []
for idx in train_indices:
    row = train_raw[int(idx)]
    q = row["question"]
    a = "yes" if bool(row["answer"]) else "no"
    ex = dspy.Example(question=q, answer=a).with_inputs("question")
    trainset.append(ex)

devset = []
for idx in dev_indices:
    row = test_raw[int(idx)]
    q = row["question"]
    a = "yes" if bool(row["answer"]) else "no"
    ex = dspy.Example(question=q, answer=a).with_inputs("question")
    devset.append(ex)

print("Prepared train examples:", len(trainset))
print("Prepared dev examples:", len(devset))

# Peek at a couple of examples
for ex in trainset[:3]:
    print("Q:", ex.question)
    print("A:", ex.answer)
    print("---")


Train size: 1603
Test size: 687
Prepared train examples: 128
Prepared dev examples: 128
Q: Will Chick-fil-A hypothetically refuse to sponsor a Pride parade?
A: yes
---
Q: Would early Eastern Canadian Natives language have use of the letter B?
A: no
---
Q: Does Felix Potvin have a position on a dodgeball team?
A: no
---


## 4. Define DSPy module and evaluation metric

In [20]:
import dspy

# We give the model room to reason by asking for a rationale and a yes/no answer.
# The rationale is not evaluated directly; we only score the final answer.
StrategyQASignature = dspy.Signature("question -> rationale, answer")

class StrategyQAModule(dspy.Module):
    def __init__(self):
        super().__init__()
        self.cot = dspy.ChainOfThought(StrategyQASignature)

    def forward(self, question: str):
        pred = self.cot(question=question)
        # Normalize answer a bit in case the LM tries to talk too much.
        answer = str(pred.answer).strip().lower()
        if answer.startswith("yes"):
            answer = "yes"
        elif answer.startswith("no"):
            answer = "no"
        pred.answer = answer
        return pred

# Simple exact-match metric on yes/no.
# FIXED: Added trace parameter to match DSPy's expected signature
def strategyqa_exact_match(gold: dspy.Example, pred: dspy.Prediction, trace=None) -> float:
    gold_ans = str(gold.answer).strip().lower()
    pred_ans = str(getattr(pred, "answer", "")).strip().lower()
    return 1.0 if gold_ans == pred_ans else 0.0

def evaluate_program(program: dspy.Module, dataset, verbose: bool = False) -> float:
    scores = []
    for i, ex in enumerate(dataset):
        pred = program(question=ex.question)
        score = strategyqa_exact_match(ex, pred)
        scores.append(score)
        if verbose and i < 5:
            print(f"Q: {ex.question}")
            print(f"Gold: {ex.answer}, Pred: {pred.answer}, Score: {score}")
            print("Rationale:", getattr(pred, "rationale", ""))
            print("----")
    return sum(scores) / max(len(scores), 1)

print("✅ Defined StrategyQAModule and metric.")

✅ Defined StrategyQAModule and metric.


## 5. Baseline: uncompiled DSPy program

In [None]:
baseline_program = StrategyQAModule()

print("Evaluating baseline program on devset...")
baseline_acc = evaluate_program(baseline_program, devset, verbose=True)
print(f"Baseline dev exact-match accuracy: {baseline_acc:.3f}")


Evaluating baseline program on devset...


NameError: name 'testset' is not defined

## 6. Compile with `BootstrapFewShotWithRandomSearch`

In [22]:
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

teleprompter = BootstrapFewShotWithRandomSearch(
    metric=strategyqa_exact_match,
    max_bootstrapped_demos=4,
    max_labeled_demos=8,
    max_rounds=1,
    num_candidate_programs=12,
    num_threads=4,
)

print("Compiling StrategyQAModule with DSPy...")
compiled_program = teleprompter.compile(
    student=StrategyQAModule(),
    trainset=trainset,
    valset=devset,   # use devset as validation for selecting best prompt config
)

print("✅ Compilation finished.")


Going to sample between 1 and 4 traces per predictor.
Will attempt to bootstrap 12 candidate sets.
Compiling StrategyQAModule with DSPy...
Average Metric: 32.00 / 128 (25.0%): 100%|██████████| 128/128 [00:00<00:00, 397.48it/s]

2025/11/17 18:41:02 INFO dspy.evaluate.evaluate: Average Metric: 32.0 / 128 (25.0%)



New best score: 25.0 for seed -3
Scores so far: [25.0]
Best score so far: 25.0
Average Metric: 62.00 / 128 (48.4%): 100%|██████████| 128/128 [00:00<00:00, 399.03it/s]

2025/11/17 18:41:02 INFO dspy.evaluate.evaluate: Average Metric: 62.0 / 128 (48.4%)



New best score: 48.44 for seed -2
Scores so far: [25.0, 48.44]
Best score so far: 48.44


  5%|▍         | 6/128 [00:00<00:02, 47.64it/s]


Bootstrapped 4 full traces after 6 examples for up to 1 rounds, amounting to 6 attempts.
Average Metric: 62.00 / 128 (48.4%): 100%|██████████| 128/128 [00:23<00:00,  5.35it/s]

2025/11/17 18:41:26 INFO dspy.evaluate.evaluate: Average Metric: 62.0 / 128 (48.4%)



Scores so far: [25.0, 48.44, 48.44]
Best score so far: 48.44


  4%|▍         | 5/128 [00:01<00:40,  3.04it/s]


Bootstrapped 4 full traces after 5 examples for up to 1 rounds, amounting to 5 attempts.
Average Metric: 62.00 / 128 (48.4%): 100%|██████████| 128/128 [00:25<00:00,  5.00it/s]

2025/11/17 18:41:54 INFO dspy.evaluate.evaluate: Average Metric: 62.0 / 128 (48.4%)



Scores so far: [25.0, 48.44, 48.44, 48.44]
Best score so far: 48.44


  4%|▍         | 5/128 [00:02<00:50,  2.43it/s]


Bootstrapped 2 full traces after 5 examples for up to 1 rounds, amounting to 5 attempts.
Average Metric: 62.00 / 128 (48.4%): 100%|██████████| 128/128 [00:25<00:00,  4.98it/s]

2025/11/17 18:42:21 INFO dspy.evaluate.evaluate: Average Metric: 62.0 / 128 (48.4%)



Scores so far: [25.0, 48.44, 48.44, 48.44, 48.44]
Best score so far: 48.44


  1%|          | 1/128 [00:00<00:50,  2.53it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Average Metric: 60.00 / 128 (46.9%): 100%|██████████| 128/128 [00:24<00:00,  5.17it/s]

2025/11/17 18:42:46 INFO dspy.evaluate.evaluate: Average Metric: 60.0 / 128 (46.9%)



Scores so far: [25.0, 48.44, 48.44, 48.44, 48.44, 46.88]
Best score so far: 48.44


  2%|▏         | 2/128 [00:00<00:44,  2.85it/s]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Average Metric: 62.00 / 128 (48.4%): 100%|██████████| 128/128 [00:24<00:00,  5.16it/s]

2025/11/17 18:43:12 INFO dspy.evaluate.evaluate: Average Metric: 62.0 / 128 (48.4%)



Scores so far: [25.0, 48.44, 48.44, 48.44, 48.44, 46.88, 48.44]
Best score so far: 48.44


  3%|▎         | 4/128 [00:01<00:42,  2.91it/s]


Bootstrapped 2 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Average Metric: 62.00 / 128 (48.4%): 100%|██████████| 128/128 [00:23<00:00,  5.42it/s]

2025/11/17 18:43:37 INFO dspy.evaluate.evaluate: Average Metric: 62.0 / 128 (48.4%)



Scores so far: [25.0, 48.44, 48.44, 48.44, 48.44, 46.88, 48.44, 48.44]
Best score so far: 48.44


  6%|▋         | 8/128 [00:02<00:41,  2.89it/s]


Bootstrapped 3 full traces after 8 examples for up to 1 rounds, amounting to 8 attempts.
Average Metric: 62.00 / 128 (48.4%): 100%|██████████| 128/128 [00:25<00:00,  5.09it/s]

2025/11/17 18:44:05 INFO dspy.evaluate.evaluate: Average Metric: 62.0 / 128 (48.4%)



Scores so far: [25.0, 48.44, 48.44, 48.44, 48.44, 46.88, 48.44, 48.44, 48.44]
Best score so far: 48.44


  2%|▏         | 2/128 [00:00<01:00,  2.07it/s]


Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Average Metric: 62.00 / 128 (48.4%): 100%|██████████| 128/128 [00:27<00:00,  4.70it/s]

2025/11/17 18:44:33 INFO dspy.evaluate.evaluate: Average Metric: 62.0 / 128 (48.4%)



Scores so far: [25.0, 48.44, 48.44, 48.44, 48.44, 46.88, 48.44, 48.44, 48.44, 48.44]
Best score so far: 48.44


  3%|▎         | 4/128 [00:01<00:49,  2.51it/s]


Bootstrapped 3 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Average Metric: 63.00 / 128 (49.2%): 100%|██████████| 128/128 [00:26<00:00,  4.92it/s]

2025/11/17 18:45:01 INFO dspy.evaluate.evaluate: Average Metric: 63.0 / 128 (49.2%)



New best score: 49.22 for seed 7
Scores so far: [25.0, 48.44, 48.44, 48.44, 48.44, 46.88, 48.44, 48.44, 48.44, 48.44, 49.22]
Best score so far: 49.22


  2%|▏         | 3/128 [00:01<00:53,  2.32it/s]


Bootstrapped 2 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Average Metric: 62.00 / 128 (48.4%): 100%|██████████| 128/128 [00:24<00:00,  5.15it/s]

2025/11/17 18:45:27 INFO dspy.evaluate.evaluate: Average Metric: 62.0 / 128 (48.4%)



Scores so far: [25.0, 48.44, 48.44, 48.44, 48.44, 46.88, 48.44, 48.44, 48.44, 48.44, 49.22, 48.44]
Best score so far: 49.22


  5%|▌         | 7/128 [00:02<00:49,  2.44it/s]


Bootstrapped 4 full traces after 7 examples for up to 1 rounds, amounting to 7 attempts.
Average Metric: 62.00 / 128 (48.4%): 100%|██████████| 128/128 [00:25<00:00,  5.06it/s]

2025/11/17 18:45:55 INFO dspy.evaluate.evaluate: Average Metric: 62.0 / 128 (48.4%)



Scores so far: [25.0, 48.44, 48.44, 48.44, 48.44, 46.88, 48.44, 48.44, 48.44, 48.44, 49.22, 48.44, 48.44]
Best score so far: 49.22


  1%|          | 1/128 [00:00<01:01,  2.07it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Average Metric: 62.00 / 128 (48.4%): 100%|██████████| 128/128 [00:25<00:00,  5.04it/s]

2025/11/17 18:46:21 INFO dspy.evaluate.evaluate: Average Metric: 62.0 / 128 (48.4%)



Scores so far: [25.0, 48.44, 48.44, 48.44, 48.44, 46.88, 48.44, 48.44, 48.44, 48.44, 49.22, 48.44, 48.44, 48.44]
Best score so far: 49.22


  5%|▌         | 7/128 [00:02<00:42,  2.85it/s]


Bootstrapped 4 full traces after 7 examples for up to 1 rounds, amounting to 7 attempts.
Average Metric: 62.00 / 128 (48.4%): 100%|██████████| 128/128 [00:25<00:00,  5.10it/s]

2025/11/17 18:46:49 INFO dspy.evaluate.evaluate: Average Metric: 62.0 / 128 (48.4%)



Scores so far: [25.0, 48.44, 48.44, 48.44, 48.44, 46.88, 48.44, 48.44, 48.44, 48.44, 49.22, 48.44, 48.44, 48.44, 48.44]
Best score so far: 49.22
15 candidate programs found.
✅ Compilation finished.


## 7. Evaluate compiled program

In [23]:
print("Evaluating compiled program on devset...")
compiled_acc = evaluate_program(compiled_program, devset, verbose=True)
print(f"Compiled dev exact-match accuracy: {compiled_acc:.3f}")


Evaluating compiled program on devset...
Q: Did the Presidency of Bill Clinton conclude with his impeachment?
Gold: no, Pred: no, Score: 1.0
Rationale: Not supplied for this particular example.
----
Q: Would a Yeti be likely to have prehensile limbs?
Gold: yes, Pred: no, Score: 0.0
Rationale: Not supplied for this particular example.
----
Q: Will the Albany in Georgia reach a hundred thousand occupants before the one in New York?
Gold: no, Pred: no, Score: 1.0
Rationale: Not supplied for this particular example.
----
Q: If your electric stove has a glass top, should you use cast iron skillets?
Gold: no, Pred: no, Score: 1.0
Rationale: Not supplied for this particular example.
----
Q: Did John Kerry run in the 2010 United Kingdom general election?
Gold: no, Pred: no, Score: 1.0
Rationale: Not supplied for this particular example.
----
Compiled dev exact-match accuracy: 0.492


## 8. Compare and try your own questions

In [43]:
print("Baseline dev accuracy: ", round(baseline_acc, 3))
print("Compiled dev accuracy:", round(compiled_acc, 3))

# Quick helper to compare baseline vs compiled on custom questions.
def ask(question: str):
    print("\nQ:", question)
    base_pred = baseline_program(question=question)
    comp_pred = compiled_program(question=question)

    print("Baseline answer: ", base_pred.answer)
    print("Baseline rationale:")
    print(getattr(base_pred, "rationale", ""))
    print()
    print("Compiled answer:", comp_pred.answer)
    print("Compiled rationale:")
    print(getattr(comp_pred, "rationale", ""))

# Example usage:
ask("Could a modern person call George Washington on the phone?")
ask("Is it possible to walk from New York to London?")
ask("Can a human survive without water for more than a week?")

# Output the best prompt configuration found.
print("\nBest prompt configuration found by DSPy Teleprompt:")
compiled_program.candidate_programs

Baseline dev accuracy:  0.25
Compiled dev accuracy: 0.492

Q: Could a modern person call George Washington on the phone?
Baseline answer:  no
Baseline rationale:
The question asks about a modern person calling George Washington on the phone. This implies that we are dealing with an historical or fictional scenario where such a call would be made.

Compiled answer: no
Compiled rationale:
Not supplied for this particular example.

Q: Is it possible to walk from New York to London?
Baseline answer:  no
Baseline rationale:
London and New York are located in different time zones, making it impossible for a person to walk between them.

Compiled answer: no
Compiled rationale:
Not supplied for this particular example.

Q: Can a human survive without water for more than a week?
Baseline answer:  yes
Baseline rationale:
Water plays a crucial role in maintaining overall bodily functions such as digestion, circulation, and metabolism. Without adequate hydration, the body's ability to perform thes

[{'score': 49.22,
  'subscores': [1.0,
   0.0,
   1.0,
   1.0,
   1.0,
   1.0,
   1.0,
   1.0,
   1.0,
   0.0,
   0.0,
   1.0,
   0.0,
   0.0,
   0.0,
   1.0,
   1.0,
   1.0,
   1.0,
   0.0,
   1.0,
   0.0,
   0.0,
   1.0,
   1.0,
   1.0,
   1.0,
   0.0,
   0.0,
   0.0,
   1.0,
   1.0,
   1.0,
   0.0,
   0.0,
   0.0,
   1.0,
   1.0,
   0.0,
   1.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   1.0,
   1.0,
   0.0,
   1.0,
   0.0,
   1.0,
   0.0,
   0.0,
   0.0,
   1.0,
   1.0,
   0.0,
   0.0,
   1.0,
   0.0,
   1.0,
   1.0,
   0.0,
   1.0,
   0.0,
   1.0,
   1.0,
   1.0,
   1.0,
   0.0,
   1.0,
   0.0,
   1.0,
   0.0,
   1.0,
   1.0,
   1.0,
   0.0,
   1.0,
   0.0,
   0.0,
   0.0,
   0.0,
   1.0,
   0.0,
   0.0,
   0.0,
   1.0,
   0.0,
   0.0,
   1.0,
   1.0,
   0.0,
   0.0,
   1.0,
   0.0,
   0.0,
   0.0,
   0.0,
   1.0,
   1.0,
   1.0,
   1.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   1.0,
   0.0,
   0.0,
   0.0,
   1.0,
   0.0,
   0.0,
   1.0,
   1.0,
   1.0,
   1.0,
   1.0,
 