# Automatic System Prompt Optimization (DSPy) — with **gpt-4.1-mini**

This notebook uses DSPy to optimize a *system prompt* for a task, targeting OpenAI's **gpt-4.1-mini**.

## 1) Setup

In [1]:

%pip install -U dspy openai tiktoken

import os, re
import dspy

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

BASE_MODEL = "openai/gpt-4.1-mini"
JUDGE_MODEL = "openai/gpt-4.1"

dspy.configure(lm=dspy.LM(BASE_MODEL))
print("DSPy:", dspy.__version__)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
DSPy: 2.6.27


## 2) Data

In [2]:

import dspy
train_examples = [
    dspy.Example(prompt="What is the capital of France?", generation="Paris."),
    dspy.Example(prompt="Who wrote '1984'?", generation="George Orwell."),
]
dev_examples = [dspy.Example(prompt="What is the largest planet?", generation="Jupiter.")]

trainset = [e.with_inputs("prompt") for e in train_examples]
devset   = [e.with_inputs("prompt") for e in dev_examples]
(len(trainset), len(devset))


(2, 1)

## 3) Metrics

In [3]:

def token_f1(pred, ref):
    p = pred.lower().split(); r = ref.lower().split()
    if not p or not r: return 0.0
    from collections import Counter
    cp, cr = Counter(p), Counter(r)
    overlap = sum((cp & cr).values())
    prec = overlap/len(p); rec = overlap/len(r)
    return 0.0 if (prec+rec)==0 else 2*prec*rec/(prec+rec)

def concise_qna_metric(example, prediction, trace=None):
    out = (prediction.get("generation") or "").strip()
    ref = (example.get("generation") or "").strip()
    if not out: return 0.0
    # Encourage <= 2 sentences
    import re as _re
    sentences = [s for s in _re.split(r"[.!?]+", out) if s.strip()]
    length_pen = 0.0 if len(sentences)<=2 else min(1.0, 0.2*(len(sentences)-2))
    return max(0.0, min(1.0, token_f1(out, ref)-length_pen))


## 4) Minimal program with custom adapter

In [4]:

class signature(dspy.Signature):
    prompt = dspy.InputField()
    generation = dspy.OutputField()

def format_demos(demos):
    s = []
    for d in (demos or []):
        s.append(f"\n# Example\nUser: {d.inputs.get('prompt','')}\nAssistant: {d.outputs.get('generation','')}")
    return "\n".join(s)

class SimplestAdapter(dspy.Adapter):
    def __call__(self, lm, lm_kwargs, signature, demos, inputs):
        sys_msg = signature.instructions or ""
        if demos: sys_msg += "\n" + format_demos(demos)
        messages = [
            {"role":"system","content": sys_msg},
            {"role":"user","content": inputs["prompt"]},
        ]
        outputs = lm(messages=messages, **lm_kwargs)
        return [{"generation": outputs[0]}]

class MyPredict(dspy.Predict):
    def __init__(self, signature, **kw):
        super().__init__(signature, **kw)
        self.adapter = SimplestAdapter()

INITIAL_SYSTEM_PROMPT = "You are concise. Answer correctly in <= 2 sentences."
my_program = MyPredict(signature)
my_program.signature.instructions = INITIAL_SYSTEM_PROMPT
print(my_program(prompt="Who painted the Mona Lisa?"))


Prediction(
    generation='The Mona Lisa was painted by Leonardo da Vinci.'
)


## 5) Optimize (MIPROv2)

In [5]:

optimizer = dspy.MIPROv2(concise_qna_metric, max_bootstrapped_demos=0, max_labeled_demos=0)
my_program_optimized = optimizer.compile(my_program, trainset=trainset, requires_permission_to_run=False)
print(my_program_optimized(prompt="What is the capital of Germany?"))
my_program_optimized.inspect_history()


2025/08/10 16:27:49 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING LIGHT AUTO RUN SETTINGS:
num_trials: 9
minibatch: False
num_fewshot_candidates: 6
num_instruct_candidates: 6
valset size: 1

2025/08/10 16:27:49 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/08/10 16:27:49 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used for informing instruction proposal.

2025/08/10 16:27:49 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=6 sets of demonstrations...


Bootstrapping set 1/6
Bootstrapping set 2/6


100%|██████████| 1/1 [00:00<00:00,  2.75it/s]


Bootstrapped 1 full traces after 0 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 3/6


100%|██████████| 1/1 [00:00<00:00, 1709.87it/s]


Bootstrapped 1 full traces after 0 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 4/6


100%|██████████| 1/1 [00:00<00:00, 2525.17it/s]


Bootstrapped 1 full traces after 0 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 5/6


100%|██████████| 1/1 [00:00<00:00, 2398.12it/s]


Bootstrapped 1 full traces after 0 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 6/6


100%|██████████| 1/1 [00:00<00:00, 2732.45it/s]
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing N=6 instructions...

2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: 0: You are concise. Answer correctly in <= 2 sentences.

2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: 1: Answer the following factual question accurately and concisely. Provide a direct response in one or two sentences without unnecessary elaboration.

2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: 2: Answer the gi

Bootstrapped 1 full traces after 0 examples for up to 1 rounds, amounting to 1 attempts.
Error getting source code: unhashable type: 'dict'.

Running without program aware proposer.
Average Metric: 0.20 / 1 (20.0%): 100%|██████████| 1/1 [00:00<00:00, 874.91it/s]

2025/08/10 16:27:50 INFO dspy.evaluate.evaluate: Average Metric: 0.2 / 1 (20.0%)
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 20.0

2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 2 / 9 =====



Average Metric: 0.33 / 1 (33.3%): 100%|██████████| 1/1 [00:00<00:00, 1631.39it/s]

2025/08/10 16:27:50 INFO dspy.evaluate.evaluate: Average Metric: 0.3333333333333333 / 1 (33.3%)
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: [92mBest full score so far![0m Score: 33.33
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 33.33 with parameters ['Predictor 0: Instruction 1'].
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [20.0, 33.33]
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 33.33


2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 3 / 9 =====



Average Metric: 0.50 / 1 (50.0%): 100%|██████████| 1/1 [00:00<00:00, 2012.62it/s]

2025/08/10 16:27:50 INFO dspy.evaluate.evaluate: Average Metric: 0.5 / 1 (50.0%)
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: [92mBest full score so far![0m Score: 50.0
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 50.0 with parameters ['Predictor 0: Instruction 5'].
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [20.0, 33.33, 50.0]
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 50.0


2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 4 / 9 =====



Average Metric: 0.20 / 1 (20.0%): 100%|██████████| 1/1 [00:00<00:00, 4064.25it/s]

2025/08/10 16:27:50 INFO dspy.evaluate.evaluate: Average Metric: 0.2 / 1 (20.0%)
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 20.0 with parameters ['Predictor 0: Instruction 0'].
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [20.0, 33.33, 50.0, 20.0]
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 50.0


2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 5 / 9 =====



Average Metric: 0.13 / 1 (13.3%): 100%|██████████| 1/1 [00:00<00:00, 4152.78it/s]

2025/08/10 16:27:50 INFO dspy.evaluate.evaluate: Average Metric: 0.13333333333333336 / 1 (13.3%)
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 13.33 with parameters ['Predictor 0: Instruction 4'].
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [20.0, 33.33, 50.0, 20.0, 13.33]
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 50.0


2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 6 / 9 =====



Average Metric: 0.33 / 1 (33.3%): 100%|██████████| 1/1 [00:00<00:00, 1223.54it/s]

2025/08/10 16:27:50 INFO dspy.evaluate.evaluate: Average Metric: 0.3333333333333333 / 1 (33.3%)
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 33.33 with parameters ['Predictor 0: Instruction 2'].
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [20.0, 33.33, 50.0, 20.0, 13.33, 33.33]
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 50.0


2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 9 =====



Average Metric: 0.33 / 1 (33.3%): 100%|██████████| 1/1 [00:00<00:00, 4084.04it/s]

2025/08/10 16:27:50 INFO dspy.evaluate.evaluate: Average Metric: 0.3333333333333333 / 1 (33.3%)
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 33.33 with parameters ['Predictor 0: Instruction 2'].
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [20.0, 33.33, 50.0, 20.0, 13.33, 33.33, 33.33]
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 50.0


2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 8 / 9 =====



Average Metric: 0.20 / 1 (20.0%): 100%|██████████| 1/1 [00:00<00:00, 4219.62it/s]

2025/08/10 16:27:50 INFO dspy.evaluate.evaluate: Average Metric: 0.2 / 1 (20.0%)
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 20.0 with parameters ['Predictor 0: Instruction 0'].
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [20.0, 33.33, 50.0, 20.0, 13.33, 33.33, 33.33, 20.0]
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 50.0


2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 9 / 9 =====



Average Metric: 0.33 / 1 (33.3%): 100%|██████████| 1/1 [00:00<00:00, 3498.17it/s]

2025/08/10 16:27:50 INFO dspy.evaluate.evaluate: Average Metric: 0.3333333333333333 / 1 (33.3%)
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 33.33 with parameters ['Predictor 0: Instruction 2'].
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [20.0, 33.33, 50.0, 20.0, 13.33, 33.33, 33.33, 20.0, 33.33]
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 50.0


2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 10 / 9 =====



Average Metric: 0.20 / 1 (20.0%): 100%|██████████| 1/1 [00:00<00:00, 3423.92it/s]

2025/08/10 16:27:50 INFO dspy.evaluate.evaluate: Average Metric: 0.2 / 1 (20.0%)
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 20.0 with parameters ['Predictor 0: Instruction 0'].
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [20.0, 33.33, 50.0, 20.0, 13.33, 33.33, 33.33, 20.0, 33.33, 20.0]
2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 50.0


2025/08/10 16:27:50 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 50.0!



Prediction(
    generation='The capital of Germany is Berlin.'
)




[34m[2025-08-10T16:27:50.191634][0m

[31mSystem message:[0m

Your input fields are:
1. `prompt` (str):
Your output fields are:
1. `generation` (str):
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## prompt ## ]]
{prompt}

[[ ## generation ## ]]
{generation}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are a knowledgeable and concise general knowledge expert. Provide accurate, fact-based answers to straightforward questions in one or two sentences, ensuring clarity and precision.


[31mUser message:[0m

[[ ## prompt ## ]]
What is the capital of Germany?

Respond with the corresponding output fields, starting with the field `[[ ## generation ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.


[31mResponse:[0m

[32m[[ ## generation ## ]]
The capital of Germany is Berlin.

[[ ## completed ## ]][0m


## 6) Eval

In [7]:
def evaluate(program, dataset, metric):
    scores = []
    for ex in dataset:
        # pull input & reference safely from dspy.Example
        user_prompt = getattr(ex, "prompt", None) or getattr(ex, "inputs", {}).get("prompt", "")
        ref_answer  = getattr(ex, "generation", None) or getattr(ex, "outputs", {}).get("generation", "")

        # run program
        pred = program(prompt=user_prompt)

        # normalize prediction to a dict with "generation"
        gen = getattr(pred, "generation", None)
        if gen is None and hasattr(pred, "as_dict"):
            gen = pred.as_dict().get("generation", "")
        if gen is None and hasattr(pred, "toDict"):
            gen = pred.toDict().get("generation", "")
        if gen is None and hasattr(pred, "outputs") and isinstance(pred.outputs, dict):
            gen = pred.outputs.get("generation", "")
        if gen is None:
            try:
                gen = pred["generation"]  # last resort if subscriptable
            except Exception:
                gen = str(pred)

        ex_dict   = {"prompt": user_prompt, "generation": ref_answer}
        pred_dict = {"generation": gen}
        scores.append(metric(ex_dict, pred_dict))

    return sum(scores) / len(scores) if scores else 0.0

base = evaluate(my_program, devset, concise_qna_metric)
opt  = evaluate(my_program_optimized, devset, concise_qna_metric)
print("Base:", base, "Optimized:", opt)


Base: 0.0909090909090909 Optimized: 0.08333333333333333


## 7) Export learned system prompt

In [8]:

final_instructions = my_program_optimized.signature.instructions
with open("optimized_system_prompt.txt","w",encoding="utf-8") as f:
    f.write(final_instructions)
print(final_instructions)
print("\nSaved to optimized_system_prompt.txt")


You are a knowledgeable and concise general knowledge expert. Provide accurate, fact-based answers to straightforward questions in one or two sentences, ensuring clarity and precision.

Saved to optimized_system_prompt.txt
