# Programming (Not Prompting) Your LLM with DSPy

DSPy (Declarative Self-improving Python) is a framework from Stanford NLP that treats language models as programmable functions rather than prompt templates. It provides a PyTorch-like interface for defining, composing, and optimizing LLM operations. Instead of writing and maintaining complex prompts, developers specify input/output signatures and let DSPy handle prompt engineering and optimization. The framework enables systematic improvement of LLM pipelines through techniques like automatic prompt tuning and self-improvement.

<img src="./media/dspy_workflow.png" width=500>

The DSPy workflow follows 4 main steps:
1. Define your program using signatures and modules
2. Create measurable success metrics that clearly show your program's performance
3. Compile your program and optimize towards success metrics
4. Collect additional data and iterate

We'll look through and apply all the various approaches DSPy offers across these steps in this notebook!

---
## Setup

<img src="./media/dspy.png" width=400>

In [1]:
import dspy

Configure LLM

**Configure LLM**

DSPy by default caches responses and models across your environment. Unless explicitly stated otherwise, configuring a language model will use that language model for all subsequent calls.

In [2]:
lm = dspy.LM('openai/gpt-4o-mini')
dspy.configure(lm=lm)

In [9]:
lm(messages=[{"role": "user", "content": "Say this is a test!"}])

['This is a test! How can I assist you further?']

---
## Signatures

DSPy Signatures follow the same approach as regular function signatures but are defined in natural language. This is the core of the "prompting" that DSPy aims to replace. Instead of telling the LLM what to do, we take the approach of declaring what the LLM will do.

The format looks like:

```python 
'input -> output' 
```

Where your `input` and `output` can be anything you'd like. It's also possible to define multiple inputs, outputs, types, or more well defined schemas.

<img src="./media/signatures.png" width=600>

Behind the scenes, this is still a language model prompt, but it aims to be more modular than static, changing wording and structure based on your natural language signature. While this may seem counter intuitive as we're abstracting away from prompting, DSPy has set this up in a way that allows for easy switching in and out of models, and algorithmic optimizations that we will highlight later.

### Simple Input & Output

In [108]:
qna = dspy.Predict('question -> answer')

response = qna(question="Why is the sky blue?")

print("Response: ", response.answer)

Response:  The sky appears blue due to a phenomenon called Rayleigh scattering. When sunlight enters the Earth's atmosphere, it collides with molecules and small particles in the air. Sunlight is made up of different colors, each with varying wavelengths. Blue light has a shorter wavelength and is scattered in all directions more than other colors with longer wavelengths, such as red or yellow. This scattering causes the sky to look predominantly blue to our eyes during the day.


In [110]:
sum = dspy.Predict('document -> summary')

document = """
The market for our products is intensely competitive and is characterized by rapid technological change and evolving industry standards. 
We believe that theprincipal competitive factors in this market are performance, breadth of product offerings, access to customers and partners and distribution channels, softwaresupport, conformity to industry standard APIs, manufacturing capabilities, processor pricing, and total system costs. 
We believe that our ability to remain competitive will depend on how well we are able to anticipate the features and functions that customers and partners will demand and whether we are able todeliver consistent volumes of our products at acceptable levels of quality and at competitive prices. 
We expect competition to increase from both existing competitors and new market entrants with products that may be lower priced than ours or may provide better performance or additional features not provided by our products. 
In addition, it is possible that new competitors or alliances among competitors could emerge and acquire significant market share.
A significant source of competition comes from companies that provide or intend to provide GPUs, CPUs, DPUs, embedded SoCs, and other accelerated, AI computing processor products, and providers of semiconductor-based high-performance interconnect products based on InfiniBand, Ethernet, Fibre Channel,and proprietary technologies. 
Some of our competitors may have greater marketing, financial, distribution and manufacturing resources than we do and may bemore able to adapt to customers or technological changes. 
We expect an increasingly competitive environment in the future.
"""

response = sum(document=document)

print("Summary: ", response.summary)

Summary:  The market for our products is highly competitive, driven by rapid technological advancements and changing industry standards. Key competitive factors include product performance, range of offerings, customer access, distribution channels, software support, adherence to industry standards, manufacturing capabilities, pricing, and overall system costs. Our competitiveness hinges on our ability to predict customer demands and deliver quality products at competitive prices. We anticipate increased competition from both established players and new entrants, potentially offering lower prices or superior features. Additionally, competition may arise from companies specializing in GPUs, CPUs, DPUs, and high-performance interconnect products. Some competitors may possess greater resources, making it challenging for us to keep pace with market changes. The competitive landscape is expected to intensify in the future.


### Multiple Inputs and Outputs

<img src="./media/multiple_signature.png" width=400>

In [112]:
multi = dspy.Predict('question, context -> answer, citation')

question = "What's my name?"
context = "The user you're talking to is Adam Lucek, AI youtuber extraordinaire"

response = multi(question=question, context=context)

print("Answer: ", response.answer)
print("\nCitation: ", response.citation)

Answer:  Your name is Adam Lucek.

Citation:  Context provided by the user.


### Type Hints with Outputs

<img src="./media/input_type.png" width=400>

In [114]:
emotion = dspy.Predict('input -> sentiment: str, confidence: float, reasoning: str')

text = "I don't quite know, I didn't really like it"

response = emotion(input=text)

print("Sentiment Classification: ", response.sentiment)
print("\nConfidence: ", response.confidence)
print("\nReasoning: ", response.reasoning)

Sentiment Classification:  negative

Confidence:  0.85

Reasoning:  The phrase "I didn't really like it" clearly indicates a negative sentiment towards whatever is being discussed. The use of "didn't like" suggests dissatisfaction, and the uncertainty expressed by "I don't quite know" reinforces a lack of positive feelings. The confidence level is high at 0.85 due to the explicit negative language used.


### Class Based Signatures

For more advanced signatures, DSPy allows you to define a pydantic class or data structure schema instead of the simple inline string approach. These classes inherit from `dspy.Signature` to start, but you must define your inputs with `dspy.InputField()` and outputs with `dspy.OutputField()`.

An optional `desc` argument can be passed within each field to add additional context as a description.

In [116]:
from typing import Literal

class TextStyleTransfer(dspy.Signature):
    """Transfer text between different writing styles while preserving content."""
    text: str = dspy.InputField()
    source_style: Literal["academic", "casual", "business", "poetic"] = dspy.InputField()
    target_style: Literal["academic", "casual", "business", "poetic"] = dspy.InputField()
    preserved_keywords: list[str] = dspy.OutputField()
    transformed_text: str = dspy.OutputField()
    style_metrics: dict[str, float] = dspy.OutputField(desc="Scores for formality, complexity, emotiveness")


text = "This coffee shop makes the best lattes ever! Their new barista really knows what he's doing with the espresso machine."

style_transfer = dspy.Predict(TextStyleTransfer)

response = style_transfer(
    text=text,
    source_style="casual",
    target_style="poetic"
)

print("Transformed Text: ", response.transformed_text)
print("\nStyle Metrics: ", response.style_metrics)
print("\nPreserverd Keywords: ", response.preserved_keywords)

Transformed Text:  In a quaint coffee shop, where dreams brew and swirl,  
The finest lattes dance, a creamy, frothy whirl.  
A new barista, skilled, with hands that weave delight,  
Crafts magic with the espresso, morning's purest light.

Style Metrics:  {'formality': 0.7, 'complexity': 0.6, 'emotiveness': 0.8}

Preserverd Keywords:  ['coffee shop', 'lattes', 'barista', 'espresso machine']


---
## Modules

<img src="./media/modules.png" width=1000>

Modules are where we apply different prompting frameworks to signatures. We've already been using the basic `Predict` module in our signature examples prior, but there exist many more popular strategies and variants. Here are the current available modules: 

* `ChainOfThought`: Implements chain-of-thought prompting by prepending a reasoning step before generating outputs. The module automatically adds a "Let's think step by step" prefix to encourage structured thinking. Use this when you need the model to break down complex problems into smaller steps.

* `ProgramOfThought`: Generates executable Python code to solve problems, with built-in error handling and code regeneration capabilities. Use this for mathematical or algorithmic problems that are better solved through actual code execution.

* `ReAct`: Implements Reasoning + Acting by interleaving thoughts, actions (via tools), and observations in a structured loop. Use this when your task requires multi-step reasoning and interaction with external tools or APIs.

And a few helpers:

* `MultiChainComparison`: Takes multiple reasoning attempts (default 3) and combines them into a single, more accurate response by comparing different reasoning paths. Use this when you need higher accuracy and can afford multiple attempts at solving a problem.

* `majority`: A utility function that takes multiple completions and returns the most common response after normalizing the text. Use this when you want to implement simple voting among multiple completion attempts to increase reliability.


### [Chain of Thought](https://github.com/stanfordnlp/dspy/blob/main/dspy/predict/chain_of_thought.py)

<img src="./media/cot_module.png" width=300>

ChainOfThought works by modifying the prompt signature to include an explicit reasoning step before the output. When initialized with a signature, it creates an extended signature by prepending a "reasoning" field with the prefix "Reasoning: Let's think step by step in order to". This reasoning field forces the language model to write out its thought process before providing the final answer.

In [118]:
# Define the Signature and Module
cot_emotion = dspy.ChainOfThought('input -> sentiment: str')

# Example
text = "That was phenomenal, but I hated it!"

# Run
cot_response = cot_emotion(input=text)

# Output
print("Sentiment: ", cot_response.sentiment)
# Inherently added reasoning
print("\nReasoning: ", cot_response.reasoning)

Sentiment:  Mixed

Reasoning:  The statement expresses a conflicting sentiment. The word "phenomenal" indicates a strong positive reaction, suggesting that the experience was impressive or outstanding. However, the phrase "but I hated it" introduces a negative sentiment, indicating a strong dislike or aversion to the same experience. This juxtaposition creates a complex emotional response, where the speaker acknowledges something as remarkable while simultaneously expressing a strong negative feeling towards it.


### [Program of Thought](https://github.com/stanfordnlp/dspy/blob/main/dspy/predict/program_of_thought.py)

<img src="./media/program_of_thought.png" width=700>

ProgramOfThought solves tasks by generating executable Python code rather than working directly with natural language outputs. When given a task, PoT first generates Python code using a ChainOfThought predictor, then executes that code in an isolated Python interpreter. If the code generates any errors, PoT enters a refinement loop where it shows the error to the language model, gets corrected code, and tries executing again, for up to a maximum number of iterations (default 3). The final output comes from actually running the successful code rather than from the language model directly. 

In [120]:
# Define the Signature
class MathAnalysis(dspy.Signature):
    """Analyze a dataset and compute various statistical metrics."""
    
    numbers: list[float] = dspy.InputField(desc="List of numerical values to analyze")
    required_metrics: list[str] = dspy.InputField(desc="List of metrics to calculate (e.g. ['mean', 'variance', 'quartiles'])")
    analysis_results: dict[str, float] = dspy.OutputField(desc="Dictionary containing the calculated metrics")

# Create the module
math_analyzer = dspy.ProgramOfThought(MathAnalysis)

# Example
data = [1.5, 2.8, 3.2, 4.7, 5.1, 2.3, 3.9]
metrics = ['mean', 'median']

# Run
pot_response = math_analyzer(
    numbers=data,
    required_metrics=metrics
)

Error in code execution


In [121]:
print("Reasoning: ", pot_response.reasoning)
print("\nResults: ", pot_response.analysis_results)

Reasoning:  The provided code correctly calculates the mean and median of the given list of numbers. The mean is computed by summing all the numbers and dividing by the count of numbers, while the median is determined by sorting the list and finding the middle value (or the average of the two middle values if the count is even). The output matches the expected results for both metrics.

Results:  {'mean': 3.357142857142857, 'median': 3.2}


### [Reasoning + Acting (ReAct)](https://github.com/stanfordnlp/dspy/blob/main/dspy/predict/react.py)

<img src="./media/react.png" width=700>

ReAct enables interactive problem-solving by combining reasoning with tool usage. It works by maintaining a trajectory of thought-action pairs, where at each step the model explains its reasoning, selects a tool to use, provides arguments for that tool, and then observes the tool's output to inform its next step. Each iteration consists of four parts: a thought explaining the strategy, selection of a tool name from the available tools, arguments to pass to that tool, and the observation from running the tool. This continues until either the model chooses to "finish" or reaches the maximum number of iterations. Here's a simple example:

In [123]:
# Define a Tool
def wikipedia_search(query: str) -> list[str]:
    """Retrieves abstracts from Wikipedia."""
    # Existing Wikipedia Abstracts Server
    results = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')(query, k=3) 
    return [x['text'] for x in results]

# Define ReAct Module
react_module = dspy.ReAct('question -> response', tools=[wikipedia_search])

# Example
text = "Who won the world series in 1983 and who won the world cup in 1966?"

# Run
react_response = react_module(question=text)

print("Answer: ", react_response.response)
print("\nReasoning: ", react_response.reasoning)

Answer:  The Baltimore Orioles won the World Series in 1983, and England won the World Cup in 1966.

Reasoning:  The Baltimore Orioles won the 1983 World Series, defeating the Philadelphia Phillies four games to one. Additionally, England won the 1966 FIFA World Cup, beating West Germany 4–2 in the final match.


### [Multi Chain Comparison](https://github.com/stanfordnlp/dspy/blob/main/dspy/predict/multi_chain_comparison.py)

<img src="./media/multi_chain.png" width=700>

MultiChainComparison is a meta-predictor that synthesizes multiple existing completions into a single, more robust prediction. It doesn't generate predictions itself, but instead takes M different completions (default 3) from other predictors - these could be from the same predictor with different temperatures, different predictors entirely, or repeated calls with the same settings. These completions are formatted as "Student Attempt #1:", "Student Attempt #2:", etc., with each attempt packaged as «I'm trying to \[rationale] I'm not sure but my prediction is \[answer]». The module then prompts the model to analyze these attempts holistically with "Accurate Reasoning: Thank you everyone. Let's now holistically..." to synthesize a final answer. This approach helps mitigate individual prediction errors by having the model explicitly compare and critique multiple solution paths before making its final decision.

In [126]:
# Run CoT completions with increasing temperatures
text = "That was phenomenal!"

cot_completions = []
for i in range(3):
    # Temperature increases: 0.7, 0.8, 0.9
    temp_config = dict(temperature=0.7 + (0.1 * i))
    completion = cot_emotion(input=text, config=temp_config)
    cot_completions.append(completion)

# Synthesize with MultiChainComparison
mcot_emotion = dspy.MultiChainComparison('input -> sentiment', M=3)
final_result = mcot_emotion(completions=cot_completions, input=text)

print(f"Sentiment: {final_result.sentiment}")
print(f"\nReasoning: {final_result.rationale}")

for i in range(3):
    print(f"\nCompletion {i+1}: ", cot_completions[i])

Sentiment: Positive

Reasoning: The phrase "That was phenomenal!" clearly indicates strong positive feelings. The word "phenomenal" is a superlative that suggests something is remarkable or exceptional, reinforcing the positive sentiment expressed by the speaker. All reasoning attempts correctly identify this sentiment as positive.

Completion 1:  Prediction(
    reasoning='The phrase "That was phenomenal!" expresses strong positive feelings about an experience or event. The use of the word "phenomenal" indicates that the subject exceeded expectations and was highly impressive.',
    sentiment='Positive'
)

Completion 2:  Prediction(
    reasoning='The phrase "That was phenomenal!" expresses a strong positive reaction. The use of the word "phenomenal" indicates that the speaker is extremely impressed or pleased with something. This suggests a high level of enthusiasm and admiration.',
    sentiment='Positive'
)

Completion 3:  Prediction(
    reasoning='The phrase "That was phenomenal!

### [Majority](https://github.com/stanfordnlp/dspy/blob/main/dspy/predict/aggregation.py)

<img src="./media/majority.png" width=700>

Majority is a utility function that implements a basic voting mechanism across multiple completions to determine the most common answer. It works by taking either a Prediction object (which contains completions) or a list of completions directly, then normalizes their values for the target field (either specified or defaults to the last output field). The normalization process, handled by normalize_text, helps manage slight variations in text that should be considered the same answer (returning None for answers that should be ignored). In cases of ties, earlier completions are prioritized. The function is particularly useful when combined with modules that generate multiple completions (like running predictors with different temperatures) and you want a simple way to find the most common response. The function returns a new Prediction object containing just the winning completion.

In [128]:
# Example Completions From Prior Multi-Chain
majority_result = dspy.majority(cot_completions, field='sentiment')

# Results
print(f"Most common sentiment: {majority_result.sentiment}")

Most common sentiment: Positive


---
## Evaluators

While modules are the building blocks of your program, you may have realized there's limited ability to actually tune or change your modules directly like you would iterate on prompt chains. This is where DSPy starts to differentiate itself, as it aims to tune performance of your modules through measuring against defined metrics.

As such, you need to deeply consider the optimal state of your LLM output and how you would measure it. This can be as simple as accuracy for classification tasks, or more complex like faithfulness to retrieved context.

### Example Data Type

The data type for DSPy evaluators and metrics is the `Example` object. In essence it's just a `dict` but handles the formatting that the DSPy backend expects. The fields can be anything you'd like, but make sure they match up to your current input and output formatting for your module.

Your training set data will consist of a list of examples.

In [120]:
qa_pair = dspy.Example(question="What is my name?", answer="Your name is Adam Lucek")

print(qa_pair)
print(qa_pair.question)
print(qa_pair.answer)

Example({'question': 'What is my name?', 'answer': 'Your name is Adam Lucek'}) (input_keys=None)
What is my name?
Your name is Adam Lucek


In [123]:
classification_pair = dspy.Example(excerpt="I really love programming!", classification="Positive", confidence=0.95)

print(classification_pair)
print(classification_pair.excerpt)
print(classification_pair.classification)
print(classification_pair.confidence)

Example({'excerpt': 'I really love programming!', 'classification': 'Positive', 'confidence': 0.95}) (input_keys=None)
I really love programming!
Positive
0.95


You may also explicitly label `inputs` and `labels` using the `.with_inputs()` method. Anything not specified in `.with_inputs()` is then expected to either be labels or metadata.

In [5]:
article_summary = dspy.Example(article = "Placeholder for Article", summary= "Expected Summary").with_inputs("article")

input_key_only = article_summary.inputs()
non_input_key_only = article_summary.labels()

print("Example with Input fields only:", article_summary.inputs())
print("\nExample object Non-Input fields only:", article_summary.labels())

Example with Input fields only: Example({'article': 'Placeholder for Article'}) (input_keys={'article'})

Example object Non-Input fields only: Example({'summary': 'Expected Summary'}) (input_keys=None)


### Metrics

<img src="./media/metrics.png" width=600>

Now that we understand the data format, we must consider our metrics. Metrics are critical to DSPy as the framework will optimize your modules towards defined metrics.

DSPy defines metrics concisesly *A metric is just a function that will take examples from your data and the output of your system and return a score that quantifies how good the output is. What makes outputs from your system good or bad?*

#### Simple Metrics

<img src="./media/simple_metrics.png" width=300>

Starting simply, setup and run validation for exact matches across a sentiment classification module

**Setup Module**

In [11]:
# Simple Tweet Sentiment Classification Module
from typing import Literal

class TwtSentiment(dspy.Signature):
    tweet: str = dspy.InputField(desc="Candidate tweet for classificaiton")
    sentiment: Literal["positive", "negative", "neutral"] = dspy.OutputField()

twt_sentiment = dspy.ChainOfThought(TwtSentiment)

**Format Dataset**

We'll grab some example tweet and sentiment pairs from the [MTEB Tweeet Sentiment Extraction](https://huggingface.co/datasets/mteb/tweet_sentiment_extraction) dataset. This will be our dataset we validate against.

In [18]:
import json

# Formatting Examples
examples = []
num_examples = 50

with open("./datasets/tweets.jsonl", 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        if num_examples and i >= num_examples:
            break
            
        data = json.loads(line.strip())
        example = dspy.Example(
            tweet=data['text'],
            sentiment=data['label_text']
        ).with_inputs("tweet")
        examples.append(example)

**Defining Metric**

The metric takes in an example, a prediction and an optional trace (we'll discuss the trace at a later point). In this case, it will return `True` or `False` depending on whether the llm predicted sentiment is the same as our ground truth

In [19]:
def validate_answer(example, pred, trace=None):
    return example.sentiment.lower() == pred.sentiment.lower()

**Running A Manual Evaluation**

For each tweet in the examples it will run a prediction with our examples defined inputs (the tweet), this is then ran through our `validate_answer` metric which returns True or False and is then stored in our scores list.

In [20]:
scores = []
for x in examples:
    pred = twt_sentiment(**x.inputs())
    score = validate_answer(x, pred)
    scores.append(score)

In [23]:
accuracy = sum(scores) / len(scores)
print("Baseline Accuracy: ", accuracy)

Baseline Accuracy:  0.76


#### Intermediate Metrics

<img src="./media/inter_metrics.png" width=300>

While these direct ground truth comparisons are good, we've seen the introduction of LLM-as-a-judge approaches assist in comparing and judging long form outputs.

Let's implement some LLM based metrics:

**Setup Module**

In [34]:
# CoT For Summarizing a Dialogue

dialog_sum = dspy.ChainOfThought("dialogue: str -> summary: str")

**Format Dataset**

Our dataset for this example comes from [DialogSum](https://github.com/cylnlp/dialogsum), a collection of dialogues and corresponding summaries. We can use their summaries as the "gold" standard to test against with fuzzy metrics from an LLM.

In [32]:
import pandas as pd

num_examples = 20
df = pd.read_csv("./datasets/dialogsum.csv")
    
# Limit the number of examples
if num_examples:
    df = df.head(num_examples)

dialogsum_examples = []

for _, row in df.iterrows():
    example = dspy.Example(
        dialogue=row['dialogue'],
        summary=row['summary']
    ).with_inputs('dialogue')
    
    dialogsum_examples.append(example)

**Metric Signature**

Now that we're using modules within our metrics, we need a dynamic signature that we can apply to metric predictions

In [90]:
# Define the signature for automatic assessments.
class Assess(dspy.Signature):
    """Assess the quality of a dialog summary along the specified dimension."""

    assessed_text = dspy.InputField()
    assessment_question = dspy.InputField()
    assessment_answer: bool = dspy.OutputField()

**Metric Definition**

We'll be using an LLM to assess whether the generated dialogue summary is accurate in comparison to the original quesiton, and concise in comparison to the expected summary.

In [91]:
def dialog_metric(gold, pred, trace=None):
    dialogue, gold_summary, generated_summary = gold.dialogue, gold.summary, pred.summary
    
    # Define Assessment Questions
    accurate_question = f"Given this original dialog: '{dialogue}', does the summary accurately represent what was discussed without adding or changing information?"
    
    concise_question = f"""Compare the level of detail in the generated summary with the gold summary:
    Gold summary: '{gold_summary}'
    Is the generated summary appropriately detailed - neither too sparse nor too verbose compared to the gold summary?"""

    # Run Predictions
    accurate = dspy.Predict(Assess)(assessed_text=generated_summary, assessment_question=accurate_question)
    concise = dspy.Predict(Assess)(assessed_text=generated_summary, assessment_question=concise_question)
    
    # Extract boolean assessment answers
    accurate, concise = [m.assessment_answer for m in [accurate, concise]]
    
    # Calculate score - accuracy is required for any points
    score = (accurate + concise) if accurate else 0
    
    if trace is not None:
        return score >= 2
        
    return score / 2.0

**Running Evaluation**

Similar manual evaluation to what we did earlier!

In [39]:
intermediate_scores = []
for x in dialogsum_examples:
    pred = dialog_sum(**x.inputs())
    score = dialog_metric(x, pred)
    intermediate_scores.append(score)

In [42]:
final_score = sum(intermediate_scores) / len(intermediate_scores)
print("Dialog Metric Score: ", final_score)

Dialog Metric Score:  0.85


#### Advanced Metrics with Tracing in DSPy

<img src="./media/advan_metrics.png" width=300>

DSPy's documentation highlights two key points about using modules as metrics:

1. If your metric is itself a DSPy program, one of the most powerful ways to iterate is to compile (optimize) your metric itself. That's usually easy because the output of the metric is usually a simple value (e.g., a score out of 5) so the metric's metric is easy to define and optimize by collecting a few examples.

2. When your metric is used during evaluation runs, DSPy will not try to track the steps of your program. But during compiling (optimization), DSPy will trace your LM calls. The trace will contain inputs/outputs to each DSPy predictor and you can leverage that to validate intermediate steps for optimization.

Digging into the second point with our prior example, the metric operates in two modes:

**Standard Evaluation (trace=None)**: Returns a normalized score (0-1) based on accuracy and conciseness of the summary, requiring factual accuracy as a gating factor.

**Compilation Mode (trace available)**: During compilation, DSPy provides us with the trace of our ChainOfThought module `(dialog_sum)`. While our standard evaluation returns a normalized score between 0-1, in compilation mode we alter the return logic to instead provide a binary success criterion `(score >= 2)`. This binary signal helps DSPy optimize more effectively during compilation by providing a clear success/failure signal for each example.

```python
def dialog_metric(gold, pred, trace=None):
    dialogue, gold_summary, generated_summary = gold.dialogue, gold.summary, pred.summary
    
    # LLM-based assessment using Assess signature
    accurate = dspy.Predict(Assess)(assessed_text=generated_summary, assessment_question=accurate_question)
    concise = dspy.Predict(Assess)(assessed_text=generated_summary, assessment_question=concise_question)
    
    if trace is not None:
        # During compilation: Can access and validate CoT reasoning steps
        # We're not doing anything with it currently but you can access in this way
        reasoning_steps = [output.reasoning for *_, output in trace if hasattr(output, 'reasoning')]
        # Return binary success criteria for optimization
        return score >= 2  # Requires both accuracy and conciseness
    
    return score / 2.0  # Normalized evaluation score
```

The trace functionality is particularly valuable for complex modules like our ChainOfThought implementation as it alters how DSPy handles optimization. During compilation, instead of returning normalized scores, we provide binary success signals based on specific criteria (score >= 2). This binary feedback helps DSPy more effectively optimize the model by providing clear success/failure signals for each example.

This dual-mode evaluation strategy serves two distinct purposes. During normal evaluation, we get detailed normalized scores to assess model performance. During compilation, we switch to binary success criteria to guide optimization more effectively. This approach helps us maintain rich evaluation metrics while providing clearer signals for model improvement during the compilation phase. We could also further complicate this by including signals from intermediate steps that are generally obfuscated.

---
## Optimization

<img src="./media/optimizers.png" width=500>

So now that we have some modules and metrics we're measuring against, we can take the final step of optimizing our programs. This takes the guesswork out of tweaking and editing prompts by automatically testing, assessing and iterating against measurable values.

DSPy offers a few ways to optimize your programs, copied over [from the docs](https://dspy.ai/learn/optimization/optimizers/):

**Automatic Few-Shot Learning**
These optimizers extend the signature by automatically generating and including optimized examples within the prompt sent to the model, implementing few-shot learning.

- `LabeledFewShot`: Simply constructs few-shot examples (demos) from provided labeled input and output data points. Requires k (number of examples for the prompt) and trainset to randomly select k examples from.

- `BootstrapFewShot`: Uses a teacher module (which defaults to your program) to generate complete demonstrations for every stage of your program, along with labeled examples in trainset. Parameters include max_labeled_demos (the number of demonstrations randomly selected from the trainset) and max_bootstrapped_demos (the number of additional examples generated by the teacher). The bootstrapping process employs the metric to validate demonstrations, including only those that pass the metric in the "compiled" prompt. Advanced: Supports using a teacher program that is a different DSPy program that has compatible structure, for harder tasks.

- `BootstrapFewShotWithRandomSearch`: Applies BootstrapFewShot several times with random search over generated demonstrations, and selects the best program over the optimization. Parameters mirror those of BootstrapFewShot, with the addition of num_candidate_programs, which specifies the number of random programs evaluated over the optimization, including candidates of the uncompiled program, LabeledFewShot optimized program, BootstrapFewShot compiled program with unshuffled examples and num_candidate_programs of BootstrapFewShot compiled programs with randomized example sets.

- `KNNFewShot`: Uses k-Nearest Neighbors algorithm to find the nearest training example demonstrations for a given input example. These nearest neighbor demonstrations are then used as the trainset for the BootstrapFewShot optimization process. See this notebook for an example.

**Automatic Instruction Optimization**
These optimizers produce optimal instructions for the prompt and, in the case of MIPROv2 can also optimize the set of few-shot demonstrations.

- `COPRO`: Generates and refines new instructions for each step, and optimizes them with coordinate ascent (hill-climbing using the metric function and the trainset). Parameters include depth which is the number of iterations of prompt improvement the optimizer runs over.

- `MIPROv2`: Generates instructions and few-shot examples in each step. The instruction generation is data-aware and demonstration-aware. Uses Bayesian Optimization to effectively search over the space of generation instructions/demonstrations across your modules.

**Automatic Finetuning**
This optimizer is used to fine-tune the underlying LLM(s).

- `BootstrapFinetune`: Distills a prompt-based DSPy program into weight updates. The output is a DSPy program that has the same steps, but where each step is conducted by a finetuned model instead of a prompted LM.

**Program Transformations**
- `Ensemble`: Ensembles a set of DSPy programs and either uses the full set or randomly samples a subset into a single program.

**Loading Train and Test for Tweets**

For our examples, we'll be optimizing the tweet sentiment classification module from before. While classification tasks are not the best examples for LLM applications, it will still allow us to understand in a lightweight way what's going on behind each optimizer so we can better apply them to more advanced programs. 

In [60]:
import json

# Formatting Examples
twitter_train = []
twitter_test = []
train_size = 100  # how many for train 
test_size = 200   # how many for test

with open("./datasets/tweets.jsonl", 'r', encoding='utf-8') as f:
   for i, line in enumerate(f):
       if i >= (train_size + test_size):
           break
           
       data = json.loads(line.strip())
       example = dspy.Example(
           tweet=data['text'],
           sentiment=data['label_text']
       ).with_inputs("tweet")
       
       if i < train_size:
           twitter_train.append(example)
       else:
           twitter_test.append(example)

**Candidate Program**

In [3]:
# Simple Tweet Sentiment Classification Module
from typing import Literal

class TwtSentiment(dspy.Signature):
    tweet: str = dspy.InputField(desc="Candidate tweet for classificaiton")
    sentiment: Literal["positive", "negative", "neutral"] = dspy.OutputField()

base_twt_sentiment = dspy.Predict(TwtSentiment)

**Simple Metrics**

In [4]:
def validate_answer(example, pred, trace=None):
    return example.sentiment.lower() == pred.sentiment.lower()

**Baseline Score**

In [7]:
baseline_scores = []
for x in twitter_test:
    pred = base_twt_sentiment(**x.inputs())
    score = validate_answer(x, pred)
    baseline_scores.append(score)

base_accuracy = baseline_scores.count(True) / len(baseline_scores)
print("Baseline Accuracy: ", base_accuracy)

Baseline Accuracy:  0.69


**Example Tweet We'll Run Each Program Through**

In [132]:
# Expected Positive Label
example_tweet = "Hi! Waking up, and not lazy at all. You would be proud of me, 8 am here!!! Btw, nice colour, not burnt."

### Automatic Few Shot Learning

<img src="./media/auto_fewshot.png" width=300>

These optimizers are focused around providing the best examples either by finding similar examples for your query in the training data during inference, or by generating optimized examples to use from the program itself.

#### LabeledFewShot

<img src="./media/labeled_few_shot.png" width=600>

The simplest optimizer. Randomly selects k examples from your training data to use as demonstrations.

In [9]:
from dspy.teleprompt import LabeledFewShot

lfs_optimizer = LabeledFewShot(k=16)  # Use 16 examples in prompts

lfs_twt_sentiment = lfs_optimizer.compile(base_twt_sentiment, trainset=twitter_train)

In [11]:
lfs_scores = []
for x in twitter_test:
    pred = lfs_twt_sentiment(**x.inputs())
    score = validate_answer(x, pred)
    lfs_scores.append(score)

lfs_accuracy = lfs_scores.count(True) / len(lfs_scores)
print("Labeled Few Shot Accuracy: ", lfs_accuracy)

Labeled Few Shot Accuracy:  0.695


In [12]:
lfs_twt_sentiment.save("./optimized/lfs_twt_sentiment.json")

In [134]:
print(lfs_twt_sentiment(tweet=example_tweet).sentiment)

positive


#### BootstrapFewShot 

<img src="./media/bootstrap_fewshot.png" width=900>

Generates high-quality examples by executing your program and keeping only successful runs.

In [15]:
from dspy.teleprompt import BootstrapFewShot

bsfs_optimizer = BootstrapFewShot(
    metric=validate_answer,          # Function to evaluate quality
    max_bootstrapped_demos=4,        # Generated examples
    max_labeled_demos=16,            # Examples from training data
    metric_threshold=1               # Minimum quality threshold
)

bsfs_twt_sentiment = bsfw_optimizer.compile(base_twt_sentiment, trainset=twitter_train)

  4%|█▋                                        | 4/100 [00:00<00:00, 199.89it/s]

Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.





In [16]:
bsfs_scores = []
for x in twitter_test:
    pred = bsfw_twt_sentiment(**x.inputs())
    score = validate_answer(x, pred)
    bsfs_scores.append(score)

bsfs_accuracy = bsfs_scores.count(True) / len(bsfs_scores)
print("Bootstrap Few Shot Accuracy: ", bsfs_accuracy)

Labeled Few Shot Accuracy:  0.715


In [17]:
bsfs_twt_sentiment.save("./optimized/bsfs_twt_sentiment.json")

In [135]:
print(bsfs_twt_sentiment(tweet=example_tweet).sentiment)

positive


#### BootstrapFewShotWithRandomSearch

<img src="./media/bsfswrs_diagram.png" width=900>

Extends BootstrapFewShot by trying multiple random sets of examples to find the best performing combination.

In [None]:
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

bsfswrs_optimizer = BootstrapFewShotWithRandomSearch(
    metric=validate_answer,
    num_candidate_programs=16,
    max_bootstrapped_demos=4,
    max_labeled_demos=16
)

bsfswrs_twt_sentiment = bsfswrs_optimizer.compile(base_twt_sentiment, trainset=twitter_train)

<img src="./media/bsfswrs.png" width=800>

In [20]:
bsfswrs_scores = []
for x in twitter_test:
    pred = bsfswrs_twt_sentiment(**x.inputs())
    score = validate_answer(x, pred)
    bsfswrs_scores.append(score)

bsfswrs_accuracy = bsfswrs_scores.count(True) / len(bsfswrs_scores)
print("Bootstrap Few Shot With Random Search Accuracy: ", bsfswrs_accuracy)

Bootstrap Few Shot With Random Search Accuracy:  0.7


In [21]:
bsfswrs_twt_sentiment.save("./optimized/bsfswrs_twt_sentiment.json")

In [137]:
print(bsfswrs_twt_sentiment(tweet=example_tweet).sentiment)

positive


#### KNNFewShot

<img src="./media/knn_diagram.png" width=800>

Dynamically selects relevant examples based on similarity to the input.

**Defining an Embedding Function**

As KNN retrieval relies on vector similarity, we need a quick embedding function. This is a very simple setup that uses OpenAI's api.

In [22]:
from openai import OpenAI
import numpy as np

client = OpenAI()

def openai_embeddings(texts):
    if isinstance(texts, str):
        texts = [texts]
    
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    
    # Convert to numpy array
    embeddings = np.array([embedding.embedding for embedding in response.data], dtype=np.float32)
    
    # If single text, return single embedding
    if len(embeddings) == 1:
        return embeddings[0]
    return embeddings

In [24]:
from dspy.teleprompt import KNNFewShot

knn_optimizer = KNNFewShot(
    k=5,                               # Number of neighbors to use
    trainset=twitter_train,            # Dataset for finding neighbors
    vectorizer=openai_embeddings       # Function to convert inputs to vectors
)

knn_twt_sentiment = knn_optimizer.compile(base_twt_sentiment, trainset=twitter_train)

In [None]:
knn_scores = []
for x in twitter_test:
    pred = knn_twt_sentiment(**x.inputs())
    score = validate_answer(x, pred)
    knn_scores.append(score)

In [27]:
knn_accuracy = knn_scores.count(True) / len(knn_scores)
print("KNN Few Shot Accuracy: ", knn_accuracy)

KNN Few Shot Accuracy:  0.7


In [28]:
knn_twt_sentiment.save("./optimized/knn_twt_sentiment.json")

In [139]:
print(knn_twt_sentiment(tweet=example_tweet).sentiment)

 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.69it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
positive


### Instruction Optimization

<img src="./media/auto_instr.png" width=300>

These optimizers improve the actual instructions and prompts given to the model, enhancing zero-shot performance rather than the few shot setups shown above.

#### COPRO (Coordinate Prompt Optimization)

<img src="./media/copro_diagram.png" width=1000>

Generates and refines new instructions for each step, and optimizes them with coordinate ascent (hill-climbing using the metric function and the trainset). Parameters include depth which is the number of iterations of prompt improvement the optimizer runs over.

In [None]:
from dspy.teleprompt import COPRO

copro_optimizer = COPRO(
    metric=validate_answer,              # Metric to Optimize Against
    prompt_model= dspy.LM('openai/gpt-4o'), # Different Model for Prompt Generation
    breadth=10,                          # New prompts per iteration
    depth=3,                             # Number of improvement rounds
    init_temperature=1.4                 # Creativity in generation
)

copro_twt_sentiment = copro_optimizer.compile(base_twt_sentiment, trainset=twitter_train, eval_kwargs={'num_threads': 6, 'display_progress': True})

In [36]:
corpo_scores = []
for x in twitter_test:
    pred = copro_twt_sentiment(**x.inputs())
    score = validate_answer(x, pred)
    corpo_scores.append(score)

corpo_accuracy = corpo_scores.count(True) / len(corpo_scores)
print("CORPO Accuracy: ", corpo_accuracy)

CORPO Accuracy:  0.71


In [37]:
copro_twt_sentiment.save("./optimized/copro_twt_sentiment.json")

In [141]:
print(copro_twt_sentiment(tweet=example_tweet).sentiment)

positive


#### MIPROv2 (Multiprompt Instruction Proposal Optimizer Version 2)

<img src="./media/mipro_diagram.png" width=1000>

Generates instructions and few-shot examples in each step. The instruction generation is data-aware and demonstration-aware. Uses Bayesian Optimization to effectively search over the space of generation instructions/demonstrations across your modules.

In [None]:
from dspy.teleprompt import MIPROv2

mipro_optimizer = MIPROv2(
    metric=validate_answer,
    prompt_model= dspy.LM('openai/gpt-4o'), # Different Model for Prompt Generation
    num_candidates=10,                      # Instructions to try
)

mipro_twt_sentiment = mipro_optimizer.compile(base_twt_sentiment, trainset=twitter_train, valset=twitter_test)

In [62]:
mipro_scores = []
for x in twitter_test:
    pred = mipro_twt_sentiment(**x.inputs())
    score = validate_answer(x, pred)
    mipro_scores.append(score)

mipro_accuracy = mipro_scores.count(True) / len(mipro_scores)
print("MIPRO Accuracy: ", mipro_accuracy)

MIPRO Accuracy:  0.715


In [63]:
mipro_twt_sentiment.save("./optimized/mipro_twt_sentiment.json")

In [153]:
print(mipro_twt_sentiment(tweet=example_tweet).sentiment)

positive


### Automatic Finetuning

<img src="./media/auto_ft.png" width=300>

Once you have a well optimized program, you may want to start looking for even further optimizations. Ideally, you would use a large and expensive model first to get the best performance, then transfer that knowledge to an optimized smaller model (or continually train an existing model)

DSPy offers a solution to automatically use your best programs to create training data for downstream finetuning with `BootstrapFinetune`.

#### BootstrapFinetune

<img src="./media/bootstrap_finetune_diagram.png" width=1000>

Creates fine-tuned versions of language models based on successful program executions. In this example we'll instill our best performing program from MIPROv2 directly into gpt-4o-mini.

In [64]:
dspy.settings.experimental = True

**Grabbing some additional data**

In [82]:
import json

# Formatting Examples
bsft_twitter_train = []
bsft_twitter_test = []
train_size = 500  # how many for train 
test_size = 200    # how many for test

with open("./datasets/tweets.jsonl", 'r', encoding='utf-8') as f:
   for i, line in enumerate(f):
       if i >= (train_size + test_size):
           break
           
       data = json.loads(line.strip())
       example = dspy.Example(
           tweet=data['text'],
           sentiment=data['label_text']
       ).with_inputs("tweet")
       
       if i < train_size:
           bsft_twitter_train.append(example)
       else:
           bsft_twitter_test.append(example)

**Teacher and Student**

At it's core `BootstrapFinetune` is meant to use our best optimized program to create training data to fine tune a language model. As such we need a teacher model that will be used across our data to create the examples, and then a student program with a target model to be fine tuned.

In [83]:
# First make a deep copy of your optimized MIPRO program as the teacher
teacher = mipro_twt_sentiment.deepcopy()

# Create student as a copy but with your target model
student = mipro_twt_sentiment.deepcopy()
student.set_lm(dspy.LM("gpt-4o-mini-2024-07-18"))  # e.g., mistral or whatever model you want to fine-tune

In [84]:
from dspy.teleprompt import BootstrapFinetune

bsft_optimizer = BootstrapFinetune(
    metric=validate_answer,          # Used to filter training data
    num_threads=16                   # For parallel processing
)

bsft_twt_sentiment = bsft_optimizer.compile(
    student=student,
    trainset=bsft_twitter_train,
    teacher=teacher
)

[BootstrapFinetune] Preparing the student and teacher programs...
[BootstrapFinetune] Bootstrapping data...
Average Metric: 362.00 / 500 (72.4%): 100%|██| 500/500 [00:00<00:00, 628.33it/s]


2024/12/30 00:58:17 INFO dspy.evaluate.evaluate: Average Metric: 362 / 500 (72.4%)


[BootstrapFinetune] Preparing the train data...
[BootstrapFinetune] Collected data for 500 examples
[BootstrapFinetune] After filtering with the metric, 362 examples remain
[BootstrapFinetune] Using 362 data points for fine-tuning the model: gpt-4o-mini-2024-07-18
[BootstrapFinetune] Starting LM fine-tuning...
[BootstrapFinetune] 1 fine-tuning job(s) to start
[BootstrapFinetune] Starting 1 fine-tuning job(s)...
[OpenAI Provider] Validating the data format
[OpenAI Provider] Saving the data to a file
[OpenAI Provider] Data saved to /Users/adamlucek/.dspy_cache/finetune/798b39e1a18373a3.jsonl
[OpenAI Provider] Uploading the data to the provider
[OpenAI Provider] Starting remote training
[OpenAI Provider] Job started with the OpenAI Job ID ftjob-L8D3vni8wlEyuCOAhIgzuFHF
[OpenAI Provider] Waiting for training to complete
[OpenAI Provider] 2024-12-30 00:58:23 Validating training file: file-Sh4DqQsYEY5UaqJEHGy37y
[OpenAI Provider] 2024-12-30 01:02:36 Fine-tuning job started
[OpenAI Provider] 

In [86]:
bsft_scores = []
for x in bsft_twitter_test:
    pred = bsft_twt_sentiment(**x.inputs())
    score = validate_answer(x, pred)
    bsft_scores.append(score)

bsft_accuracy = bsft_scores.count(True) / len(bsft_scores)
print("Bootstrap Fine Tune Accuracy: ", bsft_accuracy)

Bootstrap Fine Tune Accuracy:  0.725


In [85]:
bsft_twt_sentiment.save("./optimized/bsft_twt_sentiment.pkl")

In [151]:
print(bsft_twt_sentiment(tweet=example_tweet).sentiment)

positive


### Choosing an Optimizer

From DSPy's [Documentation](https://dspy.ai/learn/optimization/optimizers):

- If you have very few examples (around 10), start with `BootstrapFewShot`.
- If you have more data (50 examples or more), try `BootstrapFewShotWithRandomSearch`.
- If you prefer to do instruction optimization only (i.e. you want to keep your prompt 0-shot), use `MIPROv2` configured for 0-shot optimization to optimize.
- If you’re willing to use more inference calls to perform longer optimization runs (e.g. 40 trials or more), and have enough data (e.g. 200 examples or more to prevent overfitting) then try `MIPROv2`.
- If you have been able to use one of these with a large LM (e.g., 7B parameters or above) and need a very efficient program, finetune a small LM for your task with `BootstrapFinetune`.

Can't choose one? Try the [Ensemble](https://github.com/stanfordnlp/dspy/blob/main/dspy/teleprompt/ensemble.py) compiler to combine multiple optimized programs together, then process the output's in some way (i.e. majority, weighted majority, etc) to get to a final output! 

<img src="./media/ensemble_diagram.png" width=800>

### Optimizing Optimized Programs

As emphasized, running just one iteration of optimization is usually not enough. Iterate across your metrics, programs, and metrics in programs!

DSPy has a built in function that encourages this, **[BetterTogether](https://github.com/stanfordnlp/dspy/blob/main/dspy/teleprompt/bettertogether.py)**

<img src="./media/better_together.png" width=400>

But we'll go ahead and do it manually to see if it makes a difference!

**Grabbing Unseen Data**

In [100]:
import json
# Formatting Examples
final_twitter_train = []
final_twitter_test = []
train_size = 300  # how many for train 
test_size = 500    # how many for test
start_row = 1500   # start reading from this row

with open("./datasets/tweets.jsonl", 'r', encoding='utf-8') as f:
   for i, line in enumerate(f):
       # Skip until we reach start_row
       if i < start_row:
           continue
           
       # Adjust the index for our collection logic
       collection_index = i - start_row
       
       if collection_index >= (train_size + test_size):
           break
           
       data = json.loads(line.strip())
       example = dspy.Example(
           tweet=data['text'],
           sentiment=data['label_text']
       ).with_inputs("tweet")
       
       if collection_index < train_size:
           final_twitter_train.append(example)
       else:
           final_twitter_test.append(example)

**Optimizing our Fine Tuned Program with MIPROv2**

In [None]:
mipro_optimizer = MIPROv2(
    metric=validate_answer,
    prompt_model= dspy.LM('openai/gpt-4o'), # Different Model for Prompt Generation
    num_candidates=10,                      # Instructions to try
)

mipro_bsft_twt_sentiment = mipro_optimizer.compile(bsft_twt_sentiment, trainset=final_twitter_train, valset=final_twitter_test)

In [103]:
final_scores = []
for x in final_twitter_test:
    pred = mipro_bsft_twt_sentiment(**x.inputs())
    score = validate_answer(x, pred)
    final_scores.append(score)

mipro_bsft_accuracy = final_scores.count(True) / len(final_scores)
print("MIPROv2 After Bootstrap Fine Tune Accuracy: ", mipro_bsft_accuracy)

MIPROv2 After Bootstrap Fine Tune Accuracy:  0.744


In [98]:
mipro_bsft_twt_sentiment.save("./optimized/mipro_bsft_twt_sentiment.pkl")

In [149]:
print(mipro_bsft_twt_sentiment(tweet=example_tweet).sentiment)

positive


---
## Final Thoughts

Check out DSPy's [official documentation](https://dspy.ai/), which this notebook is essentially a code forward exploration of. They have plenty more [tutorials](https://dspy.ai/tutorials/) and [guides](https://dspy.ai/learn/) that are actively being updated as part of their latest (Dec 2024) release!

Overal DSPy provides an interesting approach to applying language models within programs, abstracting away from trial and error via prompting by adding rigour around clear metric definition and optimization. Rather than work with difficult to interpret or tune text strings, they offer a clean base template that can be further optimized through algorithmic approaches, applying automated ways to coordinate or generate few shot examples, directly change the instructions given to the LLM, or a combination of the two.

Inspired by deep learning frameworks, DSPy offers a powerful way to reliably optimize and iterate on LLM applications in a systematic and controlled way, with the entire ecosystem growing by the day. Go give [the DSPy repo](https://github.com/stanfordnlp/dspy/tree/main) a star!