# dspy methodology 101

1. programming
   1. LMs (tasks)
   2. signatures (i/o - eg `"context: list[str], question: str -> answer: str"`) - compiling leads to better prompts than humans write
      1. tasks, instruct the model what it needs to do
      2. underlying dSPY compiler will do the optimization, rather than brittle prompts
   3. modules (ie `dspy.Predict`, `dspy.ChainOfThought`)
      1. prompting techniques
2. evaluation
3. optimization

## TOC:
* [intro](#dspy-methodology-101)
* [LMs](#set-a-generator-lm)
* [evaluations 101](#dspy-evaluations)
* [data](#dspy-data)
  * [example-obects](#dspy-example-objects)
* [metrics](#dspy-metrics)
* [evaluations](#dspy-evaluations)

## set a generator LM

<a class="anchor" id="LM"></a>

In [2]:
import dspy
import os

ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
TOGETHER_API_KEY = os.getenv("TOGETHER_API_KEY")

lm=dspy.LM('together_ai/deepseek-ai/DeepSeek-R1', temperature=0.1, max_tokens=2500, stop=None, cache=False, api_key=TOGETHER_API_KEY)
dspy.configure(lm=lm)

## dSPY evaluations

- define your DSPy metric
  - what makes outputs from your system good or bad?
    - A metric is a function that takes examples from your data and takes the output of your system, and returns a score.
    - you use Examples a lot in DSPy to train, test, and improve AI models
- no labels, just inputs

# DSPy Data

- inputs
- intermediate labels 
- final label

### DSPy Example objects

- Examples represent items in your training set and test set
- Examples = AI-friendly data containers

**ML**:

- the Example objects have a `with_inputs()` method, which can mark specific fields as inputs. (The rest are just metadata or [labels](https://toloka.ai/blog/machine-learning-labels-and-features/) - A label is a description that informs an ML model what a particular data represents so that it may learn from the example)

- Inputs (Features) → The data you give to the model to make a prediction.
  - **Example**: A picture of a cat, a sentence, or a set of numbers.
- Labels (Targets/Outputs) → The correct answer the model should learn to predict.
  - **Example**: The word "cat" for an image classification model, or the correct sentiment (positive/negative) for a text review.

**Example**:

If you're training a spam detector:
**Input**: An email's text
**Label**: "Spam" or "Not Spam"

```bash
# Single Input.
print(qa_pair.with_inputs("question"))

# Multiple Inputs; be careful about marking your labels as inputs unless you mean it.
print(qa_pair.with_inputs("question", "answer"))
```

In [3]:
qa_pair = dspy.Example(question="This is a question?", answer="This is an answer.")

print(qa_pair)
print(qa_pair.question)
print(qa_pair.answer)

Example({'question': 'This is a question?', 'answer': 'This is an answer.'}) (input_keys=None)
This is a question?
This is an answer.


In [4]:
# Single Input.
print(qa_pair.with_inputs("question"))

# Multiple Inputs; be careful about marking your labels as inputs unless you mean it.
print(qa_pair.with_inputs("question", "answer"))

Example({'question': 'This is a question?', 'answer': 'This is an answer.'}) (input_keys={'question'})
Example({'question': 'This is a question?', 'answer': 'This is an answer.'}) (input_keys={'answer', 'question'})


In [5]:
article_summary = dspy.Example(article= "This is an article.", summary= "This is a summary.").with_inputs("article")

input_key_only = article_summary.inputs()
non_input_key_only = article_summary.labels()

print("Example object with Input fields only:", input_key_only)
print("Example object with Non-Input fields only:", non_input_key_only)

Example object with Input fields only: Example({'article': 'This is an article.'}) (input_keys={'article'})
Example object with Non-Input fields only: Example({'summary': 'This is a summary.'}) (input_keys=None)


In [None]:
article_summary = dspy.Example(article= "This is an article.",
                               summary= "This is a summary.").with_inputs("article")

input_key_only = article_summary.inputs()
non_input_key_only = article_summary.labels()

print("Example object with Input fields only:", input_key_only)
print("Example object with Non-Input fields only:", non_input_key_only)

Example object with Input fields only: Example({'article': 'This is an article.'}) (input_keys={'article'})
Example object with Non-Input fields only: Example({'summary': 'This is a summary.'}) (input_keys=None)


## dSPY metrics

A metric is just a function that will take examples from your data and the output of your system and return a score that quantifies how good the output is

AKA good vs bad outputs

for simple tasks, this could be just 
  - "accuracy"
  - "exact match"
  - "F1 score" (precision and re-call)
    - 1 (or 100%) → Perfect model
    - 0 → Totally useless


In [None]:
def dspy_metric(example, pred):
    """
    A DSPy metric function.

    Parameters:
    - example: An example from your training or dev set.
    - pred: The output prediction from your DSPy program.

    Returns:
    - score: A float, int, or bool score.
    """
    # Your metric calculation logic here
    score = calculate_score(example, pred)
    return score

`trace` is the third argument and can be used to optimize the metric

In [None]:
def validate_answer(example, pred, trace=None):
    return example.answer.lower() == pred.answer.lower()

<p align=center>during compiling (optimization), DSPy will trace your LM calls. The trace will contain inputs/outputs to each DSPy predictor and you can leverage that to validate intermediate steps for optimization.</p>

In [None]:
def validate_hops(example, pred, trace=None):
    hops = [example.question] + [outputs.query for *_, outputs in trace if 'query' in outputs]

    if max([len(h) for h in hops]) > 100: return False
    if any(dspy.evaluate.answer_exact_match_str(hops[idx], hops[:idx], frac=0.8) for idx in range(2, len(hops))): return False

    return True

a couple of example built-in common metrics:
- `dspy.evaluate.metrics.answer_exact_match`
- `dspy.evaluate.metrics.answer_passage_match`

# DSPy evaluations

In [None]:
scores = []
for x in devset:
    pred = program(**x.inputs())
    score = metric(x, pred)
    scores.append(score)

the built-in Evaluate utility can help with things like parallel evaluation (multiple threads) or showing you a sample of inputs/outputs and the metric scores.

In [None]:
from dspy.evaluate import Evaluate

# Set up the evaluator, which can be re-used in your code.
evaluator = Evaluate(devset=YOUR_DEVSET, num_threads=1, display_progress=True, display_table=5)

# Launch evaluation.
evaluator(YOUR_PROGRAM, metric=YOUR_METRIC)

In [9]:
def metric(gold, pred, trace=None):
    question, answer, tweet = gold.question, gold.answer, pred.output

    engaging = "Does the assessed text make for a self-contained, engaging tweet?"
    correct = f"The text should answer `{question}` with `{answer}`. Does the assessed text contain this answer?"

    correct =  dspy.Predict(Assess)(assessed_text=tweet, assessment_question=correct)
    engaging = dspy.Predict(Assess)(assessed_text=tweet, assessment_question=engaging)

    correct, engaging = [m.assessment_answer for m in [correct, engaging]]
    score = (correct + engaging) if correct and (len(tweet) <= 280) else 0

    if trace is not None: return score >= 2
    return score / 2.0

<p align=center>When compiling, trace is not None, and we want to be strict about judging things, so we will only return True if score >= 2. Otherwise, we return a score out of 1.0 (i.e., score / 2.0).</p>

> If your metric is itself a DSPy program, one of the most powerful ways to iterate is to compile (optimize) your metric itself. That's usually easy because the output of the metric is usually a simple value (e.g., a score out of 5) so the metric's metric is easy to define and optimize by collecting a few examples.