# DSPy Quickstart

[DSPy](https://dspy-docs.vercel.app/) simplifies building language model (LM) pipelines by replacing manual prompt engineering with structured "text transformation graphs." These graphs use flexible, learning modules that automate and optimize LM tasks like reasoning, retrieval, and answering complex questions. 

## How does it work?
At a high level, DSPy optimizes prompts, selects the best language model, and can even fine-tune the model using training data.

The process follows these three steps, common to most DSPy [optimizers](https://dspy.ai/learn/optimization/optimizers/):

1. **Candidate Generation**: DSPy finds all `Predict` modules in the program and generates variations of instructions and demonstrations (e.g., examples for prompts). This step creates a set of possible candidates for the next stage.
2. **Parameter Optimization**: DSPy then uses methods like random search, TPE, or Optuna to select the best candidate. Fine-tuning models can also be done at this stage.

## This Demo
Below we create a simple program that demonstrates the power of DSPy. We will build a text classifier leveraging OpenAI. By the end of this tutorial, we will...

1. Define a [dspy.Signature](https://dspy.ai/learn/programming/signatures/) and [dspy.Module](https://dspy.ai/learn/programming/modules/) to perform text classification.
2. Leverage [dspy.SIMBA](https://dspy.ai/api/optimizers/SIMBA/) to compile our module so it's better at classifying our text.
3. Analyze internal steps with MLflow Tracing.
3. Log the compiled model with MLflow.
4. Load the logged model and perform inference.

In [1]:
%pip install -U datasets openai "dspy>=3.0.3" "mlflow>=3.4.0"

Collecting datasets
  Downloading datasets-4.3.0-py3-none-any.whl.metadata (18 kB)
Collecting openai
  Downloading openai-2.6.1-py3-none-any.whl.metadata (29 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-22.0.0-cp311-cp311-win_amd64.whl.metadata (3.3 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Using cached multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting dill<0.4.1,>=0.3.0 (from datasets)
  Using cached dill-0.4.0-py3-none-any.whl.metadata (10 kB)
Collecting pydantic<3,>=1.9.0 (from openai)
  Using cached pydantic-2.12.3-py3-none-any.whl.metadata (87 kB)
Collecting pydantic-core==2.41.4 (from pydantic<3,>=1.9.0->openai)
  Downloading pydantic_core-2.41.4-cp311-cp311-win_amd64.whl.metadata (7.4 kB)
Collecting typing-extensions>=3.7.4.3 (from huggingface-hub<2.0,>=0.25.0->datasets)
  Using cached typing_extensions-4.15.0-py3-none-any.whl.metadata (3.3 kB)
Collecting typing-inspection>=0.4.2 (from pydantic<3,>=1.9.0->openai)
  Using cac

  You can safely remove it manually.
  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
jupyter-server 1.23.4 requires anyio<4,>=3.1.0, but you have anyio 4.11.0 which is incompatible.
langchain-openai 0.3.35 requires langchain-core<1.0.0,>=0.3.78, but you have langchain-core 0.2.43 which is incompatible.


## Setup


### Set Up LLM

After installing the relevant dependencies, let's set up access to an OpenAI LLM. Here, will leverage OpenAI's `gpt-4o-mini` model.

In [2]:
# Set OpenAI API Key to the environment variable. You can also pass the token to dspy.LM()
import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI Key:")

In [3]:
import dspy

# Define your model. We will use OpenAI for simplicity
model_name = "gpt-4o-mini"

# Note that an OPENAI_API_KEY environment must be present. You can also pass the token to dspy.LM()
lm = dspy.LM(
    model=f"openai/{model_name}",
    max_tokens=500,
    temperature=0.1,
)
dspy.settings.configure(lm=lm)

### Create MLflow Experiment

Create a new MLflow Experiment to track your DSPy models, metrics, parameters, and traces in one place. Although there is already a "default" experiment created in your workspace, it is highly recommended to create one for different tasks to organize experiment artifacts.


In [4]:
import mlflow

mlflow.set_experiment("DSPy Quickstart")

  from pandas.core import (
2025/10/25 16:46:02 INFO mlflow.tracking.fluent: Experiment with name 'DSPy Quickstart' does not exist. Creating a new experiment.


<Experiment: artifact_location='file:///c:/SMU_work/Year%203%20Sem%201/Gen%20AI/Project/ANEETAA/mlruns/482380963199463283', creation_time=1761381962226, experiment_id='482380963199463283', last_update_time=1761381962226, lifecycle_stage='active', name='DSPy Quickstart', tags={}>

### Turn on Auto Tracing with MLflow

[MLflow Tracing](https://mlflow.org/docs/latest/llms/tracing/index.html) is a powerful observability tool for monitoring and debugging what happens inside your DSPy modules, helping you identify potential bottlenecks or issues quickly. To enable DSPy tracing, you just need to call `mlflow.dspy.autolog` and that's it!

In [5]:
mlflow.dspy.autolog()

### Set Up Data

Next, we will download the [Reuters 21578](https://huggingface.co/datasets/yangwang825/reuters-21578) dataset from Huggingface. We also write a utility to ensure that our train/test split has the same labels.

In [6]:
import numpy as np
import pandas as pd
from datasets import load_dataset
from dspy.datasets.dataset import Dataset


def read_data_and_subset_to_categories() -> tuple[pd.DataFrame]:
    """
    Read the reuters-21578 dataset. Docs can be found in the url below:
    https://huggingface.co/datasets/yangwang825/reuters-21578
    """

    # Read train/test split
    dataset = load_dataset("yangwang825/reuters-21578")
    train = pd.DataFrame(dataset["train"])
    test = pd.DataFrame(dataset["test"])

    # Clean the labels
    label_map = {
        0: "acq",
        1: "crude",
        2: "earn",
        3: "grain",
        4: "interest",
        5: "money-fx",
        6: "ship",
        7: "trade",
    }

    train["label"] = train["label"].map(label_map)
    test["label"] = test["label"].map(label_map)

    return train, test


class CSVDataset(Dataset):
    def __init__(
        self, n_train_per_label: int = 20, n_test_per_label: int = 10, *args, **kwargs
    ) -> None:
        super().__init__(*args, **kwargs)
        self.n_train_per_label = n_train_per_label
        self.n_test_per_label = n_test_per_label

        self._create_train_test_split_and_ensure_labels()

    def _create_train_test_split_and_ensure_labels(self) -> None:
        """Perform a train/test split that ensure labels in `dev` are also in `train`."""
        # Read the data
        train_df, test_df = read_data_and_subset_to_categories()

        # Sample for each label
        train_samples_df = pd.concat(
            [group.sample(n=self.n_train_per_label) for _, group in train_df.groupby("label")]
        )
        test_samples_df = pd.concat(
            [group.sample(n=self.n_test_per_label) for _, group in test_df.groupby("label")]
        )

        # Set DSPy class variables
        self._train = train_samples_df.to_dict(orient="records")
        self._dev = test_samples_df.to_dict(orient="records")


# Limit to a small dataset to showcase the value of bootstrapping
dataset = CSVDataset(n_train_per_label=3, n_test_per_label=1)

# Create train and test sets containing DSPy
# Note that we must specify the expected input value name
train_dataset = [example.with_inputs("text") for example in dataset.train]
test_dataset = [example.with_inputs("text") for example in dataset.dev]
unique_train_labels = {example.label for example in dataset.train}

print(len(train_dataset), len(test_dataset))
print(f"Train labels: {unique_train_labels}")
print(train_dataset[0])

README.md:   0%|          | 0.00/465 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


train.json: 0.00B [00:00, ?B/s]

test.json: 0.00B [00:00, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

24 8
Train labels: {'grain', 'ship', 'crude', 'earn', 'trade', 'acq', 'money-fx', 'interest'}
Example({'text': 'spain deregulates bank deposit interest rates spain s finance ministry deregulated bank deposit rates in an effort to raise competition among banks and bring legislation into line with the european community ec a ministry spokesman said the measure was published today in the official state gazette it takes effect tomorrow and lifts restrictions on rates now limited to six pct on deposits of up to days the government also enacted a decree cutting to one pct from pct the proportion of total assets which banks must lend at favourable rates to industries classified of public interest some bankers expect the deregulation of rates to result in a pct drop in profits this year secretary of state for the economy guillermo de la dehesa told reuters in a recent interview the reduction in fixed asset investments would offset losses from the rate liberalisation reuter', 'label': 'interest

### Set up DSPy Signature and Module

Finally, we will define our task: text classification.

There are a variety of ways you can provide guidelines to DSPy signature behavior. Currently, DSPy allows users to specify:

1. A high-level goal via the class docstring.
2. A set of input fields, with optional metadata.
3. A set of output fields with optional metadata.

DSPy will then leverage this information to inform optimization. 

In the below example, note that we simply provide the expected labels to `output` field in the `TextClassificationSignature` class. From this initial state, we'll look to use DSPy to learn to improve our classifier accuracy.

In [7]:
class TextClassificationSignature(dspy.Signature):
    text = dspy.InputField()
    label = dspy.OutputField(
        desc=f"Label of predicted class. Possible labels are {unique_train_labels}"
    )


class TextClassifier(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_classification = dspy.Predict(TextClassificationSignature)

    def forward(self, text: str):
        return self.generate_classification(text=text)

## Run it!

### Hello World
Let's demonstrate predicting via the DSPy module and associated signature. The program has correctly learned our labels from the signature `desc` field and generates reasonable predictions.

In [8]:
# Initilize our impact_improvement class
text_classifier = TextClassifier()

message = "I am interested in space"
print(text_classifier(text=message))

message = "I enjoy ice skating"
print(text_classifier(text=message))



BadRequestError: litellm.BadRequestError: OpenAIException - You didn't provide an API key. You need to provide your API key in an Authorization header using Bearer auth (i.e. Authorization: Bearer YOUR_KEY), or as the password field (with blank username) if you're accessing the API from your browser and are prompted for a username and password. You can obtain an API key from https://platform.openai.com/account/api-keys.

### Review Traces

1. Open the MLflow UI and select the `"DSPy Quickstart"` experiment.
2. Go to the `"Traces"` tab to view the generated traces.

Now, you can observe how DSPy translates your query and interacts with the LLM. This feature is extremely valuable for debugging, iteratively refining components within your system, and monitoring models in production. While the module in this tutorial is relatively simple, the tracing feature becomes even more powerful as your model grows in complexity.

![MLflow DSPy Trace](/images/llms/dspy/dspy-trace.png)


## Compilation



### Training

To train, we will leverage [SIMBA](https://dspy.ai/api/optimizers/SIMBA/), an optimizer that will take bootstrap samples from our training set and leverage a random search strategy to optimize our predictive accuracy.

Note that in the below example, we leverage a simple metric definition of exact match, as defined in `validate_classification`, but [dspy.Metrics](https://dspy.ai/learn/evaluation/metrics/) can contain complex and LM-based logic to properly evaluate our accuracy.

In [None]:
from dspy import SIMBA


def validate_classification(example, prediction, trace=None) -> bool:
    return example.label == prediction.label


optimizer = SIMBA(
    metric=validate_classification,
    max_demos=2,
    bsize=12,
    num_threads=1,
)

compiled_pe = optimizer.compile(TextClassifier(), trainset=train_dataset)

### Compare Pre/Post Compiled Accuracy

Finally, let's explore how well our trained model can predict on unseen test data. 

In [None]:
def check_accuracy(classifier, test_data: pd.DataFrame = test_dataset) -> float:
    residuals = []
    predictions = []
    for example in test_data:
        prediction = classifier(text=example["text"])
        residuals.append(int(validate_classification(example, prediction)))
        predictions.append(prediction)
    return residuals, predictions


uncompiled_residuals, uncompiled_predictions = check_accuracy(TextClassifier())
print(f"Uncompiled accuracy: {np.mean(uncompiled_residuals)}")

compiled_residuals, compiled_predictions = check_accuracy(compiled_pe)
print(f"Compiled accuracy: {np.mean(compiled_residuals)}")

Uncompiled accuracy: 0.875
Compiled accuracy: 1.0


As shown above, our compiled accuracy is non-zero - our base LLM inferred meaning of the classification labels simply via our initial prompt. However, with DSPy training, the prompts, demonstrations, and input/output signatures have been updated to give our model to 100% accuracy on unseen data. That's a gain of 12 percentage points!

Let's take a look at each prediction in our test set.

In [None]:
for uncompiled_residual, uncompiled_prediction in zip(uncompiled_residuals, uncompiled_predictions):
    is_correct = "Correct" if bool(uncompiled_residual) else "Incorrect"
    prediction = uncompiled_prediction.label
    print(f"{is_correct} prediction: {' ' * (12 - len(is_correct))}{prediction}")

Incorrect prediction:    money-fx
Correct prediction:      crude
Correct prediction:      money-fx
Correct prediction:      earn
Incorrect prediction:    interest
Correct prediction:      grain
Correct prediction:      trade
Incorrect prediction:    trade


In [None]:
for compiled_residual, compiled_prediction in zip(compiled_residuals, compiled_predictions):
    is_correct = "Correct" if bool(compiled_residual) else "Incorrect"
    prediction = compiled_prediction.label
    print(f"{is_correct} prediction: {' ' * (12 - len(is_correct))}{prediction}")

Correct prediction:      interest
Correct prediction:      crude
Correct prediction:      money-fx
Correct prediction:      earn
Correct prediction:      acq
Correct prediction:      grain
Correct prediction:      trade
Correct prediction:      ship


## Log and Load the Model with MLflow

Now that we have a compiled model with higher classification accuracy, let's leverage MLflow to log this model and load it for inference.

In [None]:
import mlflow

with mlflow.start_run():
    model_info = mlflow.dspy.log_model(
        compiled_pe,
        name="model",
        input_example="what is 2 + 2?",
    )

Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

Open the MLflow UI again and check the complied model is recorded to a new MLflow Run. Now you can load the model back for inference using `mlflow.dspy.load_model` or `mlflow.pyfunc.load_model`.

💡 MLflow will remember the environment configuration stored in `dspy.settings`, such as the language model (LM) used during the experiment. This ensures excellent reproducibility for your experiment.

In [None]:
# Define input text
print("\n==============Input Text============")
text = test_dataset[0]["text"]
print(f"Text: {text}")

# Inference with original DSPy object
print("\n--------------Original DSPy Prediction------------")
print(compiled_pe(text=text).label)

# Inference with loaded DSPy object
print("\n--------------Loaded DSPy Prediction------------")
loaded_model_dspy = mlflow.dspy.load_model(model_info.model_uri)
print(loaded_model_dspy(text=text).label)

# Inference with MLflow PyFunc API
loaded_model_pyfunc = mlflow.pyfunc.load_model(model_info.model_uri)
print("\n--------------PyFunc Prediction------------")
print(loaded_model_pyfunc.predict(text)["label"])


Text: top discount rate at u k bill tender rises to pct

--------------Original DSPy Prediction------------
interest

--------------Loaded DSPy Prediction------------
interest

--------------PyFunc Prediction------------
interest


## Next Steps

This example demonstrates how DSPy works. Below are some potential extensions for improving this project, both with DSPy and MLflow.

### DSPy
* Use real-world data for the classifier.
* Experiment with different optimizers.
* For more in-depth examples, check out the [tutorials](https://dspy.ai/tutorials/) and [documentation](https://dspy.ai/learn/).

### MLflow
* Deploy the model using MLflow serving.
* Use MLflow to experiment with various optimization strategies.
* Track your DSPy experiments using [DSPy Optimizer Autologging](https://mlflow.org/docs/latest/genai/flavors/dspy/optimizer/).

Happy coding!