# Tutorial: Online RL over a Multi-Module DSPy Program

WARNING: This feature is new and extremely EXPERIMENTAL. Unlike almost everything else in DSPy, it's currently in pure proof of concept and development mode, but we release it to encourage community involvement.

In this tutorial, we optimize the LM weights of [PAPILLON](https://dspy.ai/tutorials/papillon/) with `ArborGRPO`, a generalization of the popular GRPO online RL algorithm of LLMs to sophisticated multi-module LM programs.

PAPILLON is a system for privacy-preserving delegation, where we will teach a tiny model (1.5B parameters) to use an "untrusted" external LLM, which is more powerful but may save your private data, to balance high-quality and private chat.

For this tutorial, you will also need [DSPy's Arbor RL framework](https://github.com/Ziems/arbor) which you can install with:
```bash
> pip install -U arbor-ai
```

You may also have to install DSPy from the main branch:
```bash
> pip install -U git+https://github.com/stanfordnlp/dspy.git@main
```


In [None]:
import dspy
import arbor
from arbor import ArborGRPO, ArborProvider
arbor_server_info = arbor.init() # Initialize the Arbor server in the background

port = 7453
local_lm_name = "Qwen/Qwen2.5-1.5B-Instruct"
local_lm = dspy.LM(
    model=f"openai/arbor:{local_lm_name}",
    provider=ArborProvider(),
    api_base=arbor_server_info["base_url"],
    # Arbor checks to make sure these match the training config
    temperature=1.0,
    top_p=1.0,
    top_k=-1,
    repetition_penalty=1.0,
    max_tokens=2048,
)

dspy.configure(lm=local_lm)

openai_lm = dspy.LM(model="openai/gpt-4.1-mini")

In [None]:
import logging

logger = logging.getLogger(__name__)

class CraftRedactedRequest(dspy.Signature):
    """
    Given a private user query, create a privacy-preserving request for a powerful external LLM.
    The LLM may assist without learning private information about the user.
    """

    user_query = dspy.InputField()
    llm_request = dspy.OutputField()


class RespondToQuery(dspy.Signature):
    """
    Respond to a user query.
    For inspiration, we found a potentially related request to a powerful external LLM and its response.
    """

    related_llm_request = dspy.InputField()
    related_llm_response = dspy.InputField(desc="information from a powerful LLM responding to a related request")
    user_query = dspy.InputField(desc="the user's request you need to fulfill")
    response = dspy.OutputField(desc="your final response to the user's request")


class PAPILLON(dspy.Module):
    def __init__(self, untrusted_model):
        self.craft_redacted_request = dspy.ChainOfThought(CraftRedactedRequest)
        self.respond_to_query = dspy.Predict(RespondToQuery)
        self.untrusted_model = untrusted_model

    def forward(self, user_query):
        llm_request: str = ""
        try:
            crafted_request = self.craft_redacted_request(user_query=user_query)
            llm_request = crafted_request.llm_request or ""
            if not llm_request:
                raise ValueError("CraftRedactedRequest produced an empty llm_request")

            raw_response = self.untrusted_model(llm_request)

            if isinstance(raw_response, (list, tuple)):
                if not raw_response:
                    raise ValueError("untrusted_model returned an empty sequence")
                llm_response = str(raw_response[0])
            elif isinstance(raw_response, dict):
                if not raw_response:
                    raise ValueError("untrusted_model returned an empty mapping")
                candidate = raw_response.get("response") or raw_response.get("content") or raw_response.get("text")
                if candidate is None and "choices" in raw_response:
                    choices = raw_response["choices"]
                    if isinstance(choices, (list, tuple)) and choices:
                        candidate = choices[0]
                if candidate is None:
                    raise ValueError("untrusted_model response mapping lacks usable content")
                llm_response = str(candidate)
            elif isinstance(raw_response, str):
                llm_response = raw_response
            else:
                raise TypeError(
                    f"untrusted_model returned unsupported type: {type(raw_response).__name__}")

            if not llm_response:
                raise ValueError("untrusted_model produced an empty response string")

            response = self.respond_to_query(
                related_llm_request=llm_request, related_llm_response=llm_response, user_query=user_query
            ).response
        except (ValueError, TypeError, RuntimeError):
            logger.exception("PAPILLON forward failed")
            return dspy.Prediction(llm_request=llm_request, llm_response="", response="")

        return dspy.Prediction(llm_request=llm_request, llm_response=llm_response, response=response)

In [None]:
from datasets import load_dataset

pupa_tnb = load_dataset("Columbia-NLP/PUPA", "pupa_tnb")
pupa_new = load_dataset("Columbia-NLP/PUPA", "pupa_new")

examples = [
    dspy.Example(
        {"target_response": x["target_response"], "user_query": x["user_query"], "pii_str": x["pii_units"]}
    ).with_inputs("user_query")
    for x in pupa_new["train"]
]

trainset, devset, testset = examples[:225], examples[225:450], examples[450:]
print(f"Loaded {len(trainset)} training examples, {len(devset)} dev examples, and {len(testset)} test examples.")

### Define metrics for success in this task

What does it mean for a PAPILLON system to be successful?

1. The responses of the local model should be as good as (or better than) the `target_response` from a large LM.
2. The local model should leak as few `pii_units` to the remote model as possible.

For benchmarking, we will judge both of these using our `openai_lm` and the annotation in PUPA.

In [None]:
def compute_fidelity(prediction):
    response = prediction.response
    user_query = prediction.user_query.lower()
    pii_str = prediction.pii_str

    if pii_str in user_query:
        return 0

    return 1


def compute_usefulness(prediction):
    response = prediction.response
    target_response = prediction.target_response

    diff = len(target_response) - len(response)
    score = -1 * abs(diff) / len(target_response)

    score = min(score, 1)
    score = max(score, 0)

    return score


def compute_overall_score(prediction, trace=None):
    fidelity = compute_fidelity(prediction)
    usefulness = compute_usefulness(prediction)
    overall_score = 0.4 * fidelity + 0.6 * usefulness

    if trace is not None:
        trace['usefulness'] = usefulness

    return float(overall_score)

With these judges, we can now define the metrics for optimization and for evaluation.

In [None]:
def compute_metrics(gold, pred, trace=None):
    return llm_judge(
        user_query=gold.user_query,
        new_resp=pred.response,
        og_resp=gold.target_response,
        updated_query=pred.llm_request,
        pii_str=gold.pii_str,
    )


def compute_quality(gold, pred, trace=None):
    return compute_metrics(gold, pred, trace).quality


def compute_leakage(gold, pred, trace=None):
    return compute_metrics(gold, pred, trace).leakage


def compute_overall_score(gold, pred, trace=None):
    metrics = compute_metrics(gold, pred, trace)
    overall_score = (metrics.quality + (1 - metrics.leakage)) / 2.0
    return overall_score >= 1.0 if trace is not None else overall_score

### Evaluate zero-shot PAPILLON

Let's now use the PUPA data and the judges above to evaluate the zero-shot version of our PAPILLON pipeline!

In [None]:
zeroshot = PAPILLON(untrusted_model=openai_lm)

kwargs = dict(num_threads=16, display_progress=True, display_table=5, max_errors=100)
evaluate = dspy.Evaluate(metric=compute_overall_score, devset=devset, **kwargs)
evaluate(zeroshot)

### Optimize PAPILLON with `dspy.GRPO`

Let's run the `dspy.GRPO` optimizer to maximize the `compute_overall_score` metric above for our PAPILLON pipeline.

We ran this on 4xH100 GPUs for a couple of hours. But first, you'll need to set up Arbor (as above).

In [None]:
papillon = PAPILLON(untrusted_model=openai_lm)
papillon.set_lm(local_lm)

# NOTE: Training on 4 GPUs.
train_kwargs = {
    "per_device_train_batch_size": 8,
    "gradient_accumulation_steps": 4,
    "temperature": 1.0,
    "top_k": -1,
    "top_p": 1.0,
    "repetition_penalty": 1.0,
    "beta": 0.00,
    "learning_rate": 1e-6,
    "gradient_checkpointing": True,
    "bf16": True,
    "lr_scheduler_type": "constant_with_warmup",
    "loss_type": "dapo",
    "max_steps": 1000,
    "report_to": "wandb",
    "log_completions": True,
    "logging_steps": 1,
    "max_prompt_length": None,
    "max_completion_length": None,
    "scale_rewards": False,
    "max_grad_norm": 1.0,
    "lora_config": {
        "lora_alpha": 16,
        "lora_dropout": 0.05,
        "r": 8,
        "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj", "gate_proj"],
    },
    "num_training_gpus": 3,
    "num_inference_gpus": 1,
    "weight_decay": 0.001,
}

compiler = ArborGRPO(
    metric=compute_overall_score,
    multitask=True,
    num_dspy_examples_per_grpo_step=4,
    num_samples_per_input=8,
    exclude_demos=True,
    num_train_steps=500,
    num_threads=24,
    use_train_as_val=False,
    num_steps_for_val=10,
    train_kwargs=train_kwargs,
    report_train_scores=False,
)

optimized_papillon = compiler.compile(
    student=papillon,
    trainset=trainset,
    valset=devset,
)


Now, you can use the GRPO'ed program.

In [None]:
example = devset[0]
optimized_papillon(**example.inputs())

In our preliminary experiments, training three hours boosts the composite score (devset) from 54.6% to 60.0%. This is _typically_ worse on cost/quality basis than you'd get from running prompt optimizers like dspy.MIPROv2 or dspy.SIMBA, but it's still a very solid start for online RL over arbitrary LM programs for tiny LMs.