# RLM with Modal Sandbox (DSPy 3.1.3)

This tutorial shows how to use **`dspy.RLM`** (Recursive Language Model) with [Modal](https://modal.com) for secure, sandboxed code execution in the cloud.

**What is RLM?** RLM is an inference strategy where the LLM writes Python code to programmatically explore data, call sub-LLMs over snippets, and iteratively build up answers — instead of feeding long contexts directly into the model.

**Why Modal?** By default, `dspy.RLM` uses a local Deno/Pyodide WASM sandbox. Modal lets you run that code in an isolated cloud container with configurable resources, dependencies, and secrets.

**What we'll do:**
1. Implement a `ModalInterpreter` that satisfies DSPy's `CodeInterpreter` protocol
2. Use `modal.Sandbox` to execute code inside an ephemeral cloud container
3. Run an RLM agent that writes and executes code remotely

## Prerequisites

- **Python 3.10+**
- **Modal account**: Sign up at [modal.com](https://modal.com) and run `modal setup`
- **Modal secret**: Create a secret named `LITELLM` that contains the environment variables used by DSPy/LiteLLM:
  - `DSPY_LM_MODEL` (e.g., `openai/gemini-3-flash-preview`)
  - `DSPY_LM_API_BASE` (your LiteLLM proxy base URL)
  - `DSPY_LLM_API_KEY` (API key for the proxy/provider)
  - optional: `DSPY_LM_MAX_TOKENS`

  Example (run in a terminal):
  ```bash
  modal secret create LITELLM \
    DSPY_LM_MODEL=... \
    DSPY_LM_API_BASE=... \
    DSPY_LLM_API_KEY=... \
    DSPY_LM_MAX_TOKENS=...
  ```

- **Security note**: don’t hard-code API keys in notebooks, and don’t print them. If a key was ever pasted into a notebook/chat, rotate it.

## 1. Install Dependencies

In [34]:
%uv pip install -qU "dspy==3.1.3" modal

/Volumes/Samsung-SSD-T7/Workspaces/Github/qredence/agent-framework/v0.5/_XCODE/_dspy/dspy/.venv/bin/python: No module named uv
Note: you may need to restart the kernel to use updated packages.


## 2. Imports and Configuration

We configure one LM locally for the *planner* (the model that writes Python code each iteration).

This notebook expects the following environment variables to be set **locally** (for the planner):
- `DSPY_LM_MODEL`
- `DSPY_LM_API_BASE`
- `DSPY_LLM_API_KEY`
- optional: `DSPY_LM_MAX_TOKENS`

The same variables are also injected into the Modal sandbox via the `LITELLM` secret, so any sandbox-side LM calls (via tool-bridged `llm_query`) use identical credentials without hard-coding secrets in the notebook.

**Important**: Modal secrets are only available *inside* Modal containers/sandboxes. They do **not** automatically set environment variables for your local notebook kernel.
This notebook will try to load a local `.env` from the project root (if present) to configure the planner LM.

In [35]:
import json
import os
import sys
from pathlib import Path
from typing import Any, Callable, Iterator

import dspy

# ---- Load local .env (for the planner LM) ----
# Modal secrets are only available *inside* Modal; they do not configure your local kernel.
def _find_project_root(start: Path) -> Path:
    for p in [start, *start.parents]:
        if (p / "pyproject.toml").exists():
            return p
    return start

def _load_dotenv(path: Path) -> None:
    if not path.exists():
        return
    try:
        for raw in path.read_text().splitlines():
            line = raw.strip()
            if not line or line.startswith("#") or "=" not in line:
                continue
            k, v = line.split("=", 1)
            k, v = k.strip(), v.strip()
            if len(v) >= 2 and ((v[0] == v[-1] == '\"') or (v[0] == v[-1] == "'")):
                v = v[1:-1]
            if k and k not in os.environ:
                os.environ[k] = v
    except Exception as e:
        print(f"Warning: could not load {path}: {e}")

PROJECT_ROOT = _find_project_root(Path.cwd())
_load_dotenv(PROJECT_ROOT / ".env")

# ---- Guard against module shadowing ----
# A local `modal.py` (or even a stale compiled `__pycache__/modal.*.pyc`) in the
# notebook's working directory can shadow the third-party `modal` package.
shadow_py = Path.cwd() / "modal.py"
shadow_pyc_dir = Path.cwd() / "__pycache__"
shadow_pycs = list(shadow_pyc_dir.glob("modal.*.pyc")) if shadow_pyc_dir.exists() else []

if shadow_py.exists():
    raise RuntimeError(
        f"Found {shadow_py} which shadows the 'modal' package. "
        "Rename/delete it (e.g., modal_get_started.py) and restart the kernel."
    )

if shadow_pycs:
    removed: list[str] = []
    failed: list[str] = []
    for p in shadow_pycs:
        try:
            p.unlink()
            removed.append(str(p))
        except Exception:
            failed.append(str(p))

    if removed:
        print("Removed shadowing bytecode files:\n" + "\n".join(removed))
    if failed:
        raise RuntimeError(
            "Found shadowing bytecode files but could not remove them:\n"
            + "\n".join(failed)
            + "\nDelete them manually and restart the kernel."
        )

# If a previous import attempt loaded a bad `modal` module, clear modal-related
# modules to avoid weird partially-initialized states.
#
# Note: Modal uses a generated `modal_proto` package under the hood; when upgrading
# modal in a running kernel, stale `modal_proto` modules can cause type mismatches.
MODULE_PREFIXES_TO_PURGE = (
    "modal",
    "modal_proto",
    "grpclib",
)

for name in list(sys.modules.keys()):
    if name in MODULE_PREFIXES_TO_PURGE or any(name.startswith(p + ".") for p in MODULE_PREFIXES_TO_PURGE):
        sys.modules.pop(name, None)

import modal
from dspy.primitives.code_interpreter import CodeInterpreterError, FinalOutput


def configure_planner_from_env() -> bool:
    """Configure DSPy planner LM from environment variables.

    Expected (local):
      - DSPY_LM_MODEL
      - DSPY_LLM_API_KEY (or DSPY_LM_API_KEY)
      - optional: DSPY_LM_API_BASE, DSPY_LM_MAX_TOKENS

    Returns True if configured, False if required env vars are missing.
    """

    api_key = os.environ.get("DSPY_LLM_API_KEY") or os.environ.get("DSPY_LM_API_KEY")
    missing: list[str] = []
    if not os.environ.get("DSPY_LM_MODEL"):
        missing.append("DSPY_LM_MODEL")
    if not api_key:
        # DSPy expects DSPY_LLM_API_KEY, but some setups use DSPY_LM_API_KEY.
        missing.append("DSPY_LLM_API_KEY")
    if missing:
        print(
            "Planner LM not configured yet. Missing env vars: "
            + ", ".join(missing)
            + "\nSet them locally (e.g., export in your shell before starting Jupyter, or create a .env at the project root) and re-run this cell." 
        )
        return False

    planner_lm = dspy.LM(
        os.environ["DSPY_LM_MODEL"],
        api_base=os.environ.get("DSPY_LM_API_BASE"),
        api_key=api_key,
        max_tokens=int(os.environ.get("DSPY_LM_MAX_TOKENS", "16000")),
    )

    dspy.configure(lm=planner_lm)
    print(f"Planner LM configured: {planner_lm.model}")
    print("(Tip: don’t print API keys.)")
    return True


PLANNER_READY = configure_planner_from_env()

# We’ll pass `modal.Secret.from_name('LITELLM')` into the sandbox so the *remote*
# Python REPL can access the same environment variables without hard-coding them.

Planner LM configured: openai/gemini-3-flash-preview
(Tip: don’t print API keys.)


### Optional: sanity-check the Modal secret (without leaking it)

The snippet below confirms that the `LITELLM` secret is mounted in Modal by checking for the *presence* of environment variables. It deliberately does **not** print secret values.

In [36]:
import json
import os

import modal

# Sandboxes require an App when created from a local environment.
app = modal.App.lookup("dspy-rlm-secret-check", create_if_missing=True)

sb = modal.Sandbox.create(app=app, secrets=[modal.Secret.from_name("LITELLM")])
try:
    code = r"""
import json, os
keys = [
  'DSPY_LM_MODEL',
  'DSPY_LM_API_BASE',
  'DSPY_LLM_API_KEY',
  'DSPY_LM_MAX_TOKENS',
]
print(json.dumps({k: bool(os.environ.get(k)) for k in keys}))
"""
    p = sb.exec("python", "-c", code, timeout=60)
    p.wait()
    print("Secret env presence:", p.stdout.read().strip())
finally:
    sb.terminate()

Secret env presence: {"DSPY_LM_MODEL": true, "DSPY_LM_API_BASE": true, "DSPY_LLM_API_KEY": true, "DSPY_LM_MAX_TOKENS": true}


### Don’t print secrets

This is **unsafe**:
- `print(os.environ["DSPY_LLM_API_KEY"])`

Instead, verify the secret is present (and optionally its length), without revealing the value.

In [37]:
import json

import modal

app = modal.App.lookup("dspy-rlm-secret-check", create_if_missing=True)

sb = modal.Sandbox.create(app=app, secrets=[modal.Secret.from_name("LITELLM")])
try:
    code = r"""
import json, os
key = os.environ.get('DSPY_LLM_API_KEY', '')
print(json.dumps({'present': bool(key), 'length': len(key)}))
"""
    p = sb.exec("python", "-c", code, timeout=60)
    p.wait()
    print("DSPY_LLM_API_KEY:", p.stdout.read().strip())
finally:
    sb.terminate()

DSPY_LLM_API_KEY: {"present": true, "length": 67}


## 3. The Modal Sandbox Driver

Modal Sandboxes are ephemeral containers. We use a **driver program** pattern (from [Modal's code interpreter example](https://modal.com/docs/examples/simple_code_interpreter)):

1. A Python driver script runs inside the sandbox, reading JSON commands from `stdin`.
2. For each command, it `exec()`s the code, captures stdout/stderr, and checks for `SUBMIT()` calls.
3. It writes the result as JSON to `stdout`.

This keeps state between iterations (variables persist in the `globals` dict) — exactly what RLM needs.

In [38]:
def sandbox_driver():
    """Driver program that runs inside the Modal sandbox container.

    Protocol:
    - Host sends one JSON line: {code, variables, tool_names, output_names}
    - Driver executes `code` (stateful globals), capturing stdout/stderr.
    - If executed code calls a tool like llm_query(), the driver emits a JSON
      tool call request to *real* stdout, then blocks reading one JSON tool
      response line from stdin.
    - At the end, driver emits one JSON line: {stdout, stderr, final}

    This mirrors DSPy's local sandbox tool-bridge pattern (see runner.js).
    """

    import json
    import sys
    from contextlib import redirect_stderr, redirect_stdout
    from io import StringIO
    from typing import Any

    # Persistent state across execute() calls.
    sandbox_globals: dict[str, Any] = {}

    # Protocol IO that bypasses redirected stdout/stderr.
    proto_out = sys.__stdout__

    # Set on each command by host.
    output_names: list[str] = []
    tool_names: list[str] = []

    class _FinalOutput(BaseException):
        pass

    def _send(obj: dict) -> None:
        proto_out.write(json.dumps(obj) + "\n")
        proto_out.flush()

    def _tool_call(name: str, *args, **kwargs):
        _send({"tool_call": {"name": name, "args": list(args), "kwargs": kwargs}})
        # Host replies with {tool_result} or {tool_error}
        reply = json.loads(input())
        if reply.get("tool_error"):
            raise RuntimeError(reply["tool_error"])
        return reply.get("tool_result")

    def _register_tools(names: list[str]) -> None:
        # Create callable stubs in sandbox_globals for every tool name.
        for n in names:
            if not n.isidentifier() or n in {"SUBMIT"}:
                continue
            if n in sandbox_globals:
                continue

            def _make(n_: str):
                def _fn(*args, **kwargs):
                    return _tool_call(n_, *args, **kwargs)

                return _fn

            sandbox_globals[n] = _make(n)

    def SUBMIT(*args, **kwargs):
        """Signal completion.

        DSPy generates SUBMIT(output1, output2, ...) with positional args,
        where the position maps to the signature output fields.

        We also support SUBMIT(field=value, ...) for convenience.
        """
        if kwargs:
            raise _FinalOutput(kwargs)

        if not output_names:
            # Fallback (should not happen if host provides output_names)
            if len(args) == 1:
                raise _FinalOutput({"output": args[0]})
            raise _FinalOutput({"output": list(args)})

        if len(args) != len(output_names):
            raise _FinalOutput({
                "error": f"SUBMIT expected {len(output_names)} positional values ({output_names}), got {len(args)}"
            })

        raise _FinalOutput(dict(zip(output_names, args)))

    sandbox_globals["SUBMIT"] = SUBMIT

    while True:
        try:
            line = input()  # Next command from host (or EOF)
        except EOFError:
            break

        try:
            command = json.loads(line)
        except json.JSONDecodeError as e:
            _send({"stdout": "", "stderr": f"[Error] Invalid JSON: {e}", "final": None})
            continue

        code = command.get("code")
        variables = command.get("variables", {}) or {}
        tool_names = list(command.get("tool_names", []) or [])
        output_names = list(command.get("output_names", []) or [])

        if code is None:
            _send({"stdout": "", "stderr": "[Error] No code provided", "final": None})
            continue

        # Inject variables and tool stubs into globals.
        sandbox_globals.update(variables)
        _register_tools(tool_names)

        # Execute and capture stdout/stderr.
        stdout_io, stderr_io = StringIO(), StringIO()
        final_obj = None
        with redirect_stdout(stdout_io), redirect_stderr(stderr_io):
            try:
                exec(code, sandbox_globals)
            except _FinalOutput as e:
                final_obj = e.args[0] if e.args else None
            except Exception as e:
                print(f"[Error] {type(e).__name__}: {e}", file=sys.stderr)

        _send({"stdout": stdout_io.getvalue(), "stderr": stderr_io.getvalue(), "final": final_obj})


print("Driver function defined.")

Driver function defined.


## 4. Implement the `ModalInterpreter`

This class implements DSPy's [`CodeInterpreter`](https://github.com/stanfordnlp/dspy/blob/main/dspy/primitives/code_interpreter.py) protocol. The protocol requires:

| Method | Purpose |
|---|---|
| `tools` (property) | Dict of callable tools available in the sandbox |
| `start()` | Initialize resources (idempotent) |
| `execute(code, variables)` | Run code, return stdout or `FinalOutput` |
| `shutdown()` | Release resources |

Our implementation creates a `modal.Sandbox`, launches the driver program, and communicates via stdin/stdout JSON messages.

In [39]:
import inspect
from typing import Iterator


# Modal sandbox image — add any packages your RLM code might need.
# (The sandbox is just a Python REPL. Your RLM-written code can `import` these.)
SANDBOX_IMAGE = modal.Image.debian_slim(python_version="3.12").pip_install(
    "numpy",
    "pandas",
)

# Reference a pre-existing Modal App (creates if missing)
MODAL_APP = modal.App.lookup("dspy-rlm-interpreter", create_if_missing=True)


class ModalInterpreter:
    """CodeInterpreter that executes code in a Modal Sandbox.

    - Maintains sandbox state across `execute()` calls (a persistent driver process).
    - Bridges DSPy tools (llm_query, llm_query_batched, and any custom tools) by
      relaying tool-call requests from the sandbox back to the host.
    """

    def __init__(
        self,
        image: modal.Image = SANDBOX_IMAGE,
        app: modal.App = MODAL_APP,
        secrets: list[modal.Secret] | None = None,
        timeout: int = 600,
    ):
        self.image = image
        self.app = app
        self.secrets = secrets or [modal.Secret.from_name("LITELLM")]
        self.timeout = timeout

        # Set by RLM on every forward() via _inject_execution_context
        self.output_fields: list[dict] | None = None
        self._tools_registered = False

        # Interpreter state
        self._sandbox: modal.Sandbox | None = None
        self._proc = None
        self._stdin = None
        self._stdout_iter: Iterator[str] | None = None
        self._stderr_iter: Iterator[str] | None = None
        self._tools: dict[str, Callable[..., str]] = {}

    # ── CodeInterpreter protocol ─────────────────────────────────────

    @property
    def tools(self) -> dict[str, Callable[..., str]]:
        return self._tools

    @tools.setter
    def tools(self, value: dict[str, Callable[..., str]]) -> None:
        self._tools = value

    def start(self) -> None:
        """Create the Modal Sandbox and launch the driver process (idempotent)."""
        if self._sandbox is not None:
            return

        driver_source = inspect.getsource(sandbox_driver)
        driver_command = f"{driver_source}\n\nsandbox_driver()"

        self._sandbox = modal.Sandbox.create(
            app=self.app,
            image=self.image,
            secrets=self.secrets,
        )

        # Start a long-lived python process inside the sandbox.
        # bufsize=1 enables line buffering for stdout.
        self._proc = self._sandbox.exec(
            "python",
            "-u",
            "-c",
            driver_command,
            bufsize=1,
            timeout=self.timeout,
        )

        self._stdin = self._proc.stdin
        self._stdout_iter = iter(self._proc.stdout)
        self._stderr_iter = iter(getattr(self._proc, "stderr", []))

    def _tool_names(self) -> list[str]:
        return list(self._tools.keys()) if self._tools else []

    def _output_names(self) -> list[str]:
        if not self.output_fields:
            return []
        return [d["name"] for d in self.output_fields if isinstance(d, dict) and d.get("name")]

    def execute(
        self,
        code: str,
        variables: dict[str, Any] | None = None,
    ) -> str | FinalOutput:
        if self._sandbox is None:
            self.start()

        # Keep variables JSON-serializable
        safe_vars: dict[str, Any] = {}
        if variables:
            for k, v in variables.items():
                if isinstance(v, (str, int, float, bool, list, dict, type(None))):
                    safe_vars[k] = v
                else:
                    safe_vars[k] = str(v)

        payload = {
            "code": code,
            "variables": safe_vars,
            "tool_names": self._tool_names(),
            "output_names": self._output_names(),
        }

        self._stdin.write(json.dumps(payload) + "\n")
        self._stdin.drain()

        # Read messages until we get the final result.
        while True:
            try:
                line = next(self._stdout_iter)
            except StopIteration:
                # Try to surface sandbox stderr for debugging
                stderr_tail = ""
                try:
                    stderr_tail = "".join(list(self._stderr_iter)[:50])
                except Exception:
                    pass
                raise CodeInterpreterError(
                    "Modal sandbox process exited unexpectedly." + (f"\nStderr: {stderr_tail}" if stderr_tail else "")
                )

            try:
                msg = json.loads(line)
            except json.JSONDecodeError:
                # Ignore non-JSON chatter
                continue

            # Tool call request from sandbox
            if "tool_call" in msg:
                call = msg["tool_call"] or {}
                name = call.get("name")
                args = call.get("args") or []
                kwargs = call.get("kwargs") or {}

                try:
                    if not name or name not in self._tools:
                        raise CodeInterpreterError(f"Unknown tool: {name}")
                    result = self._tools[name](*args, **kwargs)
                    # Ensure JSON serializable
                    try:
                        json.dumps(result)
                        reply = {"tool_result": result}
                    except TypeError:
                        reply = {"tool_result": str(result)}
                except Exception as e:
                    reply = {"tool_error": f"{type(e).__name__}: {e}"}

                self._stdin.write(json.dumps(reply) + "\n")
                self._stdin.drain()
                continue

            # Final result from sandbox
            if "stdout" in msg or "stderr" in msg or "final" in msg:
                stdout = msg.get("stdout", "") or ""
                stderr = msg.get("stderr", "") or ""
                final_obj = msg.get("final")

                if final_obj is not None:
                    return FinalOutput(final_obj)

                out = stdout
                if stderr:
                    out = out + ("\n" if out else "") + stderr
                return out

            # Unknown message type; ignore

    def shutdown(self) -> None:
        if self._sandbox is not None:
            try:
                self._sandbox.terminate()
            except Exception:
                pass
            self._sandbox = None
            self._proc = None
            self._stdin = None
            self._stdout_iter = None
            self._stderr_iter = None


print("ModalInterpreter defined.")

ModalInterpreter defined.


## 5. Basic RLM Demo: Code Generation

A simple example showing RLM writing Python code to solve a problem.

In [40]:
# Ensure the planner LM is configured
if not PLANNER_READY and dspy.settings.lm is None:
    raise RuntimeError("Planner LM not configured")

interpreter = ModalInterpreter()

rlm = dspy.RLM(
    signature="question -> answer",
    interpreter=interpreter,
    max_iterations=15,
    max_llm_calls=30,
    verbose=True,
)

try:
    result = rlm(question="What are the first 12 Fibonacci numbers? Return as comma-separated.")
    print("\nFINAL ANSWER:", result.answer)
finally:
    interpreter.shutdown()

2026/02/06 21:45:09 INFO dspy.predict.rlm: RLM iteration 1/15
Reasoning: The question asks for the first 12 Fibonacci numbers, separated by commas. I will write a simple Python script to calculate these numbers, starting from the standard definition (typically $F_0=0, F_1=1$ or $F_1=1, F_2=1$). I will provide the sequence starting from 0 as is common in mathematical contexts, or check if the prompt implies starting from 1. Most definitions start 0, 1, 1, 2... or 1, 1, 2... I will generate the first 12 starting from 0 and print them.
Code:
```python
def fibonacci(n):
    fib_sequence = [0, 1]
    while len(fib_sequence) < n:
        fib_sequence.append(fib_sequence[-1] + fib_sequence[-2])
    return fib_sequence[:n]

first_12 = fibonacci(12)
print(", ".join(map(str, first_12)))
```
2026/02/06 21:45:11 INFO dspy.predict.rlm: RLM iteration 2/15
Reasoning: The previous step successfully calculated the first 12 Fibonacci numbers starting from 0: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89. This


FINAL ANSWER: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89


## 6. Core Capability: Long Document Analysis

RLM treats long documents as an external environment. The document lives in the sandbox,
code navigates and extracts relevant sections, and only snippets are sent to llm_query().

### Use Case: Extract DSPy Architecture

In [41]:
class ExtractArchitecture(dspy.Signature):
    """Extract architectural information from DSPy documentation."""
    
    docs: str = dspy.InputField(desc="Full DSPy documentation text")
    query: str = dspy.InputField(desc="What to extract")
    modules: list = dspy.OutputField(desc="List of DSPy modules")
    optimizers: list = dspy.OutputField(desc="List of optimizers")
    design_principles: str = dspy.OutputField(desc="Key design principles")


with open("dspy-doc/dspy-doc.txt", "r") as f:
    dspy_docs = f.read()

print(f"Loaded: {len(dspy_docs):,} chars, {len(dspy_docs.splitlines()):,} lines")

interpreter = ModalInterpreter()

rlm = dspy.RLM(
    signature=ExtractArchitecture,
    interpreter=interpreter,
    max_iterations=25,
    max_llm_calls=50,
    verbose=True,
)

try:
    result = rlm(
        docs=dspy_docs,
        query="Extract all modules and optimizers from documentation",
    )
    print(f"Modules: {result.modules}")
    print(f"Optimizers: {result.optimizers}")
finally:
    interpreter.shutdown()

2026/02/06 21:45:11 INFO dspy.predict.rlm: RLM iteration 1/25
Reasoning: I will start by exploring the contents of the `docs` variable to understand its structure and locate sections related to "modules", "optimizers", and "design principles". I'll use `llm_query` to help identify specific names and principles from the documentation.

First, I'll print the first 2000 characters and check for any obvious section headers or lists. I'll also look for mentions of "optimizers" (teleprompters) and "modules".

Plan:
1. Examine the directory structure and main headers in the documentation.
2. Search for sections specifically mentioning `dspy.Module` and `dspy.Teleprompter` (optimizers).
3. Use LLM to extract lists of modules, optimizers, and design principles.
Code:
```python
print(f"Docs length: {len(docs)}")
print("First 2000 characters of docs:")
print(docs[:2000])

# Look for mentions of Modules and Optimizers/Teleprompters
import re

modules_mentions = re.findall(r'dspy\.\w+', docs)
print

Loaded: 81,456 chars, 3,126 lines


2026/02/06 21:45:13 INFO dspy.predict.rlm: RLM iteration 2/25
Reasoning: The directory structure in the documentation already provides a very clear list of modules and optimizers under `docs/api/modules/` and `docs/api/optimizers/`. 

Modules found in directory: `BestOfN`, `ChainOfThought`, `Module`, `MultiChainComparison`, `Parallel`, `Predict`, `ProgramOfThought`, `ReAct`, `Refine`.
Optimizers found in directory: `BetterTogether`, `BootstrapFewShot`, `BootstrapFewShotWithRandomSearch`, `BootstrapFinetune`, `BootstrapRS`, `COPRO`, `Ensemble`, `InferRules`, `KNN`, `KNNFewShot`, `LabeledFewShot`. 

I also need to find more optimizers that might be under `MIPRO` or `MIPROv2` as those are common in DSPy but might be in subfolders or later in the text. I also need to extract "Design Principles". I'll use `llm_query` to scan the full text for design principles and to verify if there are any other modules or optimizers mentioned in the text that weren't in that specific directory listing (e.

Modules: ['BestOfN', 'ChainOfThought', 'Module', 'MultiChainComparison', 'Parallel', 'Predict', 'ProgramOfThought', 'ReAct', 'Refine', 'CodeAct', 'RLM', 'Retrieve']
Optimizers: ['BetterTogether', 'BootstrapFewShot', 'BootstrapFewShotWithRandomSearch', 'BootstrapFinetune', 'BootstrapRS', 'COPRO', 'Ensemble', 'InferRules', 'KNN', 'KNNFewShot', 'LabeledFewShot', 'GEPA', 'GroundedProposer']


## 7. Parallel Processing with llm_query_batched()

Process multiple chunks in parallel for dramatic speedup.

In [42]:
class ExtractAPIEndpoints(dspy.Signature):
    """Extract API endpoints using batched analysis."""
    
    docs: str = dspy.InputField(desc="API documentation")
    api_endpoints: list = dspy.OutputField(desc="List of API endpoints")


interpreter = ModalInterpreter()

rlm = dspy.RLM(
    signature=ExtractAPIEndpoints,
    interpreter=interpreter,
    max_iterations=20,
    max_llm_calls=30,
    verbose=True,
)

try:
    result = rlm(docs=dspy_docs)
    print(f"Found {len(result.api_endpoints)} endpoints")
    for ep in result.api_endpoints[:5]:
        print(f"  - {ep}")
finally:
    interpreter.shutdown()

2026/02/06 21:45:14 INFO dspy.predict.rlm: RLM iteration 1/20
Reasoning: The documentation appears to be a directory structure for the DSPy library, specifically focusing on the `docs/api/` folder. This folder likely contains Markdown files describing the various classes, functions, and potentially REST API endpoints (if applicable) or programmatic API interfaces of the library.

My first goal is to explore the content of the `docs` variable to understand what kind of "API endpoints" are present. In the context of a library like DSPy, "endpoints" might refer to:
1. REST API endpoints (if there's a server component).
2. Public class methods or functions that serve as the primary interface.
3. Specific URL endpoints mentioned in the documentation.

I will start by printing a larger chunk of the `docs` string to see the structure and sample content of the `.md` files mentioned in the file list. I'll also check if there's an actual list of web endpoints or just code APIs.

Plan:
1. Examine

Found 60 endpoints
  - dspy.Adapter
  - dspy.Audio
  - dspy.BestOfN
  - dspy.BetterTogether
  - dspy.BootstrapFewShot


## 8. Stateful Multi-Step Reasoning

RLM maintains state across iterations. Variables persist, enabling multi-step workflows.

In [43]:
class FindErrorPatterns(dspy.Signature):
    """Find and categorize error patterns."""
    
    docs: str = dspy.InputField(desc="Documentation text")
    error_categories: dict = dspy.OutputField(desc="Error types mapped to solutions")
    total_errors_found: int = dspy.OutputField(desc="Total errors identified")


interpreter = ModalInterpreter()

rlm = dspy.RLM(
    signature=FindErrorPatterns,
    interpreter=interpreter,
    max_iterations=30,
    max_llm_calls=40,
    verbose=True,
)

try:
    result = rlm(docs=dspy_docs)
    print(f"Found {result.total_errors_found} error patterns")
    for cat, errors in result.error_categories.items():
        print(f"{cat}: {len(errors)} errors")
finally:
    interpreter.shutdown()

2026/02/06 21:45:17 INFO dspy.predict.rlm: RLM iteration 1/30
Reasoning: I need to find and categorize error patterns within the documentation provided in the `docs` variable. 
First, I will explore the structure and content of `docs` to understand how it's organized and where the "errors" might be mentioned (e.g., issues, troubleshooting, bug reports, or common pitfalls).
Since the input is a large string representing a directory structure and potentially file contents, I'll start by checking the length and printing a larger sample to see if it contains actual error logs or just documentation about errors.

Plan:
1. Print a large chunk of `docs` to understand the format.
2. Search for keywords like "error", "fail", "exception", "bug", "issue", "problem" to locate relevant sections.
3. Categorize the findings using the LLM for semantic analysis.
Code:
```python
# Check the total length and print a substantial sample to understand the content.
print(f"Total length of docs: {len(docs)}")

Found 9 error patterns
Execution & Runtime Errors: 125 errors
Structural & Schema Validation Failures: 164 errors
Optimization & Teleprompter Logic Failures: 135 errors
Agentic & Tool-Use Failures: 140 errors
Signature Definition Errors: 153 errors
Infrastructure & Connectivity Failures: 127 errors
API & Interface Implementation Absences: 136 errors


## 9. Inspecting the Trajectory

Every RLM result includes a trajectory - complete history of reasoning, code, and outputs.

In [44]:
interpreter = ModalInterpreter()

rlm = dspy.RLM(
    signature="text -> summary",
    interpreter=interpreter,
    max_iterations=10,
    max_llm_calls=10,
    verbose=False,
)

try:
    text_sample = dspy_docs[:3000]
    result = rlm(text=text_sample)
    
    print(f"Trajectory ({len(result.trajectory)} steps):\n")
    for i, step in enumerate(result.trajectory):
        print(f"\nStep {i+1}:")
        print(f"  Reasoning: {step.get("reasoning", "N/A")[:100]}...")
        print(f"  Code: {step.get("code", "")[:60]}...")
finally:
    interpreter.shutdown()

Trajectory (3 steps):


Step 1:
  Reasoning: I will start by examining the full content of the `text` variable to understand the directory struct...
  Code: print(text)...

Step 2:
  Reasoning: The input contains a directory structure for the `stanfordnlp-dspy` repository, specifically focusin...
  Code: prompt = f"""
Based on the following directory structure of ...

Step 3:
  Reasoning: The previous step successfully analyzed the directory structure of the `stanfordnlp-dspy` repository...
  Code: # The summary has been generated and verified in the previou...


## 10. Advanced: Custom Tools in the Sandbox

RLM supports custom tools that run inside the sandbox. This extends capabilities beyond built-in llm_query().

### Example: Regex Pattern Matcher Tool
We'll create a tool that efficiently extracts patterns from text using compiled regex.

In [45]:
# Define a custom tool function
def regex_extract(text: str, pattern: str, flags: int = 0) -> list:
    """Extract all matches of regex pattern from text.
    
    Args:
        text: Source text to search
        pattern: Regex pattern string
        flags: Regex flags (e.g., re.IGNORECASE=2)
    
    Returns:
        List of match groups or full matches
    """
    import re
    compiled = re.compile(pattern, flags)
    matches = compiled.findall(text)
    return matches


class ExtractWithCustomTool(dspy.Signature):
    """Extract specific patterns using custom regex tool.
    
    Strategy:
    1. Use regex_extract() to find all markdown headers
    2. Use regex_extract() to find all code blocks
    3. Summarize structure
    """
    
    docs: str = dspy.InputField(desc="Documentation to analyze")
    headers: list = dspy.OutputField(desc="All markdown headers found")
    code_blocks: list = dspy.OutputField(desc="All code block languages found")
    structure_summary: str = dspy.OutputField(desc="Summary of document structure")


interpreter = ModalInterpreter()

rlm = dspy.RLM(
    signature=ExtractWithCustomTool,
    interpreter=interpreter,
    tools=[regex_extract],  # Pass custom tool here
    max_iterations=15,
    max_llm_calls=20,
    verbose=True,
)

try:
    result = rlm(docs=dspy_docs[:10000])  # First 10KB for demo
    print(f"\nFound {len(result.headers)} headers")
    print(f"Found {len(result.code_blocks)} code blocks")
    print(f"\nStructure: {result.structure_summary[:200]}...")
finally:
    interpreter.shutdown()

2026/02/06 21:45:39 INFO dspy.predict.rlm: RLM iteration 1/15
Reasoning: I will start by exploring the content of `docs` to understand its structure and content. I'll print the first 2000 characters and check the total length to get an idea of the document's scale.
Code:
```python
print(f"Total length of docs: {len(docs)}")
print("--- Sample (first 2000 chars) ---")
print(docs[:2000])
```
2026/02/06 21:45:48 INFO dspy.predict.rlm: RLM iteration 2/15
Reasoning: The document starts with a directory structure followed by what appears to be Markdown content. I need to extract all markdown headers (lines starting with `#`) and all code blocks (content between triple backticks). I will use `regex_extract` to identify these patterns. I'll also look further into the document to see where the actual content begins, as the initial sample is mostly a file tree.
Code:
```python
# Extract all markdown headers
# Pattern: line starting with one or more '#' followed by space and text
headers = regex_e


Found 3 headers
Found 1 code blocks

Structure: The documentation for the `stanfordnlp-dspy` repository is structured as a **technical API Reference manual** managed via a documentation-as-code workflow.

The structure is organized into the followi...


## 11. RLM vs Direct LLM Comparison

| Aspect | Direct LLM | RLM |
|--------|-----------|-----|
| **Context size** | ~128K tokens | Virtually unlimited |
| **Attention** | Dilutes over long context | Focused (code selects snippets) |
| **Cost** | High (all tokens in context) | Lower (targeted sub-LLM calls) |
| **Accuracy** | Lower on long docs | Higher (targeted analysis) |
| **Verifiability** | Black box | Transparent (full trajectory) |
| **Tool use** | Limited | Full Python + custom tools |
| **Iterative refinement** | Manual (chat) | Automated (code loops) |
| **Structured output** | Prompt-dependent | Type-enforced via Signature |

### When to use RLM:
- ✅ Documents > 50KB
- ✅ Need structured extraction (lists, dicts, nested data)
- ✅ Multi-step analysis (filter → extract → validate)
- ✅ Need programmatic validation or computation
- ✅ Repetitive analysis across many documents

### When NOT to use RLM:
- ❌ Simple Q&A on short text (< 1K tokens)
- ❌ Creative writing or brainstorming
- ❌ Single-turn classification tasks
- ❌ Real-time low-latency requirements

## 12. RLM Best Practices

### Signature Design

1. **Describe the strategy** in the docstring:
   ```python
   """First use regex to find X, then llm_query() on relevant sections,
   finally aggregate results with llm_query_batched()."""
   ```

2. **Explicit type annotations**: Use `list`, `dict`, `int` for structured outputs

3. **Input field descriptions**: Help RLM understand what data it's working with

### Parameter Tuning

| Parameter | Typical Range | Notes |
|-----------|---------------|-------|
| `max_iterations` | 10-50 | Complex docs need more iterations |
| `max_llm_calls` | 20-100 | Primary cost control |
| `max_output_chars` | 10K-100K | Prevents output flooding |

### Debugging Workflow

1. **Start with `verbose=True`**: See real-time reasoning and code
2. **Inspect `result.trajectory`**: Full execution history
3. **Test on subsets**: Use `docs[:5000]` before full runs
4. **Check sandbox logs**: Modal shows actual execution
5. **Validate tools**: Test custom tools independently

## 13. Summary

This notebook demonstrated the full capabilities of **dspy.RLM**:

1. **Basic code generation** - LLM writes and executes Python
2. **Long document analysis** - Process 80KB+ documents efficiently
3. **Parallel processing** - `llm_query_batched()` for speed
4. **Stateful reasoning** - Multi-step workflows with persistent variables
5. **Trajectory inspection** - Full transparency into reasoning
6. **Custom tools** - Extend sandbox capabilities

### Key Takeaways

- RLM treats long context as an **environment**, not input
- Code navigates data; `llm_query()` understands semantics
- The **trajectory** provides unprecedented observability
- **Modal sandbox** provides secure, scalable execution

### Next Steps

- Try RLM on your own long documents
- Build custom tools for your domain
- Experiment with different strategies in Signature docstrings
- Use trajectory data to iteratively improve prompts

---

**Reference**: [Recursive Language Models](https://arxiv.org/abs/2501.123) (Zhang, Kraska, Khattab, 2025)