# AgenticFleet Fleet Support GEPA Notebook

This notebook adapts the [DSPy GEPA Facility Support Analyzer tutorial](https://dspy.ai/tutorials/gepa_facilitysupportanalyzer/) to the AgenticFleet codebase. We will walk through defining a fleet-support-specific signature, building a custom DSPy module, preparing training data, compiling with `BootstrapFewShot`, and evaluating the results.

## 1. Import Dependencies and Configure DSPy

We load project utilities, configure Python paths, and initialize DSPy's LM backend with the same model referenced in `config/workflow_config.yaml` (`gpt-4.1`).

In [1]:
import os
import sys
import json
import random
import inspect
from pathlib import Path
from types import SimpleNamespace
from typing import List

# Ensure repo src/ is on the path for absolute imports
# Use parent of notebooks/ to get the project root
PROJECT_ROOT = Path(__file__).resolve().parent.parent if "__file__" in dir() else Path(".").resolve().parent
SRC_PATH = PROJECT_ROOT / "src"
if str(SRC_PATH) not in sys.path:
    sys.path.insert(0, str(SRC_PATH))

import dspy
from agentic_fleet.utils.config_loader import load_config
from agentic_fleet.dspy_modules.reasoner import DSPyReasoner
from agentic_fleet.dspy_modules.workflow_signatures import EnhancedTaskRouting
from agentic_fleet.utils.gepa_optimizer import (
    load_example_dicts,
    prepare_gepa_datasets,
    build_routing_feedback_metric,
)

# Configure DSPy using the model defined in workflow config (defaults to gpt-4.1)
workflow_config = load_config()
dspy_model = workflow_config.get("dspy", {}).get("model", "gpt-4.1")

lm = dspy.LM(model=dspy_model, max_tokens=1024)
dspy.configure(lm=lm)

print(f"Project root: {PROJECT_ROOT}")
print(f"Configured DSPy with model: {dspy_model}")

Project root: /Volumes/Samsung-SSD-T7/Workspaces/Github/qredence/agent-framework/v0.5/AgenticFleet
Configured DSPy with model: gpt-4.1


## 2. Define the Fleet Support Signature

AgenticFleet already ships with an enhanced routing signature (`EnhancedTaskRouting`) in `src/agentic_fleet/dspy_modules/workflow_signatures.py`. We inspect it directly so the notebook stays aligned with the production supervisor.

In [3]:
print("EnhancedTaskRouting signature (source excerpt):\n")
print(inspect.getsource(EnhancedTaskRouting))

print("\nInput fields:")
for name, field in EnhancedTaskRouting.input_fields.items():
    print(f"- {name}: {field.json_schema_extra.get('desc', 'No description')}")

print("\nOutput fields:")
for name, field in EnhancedTaskRouting.output_fields.items():
    print(f"- {name}: {field.json_schema_extra.get('desc', 'No description')}")

EnhancedTaskRouting signature (source excerpt):

class EnhancedTaskRouting(dspy.Signature):
    """Advanced task routing with efficiency and tool-planning awareness.

    Optimizes for latency and token usage by pre-planning tool usage
    and setting execution constraints.
    """

    task: str = dspy.InputField(desc="Task to be routed")
    team_capabilities: str = dspy.InputField(desc="Capabilities of available agents")
    available_tools: str = dspy.InputField(desc="List of available tools")
    current_context: str = dspy.InputField(desc="Execution context")
    handoff_history: str = dspy.InputField(desc="History of agent handoffs")
    workflow_state: str = dspy.InputField(desc="Current state of the workflow")

    assigned_to: list[str] = dspy.OutputField(desc="Agents assigned to the task")
    execution_mode: Literal["delegated", "sequential", "parallel"] = dspy.OutputField(
        desc="Execution mode"
    )
    subtasks: list[str] = dspy.OutputField(desc="Breakdown of sub

## 3. Create the Agentic Fleet Module

We reuse the production `DSPyReasoner` module (the same class that powers the Supervisor workflow). It wires multiple DSPy submodules‚Äîtask analysis, routing, progress, and tool planning‚Äîunder one interface that GEPA can optimize.

In [4]:
reasoner = DSPyReasoner(use_enhanced_signatures=True)
print("Named predictors exposed to GEPA:")
print([name for name, _ in reasoner.named_predictors()])

# Quick smoke test using a lightweight task routed through the router signature
sample_prediction = reasoner(
    task="Summarize the latest GEPA optimizations for the exec weekly report",
    team_capabilities=(
        "Planner: decomposes projects.\n"
        "Researcher: runs Tavily searches.\n"
        "Writer: drafts polished updates."
    ),
    available_tools="TavilySearchTool, HostedCodeInterpreterTool",
    current_context="Need executive-friendly tone, highlight latency gains.",
)

print("\nSample routing decision:")
print(
    {
        "assigned_to": getattr(sample_prediction, "assigned_to", []),
        "execution_mode": getattr(sample_prediction, "execution_mode", "delegated"),
        "tool_plan": getattr(sample_prediction, "tool_plan", []),
    }
)

Named predictors exposed to GEPA:
['analyzer', 'router', 'quality_assessor', 'progress_evaluator', 'tool_planner', 'simple_responder', 'group_chat_selector', 'strategy_selector']

Sample routing decision:
{'assigned_to': ['Planner', 'Researcher', 'Writer'], 'execution_mode': 'sequential', 'tool_plan': ['TavilySearchTool', 'HostedCodeInterpreterTool']}


## 4. Load and Prepare Fleet Support Data

Supervisor training data now lives exclusively in `src/agentic_fleet/data/supervisor_examples.json`. Use `scripts/merge_supervisor_examples.py` if you need to ingest additional examples (it will merge any extra files and regenerate the canonical dataset).

In [5]:
examples_path = PROJECT_ROOT / "src" / "agentic_fleet" / "data" / "supervisor_examples.json"
records = load_example_dicts(str(examples_path))

if not records:
    raise RuntimeError(
        "No training data found. Run scripts/merge_supervisor_examples.py to populate the dataset."
    )

train_examples, val_examples = prepare_gepa_datasets(
    base_examples_path=str(examples_path),
    base_records=records,
    val_split=0.2,
    seed=13,
)

print(f"Training examples: {len(train_examples)} | Validation examples: {len(val_examples)}")

print("\nExample record:")
sample = train_examples[0]
print({
    "task": sample.task,
    "assigned_to": sample.assigned_to,
    "execution_mode": sample.execution_mode,
    "tool_requirements": sample.tool_requirements,
})

Training examples: 87 | Validation examples: 21

Example record:
{'task': 'Write a technical blog post about machine learning trends', 'assigned_to': 'Researcher,Writer,Reviewer', 'execution_mode': 'sequential', 'tool_requirements': ['TavilySearchTool']}


## 5. Define Evaluation Metrics

Our metric encourages the agent to produce correct diagnoses and actionable plans. We score partial credit for overlapping agent/tool selections so GEPA can receive granular feedback.

In [6]:
routing_feedback_metric = build_routing_feedback_metric()
print("Metric ready ‚Äì returns GEPA-friendly score + feedback text.")

# Demonstrate feedback using a real training example and a deliberately bad prediction
gold = train_examples[0]
bad_prediction = SimpleNamespace(
    task=gold.task,
    assigned_to=["Writer"],  # intentionally wrong agent
    execution_mode="delegated",
    tool_requirements=[],
    latency_budget="low",
)

score_feedback = routing_feedback_metric(gold, bad_prediction)
print(f"Score: {score_feedback.score:.2f}\n")
print("Feedback excerpt:\n")
print("\n".join(score_feedback.feedback.splitlines()[:8]))


def bootstrap_routing_metric(example, prediction, trace=None):
    """Lightweight scorer used by BootstrapFewShot (mirrors compiler.py)."""
    gold_agents = {agent.strip().lower() for agent in str(example.assigned_to).split(",") if agent}
    pred_agents = {agent.strip().lower() for agent in getattr(prediction, "assigned_to", [])}
    assignment_score = 1.0 if gold_agents & pred_agents else 0.0
    mode_score = 1.0 if getattr(prediction, "execution_mode", "") == example.execution_mode else 0.0
    return (assignment_score * 0.7) + (mode_score * 0.3)


Metric ready ‚Äì returns GEPA-friendly score + feedback text.
Score: 0.17

Feedback excerpt:

‚ùå Routing decision needs significant improvement.

üîç Edge Cases Detected:
  ‚Ä¢ Edge case: Task requires multiple agents but was assigned to fewer. Consider task complexity and required capabilities.

üìä Component Analysis:
  ‚ùå Agent mismatch: Assigned ['Writer'] but expected ['Researcher', 'Writer', 'Reviewer'].
  üìù Step-by-step: First, analyze task requirements. Then, match capabilities:


## 6. Compile the Agent with BootstrapFewShot

We warm-start the fleet module with DSPy's `BootstrapFewShot` teleprompter so the LM observes a few good demonstrations before GEPA fine-tuning. This mirrors the standard AgenticFleet optimization pipeline.

In [7]:
from dspy.teleprompt import BootstrapFewShot

max_demos = min(6, len(train_examples)) or 1
teleprompter = BootstrapFewShot(
    metric=bootstrap_routing_metric,
    max_bootstrapped_demos=max_demos,
    max_labeled_demos=max_demos,
)

print("Running BootstrapFewShot compilation against supervisor dataset...")
compiled_reasoner = teleprompter.compile(reasoner, trainset=train_examples)
print("Compilation complete. You can now hand this module to GEPA for further tuning.")

Running BootstrapFewShot compilation against supervisor dataset...


  7%|‚ñã         | 6/87 [00:19<04:26,  3.28s/it]

Bootstrapped 6 full traces after 6 examples for up to 1 rounds, amounting to 6 attempts.
Compilation complete. You can now hand this module to GEPA for further tuning.





## 7. Evaluate and Inspect Results

We score the compiled agent on the validation split, inspect per-example predictions, and print the reasoning trace for one ticket.

In [8]:
eval_examples = val_examples if val_examples else train_examples
scores: List[float] = []
feedback_snippets: List[str] = []

for example in eval_examples:
    prediction = compiled_reasoner(**example.inputs())
    score_feedback = routing_feedback_metric(example, prediction)
    scores.append(score_feedback.score)
    feedback_snippets.append(score_feedback.feedback.splitlines()[0])

mean_score = sum(scores) / len(scores) if scores else 0.0
print(f"Validation examples: {len(eval_examples)} | Mean routing metric: {mean_score:.2f}")

if eval_examples:
    sample = eval_examples[0]
    sample_prediction = compiled_reasoner(**sample.inputs())
    sample_feedback = routing_feedback_metric(sample, sample_prediction)

    print("\nSample Task:\n", sample.task)
    print("\nPrediction:")
    print(f"Assigned To: {getattr(sample_prediction, 'assigned_to', [])}")
    print(f"Execution Mode: {getattr(sample_prediction, 'execution_mode', 'delegated')}")
    print(f"Tool Plan: {getattr(sample_prediction, 'tool_plan', [])}")
    print(f"Score: {sample_feedback.score:.2f}")

    snippet = "\n".join(sample_feedback.feedback.splitlines()[:8])
    print("\nFeedback:\n", snippet)


Validation examples: 21 | Mean routing metric: 0.81

Sample Task:
 Who won the 2025 New York mayor election? Search for Zohran Mamdani and provide details about the winner.

Prediction:
Assigned To: ['Researcher', 'Writer']
Execution Mode: sequential
Tool Plan: ['TavilySearchTool']
Score: 0.80

Feedback:
 ‚ö†Ô∏è Routing decision is mostly correct but has minor issues.

üîç Edge Cases Detected:
  ‚Ä¢ This task involves ambiguity - consider clarifying requirements before routing.
  ‚Ä¢ Edge case: Time-sensitive query detected but web search tool not assigned. Tasks about current events, latest data, or future dates require TavilySearchTool.

üìä Component Analysis:
  ‚úÖ Agent selection matches ground truth.
