Skip to content

test(harness,agents): δ-harness coverage of Hermes delegate_task (3 backends)#410

Merged
thinmintdev merged 1 commit into
mainfrom
feature/phase0-delegate-harness
May 29, 2026
Merged

test(harness,agents): δ-harness coverage of Hermes delegate_task (3 backends)#410
thinmintdev merged 1 commit into
mainfrom
feature/phase0-delegate-harness

Conversation

@thinmintdev
Copy link
Copy Markdown
Contributor

Summary

  • Fakes for local / docker / modal backends mirroring upstream BaseEnvironment ABC (the actual abstraction upstream uses for execution-environment selection — tools/environments/base.py:288)
  • Per-backend test suite (4 tests each) covering happy path + payload capture + degraded-failure path + per-backend edge case
  • Matrix test (5 tests) covering parametrised round-trip + 3-backend fan-out in one delegate_task call + unknown-backend hard-fail + upstream-contract drift gate + fake-conformance check
  • Gates V3a observability panel scope per openrouter-research-2026-05-28/PLANNING.md §3 Phase 0

What I learned about upstream Hermes

  • The ABC is tools.environments.base.BaseEnvironment (~/src/hermes-agent/tools/environments/base.py:288), not a per-backend "spawn adapter". Concrete subclasses live in tools/environments/{local,docker,modal,singularity,daytona,ssh}.py and are selected by TERMINAL_ENV env var via tools/terminal_tool.py::_create_environment (line 1039).
  • The public surface every backend exposes is init_session() -> None, execute(command, cwd="", *, timeout=None, stdin_data=None) -> {"output": str, "returncode": int}, cleanup() -> None. Our _BackendContract mirrors that exactly.
  • delegate_task (tools/delegate_tool.py:1918) spawns in-process child AIAgent threads; their tool loop dispatches shell commands through the backend selected at construction time. The "backend" axis is per-subagent's terminal/code tool, not per-spawn.
  • R7's "7 backends" claim was partial marketing. Upstream actually ships 6 execution-environment backends. Vercel Sandbox does NOT exist in upstream pin 0554ef1a (confirmed via _create_environment factory's else: raise ValueError(\"Unknown environment type: %s. Use one of: 'local', 'docker', 'singularity', 'modal', 'daytona', or 'ssh'\") at line 1174). Logged in FINDINGS.md §46.

Test plan

  • 4 per-backend test files + 1 matrix test
  • All 19 tests collected by pytest tests/harness/integration/test_delegate_task_*.py (18 pass, 1 skipped — the upstream-drift gate, expected on machines without ~/src/hermes-agent)
  • Full integration suite still green: PYTHONPATH=src pytest tests/harness/integration/ → 28 passed, 1 skipped
  • ruff format --check clean
  • ruff check clean
  • mypy --strict clean (no issues found in 6 source files)
  • Upstream-drift gate manually verified by running with PYTHONPATH=src:~/src/hermes-agent — passes against real BaseEnvironment

Coverage matrix

Backend Found in upstream? Test passes? Notes
local yes — tools/environments/local.py:413 (LocalEnvironment) yes (4/4) default backend; baseline coverage
docker yes — tools/environments/docker.py:277 (DockerEnvironment) yes (4/4) image + cpu/memory/disk/volumes kwargs captured; "daemon not available" degraded path covered
modal yes — tools/environments/modal.py:164 (ModalEnvironment) yes (4/4) sandbox_kwargs + cold-start latency + token-missing degraded path; matches both direct + managed mode shapes
singularity yes — out of Phase 0 scope n/a add via README §14 if needed for V3a
daytona yes — out of Phase 0 scope n/a add via README §14 if needed for V3a
ssh yes — out of Phase 0 scope n/a add via README §14 if needed for V3a
vercel-sandbox NO — R7 marketing drift n/a not a real upstream backend; V3a UI must not list it

V3a observability recommendation

Survives intact. All three covered backends round-trip cleanly through the dispatch hop. V3a's "Hermes session log" panel can safely display target host / model / status / cost / duration for local, docker, modal. If the user picks one of the three uncovered backends (singularity, daytona, ssh) the panel should still display the metadata — the dispatch hop is the same BaseEnvironment.execute() shape — but a follow-up Phase 0.5 should extend δ-harness coverage before claiming production readiness. The "7 backends" string in promo / UI copy must be corrected to "6 backends" per FINDINGS.md §46.

Refs openrouter-research-2026-05-28/PLANNING.md §3 Phase 0 + §4 DA #2.

🤖 Generated with Claude Code

…ackends)

DA must-fix #2 from the OpenRouter integration analysis: R7 claimed
upstream Hermes ships 7 spawn backends, but tests/agents/ had zero
delegate_task coverage. Verifies orchestration end-to-end for local +
docker + modal with mocked backend handlers (no Modal credits or
docker pulls in CI). Gates V3a observability panel scope.

Adds:
- tests/harness/integration/_delegate_fakes.py — FakeLocalBackend,
  FakeDockerBackend, FakeModalBackend implementing the upstream
  BaseEnvironment ABC, capturing invocations for assertions
- _delegate_runner.py — in-process orchestration harness wiring the
  fakes into a simulated delegate_task dispatch loop
- test_delegate_task_{local,docker,modal}.py — happy path + error path
  + invocation payload shape per backend
- test_delegate_task_dispatch_matrix.py — parametrised fan-out across
  the 3 backends asserting orchestration works uniformly, plus an
  upstream-contract drift gate that runs against
  tools.environments.base.BaseEnvironment when ~/src/hermes-agent is
  on PYTHONPATH (skips cleanly on CI)

Upstream audit at pin 0554ef1a corrected R7's "7 backends" marketing:
upstream actually ships 6 (local/docker/singularity/modal/daytona/ssh).
Vercel Sandbox is NOT a BaseEnvironment subclass upstream. The gap is
documented in FINDINGS.md §46 so V3a's UI design can target the real
backend list. The three covered backends round-trip cleanly, so V3a
observability survives intact; singularity / daytona / ssh can be
added incrementally per README §14.

Refs openrouter-research-2026-05-28/PLANNING.md §3 Phase 0 + §4 #2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thinmintdev thinmintdev merged commit 4d3455c into main May 29, 2026
4 checks passed
@thinmintdev thinmintdev mentioned this pull request May 29, 2026
4 tasks
thinmintdev added a commit that referenced this pull request May 29, 2026
End-of-stream cut for v0.3. Bundles MCP-completion, memory-map redesign,
Settings → Updates fix (#386), silent-eviction dispatcher recovery (#392),
ADR-0020 OpenRouter callback skeleton (#409), persona spending-cap
primitive (#411), δ-harness Hermes coverage (#410), and the docs/internal
pin + dashboard-v3 walkthrough (#389/#390).

After this tag, active scope rolls to v0.4 (install-mode reconciliation
+ UI polish + fully-implemented Agents/UI/Install bootstrapped) and v0.5
(MCP admin + memory wiring across UI and agents).

CHANGELOG merged from two coexisting Unreleased blocks into a single
[v0.3.2-alpha.1] section; added missing entries for #392 (dispatcher),
#387 (async-job polling contract), and the docs PRs #389/#390.

pyproject 0.3.1-alpha.1 → 0.3.2-alpha.1. uv.lock resynced (was stuck at
0.3.0a1 from prior drift).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thinmintdev thinmintdev deleted the feature/phase0-delegate-harness branch May 29, 2026 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant