Skip to content

Self-repair Phase 4: closed-loop repair tickets in DreamService (#348)#375

Merged
rockfordlhotka merged 3 commits intomainfrom
feature/self-repair-phase-4
May 9, 2026
Merged

Self-repair Phase 4: closed-loop repair tickets in DreamService (#348)#375
rockfordlhotka merged 3 commits intomainfrom
feature/self-repair-phase-4

Conversation

@rockfordlhotka
Copy link
Copy Markdown
Member

Implements Phase 4 of the agent self-repair design (design/self-repair.md, see #344). Closes #348.

Promotes DreamService's failure-pattern observations into actionable tickets that apply a change, verify it, and escalate when verification fails — closing the loop opened by Phase 5 (#374), where clusters were captured but had no consumer.

Summary

  • Data model & store: RepairTicket, RepairTarget, RepairStatus, RepairAttempt records plus IRepairTicketStore and a file-backed implementation under /data/agent/repair-tickets/{id}.json (atomic temp+rename).
  • Four target appliers (IRepairTargetApplier):
    • SkillBodyApplier — append / replaceSection / deleteSection ops with auto-revert when verify fails.
    • WorkingMemoryEvictApplier — keyPrefix or explicit keys; idempotent.
    • ToolDefaultRegisterApplier — appends to /data/agent/tool-defaults/{server}.json, dedup by providerName.
    • PromptBuilderHintApplier — append/replace delimited sections in /data/agent/prompt-hints/{category}.md.
  • Verifier: cache-free RepairTicketVerifier (each apply attempt observes fresh state).
  • DreamService: two new passes after the contradiction sweep — RunRepairTicketCreationPassAsync (LLM-driven, dedups by PatternKey and (target, change-hash)) and RunRepairTicketApplyPassAsync (deterministic; apply → verify → resolve/escalate, escalation summary written to repair-escalations-latest working memory). Apply logic extracted as static helpers (RunRepairTicketApplyAsync, ApplyAndVerifyTicketAsync, ComputeNextStatus) for direct testability.
  • Phase 1 integration: new FileToolDefaultsProvider in RockBot.Tools.Mcp.Recovery.Providers reads server JSON files with FileSystemWatcher hot reload, registered alongside the three hard-coded providers — so ToolDefaultRegister tickets take effect on the next call.
  • Prompt-builder integration: ISystemPromptBuilder.Build overload accepts an optional category; DefaultSystemPromptBuilder reads prompt-hints/{category}.md (mtime-aware caching); AgentContextBuilder derives the category from the working-memory namespace prefix (session, patrol, subagent).
  • DI: new AgentHostBuilder.WithRepairTickets() extension; called from RockBot.Agent.Program immediately after WithFailureClusterStore().

Acceptance criteria (from #348)

  • A SkillBody ticket lands and is marked Resolved on the next cycle — RepairTicketApplyPassTests.SkillBodyTicket_VerifySucceeds_TicketResolved.
  • A ticket whose verify keeps failing 3 times shows up as Escalated in repair-escalations-latestRepairTicketApplyPassTests.SkillBodyTicket_VerifyFailsThreeTimes_Escalated_AndRevertedEachTime.
  • Skill-body changes auto-revert on verify failure — same test asserts skill body restored after each failed cycle.
  • Identical change proposals are deduped, not retried — creation-pass failedChangeHashes filter; static ComputeNextStatus test.

Out of scope

Test plan

  • dotnet build RockBot.slnx clean (0 errors, only pre-existing warnings).
  • dotnet test RockBot.slnx — full suite passes (812 host, 150 tools, 136 agent, 121 llm, 106 wisp, 103 observation, plus all other suites; RabbitMQ-gated tests skipped as expected).
  • Manual: drop a tool-defaults/calendar-mcp.json file with a timeZone default, call get_calendar_events without timeZone, confirm Phase 1 recovery resolves via FileToolDefaultsProvider (visible in auto-recovered telemetry).
  • Manual: drop a prompt-hints/session.md file, send a user message, confirm the hint appears in the LLM trace.
  • Cluster smoke: open a WorkingMemoryEvict ticket via the store API, run a dream cycle, confirm the targeted WM keys are gone and the ticket transitions to Resolved.

🤖 Generated with Claude Code

rockfordlhotka and others added 3 commits May 8, 2026 23:45
Promotes failure-cluster observations into actionable RepairTickets that the
DreamService applies, verifies, and either resolves or escalates. Closes the
loop opened by Phase 5: clusters now drive a fix → verify → resolve/escalate
cycle instead of just sitting in telemetry.

Adds RepairTicket data model, IRepairTicketStore (one JSON file per ticket
under /data/agent/repair-tickets), four IRepairTargetApplier implementations
(SkillBody with auto-revert, WorkingMemoryEvict, ToolDefaultRegister,
PromptBuilderHint), a cache-free IRepairTicketVerifier, and the matching
LLM-driven creation pass plus deterministic apply pass in DreamService.

Phase 1 integration ships FileToolDefaultsProvider so ToolDefaultRegister
tickets take effect on the next call without restart. Prompt-builder gains
a category-scoped overload that injects /data/agent/prompt-hints/<category>.md
so PromptBuilderHint tickets reach session prompts. Wiring exposed via a new
AgentHostBuilder.WithRepairTickets() extension.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tive

The repair-ticket apply pass now backs off on persistent verify timeouts —
each prior timeout doubles the budget (5s → 10s → 20s → 40s → 80s, capped).
After 5 consecutive timeouts at 80s, the outcome promotes from Uncertain to
PredicateFailed so the ticket can eventually escalate instead of churning
forever.

Adds a TimedOut flag to VerifyResult so non-timeout Uncertains (executor
missing, gateway down) don't share the timeout retry path. Updates the
creation-pass directive with explicit guidance to prefer cheap, single-call
verify shapes (e.g. list_accounts over a fan-out get_calendar_events).

Surfaced by the first live cycle of #348: a PromptBuilderHint ticket
proposed a verify against the same fan-out tool that produced the cluster,
which exhausted the 5s budget every cycle and could not progress.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rockfordlhotka rockfordlhotka merged commit f19a832 into main May 9, 2026
2 checks passed
@rockfordlhotka rockfordlhotka deleted the feature/self-repair-phase-4 branch May 9, 2026 05:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Self-repair Phase 4: closed-loop repair tickets in DreamService

1 participant