Self-repair Phase 4: closed-loop repair tickets in DreamService (#348)#375
Merged
rockfordlhotka merged 3 commits intomainfrom May 9, 2026
Merged
Self-repair Phase 4: closed-loop repair tickets in DreamService (#348)#375rockfordlhotka merged 3 commits intomainfrom
rockfordlhotka merged 3 commits intomainfrom
Conversation
Promotes failure-cluster observations into actionable RepairTickets that the DreamService applies, verifies, and either resolves or escalates. Closes the loop opened by Phase 5: clusters now drive a fix → verify → resolve/escalate cycle instead of just sitting in telemetry. Adds RepairTicket data model, IRepairTicketStore (one JSON file per ticket under /data/agent/repair-tickets), four IRepairTargetApplier implementations (SkillBody with auto-revert, WorkingMemoryEvict, ToolDefaultRegister, PromptBuilderHint), a cache-free IRepairTicketVerifier, and the matching LLM-driven creation pass plus deterministic apply pass in DreamService. Phase 1 integration ships FileToolDefaultsProvider so ToolDefaultRegister tickets take effect on the next call without restart. Prompt-builder gains a category-scoped overload that injects /data/agent/prompt-hints/<category>.md so PromptBuilderHint tickets reach session prompts. Wiring exposed via a new AgentHostBuilder.WithRepairTickets() extension. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tive The repair-ticket apply pass now backs off on persistent verify timeouts — each prior timeout doubles the budget (5s → 10s → 20s → 40s → 80s, capped). After 5 consecutive timeouts at 80s, the outcome promotes from Uncertain to PredicateFailed so the ticket can eventually escalate instead of churning forever. Adds a TimedOut flag to VerifyResult so non-timeout Uncertains (executor missing, gateway down) don't share the timeout retry path. Updates the creation-pass directive with explicit guidance to prefer cheap, single-call verify shapes (e.g. list_accounts over a fan-out get_calendar_events). Surfaced by the first live cycle of #348: a PromptBuilderHint ticket proposed a verify against the same fan-out tool that produced the cluster, which exhausted the 5s budget every cycle and could not progress. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements Phase 4 of the agent self-repair design (
design/self-repair.md, see #344). Closes #348.Promotes DreamService's failure-pattern observations into actionable tickets that apply a change, verify it, and escalate when verification fails — closing the loop opened by Phase 5 (#374), where clusters were captured but had no consumer.
Summary
RepairTicket,RepairTarget,RepairStatus,RepairAttemptrecords plusIRepairTicketStoreand a file-backed implementation under/data/agent/repair-tickets/{id}.json(atomic temp+rename).IRepairTargetApplier):SkillBodyApplier— append / replaceSection / deleteSection ops with auto-revert when verify fails.WorkingMemoryEvictApplier— keyPrefix or explicit keys; idempotent.ToolDefaultRegisterApplier— appends to/data/agent/tool-defaults/{server}.json, dedup by providerName.PromptBuilderHintApplier— append/replace delimited sections in/data/agent/prompt-hints/{category}.md.RepairTicketVerifier(each apply attempt observes fresh state).RunRepairTicketCreationPassAsync(LLM-driven, dedups by PatternKey and(target, change-hash)) andRunRepairTicketApplyPassAsync(deterministic; apply → verify → resolve/escalate, escalation summary written torepair-escalations-latestworking memory). Apply logic extracted as static helpers (RunRepairTicketApplyAsync,ApplyAndVerifyTicketAsync,ComputeNextStatus) for direct testability.FileToolDefaultsProviderinRockBot.Tools.Mcp.Recovery.Providersreads server JSON files with FileSystemWatcher hot reload, registered alongside the three hard-coded providers — soToolDefaultRegistertickets take effect on the next call.ISystemPromptBuilder.Buildoverload accepts an optional category;DefaultSystemPromptBuilderreadsprompt-hints/{category}.md(mtime-aware caching);AgentContextBuilderderives the category from the working-memory namespace prefix (session,patrol,subagent).AgentHostBuilder.WithRepairTickets()extension; called fromRockBot.Agent.Programimmediately afterWithFailureClusterStore().Acceptance criteria (from #348)
SkillBodyticket lands and is markedResolvedon the next cycle —RepairTicketApplyPassTests.SkillBodyTicket_VerifySucceeds_TicketResolved.Escalatedinrepair-escalations-latest—RepairTicketApplyPassTests.SkillBodyTicket_VerifyFailsThreeTimes_Escalated_AndRevertedEachTime.failedChangeHashesfilter; staticComputeNextStatustest.Out of scope
VerifyShapeplumbing.Test plan
dotnet build RockBot.slnxclean (0 errors, only pre-existing warnings).dotnet test RockBot.slnx— full suite passes (812 host, 150 tools, 136 agent, 121 llm, 106 wisp, 103 observation, plus all other suites; RabbitMQ-gated tests skipped as expected).tool-defaults/calendar-mcp.jsonfile with atimeZonedefault, callget_calendar_eventswithouttimeZone, confirm Phase 1 recovery resolves viaFileToolDefaultsProvider(visible inauto-recoveredtelemetry).prompt-hints/session.mdfile, send a user message, confirm the hint appears in the LLM trace.WorkingMemoryEvictticket via the store API, run a dream cycle, confirm the targeted WM keys are gone and the ticket transitions toResolved.🤖 Generated with Claude Code