Skip to content

Add CRAG-lite corrective retrieval around the knowledge search tool#113

Open
RuneLind wants to merge 4 commits into
mainfrom
corrective-retrieval-phase1
Open

Add CRAG-lite corrective retrieval around the knowledge search tool#113
RuneLind wants to merge 4 commits into
mainfrom
corrective-retrieval-phase1

Conversation

@RuneLind
Copy link
Copy Markdown
Owner

@RuneLind RuneLind commented May 12, 2026

Phase 1 of the corrective-retrieval work: after a copilot-sdk bot calls Huginn's search_knowledge MCP tool, judge the results and — if they're weak — do one bounded corrective re-query before the model sees them. Off by default. Plan: ../mimir/plans/huginn-muninn-corrective-rag.md.

Today the bot's own judgement is the only "retrieval evaluator" — if a search returns junk, the model just copes. This adds a CRAG-style judge-and-requery step, consuming the Phase-0 corrective signal Huginn now emits (bestScore, per-result confidenceBand, retryHints, noConfidentResults, min_relevance — Huginn PR #36).

Mechanism (copilot-sdk connector only): the connector registers a Copilot SDK onPostToolUse hook that intercepts each search_knowledge result before the model sees it (applyCorrectiveRetrievalrunCorrectiveRetrieval). Claude-CLI bots run the MCP tool in their own process and can't be intercepted this way — left to Phase 3 (prompt-level guidance); the asymmetry is documented. When the toggle is off the hook isn't registered → byte-identical to before.

Two grader modes (knowledge-grader.ts):

  • "signal"the default; no model call, ~0ms for confident searches. Reads the cheap signal Huginn already emits (a *Weak match …* / *No confident match …* footer, or a "No results found" body) and, when present, re-queries with the broaderQuery / narrowerQuery from Huginn's own retryHints. A weak search costs ~one extra HTTP call; a fully uneventful check emits no trace span.
  • "haiku" — opt-in (correctiveRetrieval.grader: "haiku" / CORRECTIVE_RETRIEVAL_GRADER=haiku). A slimmed awaiting Haiku call that also reads the result snippets and can propose a semantic rewrite / a better collection. The result text is digested to the top hits' titles + bands + a short body prefix first, so it's ~3–5s per search rather than ~11s. Fail-soft: any Haiku error → correct (no change).

On a non-correct verdict (corrective-retrieval.ts): re-query Huginn's /api/search with rerank=true (so the re-query's bands are trustworthy), strip the now-obsolete *Weak match* footer from the prior result, merge + dedupe the fresh hits by collection/doc_id (parsed from the rendered result text), append an inline [corrective retrieval — re-query #N: …] note. Hard cap 1 retry (configurable to 2), never recursive.

Tracing: knowledge_grade (attrs: mode, verdicts, finalVerdict, …) + knowledge_requery (attrs: query, collection) spans synthesized under the tool span; a corrective chip on the parent tool span in the dashboard waterfall.

Config: per-bot correctiveRetrieval block in config.json ({ enabled?, retryBudget?: 1|2, grader?: "signal"|"haiku" }), CORRECTIVE_RETRIEVAL_ENABLED / CORRECTIVE_RETRIEVAL_BUDGET / CORRECTIVE_RETRIEVAL_GRADER global defaults, CORRECTIVE_RETRIEVAL_DISABLED=1 kill-switch. Off by default.

Testing

  • Unit tests — signal grader (confident → no-op, weak footer → re-query with hint, related-terms-only → no re-query, budget 2 doesn't loop, "No results" body), Haiku grader (verdict parsing, fail-soft, result digesting), orchestrator (retry/merge/dedupe, budget exhaustion/clamp, re-query errors, footer-hint fallback, collection redirect, graderMode in metadata), search client (rendering, doc-id extraction, footer parsing, fetch + non-2xx), trace-span planner, connector hook helpers. bun run test green; tsc --noEmit clean.
  • Manual end-to-end against a running Copilot-SDK bot + Huginn — enable correctiveRetrieval (signal mode), ask a question whose first search Huginn flags weak, confirm a knowledge_requery span appears and the bot answers from the merged set; confirm a confident search adds no knowledge_grade span and no latency. Validated once in "haiku" mode (the trace screenshot in the PR thread — grader took ~11s, which is why signal is now the default).
  • E2E — n/a

Notes

  • Depends on the Phase-0 contract — currently on Huginn PR Refactor: extract shared Knowledge API client #36 / branch phase0-corrective-signal, not yet on Huginn main. The signal grader leans on Huginn's *Weak match* / *No confident match* footer + retryHints, which that branch adds.
  • A reranked: false flag on the /api/search response is a noted Huginn-side follow-up (coordinated with the Huginn peer).
  • Known limitation: a corrective re-query hits /api/search directly, so it isn't constrained by the per-bot ALLOWED_COLLECTIONS the MCP adapter applies when no collection is specified in the original call (it does respect a collection passed in the original call). Collection-scoping the re-query is a possible follow-up.

RuneLind added 4 commits May 12, 2026 20:49
Phase 1 of the corrective-retrieval work (plan: mimir/plans/huginn-muninn-corrective-rag.md).
After a copilot-sdk bot calls Huginn's `search_knowledge`, optionally grade the
results with a dedicated Haiku call and do a bounded corrective re-query before
the model sees them — consuming the Phase-0 contract (`bestScore`,
`confidenceBand`, `retryHints`, `noConfidentResults`, `min_relevance`).

- `knowledge-grader.ts` — awaiting Haiku evaluator → correct/ambiguous/insufficient
  + rewritten query / suggested collection. Fail-soft to "correct".
- `corrective-retrieval.ts` — grade → re-query `/api/search` (rerank=true) →
  merge + dedupe by collection/doc_id (parsed from the rendered result text) →
  consolidated text + `corrective` metadata. Hard cap 1 retry (configurable 2),
  never recursive.
- `knowledge-search-client.ts` — HTTP client for `/api/search` + a renderer
  mirroring Huginn's MCP-adapter result format.
- copilot-sdk connector: registers a `hooks.onPostToolUse` handler that runs the
  corrective pass and returns a `modifiedResult`; re-appends any trailing Huginn
  trace marker so downstream trace extraction is unaffected. Claude-CLI bots
  can't be intercepted this way — left to Phase 3 (prompt-level guidance).
- Trace spans: `knowledge_grade` + `knowledge_requery` synthesized under the
  tool span (`corrective-trace-spans.ts`); a corrective chip on the parent tool
  span in the dashboard waterfall.
- Config: per-bot `correctiveRetrieval` block, `CORRECTIVE_RETRIEVAL_ENABLED`
  global default, `CORRECTIVE_RETRIEVAL_DISABLED` kill-switch. Off by default —
  when off the hook isn't registered and behaviour is byte-identical to before.
- Tests: grader (verdict parsing, fail-soft), orchestrator (retry/merge/dedupe,
  budget exhaustion, budget clamp, re-query errors), search client (rendering,
  doc-id extraction, fetch), trace-span planner, connector hook helpers.
The awaiting Haiku grader (`claude` CLI per knowledge search, ~11s on a 12 KB
result prompt) is too slow for the hot path. Add a second grader mode and make
it the default:

- `"signal"` (default) — no model call. Reads the cheap signal Huginn already
  emits (a `*Weak match …*` / `*No confident match …*` footer, or a "No results
  found" body) and, when present, re-queries with the `broaderQuery` /
  `narrowerQuery` from Huginn's own `retryHints`. ~0ms for confident searches;
  ~one extra HTTP call when weak. A fully uneventful signal-mode check emits no
  trace span.
- `"haiku"` (opt-in via `correctiveRetrieval.grader: "haiku"` /
  `CORRECTIVE_RETRIEVAL_GRADER=haiku`) — the previous behaviour, but the result
  text is now digested down to the top hits' titles + bands + a short body
  prefix before being sent to Haiku, so it's ~3–5s instead of ~11s.

Also: on a corrective merge, the now-obsolete `*Weak match — try: …*` footer is
stripped from the prior result before the fresh hits are spliced in (keeps the
model's context clean and stops signal-mode re-grading from re-detecting an
already-handled weak signal). `corrective` metadata + the `knowledge_grade`
span gain a `graderMode` / `mode` field.

Tests updated for the new shape; signal-grader paths covered (confident → no-op,
weak footer → re-query with hint, related-terms-only → no re-query, budget 2
doesn't loop, "No results" body); Haiku digest covered.
A Huginn search can keep hundreds of candidates yet hand the model "No results
found / low confidence" — the kept/fetched candidate count chip hid that. Now,
when the captured tool output (or the trace's Phase-0 `response` block) shows the
model got nothing usable, the row replaces the `N/N` candidate chip with a red
`0 hits` chip; a weak-match footer flips the count chip to the low-confidence
palette and adds a tooltip note. The corrective chip's tooltip also now carries
the grader mode and the grade reason (e.g. "corrective retrieval (signal):
insufficient — search returned no results; no re-query").
Address review findings on the corrective-retrieval branch:

- attachCorrectiveOutcomes: the hook now pushes one slot per knowledge-search
  tool call (a null when that search had no outcome), so a skipped search no
  longer shifts a later search's metadata onto it. This was a real misalignment
  bug — common in signal mode, where a confident search produces no outcome.
- runCorrectiveRetrieval reuses clampBudget instead of re-deriving the clamp
  inline (the inline version returned NaN for non-finite input).
- Consolidate the "weak match / no results" detection regexes into one place
  (knowledge-search-client.ts: classifyResultSignal, extractTrailingRetryFooter,
  WEAK_RESULT_RELEVANCE); knowledge-grader.ts now consumes them.
- CorrectiveToolMeta uses proper string-union types; named GRADER_TIMEOUT_MS and
  WEAK_BEST_SCORE constants instead of bare literals; lookup table for the
  corrective chip's verdict→style mapping; drop the unused KnowledgeSearchResponse.lowConfidence field; trim a few rot-prone "Phase N" doc comments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant