Add CRAG-lite corrective retrieval around the knowledge search tool by RuneLind · Pull Request #113 · RuneLind/muninn

RuneLind · 2026-05-12T18:49:45Z

Phase 1 of the corrective-retrieval work: after a copilot-sdk bot calls Huginn's search_knowledge MCP tool, judge the results and — if they're weak — do one bounded corrective re-query before the model sees them. Off by default. Plan: ../mimir/plans/huginn-muninn-corrective-rag.md.

Today the bot's own judgement is the only "retrieval evaluator" — if a search returns junk, the model just copes. This adds a CRAG-style judge-and-requery step, consuming the Phase-0 corrective signal Huginn now emits (bestScore, per-result confidenceBand, retryHints, noConfidentResults, min_relevance — Huginn PR #36).

Mechanism (copilot-sdk connector only): the connector registers a Copilot SDK onPostToolUse hook that intercepts each search_knowledge result before the model sees it (applyCorrectiveRetrieval → runCorrectiveRetrieval). Claude-CLI bots run the MCP tool in their own process and can't be intercepted this way — left to Phase 3 (prompt-level guidance); the asymmetry is documented. When the toggle is off the hook isn't registered → byte-identical to before.

Two grader modes (knowledge-grader.ts):

"signal" — the default; no model call, ~0ms for confident searches. Reads the cheap signal Huginn already emits (a *Weak match …* / *No confident match …* footer, or a "No results found" body) and, when present, re-queries with the broaderQuery / narrowerQuery from Huginn's own retryHints. A weak search costs ~one extra HTTP call; a fully uneventful check emits no trace span.
"haiku" — opt-in (correctiveRetrieval.grader: "haiku" / CORRECTIVE_RETRIEVAL_GRADER=haiku). A slimmed awaiting Haiku call that also reads the result snippets and can propose a semantic rewrite / a better collection. The result text is digested to the top hits' titles + bands + a short body prefix first, so it's ~3–5s per search rather than ~11s. Fail-soft: any Haiku error → correct (no change).

On a non-correct verdict (corrective-retrieval.ts): re-query Huginn's /api/search with rerank=true (so the re-query's bands are trustworthy), strip the now-obsolete *Weak match* footer from the prior result, merge + dedupe the fresh hits by collection/doc_id (parsed from the rendered result text), append an inline [corrective retrieval — re-query #N: …] note. Hard cap 1 retry (configurable to 2), never recursive.

Tracing: knowledge_grade (attrs: mode, verdicts, finalVerdict, …) + knowledge_requery (attrs: query, collection) spans synthesized under the tool span; a corrective chip on the parent tool span in the dashboard waterfall.

Config: per-bot correctiveRetrieval block in config.json ({ enabled?, retryBudget?: 1|2, grader?: "signal"|"haiku" }), CORRECTIVE_RETRIEVAL_ENABLED / CORRECTIVE_RETRIEVAL_BUDGET / CORRECTIVE_RETRIEVAL_GRADER global defaults, CORRECTIVE_RETRIEVAL_DISABLED=1 kill-switch. Off by default.

Testing

Unit tests — signal grader (confident → no-op, weak footer → re-query with hint, related-terms-only → no re-query, budget 2 doesn't loop, "No results" body), Haiku grader (verdict parsing, fail-soft, result digesting), orchestrator (retry/merge/dedupe, budget exhaustion/clamp, re-query errors, footer-hint fallback, collection redirect, graderMode in metadata), search client (rendering, doc-id extraction, footer parsing, fetch + non-2xx), trace-span planner, connector hook helpers. bun run test green; tsc --noEmit clean.
Manual end-to-end against a running Copilot-SDK bot + Huginn — enable correctiveRetrieval (signal mode), ask a question whose first search Huginn flags weak, confirm a knowledge_requery span appears and the bot answers from the merged set; confirm a confident search adds no knowledge_grade span and no latency. Validated once in "haiku" mode (the trace screenshot in the PR thread — grader took ~11s, which is why signal is now the default).
E2E — n/a

Notes

Depends on the Phase-0 contract — currently on Huginn PR Refactor: extract shared Knowledge API client #36 / branch phase0-corrective-signal, not yet on Huginn main. The signal grader leans on Huginn's *Weak match* / *No confident match* footer + retryHints, which that branch adds.
A reranked: false flag on the /api/search response is a noted Huginn-side follow-up (coordinated with the Huginn peer).
Known limitation: a corrective re-query hits /api/search directly, so it isn't constrained by the per-bot ALLOWED_COLLECTIONS the MCP adapter applies when no collection is specified in the original call (it does respect a collection passed in the original call). Collection-scoping the re-query is a possible follow-up.

Phase 1 of the corrective-retrieval work (plan: mimir/plans/huginn-muninn-corrective-rag.md). After a copilot-sdk bot calls Huginn's `search_knowledge`, optionally grade the results with a dedicated Haiku call and do a bounded corrective re-query before the model sees them — consuming the Phase-0 contract (`bestScore`, `confidenceBand`, `retryHints`, `noConfidentResults`, `min_relevance`). - `knowledge-grader.ts` — awaiting Haiku evaluator → correct/ambiguous/insufficient + rewritten query / suggested collection. Fail-soft to "correct". - `corrective-retrieval.ts` — grade → re-query `/api/search` (rerank=true) → merge + dedupe by collection/doc_id (parsed from the rendered result text) → consolidated text + `corrective` metadata. Hard cap 1 retry (configurable 2), never recursive. - `knowledge-search-client.ts` — HTTP client for `/api/search` + a renderer mirroring Huginn's MCP-adapter result format. - copilot-sdk connector: registers a `hooks.onPostToolUse` handler that runs the corrective pass and returns a `modifiedResult`; re-appends any trailing Huginn trace marker so downstream trace extraction is unaffected. Claude-CLI bots can't be intercepted this way — left to Phase 3 (prompt-level guidance). - Trace spans: `knowledge_grade` + `knowledge_requery` synthesized under the tool span (`corrective-trace-spans.ts`); a corrective chip on the parent tool span in the dashboard waterfall. - Config: per-bot `correctiveRetrieval` block, `CORRECTIVE_RETRIEVAL_ENABLED` global default, `CORRECTIVE_RETRIEVAL_DISABLED` kill-switch. Off by default — when off the hook isn't registered and behaviour is byte-identical to before. - Tests: grader (verdict parsing, fail-soft), orchestrator (retry/merge/dedupe, budget exhaustion, budget clamp, re-query errors), search client (rendering, doc-id extraction, fetch), trace-span planner, connector hook helpers.

The awaiting Haiku grader (`claude` CLI per knowledge search, ~11s on a 12 KB result prompt) is too slow for the hot path. Add a second grader mode and make it the default: - `"signal"` (default) — no model call. Reads the cheap signal Huginn already emits (a `*Weak match …*` / `*No confident match …*` footer, or a "No results found" body) and, when present, re-queries with the `broaderQuery` / `narrowerQuery` from Huginn's own `retryHints`. ~0ms for confident searches; ~one extra HTTP call when weak. A fully uneventful signal-mode check emits no trace span. - `"haiku"` (opt-in via `correctiveRetrieval.grader: "haiku"` / `CORRECTIVE_RETRIEVAL_GRADER=haiku`) — the previous behaviour, but the result text is now digested down to the top hits' titles + bands + a short body prefix before being sent to Haiku, so it's ~3–5s instead of ~11s. Also: on a corrective merge, the now-obsolete `*Weak match — try: …*` footer is stripped from the prior result before the fresh hits are spliced in (keeps the model's context clean and stops signal-mode re-grading from re-detecting an already-handled weak signal). `corrective` metadata + the `knowledge_grade` span gain a `graderMode` / `mode` field. Tests updated for the new shape; signal-grader paths covered (confident → no-op, weak footer → re-query with hint, related-terms-only → no re-query, budget 2 doesn't loop, "No results" body); Haiku digest covered.

A Huginn search can keep hundreds of candidates yet hand the model "No results found / low confidence" — the kept/fetched candidate count chip hid that. Now, when the captured tool output (or the trace's Phase-0 `response` block) shows the model got nothing usable, the row replaces the `N/N` candidate chip with a red `0 hits` chip; a weak-match footer flips the count chip to the low-confidence palette and adds a tooltip note. The corrective chip's tooltip also now carries the grader mode and the grade reason (e.g. "corrective retrieval (signal): insufficient — search returned no results; no re-query").

Address review findings on the corrective-retrieval branch: - attachCorrectiveOutcomes: the hook now pushes one slot per knowledge-search tool call (a null when that search had no outcome), so a skipped search no longer shifts a later search's metadata onto it. This was a real misalignment bug — common in signal mode, where a confident search produces no outcome. - runCorrectiveRetrieval reuses clampBudget instead of re-deriving the clamp inline (the inline version returned NaN for non-finite input). - Consolidate the "weak match / no results" detection regexes into one place (knowledge-search-client.ts: classifyResultSignal, extractTrailingRetryFooter, WEAK_RESULT_RELEVANCE); knowledge-grader.ts now consumes them. - CorrectiveToolMeta uses proper string-union types; named GRADER_TIMEOUT_MS and WEAK_BEST_SCORE constants instead of bare literals; lookup table for the corrective chip's verdict→style mapping; drop the unused KnowledgeSearchResponse.lowConfidence field; trim a few rot-prone "Phase N" doc comments.

RuneLind added 4 commits May 12, 2026 20:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CRAG-lite corrective retrieval around the knowledge search tool#113

Add CRAG-lite corrective retrieval around the knowledge search tool#113
RuneLind wants to merge 4 commits into
mainfrom
corrective-retrieval-phase1

RuneLind commented May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RuneLind commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RuneLind commented May 12, 2026 •

edited

Loading