Add CRAG-lite corrective retrieval around the knowledge search tool#113
Open
RuneLind wants to merge 4 commits into
Open
Add CRAG-lite corrective retrieval around the knowledge search tool#113RuneLind wants to merge 4 commits into
RuneLind wants to merge 4 commits into
Conversation
Phase 1 of the corrective-retrieval work (plan: mimir/plans/huginn-muninn-corrective-rag.md). After a copilot-sdk bot calls Huginn's `search_knowledge`, optionally grade the results with a dedicated Haiku call and do a bounded corrective re-query before the model sees them — consuming the Phase-0 contract (`bestScore`, `confidenceBand`, `retryHints`, `noConfidentResults`, `min_relevance`). - `knowledge-grader.ts` — awaiting Haiku evaluator → correct/ambiguous/insufficient + rewritten query / suggested collection. Fail-soft to "correct". - `corrective-retrieval.ts` — grade → re-query `/api/search` (rerank=true) → merge + dedupe by collection/doc_id (parsed from the rendered result text) → consolidated text + `corrective` metadata. Hard cap 1 retry (configurable 2), never recursive. - `knowledge-search-client.ts` — HTTP client for `/api/search` + a renderer mirroring Huginn's MCP-adapter result format. - copilot-sdk connector: registers a `hooks.onPostToolUse` handler that runs the corrective pass and returns a `modifiedResult`; re-appends any trailing Huginn trace marker so downstream trace extraction is unaffected. Claude-CLI bots can't be intercepted this way — left to Phase 3 (prompt-level guidance). - Trace spans: `knowledge_grade` + `knowledge_requery` synthesized under the tool span (`corrective-trace-spans.ts`); a corrective chip on the parent tool span in the dashboard waterfall. - Config: per-bot `correctiveRetrieval` block, `CORRECTIVE_RETRIEVAL_ENABLED` global default, `CORRECTIVE_RETRIEVAL_DISABLED` kill-switch. Off by default — when off the hook isn't registered and behaviour is byte-identical to before. - Tests: grader (verdict parsing, fail-soft), orchestrator (retry/merge/dedupe, budget exhaustion, budget clamp, re-query errors), search client (rendering, doc-id extraction, fetch), trace-span planner, connector hook helpers.
The awaiting Haiku grader (`claude` CLI per knowledge search, ~11s on a 12 KB result prompt) is too slow for the hot path. Add a second grader mode and make it the default: - `"signal"` (default) — no model call. Reads the cheap signal Huginn already emits (a `*Weak match …*` / `*No confident match …*` footer, or a "No results found" body) and, when present, re-queries with the `broaderQuery` / `narrowerQuery` from Huginn's own `retryHints`. ~0ms for confident searches; ~one extra HTTP call when weak. A fully uneventful signal-mode check emits no trace span. - `"haiku"` (opt-in via `correctiveRetrieval.grader: "haiku"` / `CORRECTIVE_RETRIEVAL_GRADER=haiku`) — the previous behaviour, but the result text is now digested down to the top hits' titles + bands + a short body prefix before being sent to Haiku, so it's ~3–5s instead of ~11s. Also: on a corrective merge, the now-obsolete `*Weak match — try: …*` footer is stripped from the prior result before the fresh hits are spliced in (keeps the model's context clean and stops signal-mode re-grading from re-detecting an already-handled weak signal). `corrective` metadata + the `knowledge_grade` span gain a `graderMode` / `mode` field. Tests updated for the new shape; signal-grader paths covered (confident → no-op, weak footer → re-query with hint, related-terms-only → no re-query, budget 2 doesn't loop, "No results" body); Haiku digest covered.
A Huginn search can keep hundreds of candidates yet hand the model "No results found / low confidence" — the kept/fetched candidate count chip hid that. Now, when the captured tool output (or the trace's Phase-0 `response` block) shows the model got nothing usable, the row replaces the `N/N` candidate chip with a red `0 hits` chip; a weak-match footer flips the count chip to the low-confidence palette and adds a tooltip note. The corrective chip's tooltip also now carries the grader mode and the grade reason (e.g. "corrective retrieval (signal): insufficient — search returned no results; no re-query").
Address review findings on the corrective-retrieval branch: - attachCorrectiveOutcomes: the hook now pushes one slot per knowledge-search tool call (a null when that search had no outcome), so a skipped search no longer shifts a later search's metadata onto it. This was a real misalignment bug — common in signal mode, where a confident search produces no outcome. - runCorrectiveRetrieval reuses clampBudget instead of re-deriving the clamp inline (the inline version returned NaN for non-finite input). - Consolidate the "weak match / no results" detection regexes into one place (knowledge-search-client.ts: classifyResultSignal, extractTrailingRetryFooter, WEAK_RESULT_RELEVANCE); knowledge-grader.ts now consumes them. - CorrectiveToolMeta uses proper string-union types; named GRADER_TIMEOUT_MS and WEAK_BEST_SCORE constants instead of bare literals; lookup table for the corrective chip's verdict→style mapping; drop the unused KnowledgeSearchResponse.lowConfidence field; trim a few rot-prone "Phase N" doc comments.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase 1 of the corrective-retrieval work: after a copilot-sdk bot calls Huginn's
search_knowledgeMCP tool, judge the results and — if they're weak — do one bounded corrective re-query before the model sees them. Off by default. Plan:../mimir/plans/huginn-muninn-corrective-rag.md.Today the bot's own judgement is the only "retrieval evaluator" — if a search returns junk, the model just copes. This adds a CRAG-style judge-and-requery step, consuming the Phase-0 corrective signal Huginn now emits (
bestScore, per-resultconfidenceBand,retryHints,noConfidentResults,min_relevance— Huginn PR #36).Mechanism (copilot-sdk connector only): the connector registers a Copilot SDK
onPostToolUsehook that intercepts eachsearch_knowledgeresult before the model sees it (applyCorrectiveRetrieval→runCorrectiveRetrieval). Claude-CLI bots run the MCP tool in their own process and can't be intercepted this way — left to Phase 3 (prompt-level guidance); the asymmetry is documented. When the toggle is off the hook isn't registered → byte-identical to before.Two grader modes (
knowledge-grader.ts):"signal"— the default; no model call, ~0ms for confident searches. Reads the cheap signal Huginn already emits (a*Weak match …*/*No confident match …*footer, or a "No results found" body) and, when present, re-queries with thebroaderQuery/narrowerQueryfrom Huginn's ownretryHints. A weak search costs ~one extra HTTP call; a fully uneventful check emits no trace span."haiku"— opt-in (correctiveRetrieval.grader: "haiku"/CORRECTIVE_RETRIEVAL_GRADER=haiku). A slimmed awaiting Haiku call that also reads the result snippets and can propose a semantic rewrite / a better collection. The result text is digested to the top hits' titles + bands + a short body prefix first, so it's ~3–5s per search rather than ~11s. Fail-soft: any Haiku error →correct(no change).On a non-
correctverdict (corrective-retrieval.ts): re-query Huginn's/api/searchwithrerank=true(so the re-query's bands are trustworthy), strip the now-obsolete*Weak match*footer from the prior result, merge + dedupe the fresh hits bycollection/doc_id(parsed from the rendered result text), append an inline[corrective retrieval — re-query #N: …]note. Hard cap 1 retry (configurable to 2), never recursive.Tracing:
knowledge_grade(attrs:mode,verdicts,finalVerdict, …) +knowledge_requery(attrs:query,collection) spans synthesized under the tool span; a corrective chip on the parent tool span in the dashboard waterfall.Config: per-bot
correctiveRetrievalblock inconfig.json({ enabled?, retryBudget?: 1|2, grader?: "signal"|"haiku" }),CORRECTIVE_RETRIEVAL_ENABLED/CORRECTIVE_RETRIEVAL_BUDGET/CORRECTIVE_RETRIEVAL_GRADERglobal defaults,CORRECTIVE_RETRIEVAL_DISABLED=1kill-switch. Off by default.Testing
bun run testgreen;tsc --noEmitclean.correctiveRetrieval(signal mode), ask a question whose first search Huginn flags weak, confirm aknowledge_requeryspan appears and the bot answers from the merged set; confirm a confident search adds noknowledge_gradespan and no latency. Validated once in"haiku"mode (the trace screenshot in the PR thread — grader took ~11s, which is why signal is now the default).Notes
phase0-corrective-signal, not yet on Huginnmain. The signal grader leans on Huginn's*Weak match*/*No confident match*footer +retryHints, which that branch adds.reranked: falseflag on the/api/searchresponse is a noted Huginn-side follow-up (coordinated with the Huginn peer)./api/searchdirectly, so it isn't constrained by the per-botALLOWED_COLLECTIONSthe MCP adapter applies when no collection is specified in the original call (it does respect a collection passed in the original call). Collection-scoping the re-query is a possible follow-up.