feat(web_fetch): query-aware semantic extraction — beat Hermes's positional truncate-and-store#90
Merged
Merged
Conversation
…tional truncate-and-store
Hermes PR #54843 replaced a per-page LLM summarizer with head+tail positional truncation
(fast + cheap, but their own eval was 3/4 on "answer in the returned window" — a mid-document
answer forces an extra read_file round-trip). We keep the no-LLM win and beat it on quality.
- extract-select.ts (PURE): selectRelevantPassages(text, {query, budget}) ranks passages by the
SAME stemmed TF-cosine we ship for recall, keeps the lede as an anchor, and returns the most
relevant passages in document order — so a mid-page answer comes back in ONE call. No query →
identical head+tail window, so we're never worse. Plus splitPassages, headTailWindow,
stripBase64Images (drop base64 token-bombs, keep http image links — mirrors their image fix).
- web_fetch: new `query` param; oversized pages store full clean text to ~/.qodex/cache/web and
return the relevant slice + a recovery footer (read_file path/offset/limit) so nothing is lost.
Zero added LLM cost — the ranker is pure lexical math.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reviewed Hermes's PR #54843 (
feat(web_extract): truncate-and-store instead of LLM summarization) and built something semantically better for QodeX.What Hermes did
They dropped a per-page LLM summarizer (Gemini-class, run on every page >5k chars) in favor of positional truncate-and-store: pages over a 15k char budget return a head+tail window (~75/25) plus a pointer to the full text on disk. Result: 11.7× faster, ~23× cheaper, quality unchanged.
Their tradeoff: positional selection. Their own eval was 3/4 on "answer present in the returned window" — when the relevant passage is in the dropped middle, the agent must make an extra
read_fileround-trip to recover it. (QodeX's oldweb_fetchwas worse — a plainslice(0, maxChars)head cut, no tail, no store, no recovery.)What we did — keep their win, beat it on quality
web_fetchgains aqueryparam (what you're looking for). For an oversized page:recall_approach(feat(recall): MMR diversity in recall_approach #78/feat(recall): semantic stemming — paginate↔pagination, authenticate↔authentication #85), keep the title/lede as an anchor, return the most relevant passages in document order. The mid-document answer that costs Hermes a round-trip comes back in the first call.query→ the identical head+tail window (a unit test asserts byte-equality).~/.qodex/cache/web/…, with a footer giving the exactread_file path=… offset=… limit=…to page the rest.[IMAGE: alt]placeholders; realhttp(s)image links preserved (mirrors their image fix).The headline test proves it: on a page whose answer sits in the middle,
head-tailmode omits it whilesemanticmode returns it — same budget.src/tools/web/extract-select.ts— PURE core, 7 unit testssrc/tools/web/web-fetch.ts—queryparam, store-to-disk, recovery footer, base64 handlingFull suite 1474 green, tsc clean.