Skip to content

feat(web_fetch): query-aware semantic extraction — beat Hermes's positional truncate-and-store#90

Merged
QodeXcli merged 1 commit into
mainfrom
feat/web-fetch-semantic-extract
Jul 1, 2026
Merged

feat(web_fetch): query-aware semantic extraction — beat Hermes's positional truncate-and-store#90
QodeXcli merged 1 commit into
mainfrom
feat/web-fetch-semantic-extract

Conversation

@QodeXcli

@QodeXcli QodeXcli commented Jul 1, 2026

Copy link
Copy Markdown
Owner

Reviewed Hermes's PR #54843 (feat(web_extract): truncate-and-store instead of LLM summarization) and built something semantically better for QodeX.

What Hermes did

They dropped a per-page LLM summarizer (Gemini-class, run on every page >5k chars) in favor of positional truncate-and-store: pages over a 15k char budget return a head+tail window (~75/25) plus a pointer to the full text on disk. Result: 11.7× faster, ~23× cheaper, quality unchanged.

Their tradeoff: positional selection. Their own eval was 3/4 on "answer present in the returned window" — when the relevant passage is in the dropped middle, the agent must make an extra read_file round-trip to recover it. (QodeX's old web_fetch was worse — a plain slice(0, maxChars) head cut, no tail, no store, no recovery.)

What we did — keep their win, beat it on quality

web_fetch gains a query param (what you're looking for). For an oversized page:

  • Semantic passage selection — rank passages by the same stemmed TF-cosine we already ship for recall_approach (feat(recall): MMR diversity in recall_approach #78/feat(recall): semantic stemming — paginate↔pagination, authenticate↔authentication #85), keep the title/lede as an anchor, return the most relevant passages in document order. The mid-document answer that costs Hermes a round-trip comes back in the first call.
  • Never worse than Hermes — no query → the identical head+tail window (a unit test asserts byte-equality).
  • Store-and-recover — full clean text written to ~/.qodex/cache/web/…, with a footer giving the exact read_file path=… offset=… limit=… to page the rest.
  • Base64 image token-bombs → [IMAGE: alt] placeholders; real http(s) image links preserved (mirrors their image fix).
  • Zero added LLM cost — the ranker is pure lexical math (microseconds), so we keep their speed + cost win and add the relevance they dropped.

The headline test proves it: on a page whose answer sits in the middle, head-tail mode omits it while semantic mode returns it — same budget.

  • src/tools/web/extract-select.ts — PURE core, 7 unit tests
  • src/tools/web/web-fetch.tsquery param, store-to-disk, recovery footer, base64 handling

Full suite 1474 green, tsc clean.

…tional truncate-and-store

Hermes PR #54843 replaced a per-page LLM summarizer with head+tail positional truncation
(fast + cheap, but their own eval was 3/4 on "answer in the returned window" — a mid-document
answer forces an extra read_file round-trip). We keep the no-LLM win and beat it on quality.

- extract-select.ts (PURE): selectRelevantPassages(text, {query, budget}) ranks passages by the
  SAME stemmed TF-cosine we ship for recall, keeps the lede as an anchor, and returns the most
  relevant passages in document order — so a mid-page answer comes back in ONE call. No query →
  identical head+tail window, so we're never worse. Plus splitPassages, headTailWindow,
  stripBase64Images (drop base64 token-bombs, keep http image links — mirrors their image fix).
- web_fetch: new `query` param; oversized pages store full clean text to ~/.qodex/cache/web and
  return the relevant slice + a recovery footer (read_file path/offset/limit) so nothing is lost.
  Zero added LLM cost — the ranker is pure lexical math.
@QodeXcli QodeXcli merged commit 49162dd into main Jul 1, 2026
2 checks passed
@QodeXcli QodeXcli deleted the feat/web-fetch-semantic-extract branch July 1, 2026 02:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant