Detect identifiers embedded in prose — adaptive fusion now generalizes#42
Merged
Merged
Conversation
External validation (pydantic) showed the first adaptive-fusion classifier mis-routed the common real case: a prose query that *names* a symbol (e.g. a commit message "Fix tuple order in AliasGenerator.generate_aliases()") was treated as natural language and leaned dense — exactly backwards, since the discriminating signal is the exact identifier BM25 matches. Replace the shape/length heuristic with references_identifier(): detect a specific identifier token anywhere in the query (backtick spans, calls foo(, dotted Foo.bar, snake_case, camelCase), with e.g./i.e./version-number guards. A query that names an identifier now routes to the neutral (1:1) code weights instead of dense-leaning — so adaptive fusion falls back to plain hybrid whenever a symbol is mentioned and can only help pure-NL queries. Re-validated on CodeRAG (symbol level), now a clear Pareto win over both hybrid and dense on both query styles: NL set: hybrid 0.581 / dense 0.675 -> adaptive 0.706 MRR identifier set: hybrid 0.685 / dense 0.686 -> adaptive 0.715 MRR looks_like_identifier kept as a back-compat alias. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
Re-validated the smarter classifier on a larger pydantic index (172 cases, 22,071-chunk corpus): adaptive went from 0.286 (regression vs hybrid 0.361 with the old shape-based classifier) to 0.458 = hybrid — no regression — because identifier-naming queries now route to neutral weights. It still beats hybrid on CodeRAG (NL 0.706 vs 0.581; identifier 0.715 vs 0.685). Adaptive fusion is now a Pareto win over fixed 1:1 hybrid across both repos. Updates the eval.md caveat and the external-validation write-up accordingly (the "make the classifier smarter" follow-up is now done). Still off by default pending a multi-repo sweep, but a strong default-on candidate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
CodeQL flagged the identifier regex as polynomial-ReDoS: the snake_case branch [A-Za-z]\w*_\w+ is ambiguous (underscore is in \w), so a crafted query caused quadratic backtracking (measured 12ms→47ms→187ms→736ms as n doubled). The query string flows in from the HTTP API, so this is API-reachable. Replace the single backtracking regex with linear token scanning: two disjoint-class regexes (`backtick` span, word-then-paren) plus per-token plain-Python checks for snake_case / dotted-path / camelCase. Same detection behavior (all routing tests unchanged); timing is now linear (64x input -> ~56x time). Adds a regression test on a 200k-char adversarial input. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
A 4-repo sweep (627 git-mined cases) showed adaptive fusion is NOT an aggregate win (hybrid 0.442 vs adaptive 0.423 MRR) — the big CodeRAG-curated gain was an artifact of dense-friendly clean-NL queries. The classifier fix still matters: it removed the catastrophic regression (pydantic 0.286->0.458), making adaptive a safe opt-in. But it is not a default-on candidate; fixed 1:1 hybrid stays the default. Corrects the earlier "Pareto win / strong default-on candidate" framing in eval.md and external-validation.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The external-repo validation (#40) found that query-type adaptive fusion (#39) did not generalize: on
pydantic's commit-message queries it hurt (MRR 0.286 vs hybrid 0.361). Root cause — the classifier keyed on query shape (short + code-looking) and mis-read a prose query that names a symbol (e.g. "Fix tuple order inAliasGenerator.generate_aliases()") as natural language, leaning dense when the discriminating signal was the exact identifier BM25 matches.What
Replace the shape/length heuristic with
references_identifier()— detect a specific identifier token anywhere in the query (backtick spans, callsfoo(, dottedFoo.bar, snake_case, camelCase), withe.g./i.e./version-number guards. A query that names an identifier now routes to the neutral (1:1) code weights, so adaptive fusion falls back to plain hybrid whenever a symbol is mentioned — it can only help pure-NL queries, never repeat the regression.looks_like_identifierkept as a back-compat alias.Re-validated — now a Pareto win across both repos
Adaptive is now never worse than hybrid across two very different repos and clearly better on CodeRAG. (Bonus: on the larger pydantic corpus hybrid beats both single modalities — 0.458 vs 0.398 vs 0.328 — reinforcing 1:1 hybrid as the robust base.)
It stays off by default pending a multi-repo sweep, but is now a strong default-on candidate. Docs (
eval.md,external-validation.md) updated to record the fix.Testing
New tests for embedded-identifier detection and the
e.g./version false-positive guards; existing routing tests still pass.ruff+mypyclean.🤖 Generated with Claude Code
Generated by Claude Code