Skip to content

Detect identifiers embedded in prose — adaptive fusion now generalizes#42

Merged
Neverdecel merged 4 commits into
masterfrom
claude/smarter-identifier-detection
Jun 17, 2026
Merged

Detect identifiers embedded in prose — adaptive fusion now generalizes#42
Neverdecel merged 4 commits into
masterfrom
claude/smarter-identifier-detection

Conversation

@Neverdecel

Copy link
Copy Markdown
Owner

Why

The external-repo validation (#40) found that query-type adaptive fusion (#39) did not generalize: on pydantic's commit-message queries it hurt (MRR 0.286 vs hybrid 0.361). Root cause — the classifier keyed on query shape (short + code-looking) and mis-read a prose query that names a symbol (e.g. "Fix tuple order in AliasGenerator.generate_aliases()") as natural language, leaning dense when the discriminating signal was the exact identifier BM25 matches.

What

Replace the shape/length heuristic with references_identifier() — detect a specific identifier token anywhere in the query (backtick spans, calls foo(, dotted Foo.bar, snake_case, camelCase), with e.g./i.e./version-number guards. A query that names an identifier now routes to the neutral (1:1) code weights, so adaptive fusion falls back to plain hybrid whenever a symbol is mentioned — it can only help pure-NL queries, never repeat the regression. looks_like_identifier kept as a back-compat alias.

Re-validated — now a Pareto win across both repos

CodeRAG, symbol level                  hybrid  dense  adaptive
  natural-language queries (MRR)        0.581   0.675   0.706   ← beats both
  identifier queries (MRR)              0.685   0.686   0.715   ← beats both

pydantic, symbol level (172 cases, 22 071-chunk corpus)
  dense 0.328 · bm25 0.398 · hybrid 0.458 · adaptive 0.458     ← = hybrid (was 0.286, a regression)

Adaptive is now never worse than hybrid across two very different repos and clearly better on CodeRAG. (Bonus: on the larger pydantic corpus hybrid beats both single modalities — 0.458 vs 0.398 vs 0.328 — reinforcing 1:1 hybrid as the robust base.)

It stays off by default pending a multi-repo sweep, but is now a strong default-on candidate. Docs (eval.md, external-validation.md) updated to record the fix.

Testing

New tests for embedded-identifier detection and the e.g./version false-positive guards; existing routing tests still pass. ruff + mypy clean.

🤖 Generated with Claude Code


Generated by Claude Code

claude added 2 commits June 17, 2026 16:40
External validation (pydantic) showed the first adaptive-fusion classifier
mis-routed the common real case: a prose query that *names* a symbol
(e.g. a commit message "Fix tuple order in AliasGenerator.generate_aliases()")
was treated as natural language and leaned dense — exactly backwards, since
the discriminating signal is the exact identifier BM25 matches.

Replace the shape/length heuristic with references_identifier(): detect a
specific identifier token anywhere in the query (backtick spans, calls foo(,
dotted Foo.bar, snake_case, camelCase), with e.g./i.e./version-number guards.
A query that names an identifier now routes to the neutral (1:1) code weights
instead of dense-leaning — so adaptive fusion falls back to plain hybrid
whenever a symbol is mentioned and can only help pure-NL queries.

Re-validated on CodeRAG (symbol level), now a clear Pareto win over both
hybrid and dense on both query styles:
  NL set:         hybrid 0.581 / dense 0.675 -> adaptive 0.706 MRR
  identifier set: hybrid 0.685 / dense 0.686 -> adaptive 0.715 MRR

looks_like_identifier kept as a back-compat alias.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
Re-validated the smarter classifier on a larger pydantic index (172 cases,
22,071-chunk corpus): adaptive went from 0.286 (regression vs hybrid 0.361
with the old shape-based classifier) to 0.458 = hybrid — no regression —
because identifier-naming queries now route to neutral weights. It still
beats hybrid on CodeRAG (NL 0.706 vs 0.581; identifier 0.715 vs 0.685).

Adaptive fusion is now a Pareto win over fixed 1:1 hybrid across both repos.
Updates the eval.md caveat and the external-validation write-up accordingly
(the "make the classifier smarter" follow-up is now done). Still off by
default pending a multi-repo sweep, but a strong default-on candidate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
Comment thread coderag/retrieval/query_type.py Fixed
CodeQL flagged the identifier regex as polynomial-ReDoS: the snake_case branch
[A-Za-z]\w*_\w+ is ambiguous (underscore is in \w), so a crafted query caused
quadratic backtracking (measured 12ms→47ms→187ms→736ms as n doubled). The
query string flows in from the HTTP API, so this is API-reachable.

Replace the single backtracking regex with linear token scanning: two
disjoint-class regexes (`backtick` span, word-then-paren) plus per-token
plain-Python checks for snake_case / dotted-path / camelCase. Same detection
behavior (all routing tests unchanged); timing is now linear (64x input ->
~56x time). Adds a regression test on a 200k-char adversarial input.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
A 4-repo sweep (627 git-mined cases) showed adaptive fusion is NOT an
aggregate win (hybrid 0.442 vs adaptive 0.423 MRR) — the big CodeRAG-curated
gain was an artifact of dense-friendly clean-NL queries. The classifier fix
still matters: it removed the catastrophic regression (pydantic 0.286->0.458),
making adaptive a safe opt-in. But it is not a default-on candidate; fixed 1:1
hybrid stays the default. Corrects the earlier "Pareto win / strong default-on
candidate" framing in eval.md and external-validation.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
@Neverdecel Neverdecel merged commit c53c862 into master Jun 17, 2026
12 checks passed
@Neverdecel Neverdecel deleted the claude/smarter-identifier-detection branch June 18, 2026 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants