Detect identifiers embedded in prose — adaptive fusion now generalizes by Neverdecel · Pull Request #42 · Neverdecel/CodeRAG

Neverdecel · 2026-06-17T16:41:59Z

Why

The external-repo validation (#40) found that query-type adaptive fusion (#39) did not generalize: on pydantic's commit-message queries it hurt (MRR 0.286 vs hybrid 0.361). Root cause — the classifier keyed on query shape (short + code-looking) and mis-read a prose query that names a symbol (e.g. "Fix tuple order in AliasGenerator.generate_aliases()") as natural language, leaning dense when the discriminating signal was the exact identifier BM25 matches.

What

Replace the shape/length heuristic with references_identifier() — detect a specific identifier token anywhere in the query (backtick spans, calls foo(, dotted Foo.bar, snake_case, camelCase), with e.g./i.e./version-number guards. A query that names an identifier now routes to the neutral (1:1) code weights, so adaptive fusion falls back to plain hybrid whenever a symbol is mentioned — it can only help pure-NL queries, never repeat the regression. looks_like_identifier kept as a back-compat alias.

Re-validated — now a Pareto win across both repos

CodeRAG, symbol level                  hybrid  dense  adaptive
  natural-language queries (MRR)        0.581   0.675   0.706   ← beats both
  identifier queries (MRR)              0.685   0.686   0.715   ← beats both

pydantic, symbol level (172 cases, 22 071-chunk corpus)
  dense 0.328 · bm25 0.398 · hybrid 0.458 · adaptive 0.458     ← = hybrid (was 0.286, a regression)

Adaptive is now never worse than hybrid across two very different repos and clearly better on CodeRAG. (Bonus: on the larger pydantic corpus hybrid beats both single modalities — 0.458 vs 0.398 vs 0.328 — reinforcing 1:1 hybrid as the robust base.)

It stays off by default pending a multi-repo sweep, but is now a strong default-on candidate. Docs (eval.md, external-validation.md) updated to record the fix.

Testing

New tests for embedded-identifier detection and the e.g./version false-positive guards; existing routing tests still pass. ruff + mypy clean.

🤖 Generated with Claude Code

Generated by Claude Code

External validation (pydantic) showed the first adaptive-fusion classifier mis-routed the common real case: a prose query that *names* a symbol (e.g. a commit message "Fix tuple order in AliasGenerator.generate_aliases()") was treated as natural language and leaned dense — exactly backwards, since the discriminating signal is the exact identifier BM25 matches. Replace the shape/length heuristic with references_identifier(): detect a specific identifier token anywhere in the query (backtick spans, calls foo(, dotted Foo.bar, snake_case, camelCase), with e.g./i.e./version-number guards. A query that names an identifier now routes to the neutral (1:1) code weights instead of dense-leaning — so adaptive fusion falls back to plain hybrid whenever a symbol is mentioned and can only help pure-NL queries. Re-validated on CodeRAG (symbol level), now a clear Pareto win over both hybrid and dense on both query styles: NL set: hybrid 0.581 / dense 0.675 -> adaptive 0.706 MRR identifier set: hybrid 0.685 / dense 0.686 -> adaptive 0.715 MRR looks_like_identifier kept as a back-compat alias. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7

Re-validated the smarter classifier on a larger pydantic index (172 cases, 22,071-chunk corpus): adaptive went from 0.286 (regression vs hybrid 0.361 with the old shape-based classifier) to 0.458 = hybrid — no regression — because identifier-naming queries now route to neutral weights. It still beats hybrid on CodeRAG (NL 0.706 vs 0.581; identifier 0.715 vs 0.685). Adaptive fusion is now a Pareto win over fixed 1:1 hybrid across both repos. Updates the eval.md caveat and the external-validation write-up accordingly (the "make the classifier smarter" follow-up is now done). Still off by default pending a multi-repo sweep, but a strong default-on candidate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7

CodeQL flagged the identifier regex as polynomial-ReDoS: the snake_case branch [A-Za-z]\w*_\w+ is ambiguous (underscore is in \w), so a crafted query caused quadratic backtracking (measured 12ms→47ms→187ms→736ms as n doubled). The query string flows in from the HTTP API, so this is API-reachable. Replace the single backtracking regex with linear token scanning: two disjoint-class regexes (`backtick` span, word-then-paren) plus per-token plain-Python checks for snake_case / dotted-path / camelCase. Same detection behavior (all routing tests unchanged); timing is now linear (64x input -> ~56x time). Adds a regression test on a 200k-char adversarial input. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7

A 4-repo sweep (627 git-mined cases) showed adaptive fusion is NOT an aggregate win (hybrid 0.442 vs adaptive 0.423 MRR) — the big CodeRAG-curated gain was an artifact of dense-friendly clean-NL queries. The classifier fix still matters: it removed the catastrophic regression (pydantic 0.286->0.458), making adaptive a safe opt-in. But it is not a default-on candidate; fixed 1:1 hybrid stays the default. Corrects the earlier "Pareto win / strong default-on candidate" framing in eval.md and external-validation.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7

claude added 2 commits June 17, 2026 16:40

github-advanced-security AI found potential problems Jun 17, 2026

View reviewed changes

Comment thread coderag/retrieval/query_type.py Fixed

Neverdecel mentioned this pull request Jun 17, 2026

Multi-repo evaluation: macro-averaged generalization view #43

Merged

Neverdecel merged commit c53c862 into master Jun 17, 2026
12 checks passed

Neverdecel deleted the claude/smarter-identifier-detection branch June 18, 2026 08:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect identifiers embedded in prose — adaptive fusion now generalizes#42

Detect identifiers embedded in prose — adaptive fusion now generalizes#42
Neverdecel merged 4 commits into
masterfrom
claude/smarter-identifier-detection

Neverdecel commented Jun 17, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Neverdecel commented Jun 17, 2026

Why

What

Re-validated — now a Pareto win across both repos

Testing

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants