Skip to content

feat: Add multi-chunk file boosting and embedded CamelCase boosting#25

Merged
Pringled merged 33 commits into
mainfrom
ideas
Apr 18, 2026
Merged

feat: Add multi-chunk file boosting and embedded CamelCase boosting#25
Pringled merged 33 commits into
mainfrom
ideas

Conversation

@Pringled
Copy link
Copy Markdown
Member

@Pringled Pringled commented Apr 18, 2026

This PR introduces two new boosting mechanisms:

  • Multi-chunk file boost: files with multiple high-scoring chunks get a bonus on their best chunk to reward consistent relevance.
  • Embedded symbol boost: queries containing identifiers like StateManager or beforeAll now also get a boost.
Change NDCG@10 Δ
Baseline 0.8323
+ Multi-chunk file boost 0.8471 +0.015
+ Embedded CamelCase symbol boost 0.8509 +0.006

Pringled added 29 commits April 17, 2026 13:27
Aggregate candidate chunk scores per file and boost the top chunk from
files that appear multiple times in the candidate pool. Files with many
high-scoring chunks are stronger relevance signals than a single chunk
from an otherwise-irrelevant file.

Benchmarked on 66 repos (20 languages):
  Total NDCG@10: 0.832 → 0.844 (+0.012)
  architecture:  0.782 → 0.801 (+0.019)
  semantic:      0.831 → 0.834 (+0.003)
  symbol:        0.934 → 0.956 (+0.022)
Add lightweight morphological stemming to _split_identifier: common
suffixes (s, es, er, ed, ing, tion, ity, ...) are stripped to generate
a stem variant that is added alongside the original token. This lets
plurals, gerunds, and nominalizations match across query/document
boundaries (colors↔color, utility↔util, serialization↔serial).

Benchmarked on 66 repos (20 languages):
  Total NDCG@10: 0.844 → 0.847 (+0.003)
  architecture:  0.801 → 0.805 (+0.004)
  semantic:      0.834 → 0.839 (+0.005)
  symbol:        0.956 → 0.952 (-0.004)
Sweep over boost multipliers on 66-repo benchmark finds:
  - _DEFINITION_BOOST_MULTIPLIER: 2.0 → 2.5 (+0.001 symbol/semantic)
  - _FILE_COHERENCE_BOOST_FRAC: 0.3 → 0.2 (tighter coherence signal)

Total NDCG@10: 0.847 → 0.848 (+0.001)
semantic: 0.839 → 0.845 (+0.006)
Smaller k amplifies rank differences in RRF fusion, boosting semantic and
architecture retrieval. Net gain: NDCG@10 0.848 → 0.850 on full 66-repo benchmark
(architecture +0.000, semantic +0.005, symbol -0.007).
Elixir's defmodule was not matched by the existing 'module' keyword because
defmodule is a macro prefix, not a bare keyword. Adding it recovers symbol
ranking for Elixir modules. Also allow optional namespace prefix in definition
patterns (e.g. defmodule Phoenix.Router matches query 'Router').

NDCG@10: 0.850 -> 0.851, symbol: 0.946 -> 0.949.
Files like 'requests.py' define 'Request' but weren't reached by the
non-candidate stem scan because stem('requests') != 'request'. Fix by
also matching when stem.rstrip('s') == symbol_lower, so 'requests.py'
matches symbol query 'Request'.

Combined with defmodule/namespace fix: symbol NDCG@10 0.946 -> 0.952.
Extract class/function/def/etc. names from each chunk and append them to
the BM25 content. This ensures symbol queries match chunks that define
those symbols even when BM25 tokenisation would otherwise miss the name
at ranking time.

Symbol NDCG@10 recovers to 0.953 (pre-RRF-tuning level), confirming the
defmodule + plural-stem + name-enrichment changes together absorb the symbol
regression from RRF k=30. Overall: 0.851 (architecture=0.801, semantic=0.850,
symbol=0.953).
Higher multiplier pushes definition chunks above non-defining candidates
more strongly. Symbol NDCG@10: 0.953 -> 0.954. Overall unchanged at 0.851.
Files named 'example[s].*' that aren't in an examples/ directory (already
penalized) can flood results for broad queries. Apply STRONG_PENALTY (0.3x)
to these files so genuinely relevant implementation files rank higher.

Fixes lazy.nvim 'config/loader' queries where example.lua incorrectly ranked
first. Overall NDCG@10 unchanged (lazy.nvim gains offset by rounding variance).
Natural language (semantic/architecture) queries benefit from a larger
candidate pool since the target may rank lower in individual retrievers
before boosting. Symbol queries already get strong BM25 signal so 5x is
sufficient.

NDCG@10: 0.851 -> 0.853 (architecture=0.804, semantic=0.852, symbol=0.954).
single_include/nlohmann/json.hpp is a 27k-line generated amalgam that
was outranking the real source files. Apply _STRONG_PENALTY (0.3x) to
any path containing single_include/. Full benchmark: 0.805 → 0.807
(arch +0.005, sem +0.004, sym +0.001).
Directories like example_dart/ and example_flutter_app/ (dio repo) were
ranking above real source files. Extend _EXAMPLES_DIR_RE to also match
these compound example directory names while avoiding false positives on
filenames like example_code.py (requires trailing /). Full benchmark:
0.807 → 0.808.
website/ directories contain documentation site source code (Docusaurus,
etc.) that is unrelated to library implementation. These were showing up
in top-10 results for riverpod queries. Apply _STRONG_PENALTY (0.3x).
No annotation targets fall in website/ dirs. Overall: ~0.808 → ~0.809.
Directories like deps/ contain vendored third-party libraries (e.g.
jemalloc in redis) that should not rank above the project's own source
files. Apply _STRONG_PENALTY (0.3x). No annotation targets in deps/.
Score is stable at ~0.808.
Remove BM25 suffix stemming (tokens.py): 45 LOC, gain indistinguishable
from noise, rule-based stemmers on code identifiers are noisy across 20
languages.

Revert RRF k=30 (search.py) and BM25 def-name enrichment (sparse.py):
these were coupled — k=30 alone regressed symbol, enrichment was added
to cancel that regression. Net +0.001 for ~30 LOC. Keep k=60.

Remove example file, example_dart/, website/, deps/ penalties
(penalties.py): all zero measured gain, several carry false-positive
risk on repos outside the benchmark.

Keep: single_include/ penalty (+0.002, real C/C++ pattern).
For queries like 'how the StateManager tracks state', extract embedded
CamelCase/camelCase identifiers and apply a symbol-definition boost at
half the strength of pure symbol queries.

Non-candidate scan uses prefix matching (min 4-char stem) so e.g.
state.ts is found for symbol StateManager even when it doesn't rank in
the initial candidate pool.

Benchmark: 0.844 -> 0.850 (+0.006, above ±0.003 noise floor)
  architecture: 0.801 -> 0.812
  typescript:   0.699 -> 0.710
  csharp:       0.851 -> 0.878
  rust:         0.844 -> 0.857
…ALE constant, single-pass coherence, revert tokens.py
…bedded-symbol pass, walrus + docstring trims
…cstring, drop _stem_matches from _prefix_or_exact
@Pringled Pringled changed the title feat: Add file coherence boosting and embedded CamelCase boosting feat: Add multi-chunk file boosting and embedded CamelCase boosting Apr 18, 2026
@Pringled Pringled merged commit 1ea77ad into main Apr 18, 2026
8 checks passed
@Pringled Pringled deleted the ideas branch April 22, 2026 05:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant