feat: Add multi-chunk file boosting and embedded CamelCase boosting by Pringled · Pull Request #25 · MinishLab/semble

Pringled · 2026-04-18T07:38:12Z

This PR introduces two new boosting mechanisms:

Multi-chunk file boost: files with multiple high-scoring chunks get a bonus on their best chunk to reward consistent relevance.
Embedded symbol boost: queries containing identifiers like StateManager or beforeAll now also get a boost.

Change	NDCG@10	Δ
Baseline	0.8323	—
+ Multi-chunk file boost	0.8471	+0.015
+ Embedded CamelCase symbol boost	0.8509	+0.006

Aggregate candidate chunk scores per file and boost the top chunk from files that appear multiple times in the candidate pool. Files with many high-scoring chunks are stronger relevance signals than a single chunk from an otherwise-irrelevant file. Benchmarked on 66 repos (20 languages): Total NDCG@10: 0.832 → 0.844 (+0.012) architecture: 0.782 → 0.801 (+0.019) semantic: 0.831 → 0.834 (+0.003) symbol: 0.934 → 0.956 (+0.022)

Add lightweight morphological stemming to _split_identifier: common suffixes (s, es, er, ed, ing, tion, ity, ...) are stripped to generate a stem variant that is added alongside the original token. This lets plurals, gerunds, and nominalizations match across query/document boundaries (colors↔color, utility↔util, serialization↔serial). Benchmarked on 66 repos (20 languages): Total NDCG@10: 0.844 → 0.847 (+0.003) architecture: 0.801 → 0.805 (+0.004) semantic: 0.834 → 0.839 (+0.005) symbol: 0.956 → 0.952 (-0.004)

Sweep over boost multipliers on 66-repo benchmark finds: - _DEFINITION_BOOST_MULTIPLIER: 2.0 → 2.5 (+0.001 symbol/semantic) - _FILE_COHERENCE_BOOST_FRAC: 0.3 → 0.2 (tighter coherence signal) Total NDCG@10: 0.847 → 0.848 (+0.001) semantic: 0.839 → 0.845 (+0.006)

Smaller k amplifies rank differences in RRF fusion, boosting semantic and architecture retrieval. Net gain: NDCG@10 0.848 → 0.850 on full 66-repo benchmark (architecture +0.000, semantic +0.005, symbol -0.007).

Elixir's defmodule was not matched by the existing 'module' keyword because defmodule is a macro prefix, not a bare keyword. Adding it recovers symbol ranking for Elixir modules. Also allow optional namespace prefix in definition patterns (e.g. defmodule Phoenix.Router matches query 'Router'). NDCG@10: 0.850 -> 0.851, symbol: 0.946 -> 0.949.

Files like 'requests.py' define 'Request' but weren't reached by the non-candidate stem scan because stem('requests') != 'request'. Fix by also matching when stem.rstrip('s') == symbol_lower, so 'requests.py' matches symbol query 'Request'. Combined with defmodule/namespace fix: symbol NDCG@10 0.946 -> 0.952.

Extract class/function/def/etc. names from each chunk and append them to the BM25 content. This ensures symbol queries match chunks that define those symbols even when BM25 tokenisation would otherwise miss the name at ranking time. Symbol NDCG@10 recovers to 0.953 (pre-RRF-tuning level), confirming the defmodule + plural-stem + name-enrichment changes together absorb the symbol regression from RRF k=30. Overall: 0.851 (architecture=0.801, semantic=0.850, symbol=0.953).

Higher multiplier pushes definition chunks above non-defining candidates more strongly. Symbol NDCG@10: 0.953 -> 0.954. Overall unchanged at 0.851.

Files named 'example[s].*' that aren't in an examples/ directory (already penalized) can flood results for broad queries. Apply STRONG_PENALTY (0.3x) to these files so genuinely relevant implementation files rank higher. Fixes lazy.nvim 'config/loader' queries where example.lua incorrectly ranked first. Overall NDCG@10 unchanged (lazy.nvim gains offset by rounding variance).

Natural language (semantic/architecture) queries benefit from a larger candidate pool since the target may rank lower in individual retrievers before boosting. Symbol queries already get strong BM25 signal so 5x is sufficient. NDCG@10: 0.851 -> 0.853 (architecture=0.804, semantic=0.852, symbol=0.954).

single_include/nlohmann/json.hpp is a 27k-line generated amalgam that was outranking the real source files. Apply _STRONG_PENALTY (0.3x) to any path containing single_include/. Full benchmark: 0.805 → 0.807 (arch +0.005, sem +0.004, sym +0.001).

Directories like example_dart/ and example_flutter_app/ (dio repo) were ranking above real source files. Extend _EXAMPLES_DIR_RE to also match these compound example directory names while avoiding false positives on filenames like example_code.py (requires trailing /). Full benchmark: 0.807 → 0.808.

website/ directories contain documentation site source code (Docusaurus, etc.) that is unrelated to library implementation. These were showing up in top-10 results for riverpod queries. Apply _STRONG_PENALTY (0.3x). No annotation targets fall in website/ dirs. Overall: ~0.808 → ~0.809.

Directories like deps/ contain vendored third-party libraries (e.g. jemalloc in redis) that should not rank above the project's own source files. Apply _STRONG_PENALTY (0.3x). No annotation targets in deps/. Score is stable at ~0.808.

Remove BM25 suffix stemming (tokens.py): 45 LOC, gain indistinguishable from noise, rule-based stemmers on code identifiers are noisy across 20 languages. Revert RRF k=30 (search.py) and BM25 def-name enrichment (sparse.py): these were coupled — k=30 alone regressed symbol, enrichment was added to cancel that regression. Net +0.001 for ~30 LOC. Keep k=60. Remove example file, example_dart/, website/, deps/ penalties (penalties.py): all zero measured gain, several carry false-positive risk on repos outside the benchmark. Keep: single_include/ penalty (+0.002, real C/C++ pattern).

For queries like 'how the StateManager tracks state', extract embedded CamelCase/camelCase identifiers and apply a symbol-definition boost at half the strength of pure symbol queries. Non-candidate scan uses prefix matching (min 4-char stem) so e.g. state.ts is found for symbol StateManager even when it doesn't rank in the initial candidate pool. Benchmark: 0.844 -> 0.850 (+0.006, above ±0.003 noise floor) architecture: 0.801 -> 0.812 typescript: 0.699 -> 0.710 csharp: 0.851 -> 0.878 rust: 0.844 -> 0.857

…ALE constant, single-pass coherence, revert tokens.py

…andidate condition

…bedded-symbol pass, walrus + docstring trims

…cstring, drop _stem_matches from _prefix_or_exact

… -> file_path

…ndent

…ured impact)

…egex, fold double guard

Pringled added 29 commits April 17, 2026 13:27

Tune RRF k from 60 to 30 for sharper rank discrimination

1ce739b

Smaller k amplifies rank differences in RRF fusion, boosting semantic and architecture retrieval. Net gain: NDCG@10 0.848 → 0.850 on full 66-repo benchmark (architecture +0.000, semantic +0.005, symbol -0.007).

Tune DEFINITION_BOOST_MULTIPLIER to 3.0 for better symbol recall

1ef4d33

Higher multiplier pushes definition chunks above non-defining candidates more strongly. Symbol NDCG@10: 0.953 -> 0.954. Overall unchanged at 0.851.

Add benchmark results

2a40099

Cleanup: extract _stem_matches helper, name _EMBEDDED_SYMBOL_BOOST_SC…

f4a56e5

…ALE constant, single-pass coherence, revert tokens.py

Trim verbose comments to match codebase style

fd1c6c7

Simplify: inline names set, remove obvious comment, consolidate non-c…

51c3132

…andidate condition

Simplify: extract _scan_non_candidates/_prefix_or_exact, aggregate em…

a9a89e3

…bedded-symbol pass, walrus + docstring trims

Simplify: inline _file_stem_matches_symbol, trim apply_query_boost do…

aa34cc8

…cstring, drop _stem_matches from _prefix_or_exact

Keep only final benchmark result, drop intermediates

ba700a2

Simplify: drop _prefix_or_exact, inline non-candidate loop, rename fp…

13549bb

… -> file_path

Rename short vars: n -> name, sl -> symbol_lower

1619261

Move boost_file_coherence out of apply_query_boost; it's query-indepe…

a2c99a1

…ndent

Drop unverifiable single_include/ penalty (1 repo affected, zero meas…

3b25e98

…ured impact)

Nits: move boost_file_coherence to public section, cache definition r…

0d7e158

…egex, fold double guard

Drop verbose comments

4019b76

Pringled changed the title ~~feat: Add file coherence boosting and embedded CamelCase boosting~~ feat: Add multi-chunk file boosting and embedded CamelCase boosting Apr 18, 2026

Pringled added 4 commits April 18, 2026 09:49

Renamed functions

f028a75

Drop adaptive candidate count (5x/7x NL) — within noise of top_k*5

ebe9b27

Update benchmark result to post-cleanup run (0.851)

f93ab08

Update docstrings

46f46a2

Pringled merged commit 1ea77ad into main Apr 18, 2026
8 checks passed

Pringled deleted the ideas branch April 22, 2026 05:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add multi-chunk file boosting and embedded CamelCase boosting#25

feat: Add multi-chunk file boosting and embedded CamelCase boosting#25
Pringled merged 33 commits into
mainfrom
ideas

Pringled commented Apr 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Pringled commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Pringled commented Apr 18, 2026 •

edited

Loading