Skip to content

feat: Add query boosting and filetype penalties#4

Merged
Pringled merged 20 commits into
mainfrom
add-ranking
Apr 14, 2026
Merged

feat: Add query boosting and filetype penalties#4
Pringled merged 20 commits into
mainfrom
add-ranking

Conversation

@Pringled
Copy link
Copy Markdown
Member

No description provided.

Pringled added 20 commits April 14, 2026 10:57
- Extract ranking/ subpackage: penalties.py, boosting.py (query detection +
  alpha resolution + score boosting), selection.py (diverse top-k)
- Rename utils.py → tokens.py, sources.py → file_walker.py for clarity
- Delete cache.py: inline _EmbeddingCache into index.py
- Delete version.py: inline __version__ into __init__.py
- Fix BM25 path enrichment: use repo-relative paths instead of absolute paths
- Fix diverse_topk early-exit: correct min_selected initialisation to +inf
- Fix packaging: use find: discovery so ranking/ subpackage is included
- Add CLI entry point: semble = semble.cli:main
… and trim stale comments

- diverse_topk now returns (chunk, effective_score) pairs; search_hybrid
  uses the effective score so displayed scores are monotonic with ranking
- Normalise path separators to '/' before regex matching in penalties.py
  so test/compat/examples detection works on Windows paths
- Fix ruff per-file-ignores path: benchmarks/*.py -> local/benchmarks/*.py
- Remove redundant module-map docstrings from search.py and boosting.py
- Strip drifting benchmark numbers from inline comments, keep rationale
- Inline _resolve_cache_config into SembleIndex.__init__ (single callsite)
- Inline _extract_query_keywords into _boost_stem_matches; keep _path_parts
  (its 'why only immediate parent' docstring is worth preserving)
- Remove unused imports in test_index.py (Sequence, numpy, numpy.typing)
- Fold test_stats_property assertion into test_index_returns_stats
- Parametrize markdown include/exclude tests into one test
- Trim selection.py module docstring and section banner (one-function file)
… docs

- Remove one-liner module docstrings from search.py, selection.py,
  penalties.py, and boosting.py (D100 is ignored; filenames are self-evident)
- Remove all six section-divider banners from boosting.py
- Strip :param/:return lines from private functions where they restate
  the type signature without adding meaning; keep prose that explains
  non-obvious decisions (two-pass SQL matching, immediate-parent-only
  path parts, multiplicative penalty combination, early-exit invariant)
- Keep full param docs on all public API methods (SembleIndex, search_hybrid)
… last

All module-level constants and compiled patterns are now together at the
top of the file. Private helpers follow in call-order. The two public
functions (resolve_alpha, apply_query_boost) are at the bottom.

The other files (search.py, penalties.py, selection.py, index.py) already
had correct ordering — constants/patterns at top, public API before private
helpers — and are unchanged.
_SYMBOL_QUERY_RE was too narrow (only PascalCase/camelCase), wrongly
treating HTTPAdapter, field_validator, URL, TRPCError etc. as NL queries.
New regex matches any identifier with uppercase, underscore, or namespace
separator; only purely-lowercase words (session, response) stay as NL.
Benchmark unchanged at NDCG@10=0.867.

Also: extract __version__ to version.py (no imports), trim DEFAULT_IGNORED_DIRS
(remove .github/.circleci/.gitlab/deprecated), fix alpha docstring in search().
… tests

rerank_topk gains penalise_paths=False (keyword-only); search_hybrid passes
False when alpha==1.0 so __init__.py and other heuristic demotions don't
corrupt results that were never influenced by BM25.

Adds tests/test_ranking.py: 40 tests pinning _is_symbol_query classification,
_file_path_penalty multipliers, _chunk_defines_symbol definition detection,
and the __init__.py demotion / penalise_paths=False behaviour.
191 → 88 lines, 40 → 25 tests. Merge separate true/false parametrize
blocks into single tables with an expected bool column; collapse five
individual _file_path_penalty functions into one parametrized test.
1. Collapse _is_test_file/_is_init_file into _file_path_penalty; remove
   is_test parameter and the extra call-site plumbing in rerank_topk.
2. Trim rerank_topk docstring; delete duplicate 8-line inline comment block.
3. Merge _path_parts into _boost_stem_matches (path tokens inlined);
   keep _fuzzy_keyword_overlap as standalone to satisfy complexity limit.
4. Move _make_chunk to conftest.make_chunk; remove local copies in
   test_search.py and test_ranking.py; drop unused _is_test_file import.
5. Trim search_hybrid docstring to two lines + params.
@Pringled Pringled merged commit 69767b0 into main Apr 14, 2026
@Pringled Pringled deleted the add-ranking branch April 22, 2026 05:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant