feat: Add query boosting and filetype penalties#4
Merged
Conversation
- Extract ranking/ subpackage: penalties.py, boosting.py (query detection + alpha resolution + score boosting), selection.py (diverse top-k) - Rename utils.py → tokens.py, sources.py → file_walker.py for clarity - Delete cache.py: inline _EmbeddingCache into index.py - Delete version.py: inline __version__ into __init__.py - Fix BM25 path enrichment: use repo-relative paths instead of absolute paths - Fix diverse_topk early-exit: correct min_selected initialisation to +inf - Fix packaging: use find: discovery so ranking/ subpackage is included - Add CLI entry point: semble = semble.cli:main
… and trim stale comments - diverse_topk now returns (chunk, effective_score) pairs; search_hybrid uses the effective score so displayed scores are monotonic with ranking - Normalise path separators to '/' before regex matching in penalties.py so test/compat/examples detection works on Windows paths - Fix ruff per-file-ignores path: benchmarks/*.py -> local/benchmarks/*.py - Remove redundant module-map docstrings from search.py and boosting.py - Strip drifting benchmark numbers from inline comments, keep rationale
- Inline _resolve_cache_config into SembleIndex.__init__ (single callsite) - Inline _extract_query_keywords into _boost_stem_matches; keep _path_parts (its 'why only immediate parent' docstring is worth preserving) - Remove unused imports in test_index.py (Sequence, numpy, numpy.typing) - Fold test_stats_property assertion into test_index_returns_stats - Parametrize markdown include/exclude tests into one test - Trim selection.py module docstring and section banner (one-function file)
… docs - Remove one-liner module docstrings from search.py, selection.py, penalties.py, and boosting.py (D100 is ignored; filenames are self-evident) - Remove all six section-divider banners from boosting.py - Strip :param/:return lines from private functions where they restate the type signature without adding meaning; keep prose that explains non-obvious decisions (two-pass SQL matching, immediate-parent-only path parts, multiplicative penalty combination, early-exit invariant) - Keep full param docs on all public API methods (SembleIndex, search_hybrid)
… last All module-level constants and compiled patterns are now together at the top of the file. Private helpers follow in call-order. The two public functions (resolve_alpha, apply_query_boost) are at the bottom. The other files (search.py, penalties.py, selection.py, index.py) already had correct ordering — constants/patterns at top, public API before private helpers — and are unchanged.
…st in boosting.py
_SYMBOL_QUERY_RE was too narrow (only PascalCase/camelCase), wrongly treating HTTPAdapter, field_validator, URL, TRPCError etc. as NL queries. New regex matches any identifier with uppercase, underscore, or namespace separator; only purely-lowercase words (session, response) stay as NL. Benchmark unchanged at NDCG@10=0.867. Also: extract __version__ to version.py (no imports), trim DEFAULT_IGNORED_DIRS (remove .github/.circleci/.gitlab/deprecated), fix alpha docstring in search().
… tests rerank_topk gains penalise_paths=False (keyword-only); search_hybrid passes False when alpha==1.0 so __init__.py and other heuristic demotions don't corrupt results that were never influenced by BM25. Adds tests/test_ranking.py: 40 tests pinning _is_symbol_query classification, _file_path_penalty multipliers, _chunk_defines_symbol definition detection, and the __init__.py demotion / penalise_paths=False behaviour.
191 → 88 lines, 40 → 25 tests. Merge separate true/false parametrize blocks into single tables with an expected bool column; collapse five individual _file_path_penalty functions into one parametrized test.
1. Collapse _is_test_file/_is_init_file into _file_path_penalty; remove is_test parameter and the extra call-site plumbing in rerank_topk. 2. Trim rerank_topk docstring; delete duplicate 8-line inline comment block. 3. Merge _path_parts into _boost_stem_matches (path tokens inlined); keep _fuzzy_keyword_overlap as standalone to satisfy complexity limit. 4. Move _make_chunk to conftest.make_chunk; remove local copies in test_search.py and test_ranking.py; drop unused _is_test_file import. 5. Trim search_hybrid docstring to two lines + params.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.