feat: Update benchmarks, add baselines, add plots by Pringled · Pull Request #26 · MinishLab/semble

Pringled · 2026-04-21T14:33:40Z

No description provided.

…ations) - Add bench_ripgrep.py, bench_colgrep.py, bench_coderankembed.py, bench_ablations.py - Extend benchmarks/data.py with save_results, results_path, current_sha helpers - Update run_benchmark.py to use save_results with method-prefixed filenames - Add benchmark extra (sentence-transformers, einops) to pyproject.toml - Add [build-system] table so semble is installable as a package via uv - Save results: semble-hybrid (0.850), coderankembed-hybrid (0.860), semble-ablations, colgrep (0.551), ripgrep (0.117)

New modes run retrieval through the full semble ranking stack (boost, query boost, rerank) with alpha=0 or alpha=1 to isolate each retrieval source independently from the hybrid fusion step.

Move comparison and ablation scripts into benchmarks/baselines/ to separate the day-to-day main benchmark (run_benchmark.py) from the one-shot baselines (ripgrep, colgrep, coderankembed, ablations).

- Add benchmarks/metrics.py with dcg, ndcg_at_k, target_rank, file_rank - Import from metrics.py in all 5 benchmark files (no duplication) - Remove verbose module docstrings (replaced with one-liners) - Remove # --- section separator banner comments - Fix E501 line-too-long violations in ablations.py and coderankembed.py

… imports

…cross all tools Measures cold index build + query p50 for semble, coderankembed, colgrep, and ripgrep on a curated 1-per-language subset (20 repos). Uses colgrep clear before each init to ensure cold builds. colgrep index timing now also tracked in the full colgrep baseline.

Ensures all tools run on CPU for a fair apples-to-apples comparison. CRE was previously using MPS (Apple Silicon GPU); colgrep may have used CoreML via ONNX. Both now use device='cpu' / --force-cpu respectively.

…sults - Move --force-cpu after 'init' subcommand so it is parsed correctly - All 20-repo CPU speed benchmark results saved

…m results colgrep indexes 0 files for Dart — detect this and skip rather than recording a meaningless 14ms. Riverpod entry removed from speed results JSON and colgrep summary recomputed over 19 repos (5.75s index, 124ms query p50).

Keeping riverpod created an uneven denominator (colgrep 19 repos vs others 20). Simpler to drop Dart entirely and run a clean 19-language benchmark.

- Remove dart repos (riverpod, dio, http-dart) from repos.json and annotations — colgrep does not support Dart so they were excluded from all benchmarks - Delete the three Dart annotation files - Update result JSON summaries for all five canonical result files (63 repos / 19 languages, Dart excluded) - Add main results table and ablations section to benchmarks/README.md

…tegory breakdown Add table of contents, setup config table, results-by-category table, key findings prose, and ablations by category in a collapsible block. Structure follows patterns from semhash/pyversity/model2vec benchmark READMEs.

Add benchmarks/plot.py generating a log-scale scatter plot of time-to-first-result vs NDCG@10 for all methods. Marker size scales with model parameter count. Add matplotlib to the benchmark optional-dependencies extra.

…plot

…ight-side labels

…brid

…rid in README

…ng docstrings

- Remove --code-only flag from colgrep search (excluded bash/shell files entirely) - Add --force-cpu to both init and search for reproducibility - _init_index now clears before rebuilding and detects silent 0-file failures - _resolve_path falls back to checkout root when benchmark_dir yields 0 files - Drop rxswift (Sources/RxSwift uses symlinks ColGREP cannot follow) - Re-run the 10 repos that had 0.0 NDCG; merge with good prior results - ColGREP NDCG@10 corrects from 0.577 → 0.692 across 62 repos / 19 languages - Regenerate cold + warm scatter plots with updated ColGREP position - Add warm plot to two-column table in benchmarks/README.md

…update all results - Replace broken rxswift (symlinks) with snapkit (37 real .swift files, no symlinks) - Add benchmarks/annotations/snapkit.json with 13 queries (semantic/architecture/symbol) - Run all 4 quality tools on snapkit: semble=0.787, colgrep=0.678, ripgrep=0.164, CRE hybrid=0.790 - Remove rxswift + 3 stale Dart repos from all results files; add snapkit → 63 repos - Fix --code-only in speed_benchmark.py _run_colgrep (same bug as quality benchmark) - Update all NDCG@10 values in plot.py and README: semble 0.852→0.854, CRE 0.762→0.765, CRE Hybrid 0.860→0.862, ripgrep 0.123→0.126, ColGREP 0.692 (unchanged) - Regenerate cold + warm scatter plots with corrected positions

…nical results - colgrep.py now uses --code-only for all non-bash repos (their default); bash repos (bash-it, bats-core, nvm) run without it since .sh/.bash files are excluded by --code-only - Re-ran 7 repos that had buggy zero scores in the old --code-only run (abseil-cpp, curl, ecto, httpx, laravel-framework, redis, tokio) - Merged: 52 repos from old --code-only run + 7 re-run + 3 bash + 1 snapkit - Only meaningful change: redis 0.742 -> 0.792 (--code-only removed README.md noise) - Overall ColGREP NDCG: 0.6917 -> 0.6925 (negligible; story unchanged) - Added --no-code-only CLI flag for override; added README note on config

- semble-hybrid results file: by_language had stale dart entry (dropped repos) and wrong swift values (from rxswift, not snapkit); summary ndcg10/latency/by_category were all computed from that stale data. Recompute everything from repos[] (63 repos, 19 languages). summary.ndcg10: 0.8509 (lang-weighted, stale) -> 0.8544 (repo-weighted) summary.by_category: arch 0.8091->0.8034, sem 0.843->0.8455 (symbol unchanged) - run_benchmark.py _save_results: switch summary.ndcg10 to simple repo-weighted average, consistent with all other tools - speed_benchmark.py: fix '20 repos' -> '19 repos' in summary print; add --code-only (with bash auto-detect) to _run_colgrep, consistent with the quality benchmark

- Remove 8 superseded results files (old SHA runs for colgrep, cre, ripgrep, semble-hybrid, plus the two unnamed earlier semble runs) - Keep both semble-ablations files (cover complementary modes: raw vs ranked) and all four canonical *-0332378809c5 / colgrep-c8a40fab2235 / speed files - plot.py cold frontier: restore incumbent baseline (ripgrep → ColGREP → CRE Hybrid) so semble floats above it, matching the intended 'how far above the incumbent curve' framing; update comment to say so - Regenerate both scatter plots

… floats above ColGREP is dominated by CRE Hybrid in warm mode (slower and lower NDCG), so the warm incumbent baseline is ripgrep → CRE Hybrid, consistent with the cold plot convention. semble now correctly floats above/left of it.

…latency (cold)' Aligns naming and axis label with the warm plot convention.

…s axis labels

…p and grouping in benchmarks

- Rename cat_ndcg10→category_ndcg10, cat→category in verbose blocks - Remove dead results=[]/results:list[]= initialisations (unconditionally overwritten) - Rename r→result in _build_summary and _load_completed (coderankembed) - Remove remaining string annotations: _AsymmetricWrapper, _CREWrapper - Rename t0→started, qlats→query_latencies, g→language_results, lang_*→language_*, idx_ms→index_ms, p50→p50_ms (speed/run benchmarks) - Add magic 10→_DIRECT_TOP_K in run_benchmark._evaluate - Fix header casing: language→Language, chunks→Chunks in _bench_quality - Add SearchResult import to run_benchmark for typed declaration - Split overlong verbose print lines (120 char limit)

Pringled added 30 commits April 18, 2026 11:41

feat: Add semble ranking ablations (semble-bm25, semble-semantic)

4c535fa

New modes run retrieval through the full semble ranking stack (boost, query boost, rerank) with alpha=0 or alpha=1 to isolate each retrieval source independently from the hybrid fusion step.

refactor: Reorganise benchmarks into baselines/ subpackage

d23e299

Move comparison and ablation scripts into benchmarks/baselines/ to separate the day-to-day main benchmark (run_benchmark.py) from the one-shot baselines (ripgrep, colgrep, coderankembed, ablations).

refactor: Remove module docstrings and unnecessary future annotations…

f7242de

… imports

feat: Measure and store colgrep index time per repo

557d447

fix: Force CPU for coderankembed and colgrep in speed benchmark

a0b1a13

Ensures all tools run on CPU for a fair apples-to-apples comparison. CRE was previously using MPS (Apple Silicon GPU); colgrep may have used CoreML via ONNX. Both now use device='cpu' / --force-cpu respectively.

fix: Correct colgrep init --force-cpu flag position; add CPU speed re…

e5129a5

…sults - Move --force-cpu after 'init' subcommand so it is parsed correctly - All 20-repo CPU speed benchmark results saved

refactor: Drop Dart from speed benchmark; colgrep does not support it

3a021fd

Keeping riverpod created an uneven denominator (colgrep 19 repos vs others 20). Simpler to drop Dart entirely and run a clean 19-language benchmark.

docs: Simplify benchmarks README tone and restructure

7490c99

docs: Add dataset section, fix query count, clean up ablations table

5aa36ed

feat: Add speed-vs-quality scatter plot to benchmarks

444f676

Add benchmarks/plot.py generating a log-scale scatter plot of time-to-first-result vs NDCG@10 for all methods. Marker size scales with model parameter count. Add matplotlib to the benchmark optional-dependencies extra.

style: Polish scatter plot — all circles, cleaner colors, fix axis ticks

6c971d1

style: Remove arrow lines, place labels directly beside bubbles

847b2dc

fix: Darken ripgrep, correct colgrep params to 16M, fix legend overlap

c674a08

style: Fix legend clipping, add speedup ratio annotations to scatter …

21a9bac

…plot

style: Switch to cube-root x-axis scale for clearer speed separation

8a3246b

style: Uniform label spacing, ripgrep simplification, coderankembed r…

7d3810d

…ight-side labels

style: Rename x-axis label to use parentheses instead of em dash

0847f11

style: Correct tool names to ColGREP, CodeRankEmbed, CodeRankEmbed Hy…

bf0bc4f

…brid

docs: Correct tool names to ColGREP, CodeRankEmbed, CodeRankEmbed Hyb…

ddcc9ad

…rid in README

refactor: Remove try/except import guard, inline constants, add missi…

5868772

…ng docstrings

style: Add baseline Pareto frontier line to scatter plot

42d80d6

feat: Add warm-query scatter plot, fix label spacing per axis range

2d358a3

Pringled added 14 commits April 20, 2026 17:22

Add per-language NDCG breakdown to README with ColGREP gap analysis

c8a40fa

fix: Rename cold plot to speed_vs_ndcg_cold.png; label x-axis 'Query …

6f4ec56

…latency (cold)' Aligns naming and axis label with the warm plot convention.

feat: Add BM25 to speed-vs-quality plots; extend warm xlim; fix sub-m…

20e6670

…s axis labels

feat: Shade incumbent zone below frontier in speed-vs-quality plots

3add078

Update benchmarks

f411b55

Update benchmarks

4fc4daf

Update benchmarks

c66da80

refactor: Remove _HAS_ST guard, fix string annotations, simplify dedu…

15f1393

…p and grouping in benchmarks

fix: pin chonkie to 1.6.2 to avoid pandas ImportError in 1.6.3

1bb7362

Pringled merged commit 2018720 into main Apr 21, 2026
8 checks passed

Pringled deleted the add-comparisons branch April 22, 2026 05:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Update benchmarks, add baselines, add plots#26

feat: Update benchmarks, add baselines, add plots#26
Pringled merged 44 commits into
mainfrom
add-comparisons

Pringled commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Pringled commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant