feat: Update benchmarks, add baselines, add plots#26
Merged
Conversation
…ations) - Add bench_ripgrep.py, bench_colgrep.py, bench_coderankembed.py, bench_ablations.py - Extend benchmarks/data.py with save_results, results_path, current_sha helpers - Update run_benchmark.py to use save_results with method-prefixed filenames - Add benchmark extra (sentence-transformers, einops) to pyproject.toml - Add [build-system] table so semble is installable as a package via uv - Save results: semble-hybrid (0.850), coderankembed-hybrid (0.860), semble-ablations, colgrep (0.551), ripgrep (0.117)
New modes run retrieval through the full semble ranking stack (boost, query boost, rerank) with alpha=0 or alpha=1 to isolate each retrieval source independently from the hybrid fusion step.
Move comparison and ablation scripts into benchmarks/baselines/ to separate the day-to-day main benchmark (run_benchmark.py) from the one-shot baselines (ripgrep, colgrep, coderankembed, ablations).
- Add benchmarks/metrics.py with dcg, ndcg_at_k, target_rank, file_rank - Import from metrics.py in all 5 benchmark files (no duplication) - Remove verbose module docstrings (replaced with one-liners) - Remove # --- section separator banner comments - Fix E501 line-too-long violations in ablations.py and coderankembed.py
…cross all tools Measures cold index build + query p50 for semble, coderankembed, colgrep, and ripgrep on a curated 1-per-language subset (20 repos). Uses colgrep clear before each init to ensure cold builds. colgrep index timing now also tracked in the full colgrep baseline.
Ensures all tools run on CPU for a fair apples-to-apples comparison. CRE was previously using MPS (Apple Silicon GPU); colgrep may have used CoreML via ONNX. Both now use device='cpu' / --force-cpu respectively.
…sults - Move --force-cpu after 'init' subcommand so it is parsed correctly - All 20-repo CPU speed benchmark results saved
…m results colgrep indexes 0 files for Dart — detect this and skip rather than recording a meaningless 14ms. Riverpod entry removed from speed results JSON and colgrep summary recomputed over 19 repos (5.75s index, 124ms query p50).
Keeping riverpod created an uneven denominator (colgrep 19 repos vs others 20). Simpler to drop Dart entirely and run a clean 19-language benchmark.
- Remove dart repos (riverpod, dio, http-dart) from repos.json and annotations — colgrep does not support Dart so they were excluded from all benchmarks - Delete the three Dart annotation files - Update result JSON summaries for all five canonical result files (63 repos / 19 languages, Dart excluded) - Add main results table and ablations section to benchmarks/README.md
…tegory breakdown Add table of contents, setup config table, results-by-category table, key findings prose, and ablations by category in a collapsible block. Structure follows patterns from semhash/pyversity/model2vec benchmark READMEs.
Add benchmarks/plot.py generating a log-scale scatter plot of time-to-first-result vs NDCG@10 for all methods. Marker size scales with model parameter count. Add matplotlib to the benchmark optional-dependencies extra.
- Remove --code-only flag from colgrep search (excluded bash/shell files entirely) - Add --force-cpu to both init and search for reproducibility - _init_index now clears before rebuilding and detects silent 0-file failures - _resolve_path falls back to checkout root when benchmark_dir yields 0 files - Drop rxswift (Sources/RxSwift uses symlinks ColGREP cannot follow) - Re-run the 10 repos that had 0.0 NDCG; merge with good prior results - ColGREP NDCG@10 corrects from 0.577 → 0.692 across 62 repos / 19 languages - Regenerate cold + warm scatter plots with updated ColGREP position - Add warm plot to two-column table in benchmarks/README.md
…update all results - Replace broken rxswift (symlinks) with snapkit (37 real .swift files, no symlinks) - Add benchmarks/annotations/snapkit.json with 13 queries (semantic/architecture/symbol) - Run all 4 quality tools on snapkit: semble=0.787, colgrep=0.678, ripgrep=0.164, CRE hybrid=0.790 - Remove rxswift + 3 stale Dart repos from all results files; add snapkit → 63 repos - Fix --code-only in speed_benchmark.py _run_colgrep (same bug as quality benchmark) - Update all NDCG@10 values in plot.py and README: semble 0.852→0.854, CRE 0.762→0.765, CRE Hybrid 0.860→0.862, ripgrep 0.123→0.126, ColGREP 0.692 (unchanged) - Regenerate cold + warm scatter plots with corrected positions
…nical results - colgrep.py now uses --code-only for all non-bash repos (their default); bash repos (bash-it, bats-core, nvm) run without it since .sh/.bash files are excluded by --code-only - Re-ran 7 repos that had buggy zero scores in the old --code-only run (abseil-cpp, curl, ecto, httpx, laravel-framework, redis, tokio) - Merged: 52 repos from old --code-only run + 7 re-run + 3 bash + 1 snapkit - Only meaningful change: redis 0.742 -> 0.792 (--code-only removed README.md noise) - Overall ColGREP NDCG: 0.6917 -> 0.6925 (negligible; story unchanged) - Added --no-code-only CLI flag for override; added README note on config
- semble-hybrid results file: by_language had stale dart entry (dropped repos) and wrong swift values (from rxswift, not snapkit); summary ndcg10/latency/by_category were all computed from that stale data. Recompute everything from repos[] (63 repos, 19 languages). summary.ndcg10: 0.8509 (lang-weighted, stale) -> 0.8544 (repo-weighted) summary.by_category: arch 0.8091->0.8034, sem 0.843->0.8455 (symbol unchanged) - run_benchmark.py _save_results: switch summary.ndcg10 to simple repo-weighted average, consistent with all other tools - speed_benchmark.py: fix '20 repos' -> '19 repos' in summary print; add --code-only (with bash auto-detect) to _run_colgrep, consistent with the quality benchmark
- Remove 8 superseded results files (old SHA runs for colgrep, cre, ripgrep, semble-hybrid, plus the two unnamed earlier semble runs) - Keep both semble-ablations files (cover complementary modes: raw vs ranked) and all four canonical *-0332378809c5 / colgrep-c8a40fab2235 / speed files - plot.py cold frontier: restore incumbent baseline (ripgrep → ColGREP → CRE Hybrid) so semble floats above it, matching the intended 'how far above the incumbent curve' framing; update comment to say so - Regenerate both scatter plots
… floats above ColGREP is dominated by CRE Hybrid in warm mode (slower and lower NDCG), so the warm incumbent baseline is ripgrep → CRE Hybrid, consistent with the cold plot convention. semble now correctly floats above/left of it.
…latency (cold)' Aligns naming and axis label with the warm plot convention.
…p and grouping in benchmarks
- Rename cat_ndcg10→category_ndcg10, cat→category in verbose blocks - Remove dead results=[]/results:list[]= initialisations (unconditionally overwritten) - Rename r→result in _build_summary and _load_completed (coderankembed) - Remove remaining string annotations: _AsymmetricWrapper, _CREWrapper - Rename t0→started, qlats→query_latencies, g→language_results, lang_*→language_*, idx_ms→index_ms, p50→p50_ms (speed/run benchmarks) - Add magic 10→_DIRECT_TOP_K in run_benchmark._evaluate - Fix header casing: language→Language, chunks→Chunks in _bench_quality - Add SearchResult import to run_benchmark for typed declaration - Split overlong verbose print lines (120 char limit)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.