Skip to content

feat: Update benchmarks, add baselines, add plots#26

Merged
Pringled merged 44 commits into
mainfrom
add-comparisons
Apr 21, 2026
Merged

feat: Update benchmarks, add baselines, add plots#26
Pringled merged 44 commits into
mainfrom
add-comparisons

Conversation

@Pringled
Copy link
Copy Markdown
Member

No description provided.

Pringled added 30 commits April 18, 2026 11:41
…ations)

- Add bench_ripgrep.py, bench_colgrep.py, bench_coderankembed.py, bench_ablations.py
- Extend benchmarks/data.py with save_results, results_path, current_sha helpers
- Update run_benchmark.py to use save_results with method-prefixed filenames
- Add benchmark extra (sentence-transformers, einops) to pyproject.toml
- Add [build-system] table so semble is installable as a package via uv
- Save results: semble-hybrid (0.850), coderankembed-hybrid (0.860), semble-ablations, colgrep (0.551), ripgrep (0.117)
New modes run retrieval through the full semble ranking stack (boost,
query boost, rerank) with alpha=0 or alpha=1 to isolate each retrieval
source independently from the hybrid fusion step.
Move comparison and ablation scripts into benchmarks/baselines/ to
separate the day-to-day main benchmark (run_benchmark.py) from the
one-shot baselines (ripgrep, colgrep, coderankembed, ablations).
- Add benchmarks/metrics.py with dcg, ndcg_at_k, target_rank, file_rank
- Import from metrics.py in all 5 benchmark files (no duplication)
- Remove verbose module docstrings (replaced with one-liners)
- Remove # --- section separator banner comments
- Fix E501 line-too-long violations in ablations.py and coderankembed.py
…cross all tools

Measures cold index build + query p50 for semble, coderankembed, colgrep,
and ripgrep on a curated 1-per-language subset (20 repos). Uses colgrep clear
before each init to ensure cold builds. colgrep index timing now also tracked
in the full colgrep baseline.
Ensures all tools run on CPU for a fair apples-to-apples comparison.
CRE was previously using MPS (Apple Silicon GPU); colgrep may have used
CoreML via ONNX. Both now use device='cpu' / --force-cpu respectively.
…sults

- Move --force-cpu after 'init' subcommand so it is parsed correctly
- All 20-repo CPU speed benchmark results saved
…m results

colgrep indexes 0 files for Dart — detect this and skip rather than recording
a meaningless 14ms. Riverpod entry removed from speed results JSON and colgrep
summary recomputed over 19 repos (5.75s index, 124ms query p50).
Keeping riverpod created an uneven denominator (colgrep 19 repos vs others
20). Simpler to drop Dart entirely and run a clean 19-language benchmark.
- Remove dart repos (riverpod, dio, http-dart) from repos.json and annotations — colgrep does not support Dart so they were excluded from all benchmarks
- Delete the three Dart annotation files
- Update result JSON summaries for all five canonical result files (63 repos / 19 languages, Dart excluded)
- Add main results table and ablations section to benchmarks/README.md
…tegory breakdown

Add table of contents, setup config table, results-by-category table,
key findings prose, and ablations by category in a collapsible block.
Structure follows patterns from semhash/pyversity/model2vec benchmark READMEs.
Add benchmarks/plot.py generating a log-scale scatter plot of
time-to-first-result vs NDCG@10 for all methods. Marker size
scales with model parameter count. Add matplotlib to the
benchmark optional-dependencies extra.
- Remove --code-only flag from colgrep search (excluded bash/shell files entirely)
- Add --force-cpu to both init and search for reproducibility
- _init_index now clears before rebuilding and detects silent 0-file failures
- _resolve_path falls back to checkout root when benchmark_dir yields 0 files
- Drop rxswift (Sources/RxSwift uses symlinks ColGREP cannot follow)
- Re-run the 10 repos that had 0.0 NDCG; merge with good prior results
- ColGREP NDCG@10 corrects from 0.577 → 0.692 across 62 repos / 19 languages
- Regenerate cold + warm scatter plots with updated ColGREP position
- Add warm plot to two-column table in benchmarks/README.md
…update all results

- Replace broken rxswift (symlinks) with snapkit (37 real .swift files, no symlinks)
- Add benchmarks/annotations/snapkit.json with 13 queries (semantic/architecture/symbol)
- Run all 4 quality tools on snapkit: semble=0.787, colgrep=0.678, ripgrep=0.164, CRE hybrid=0.790
- Remove rxswift + 3 stale Dart repos from all results files; add snapkit → 63 repos
- Fix --code-only in speed_benchmark.py _run_colgrep (same bug as quality benchmark)
- Update all NDCG@10 values in plot.py and README: semble 0.852→0.854, CRE 0.762→0.765,
  CRE Hybrid 0.860→0.862, ripgrep 0.123→0.126, ColGREP 0.692 (unchanged)
- Regenerate cold + warm scatter plots with corrected positions
Pringled added 14 commits April 20, 2026 17:22
…nical results

- colgrep.py now uses --code-only for all non-bash repos (their default);
  bash repos (bash-it, bats-core, nvm) run without it since .sh/.bash files
  are excluded by --code-only
- Re-ran 7 repos that had buggy zero scores in the old --code-only run
  (abseil-cpp, curl, ecto, httpx, laravel-framework, redis, tokio)
- Merged: 52 repos from old --code-only run + 7 re-run + 3 bash + 1 snapkit
- Only meaningful change: redis 0.742 -> 0.792 (--code-only removed README.md noise)
- Overall ColGREP NDCG: 0.6917 -> 0.6925 (negligible; story unchanged)
- Added --no-code-only CLI flag for override; added README note on config
- semble-hybrid results file: by_language had stale dart entry (dropped
  repos) and wrong swift values (from rxswift, not snapkit); summary
  ndcg10/latency/by_category were all computed from that stale data.
  Recompute everything from repos[] (63 repos, 19 languages).
  summary.ndcg10: 0.8509 (lang-weighted, stale) -> 0.8544 (repo-weighted)
  summary.by_category: arch 0.8091->0.8034, sem 0.843->0.8455 (symbol unchanged)
- run_benchmark.py _save_results: switch summary.ndcg10 to simple
  repo-weighted average, consistent with all other tools
- speed_benchmark.py: fix '20 repos' -> '19 repos' in summary print;
  add --code-only (with bash auto-detect) to _run_colgrep, consistent
  with the quality benchmark
- Remove 8 superseded results files (old SHA runs for colgrep, cre,
  ripgrep, semble-hybrid, plus the two unnamed earlier semble runs)
- Keep both semble-ablations files (cover complementary modes: raw vs
  ranked) and all four canonical *-0332378809c5 / colgrep-c8a40fab2235
  / speed files
- plot.py cold frontier: restore incumbent baseline (ripgrep → ColGREP
  → CRE Hybrid) so semble floats above it, matching the intended
  'how far above the incumbent curve' framing; update comment to say so
- Regenerate both scatter plots
… floats above

ColGREP is dominated by CRE Hybrid in warm mode (slower and lower NDCG),
so the warm incumbent baseline is ripgrep → CRE Hybrid, consistent with
the cold plot convention. semble now correctly floats above/left of it.
…latency (cold)'

Aligns naming and axis label with the warm plot convention.
- Rename cat_ndcg10→category_ndcg10, cat→category in verbose blocks
- Remove dead results=[]/results:list[]= initialisations (unconditionally overwritten)
- Rename r→result in _build_summary and _load_completed (coderankembed)
- Remove remaining string annotations: _AsymmetricWrapper, _CREWrapper
- Rename t0→started, qlats→query_latencies, g→language_results,
  lang_*→language_*, idx_ms→index_ms, p50→p50_ms (speed/run benchmarks)
- Add magic 10→_DIRECT_TOP_K in run_benchmark._evaluate
- Fix header casing: language→Language, chunks→Chunks in _bench_quality
- Add SearchResult import to run_benchmark for typed declaration
- Split overlong verbose print lines (120 char limit)
@Pringled Pringled merged commit 2018720 into main Apr 21, 2026
8 checks passed
@Pringled Pringled deleted the add-comparisons branch April 22, 2026 05:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant