Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
215a047
feat: Add comparison benchmarks (ripgrep, colgrep, coderankembed, abl…
Pringled Apr 18, 2026
4c535fa
feat: Add semble ranking ablations (semble-bm25, semble-semantic)
Pringled Apr 20, 2026
d23e299
refactor: Reorganise benchmarks into baselines/ subpackage
Pringled Apr 20, 2026
1bcd9de
refactor: Extract shared metrics helpers and clean up benchmark files
Pringled Apr 20, 2026
f7242de
refactor: Remove module docstrings and unnecessary future annotations…
Pringled Apr 20, 2026
557d447
feat: Measure and store colgrep index time per repo
Pringled Apr 20, 2026
edb40fa
feat: Add speed_benchmark.py for cold-start index and query latency a…
Pringled Apr 20, 2026
a0b1a13
fix: Force CPU for coderankembed and colgrep in speed benchmark
Pringled Apr 20, 2026
e5129a5
fix: Correct colgrep init --force-cpu flag position; add CPU speed re…
Pringled Apr 20, 2026
2448a68
fix: Skip colgrep for unsupported languages; remove riverpod/dart fro…
Pringled Apr 20, 2026
3a021fd
refactor: Drop Dart from speed benchmark; colgrep does not support it
Pringled Apr 20, 2026
929c08d
chore: Drop Dart repos; add benchmark results and README
Pringled Apr 20, 2026
ffceb1c
docs: Expand benchmarks README with methodology, key findings, and ca…
Pringled Apr 20, 2026
7490c99
docs: Simplify benchmarks README tone and restructure
Pringled Apr 20, 2026
5aa36ed
docs: Add dataset section, fix query count, clean up ablations table
Pringled Apr 20, 2026
444f676
feat: Add speed-vs-quality scatter plot to benchmarks
Pringled Apr 20, 2026
6c971d1
style: Polish scatter plot — all circles, cleaner colors, fix axis ticks
Pringled Apr 20, 2026
847b2dc
style: Remove arrow lines, place labels directly beside bubbles
Pringled Apr 20, 2026
c674a08
fix: Darken ripgrep, correct colgrep params to 16M, fix legend overlap
Pringled Apr 20, 2026
21a9bac
style: Fix legend clipping, add speedup ratio annotations to scatter …
Pringled Apr 20, 2026
8a3246b
style: Switch to cube-root x-axis scale for clearer speed separation
Pringled Apr 20, 2026
7d3810d
style: Uniform label spacing, ripgrep simplification, coderankembed r…
Pringled Apr 20, 2026
0847f11
style: Rename x-axis label to use parentheses instead of em dash
Pringled Apr 20, 2026
bf0bc4f
style: Correct tool names to ColGREP, CodeRankEmbed, CodeRankEmbed Hy…
Pringled Apr 20, 2026
ddcc9ad
docs: Correct tool names to ColGREP, CodeRankEmbed, CodeRankEmbed Hyb…
Pringled Apr 20, 2026
5868772
refactor: Remove try/except import guard, inline constants, add missi…
Pringled Apr 20, 2026
42d80d6
style: Add baseline Pareto frontier line to scatter plot
Pringled Apr 20, 2026
2d358a3
feat: Add warm-query scatter plot, fix label spacing per axis range
Pringled Apr 20, 2026
0332378
fix: Correct ColGREP benchmark methodology and update results
Pringled Apr 20, 2026
93689d2
feat: Replace rxswift with snapkit, fix speed_benchmark --code-only, …
Pringled Apr 20, 2026
c8a40fa
Add per-language NDCG breakdown to README with ColGREP gap analysis
Pringled Apr 20, 2026
3a248b7
Run ColGREP with --code-only (default) for non-bash repos, merge cano…
Pringled Apr 21, 2026
337ae74
fix: Correct stale semble results summary and code consistency
Pringled Apr 21, 2026
160a98f
chore: Drop stale results files; restore cold incumbent frontier in plot
Pringled Apr 21, 2026
5c3f15d
fix: Warm plot uses incumbent frontier (ripgrep → CRE Hybrid), semble…
Pringled Apr 21, 2026
6f4ec56
fix: Rename cold plot to speed_vs_ndcg_cold.png; label x-axis 'Query …
Pringled Apr 21, 2026
20e6670
feat: Add BM25 to speed-vs-quality plots; extend warm xlim; fix sub-m…
Pringled Apr 21, 2026
3add078
feat: Shade incumbent zone below frontier in speed-vs-quality plots
Pringled Apr 21, 2026
f411b55
Update benchmarks
Pringled Apr 21, 2026
4fc4daf
Update benchmarks
Pringled Apr 21, 2026
c66da80
Update benchmarks
Pringled Apr 21, 2026
15f1393
refactor: Remove _HAS_ST guard, fix string annotations, simplify dedu…
Pringled Apr 21, 2026
678d6c7
refactor: Polish variable names across all benchmark files
Pringled Apr 21, 2026
1bb7362
fix: pin chonkie to 1.6.2 to avoid pandas ImportError in 1.6.3
Pringled Apr 21, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
194 changes: 186 additions & 8 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,201 @@
# Benchmarks

Reproducible local benchmarks for `semble`.
Quality and speed benchmarks for `semble`.

Pinned repositories live in `repos.json` and are checked out into `~/.cache/semble-bench`.
- [Main results](#main-results)
- [By language](#by-language)
- [Ablations](#ablations)
- [Dataset](#dataset)
- [Methods](#methods)
- [Running the benchmarks](#running-the-benchmarks)

## Setup
## Main results

Quality and speed across all methods.

| Method | NDCG@10 | Index | Query p50 |
|---|---:|---:|---:|
| ripgrep | 0.126 | — | 12 ms |
| ColGREP | 0.693 | 5.8 s | 124 ms |
| CodeRankEmbed | 0.765 | 57 s | 16 ms |
| semble | 0.854 | **263 ms** | **1.5 ms** |
| CodeRankEmbed Hybrid | **0.862** | 57 s | 16 ms |

| ![Speed vs quality (cold)](results/speed_vs_ndcg_cold.png) | ![Speed vs quality (warm)](results/speed_vs_ndcg_warm.png) |
|:--:|:--:|
| *Time to first result (index + query) vs NDCG@10* | *Query latency on a warm index vs NDCG@10* |

The 137M-param CodeRankEmbed Hybrid wins NDCG@10 by 0.008. semble wins index time by 218x and query latency by 11x.

NDCG@10 is averaged across all queries. Speed numbers use one repo per language, CPU only: cold-start index time and warm query p50 (median across 5 consecutive runs).

## By language

NDCG@10 per language, sorted by CodeRankEmbed Hybrid (CRE in the table). Best score per row is bolded.

| Language | semble | CRE Hybrid | CRE | ColGREP | ripgrep |
|---|---:|---:|---:|---:|---:|
| scala | 0.909 | **0.922** | 0.845 | 0.765 | 0.180 |
| cpp | **0.915** | 0.913 | 0.846 | 0.626 | 0.126 |
| ruby | **0.909** | **0.909** | 0.769 | 0.708 | 0.230 |
| elixir | 0.894 | **0.905** | 0.869 | 0.808 | 0.134 |
| javascript | 0.917 | 0.903 | **0.920** | 0.823 | 0.176 |
| zig | **0.913** | 0.901 | 0.807 | 0.474 | 0.000 |
| csharp | 0.885 | **0.889** | 0.743 | 0.614 | 0.117 |
| go | **0.895** | 0.884 | 0.676 | 0.785 | 0.133 |
| python | 0.867 | **0.880** | 0.794 | 0.777 | 0.202 |
| php | 0.858 | **0.874** | 0.758 | 0.663 | 0.123 |
| swift | 0.860 | **0.873** | 0.721 | 0.710 | 0.160 |
| bash | 0.825 | 0.852 | **0.892** | 0.706 | 0.000 |
| lua | 0.823 | **0.847** | 0.803 | 0.798 | 0.000 |
| java | **0.849** | 0.841 | 0.706 | 0.641 | 0.198 |
| kotlin | 0.821 | **0.830** | 0.670 | 0.637 | 0.166 |
| rust | **0.856** | 0.827 | 0.627 | 0.662 | 0.162 |
| c | 0.741 | **0.806** | 0.706 | 0.676 | 0.000 |
| haskell | 0.765 | 0.771 | **0.776** | 0.683 | 0.000 |
| typescript | 0.706 | **0.708** | 0.545 | 0.430 | 0.128 |
| **overall** | **0.854** | **0.862** | **0.765** | **0.693** | **0.126** |

## Ablations

`raw` returns retrieval scores directly; `+ ranking` feeds them through semble's hybrid ranker.

| Retrieval | Raw | + ranking |
|---|---:|---:|
| BM25 | 0.675 | 0.834 |
| potion-code-16M | 0.650 | 0.821 |
| BM25 + potion-code-16M | — | **0.854** |

<details>
<summary>By query category</summary>

| Mode | Architecture | Semantic | Symbol |
|---|---:|---:|---:|
| BM25 raw | 0.628 | 0.676 | 0.719 |
| potion-code-16M raw | 0.626 | 0.666 | 0.629 |
| semble BM25 (+ ranking) | 0.770 | 0.819 | 0.957 |
| semble potion-code-16M (+ ranking) | 0.757 | 0.808 | 0.943 |
| **semble hybrid** | **0.802** | **0.846** | **0.958** |

</details>

## Dataset

~1,250 queries over 63 repositories in 19 languages, grouped into three categories:

| Category | Queries | What it tests |
|---|---:|---|
| semantic | 711 | Code that implements a specific behavior or concept |
| architecture | 343 | Design decisions, module boundaries, structural patterns |
| symbol | 204 | Named entity lookup (function, class, type, variable) |

<details>
<summary>Notes</summary>

**Languages**: three repos per language (nine for Python): bash, C, C++, C#, Elixir, Go, Haskell, Java, JavaScript, Kotlin, Lua, PHP, Python, Ruby, Rust, Scala, Swift, TypeScript, Zig. Repos are pinned by revision in `repos.json`.

**How the benchmark was built**: queries and ground-truth relevance labels are generated by Claude Sonnet 4.6. The same model is used as LLM-as-judge to verify label quality.

</details>

## Methods

- **[ripgrep](https://github.com/BurntSushi/ripgrep)**: fast regex search over files, included as a raw keyword-match baseline.
- **[ColGREP](https://github.com/lightonai/next-plaid/tree/main/colgrep)**: late-interaction code retrieval built on next-plaid with the [LateOn-Code-edge](https://huggingface.co/lightonai/LateOn-Code-edge) model.
- **[CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed)**: 137M-param transformer embedding model for code retrieval. *CodeRankEmbed Hybrid* fuses its dense scores with BM25.
- **[semble](https://github.com/your-repo/semble)**: this library. [potion-code-16M](https://huggingface.co/minishlab/potion-code-16M) static embeddings + BM25 + the semble reranking stack.

## Running the benchmarks

Repos are pinned in `repos.json` and cloned into `~/.cache/semble-bench`:

```bash
uv run python -m benchmarks.sync_repos
uv run python -m benchmarks.sync_repos --check
uv run python -m benchmarks.sync_repos # clone / update
uv run python -m benchmarks.sync_repos --check # verify only
```

## Run
All tools run CPU-only. semble uses `minishlab/potion-code-16M`; CodeRankEmbed uses `nomic-ai/CodeRankEmbed` (137M params). The speed benchmark touches one repo per language with a cold-start index and 5 query runs per repo.

<details>
<summary>semble</summary>

```bash
uv run python -m benchmarks.run_benchmark
uv run python -m benchmarks.run_benchmark --repo fastapi --repo axios
uv run python -m benchmarks.run_benchmark --language python
```

Full runs (no `--repo`/`--language` filters) automatically save results to
`benchmarks/results/<sha>.json`.
Full runs write to `benchmarks/results/semble-hybrid-<sha12>.json`.

</details>

<details>
<summary>Speed benchmark</summary>

```bash
uv run python -m benchmarks.speed_benchmark
```

Writes to `benchmarks/results/speed-<sha12>.json`.

</details>

<details>
<summary>Ablations</summary>

```bash
uv run python -m benchmarks.baselines.ablations
uv run python -m benchmarks.baselines.ablations --mode bm25
uv run python -m benchmarks.baselines.ablations --mode semble-semantic
```

</details>

<details>
<summary>ripgrep</summary>

Needs `rg` on `$PATH` (`brew install ripgrep` / `apt install ripgrep`).

```bash
uv run python -m benchmarks.baselines.ripgrep
uv run python -m benchmarks.baselines.ripgrep --no-fixed-strings
```

</details>

<details>
<summary>ColGREP</summary>

Needs the `colgrep` binary on `$PATH`.

```bash
uv run python -m benchmarks.baselines.colgrep
uv run python -m benchmarks.baselines.colgrep --repo fastapi --repo axios
```

Runs with `--code-only` everywhere except bash repos (bash-it, bats-core, nvm), which use `--no-code-only` because ColGREP's code filter excludes `.sh`/`.bash` files.

</details>

<details>
<summary>CodeRankEmbed</summary>

Requires the `benchmark` extra (`uv sync --extra benchmark`).

```bash
uv run python -m benchmarks.baselines.coderankembed
uv run python -m benchmarks.baselines.coderankembed --mode semantic
```

</details>

<details>
<summary>Plots</summary>

```bash
uv run python -m benchmarks.plot
```

Writes `speed_vs_ndcg_cold.png` and `speed_vs_ndcg_warm.png` to `benchmarks/results/`.

</details>
122 changes: 0 additions & 122 deletions benchmarks/annotations/dio.json

This file was deleted.

Loading
Loading