MinishLab · Pringled · Apr 30, 2026 · Apr 28, 2026 · Apr 28, 2026 · Apr 29, 2026
diff --git a/README.md b/README.md
@@ -1,7 +1,8 @@
 
 <h2 align="center">
   <img width="30%" alt="semble logo" src="https://raw.githubusercontent.com/MinishLab/semble/main/assets/images/semble_logo.png"><br/>
-  Fast and Accurate Code Search for Agents
+  Fast and Accurate Code Search for Agents<br/>
+  <sub>Uses ~98% fewer tokens than grep+read</sub>
 </h2>
 
 <div align="center">
@@ -23,7 +24,7 @@
 
 </div>
 
-Semble is a code search library built for agents. It returns the exact code snippets they need instantly, cutting both token usage and waiting time on every step. Indexing and searching a full codebase end-to-end takes under a second, with ~200x faster indexing and ~10x faster queries than a code-specialized transformer, at 99% of its retrieval quality (see [benchmarks](#benchmarks)). Everything runs on CPU with no API keys, GPU, or external services. Run it as an [MCP server](#mcp-server) and any agent (Claude Code, Cursor, Codex, OpenCode, etc.) gets instant access to any repo, cloned and indexed on demand.
+Semble is a code search library built for agents. It returns the exact code snippets they need instantly, using ~98% fewer tokens than grep+read and cutting latency on every step. Indexing and searching a full codebase end-to-end takes under a second, with ~200x faster indexing and ~10x faster queries than a code-specialized transformer, at 99% of its retrieval quality (see [benchmarks](#benchmarks)). Everything runs on CPU with no API keys, GPU, or external services. Run it as an [MCP server](#mcp-server) and any agent (Claude Code, Cursor, Codex, OpenCode, etc.) gets instant access to any repo, cloned and indexed on demand.
 
 ## Quickstart
 
@@ -155,6 +156,14 @@ We benchmark quality and speed across all methods on ~1,250 queries over 63 repo
 
 Semble achieves 99% of the performance of the 137M-parameter [CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed) Hybrid, while indexing 218x faster and answering queries 11x faster. See [benchmarks](benchmarks/README.md) for per-language results, ablations, and methodology.
 
+### Token efficiency
+
+Agents using grep+read spend most of their context budget on irrelevant code. Semble returns only the chunks that match, keeping token usage low even at high recall.
+
+![Token efficiency: recall vs. retrieved tokens](https://raw.githubusercontent.com/MinishLab/semble/main/assets/images/token_efficiency.png)
+
+Semble uses **98% fewer tokens** on average, and reaches 94% recall at a budget of only 2k tokens, while grep+read needs a full 100k context window to reach 85%. See [benchmarks](benchmarks/README.md#token-efficiency) for details.
+
 ## License
 
 MIT

diff --git a/assets/images/token_efficiency.png b/assets/images/token_efficiency.png
diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -3,6 +3,7 @@
 Quality and speed benchmarks for `semble`.
 
 - [Main results](#main-results)
+- [Token efficiency](#token-efficiency)
 - [By language](#by-language)
 - [Ablations](#ablations)
 - [Dataset](#dataset)
@@ -33,6 +34,37 @@ The 137M-param CodeRankEmbed Hybrid wins NDCG@10 by 0.008. semble wins index tim
 
 NDCG@10 is averaged across all queries. Speed numbers use one repo per language, CPU only: cold-start index time and warm query p50 (median across 5 consecutive runs).
 
+## Token efficiency
+
+Coding agents (Claude Code, OpenCode, etc.) typically find code by running `grep` on keywords and reading the matched files. We model that workflow and compare it against semble's chunk retrieval across our full benchmark of 1251 queries.
+
+![Token efficiency: recall vs. retrieved tokens](../assets/images/token_efficiency.png)
+
+### Expected tokens per query
+
+For each query: tokens consumed at first relevant hit, or 32k if the method never finds anything. Averaged across all 1251 queries.
+
+| Method | Expected tokens | Savings |
+|---|---:|---:|
+| ripgrep + read file | 45,692 | baseline |
+| **semble** | **566** | **98% fewer** |
+
+### Recall at fixed token budgets
+
+A relevant file is "covered" once any retrieved unit comes from it.
+
+| Method | 500 | 1k | 2k | 4k | 8k | 16k | 32k |
+|---|---:|---:|---:|---:|---:|---:|---:|
+| **semble** | **0.685** | **0.849** | **0.938** | **0.976** | **0.991** | **0.996** | **0.996** |
+| ripgrep + read file | 0.001 | 0.008 | 0.037 | 0.088 | 0.212 | 0.379 | 0.583 |
+
+<details>
+<summary>Methodology</summary>
+
+semble returns the top-50 ranked chunks. `ripgrep+read` splits the query into keywords (dropping stopwords and short words), runs `rg --fixed-strings --ignore-case` for each keyword, then reads matched files in full ranked by how many distinct keywords they contain. Both methods search the same set of file types and ignored directories. Tokens are counted with `cl100k_base` via `tiktoken`. A relevant file is "covered" once any retrieved unit overlaps its annotated span.
+
+</details>
+
 ## By language
 
 NDCG@10 per language, sorted by CodeRankEmbed Hybrid (CRE in the table). Best score per row is bolded.
@@ -238,6 +270,24 @@ uv run python -m benchmarks.baselines.coderankembed --mode semantic
 
 </details>
 
+<details>
+<summary>Context-efficiency benchmark</summary>
+
+Requires the `benchmark` extra (`uv sync --extra benchmark`) and `rg` on `$PATH`.
+
+```bash
+# Recall vs. token-budget across all queries; plots automatically.
+uv run python -m benchmarks.token_efficiency recall
+uv run python -m benchmarks.token_efficiency recall --repo fastapi
+
+# Regenerate the plot from a saved recall payload.
+uv run python -m benchmarks.token_efficiency plot
+```
+
+Writes `benchmarks/results/token-efficiency-<sha12>.json` and `assets/images/token_efficiency.png`.
+
+</details>
+
 <details>
 <summary>Plots</summary>
 

diff --git a/benchmarks/baselines/ablations.py b/benchmarks/baselines/ablations.py
@@ -11,11 +11,11 @@
 from benchmarks.data import (
     RepoSpec,
     Task,
-    apply_task_filters,
-    available_repo_specs,
+    add_filter_args,
     grouped_tasks,
-    load_tasks,
+    load_filtered_tasks,
     save_results,
+    summarize_modes,
 )
 from benchmarks.metrics import ndcg_at_k, target_rank
 from semble import SembleIndex
@@ -163,12 +163,10 @@ def _bench(
 
 def _parse_args() -> argparse.Namespace:
     parser = argparse.ArgumentParser(description="semble ablation benchmarks.")
-    parser.add_argument("--repo", action="append", default=[], help="Limit to one or more repo names.")
-    parser.add_argument("--language", action="append", default=[], help="Limit to one or more languages.")
+    add_filter_args(parser, verbose=True)
     parser.add_argument(
         "--mode", action="append", default=[], choices=_MODES, help="Mode(s) to evaluate (default: all)."
     )
-    parser.add_argument("--verbose", action="store_true", help="Print per-query results.")
     return parser.parse_args()
 
 
@@ -177,12 +175,7 @@ def main() -> None:
     args = _parse_args()
     modes = args.mode or _MODES
 
-    repo_specs = available_repo_specs()
-    tasks = apply_task_filters(
-        load_tasks(repo_specs=repo_specs), repos=args.repo or None, languages=args.language or None
-    )
-    if not tasks:
-        raise SystemExit("No benchmark tasks matched the requested filters.")
+    repo_specs, tasks = load_filtered_tasks(args.repo or None, args.language or None)
 
     print("Loading model...", file=sys.stderr)
     started = time.perf_counter()
@@ -210,21 +203,7 @@ def main() -> None:
     summary = {
         "tool": "semble-ablations",
         "model": _DEFAULT_MODEL_NAME,
-        "by_mode": {
-            mode: {
-                "avg_ndcg10": round(
-                    sum(r.ndcg10 for r in results if r.mode == mode)
-                    / max(1, sum(1 for r in results if r.mode == mode)),
-                    4,
-                ),
-                "avg_p50_ms": round(
-                    sum(r.p50_ms for r in results if r.mode == mode)
-                    / max(1, sum(1 for r in results if r.mode == mode)),
-                    1,
-                ),
-            }
-            for mode in modes
-        },
+        "by_mode": summarize_modes(results, modes),
         "repos": [asdict(r) for r in results],
     }
     print(json.dumps(summary, indent=2))

diff --git a/benchmarks/baselines/coderankembed.py b/benchmarks/baselines/coderankembed.py
@@ -13,12 +13,12 @@
 from benchmarks.data import (
     RepoSpec,
     Task,
-    apply_task_filters,
-    available_repo_specs,
+    add_filter_args,
     grouped_tasks,
-    load_tasks,
+    load_filtered_tasks,
     results_path,
     save_results,
+    summarize_modes,
 )
 from benchmarks.metrics import ndcg_at_k, target_rank
 from semble import SembleIndex
@@ -30,21 +30,18 @@
 
 
 class _AsymmetricWrapper:
-    """Wrap SentenceTransformer with asymmetric query/document prompts.
-
-    Single-element lists are treated as queries; larger batches as documents.
-    max_seq_length is capped to avoid OOM on CPU with long chunks.
-    """
+    """Wrap SentenceTransformer with asymmetric query/document prompts."""
 
     def __init__(self, model: SentenceTransformer, max_seq_length: int = 512) -> None:
         self._model = model
         self._model.max_seq_length = max_seq_length
 
     def encode(self, texts: Sequence[str]) -> np.ndarray:
         """Encode texts with query or document prompt based on batch size."""
-        if len(texts) == 1:
-            return self._model.encode(texts, prompt_name="query", batch_size=1)  # type: ignore[return-value]
-        return self._model.encode(texts, batch_size=1)  # type: ignore[return-value]
+        text_list = list(texts)
+        if len(text_list) == 1:
+            return self._model.encode(text_list, prompt_name="query", batch_size=1)  # type: ignore[return-value]
+        return self._model.encode(text_list, batch_size=1)  # type: ignore[return-value]
 
 
 @dataclass(frozen=True)
@@ -116,21 +113,7 @@ def _build_summary(results: list[RepoResult], modes: list[str]) -> dict[str, obj
     return {
         "tool": "coderankembed",
         "model": _MODEL_NAME,
-        "by_mode": {
-            mode: {
-                "avg_ndcg10": round(
-                    sum(result.ndcg10 for result in results if result.mode == mode)
-                    / max(1, sum(1 for result in results if result.mode == mode)),
-                    4,
-                ),
-                "avg_p50_ms": round(
-                    sum(result.p50_ms for result in results if result.mode == mode)
-                    / max(1, sum(1 for result in results if result.mode == mode)),
-                    1,
-                ),
-            }
-            for mode in modes
-        },
+        "by_mode": summarize_modes(results, modes),
         "repos": [asdict(result) for result in results],
     }
 
@@ -228,12 +211,10 @@ def _bench(
 
 def _parse_args() -> argparse.Namespace:
     parser = argparse.ArgumentParser(description="Benchmark CodeRankEmbed on the semble benchmark suite.")
-    parser.add_argument("--repo", action="append", default=[], help="Limit to one or more repo names.")
-    parser.add_argument("--language", action="append", default=[], help="Limit to one or more languages.")
+    add_filter_args(parser, verbose=True)
     parser.add_argument(
         "--mode", action="append", default=[], choices=["semantic", "hybrid"], help="Search mode(s) (default: both)."
     )
-    parser.add_argument("--verbose", action="store_true", help="Print per-query results.")
     return parser.parse_args()
 
 
@@ -243,12 +224,7 @@ def main() -> None:
     modes = args.mode or ["semantic", "hybrid"]
     is_full_run = not args.repo and not args.language
 
-    repo_specs = available_repo_specs()
-    tasks = apply_task_filters(
-        load_tasks(repo_specs=repo_specs), repos=args.repo or None, languages=args.language or None
-    )
-    if not tasks:
-        raise SystemExit("No benchmark tasks matched the requested filters.")
+    repo_specs, tasks = load_filtered_tasks(args.repo or None, args.language or None)
 
     print(f"Loading {_MODEL_NAME}...", file=sys.stderr)
     started = time.perf_counter()

diff --git a/benchmarks/baselines/colgrep.py b/benchmarks/baselines/colgrep.py
@@ -9,14 +9,14 @@
 from benchmarks.data import (
     RepoSpec,
     Task,
-    apply_task_filters,
-    available_repo_specs,
+    add_filter_args,
     grouped_tasks,
-    load_tasks,
+    load_filtered_tasks,
     results_path,
     save_results,
 )
 from benchmarks.metrics import file_rank, ndcg_at_k
+from benchmarks.tools import run_colgrep_files
 
 _COLGREP = "colgrep"
 _TOP_K = 10
@@ -34,25 +34,6 @@ class RepoResult:
     index_ms: float
 
 
-def _run_colgrep(query: str, benchmark_dir: Path, top_k: int, *, code_only: bool = True) -> list[str]:
-    """Return list of absolute file paths from colgrep JSON output."""
-    cmd = [_COLGREP, "--force-cpu"]
-    if code_only:
-        cmd.append("--code-only")
-    cmd += ["--json", "-k", str(top_k), query, str(benchmark_dir)]
-    try:
-        proc = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
-    except subprocess.TimeoutExpired:
-        return []
-    if proc.returncode != 0:
-        return []
-    try:
-        data = json.loads(proc.stdout)
-    except json.JSONDecodeError:
-        return []
-    return [item["unit"]["file"] for item in data if "unit" in item and "file" in item["unit"]]
-
-
 def _evaluate_repo(
     tasks: list[Task], benchmark_dir: Path, *, code_only: bool = True, verbose: bool = False
 ) -> tuple[float, float]:
@@ -65,7 +46,7 @@ def _evaluate_repo(
         file_paths: list[str] = []
         for _ in range(_LATENCY_RUNS):
             started = time.perf_counter()
-            file_paths = _run_colgrep(task.query, benchmark_dir, _TOP_K, code_only=code_only)
+            file_paths = run_colgrep_files(task.query, benchmark_dir, top_k=_TOP_K, code_only=code_only)
             query_latencies.append((time.perf_counter() - started) * 1000)
         latencies.append(sorted(query_latencies)[_LATENCY_RUNS // 2])
 
@@ -88,11 +69,7 @@ def _evaluate_repo(
 
 
 def _init_index(path: Path) -> tuple[bool, float]:
-    """Build (or rebuild) the colgrep index at path; return (non_empty, elapsed_ms).
-
-    :param path: Directory to index.
-    :return: Tuple of (non_empty, index_ms) where non_empty is False if colgrep reported 0 files.
-    """
+    """Build the ColGREP index and return whether it indexed files plus elapsed time."""
     subprocess.run([_COLGREP, "clear", str(path)], capture_output=True, timeout=30)
     cmd = [_COLGREP, "init", "--force-cpu", "-y", str(path)]
     started = time.perf_counter()
@@ -106,14 +83,7 @@ def _init_index(path: Path) -> tuple[bool, float]:
 
 
 def _resolve_path(spec: RepoSpec) -> tuple[Path, float]:
-    """Return the path ColGREP should index and elapsed index build time.
-
-    Tries benchmark_dir first; if that yields 0 files falls back to checkout_dir,
-    which is the project root ColGREP needs to discover the .git boundary.
-
-    :param spec: Repo spec providing benchmark_dir and checkout_dir.
-    :return: Tuple of (effective_path, index_ms).
-    """
+    """Return the path ColGREP should index and elapsed index build time."""
     path = spec.benchmark_dir
     ok, index_ms = _init_index(path)
     if ok:
@@ -174,9 +144,7 @@ def _load_completed(out_path: Path) -> dict[str, RepoResult]:
 
 def _parse_args() -> argparse.Namespace:
     parser = argparse.ArgumentParser(description="Benchmark ColGREP on the semble benchmark suite.")
-    parser.add_argument("--repo", action="append", default=[], help="Limit to one or more repo names.")
-    parser.add_argument("--language", action="append", default=[], help="Limit to one or more languages.")
-    parser.add_argument("--verbose", action="store_true", help="Print per-query results.")
+    add_filter_args(parser, verbose=True)
     parser.add_argument(
         "--no-code-only",
         action="store_true",
@@ -229,12 +197,7 @@ def main() -> None:
     args = _parse_args()
     is_full_run = not args.repo and not args.language
 
-    repo_specs = available_repo_specs()
-    tasks = apply_task_filters(
-        load_tasks(repo_specs=repo_specs), repos=args.repo or None, languages=args.language or None
-    )
-    if not tasks:
-        raise SystemExit("No benchmark tasks matched the requested filters.")
+    repo_specs, tasks = load_filtered_tasks(args.repo or None, args.language or None)
 
     repo_tasks = grouped_tasks(tasks)