Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 11 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@

<h2 align="center">
<img width="30%" alt="semble logo" src="https://raw.githubusercontent.com/MinishLab/semble/main/assets/images/semble_logo.png"><br/>
Fast and Accurate Code Search for Agents
Fast and Accurate Code Search for Agents<br/>
<sub>Uses ~98% fewer tokens than grep+read</sub>
</h2>

<div align="center">
Expand All @@ -23,7 +24,7 @@

</div>

Semble is a code search library built for agents. It returns the exact code snippets they need instantly, cutting both token usage and waiting time on every step. Indexing and searching a full codebase end-to-end takes under a second, with ~200x faster indexing and ~10x faster queries than a code-specialized transformer, at 99% of its retrieval quality (see [benchmarks](#benchmarks)). Everything runs on CPU with no API keys, GPU, or external services. Run it as an [MCP server](#mcp-server) and any agent (Claude Code, Cursor, Codex, OpenCode, etc.) gets instant access to any repo, cloned and indexed on demand.
Semble is a code search library built for agents. It returns the exact code snippets they need instantly, using ~98% fewer tokens than grep+read and cutting latency on every step. Indexing and searching a full codebase end-to-end takes under a second, with ~200x faster indexing and ~10x faster queries than a code-specialized transformer, at 99% of its retrieval quality (see [benchmarks](#benchmarks)). Everything runs on CPU with no API keys, GPU, or external services. Run it as an [MCP server](#mcp-server) and any agent (Claude Code, Cursor, Codex, OpenCode, etc.) gets instant access to any repo, cloned and indexed on demand.

## Quickstart

Expand Down Expand Up @@ -155,6 +156,14 @@ We benchmark quality and speed across all methods on ~1,250 queries over 63 repo

Semble achieves 99% of the performance of the 137M-parameter [CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed) Hybrid, while indexing 218x faster and answering queries 11x faster. See [benchmarks](benchmarks/README.md) for per-language results, ablations, and methodology.

### Token efficiency

Agents using grep+read spend most of their context budget on irrelevant code. Semble returns only the chunks that match, keeping token usage low even at high recall.

![Token efficiency: recall vs. retrieved tokens](https://raw.githubusercontent.com/MinishLab/semble/main/assets/images/token_efficiency.png)

Semble uses **98% fewer tokens** on average, and reaches 94% recall at a budget of only 2k tokens, while grep+read needs a full 100k context window to reach 85%. See [benchmarks](benchmarks/README.md#token-efficiency) for details.

## License

MIT
Expand Down
Binary file added assets/images/token_efficiency.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
50 changes: 50 additions & 0 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
Quality and speed benchmarks for `semble`.

- [Main results](#main-results)
- [Token efficiency](#token-efficiency)
- [By language](#by-language)
- [Ablations](#ablations)
- [Dataset](#dataset)
Expand Down Expand Up @@ -33,6 +34,37 @@ The 137M-param CodeRankEmbed Hybrid wins NDCG@10 by 0.008. semble wins index tim

NDCG@10 is averaged across all queries. Speed numbers use one repo per language, CPU only: cold-start index time and warm query p50 (median across 5 consecutive runs).

## Token efficiency

Coding agents (Claude Code, OpenCode, etc.) typically find code by running `grep` on keywords and reading the matched files. We model that workflow and compare it against semble's chunk retrieval across our full benchmark of 1251 queries.

![Token efficiency: recall vs. retrieved tokens](../assets/images/token_efficiency.png)

### Expected tokens per query

For each query: tokens consumed at first relevant hit, or 32k if the method never finds anything. Averaged across all 1251 queries.

| Method | Expected tokens | Savings |
|---|---:|---:|
| ripgrep + read file | 45,692 | baseline |
| **semble** | **566** | **98% fewer** |

### Recall at fixed token budgets

A relevant file is "covered" once any retrieved unit comes from it.

| Method | 500 | 1k | 2k | 4k | 8k | 16k | 32k |
|---|---:|---:|---:|---:|---:|---:|---:|
| **semble** | **0.685** | **0.849** | **0.938** | **0.976** | **0.991** | **0.996** | **0.996** |
| ripgrep + read file | 0.001 | 0.008 | 0.037 | 0.088 | 0.212 | 0.379 | 0.583 |

<details>
<summary>Methodology</summary>

semble returns the top-50 ranked chunks. `ripgrep+read` splits the query into keywords (dropping stopwords and short words), runs `rg --fixed-strings --ignore-case` for each keyword, then reads matched files in full ranked by how many distinct keywords they contain. Both methods search the same set of file types and ignored directories. Tokens are counted with `cl100k_base` via `tiktoken`. A relevant file is "covered" once any retrieved unit overlaps its annotated span.

</details>

## By language

NDCG@10 per language, sorted by CodeRankEmbed Hybrid (CRE in the table). Best score per row is bolded.
Expand Down Expand Up @@ -238,6 +270,24 @@ uv run python -m benchmarks.baselines.coderankembed --mode semantic

</details>

<details>
<summary>Context-efficiency benchmark</summary>

Requires the `benchmark` extra (`uv sync --extra benchmark`) and `rg` on `$PATH`.

```bash
# Recall vs. token-budget across all queries; plots automatically.
uv run python -m benchmarks.token_efficiency recall
uv run python -m benchmarks.token_efficiency recall --repo fastapi

# Regenerate the plot from a saved recall payload.
uv run python -m benchmarks.token_efficiency plot
```

Writes `benchmarks/results/token-efficiency-<sha12>.json` and `assets/images/token_efficiency.png`.

</details>

<details>
<summary>Plots</summary>

Expand Down
33 changes: 6 additions & 27 deletions benchmarks/baselines/ablations.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,11 @@
from benchmarks.data import (
RepoSpec,
Task,
apply_task_filters,
available_repo_specs,
add_filter_args,
grouped_tasks,
load_tasks,
load_filtered_tasks,
save_results,
summarize_modes,
)
from benchmarks.metrics import ndcg_at_k, target_rank
from semble import SembleIndex
Expand Down Expand Up @@ -163,12 +163,10 @@ def _bench(

def _parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(description="semble ablation benchmarks.")
parser.add_argument("--repo", action="append", default=[], help="Limit to one or more repo names.")
parser.add_argument("--language", action="append", default=[], help="Limit to one or more languages.")
add_filter_args(parser, verbose=True)
parser.add_argument(
"--mode", action="append", default=[], choices=_MODES, help="Mode(s) to evaluate (default: all)."
)
parser.add_argument("--verbose", action="store_true", help="Print per-query results.")
return parser.parse_args()


Expand All @@ -177,12 +175,7 @@ def main() -> None:
args = _parse_args()
modes = args.mode or _MODES

repo_specs = available_repo_specs()
tasks = apply_task_filters(
load_tasks(repo_specs=repo_specs), repos=args.repo or None, languages=args.language or None
)
if not tasks:
raise SystemExit("No benchmark tasks matched the requested filters.")
repo_specs, tasks = load_filtered_tasks(args.repo or None, args.language or None)

print("Loading model...", file=sys.stderr)
started = time.perf_counter()
Expand Down Expand Up @@ -210,21 +203,7 @@ def main() -> None:
summary = {
"tool": "semble-ablations",
"model": _DEFAULT_MODEL_NAME,
"by_mode": {
mode: {
"avg_ndcg10": round(
sum(r.ndcg10 for r in results if r.mode == mode)
/ max(1, sum(1 for r in results if r.mode == mode)),
4,
),
"avg_p50_ms": round(
sum(r.p50_ms for r in results if r.mode == mode)
/ max(1, sum(1 for r in results if r.mode == mode)),
1,
),
}
for mode in modes
},
"by_mode": summarize_modes(results, modes),
"repos": [asdict(r) for r in results],
}
print(json.dumps(summary, indent=2))
Expand Down
46 changes: 11 additions & 35 deletions benchmarks/baselines/coderankembed.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,12 @@
from benchmarks.data import (
RepoSpec,
Task,
apply_task_filters,
available_repo_specs,
add_filter_args,
grouped_tasks,
load_tasks,
load_filtered_tasks,
results_path,
save_results,
summarize_modes,
)
from benchmarks.metrics import ndcg_at_k, target_rank
from semble import SembleIndex
Expand All @@ -30,21 +30,18 @@


class _AsymmetricWrapper:
"""Wrap SentenceTransformer with asymmetric query/document prompts.

Single-element lists are treated as queries; larger batches as documents.
max_seq_length is capped to avoid OOM on CPU with long chunks.
"""
"""Wrap SentenceTransformer with asymmetric query/document prompts."""

def __init__(self, model: SentenceTransformer, max_seq_length: int = 512) -> None:
self._model = model
self._model.max_seq_length = max_seq_length

def encode(self, texts: Sequence[str]) -> np.ndarray:
"""Encode texts with query or document prompt based on batch size."""
if len(texts) == 1:
return self._model.encode(texts, prompt_name="query", batch_size=1) # type: ignore[return-value]
return self._model.encode(texts, batch_size=1) # type: ignore[return-value]
text_list = list(texts)
if len(text_list) == 1:
return self._model.encode(text_list, prompt_name="query", batch_size=1) # type: ignore[return-value]
return self._model.encode(text_list, batch_size=1) # type: ignore[return-value]


@dataclass(frozen=True)
Expand Down Expand Up @@ -116,21 +113,7 @@ def _build_summary(results: list[RepoResult], modes: list[str]) -> dict[str, obj
return {
"tool": "coderankembed",
"model": _MODEL_NAME,
"by_mode": {
mode: {
"avg_ndcg10": round(
sum(result.ndcg10 for result in results if result.mode == mode)
/ max(1, sum(1 for result in results if result.mode == mode)),
4,
),
"avg_p50_ms": round(
sum(result.p50_ms for result in results if result.mode == mode)
/ max(1, sum(1 for result in results if result.mode == mode)),
1,
),
}
for mode in modes
},
"by_mode": summarize_modes(results, modes),
"repos": [asdict(result) for result in results],
}

Expand Down Expand Up @@ -228,12 +211,10 @@ def _bench(

def _parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(description="Benchmark CodeRankEmbed on the semble benchmark suite.")
parser.add_argument("--repo", action="append", default=[], help="Limit to one or more repo names.")
parser.add_argument("--language", action="append", default=[], help="Limit to one or more languages.")
add_filter_args(parser, verbose=True)
parser.add_argument(
"--mode", action="append", default=[], choices=["semantic", "hybrid"], help="Search mode(s) (default: both)."
)
parser.add_argument("--verbose", action="store_true", help="Print per-query results.")
return parser.parse_args()


Expand All @@ -243,12 +224,7 @@ def main() -> None:
modes = args.mode or ["semantic", "hybrid"]
is_full_run = not args.repo and not args.language

repo_specs = available_repo_specs()
tasks = apply_task_filters(
load_tasks(repo_specs=repo_specs), repos=args.repo or None, languages=args.language or None
)
if not tasks:
raise SystemExit("No benchmark tasks matched the requested filters.")
repo_specs, tasks = load_filtered_tasks(args.repo or None, args.language or None)

print(f"Loading {_MODEL_NAME}...", file=sys.stderr)
started = time.perf_counter()
Expand Down
53 changes: 8 additions & 45 deletions benchmarks/baselines/colgrep.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,14 @@
from benchmarks.data import (
RepoSpec,
Task,
apply_task_filters,
available_repo_specs,
add_filter_args,
grouped_tasks,
load_tasks,
load_filtered_tasks,
results_path,
save_results,
)
from benchmarks.metrics import file_rank, ndcg_at_k
from benchmarks.tools import run_colgrep_files

_COLGREP = "colgrep"
_TOP_K = 10
Expand All @@ -34,25 +34,6 @@ class RepoResult:
index_ms: float


def _run_colgrep(query: str, benchmark_dir: Path, top_k: int, *, code_only: bool = True) -> list[str]:
"""Return list of absolute file paths from colgrep JSON output."""
cmd = [_COLGREP, "--force-cpu"]
if code_only:
cmd.append("--code-only")
cmd += ["--json", "-k", str(top_k), query, str(benchmark_dir)]
try:
proc = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
except subprocess.TimeoutExpired:
return []
if proc.returncode != 0:
return []
try:
data = json.loads(proc.stdout)
except json.JSONDecodeError:
return []
return [item["unit"]["file"] for item in data if "unit" in item and "file" in item["unit"]]


def _evaluate_repo(
tasks: list[Task], benchmark_dir: Path, *, code_only: bool = True, verbose: bool = False
) -> tuple[float, float]:
Expand All @@ -65,7 +46,7 @@ def _evaluate_repo(
file_paths: list[str] = []
for _ in range(_LATENCY_RUNS):
started = time.perf_counter()
file_paths = _run_colgrep(task.query, benchmark_dir, _TOP_K, code_only=code_only)
file_paths = run_colgrep_files(task.query, benchmark_dir, top_k=_TOP_K, code_only=code_only)
query_latencies.append((time.perf_counter() - started) * 1000)
latencies.append(sorted(query_latencies)[_LATENCY_RUNS // 2])

Expand All @@ -88,11 +69,7 @@ def _evaluate_repo(


def _init_index(path: Path) -> tuple[bool, float]:
"""Build (or rebuild) the colgrep index at path; return (non_empty, elapsed_ms).

:param path: Directory to index.
:return: Tuple of (non_empty, index_ms) where non_empty is False if colgrep reported 0 files.
"""
"""Build the ColGREP index and return whether it indexed files plus elapsed time."""
subprocess.run([_COLGREP, "clear", str(path)], capture_output=True, timeout=30)
cmd = [_COLGREP, "init", "--force-cpu", "-y", str(path)]
started = time.perf_counter()
Expand All @@ -106,14 +83,7 @@ def _init_index(path: Path) -> tuple[bool, float]:


def _resolve_path(spec: RepoSpec) -> tuple[Path, float]:
"""Return the path ColGREP should index and elapsed index build time.

Tries benchmark_dir first; if that yields 0 files falls back to checkout_dir,
which is the project root ColGREP needs to discover the .git boundary.

:param spec: Repo spec providing benchmark_dir and checkout_dir.
:return: Tuple of (effective_path, index_ms).
"""
"""Return the path ColGREP should index and elapsed index build time."""
path = spec.benchmark_dir
ok, index_ms = _init_index(path)
if ok:
Expand Down Expand Up @@ -174,9 +144,7 @@ def _load_completed(out_path: Path) -> dict[str, RepoResult]:

def _parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(description="Benchmark ColGREP on the semble benchmark suite.")
parser.add_argument("--repo", action="append", default=[], help="Limit to one or more repo names.")
parser.add_argument("--language", action="append", default=[], help="Limit to one or more languages.")
parser.add_argument("--verbose", action="store_true", help="Print per-query results.")
add_filter_args(parser, verbose=True)
parser.add_argument(
"--no-code-only",
action="store_true",
Expand Down Expand Up @@ -229,12 +197,7 @@ def main() -> None:
args = _parse_args()
is_full_run = not args.repo and not args.language

repo_specs = available_repo_specs()
tasks = apply_task_filters(
load_tasks(repo_specs=repo_specs), repos=args.repo or None, languages=args.language or None
)
if not tasks:
raise SystemExit("No benchmark tasks matched the requested filters.")
repo_specs, tasks = load_filtered_tasks(args.repo or None, args.language or None)

repo_tasks = grouped_tasks(tasks)

Expand Down
Loading
Loading