434 tests. Offline retrieval-quality harness for RAG systems. No LLM-as-judge.
recall@k,precision@k,hit_rate@k,MRR,nDCG(binary + graded),context_precision,context_recall, plus chunking diagnostics and regression diffs — all deterministic, all byte-identical across runs, all free of LLM calls.
Everyone building a RAG system has the same debugging loop:
- Swap chunker → retrieval feels worse.
- Try a different embedder → ???
- Add reranker → ???
- Deploy → user complains 🎉.
The loop is broken because teams never measure retrieval. They measure end-to-end answer quality with an LLM-as-judge, and by then the signal from retrieval is buried under 200B+ generator parameters.
ragcheck fixes this the boring way: standard IR metrics, computed
offline on a gold set, with deterministic output you can diff in CI.
- It does not call an LLM to judge answers. That's
ragas,deepeval,trulens, and friends. They are great for end-to-end pipelines. They are the wrong tool for "did my chunker regress?" - It does not spin up a vector DB. You bring your corpus as plain text
files;
ragcheckchunks + embeds + scores in-process. - It does not require a GPU. The default
HashEmbedderruns a full bench in under a second on CPU, so you can gate PRs on it.
- Computes 10 retrieval metrics per query, aggregates over the gold set.
- Ships 5 chunkers and 3 embedders (plus
HashEmbedderfor CI). - Ships 3 fixtures (BEIR FiQA subset, MS MARCO subset, synthetic needle-in-haystack) so you can start evaluating in 5 seconds.
- Detects regressions between runs with configurable thresholds and exits non-zero when any metric drops below them — drop it in CI.
- Generates byte-identical JSON output across runs on the same inputs.
diffreturns empty. Commits are meaningful.
pip install ragcheck
# optional: use heavy embedders
pip install "ragcheck[sentence-transformers]"
pip install "ragcheck[openai]"Requires Python 3.9+. Core dependencies: numpy, jinja2. No network,
no GPU, no LLM required for any of the default flows.
# 1. Evaluate the bundled BEIR fixture
ragcheck run \
--corpus $(python -c 'import ragcheck.fixtures as f; print(f.BEIR_FIQA_DIR / "corpus")') \
--gold $(python -c 'import ragcheck.fixtures as f; print(f.BEIR_FIQA_DIR / "gold.json")') \
--out runs/baseline.json
# 2. Change chunker, re-run
ragcheck run --corpus ... --gold ... --out runs/candidate.json \
--chunker sliding-window --chunker-args '{"size": 120, "stride": 60}'
# 3. Diff
ragcheck diff runs/baseline.json runs/candidate.json
# Exit 1 if any metric drops more than the threshold (default 0.02).
# Pass --no-fail to always exit 0 (e.g. for advisory CI checks).ragcheck run --corpus DIR --gold FILE [--chunker NAME] [--embedder NAME] --out FILE
ragcheck diff BASELINE.json HEAD.json [--threshold KEY=VAL ...] [--no-fail]
ragcheck bench [--json]
ragcheck synth --corpus DIR --out FILE [--questions 50] [--seed 42]
ragcheck report --in FILE --format {json,md,html} [--out FILE]
ragcheck chunkers list
ragcheck embedders list
Load corpus + gold set, chunk, embed, retrieve top-k per query, compute all metrics, dump deterministic JSON.
ragcheck run \
--corpus ./docs \
--gold ./eval/gold.json \
--chunker semantic-boundary \
--chunker-args '{"max_chars": 1200}' \
--embedder hash \
--k 1,3,5,10 \
--out runs/2026-04-16.jsonCompute per-metric deltas between two runs. Exits 1 if any metric drops
more than its threshold. Default threshold is 0.02 for every metric;
override per-metric-prefix by repeating --threshold KEY=VAL. Pass
--no-fail to always exit 0 (e.g. advisory CI).
ragcheck diff runs/baseline.json runs/candidate.json \
--threshold recall@5=0.01 --threshold mrr=0.03Runs the three bundled fixtures end-to-end and prints a timing table. Completes in under a second on CPU. Use it to smoke-test after install:
fixture | docs | queries | recall@5 | mrr | ndcg@5 | seconds
-----------------|------|---------|----------|----------|----------|--------
beir_fiqa | 8 | 10 | 0.5500 | 0.8083 | 0.5463 | 0.006
ms_marco | 6 | 8 | 0.7500 | 0.8438 | 0.7193 | 0.004
needle_haystack | 78 | 6 | 0.8333 | 0.6444 | 0.6696 | 0.032
-----------------|------|---------|----------|----------|----------|--------
total elapsed: 0.042s
Generate a synthetic gold set from your corpus when you don't have labels yet.
Picks the most distinctive sentence from each document and rephrases it into
a question. Deterministic given the same --seed.
Re-render a saved run as Markdown or HTML (JSON is the input format). Useful for pinning in pull-request descriptions.
Print the registered chunkers / embedders. Output stays in sync with the source — fail fast if someone registers a duplicate.
All metrics are deterministic, pure functions (no state, no random
seeds, no LLM calls). Each is implemented in one place and hand-verified
against an independent textbook calculation in tests/test_metrics.py.
| Metric | Formula | Notes |
|---|---|---|
recall@k |
distinct relevant in top-k / total relevant | deduplicated by design |
precision@k |
hits in top-k / k | denominator is k, not retrieved length |
hit_rate@k |
1.0 iff any top-k item is relevant | aka success@k |
mrr |
1 / rank of first relevant item | 0.0 if no relevant item retrieved |
dcg@k / ndcg@k |
Σ (2^gain − 1) / log₂(i+1), normalised by ideal DCG | binary or graded relevance |
context_precision |
LangChain/Ragas-style average precision@i at hit positions | rewards putting relevance at top |
context_recall |
recall over the retrieved window | equivalent to recall@len(retrieved) |
f1@k |
2·p·r / (p+r) | 0.0 when both zero |
average_precision |
MAP's per-query component | aggregated by caller |
Default k-values are {1, 3, 5, 10}, matching common IR benchmarks.
Override with --k 1,5,20,100.
Every chunker produces Chunk(id, doc_id, start, end, text, metadata)
where id = sha1("doc_id|start|end|flavor"). Deterministic, content-addressed.
| Chunker | When to use |
|---|---|
fixed-token |
Baseline; reproducible; no dependencies on text structure |
sliding-window |
When you care about edge cases falling on chunk boundaries |
sentence |
Short docs, FAQ-style corpora |
semantic-boundary |
Long docs where paragraphs are the natural unit |
structural-markdown |
Docs / READMEs / wikis with # headings and code blocks |
Register your own:
from ragcheck.chunkers import register_chunker, Chunk
class MyChunker:
name = "my-chunker"
def __init__(self, max_chars: int = 500) -> None:
self.max_chars = max_chars
def chunk(self, doc_id: str, text: str) -> list[Chunk]:
return [Chunk(doc_id=doc_id, chunk_id="x", text=text, start=0, end=len(text))]
register_chunker("my-chunker", MyChunker)
# get_chunker("my-chunker", max_chars=1200) constructs and returns an instance.register_chunker takes a factory (typically a class) that accepts
keyword arguments and returns a chunker instance with a
.chunk(doc_id: str, text: str) -> list[Chunk] method. Pass
override=True to intentionally replace an existing factory; otherwise
duplicate registrations raise ValueError to fail fast.
| Embedder | Install with | Notes |
|---|---|---|
hash |
(built in) | Deterministic md5 shingle hashing; for CI / bench |
sentence-transformers |
ragcheck[sentence-transformers] |
all-MiniLM-L6-v2 default; lazy imported |
openai |
ragcheck[openai] |
Disk-cached by sha1(model+text); lazy imported |
numpy |
(built in) | Read pre-computed vectors (vectors.npy+ids.json) |
HashEmbedder produces L2-normalised 128-dimensional vectors from n-gram
shingles. It's not good at semantics — that's intentional. It gives you a
stable, offline baseline so you can isolate changes from your real embedder.
ragcheck also answers the question "is my chunker doing something insane?"
by computing structural diagnostics on every run:
| Diagnostic | What it tells you |
|---|---|
coverage |
Fraction of every source document covered by at least one chunk |
duplicate_ratio |
Fraction of chunks that share the exact same content hash |
orphan_chunks |
Chunks that never appear in any query's top-k (potential dead weight) |
size_histogram |
Distribution of chunk character lengths across 7 buckets |
Example (BEIR FiQA with fixed-token, tokens_per_chunk=40):
0-49 |
50-99 | 3
100-199 | ████████████████████ 17
200-499 | ██ 2
500-999 |
1000-1999 |
2000+ |
ragcheck diff reports per-metric deltas with a status flag:
improved— metric increased by at least its thresholdflat— change within ±thresholddegraded— metric dropped by more than its thresholdnew— metric exists in head but not baselineremoved— metric exists in baseline but not head
Default threshold is 0.02 (2 points) for every metric.
# Fail CI if recall@5 drops by more than 1 point, but tolerate 5-point changes
# on the slow-moving context_recall
ragcheck diff base.json head.json \
--threshold recall@5=0.01 \
--threshold context_recall=0.05Two runs of
ragcheck runover the same corpus + gold set + config MUST produce byte-identical JSON.
This is enforced by:
- Chunk IDs from
sha1(doc_id + start + end + flavor) - L2-normalised embeddings with stable-sort top-k
(
numpy.argsort(..., kind="stable")) json.dumps(sort_keys=True, ensure_ascii=False)- 6-decimal-place float formatting via
format_float - No timestamps in run output (add-only via
--embed-timestamp) corpus_sha1stamped in output sodiffwarns when the corpus changes- No
setiteration order leaks into output
Test test_runner.py::TestDeterminism::test_two_runs_byte_identical
enforces this on every CI build.
| Tool | LLM-as-judge required | Retrieval metrics | Chunking diagnostics | Regression diff | Byte-identical output | Runs offline | Runtime (bench) |
|---|---|---|---|---|---|---|---|
| ragcheck | ❌ | ✅ 10 metrics | ✅ coverage, dup, orphans, sizes | ✅ per-metric thresholds | ✅ | ✅ | <1s |
| ragas | ✅ | partial | ❌ | ❌ | ❌ | ❌ | ~minutes |
| deepeval | ✅ | partial | ❌ | ❌ | ❌ | ❌ | ~minutes |
| BenchmarkQED | ✅ | partial | ❌ | ❌ | ❌ | ❌ | ~minutes |
| trulens | ✅ | partial | ❌ | ❌ | ❌ | ❌ | ~seconds-minutes |
ragcheck does not replace these tools. It sits under them: it catches
retrieval regressions before you spend LLM quota evaluating generations on
top of bad context.
+-----------+ +----------+ +-----------+ +-----------+
| Corpus | --> | Chunker | --> | Embedder | --> | Retrieval |
| (text/md) | | (5) | | (4) | | top-k |
+-----------+ +----------+ +-----------+ +-----------+
|
v
+-------------+
| Metrics |
| (pure fns) |
+-------------+
|
v
+-----------------+ +---+---+ +------+
| Diagnostics | | JSON | | Diff |
| (coverage, dup) | |(byte= )| | (±δ) |
+-----------------+ +-------+ +------+
All layers are swappable; none depend on network or LLMs.
ragcheck complements four sibling projects:
- ctxpack — compresses context
windows before retrieval.
ragcheckmeasures whether that compression preserves retrieval quality. - ctxlens — introspects context
window utilisation at runtime.
ragcheckproves what goes in the window. - promptdiff — regression
tests for prompts.
ragcheckdoes the same for retrieval, so prompt tests don't mask chunking bugs. - mocklm — deterministic LLM
mocking.
ragcheckruns without any LLM at all; together they give you a fully offline RAG regression harness.
Each of these stands alone. Together they form the CI substrate for an agent shop that treats retrieval, context, prompts, and generation as independent components you can regress on separately.
Every CLI operation is a pure Python function:
import ragcheck
from ragcheck import RunConfig, run_evaluation, diff_runs
result = run_evaluation(config=RunConfig(
corpus_path="./docs",
gold_path="./eval/gold.json",
chunker="semantic-boundary",
embedder="hash",
top_k_cap=20,
))
print(result.summary["recall@5"])
print(result.diagnostics["coverage"])
# Load two saved runs and diff
diff = diff_runs(
ragcheck.load_result_json("runs/base.json"),
ragcheck.load_result_json("runs/head.json"),
fail_on_degraded=True,
)
if diff.degraded:
raise SystemExit(1)beir_fiqa— 8 finance-question documents + 10 queries drawn from the BEIR FiQA benchmark style (no external download; subset bundled).ms_marco— 6 passages + 8 queries in MS MARCO's passage-ranking style.needle_haystack— synthetic needle-in-haystack with 6 needles and 72 distractor haystack documents. Good for stress-testing chunker boundary behaviour without leaking real labels.
All three ship in ragcheck.fixtures; bench runs all three.
- No reranker evaluation. If your pipeline has a cross-encoder reranker,
run it to produce the "retrieved" list and pass that list into
ragcheck. - Binary relevance by default. Graded relevance works via the
relevancemapping in the gold file ({doc_id: 1.0, doc_id2: 2.5, ...}), but mostragcheckcommands optimise for the binary case. - No latency simulation.
elapsed_secondsin bench is wall-clock for the retrieval itself, not a realistic p99 — measure that at your serving tier. - Corpus hashing is content-level, not metadata-level. Renaming a file changes the hash; editing one character does too. That is intentional (diffs should warn) but worth knowing.
- No built-in multi-query fusion. If you use RAG-fusion or HyDE, run
ragcheckon each sub-query and aggregate externally.
git clone https://github.com/JSLEEKR/ragcheck.git
cd ragcheck
pip install -e ".[dev,sentence-transformers,openai]"
pytest # 434 tests pass (doc-drift guard included)
ruff check ragcheck/
mypy ragcheck/
python -m ragcheck benchContributions welcome. All new metrics need a textbook-verified test.
All new chunkers need coverage + diagnostics smoke tests. All public API
changes need a CHANGELOG.md entry.
MIT © 2026 JSLEEKR