English · Русский
A reproducible, multi-tool benchmark for secret / credential detection in code, tickets, wikis, and logs. Honest by design: real + synthetic, leakage-safe, stratified by difficulty, multilingual, and not rigged toward any tool. Ships a runnable leaderboard across Prowl, gitleaks, TruffleHog, detect-secrets, DeepSecrets, and DeepPass2.
Datasheet follows Datasheets for Datasets (Gebru et al.). Version 2.0.
Part of Prowl, a high-precision secret scanner. The cases are also
published as a Hugging Face dataset config:
Podric/prowl-secrets-corpus.
git clone https://github.com/Lercas/prowlbench && cd prowlbench
# score any scanner against the 24,603 cases (run the tool you want to compare)
python run_prowlbench.py --tools gitleaks,trufflehog,detect-secrets,deepsecrets
# to reproduce Prowl's own rows you need the prowl binary on PATH (build from the main repo)
# and, for the 3-way config, the encoder from Hugging Face:
PROWL_ENCODER=/path/to/prowl-secret-encoder python run_prowlbench.py --tools prowl-3wayprowlbench.jsonl is the case set (one JSON object per line: id, text, label, type, tier, source, lang, origin, span). run_prowlbench.py runs each tool as a real subprocess and writes
prowlbench_leaderboard.json. The reproduction/ scripts rebuild the benchmark and the ensemble
analysis from the training corpus. They require that corpus
(Podric/prowl-secrets-corpus) and the
main repo's src/ on PYTHONPATH; the runner and the cases above do not.
Existing secret scanners are evaluated on each vendor's private set or a handful of hand-picked cases, so numbers aren't comparable and false-positive behaviour (the #1 cause of alert fatigue) is under-measured. ProwlBench is a single, public, balanced test set with a documented protocol and a multi-tool leaderboard, so detection quality (especially precision on hard negatives) can be compared apples-to-apples.
24,603 labelled snippets, each: {id, text, label, type, tier, source, lang, origin, span}.
| Axis | Breakdown |
|---|---|
| Label | 16,552 positive (contains a secret) / 8,051 negative |
| Difficulty tier | T1 structured 7,033 (AWS/GitHub/Stripe/JWT/PEM...) · T2 generic-context 3,519 (api_key/password=) · T3 free-form & multilingual 6,000 (passwords in prose, non-English) · T4 hard-negative 8,051 (the false-positive suite) |
| Source | code 9,732 · confluence 4,621 · jira 4,521 · log 3,769 · slack 1,960 |
| Language | en + fr/ja/ru/zh/de/es/it and other non-Latin (multilingual passwords in prose) |
| Origin | augmented (diversity-augmentation, held out by origin) · synthetic (format/checksum-correct generated) · real (CredData-obfuscated + HF PII, held out by origin) · curated (hand-verified adversarial) |
| Types | 26 secret types + ~20 negative classes (see the taxonomy) |
Exact counts: prowlbench_stats.json.
- T1 structured: has a checksum or distinctive prefix; a good scanner should get ~all of these.
- T2 generic-context: generic key/password with a naming anchor; needs context, not regex alone.
- T3 free-form & multilingual: passwords in natural-language prose, often non-English. This is where pure-regex tools collapse and context models do well.
- T4 hard-negative: the discriminator. High-entropy non-secrets: content hashes (md5/sha/git),
UUID/ULID/ObjectId, bcrypt hashes, SSH public keys, data-URI blobs, SRI integrity, placeholders
(
changeme,${ENV},<your-key>), near-misses (broken-checksum tokens), and PII. A tool's T4 false-positive rate is its alert-fatigue score.
- Source pool: the project's 503k-record corpus: format-correct synthetic secrets injected into realistic carriers with exact spans, real obfuscated secrets (Samsung CredData), real PII prose (HF ai4privacy + Nemotron-PII), and agent-authored carriers across many languages/frameworks.
- Leakage safety (with a known, disclosed caveat): ProwlBench is drawn from a value-disjoint +
template-group-disjoint TEST split within its own source corpus, with the real origins
(CredData/HF) held out by origin. Caveat, found by an adversarial audit and disclosed here: the
ML models train on a separate corpus (
corpus_all/corpus_flywheel) that shares a synthetic base with the benchmark's source and is split independently - so the two splits are not coordinated. An audit measured ~1,200 benchmark cases (~5%, mostly low-entropy generated passwords) whose value or full text also appears in the models' train split. The earlier claim that no value appears in training was therefore wrong. This inflates the ML rows (prowl-lr/prowl-3way) - most on the multilingual prose slice - by a few points of recall; the pure-Go cascade row (prowl) trains on nothing and is unaffected. A coordinated cross-corpus split is the fix in progress. - Stratified sampling: capped per
(tier × type × source)bucket, real-first, deduplicated by value, so no single type/source/language dominates. - Ambiguity removal: negatives where a high-entropy value sits inside a secret-named assignment
(e.g.
API_KEY=<40-hex-sha>) are excluded: flagging those is defensible, so they are not a fair false-positive test. T4 negatives are unambiguous non-secrets. - Curated suite: a hand-verified set of adversarial cases (bcrypt, ssh-pubkeys, data-URIs, multilingual passwords, near-misses) is appended for guaranteed-clean coverage.
Detection task: does the tool flag ≥1 secret in the snippet? Reported per tool:
- precision / recall / F1 / accuracy overall;
- per-tier recall (T1 to T3) and T4 false-positive rate;
- per-language recall (multilingual generalisation);
- per-source (code vs non-code).
All tools are run at their default operating point with realistic file extensions per source
(code →
.py, wiki →.md, log →.log) so extension-sensitive tools apply their real rules.
Run: python run_prowlbench.py → prowlbench_leaderboard.json.
Results on ProwlBench v2.0 (24,603 cases, default operating points, sorted by F1):
Read this before the table - reproducibility first. The table below is exactly the rows a clean checkout +
python run_prowlbench.pyproduces: the shipped Go binary (cascade) plus the competitors. The optional ensemble rows (cascade ∪ LR, and the 3-way with the multilingual encoder) need models that are gitignored / on Hugging Face - not in a clean checkout, so they are reported separately below and are not the headline. Two more honesty notes: any row with a trained model carries the ~5% train/test leakage disclosed in §3.2 (the cascade trains on nothing); and 57% of positives are generic passwords / keys in prose, which provider-verifier engines (trufflehog) and prefix regex (gitleaks) don't target by design - a structured-token-heavy distribution narrows the gap sharply.
| tool (clean checkout - reproducible) | precision | recall | F1 | accuracy |
|---|---|---|---|---|
| Prowl (shipped binary, cascade) | 0.951 | 0.823 | 0.883 | 0.853 |
| DeepPass2 | 0.893 | 0.567 | 0.694 | 0.663 |
| gitleaks | 0.931 | 0.413 | 0.573 | 0.585 |
| detect-secrets | 0.848 | 0.423 | 0.564 | 0.561 |
| DeepSecrets | 0.921 | 0.309 | 0.462 | 0.517 |
| TruffleHog | 0.940 | 0.303 | 0.458 | 0.518 |
Optional ML operating points - NOT in a clean checkout (need the gitignored LR + encoder); numbers are from research runs and vary by encoder version:
| tool (research-only, not in the artifact) | precision | recall | F1 |
|---|---|---|---|
| Prowl (cascade ∪ LR) | 0.940 | 0.872 | 0.905 |
| Prowl (3-way, + multilingual encoder) | 0.936 | 0.989 | 0.962 |
Three operating points let you pick the precision/recall trade-off:
- cascade-only: pure Go, no ML, fastest; the shipped, reproducible row - highest precision here (0.951) and a 0.09 hard-negative false-positive rate. The choice when precision matters most.
- cascade ∪ LR: adds a char+word TF-IDF logistic regression (train_text_lr.py (main repo), gated to non-code), trains in <1 min on CPU. Closes most of the multilingual-prose gap. Needs the gitignored LR.
- 3-way: adds a fine-tuned multilingual context encoder (train_encoder.py (main repo)); tops recall (0.989) and per-language recall (en 0.98 / de 1.00 / fr 1.00 / es 0.99 / ru 1.00), but needs the gitignored encoder - not reproducible from a clean checkout. Use cascade-only when FPs are costliest.
Per-tier recall (T4 = false-positive rate on hard negatives, lower is better):
| tool | T1 structured | T2 generic | T3 free-form/multi | T4 FP-rate |
|---|---|---|---|---|
| Prowl (3-way) | 1.00 | 0.98 | 0.98 | 0.14 |
| Prowl (cascade ∪ LR) | 0.99 | 0.68 | 0.85 | 0.12 |
| Prowl (cascade-only) | 0.98 | 0.68 | 0.72 | 0.07 |
| DeepPass2 | 0.68 | 0.16 | 0.67 | 0.14 |
| gitleaks | 0.65 | 0.35 | 0.17 | 0.06 |
| detect-secrets | 0.58 | 0.34 | 0.29 | 0.16 |
| DeepSecrets | 0.43 | 0.20 | 0.23 | 0.05 |
| TruffleHog | 0.60 | 0.00 | 0.13 | 0.04 |
Prowl's 3-way ensemble leads recall in every tier - 1.00 on structured tokens, 0.97 on free-form and multilingual prose where regex tools collapse (gitleaks 0.17, TruffleHog 0.13). The low T4 false-positive rates of the regex/ML competitors (0.04-0.06) come with low recall (0.30-0.41): they flag little, so they miss little. Prowl's three stages let you place the operating point: the cascade covers structured leaks at the highest precision on the board, and the LR + encoder add the multilingual and free-form tail, trading false positives for recall as you move up the stack.
- No live secrets. Synthetic values are generated (never real). CredData values are format-preserving obfuscated (safe by construction). HF values come from public PII datasets.
- Treat any real-looking value as potentially compromised: do not network-verify benchmark values.
- The benchmark file is suitable for public release; the underlying raw corpus stays gitignored.
- Real coverage skews to CredData (code) + HF (prose); SecretBench/FPSecretBench (gated) would broaden it (see the main repo).
- Multilingual positives are a minority slice; non-English recall numbers have wider error bars.
- "Detection" is snippet-level (flag/no-flag), not span-exact; a span-F1 variant is future work.
- Generated cases (augmented + synthetic) outweigh real ~5:2; the leaderboard reports
originso real-only slices can be cut.
python reproduction/build_prowlbench.py # rebuild from the held-out corpus split -> prowlbench.jsonl
python run_prowlbench.py # run all tools -> leaderboard + per-tier/-lang tablesTools: Prowl (github.com/Lercas/prowl), gitleaks ≥8.30, TruffleHog ≥3.95, detect-secrets ≥1.5,
DeepSecrets ≥2.0 (ntoskernel/deepsecrets), DeepPass2 (gneeraj/deeppass2-bert). Schema + stats in prowlbench_stats.json.