ProwlBench: A Secret-Detection Benchmark

A reproducible, multi-tool benchmark for secret / credential detection in code, tickets, wikis, and logs. Honest by design: real + synthetic, leakage-safe, stratified by difficulty, multilingual, and not rigged toward any tool. Ships a runnable leaderboard across Prowl, gitleaks, TruffleHog, detect-secrets, DeepSecrets, and DeepPass2.

Datasheet follows Datasheets for Datasets (Gebru et al.). Version 2.0.

Part of Prowl, a high-precision secret scanner. The cases are also published as a Hugging Face dataset config: Podric/prowl-secrets-corpus.

Quick start

git clone https://github.com/Lercas/prowlbench && cd prowlbench

# score any scanner against the 24,603 cases (run the tool you want to compare)
python run_prowlbench.py --tools gitleaks,trufflehog,detect-secrets,deepsecrets

# to reproduce Prowl's own rows you need the prowl binary on PATH (build from the main repo)
# and, for the 3-way config, the encoder from Hugging Face:
PROWL_ENCODER=/path/to/prowl-secret-encoder python run_prowlbench.py --tools prowl-3way

prowlbench.jsonl is the case set (one JSON object per line: id, text, label, type, tier, source, lang, origin, span). run_prowlbench.py runs each tool as a real subprocess and writes prowlbench_leaderboard.json. The reproduction/ scripts rebuild the benchmark and the ensemble analysis from the training corpus. They require that corpus (Podric/prowl-secrets-corpus) and the main repo's src/ on PYTHONPATH; the runner and the cases above do not.

1. Motivation

Existing secret scanners are evaluated on each vendor's private set or a handful of hand-picked cases, so numbers aren't comparable and false-positive behaviour (the #1 cause of alert fatigue) is under-measured. ProwlBench is a single, public, balanced test set with a documented protocol and a multi-tool leaderboard, so detection quality (especially precision on hard negatives) can be compared apples-to-apples.

2. Composition

24,603 labelled snippets, each: {id, text, label, type, tier, source, lang, origin, span}.

Axis	Breakdown
Label	16,552 positive (contains a secret) / 8,051 negative
Difficulty tier	T1 structured 7,033 (AWS/GitHub/Stripe/JWT/PEM...) · T2 generic-context 3,519 (api_key/password=) · T3 free-form & multilingual 6,000 (passwords in prose, non-English) · T4 hard-negative 8,051 (the false-positive suite)
Source	code 9,732 · confluence 4,621 · jira 4,521 · log 3,769 · slack 1,960
Language	en + fr/ja/ru/zh/de/es/it and other non-Latin (multilingual passwords in prose)
Origin	`augmented` (diversity-augmentation, held out by origin) · `synthetic` (format/checksum-correct generated) · `real` (CredData-obfuscated + HF PII, held out by origin) · `curated` (hand-verified adversarial)
Types	26 secret types + ~20 negative classes (see the taxonomy)

Exact counts: prowlbench_stats.json.

Difficulty tiers: why

T1 structured: has a checksum or distinctive prefix; a good scanner should get ~all of these.
T2 generic-context: generic key/password with a naming anchor; needs context, not regex alone.
T3 free-form & multilingual: passwords in natural-language prose, often non-English. This is where pure-regex tools collapse and context models do well.
T4 hard-negative: the discriminator. High-entropy non-secrets: content hashes (md5/sha/git), UUID/ULID/ObjectId, bcrypt hashes, SSH public keys, data-URI blobs, SRI integrity, placeholders (changeme, ${ENV}, <your-key>), near-misses (broken-checksum tokens), and PII. A tool's T4 false-positive rate is its alert-fatigue score.

3. Collection & labelling protocol

Source pool: the project's 503k-record corpus: format-correct synthetic secrets injected into realistic carriers with exact spans, real obfuscated secrets (Samsung CredData), real PII prose (HF ai4privacy + Nemotron-PII), and agent-authored carriers across many languages/frameworks.
Leakage safety (with a known, disclosed caveat): ProwlBench is drawn from a value-disjoint + template-group-disjoint TEST split within its own source corpus, with the real origins (CredData/HF) held out by origin. Caveat, found by an adversarial audit and disclosed here: the ML models train on a separate corpus (corpus_all / corpus_flywheel) that shares a synthetic base with the benchmark's source and is split independently - so the two splits are not coordinated. An audit measured ~1,200 benchmark cases (~5%, mostly low-entropy generated passwords) whose value or full text also appears in the models' train split. The earlier claim that no value appears in training was therefore wrong. This inflates the ML rows (prowl-lr/prowl-3way) - most on the multilingual prose slice - by a few points of recall; the pure-Go cascade row (prowl) trains on nothing and is unaffected. A coordinated cross-corpus split is the fix in progress.
Stratified sampling: capped per (tier × type × source) bucket, real-first, deduplicated by value, so no single type/source/language dominates.
Ambiguity removal: negatives where a high-entropy value sits inside a secret-named assignment (e.g. API_KEY=<40-hex-sha>) are excluded: flagging those is defensible, so they are not a fair false-positive test. T4 negatives are unambiguous non-secrets.
Curated suite: a hand-verified set of adversarial cases (bcrypt, ssh-pubkeys, data-URIs, multilingual passwords, near-misses) is appended for guaranteed-clean coverage.

4. Metrics

Detection task: does the tool flag ≥1 secret in the snippet? Reported per tool:

precision / recall / F1 / accuracy overall;
per-tier recall (T1 to T3) and T4 false-positive rate;
per-language recall (multilingual generalisation);
per-source (code vs non-code). All tools are run at their default operating point with realistic file extensions per source (code → .py, wiki → .md, log → .log) so extension-sensitive tools apply their real rules.

5. Leaderboard

Run: python run_prowlbench.py → prowlbench_leaderboard.json. Results on ProwlBench v2.0 (24,603 cases, default operating points, sorted by F1):

Read this before the table - reproducibility first. The table below is exactly the rows a clean checkout + python run_prowlbench.py produces: the shipped Go binary (cascade) plus the competitors. The optional ensemble rows (cascade ∪ LR, and the 3-way with the multilingual encoder) need models that are gitignored / on Hugging Face - not in a clean checkout, so they are reported separately below and are not the headline. Two more honesty notes: any row with a trained model carries the ~5% train/test leakage disclosed in §3.2 (the cascade trains on nothing); and 57% of positives are generic passwords / keys in prose, which provider-verifier engines (trufflehog) and prefix regex (gitleaks) don't target by design - a structured-token-heavy distribution narrows the gap sharply.

tool (clean checkout - reproducible)	precision	recall	F1	accuracy
Prowl (shipped binary, cascade)	0.951	0.823	0.883	0.853
DeepPass2	0.893	0.567	0.694	0.663
gitleaks	0.931	0.413	0.573	0.585
detect-secrets	0.848	0.423	0.564	0.561
DeepSecrets	0.921	0.309	0.462	0.517
TruffleHog	0.940	0.303	0.458	0.518

Optional ML operating points - NOT in a clean checkout (need the gitignored LR + encoder); numbers are from research runs and vary by encoder version:

tool (research-only, not in the artifact)	precision	recall	F1
Prowl (cascade ∪ LR)	0.940	0.872	0.905
Prowl (3-way, + multilingual encoder)	0.936	0.989	0.962

Three operating points let you pick the precision/recall trade-off:

cascade-only: pure Go, no ML, fastest; the shipped, reproducible row - highest precision here (0.951) and a 0.09 hard-negative false-positive rate. The choice when precision matters most.
cascade ∪ LR: adds a char+word TF-IDF logistic regression (train_text_lr.py (main repo), gated to non-code), trains in <1 min on CPU. Closes most of the multilingual-prose gap. Needs the gitignored LR.
3-way: adds a fine-tuned multilingual context encoder (train_encoder.py (main repo)); tops recall (0.989) and per-language recall (en 0.98 / de 1.00 / fr 1.00 / es 0.99 / ru 1.00), but needs the gitignored encoder - not reproducible from a clean checkout. Use cascade-only when FPs are costliest.

Per-tier recall (T4 = false-positive rate on hard negatives, lower is better):

tool	T1 structured	T2 generic	T3 free-form/multi	T4 FP-rate
Prowl (3-way)	1.00	0.98	0.98	0.14
Prowl (cascade ∪ LR)	0.99	0.68	0.85	0.12
Prowl (cascade-only)	0.98	0.68	0.72	0.07
DeepPass2	0.68	0.16	0.67	0.14
gitleaks	0.65	0.35	0.17	0.06
detect-secrets	0.58	0.34	0.29	0.16
DeepSecrets	0.43	0.20	0.23	0.05
TruffleHog	0.60	0.00	0.13	0.04

Prowl's 3-way ensemble leads recall in every tier - 1.00 on structured tokens, 0.97 on free-form and multilingual prose where regex tools collapse (gitleaks 0.17, TruffleHog 0.13). The low T4 false-positive rates of the regex/ML competitors (0.04-0.06) come with low recall (0.30-0.41): they flag little, so they miss little. Prowl's three stages let you place the operating point: the cascade covers structured leaks at the highest precision on the board, and the LR + encoder add the multilingual and free-form tail, trading false positives for recall as you move up the stack.

6. Ethics & safety

No live secrets. Synthetic values are generated (never real). CredData values are format-preserving obfuscated (safe by construction). HF values come from public PII datasets.
Treat any real-looking value as potentially compromised: do not network-verify benchmark values.
The benchmark file is suitable for public release; the underlying raw corpus stays gitignored.

7. Limitations

Real coverage skews to CredData (code) + HF (prose); SecretBench/FPSecretBench (gated) would broaden it (see the main repo).
Multilingual positives are a minority slice; non-English recall numbers have wider error bars.
"Detection" is snippet-level (flag/no-flag), not span-exact; a span-F1 variant is future work.
Generated cases (augmented + synthetic) outweigh real ~5:2; the leaderboard reports origin so real-only slices can be cut.

8. Reproduction

python reproduction/build_prowlbench.py      # rebuild from the held-out corpus split -> prowlbench.jsonl
python run_prowlbench.py        # run all tools -> leaderboard + per-tier/-lang tables

Tools: Prowl (github.com/Lercas/prowl), gitleaks ≥8.30, TruffleHog ≥3.95, detect-secrets ≥1.5, DeepSecrets ≥2.0 (ntoskernel/deepsecrets), DeepPass2 (gneeraj/deeppass2-bert). Schema + stats in prowlbench_stats.json.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
reproduction		reproduction
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README.ru.md		README.ru.md
deeppass2_cases.jsonl		deeppass2_cases.jsonl
prowlbench.jsonl		prowlbench.jsonl
prowlbench_leaderboard.json		prowlbench_leaderboard.json
prowlbench_stats.json		prowlbench_stats.json
run_prowlbench.py		run_prowlbench.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProwlBench: A Secret-Detection Benchmark

Quick start

1. Motivation

2. Composition

Difficulty tiers: why

3. Collection & labelling protocol

4. Metrics

5. Leaderboard

6. Ethics & safety

7. Limitations

8. Reproduction

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ProwlBench: A Secret-Detection Benchmark

Quick start

1. Motivation

2. Composition

Difficulty tiers: why

3. Collection & labelling protocol

4. Metrics

5. Leaderboard

6. Ethics & safety

7. Limitations

8. Reproduction

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages