A multi-modal benchmark measuring how well today's tools redact PII from screen telemetry, rendered screenshots, and multi-step computer-use traces.
Blog: screenpipe.github.io/screenleak · Contact: louis@screenpi.pe
Each adapter is scored on every surface where it operates. The composite is the geometric mean across surfaces — the weakest-link compliance posture.
| Framework | Text ( v45_phase3) |
Image ( rfdetr) |
Trace ( gpt5) |
Composite |
|---|---|---|---|---|
| HIPAA | 91.8% | 95.8% | 76.0% | 87.9% |
| GDPR | 90.2% | 95.2% | 68.0% | 84.5% |
| CCPA | 90.2% | 95.2% | 68.0% | 84.5% |
| SOC 2 | 88.0% | 95.7% | 68.0% | 83.9% |
| PCI DSS | 88.7% | 96.8% | 78.3% | 87.9% |
| DPDPA | 91.6% | 95.8% | 72.0% | 86.5% |
Single shared label-subset dict (scoring/frameworks.py) applies across all three surfaces. Numbers are zero-leak rates on the private val sets (422 text · 221 image · 25 trace) — re-runnable from the public sample with the same probes. Full unified breakdown: results/framework_coverage.md.
| Bench | Question | Corpus |
|---|---|---|
text/ |
Find PII in window titles / AX nodes / OCR fragments | 422 cases · 13 labels · multilingual + adversarial splits |
image/ |
Find pixel regions of PII in rendered screens | Synthetic screenshots · 9 app templates · pixel-precise gold |
trace/ |
Does the agent leak PII it observes inside a task? | 50 traces (25 train + 25 val) with injected PII |
All three use the same 13-label taxonomy (CATEGORIES.md).
Zero-leak alone is half the picture. Local size + RAM + latency are the other half. The v45_phase3 row is the only one strong on every dimension.
| Adapter | Local | Zero-leak | Oversmash | macro-F1 | Size | RAM | p50 |
|---|---|---|---|---|---|---|---|
gemini (gemini-3.1-pro) |
❌ | 91.0% | 2.6% | 0.85 | — | — | 3 754 ms |
gpt5 (gpt-5.5) |
❌ | 90.7% | 5.2% | 0.85 | — | — | 2 173 ms |
claude (claude-opus-4-7) |
❌ | 87.8% | 5.2% | 0.81 | — | — | 1 550 ms |
v45_phase3 ⭐ |
✅ | 86.7% | 0.0% | 0.78 | 278 MB | 1.1 GB | 9 ms |
privacy_filter_ft_v6 |
✅ | 80.9% | 3.9% | 0.72 | ~ 1.4 GB | ~ 6 GB | 54 ms |
opf_rs / privacy_filter family |
✅ | 38 – 80% | 4 – 10% | 0.35 – 0.72 | ~ 1.4 GB | ~ 6 GB | 1 – 120 ms |
gliner_pii |
✅ | 62.6% | 79.2% | 0.44 | ~ 500 MB | ~ 1.5 GB | 104 ms |
gcp_dlp |
❌ | 37.7% | 11.7% | 0.24 | — | — | 84 ms |
presidio |
✅ | 35.4% | 22.1% | 0.20 | ~ 200 MB | ~ 400 MB | 6 ms |
regex |
✅ | 33.9% | 1.3% | 0.57 | < 1 MB | ~ 30 MB | < 1 ms |
v45_phase3 peak RSS measured via cargo run --release --example v45_phase3_smoke --features onnx-cpu. Other adapters' size / RAM from each model's card.
1. A 278 MB local model matches frontier APIs on text. Cloud DLP products don't. On the 422-case text bench, Gemini / GPT-5.5 / Claude all score 87.8 – 91.0 % zero-leak. v45_phase3 averages 90 % across compliance frameworks at 9 ms p50 on CPU. Google Cloud DLP (37.7 %) and Microsoft Presidio (35.4 %) barely beat a hand-rolled regex (33.9 %) — they were built for documents, not screen telemetry.
2. Frontier vision can't draw boxes. A small specialized detector can. On 190 PII-bearing screenshots at IoU ≥ 0.30, every frontier vision API sits below 5 % zero-leak with CIs that overlap each other and the Tesseract+regex baseline. A locally fine-tuned RF-DETR-Nano (28 M params, 108 MB) hits 95.3 % with a lower CI bound of 91.2 % — decisively separated.
3. Frontier APIs don't withhold PII when working. On 25 multi-turn traces, the strongest (GPT-5.5 at 64.0 %, CI 44 – 80 %) still leaks at least one observed PII item in 36 % of traces. Gemini at 20 % leaks in 80 % of traces.
The pattern: capability ≠ pixel-grounding ≠ disposition. A model nailing text detection at 91 % can still leak PII 80 % of the time when it observes that PII inside a task. See THREAT_MODEL.md for what counts as a leak, LIMITATIONS.md for caveats (notably: rfdetr's val is in-distribution with its training — held-out images, same source).
- Scoring code —
text/src/score.py,image/src/score.py,trace/src/score.py+trace/src/replay.py. Shared framework probe atscoring/frameworks.py. - Every adapter benchmarked — Claude, GPT-5.5, Gemini, GCP DLP, Presidio, GLiNER,
privacy_filterfamily, RF-DETR, regex baselines. - Methodology / threat model / categories / limitations / citation.
- Public sample per surface so any adapter can be run end-to-end without the full corpus:
text/data/sample.jsonl— 51 cases (incl. PHI / PCI / Art. 9 / multilingual tasters)image/corpus/sample/— 30 rendered screenshots + DOM-extracted gold bboxestrace/data/injected_sample.jsonl— 5 multi-turn computer-use traces
The full val sets and the data pipelines that produce them live in a private companion repo — to keep the leaderboard uncontaminated by training-on-the-bench and to preserve the moat. Researchers running serious evaluations can request access at louis@screenpi.pe.
make install
export ANTHROPIC_API_KEY=... OPENAI_API_KEY=... GOOGLE_API_KEY=...
make bench-text ADAPTER=claude # or: gpt5, gemini, v45_phase3, gcp_dlp, regex, …
make bench-image ADAPTER=rfdetr # or: claude, gpt5, gemini, regex_ocr, …
make bench-trace ADAPTER=claude # or: gpt5, geminiHeadline leaderboard numbers are computed on the full private corpus; sample-corpus runs are for adapter validation and onboarding, not re-ranking.
See CITATION.bib.
louis@screenpi.pe