ScreenLeak

A multi-modal benchmark measuring how well today's tools redact PII from screen telemetry, rendered screenshots, and multi-step computer-use traces.

Blog: screenpipe.github.io/screenleak · Contact: louis@screenpi.pe

Headline — composite framework coverage (text · image · trace)

Each adapter is scored on every surface where it operates. The composite is the geometric mean across surfaces — the weakest-link compliance posture.

Framework	Text (`v45_phase3`)	Image (`rfdetr`)	Trace (`gpt5`)	Composite
HIPAA	91.8%	95.8%	76.0%	87.9%
GDPR	90.2%	95.2%	68.0%	84.5%
CCPA	90.2%	95.2%	68.0%	84.5%
SOC 2	88.0%	95.7%	68.0%	83.9%
PCI DSS	88.7%	96.8%	78.3%	87.9%
DPDPA	91.6%	95.8%	72.0%	86.5%

Single shared label-subset dict (scoring/frameworks.py) applies across all three surfaces. Numbers are zero-leak rates on the private val sets (422 text · 221 image · 25 trace) — re-runnable from the public sample with the same probes. Full unified breakdown: results/framework_coverage.md.

Three sub-benches

Bench	Question	Corpus
`text/`	Find PII in window titles / AX nodes / OCR fragments	422 cases · 13 labels · multilingual + adversarial splits
`image/`	Find pixel regions of PII in rendered screens	Synthetic screenshots · 9 app templates · pixel-precise gold
`trace/`	Does the agent leak PII it observes inside a task?	50 traces (25 train + 25 val) with injected PII

All three use the same 13-label taxonomy (CATEGORIES.md).

Three failure modes — text adapters, full comparison

Zero-leak alone is half the picture. Local size + RAM + latency are the other half. The v45_phase3 row is the only one strong on every dimension.

Adapter	Local	Zero-leak	Oversmash	macro-F1	Size	RAM	p50
`gemini` (gemini-3.1-pro)	❌	91.0%	2.6%	0.85	—	—	3 754 ms
`gpt5` (gpt-5.5)	❌	90.7%	5.2%	0.85	—	—	2 173 ms
`claude` (claude-opus-4-7)	❌	87.8%	5.2%	0.81	—	—	1 550 ms
`v45_phase3` ⭐	✅	86.7%	0.0%	0.78	278 MB	1.1 GB	9 ms
`privacy_filter_ft_v6`	✅	80.9%	3.9%	0.72	~ 1.4 GB	~ 6 GB	54 ms
`opf_rs` / privacy_filter family	✅	38 – 80%	4 – 10%	0.35 – 0.72	~ 1.4 GB	~ 6 GB	1 – 120 ms
`gliner_pii`	✅	62.6%	79.2%	0.44	~ 500 MB	~ 1.5 GB	104 ms
`gcp_dlp`	❌	37.7%	11.7%	0.24	—	—	84 ms
`presidio`	✅	35.4%	22.1%	0.20	~ 200 MB	~ 400 MB	6 ms
`regex`	✅	33.9%	1.3%	0.57	< 1 MB	~ 30 MB	< 1 ms

v45_phase3 peak RSS measured via cargo run --release --example v45_phase3_smoke --features onnx-cpu. Other adapters' size / RAM from each model's card.

Findings

1. A 278 MB local model matches frontier APIs on text. Cloud DLP products don't. On the 422-case text bench, Gemini / GPT-5.5 / Claude all score 87.8 – 91.0 % zero-leak. v45_phase3 averages 90 % across compliance frameworks at 9 ms p50 on CPU. Google Cloud DLP (37.7 %) and Microsoft Presidio (35.4 %) barely beat a hand-rolled regex (33.9 %) — they were built for documents, not screen telemetry.

2. Frontier vision can't draw boxes. A small specialized detector can. On 190 PII-bearing screenshots at IoU ≥ 0.30, every frontier vision API sits below 5 % zero-leak with CIs that overlap each other and the Tesseract+regex baseline. A locally fine-tuned RF-DETR-Nano (28 M params, 108 MB) hits 95.3 % with a lower CI bound of 91.2 % — decisively separated.

3. Frontier APIs don't withhold PII when working. On 25 multi-turn traces, the strongest (GPT-5.5 at 64.0 %, CI 44 – 80 %) still leaks at least one observed PII item in 36 % of traces. Gemini at 20 % leaks in 80 % of traces.

The pattern: capability ≠ pixel-grounding ≠ disposition. A model nailing text detection at 91 % can still leak PII 80 % of the time when it observes that PII inside a task. See THREAT_MODEL.md for what counts as a leak, LIMITATIONS.md for caveats (notably: rfdetr's val is in-distribution with its training — held-out images, same source).

What's in this repo

Scoring code — text/src/score.py, image/src/score.py, trace/src/score.py + trace/src/replay.py. Shared framework probe at scoring/frameworks.py.
Every adapter benchmarked — Claude, GPT-5.5, Gemini, GCP DLP, Presidio, GLiNER, privacy_filter family, RF-DETR, regex baselines.
Methodology / threat model / categories / limitations / citation.
Public sample per surface so any adapter can be run end-to-end without the full corpus:
- text/data/sample.jsonl — 51 cases (incl. PHI / PCI / Art. 9 / multilingual tasters)
- image/corpus/sample/ — 30 rendered screenshots + DOM-extracted gold bboxes
- trace/data/injected_sample.jsonl — 5 multi-turn computer-use traces

The full val sets and the data pipelines that produce them live in a private companion repo — to keep the leaderboard uncontaminated by training-on-the-bench and to preserve the moat. Researchers running serious evaluations can request access at louis@screenpi.pe.

Run an adapter on the sample

make install
export ANTHROPIC_API_KEY=...  OPENAI_API_KEY=...  GOOGLE_API_KEY=...

make bench-text  ADAPTER=claude          # or: gpt5, gemini, v45_phase3, gcp_dlp, regex, …
make bench-image ADAPTER=rfdetr          # or: claude, gpt5, gemini, regex_ocr, …
make bench-trace ADAPTER=claude          # or: gpt5, gemini

Headline leaderboard numbers are computed on the full private corpus; sample-corpus runs are for adapter validation and onboarding, not re-ranking.

Cite this

See CITATION.bib.

Contact

louis@screenpi.pe

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
docs		docs
image		image
results		results
scoring		scoring
scripts		scripts
tests		tests
text		text
trace		trace
.gitignore		.gitignore
.synth-allowlist		.synth-allowlist
CATEGORIES.md		CATEGORIES.md
CHANGELOG.md		CHANGELOG.md
CITATION.bib		CITATION.bib
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE-DATA		LICENSE-DATA
LIMITATIONS.md		LIMITATIONS.md
METHODOLOGY.md		METHODOLOGY.md
Makefile		Makefile
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
SOURCES.md		SOURCES.md
THREAT_MODEL.md		THREAT_MODEL.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScreenLeak

Headline — composite framework coverage (text · image · trace)

Three sub-benches

Three failure modes — text adapters, full comparison

Findings

What's in this repo

Run an adapter on the sample

Cite this

Contact

About

Licenses found

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ScreenLeak

Headline — composite framework coverage (text · image · trace)

Three sub-benches

Three failure modes — text adapters, full comparison

Findings

What's in this repo

Run an adapter on the sample

Cite this

Contact

About

Topics

Resources

License

Licenses found

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages