AnonShield is a pseudonymization framework designed for Computer Security Incident Response Teams (CSIRTs). It replaces Personally Identifiable Information (PII) and cybersecurity indicators with cryptographically secure, deterministic pseudonyms (HMAC-SHA256), preserving referential integrity across documents while enabling GDPR/LGPD-compliant data sharing. AnonShield combines GPU-accelerated NER, an LRU entity cache, streaming processors for large files, and a schema-aware configuration mechanism. Evaluated on datasets up to 550 MB (70,951 vulnerability records), it reduces processing time from ~92 hours to under 10 minutes (~738× speedup over v2.0 on D2 JSON; ≥743× on D2 CSV) and achieves F1 = 94.2%, Recall = 96.7% with the filtered/hybrid strategies.
Paper: AnonShield: Scalable On-Premise Pseudonymization for CSIRT Vulnerability Data — SBRC 2026 Salão de Ferramentas.
Note: In parts of this repository — including benchmark scripts, CLI flags (
--versions 3.0), result directory names, and internal logs — AnonShield is referred to as v3.0. This reflects its versioning relative to the predecessor tools AnonLFI v1.0 and v2.0, which are used as baselines in the benchmark comparisons.
| Section | Description |
|---|---|
| Considered Seals | SBRC quality seals targeted by this artifact |
| Basic Information | Hardware, OS, and software environment |
| Dependencies | Required packages and external tools |
| Security Concerns | Risks and mitigations for evaluators |
| Installation | Step-by-step setup (local and Docker) |
| Minimal Test | Quick functional verification (~5–10 min) |
| Experiments | Reproduction of the three main paper claims |
| License | Licensing information |
The seals considered are: Available (SeloD), Functional (SeloF), Sustainable (SeloS), and Reproducible Experiments (SeloR).
SeloS — Sustainable: The source code is organized into 25 focused modules under src/anon/ with clear separation of concerns: engine.py (orchestration), strategies.py (anonymization algorithms), processors.py (file-format handling), entity_detector.py (NER), repository.py/database.py (data layer), cache_manager.py, hash_generator.py, security.py, and others. The CLI entry point is anon.py. Design patterns are applied explicitly: Strategy for anonymization algorithms, Template Method for file processors, Repository for data access, and Dependency Injection in the orchestrator; core/protocols.py defines Protocol-based interfaces for dependency inversion. ~73% of public APIs carry full type annotations and 100% of public classes/methods have docstrings. Beyond inline documentation, five developer guides are provided under docs/developers/ (ARCHITECTURE.md, ANONYMIZATION_STRATEGIES.md, SLM_INTEGRATION_GUIDE.md, UTILITY_SCRIPTS_GUIDE.md, EXTENSIBILITY.md, ~3,500 lines total), including a Mermaid architecture diagram. All dependencies are pinned in pyproject.toml and uv.lock; Docker images (anonshield/anon:latest / :gpu) provide a fully self-contained execution environment.
| Hardware (paper experiments) | NVIDIA RTX 5060 Ti 16 GB VRAM (driver 590.48.01, CUDA 13.1) · AMD Ryzen 5 8600G (6c/12t) · 32 GB DDR5 6000 MHz — GPU used (Device set to use cuda:0); 45/45 tests OK in ~2m18s |
| Hardware (tester — laptop) | Intel Core i5-1035G1 · 20 GB RAM · no discrete GPU — CPU-only mode; 45/45 tests OK in ~3m55s |
| Hardware (tester — server A) | 2× Intel Xeon E5-2650 · 130 GB RAM · NVIDIA Tesla C2050 + Quadro 5000 (2010, Fermi, sm_20 — not to be confused with the newer Quadro RTX 5000 which is Turing sm_75) present but below the sm_75 minimum; no driver installed; tool runs CPU-only; 45/45 tests OK in ~10m19s |
| Hardware (tester — server B) | AMD Ryzen 7 5800X (8c/16t) · 130 GB RAM · NVIDIA GeForce RTX 3060 12 GB (driver 550.163.01, CUDA 12.4) — GPU used (Device set to use cuda:0); 45/45 tests OK in ~3m15s |
| Hardware (tester — laptop 2) | Intel Core i5-12450HX (8c) · 16 GB DDR4 · NVIDIA GeForce RTX 3050 6 GB · Zorin OS 18 — GPU used (Device set to use cuda:0); 45/45 tests OK in ~2m21s |
| Minimum for smoke test | 4 GB RAM · x86_64 · Python 3.12 + uv |
| Software | Python 3.12 + uv for all experiments; Docker optional (tool use only) |
| GPU (optional) | NVIDIA driver ≥ 525 (CUDA 12.8) + NVIDIA Container Toolkit; GPU must be sm_75 (Turing) or newer — torch 2.11.0+cu128 dropped Volta (sm_70) and older |
| OS | Linux (tested and recommended); macOS/Windows supported via Docker only |
| Disk | .venv after uv sync: ~7.9 GB; NER models: ~1.5 GB (downloaded on first run to ~/.cache/huggingface/); D1 ~133 MB (in git); D3 bundled as zips (~80 MB in git, ~700 MB extracted). Benchmark comparisons with v2.0 (via --setup) require ~8 GB additional (v2.0 venv + models). Total for full experiment suite: ~17 GB. |
Python environment (all experiments):
- Python 3.12 +
uv— all packages pinned inpyproject.toml/uv.lock - Key packages:
presidio-analyzer,presidio-anonymizer,transformers,spacy,torch,pandas,pymupdf,pytesseract,lxml,orjson,scipy,statsmodels - NER models downloaded automatically on first run and cached in
~/.cache/huggingface/(~1.5 GB)
Optional:
- Tesseract OCR — required only for OCR-mode tests (PDF/image files):
sudo apt install tesseract-ocr # Ubuntu/Debian - Docker — for tool use only (not needed for experiments):
anonshield/anon:latest(~2 GB CPU) oranonshield/anon:gpu(~6 GB GPU)
- AnonShield processes sensitive cybersecurity data entirely locally — no data is transmitted to external services
db/entities.dbstores the PII entity mapping table — keep it secure; losing it makes de-anonymization impossible- The HMAC secret key (
ANON_SECRET_KEY) must be protected — it is required to correlate pseudonyms across separate runs - The Docker
--gpuflag passes--gpus allto the container; review this before use in shared environments
# 1. Clone the repository
git clone https://github.com/AnonShield/tool.git
cd tool
# 2. Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# 3. Install system build dependencies (Linux — required to compile hdbscan and other C extensions)
sudo apt update && sudo apt install -y python3-dev build-essential
# 4. Install Python dependencies
uv sync
# 5. Set the HMAC secret key (required for pseudonymization)
export ANON_SECRET_KEY=$(openssl rand -hex 32)
# To persist across sessions:
echo "export ANON_SECRET_KEY=$ANON_SECRET_KEY" >> ~/.bashrcGPU: CUDA-enabled PyTorch (cu128) and CuPy are included in pyproject.toml and installed automatically by uv sync. No extra steps required — GPU acceleration is enabled by default when an NVIDIA GPU is present.
⚠️ Warning: The Docker images contain only the anonymization tool (anon.py). They do not include the benchmark suite, datasets (D1/D3), evaluation data, or any experiment scripts. To reproduce the paper's claims, use the local installation withuv syncas described above.
No system dependencies required — Python, uv, and all libraries are already inside the image. The only prerequisite is Docker. The download commands (
curlon Linux/macOS,Invoke-WebRequeston Windows) and the secret key generators (opensslon Linux/macOS,RandomNumberGeneratoron Windows) are native to each OS — nothing extra to install.
Linux / macOS — uses curl and openssl (built-in):
curl -fsSL https://raw.githubusercontent.com/AnonShield/tool/main/docker/run.sh -o run.sh
chmod +x run.sh
export ANON_SECRET_KEY=$(openssl rand -hex 32)
./run.sh ./your_file.csvWindows — uses Invoke-WebRequest and RandomNumberGenerator (built-in PowerShell):
Invoke-WebRequest -Uri https://raw.githubusercontent.com/AnonShield/tool/main/docker/run.ps1 -OutFile run.ps1
$env:ANON_SECRET_KEY = [System.BitConverter]::ToString([System.Security.Cryptography.RandomNumberGenerator]::GetBytes(32)).Replace("-","").ToLower()
.\run.ps1 .\your_file.csvFull usage, options, and examples: Docker Hub README
~5–10 minutes. No datasets beyond what is already in the repository.
# Set a secret key
export ANON_SECRET_KEY=$(openssl rand -hex 32)
# Anonymize the included example file
uv run anon.py examples/teste-exemplo-artigo.txt
# Expected: output/anon_teste-exemplo-artigo.txt is created
# PII tokens replaced with [TYPE_<slug>] pseudonyms — verify with:
cat output/anon_teste-exemplo-artigo.txtRun the unit test suite:
uv run python -m unittest discover tests/Expected: the final line reads OK (all tests passed) or FAILED (one or more tests failed). Some tests intentionally exercise error paths and will print ERROR or warning messages during the run — this is normal and does not indicate a failure. Only the final OK / FAILED verdict matters.
Claim #1 — AnonShield (standalone) achieves ~3×–~17× speedup over v2.0 per file on D1 (GPU); ≥3,532× (GPU) / ≥535× (CPU) at D3 scale
Paper reference: Performance results section (D1/D2/D3).
What this claim asserts and why it has two parts:
The per-file speedup is measured on D1 (small files, 130 targets). On GPU, AnonShield benefits from accelerated NER inference, yielding ~3×–~17× over v2.0 per file (mean-based). On CPU-only hardware, AnonShield loses GPU acceleration (~5.5× slower per file) while v2.0 is already CPU-bound — so per-file speedup on CPU is ~GPU speedup ÷ 5.5, and AnonShield may be slower per file without a GPU. However, at D3 scale the advantage recovers due to AnonShield's O(n) streaming architecture vs v2.0's scaling behavior: ≥3,532× on GPU and ≥535× on CPU (D3 CPU times are measured in stored results).
Verification options (in order of time cost):
Option A — Smoke test (~5–25 min depending on hardware): Verifies the full pipeline is functional on small subsets of D1, D1C, and D3.
# D1C includes image-based PDF targets — install Tesseract before running:
sudo apt install tesseract-ocr
./paper_data/test_minimal/run_tests.sh --skip-d2 # with NVIDIA GPU (D2 is private — skip it)
./paper_data/test_minimal/run_tests.sh --skip-d2 --cpu-only # no GPUPaper hardware (example): GPU command 313.21 s (~5.22 min); CPU command 700.36 s (~11.67 min).
Expected: the final line reads RESULT: ALL PASSED. D2 is a private dataset not included in this repository; --skip-d2 omits those 4 steps so the script exits cleanly. Absolute runtimes on 500-row subsets will not match the paper's full-scale numbers, but the pipeline is verified end-to-end.
Option B — Spot check (~8–20 min after setup): Runs v2.0 and AnonShield on a ~512 KB subset of D3 CSV. v2.0 throughput is compute-limited and scales poorly with file size; AnonShield benefits from GPU acceleration when available, so the measured ratio varies by hardware. On first run, the script automatically sets up v2.0 and v3.0 environments and downloads model weights (~several GB) — this can take significantly longer depending on network speed. Subsequent runs skip setup entirely.
./paper_data/scripts/extract_datasets.sh # extract D3 from bundled zips (required once)
./paper_data/scripts/spot_check_claim1.sh # with NVIDIA GPU
./paper_data/scripts/spot_check_claim1.sh --cpu-only # no GPUPaper hardware (example): GPU path (extract + spot_check) 248.62 s (~4.14 min); CPU path 282.86 s (~4.71 min). Tester — server B (AMD Ryzen 7 5800X · RTX 3060): GPU path 272.8 s (~4.55 min); CPU path 242.6 s (~4.04 min). Tester — server A (2× Intel Xeon E5-2650, CPU-only): GPU path skipped (no GPU); CPU path 966.0 s (~16.1 min).
Expected output (absolute times vary by hardware; speedup is larger with GPU):
══════════════════════════════════════════════════════════════
Claim #1 Spot Check (515 KB subset of D3 CSV)
══════════════════════════════════════════════════════════════
v2.0 default : XXX.X s (X.XX KB/s on this machine)
AnonShield standalone : XX.X s (XXX KB/s on this machine)
Speedup : XX× (varies by hardware — larger when GPU is available)
Extrapolating to full D3 (247 MB) via measured throughputs:
v2.0 on full D3 : ≥ XX.X h (lower bound — extrapolated from measured throughput)
AnonShield on full D3 : ≤ XXXX s (upper bound — AnonShield cache improves at scale)
Projected speedup: ≥ XX×
══════════════════════════════════════════════════════════════
Option C — Full D3 benchmark: Runtime is hardware-dependent and cannot be estimated without knowing the evaluator's machine.
./paper_data/scripts/extract_datasets.sh # extract D3 from bundled zips (~80 MB → ~700 MB)
./paper_data/scripts/reproduce_all_runs.sh --skip-d1 --skip-d2
./paper_data/scripts/analyze_all.shPaper hardware (measured): full Option C (reproduce_all_runs --skip-d1 --skip-d2 + analyze_all) completed in 22,906.43 s (~6.36 h) on GPU; ~103,035 s (~28.62 h) on CPU-only (derived from stored per-run wall_clock_time_sec).
Tester — server B (AMD Ryzen 7 5800X · RTX 3060): full Option C 32,549.1 s (~9.04 h) (extract 2.4 s; reproduce 32,489.3 s (~9.02 h); analyze 57.5 s).
Tester — server A (2× Intel Xeon E5-2650, CPU-only): full Option C estimated ~326,554 s (~90.7 h ≈ 3.78 days) — projected from 7 real D3 CSV runs (per-run avg: filtered ~6,023 s, hybrid ~6,009 s, standalone ~1,596 s, presidio ~6,137 s), extrapolated to the remaining runs and datasets using the 3.17× ratio relative to the paper hardware CPU stored results.
The stored benchmark_results.csv files under paper_data/results_paper/ contain the paper's original measurements and can be inspected directly without re-running.
Full dataset details and step-by-step instructions:
paper_data/EXPERIMENTS.md
Paper reference: Accuracy evaluation section.
What this claim asserts: On a stratified sample of 67 OpenVAS vulnerability records annotated by three security specialists, the filtered and hybrid strategies achieve F1 = 94.2% and Recall = 96.7%. Annotation was performed by the paper authors and is not expected to be reproduced by evaluators — it required manual expert judgment across 13 entity types.
What evaluators can verify:
-
Inspect the pre-computed annotated outputs directly (no re-running required):
-
Re-run the tool on the evaluation dataset and compare the anonymized output against the reference:
⚠️ GPU recommended: This command uses theSecureModernBERT-NERtransformer model. Measured runtimes for thefilteredstrategy on the 9.2 MB evaluation file: ~4 min on GPU (RTX 5060 Ti, 38.7 KB/s) and ~2h on CPU (Intel i5-1035G1, 1.4 KB/s). For all 4 strategies: ~15–20 min on GPU, ~7–8h on CPU. If you do not have an NVIDIA GPU, prefer Option 1 (pre-computed outputs) to avoid the long runtime.Note on the progress bar: the time-remaining estimate shown in the terminal during processing is unreliable and should be ignored. During the first ~1–3 minutes the model is being compiled and loaded — throughput is near zero at this stage, causing the progress bar to project absurdly large estimates (e.g. "60h remaining" or "2h remaining"). Once the model finishes loading the speed increases sharply and the run completes well within the times shown above. Do not interrupt the process based on the initial estimate.
python3 benchmark/benchmark.py \
--benchmark \
--file paper_data/evaluation/vulnnet_scans_openvas_compilado.csv \
--versions 3.0 \
--strategies filtered hybrid standalone presidio \
--transformer-model attack-vector/SecureModernBERT-NER \
--entities-to-preserve TOOL,PLATFORM,FILE_PATH,THREAT_ACTOR,SERVICE,REGISTRY_KEY,CAMPAIGN,MALWARE,SECTOR \
--anonymization-config paper_data/configs/anonymization_config_openvas.jsonPaper hardware (example): command completed in 983.37 s (~16.39 min) after rebuilding the environment. Tester — server B (AMD Ryzen 7 5800X · RTX 3060): command completed in 1,210.6 s (~20.2 min). Tester — server A (2× Intel Xeon E5-2650, CPU-only): command completed in 15,879.4 s (~4.41 h).
Reference results (pre-computed, 67 records, 3 specialists, 13 entity types):
| Strategy | TP | FP | FN | Precision | Recall | F1 |
|---|---|---|---|---|---|---|
filtered |
724 | 64 | 25 | 91.9% | 96.7% | 94.2% |
hybrid |
724 | 64 | 25 | 91.9% | 96.7% | 94.2% |
standalone |
739 | 102 | 43 | 87.9% | 94.5% | 91.1% |
presidio |
724 | 287 | 25 | 71.6% | 96.7% | 82.3% |
Annotation methodology and XLSX format details:
paper_data/evaluation/EVALUATION_DATA.md·ANNOTATION_MANUAL.md
Claim #3 — anonymization_config eliminates NER inference overhead, reducing D3 processing time significantly (paper hardware: ~9× GPU / ~55× CPU; actual speedup depends on GPU speed)
Paper reference: Config gain results section.
What this claim asserts: A schema-aware anonymization_config that specifies only force_anonymize and exclude directives bypasses the NER and regex pipeline entirely — no field undergoes inference. On GPU (paper hardware), this reduces D3 CSV processing from ~73 s to ~8 s (~9×). The CPU gain is larger because NER inference costs more without a GPU. The paper also reports gains on D2 (private dataset, not reproducible by evaluators).
Dataset: D3 with paper_data/configs/anonymization_config_cve.json.
./paper_data/scripts/extract_datasets.sh # extract D3 from bundled zips (required once)
./paper_data/scripts/spot_check_claim3.sh # with NVIDIA GPU (~80 s)
./paper_data/scripts/spot_check_claim3.sh --cpu-only # no GPU (~490 s / ~8 min)Paper hardware (example): GPU path (extract + spot_check) 83.70 s (~1.40 min); CPU path 504.43 s (~8.41 min). Tester — server B (AMD Ryzen 7 5800X · RTX 3060): GPU path 119.4 s (~1.99 min); CPU path 488.6 s (~8.14 min). Tester — server A (2× Intel Xeon E5-2650, CPU-only): GPU path skipped (no GPU); CPU path 1,670.1 s (~27.8 min).
Expected speedup: larger on CPU (NER inference costs more without a GPU, so removing it saves more). Absolute times vary by hardware.
══════════════════════════════════════════════════════════════
Claim #3 Spot Check (D3 CSV, AnonShield standalone)
══════════════════════════════════════════════════════════════
without config : XXX.X s
with config : X.X s
Config speedup : XX×
Note: for this specific config (only force_anonymize and
fields_to_exclude directives, zero fields_to_anonymize entries),
no field passes through the NER or regex pipeline, so GPU and
CPU times converge. The CPU gain is therefore larger than on GPU.
══════════════════════════════════════════════════════════════
Full reproduction steps:
paper_data/EXPERIMENTS.md
We welcome feedback, questions, and contributions from the community.
- Bugs & Feature Requests: Please open an issue on our GitHub repository.
- Direct Contact & Inquiries: For institutional questions, partnerships, or to report a security bug directly, reach out to our team at anonshield@unipampa.edu.br.
This project is licensed under the GNU General Public License v3.0. See LICENSE for the full text.
CLI Reference · Architecture · Anonymization Strategies · Extensibility · Benchmark Suite · Experiments & Datasets · Evaluation Data · Contributing · Changelog