This repository compares the performance of various open-source LLMs on a Dutch anonymization / PII detection task. We find best results for the Cogito model (cogito_v1_preview_qwen_32B). This repo contains prompt templates, utilities to prepare/run LLM calls, evaluation utilities, and notebooks to reproduce experiments on provided datasets. This is an initial exploration and experimentation; we recommend further implementing this code into a pipeline as described below.
- We generated a Dutch PII dataset (
data/PII_testset50.json) for the public domain, containing 50 documents. - Best performance for Cogito (f1-score = 88.9; 92.1 after postprocessing) compared to a Presidio baseline score of (f1 = 58.4)
- We recommend using the LLM approach (Cogito) for this task, with prompt id
pii_v1_4_2
Results for Cogito with prompt pii_v1_4_2 after postprocessing:
Precision: 0.9254
Recall: 0.9160
F1: 0.9207
Accuracy: 0.8531
label precision recall f1 tp pred gold
0 NAME 0.960526 0.960526 0.960526 73 76 76
1 ORGANIZATION 0.873016 0.833333 0.852713 55 63 66
2 TITLE 1.000000 0.727273 0.842105 40 40 55
3 ADDRESS 0.863636 0.950000 0.904762 38 44 40
4 CASE_NO 0.760000 0.950000 0.844444 38 50 40
5 EMAIL 1.000000 1.000000 1.000000 29 29 29
6 PHONE 1.000000 1.000000 1.000000 27 27 27
7 DATE_OF_BIRTH 1.000000 1.000000 1.000000 21 21 21
8 CARD_NO 1.000000 1.000000 1.000000 16 16 16
9 ID_NO 1.000000 1.000000 1.000000 12 12 12
10 BSN 1.000000 1.000000 1.000000 11 11 11
Tested with Python 3.10+.
pip install -r requirements.txtFor running the Presidio baseline models, Dutch spaCy models are required:
python -m spacy download nl_core_news_sm
python -m spacy download nl_core_news_md
python -m spacy download nl_core_news_lgFor sentence segmentation with Stanza (used in dataset preparation notebooks), the Dutch model will be automatically downloaded on first use. To download it explicitly:
import stanza
stanza.download('nl')For using LLM inference (Fireworks AI or Hugging Face):
export FIREWORKS_API_KEY=your__token prompts.yml: Prompt templates for LLM PII detection (multiple variants emphasizing recall/precision and definitions)src/utils/: Utility modulesllm_utils.py: LLM calling, prompt rendering, batch prep, schema validation, offset recovery, experiment logging (runs.csv)eval_utils.py: Presidio baseline inference, span matching (IoU), per-row and aggregate metrics, confusion matrices, summary printerspii_utils.py: Sentence splitting (spaCy/Stanza), span offset recovery, sentence-aware span splitting, ANSI highlighting for quick text inspectionbday_linker.py: Birthdate cue detection and linking utilities
src/models/: Model modulesPresidio_models.py: PresidioAnalyzerEnginefactory for Dutch (nl_core_news_*)
data/: Example datasets (main testsetPII_testset50.json)outputs-and-results/: Saved batch outputs and run artifacts (includingruns.csv)notebooks/client_testing.ipynb: minimal client workflow to extract Dutch PII entities using an LLM via the Fireworks API with JSON Schema-constrained outputsPII_create_batch_data.ipynb: generate provider-ready batch input JSONL for Dutch PII extraction test dataPII_prompt_optimization.ipynb: explore prompt variants and quality trade-offsPII_dataset_prep.ipynb: preprocess datasets with sentence splits and gold alignmentPII-batch-output-eval.ipynb: evaluate model JSONL batch outputs against goldwikineural_PII_eval.ipynb: evaluate model performance on Wikineural datafinance_PII_eval.ipynb: evaluate model performance on finance datacogito_postprocessing.ipynb: post-process Cogito model outputs
A future pipeline will combine the functionalities currently presented in notebooks. The proposed workflow:
[input data file] (e.g., `data/PII_testset50.json`)
|
v
data preprocessing → `PII_dataset_prep.ipynb`
| (output: `*_offsets.json with sentence splits`)
|
v
create batch data → `PII_create_batch_data.ipynb`
| (output: `batch_input_data_*.jsonl`)
|
v
generate batch predictions → using prompt `pii_v1_4_2` via inference client
| (output: `BIJOutputSet.jsonl`)
|
v
postprocess outputs → `cogito_postprocessing.ipynb`
| (output: DataFrame with offset-mapped predictions)
|
v
evaluate results → `PII-batch-output-eval.ipynb`
| (output: metrics and comparison reports)
Each item:
{
"text": "... full document ...",
"entities": [
{"text": "Westerdijk 45, 1621 LE Hoorn", "label": "ADDRESS", "sent_index": 2},
{"text": "0229-252 888", "label": "PHONE", "sent_index": 4}
]
}text: original documententities: gold annotations per sentence (sent_index)
Offset variants (*_offsets.json) include start/end character indices.
Returned per-prompt (strict JSON):
{
"entities": [
{
"text": "<exact substring>",
"sentence_index": 0,
"nth": 0,
"label": "<LABEL>",
"confidence": 0.90
}
]
}See prompts.yml for label definitions and rules.
- BSN – Burgerservicenummer (Dutch citizen service number)
- NAME – Full names, including official titles or honorifics
- EMAIL – Email addresses
- PHONE – Telephone or mobile numbers
- ADDRESS – Exact addresses or part of an address, including streetname (e.g. Haarlemmerweg); house number (e.g. 123); ZIP/postcode (e.g. 1041AD); city (e.g. Haarlem); region (e.g. Noord-Holland)
- ID_NO – Passport number, ID card number, driver's license number, tax ID, or other official identifiers
- DATE_OF_BIRTH – Date of birth
- CASE_NO – Case or file numbers, internal reference IDs
- TITLE – Occupational or job title
- CARD_NO – IBAN or credit card number
- ORGANIZATION – Organization or company name