GitHub - ConductionNL/anonymization-experiments: Anonymization experiments that compare the performance of LLMs and a ML-based baseline model

Anonymisation experiments

This repository compares the performance of various open-source LLMs on a Dutch anonymization / PII detection task. We find best results for the Cogito model (cogito_v1_preview_qwen_32B). This repo contains prompt templates, utilities to prepare/run LLM calls, evaluation utilities, and notebooks to reproduce experiments on provided datasets. This is an initial exploration and experimentation; we recommend further implementing this code into a pipeline as described below.

Main findings

We generated a Dutch PII dataset (data/PII_testset50.json) for the public domain, containing 50 documents.
Best performance for Cogito (f1-score = 88.9; 92.1 after postprocessing) compared to a Presidio baseline score of (f1 = 58.4)
We recommend using the LLM approach (Cogito) for this task, with prompt id pii_v1_4_2

Best performance results

Results for Cogito with prompt pii_v1_4_2 after postprocessing:

Precision: 0.9254
Recall: 0.9160
F1: 0.9207
Accuracy: 0.8531

            label  precision    recall        f1  tp  pred  gold
0            NAME   0.960526  0.960526  0.960526  73    76    76
1    ORGANIZATION   0.873016  0.833333  0.852713  55    63    66
2           TITLE   1.000000  0.727273  0.842105  40    40    55
3         ADDRESS   0.863636  0.950000  0.904762  38    44    40
4         CASE_NO   0.760000  0.950000  0.844444  38    50    40
5           EMAIL   1.000000  1.000000  1.000000  29    29    29
6           PHONE   1.000000  1.000000  1.000000  27    27    27
7   DATE_OF_BIRTH   1.000000  1.000000  1.000000  21    21    21
8         CARD_NO   1.000000  1.000000  1.000000  16    16    16
9           ID_NO   1.000000  1.000000  1.000000  12    12    12
10            BSN   1.000000  1.000000  1.000000  11    11    11

Installation

Tested with Python 3.10+.

pip install -r requirements.txt

Download NLP models

For running the Presidio baseline models, Dutch spaCy models are required:

python -m spacy download nl_core_news_sm
python -m spacy download nl_core_news_md
python -m spacy download nl_core_news_lg

For sentence segmentation with Stanza (used in dataset preparation notebooks), the Dutch model will be automatically downloaded on first use. To download it explicitly:

import stanza
stanza.download('nl')

Environment variables for LLM access

For using LLM inference (Fireworks AI or Hugging Face):

export FIREWORKS_API_KEY=your__token

Repository structure

prompts.yml: Prompt templates for LLM PII detection (multiple variants emphasizing recall/precision and definitions)
src/utils/: Utility modules
- llm_utils.py: LLM calling, prompt rendering, batch prep, schema validation, offset recovery, experiment logging (runs.csv)
- eval_utils.py: Presidio baseline inference, span matching (IoU), per-row and aggregate metrics, confusion matrices, summary printers
- pii_utils.py: Sentence splitting (spaCy/Stanza), span offset recovery, sentence-aware span splitting, ANSI highlighting for quick text inspection
- bday_linker.py: Birthdate cue detection and linking utilities
src/models/: Model modules
- Presidio_models.py: Presidio AnalyzerEngine factory for Dutch (nl_core_news_*)
data/: Example datasets (main testset PII_testset50.json)
outputs-and-results/: Saved batch outputs and run artifacts (including runs.csv)
notebooks/
client_testing.ipynb: minimal client workflow to extract Dutch PII entities using an LLM via the Fireworks API with JSON Schema-constrained outputs
PII_create_batch_data.ipynb: generate provider-ready batch input JSONL for Dutch PII extraction test data
PII_prompt_optimization.ipynb: explore prompt variants and quality trade-offs
PII_dataset_prep.ipynb: preprocess datasets with sentence splits and gold alignment
PII-batch-output-eval.ipynb: evaluate model JSONL batch outputs against gold
wikineural_PII_eval.ipynb: evaluate model performance on Wikineural data
finance_PII_eval.ipynb: evaluate model performance on finance data
cogito_postprocessing.ipynb: post-process Cogito model outputs

Next steps

A future pipeline will combine the functionalities currently presented in notebooks. The proposed workflow:

[input data file] (e.g., `data/PII_testset50.json`)
        |
        v
data preprocessing → `PII_dataset_prep.ipynb`
        | (output: `*_offsets.json with sentence splits`)
        |
        v
create batch data → `PII_create_batch_data.ipynb`
        | (output: `batch_input_data_*.jsonl`)
        |
        v
generate batch predictions → using prompt `pii_v1_4_2` via inference client
        | (output: `BIJOutputSet.jsonl`)
        |
        v      
postprocess outputs → `cogito_postprocessing.ipynb`
        | (output: DataFrame with offset-mapped predictions)
        |
        v      
evaluate results → `PII-batch-output-eval.ipynb`
        | (output: metrics and comparison reports)

Data formats

Gold datasets (`data/PII_testset50.json`)

Each item:

{
  "text": "... full document ...",
  "entities": [
    {"text": "Westerdijk 45, 1621 LE Hoorn", "label": "ADDRESS", "sent_index": 2},
    {"text": "0229-252 888", "label": "PHONE", "sent_index": 4}
  ]
}

text: original document
entities: gold annotations per sentence (sent_index)

Offset variants (*_offsets.json) include start/end character indices.

Expected LLM output schema

Returned per-prompt (strict JSON):

{
  "entities": [
    {
      "text": "<exact substring>",
      "sentence_index": 0,
      "nth": 0,
      "label": "<LABEL>",
      "confidence": 0.90
    }
  ]
}

See prompts.yml for label definitions and rules.

Label definitions

BSN – Burgerservicenummer (Dutch citizen service number)
NAME – Full names, including official titles or honorifics
EMAIL – Email addresses
PHONE – Telephone or mobile numbers
ADDRESS – Exact addresses or part of an address, including streetname (e.g. Haarlemmerweg); house number (e.g. 123); ZIP/postcode (e.g. 1041AD); city (e.g. Haarlem); region (e.g. Noord-Holland)
ID_NO – Passport number, ID card number, driver's license number, tax ID, or other official identifiers
DATE_OF_BIRTH – Date of birth
CASE_NO – Case or file numbers, internal reference IDs
TITLE – Occupational or job title
CARD_NO – IBAN or credit card number
ORGANIZATION – Organization or company name

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Anonymisation experiments

Main findings

Best performance results

Installation

Download NLP models

Environment variables for LLM access

Repository structure

Next steps

Data formats

Gold datasets (`data/PII_testset50.json`)

Expected LLM output schema

Label definitions

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
prompts.yml		prompts.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Anonymisation experiments

Main findings

Best performance results

Installation

Download NLP models

Environment variables for LLM access

Repository structure

Next steps

Data formats

Gold datasets (data/PII_testset50.json)

Expected LLM output schema

Label definitions

About

Resources

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Gold datasets (`data/PII_testset50.json`)

Packages