Skip to content

ConductionNL/anonymization-experiments

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Anonymisation experiments

This repository compares the performance of various open-source LLMs on a Dutch anonymization / PII detection task. We find best results for the Cogito model (cogito_v1_preview_qwen_32B). This repo contains prompt templates, utilities to prepare/run LLM calls, evaluation utilities, and notebooks to reproduce experiments on provided datasets. This is an initial exploration and experimentation; we recommend further implementing this code into a pipeline as described below.

Main findings

  • We generated a Dutch PII dataset (data/PII_testset50.json) for the public domain, containing 50 documents.
  • Best performance for Cogito (f1-score = 88.9; 92.1 after postprocessing) compared to a Presidio baseline score of (f1 = 58.4)
  • We recommend using the LLM approach (Cogito) for this task, with prompt id pii_v1_4_2

Best performance results

Results for Cogito with prompt pii_v1_4_2 after postprocessing:

Precision: 0.9254
Recall: 0.9160
F1: 0.9207
Accuracy: 0.8531

            label  precision    recall        f1  tp  pred  gold
0            NAME   0.960526  0.960526  0.960526  73    76    76
1    ORGANIZATION   0.873016  0.833333  0.852713  55    63    66
2           TITLE   1.000000  0.727273  0.842105  40    40    55
3         ADDRESS   0.863636  0.950000  0.904762  38    44    40
4         CASE_NO   0.760000  0.950000  0.844444  38    50    40
5           EMAIL   1.000000  1.000000  1.000000  29    29    29
6           PHONE   1.000000  1.000000  1.000000  27    27    27
7   DATE_OF_BIRTH   1.000000  1.000000  1.000000  21    21    21
8         CARD_NO   1.000000  1.000000  1.000000  16    16    16
9           ID_NO   1.000000  1.000000  1.000000  12    12    12
10            BSN   1.000000  1.000000  1.000000  11    11    11

Installation

Tested with Python 3.10+.

pip install -r requirements.txt

Download NLP models

For running the Presidio baseline models, Dutch spaCy models are required:

python -m spacy download nl_core_news_sm
python -m spacy download nl_core_news_md
python -m spacy download nl_core_news_lg

For sentence segmentation with Stanza (used in dataset preparation notebooks), the Dutch model will be automatically downloaded on first use. To download it explicitly:

import stanza
stanza.download('nl')

Environment variables for LLM access

For using LLM inference (Fireworks AI or Hugging Face):

export FIREWORKS_API_KEY=your__token    

Repository structure

  • prompts.yml: Prompt templates for LLM PII detection (multiple variants emphasizing recall/precision and definitions)
  • src/utils/: Utility modules
    • llm_utils.py: LLM calling, prompt rendering, batch prep, schema validation, offset recovery, experiment logging (runs.csv)
    • eval_utils.py: Presidio baseline inference, span matching (IoU), per-row and aggregate metrics, confusion matrices, summary printers
    • pii_utils.py: Sentence splitting (spaCy/Stanza), span offset recovery, sentence-aware span splitting, ANSI highlighting for quick text inspection
    • bday_linker.py: Birthdate cue detection and linking utilities
  • src/models/: Model modules
    • Presidio_models.py: Presidio AnalyzerEngine factory for Dutch (nl_core_news_*)
  • data/: Example datasets (main testset PII_testset50.json)
  • outputs-and-results/: Saved batch outputs and run artifacts (including runs.csv)
  • notebooks/
  • client_testing.ipynb: minimal client workflow to extract Dutch PII entities using an LLM via the Fireworks API with JSON Schema-constrained outputs
  • PII_create_batch_data.ipynb: generate provider-ready batch input JSONL for Dutch PII extraction test data
  • PII_prompt_optimization.ipynb: explore prompt variants and quality trade-offs
  • PII_dataset_prep.ipynb: preprocess datasets with sentence splits and gold alignment
  • PII-batch-output-eval.ipynb: evaluate model JSONL batch outputs against gold
  • wikineural_PII_eval.ipynb: evaluate model performance on Wikineural data
  • finance_PII_eval.ipynb: evaluate model performance on finance data
  • cogito_postprocessing.ipynb: post-process Cogito model outputs

Next steps

A future pipeline will combine the functionalities currently presented in notebooks. The proposed workflow:

[input data file] (e.g., `data/PII_testset50.json`)
        |
        v
data preprocessing → `PII_dataset_prep.ipynb`
        | (output: `*_offsets.json with sentence splits`)
        |
        v
create batch data → `PII_create_batch_data.ipynb`
        | (output: `batch_input_data_*.jsonl`)
        |
        v
generate batch predictions → using prompt `pii_v1_4_2` via inference client
        | (output: `BIJOutputSet.jsonl`)
        |
        v      
postprocess outputs → `cogito_postprocessing.ipynb`
        | (output: DataFrame with offset-mapped predictions)
        |
        v      
evaluate results → `PII-batch-output-eval.ipynb`
        | (output: metrics and comparison reports)

Data formats

Gold datasets (data/PII_testset50.json)

Each item:

{
  "text": "... full document ...",
  "entities": [
    {"text": "Westerdijk 45, 1621 LE Hoorn", "label": "ADDRESS", "sent_index": 2},
    {"text": "0229-252 888", "label": "PHONE", "sent_index": 4}
  ]
}
  • text: original document
  • entities: gold annotations per sentence (sent_index)

Offset variants (*_offsets.json) include start/end character indices.

Expected LLM output schema

Returned per-prompt (strict JSON):

{
  "entities": [
    {
      "text": "<exact substring>",
      "sentence_index": 0,
      "nth": 0,
      "label": "<LABEL>",
      "confidence": 0.90
    }
  ]
}

See prompts.yml for label definitions and rules.

Label definitions

  • BSN – Burgerservicenummer (Dutch citizen service number)
  • NAME – Full names, including official titles or honorifics
  • EMAIL – Email addresses
  • PHONE – Telephone or mobile numbers
  • ADDRESS – Exact addresses or part of an address, including streetname (e.g. Haarlemmerweg); house number (e.g. 123); ZIP/postcode (e.g. 1041AD); city (e.g. Haarlem); region (e.g. Noord-Holland)
  • ID_NO – Passport number, ID card number, driver's license number, tax ID, or other official identifiers
  • DATE_OF_BIRTH – Date of birth
  • CASE_NO – Case or file numbers, internal reference IDs
  • TITLE – Occupational or job title
  • CARD_NO – IBAN or credit card number
  • ORGANIZATION – Organization or company name

About

Anonymization experiments that compare the performance of LLMs and a ML-based baseline model

Resources

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors