This is a prototype of a de-identification pipeline for German medical admission notes from the cardiology domain. It uses a three step approach to de-identificate medical texts:
- Rule based approach using Stanford RegexNER
- Spelling variant detection using Yuwono's approach based on Levenshtein distance
- Stanford NER based on an out of domain trained model
This tool is am eclipse project. There is currently no runnable JAR file available.
- Java 8
- Eclipse (tested with oxygen)
- Stanford CoreNLP 3.8
- stanford-corenlp-3.8.0.jar
- Stanford CoreNLP German models 3.8
- stanford-germn-corenlp-2017-06-09-models.jar
- Apache POI 3.17
- poi-2.17.jar (included)
- poi-ooxml-3.17.jar (included)
- poi-ooxml--schemas-3.17.jar (included)
- xmlbeans-2.6.0.jar (included and part of Apache poi package)
- Gazetteers (included)
- (Currently the algorithm is configured with the default NER model)
- self trained Stanford NER model
- copy this jar into the misc/ Folder
- define the folder path to the model in Init class line 27
- if you use the default model comment line 38 in NamedEntityRecognizer class. Add to line 39:
String serializedClassifier = "edu" + File.separator + "stanford" + File.separator + "nlp" + File.separator + "models" + File.separator + "ner" + File.separator + "german.conll.hgc_175m_600.crf.ser.gz";
The Stanford models are available here
Copy both jar files (stanford-corenlp-3.8.0.jar and stanford-germn-corenlp-2017-06-09-models.jar) into the src/lib/ directory
- Open Eclipse
- File -> Import
- Choose General - Existing Project into Workspace
- Select folder and click Finish
- You can start the program inside eclipse. To do so, open the Init class in the init package. Right click in the code and choose Run configuration.
- Now click on the "Arguments tab" and enter the location of the docx files. The program will output the following files into the folder of the docx files:
- Outputs anonymized text in plain text (.anon)
- Outputs text labeled with NE in ConLL 2002 format (.anon.tokenized)
- Outputs text labeled with (KEEP, ANONYMIZE) in CoNLL 2002 format (.tokenized_anon)
- Outputs text in CoNLL 2002 format (.anon.conll)
This is an example input/output of the pipeline. For further output files and an example docx file see folder example/.
Frau Maria Mustermann, whf. in 10000 Musterstadt, Musterstr, 12a, ist am 01.02.1999 in unserer Ambulanz aufgenommen worden.
<SALUTE> <PER> <PER>, whf. in <PLZ> <LOC>, <LOC> <LOC>, ist am <DATE> in unserer Ambulanz aufgenommen worden.
├── src (pipeline)
│ ├── init (execution and resource configuration)
│ ├── anonymizer (Pipeline modules)
├── lib (dependencies)
├── misc (resource files)
├── example (example docx input, de-identified and annotated CoNLL 2002 output files)
├── Reports (Reports, presentation)
├── evaluation (python template for evaluation purposes)
├── python_anon (alternative python de-identificator)
To be independent from eclipse, you can create a jar file.
- Click on File -> Export -> Runnable Jar
- Choose an export destination
- Choose 'Package required libraries into generated jar'
- Click on Finish
To execute the program copy the misc/ folder into the same destination where your jar is. You can execute the file, if you named it anonymizer.jar, like this:
anonymizer.jar <path to docx files>
- Init: The initializing class containing important resource configurations. For execution it needs one Argument:
- Anonymizer: This class anonymizes a given docx file with a table header containing contact informations and returns the anonymized file with NE as placeholders. 'Peter Pan lives in London' becomes ' lives in '
- DocxReader: Reading the first table of a docx file. and saving the content of all cells in an array containing unique strings
- NamedEntityRecognizer: Loads the Stanford CoreNLP pipeline and annotates a given array of strings
- Yuwono, Steven Kester, Hwee Tou Ng, and Kee Yuan Ngiam. 2016. “AutomatedAnonymization as Spelling Variant Detection.” InProceedings of the ClinicalNatural Language Processing Workshop (ClinicalNLP),99–103. Osaka, Japan:The COLING 2016 Organizing Committee, December.http://aclweb.org/anthology/W16-4214.
- Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Prismatic Inc,Steven J. Bethard, and David Mcclosky. 2014. “The Stanford CoreNLP NaturalLanguage Processing Toolkit.” InIn Proceedings of the 52nd Annual Meeting ofthe Association for Computational Linguistics: System Demonstrations,55–60.