GGPONC - A Corpus of German Medical Text with Rich Metadata Based on Clinical Practice Guidelines

This repository contains the code to reproduce the results in: Florian Borchert, Christina Lohr, Luise Modersohn, Thomas Langer, Markus Follmann, Jan Philipp Sachs, Udo Hahn, Matthieu-P. Schapranow. GGPONC: A Corpus of German Medical Text with Rich Metadata Based on Clinical Practice Guidelines. ArXiv:2007.06400 [Cs]. [arXiv] [Code on GitHub] [data-request@DKG] accepted at [LOUHI@EMNLP'20) https://arxiv.org/abs/2007.06400

Prerequisites

Requesting text data

GGPONC source files:
- Follow the instructions of the GGPONC website (Access & Download)
- Copy cpg-corpus-cms.xml into src/main/resources
PubMed Abstracts from German Case Reports and Case Descriptions
- Install Entres API from NCBI or EDirect, the commandline tools requesting the PubMed infrastructure
- Open a terminal and type esearch -db pubmed -query "Case Reports[Publication Type] AND GER[LA]" | efetch -format xml > allGermanPubMedCaseAbstracts.xml (This step could take an hour.)
- export the extracted file allGermanPubMedCaseAbstracts.xml into src/main/resources
JSYNCC v1.1: follow the instructions of https://github.com/JULIELab/jsyncc or contact Christina Lohr
3000PA: no public access
KRAUTS Corpus (Strötgen et al):
WikiWarsDe Corpus (Strötgen et al)

UMLS Terminology data

You need files from the UMLS.
You need a registration at UTS, you can download the UMLS files from the U.S. National Library of Medicine (NIH).
For our current work, we used the UMLS release 2019AB and you need the following files:
- 2019AB MRCONSO.RRF
- 2019AB MRSTY.RRF (only accessible from the full release zip file.)
- unzip the files.
More information of UMLS releases
More information on the UMLS can be found in the UMLS® Reference Manual.

Software requirements

Java 11 - We prefer Open JDK
Apache Maven (mvn)
Python 3 => We prefer to use Eclipse IDE or IntelliJ IDEA

Configuration after downloading of this repository

Configure the project as a Maven project
- In Eclipse: right click on project => Configure => Convert to Maven Project
- Command line: mvn compile

Processing the data

Conversion of GGPOnc corpus XML file to plain text and preprocessing

Run mvn compile before executing mvn exec:java -Dexec.mainClass="de.hpi.guidelines.reader.GGPOncXMLReader" -Dexec.args="<Path to cpg-corpus-cms.xml>" or run GGPOncXMLReader.java (in package de.hpi.guidelines.reader) in Eclipse (Run As => Java Application)
Wait a minute
Look into the directory /output

Create PubMed abstract text files

We download PubMed data at February 21 2020, if you download PubMed data by esearch commands, you will receive a larger text corpus than our export. The file src/main/resources/usedPubMedIds_20200221.txt contains a list with the used PubMed identifiers from February 21 2020.
If you want to create the described data set from PubMed, import your extracted XML file and run the src/main/extractPubMedCaseAbstracts.java. This code is able to filter our used PubMed text data from your new created download.

Processing dictionaries

Filtered dictionaries from UMLS by JuFiT

We worked with JuFit v1.1 - you can find the right jar file in this repository.
If you want to work with the real JuFit, follow the steps below:
- Download JuFit from https://github.com/JULIELab/jufit
- create the jar file by Apache Maven and run mvn clean package
- run java -jar JuFiT.jar MRCONSO.RRF MRSTY.RRF GER --grounded > UMLS_dict.txt
Run the Java Code RequestJuFiT.java (package de.julielab.dictionaryhandling) or the Python script extended_script_dictionaries/request-jufit.sh

Gene Dictionary

We used a list of gene names compiled from Entrez Gene and UniProt with the approach originating from Wermter et al.
Code of JULIELab/gene-name-mapping
The integration of this code in the GGPOnc Repository is coming soon.

Connect Dictionaries

For the usage of JCoRe Pipelines you will need one large file global_dictionary.txt
Run the script extended_script_dictionaries/createDics.py to create on large dictionary (before run: adapt path names in the script file)
Or run the Java Code CreateLargeDictionary.java (package de.julielab.dictionaryhandling) (before run: adapt path names in the script file)

JCoRe Pipeline

Unpack the *.zip files in jcore-pipelines, there are 2 pipelines:
- dectectUMLSentries
- detectStopwords
Create the folder data/files in the pipeline directories and put the data to be analyzed in the directory data/files (subdirectories are not read, be carefully with *.tar files)
Put the global dictionary file into jcore-pipelines/detectUMLSentries/resources
Adapt filename of the dictionary and the stopword dictionary in the following files:
- desc/GazetteerAnnotator Template Descriptor with Configurable External Resource.xml
- descAll/GazetteerAnnotator Template Descriptor with Configurable External Resource.xml
Open a terminal and root into one of the pipeline directories
Start the pipeline with java -jar ../jcore-pipeline-runner-base-0.4.1-SNAPSHOT-cli-assembly.jar run.xml
Results
- offsets.tsv
- data/outData/output-xmi
This JCoRe pipeline is derived of the JULIE Lab own jcore-pipeline-modules (see also https://zenodo.org/record/4066619#.X3sPVS8Rp-U)

Evaluation of Annotations

To calculate the inter-annotator-agreement between human annotators follow the instructions of bratiaa
To calculate precision and recall between automatically created annotations and the human annotated data run:
- pip install bratutils
- python src/main/python/umls_evaluation.py <path to gold annotations> <path to automatic annotations>

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
extended_script_dictionaries		extended_script_dictionaries
jcore-pipelines		jcore-pipelines
src/main		src/main
.gitignore		.gitignore
JenaUmlsFilter-1.1-jar-with-dependencies.jar		JenaUmlsFilter-1.1-jar-with-dependencies.jar
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GGPONC - A Corpus of German Medical Text with Rich Metadata Based on Clinical Practice Guidelines

Prerequisites

Requesting text data

UMLS Terminology data

Software requirements

Configuration after downloading of this repository

Processing the data

Conversion of GGPOnc corpus XML file to plain text and preprocessing

Create PubMed abstract text files

Processing dictionaries

Filtered dictionaries from UMLS by JuFiT

Gene Dictionary

Connect Dictionaries

JCoRe Pipeline

Evaluation of Annotations

About

Releases 1

Packages

Contributors 2

Languages

JULIELab/GGPOnc

Folders and files

Latest commit

History

Repository files navigation

GGPONC - A Corpus of German Medical Text with Rich Metadata Based on Clinical Practice Guidelines

Prerequisites

Requesting text data

UMLS Terminology data

Software requirements

Configuration after downloading of this repository

Processing the data

Conversion of GGPOnc corpus XML file to plain text and preprocessing

Create PubMed abstract text files

Processing dictionaries

Filtered dictionaries from UMLS by JuFiT

Gene Dictionary

Connect Dictionaries

JCoRe Pipeline

Evaluation of Annotations

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages