Medical Concept Extraction

A method for language-agnostic human-in-the-loop medical concept extraction from highly unstructured electronic health records.

Usage

Works best in an interactive session.
See demo.ipynb for a Jupyter Notebook demo with 3 simple example files.

Loading records
Load your text files into record objects using the helper classes in data_utils:

records = Records()
for i, text in enumerate(raw_record_strings):
    records.append(Record(i, text))
    
# optionally save the tokenized records as a pickle file
records.save()  # load with Records.load

Initial curation
Add your bootstrap lexicon to the data folder as a newline separated lexicon.txt file. AnnotationGUI starts a GUI where you can annotate tokens with right mouse button. You can use extraction.extract_all to pre-annotate the records using your bootstrap lexicon.

lexicon = build_lexicon()  

extract_all(records, lexicon) # pre-annotate tokens that are contained in the lexicon
gui = AnnotationGUI(records)  
tags = gui.annotated
fps = get_false_positives(records, lexicon)

Inspect the false positives returned by curation.get_false_positives. Remove or add annotation mistakes, or mark lexicon entries as ambiguous, using the methods in curation.py.

Example false positive returned: (('none', 'administered'), 40110, (14, 4), ('medication', ':', 'none', 'administered', '.')) Remove with: remove_tag(tags, fps[0][1], fps[0][2]) Or mark as ambiguous: mark_ambiguous(lexicon, fps[0][0])

Review candidates

cdf = CandidateFinder(records, lexicon)  
cdf.extend_fuzzy()  # find fuzzy matches
cdf.process_contexts()  # find lexicon match contexts
cdf.get_candidates()  # sets cdf.candidates to a list of candidates ordered by their occurrence count
cdf.start_review()  # opens the review GUI
cdf.save_all()  # saves the lexicon, rejected entries, and partial match statistics

Then later load with:

cdf = CandidateFinder(records)
cdf.load_all()

Extract matches
Annotate using extraction.extract_all with the saved partial match statistics.

# tags are saved in Record.tags
extract_all(records, lexicon, cdf.pos_counter, cdf.pos_counter_m)

Citations

As published in the ICDM 2020 DMBIH workshop:

@InProceedings{Ruis2020,
  author    = {Frank Ruis and Shreyasi Pathak and Jeroen Geerdink and Johannes H. Hegeman and Christin Seifert and Maurice van Keulen},
  booktitle = {Proc. International Conference on Data Mining Workshops},
  title     = {Human-in-the-loop Language-agnostic Extraction of Medication Data from Highly Unstructured Electronic Health Records},
  year      = {2020},
  publisher = {{IEEE}},
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
LICENSE		LICENSE
README.md		README.md
context_utils.py		context_utils.py
curation.py		curation.py
data_utils.py		data_utils.py
demo.ipynb		demo.ipynb
extraction.py		extraction.py
fuzzy.py		fuzzy.py
gui.py		gui.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Medical Concept Extraction

Usage

Citations

About

Releases

Packages

Languages

License

FrankRuis/medical_concept_extraction

Folders and files

Latest commit

History

Repository files navigation

Medical Concept Extraction

Usage

Citations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages