Skip to content

ProjectDebbie/Biomaterials_annotator

Repository files navigation

Biomaterials annotator

The Biomaterials Annotator: a system for ontology-based concept annotation of biomaterials text.

The Biomaterials Annotator is an ontology-based NER system that identifies biomaterial concepts. It provides a schema for combining terms from mutiple ontologies, vocabularies and nomenclutures. A full list of the type of concepts annotated are available here.

The global scores calculated for the system are: 0.75 strict F-score, 0.79 lenient F-Score and 0.77 average F-score. The full results including metrics by category are available here. Here are the validated set of abstracts.

The Biomaterials Annotator has been implemented following a modular organization, using software containers for the different components. The pipeline is orchestrated using Nextflow as workflow manager. Natural language processing (NLP) components were mainly developed in Java, and it relying on the Stanford CoreNLP Natural Language Processing open source toolkit.

Annotated corpus

A biomaterials annotated gold standard corpus of 1222 MEDLINE abstracts resulting from the execution of the Biomaterials Annotator is available and free to use at https://github.com/ProjectDebbie/Biomaterials_annotated_corpus. The corpus contains articles describing the evaluation of biomaterials and medical devices in either a laboratory or clinical setting, Each abstract is individually contained as a separate file under the GATE format.

Biomaterials Annotator Project Overview

System architecture

The Standard NLP preprocessing component is available at https://gitlab.bsc.es/inb/text-mining/generic-tools/nlp-standard-preprocessing The MSH Annotator annotates pre-selected categories from the MeSH terminology; and the Dictionary Annotator annotates does the same using manually collected ontologies and vocabularies. This is followed by execution of the Post-processing rules, including entity recognition based on lexical rules, removal of false positives and abbreviations concept recognition, among other tasks.

The MSH Annotator is available at https://github.com/ProjectDebbie/debbie_umls_annotations; and the Dictionary Annotator and Post-processing rules are available at https://github.com/ProjectDebbie/DEBBIE_dictionaries_annotations.

Software containers used in the actual version of the Biomaterials Annotator workflow:

  1. nlp-standard-preprocessing: registry.hub.docker.com/javicorvi/nlp-standard-preprocessing:dev_1.6
  2. debbie-umls-annotation: registry.hub.docker.com/projectdebbie/debbie_umls_annotation:release-1.0.7
  3. debbie-dictionaries-annotations: registry.hub.docker.com/projectdebbie/debbie_dictionaries_annotations:release-2.0.0

Ontologies used for the annotations

  1. MESH (UMLS)
  2. DEB
  3. GMDN
  4. CHEBI
  5. IOBC
  6. NCIT
  7. NPO
  8. OBI
  9. ONTOTOXNUC
  10. UBERON
  11. PREMEDONTO
  12. EDAM Bioimaging Ontology
  13. CHMO

Run the Biomaterials Annotator

You need to have docker and nextflow installed, then configure and the run.sh file.

Actual Version: 1.0, 2021-03-23

Versioning

We use SemVer for versioning. For the versions available, see the tags on this repository.

Authors

Javier Corvi and Osnat Hakimi

To cite the Biomaterials Annotator

Corvi, J., Fuenteslópez, C., Fernández, J., Gelpi, J., Ginebra, M.-P., Capella-Guitierrez, S., Hakimi, O.: The biomaterials annotator: a systemfor ontology-based concept annotation of biomaterials text. In:Proceedings of the Second Workshop on Scholarly DocumentProcessing, pp. 36–48. Association for Computational Linguistics,Online (2021). https://www.aclweb.org/anthology/2021.sdp-1.5

BibText

License

This project is licensed under the GNU License - see the LICENSE file for details

Funding

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 751277