Skip to content

Resources

John Giorgi edited this page Apr 12, 2018 · 8 revisions

Corpora

Some datasets (i.e., gold-standard corpora) which can be publicly distributed are available in the datatsets directory of this repository [1].

Alternatively, corpora can be publicly accessed at the following links:

Corpora Text Genre Standard Entities Publication
AZDC Scientific Article Gold disease link
BioInfer Scientific Article Gold genes/proteins link
BioSemantics Patent Gold chemicals, disease link
CDR Scientific Article Gold chemicals, diseases link
CellFinder Scientific Article Gold species, gene/proteins, cells, anatomy link
CEMP Patent Gold chemicals link
DECA Scientific Article Gold gene/proteins link
FSU-PRGE Scientific Article Gold genes/proteins link
Linneaus Scientific Article Gold species link
IEPA Scientific Article Gold genes/proteins link
miRNA Scientific Article Gold diseases, species, genes/proteins link
NCBI disease Scientific Article Gold diseases link
S800 Scientific Article Gold species link

Multi-Level Event Extraction (MLEE) Corpus

The MLEE corpus [3] was obtained here. We used standoff2conll to convert it to the IOB format, with the following command:

python2 standoff2conll.py path/to/original_format_corpora/MLEE-1.0.2-rev1/standoff/full -t Cell_proliferation Development Blood_vessel_development Death Breakdown Remodeling Growth Synthesis Gene_expression Transcription Catabolism Phosphorylation Dephosphorylation Localization Binding Regulation Positive_regulation Negative_regulation Planned_process -s IOB > MLEE_IOB.tsv

Word embeddings

Word embeddings derived from a combination of PubMed and PMC texts along with a recent English Wikipedia dump (optimal for sequence processing tasks in the biomedical domain) can be obtained here [2].

Notes

  1. Many of these datasets were obtained from https://github.com/cambridgeltl/MTL-Bioinformatics-2016/
  2. Moen, S. P. F. G. H., & Ananiadou, T. S. S. (2013). Distributional semantics resources for biomedical text processing. In Proceedings of the 5th International Symposium on Languages in Biology and Medicine, Tokyo, Japan (pp. 39-43).
  3. Sampo Pyysalo, Tomoko Ohta, Makoto Miwa, Han-Cheol Cho, Jun'ichi Tsujii and Sophia Ananiadou. Event extraction across multiple levels of biological organization. Bioinformatics (2012) 28(18):i575-i581.