This repository contains the CRAFT corpus, a collection of 97 articles from the PubMed Central Open Access subset, each of which has been annotated along a number of different axes spanning structural, coreference, and concept annotation.
For this project, I am using the CRAFT corpus as a biomedical named entity recognition dataset. The data contains 97 full-text biomedical research articles from PubMed Central, along with expert annotations for biological concepts. I focused on the Gene Ontology concept annotations:
GO_BP: biological process termsGO_CC: cellular component termsGO_MF: molecular function terms
Up to this point, I converted the original CRAFT Knowtator annotations into token-level IOB files so they can be used for NER training. The generated files are in outputs/iob. Each row contains a token, its IOB label, the GO identifier, GO term, ontology source, mention id, and character offsets. Because standard IOB tagging cannot represent every complex CRAFT annotation directly, discontinuous mentions were split into contiguous spans and overlapping mentions were flattened with a longest-span-first rule. The merged IOB dataset is stored in outputs/iob/merged/docs.
I also built a Week 2 modeling pipeline in scripts/week2_biogru_ner.py. The pipeline uses BioBERT (dmis-lab/biobert-base-cased-v1.1) to generate frozen contextual embeddings for each token, then trains a Bidirectional GRU tagger on those embeddings. For the current experiment, all GO concept labels are mapped to B-ENTITY and I-ENTITY so the model learns biomedical entity boundaries instead of predicting each exact GO id.
Current project artifacts include:
- IOB conversion script:
scripts/craft_to_iob.py - Generated IOB dataset:
outputs/iob - BioBERT embeddings for all 97 documents:
outputs/week2_biogru/embeddings - Trained Bi-GRU model and metrics:
outputs/week2_biogru/model - Week 2 notes and presentation outline:
docs/week2_run_notes.mdanddocs/week2_presentation_outline.md
The latest saved Week 2 run used a 70/15/15 document-level train/dev/test split and trained for 5 epochs. On the test set, the model reached token-level F1 of 0.5729 and exact-span F1 of 0.5008. The next steps are to improve boundary accuracy, compare exact GO-label prediction against the current generic entity-label setup, and possibly add stronger decoding such as a CRF layer.
To cite the CRAFT corpus, please see the CRAFT Reference wiki page.
For installation and other usage instructions, please see the CRAFT Wiki.
For stable releases, please download from the CRAFT Releases page.
The distribution has been streamlined to include only a single file format for each annotation type. In place of multiple file formats for each annotation type, the CRAFT corpus is distributed with a script which can convert annotations from the native file format into a variety of other file formats. Please see the Creating alternative annotation file formats wiki page for details.
Please direct comments, questions, and suggestions to the Issues section of the CRAFT GitHub page, or send e-mail to Mike Bada at mike.bada@ucdenver.edu.