The Colorado Richly Annotated Full-Text (CRAFT) Corpus

This repository contains the CRAFT corpus, a collection of 97 articles from the PubMed Central Open Access subset, each of which has been annotated along a number of different axes spanning structural, coreference, and concept annotation.

Project update

For this project, I am using the CRAFT corpus as a biomedical named entity recognition dataset. The data contains 97 full-text biomedical research articles from PubMed Central, along with expert annotations for biological concepts. I focused on the Gene Ontology concept annotations:

GO_BP: biological process terms
GO_CC: cellular component terms
GO_MF: molecular function terms

Up to this point, I converted the original CRAFT Knowtator annotations into token-level IOB files so they can be used for NER training. The generated files are in outputs/iob. Each row contains a token, its IOB label, the GO identifier, GO term, ontology source, mention id, and character offsets. Because standard IOB tagging cannot represent every complex CRAFT annotation directly, discontinuous mentions were split into contiguous spans and overlapping mentions were flattened with a longest-span-first rule. The merged IOB dataset is stored in outputs/iob/merged/docs.

I also built a Week 2 modeling pipeline in scripts/week2_biogru_ner.py. The pipeline uses BioBERT (dmis-lab/biobert-base-cased-v1.1) to generate frozen contextual embeddings for each token, then trains a Bidirectional GRU tagger on those embeddings. For the current experiment, all GO concept labels are mapped to B-ENTITY and I-ENTITY so the model learns biomedical entity boundaries instead of predicting each exact GO id.

Current project artifacts include:

IOB conversion script: scripts/craft_to_iob.py
Generated IOB dataset: outputs/iob
BioBERT embeddings for all 97 documents: outputs/week2_biogru/embeddings
Trained Bi-GRU model and metrics: outputs/week2_biogru/model
Week 2 notes and presentation outline: docs/week2_run_notes.md and docs/week2_presentation_outline.md

The latest saved Week 2 run used a 70/15/15 document-level train/dev/test split and trained for 5 epochs. On the test set, the model reached token-level F1 of 0.5729 and exact-span F1 of 0.5008. The next steps are to improve boundary accuracy, compare exact GO-label prediction against the current generic entity-label setup, and possibly add stronger decoding such as a CRF layer.

Citing CRAFT

To cite the CRAFT corpus, please see the CRAFT Reference wiki page.

Using CRAFT

For installation and other usage instructions, please see the CRAFT Wiki.

Stable releases

For stable releases, please download from the CRAFT Releases page.

Creating alternative file formats

The distribution has been streamlined to include only a single file format for each annotation type. In place of multiple file formats for each annotation type, the CRAFT corpus is distributed with a script which can convert annotations from the native file format into a variety of other file formats. Please see the Creating alternative annotation file formats wiki page for details.

Feedback

Please direct comments, questions, and suggestions to the Issues section of the CRAFT GitHub page, or send e-mail to Mike Bada at mike.bada@ucdenver.edu.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.knowtator2		.knowtator2
articles		articles
concept-annotation		concept-annotation
coreference-annotation		coreference-annotation
docs		docs
outputs		outputs
schema		schema
scripts		scripts
structural-annotation		structural-annotation
.gitignore		.gitignore
CHANGES.md		CHANGES.md
CRAFT_IOB_submission.zip		CRAFT_IOB_submission.zip
LICENSE.txt		LICENSE.txt
README.md		README.md
build.boot		build.boot
requirements-week2.txt		requirements-week2.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Colorado Richly Annotated Full-Text (CRAFT) Corpus

Project update

Citing CRAFT

Using CRAFT

Stable releases

Creating alternative file formats

Feedback

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

The Colorado Richly Annotated Full-Text (CRAFT) Corpus

Project update

Citing CRAFT

Using CRAFT

Stable releases

Creating alternative file formats

Feedback

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages