Skip to content

DrFaustest/NLP-Assignment

Repository files navigation

The Colorado Richly Annotated Full-Text (CRAFT) Corpus

This repository contains the CRAFT corpus, a collection of 97 articles from the PubMed Central Open Access subset, each of which has been annotated along a number of different axes spanning structural, coreference, and concept annotation.

Project update

For this project, I am using the CRAFT corpus as a biomedical named entity recognition dataset. The data contains 97 full-text biomedical research articles from PubMed Central, along with expert annotations for biological concepts. I focused on the Gene Ontology concept annotations:

  • GO_BP: biological process terms
  • GO_CC: cellular component terms
  • GO_MF: molecular function terms

Up to this point, I converted the original CRAFT Knowtator annotations into token-level IOB files so they can be used for NER training. The generated files are in outputs/iob. Each row contains a token, its IOB label, the GO identifier, GO term, ontology source, mention id, and character offsets. Because standard IOB tagging cannot represent every complex CRAFT annotation directly, discontinuous mentions were split into contiguous spans and overlapping mentions were flattened with a longest-span-first rule. The merged IOB dataset is stored in outputs/iob/merged/docs.

I also built a Week 2 modeling pipeline in scripts/week2_biogru_ner.py. The pipeline uses BioBERT (dmis-lab/biobert-base-cased-v1.1) to generate frozen contextual embeddings for each token, then trains a Bidirectional GRU tagger on those embeddings. For the current experiment, all GO concept labels are mapped to B-ENTITY and I-ENTITY so the model learns biomedical entity boundaries instead of predicting each exact GO id.

Current project artifacts include:

  • IOB conversion script: scripts/craft_to_iob.py
  • Generated IOB dataset: outputs/iob
  • BioBERT embeddings for all 97 documents: outputs/week2_biogru/embeddings
  • Trained Bi-GRU model and metrics: outputs/week2_biogru/model
  • Week 2 notes and presentation outline: docs/week2_run_notes.md and docs/week2_presentation_outline.md

The latest saved Week 2 run used a 70/15/15 document-level train/dev/test split and trained for 5 epochs. On the test set, the model reached token-level F1 of 0.5729 and exact-span F1 of 0.5008. The next steps are to improve boundary accuracy, compare exact GO-label prediction against the current generic entity-label setup, and possibly add stronger decoding such as a CRF layer.

Citing CRAFT

To cite the CRAFT corpus, please see the CRAFT Reference wiki page.

Using CRAFT

For installation and other usage instructions, please see the CRAFT Wiki.

Stable releases

For stable releases, please download from the CRAFT Releases page.

Creating alternative file formats

The distribution has been streamlined to include only a single file format for each annotation type. In place of multiple file formats for each annotation type, the CRAFT corpus is distributed with a script which can convert annotations from the native file format into a variety of other file formats. Please see the Creating alternative annotation file formats wiki page for details.

Feedback

Please direct comments, questions, and suggestions to the Issues section of the CRAFT GitHub page, or send e-mail to Mike Bada at mike.bada@ucdenver.edu.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors