Skip to content

This repository serves to contain my submissions for my participation at MultiCardioNER 2024.

Notifications You must be signed in to change notification settings

Padraig20/MultiCardioNER-2024

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MultiCardioNER - 2024

MultiCardioNER is a shared task about the adaptation of clinical NER systems to the cardiology domain. It uses a combination of two existing datasets (DisTEMIST for diseases and the newly-released DrugTEMIST for medications), as well as a new, smaller dataset of cardiology clinical cases annotated using the same guidelines.

Participants are provided DisTEMIST and DrugTEMIST as training data to use as they see fit (1,000 documents, with the original partitions splitting them into 750 for training and 250 for testing). The cardiology clinical cases (cardioccc) are meant to be used as a development or validation set (258 documents). Another set of cardioccc will be released later on for testing.

MultiCardioNER proposes two tasks:

  • Track 1: Spanish adaptation of disease recognition systems to the cardiology domain.
  • Track 2: Multilingual (Spanish, English and Italian) adaptation of medication recognition systems to the cardiology domain.

MultiCardioNER was developed by the Barcelona Supercomputing Center's NLP for Biomedical Information Analysis and used as part of BioASQ 2024. For more information on the corpus, annotation scheme and task in general, please visit: https://temu.bsc.es/multicardioner.

Track 1 - Disease Recognition in Spanish

Since this model only contains Spanish words, we will use language specific BERT models.

Data Loading

The dataloader.py file contains three classes: DataLoader, Sliding_Windows_Dataset, Cutoff_Dataset and Admission_Notes_Dataset.

The DataLoader class is responsible for loading and preprocessing the dataset. It reads the dataset from a TSV file, drops unnecessary columns, and gets a list of unique filenames. It then initializes a tokenizer from the HuggingFace transformers library and adds some custom tokens. The data can be split into training, validation, and test sets, or returned as a whole depending on the full parameter of the load_dataset method.

The Cutoff_Dataset class is a subclass of PyTorch's Dataset class. It is used for loading and tokenizing sentences on-the-fly. This class is designed to be used with a PyTorch DataLoader to efficiently load and preprocess data in parallel. However, sentences which do not fit into the transformer model (maximum number of tokens) are simply cutoff.

In order to increase data capture, we use Sliding Window Attention with a specific stride. This ensures that all data given in one document is used as input to the model, and that the model sees how the sections connect.

image

The Admission_Notes_Dataset is then used for Masked Language Modelling to load and tokenize

Together, these classes provide a convenient way to handle data loading and preprocessing for a machine learning model.

Ideas for Pre-Training

First of all, we need to adapt the model to the very specific corpus of medical texts in Spanish. In order to expand the vocabulary of the model via domain-specific pretraining, it would be wise to use MedLexSp, an only recently released dataset containing curated medical vocabulary in Spanish. Furthermore, to increade the model's understanding of patient notes, we could use masked language modelling in patient admission notes of the TREC CT proceedings. These texts would need to be automatically translated into Spanish, possible via DeepL or the Google Translate API.

There are also various other datasets which can be used for the first step of general pretraining of the model:

After discussing this with Leonardo, the author of MedLexSp, I realized that ready-made models would be more suitable to this purpose - after all, it's difficult to compete with these models with limited computational power, time and resources. Following models seem promising, especially from the PlanTL-GOB-ES huggingface repository:

  • roberta-es-clinical-trials-ner provides a ready-made NER model which already is good at detecting chemical entities and pharmacological substances as well as pathologic conditions - this would be suitable for both the first as well as the second track.
  • bsc-bio-ehr-es has been trained on a large corpus of medical free text in Spanish (based on RoBERTa) and shows great understanding of this matter.

Transfer-Learning and Data Augmentation

This focuses on Leonardo's fine-tuned model, which is proficient in detecting pathologic conditions as well as chemical entities and pharmacological substances in Spanish. Leonardo tested it on the dataset, but it still needs to be fine-tuned with the data provided by the organisers. The model currently recognises any disorder or sign/symptom, whereas MultiCardioNER seems to annotate only cardiovascular pathological conditions. This obtains an unsatisfactory performance. In addition, the other types of labels annotated by the model (e.g. anatomical entities or procedures) will need to be removed.

To further fine-tune the model, we use data augmentation via external datasets, e.g. using clinical cases or reports for English, translate them automatically to Spanish, and then pre-annotate with a preliminary model; then, these new data or "silver standard" may be used to fine-tune the final model. Such a dataset is available on Huggingface, called medical_mtsamples in English. I automatically translated it into Spanish, once again using the BRAT annotation format.

License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

About

This repository serves to contain my submissions for my participation at MultiCardioNER 2024.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Languages