MultiCardioNER - 2024

MultiCardioNER is a shared task about the adaptation of clinical NER systems to the cardiology domain. It uses a combination of two existing datasets (DisTEMIST for diseases and the newly-released DrugTEMIST for medications), as well as a new, smaller dataset of cardiology clinical cases annotated using the same guidelines.

Participants are provided DisTEMIST and DrugTEMIST as training data to use as they see fit (1,000 documents, with the original partitions splitting them into 750 for training and 250 for testing). The cardiology clinical cases (cardioccc) are meant to be used as a development or validation set (258 documents). Another set of cardioccc will be released later on for testing.

MultiCardioNER proposes two tasks:

Track 1: Spanish adaptation of disease recognition systems to the cardiology domain.
Track 2: Multilingual (Spanish, English and Italian) adaptation of medication recognition systems to the cardiology domain.

MultiCardioNER was developed by the Barcelona Supercomputing Center's NLP for Biomedical Information Analysis and used as part of BioASQ 2024. For more information on the corpus, annotation scheme and task in general, please visit: https://temu.bsc.es/multicardioner.

Track 1 - Disease Recognition in Spanish

Since this model only contains Spanish words, we will use language specific BERT models.

Data Loading

The dataloader.py file contains three classes: DataLoader, Sliding_Windows_Dataset, Cutoff_Dataset and Admission_Notes_Dataset.

The DataLoader class is responsible for loading and preprocessing the dataset. It reads the dataset from a TSV file, drops unnecessary columns, and gets a list of unique filenames. It then initializes a tokenizer from the HuggingFace transformers library and adds some custom tokens. The data can be split into training, validation, and test sets, or returned as a whole depending on the full parameter of the load_dataset method.

The Cutoff_Dataset class is a subclass of PyTorch's Dataset class. It is used for loading and tokenizing sentences on-the-fly. This class is designed to be used with a PyTorch DataLoader to efficiently load and preprocess data in parallel. However, sentences which do not fit into the transformer model (maximum number of tokens) are simply cutoff.

In order to increase data capture, we use Sliding Window Attention with a specific stride. This ensures that all data given in one document is used as input to the model, and that the model sees how the sections connect.

The Admission_Notes_Dataset is then used for Masked Language Modelling to load and tokenize

Together, these classes provide a convenient way to handle data loading and preprocessing for a machine learning model.

Ideas for Pre-Training

First of all, we need to adapt the model to the very specific corpus of medical texts in Spanish. In order to expand the vocabulary of the model via domain-specific pretraining, it would be wise to use MedLexSp, an only recently released dataset containing curated medical vocabulary in Spanish. Furthermore, to increade the model's understanding of patient notes, we could use masked language modelling in patient admission notes of the TREC CT proceedings. These texts would need to be automatically translated into Spanish, possible via DeepL or the Google Translate API.

There are also various other datasets which can be used for the first step of general pretraining of the model:

After discussing this with Leonardo, the author of MedLexSp, I realized that ready-made models would be more suitable to this purpose - after all, it's difficult to compete with these models with limited computational power, time and resources. Following models seem promising, especially from the PlanTL-GOB-ES huggingface repository:

roberta-es-clinical-trials-ner provides a ready-made NER model which already is good at detecting chemical entities and pharmacological substances as well as pathologic conditions - this would be suitable for both the first as well as the second track.
bsc-bio-ehr-es has been trained on a large corpus of medical free text in Spanish (based on RoBERTa) and shows great understanding of this matter.

Transfer-Learning and Data Augmentation

This focuses on Leonardo's fine-tuned model, which is proficient in detecting pathologic conditions as well as chemical entities and pharmacological substances in Spanish. Leonardo tested it on the dataset, but it still needs to be fine-tuned with the data provided by the organisers. The model currently recognises any disorder or sign/symptom, whereas MultiCardioNER seems to annotate only cardiovascular pathological conditions. This obtains an unsatisfactory performance. In addition, the other types of labels annotated by the model (e.g. anatomical entities or procedures) will need to be removed.

To further fine-tune the model, we use data augmentation via external datasets, e.g. using clinical cases or reports for English, translate them automatically to Spanish, and then pre-annotate with a preliminary model; then, these new data or "silver standard" may be used to fine-tune the final model. Such a dataset is available on Huggingface, called medical_mtsamples in English. I automatically translated it into Spanish, once again using the BRAT annotation format.

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
data_descriptor		data_descriptor
datasets		datasets
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MultiCardioNER - 2024

Track 1 - Disease Recognition in Spanish

Data Loading

Ideas for Pre-Training

Transfer-Learning and Data Augmentation

License

About

Releases 2

Packages

Languages

Padraig20/MultiCardioNER-2024

Folders and files

Latest commit

History

Repository files navigation

MultiCardioNER - 2024

Track 1 - Disease Recognition in Spanish

Data Loading

Ideas for Pre-Training

Transfer-Learning and Data Augmentation

License

About

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages