Skip to content

bionlproc/LLMs-vs-Encoders-for-BioNER

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Do LLMs Surpass Encoders for Biomedical NER?

This repository contains the official codebase and dataset for our paper, "Do LLMs Surpass Encoders for Biomedical NER?", accepted at the IEEE International Conference on Healthcare Informatics (ICHI) 2025.

In this repository, we provide scripts for training and evaluating both encoder-based and decoder-based Named Entity Recognition (NER) models. The encoder model leverages transformer-based architectures for token classification, while the decoder model is designed for autoregressive generation tasks.

The evaluation is conducted using the International Workshop on Semantic Evaluation (SemEval) framework. For more details on named entity evaluation, you can refer to this blog post by David Batista.


Table of Contents


Project Overview

This project implements Named Entity Recognition (NER) using two different modeling approaches:

  • Encoder-based Models: Uses transformer-based token classification (e.g., BERT) for sequence labeling.
  • Decoder-based Models: Uses autoregressive transformer models (e.g., Mistral-7B) for sequence generation.

The models can be trained on various datasets, such as:

  • JNLPBA
  • BioRED
  • ChemProt
  • BC5CDR
  • Reddit_Impacts

Notes:


File Structure

.
├── datasets/
│
├── encoders/
│   ├── data_preprocessing.py       # Prepares datasets for encoder model
│   ├── model_prepare.py            # Prepares encoder model for training
│   ├── metrics.py                  # Evaluation metrics for encoder model
│   ├── ner_SemEval.py              # SemEval evaluation
│   ├── run_encoder_train.py        # Trains the encoder model
│   ├── run_encoder_inference.py    # Runs inference on trained encoder model
│   ├── inference.py                # Core inference script
│   ├── main.py                     # Main script for encoder training/evaluation
│
├── decoders/
│   ├── data_preprocessing.py       # Prepares datasets for decoder model
│   ├── model_prepare.py            # Prepares decoder model for training
│   ├── metrics.py                  # Evaluation metrics for decoder model
│   ├── ner_SemEval.py              # SemEval evaluation
│   ├── run_decoder_train.py        # Trains the decoder model
│   ├── run_decoder_inference.py    # Runs inference on trained decoder model
│   ├── inference.py                # Core inference script
│   ├── main.py                     # Main script for decoder training/evaluation
│
├── encoder_requirements.txt        # Required dependencies for encoder model
├── decoders_requirements.txt       # Required dependencies for decoder model
└── README.md                       # Documentation

Installation

To install the necessary dependencies for the encoder model, run:

pip install -r encoder_requirements.txt

To install the necessary dependencies for the decoder model, run:

pip install -r decoder_requirements.txt

Usage

Training & Evaluating the Encoder Model

To train the encoder model, use:

python run_encoder_train.py

This script executes the following command internally:

python /src/main.py \
    --output_dir /results/ \
    --datasets_path /datasets/ \
    --num_train_epochs 20 \
    --model_name google-bert/bert-large-uncased \
    --learning_rate 2e-5 \
    --batch_size 8 \
    --dataset_name JNLPBA

To perform inference using the trained encoder model, run:

python run_encoder_inference.py

Which executes:

python /src/inference.py \
    --model_dir /results/ \
    --datasets_path /datasets/ \
    --model_name google-bert/bert-large-uncased \
    --dataset_name JNLPBA

Training & Evaluating the Decoder Model

To train the decoder model, run:

python run_decoder_train.py

This will execute the following command:

python /src/main.py \
    --output_dir /results/ \
    --datasets_path /datasets/ \
    --num_train_epochs 20 \
    --learning_rate 4e-5 \
    --optim paged_adamw_8bit \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --generation_batch_size 16 \
    --max_new_tokens 3000 \
    --model_name mistralai/Mistral-7B-Instruct-v0.3 \
    --isQLoRA True \
    --dataset_name JNLPBA \
    --hf_token <YOUR_HF_TOKEN>

To generate predictions using the decoder model, run:

python run_decoder_inference.py

This will internally execute:

python /src/inference.py \
    --model_dir /results/ \
    --datasets_path /datasets/ \
    --generation_batch_size 16 \
    --max_new_tokens 3000 \
    --model_name mistralai/Mistral-7B-Instruct-v0.3 \
    --isQLoRA True \
    --dataset_name JNLPBA \
    --trained_model_checkpoint_number <CHECKPOINT_NUMBER> \
    --hf_token <YOUR_HF_TOKEN>

Notes:

  • --hf_token <YOUR_HF_TOKEN>: Replace <YOUR_HF_TOKEN> with your Hugging Face API token to access pre-trained models.
  • --trained_model_checkpoint_number <checkpoint_number>: This specifies the checkpoint number of the trained decoder model. Adjust this to the correct checkpoint number.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%