Skip to content

ORNL/phenolitreview

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Phenomic Literature Review Tool


🎯 Overview

The demand for a comprehensive phenomics library, which requires identifying computable phenotype definitions and associated metadata from an ever-expanding biomedical literature, presents a significant, labor-intensive, and unscalable challenge. To address this, we developed a transformer-based language model specifically designed for identifying biomedical texts containing computable phenotypes and piloted its use in the Centralized Interactive Phenomics Resource (CIPHER) platform.

🚀 Objective

We provide two components: a training module which can be used to fine tune a BioBERT model using labeled manuscripts and a prediction module which executes inference on a given PubMed document identifier (PubMedID). These modules incorporate our novel sliding-window approach to effectively overcome token-length limitations, thereby enabling accurate classification of full-length manuscripts.

Installation (Local)

python3 -m venv "$HOME/.venv/phenolitreview"
source "$HOME/.venv/phenolitreview/bin/activate"
python3 -m pip install ./src/
# For the training module
python3 -m pip install ./src/[training]

Installation (Container)

We recommend Podman, but Docker works as well. This example stops at the prediction stage to minimize container size.

podman build -t phenolitreview -f ./Dockerfile --target=prediction

In order to create a full container including training capability, use the same command without the --target=prediction option.

Usage (Local)

The usage output from phenolitreview --help and phenolitreview_trainer --help gives the default paths for inputs and outputs. By default, the tool works relative to the current working directory.

Training Tool

The training tool reads a CSV file and emits a fine tuned model.

PMID,Label
35592925,yes
41729549,no

Sample of test training data to verify model install. Should be saved as labels.csv for testing.

phenolitreview_trainer --label_data /path/to/labels.csv --output_dir training_data/my_model

The model will be output in a directory like training_data/my_model/checkpoint_10

Prediction API Server

To run the API server, you must provide a path to the model used for prediction. By default, the API server listens on port 3004.

phenolitreview --model_path training_data/my_model/checkpoint_10

Usage (Container)

Training Tool

The training tool reads a CSV file and emits a fine tuned model. Bind mounts are used to retrieve the output from the container. The sample data from the local usage section can also be used for this section.

podman run --rm \
-v $(realpath local/path/to/labels.csv):/literature-review/labels.csv \
-v $(realpath local/path/to/model):/literature-review/my-model \
--entrypoint phenolitreview_trainer --label_data /literature-review/labels.csv --output_dir /literature-review/my-model

Prediction API Server

This will run the server and expose its port on 3004 to receive requests. You can pass command line arguments the same way as the CLI interface.

podman run --rm -p 3004:3004 \
    -v $(realpath local/path/to/model):/literature-review/my-model phenolitreview \
    --model_path /literature-review/my-model

Query the API Server

Here are some sample queries to check server liveness and to classify a PubMed document.

curl -D /dev/stderr --fail-with-body localhost:3004/liveness/
curl -D /dev/stderr --fail-with-body localhost:3004/health/
curl -D /dev/stderr -d '{"type": "pubmed", "data":"38908256"}' -H "Accept: application/json" -X POST --fail-with-body localhost:3004/predict/

Publication

Junghoon Chae, et al. Detecting Manuscripts Related to Computable Phenotypes Using a Transformer-based Language Model, bioRxiv, 2026.03.12.711165

Acknowledgements

Comeau DC, Wei CH, Islamaj Doğan R, and Lu Z. PMC text mining subset in BioC: about 3 million full text articles and growing, Bioinformatics, btz070, 2019.

About

This tool is used for categorizing the literature in PubMed as containing or not containing computable phenotype metadata.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors