The demand for a comprehensive phenomics library, which requires identifying computable phenotype definitions and associated metadata from an ever-expanding biomedical literature, presents a significant, labor-intensive, and unscalable challenge. To address this, we developed a transformer-based language model specifically designed for identifying biomedical texts containing computable phenotypes and piloted its use in the Centralized Interactive Phenomics Resource (CIPHER) platform.
We provide two components: a training module which can be used to fine tune a BioBERT model using labeled manuscripts and a prediction module which executes inference on a given PubMed document identifier (PubMedID). These modules incorporate our novel sliding-window approach to effectively overcome token-length limitations, thereby enabling accurate classification of full-length manuscripts.
python3 -m venv "$HOME/.venv/phenolitreview"
source "$HOME/.venv/phenolitreview/bin/activate"
python3 -m pip install ./src/
# For the training module
python3 -m pip install ./src/[training]We recommend Podman, but Docker works as well. This example stops at the prediction stage to minimize container size.
podman build -t phenolitreview -f ./Dockerfile --target=predictionIn order to create a full container including training capability,
use the same command without the --target=prediction option.
The usage output from phenolitreview --help and phenolitreview_trainer --help
gives the default paths for inputs and outputs. By default, the tool works relative
to the current working directory.
The training tool reads a CSV file and emits a fine tuned model.
PMID,Label
35592925,yes
41729549,noSample of test training data to verify model install. Should be saved as labels.csv for testing.
phenolitreview_trainer --label_data /path/to/labels.csv --output_dir training_data/my_modelThe model will be output in a directory like training_data/my_model/checkpoint_10
To run the API server, you must provide a path to the model used for prediction. By default, the API server listens on port 3004.
phenolitreview --model_path training_data/my_model/checkpoint_10The training tool reads a CSV file and emits a fine tuned model. Bind mounts are used to retrieve the output from the container. The sample data from the local usage section can also be used for this section.
podman run --rm \
-v $(realpath local/path/to/labels.csv):/literature-review/labels.csv \
-v $(realpath local/path/to/model):/literature-review/my-model \
--entrypoint phenolitreview_trainer --label_data /literature-review/labels.csv --output_dir /literature-review/my-modelThis will run the server and expose its port on 3004 to receive requests. You can pass command line arguments the same way as the CLI interface.
podman run --rm -p 3004:3004 \
-v $(realpath local/path/to/model):/literature-review/my-model phenolitreview \
--model_path /literature-review/my-modelHere are some sample queries to check server liveness and to classify a PubMed document.
curl -D /dev/stderr --fail-with-body localhost:3004/liveness/
curl -D /dev/stderr --fail-with-body localhost:3004/health/
curl -D /dev/stderr -d '{"type": "pubmed", "data":"38908256"}' -H "Accept: application/json" -X POST --fail-with-body localhost:3004/predict/