A combined deep learning method for automated recognition of human phenotype ontology
PhenoBERT is a method that uses advanced deep learning methods (i.e. convolutional neural networks and BERT) to identify clinical disease phenotypes from free text. Currently, only English text is supported. Compared with other methods in the expert-annotated test data sets, PhenoBERT has reached SOTA effect.
Y. Feng, L. Qi and W. Tian, "PhenoBERT: a combined deep learning method for automated recognition of human phenotype ontology," in IEEE/ACM Transactions on Computational Biology and Bioinformatics, doi: 10.1109/TCBB.2022.3170301.
You can use PhenoBERT on your local machine, we have tested using docker. Due to some inevitable reason, the web version of PhenoBERT is not yet available.
- Download total project from GitHub (You need install
git
first).
git clone https://github.com/EclipseCN/PhenoBERT.git
- Enter the project main directory.
cd PhenoBERT
-
Install dependencies in the current Python (>=3.6) environment (You need install
python>=3.6
first).Notice: we recommend using Python virtual environment (
venv
) to avoid confusion.
pip install -r requirements.txt
python setup.py
- Move the pretrained files into the corresponding folder.
# download files from Google Drive in advance
mv /path/to/download/embeddings/* phenobert/embeddings
mv /path/to/download/models/* phenobert/models
After step 4, file structure should like:
- phenobert/
-- models/
-- HPOModel_H/
-- bert_model_max_triple.pkl
-- embeddings/
-- biobert_v1.1_pubmed/
-- fasttext_pubmed.bin
We have prepared pre-trained fastText and BERT embeddings and model files with .pkl suffix on Google Drive for downloading.
Directory Name | File Name | Description |
---|---|---|
models/ | HPOModel_H/ | CNN hierarchical model file |
bert_model_max_triple.pkl | BERT model file | |
embeddings/ | biobert_v1.1_pubmed/ | BERT embedding obtained from BioBERT |
fasttext_pubmed.bin | fastText embedding trained on pubmed |
Once the download is complete, please put it in the corresponding folder for PhenoBERT to load.
We provide three ways to use PhenoBERT. Due to this issue, all calls need to be in the phenobert/utils
path.
cd phenobert/utils
The most common usage is recognizing human clinical disease phenotype from free text.
Giving a set of text files, PhenoBERT will then annotate each of the text files and generate an annotation file with the same name in the target folder.
Example use annotate.py
:
python annotate.py -i DIR_IN -o DIR_OUT
Arguments:
[Required]
-i directory for storing text files
-o directory for storing annotation files
[Optional]
-p1 parameter for CNN model [0.8]
-p2 parameter for CNN model [0.6]
-p3 parameter for BERT model [0.9]
-al flag for not filter overlapping concept
-nb flag for not use BERT
-t cpu threads for calculation [10]
We also provide some APIs for other programs to integrate.
from api import *
Running the above code will import related functions and related models, and temporarily store them as global variables for quick and repeated calls. Or you can simply use Python interactive shell.
Currently we have integrated the following functions:
- annotate directly from String
print(annotate_text("I have a headache"))
Output:
9 17 headache HP:0002315 1.0
Notice: use output = path/
can redirect output to specified file
- get the approximate location of the disease
print(get_L1_HPO_term(["cardiac hypertrophy", "renal disease"]))
Output:
[['cardiac hypertrophy', {'HP:0001626'}], ['renal disease', {'HP:0000119'}]]
- get most similar HPO terms.
print(get_most_related_HPO_term(["cardiac hypertrophy", "renal disease"]))
Output:
[['cardiac hypertrophy', 'None'], ['renal disease', 'HP:0000112']]
- determine if two phrases match
print(is_phrase_match_BERT("cardiac hypertrophy", "Ventricular hypertrophy"))
Output:
Match
For users who are not familiar with command line tools, we also provide GUI annotation applications.
Simply use
python gui.py
Then you will get a visual interactive interface as shown in the figure below, in which the yellow highlighted dialog box will display the running status.
We provide here two corpus with annotations used in the evaluation (phenobert/data
), which are currently publicly available due to privacy processing.
Dataset | Num | Description |
---|---|---|
GSC+ | 228 | Contains 228 abstracts of biomedical literature (Lobo et al., 2017) in raw format |
ID-68 | 68 | Clinical description of 68 real cases in the intellectual disability study (Anazi et al., 2017) |
GeneReviews | 10 | Contains 10 GeneReviews clinical cases and annotations |
val | 30 | Contains 30 disease research articles from the OMIM database to determine hyperparameters in our model |
For the convenience of some users who cannot log in to Google Drive or who want to customize training process for their selves.
We provide the training Python script and training set used by PhenoBERT. Of course, the training set can be customized by the user to generate specific models for other purposes.
cd phenobert/utils
# produce trained models for CNN model
python train.py
python train_sub.py
# produce trained models for BERT model
python my_bert_match.py