This repo hosts pretraining and finetuning weights and relevant scripts for ClinicalBERT, a contextual representation for clinical notes.
-
clinical XLNet pretrained model is available at here.
-
Detailed Step Instructions for pretraining ClinicalBERT and Clinical XLNet from scratch are available here
-
The predictive performance result is updated in this version using the correct pretraining test splitting method described in pretraining script above. For more clinical outcomes performance comparison with more baselines using the correct split for ClinicalBERT/XLNet, please see the Clinical XLNet paper.
pip install pytorch-pretrained-bert
We use MIMIC-III. As MIMIC-III requires the CITI training program in order to use it, we refer users to the link. However, as clinical notes share commonality, users can test any clinical notes using the ClinicalBERT weight, although further fine-tuning from our checkpoint is recommended.
File system expected:
-data
-discharge
-train.csv
-val.csv
-test.csv
-3days
-train.csv
-val.csv
-test.csv
-2days
-test.csv
Data file is expected to have column "TEXT", "ID" and "Label" (Note chunks, Admission ID, Label of readmission).
Use this google link to download pretrained ClinicalBERT along with the readmission task fine-tuned model weights.
The following scripts presume a model folder that has following structure:
-model
-discharge_readmission
-bert_config.json
-pytorch_model.bin
-early_readmission
-bert_config.json
-pytorch_model.bin
-pretraining
-bert_config.json
-pytorch_model.bin
-vocab.txt
Below list the scripts for running prediction for 30 days hospital readmissions.
python ./run_readmission.py \
--task_name readmission \
--readmission_mode early \
--do_eval \
--data_dir ./data/3days(2days)/ \
--bert_model ./model/early_readmission \
--max_seq_length 512 \
--output_dir ./result_early
python ./run_readmission.py \
--task_name readmission \
--readmission_mode discharge \
--do_eval \
--data_dir ./data/discharge/ \
--bert_model ./model/discharge_readmission \
--max_seq_length 512 \
--output_dir ./result_discharge
python ./run_readmission.py \
--task_name readmission \
--do_train \
--do_eval \
--data_dir ./data/(DATA_FILE) \
--bert_model ./model/pretraining \
--max_seq_length 512 \
--train_batch_size (BATCH_SIZE) \
--learning_rate 2e-5 \
--num_train_epochs (EPOCHs) \
--output_dir ./result_new
It will use the train.csv from the (DATA_FILE) folder.
The results will be in the output_dir folder and it consists of
- 'logits_clinicalbert.csv': logits from ClinicalBERT to compare with other models
- 'auprc_clinicalbert.png': Precision-Recall Curve
- 'auroc_clinicalbert.png': ROC Curve
- 'eval_results.txt': RP80, accuracy, loss
We provide script for preprocessing clinical notes and merge notes with admission information on MIMIC-III.
- Attention: this notebook is a tutorial to visualize self-attention.
Please use this link to download Word2Vec and FastText models for Clinical Notes.
To use, simply
import gensim
word2vec = gensim.models.KeyedVectors.load('word2vec.model')
weights = (m[m.wv.vocab])
Please contact kh2383@nyu.edu for help or submit an issue.
Please cite arxiv:
@article{clinicalbert,
author = {Kexin Huang and Jaan Altosaar and Rajesh Ranganath},
title = {ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission},
year = {2019},
journal = {arXiv:1904.05342},
}