This repository hosts the code and resources for the research project "Improving Clinical NLP Performance through Language Model-Generated Synthetic Clinical Data," aimed at enhancing clinical natural language processing (NLP) methods through the use of synthetic data generated by advanced language models. Our study demonstrates the feasibility and effectiveness of augmenting NLP model training with high-quality synthetic clinical text, showing promising applications in the high-stakes domain of healthcare.
- Synthetic Data Generation: Utilizes large language models (LLMs) to create synthetic annotated clinical text datasets.
- Label Correction Technique: An active learning step applied to enhance the quality of synthetic datasets.
- Benchmark Evaluation: Assessment of model performance on NLP benchmarks and real-world long document clinical datasets.
- We showed detailed Prompts for Synthetic Data Generation for all our tasks among the 4 .py files
For any queries, please reach out to Dr. Danielle S. Bitterman at dbitterman@bwh.harvard.edu.
If you use our work in your research, please cite:
@article{chen2024improving,
title={Improving Clinical NLP Performance through Language Model-Generated Synthetic Clinical Data},
author={Chen, Shan and Gallifant, Jack and Guevara, Marco and Gao, Yanjun and others},
journal={arXiv preprint arXiv:XXXX.XXXX},
year={2024}
}