Skip to content

AIM-Harvard/fake2real

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Improving Clinical NLP Performance with Synthetic Clinical Data

Overview

This repository hosts the code and resources for the research project "Improving Clinical NLP Performance through Language Model-Generated Synthetic Clinical Data," aimed at enhancing clinical natural language processing (NLP) methods through the use of synthetic data generated by advanced language models. Our study demonstrates the feasibility and effectiveness of augmenting NLP model training with high-quality synthetic clinical text, showing promising applications in the high-stakes domain of healthcare.

Key Features

  • Synthetic Data Generation: Utilizes large language models (LLMs) to create synthetic annotated clinical text datasets.
  • Label Correction Technique: An active learning step applied to enhance the quality of synthetic datasets.
  • Benchmark Evaluation: Assessment of model performance on NLP benchmarks and real-world long document clinical datasets.

In this repo

  • We showed detailed Prompts for Synthetic Data Generation for all our tasks among the 4 .py files

Contact

For any queries, please reach out to Dr. Danielle S. Bitterman at dbitterman@bwh.harvard.edu.

Citation

If you use our work in your research, please cite:

@article{chen2024improving,
  title={Improving Clinical NLP Performance through Language Model-Generated Synthetic Clinical Data},
  author={Chen, Shan and Gallifant, Jack and Guevara, Marco and Gao, Yanjun and others},
  journal={arXiv preprint arXiv:XXXX.XXXX},
  year={2024}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages