Improving Clinical NLP Performance with Synthetic Clinical Data

Overview

This repository hosts the code and resources for the research project "Improving Clinical NLP Performance through Language Model-Generated Synthetic Clinical Data," aimed at enhancing clinical natural language processing (NLP) methods through the use of synthetic data generated by advanced language models. Our study demonstrates the feasibility and effectiveness of augmenting NLP model training with high-quality synthetic clinical text, showing promising applications in the high-stakes domain of healthcare.

Key Features

Synthetic Data Generation: Utilizes large language models (LLMs) to create synthetic annotated clinical text datasets.
Label Correction Technique: An active learning step applied to enhance the quality of synthetic datasets.
Benchmark Evaluation: Assessment of model performance on NLP benchmarks and real-world long document clinical datasets.

In this repo

We showed detailed Prompts for Synthetic Data Generation for all our tasks among the 4 .py files

Contact

For any queries, please reach out to Dr. Danielle S. Bitterman at dbitterman@bwh.harvard.edu.

Citation

If you use our work in your research, please cite:

@article{chen2024improving,
  title={Improving Clinical NLP Performance through Language Model-Generated Synthetic Clinical Data},
  author={Chen, Shan and Gallifant, Jack and Guevara, Marco and Gao, Yanjun and others},
  journal={arXiv preprint arXiv:XXXX.XXXX},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
gen_AP.py		gen_AP.py
gen_eso.py		gen_eso.py
gen_mednli.py		gen_mednli.py
gen_sum.py		gen_sum.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Improving Clinical NLP Performance with Synthetic Clinical Data

Overview

Key Features

In this repo

Contact

Citation

About

Releases

Packages

Languages

AIM-Harvard/fake2real

Folders and files

Latest commit

History

Repository files navigation

Improving Clinical NLP Performance with Synthetic Clinical Data

Overview

Key Features

In this repo

Contact

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages