AYA-synthetic-data

What does this repository represent?

This repository contains the research code and scripts used for an investigation of the role of sample size in synthetic data. The code in this repository was specifically designed to investigate the effects of variation in input (training data) and output (produced synthetic data) sample size on synthetic data veracity, privacy concealment, and utility. A more extensive description of the methodology that this repository represents can be found in the associated scientific publication: currently as preprint on MedRxiv; https://doi.org/10.1101/2024.03.04.24303526

Where have the contents of this repository been used and reported?

The role of sample size was investigated in a rare and heterogeneous healthcare demographic: adolescents and young adults with cancer. The findings of this investigation can be found in the associated scientific publication:

Can this code be re-used to investigate sample size effects in other demographics or datasets?

A large proportion of this code should be re-usable with another single-table dataset (i.e., not time-series or multi-table datasets), given that the dataset is appropriately cleaned. However, certain components such as data_preprocessing.py, evaluation_visualisation.py and the utility assessment in evaluation_metric.py were specifically designed for the aforementioned dataset and publication.

How are the code and scripts in this repository to be used?

There is a worked-out example provided in the example_exercise.ipynb Jupyter notebook. This example makes use of a public dataset on paediatric bone marrow transplantation developed by ... that is available through:

How was this work funded?

This work and the associated scientific publication were predominantly supported by the European Union’s Horizon 2020 research and innovation programme through The STRONG-AYA Initiative (Grant agreement ID: 101057482).

What are the main libraries that this research code relied on?

The synthetic data was generated using

Synthetic Data Vault (SDV) (https://github.com/sdv-dev/SDV), and
Differentially Private - Conditional Generative Adversarial Networks ( DP-CGAN) (https://github.com/sunchang0124/dp_cgans).

The evaluations were performed using:

prdc (https://github.com/clovaai/generative-evaluation-prdc),
scipy (https://github.com/scipy/scipy),
SDmetrics (https://github.com/sdv-dev/SDMetrics),
sklearn (https://github.com/scikit-learn/scikit-learn), and
statsmodels (https://github.com/statsmodels/statsmodels/).

Versions of all necessary libraries can be found in the requirements.txt file Please note that the second branch that DP-CGAN was developed in requires slightly different versions for some libraries

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assests_for_jupyter_example		assests_for_jupyter_example
example_data		example_data
src		src
LICENSE		LICENSE
README.md		README.md
analysis_missing.py		analysis_missing.py
combine_multiple_analyses.py		combine_multiple_analyses.py
example_exercise.ipynb		example_exercise.ipynb
requirements.txt		requirements.txt
run_generation_and_evaluation.py		run_generation_and_evaluation.py
run_generation_and_evaluation.sh		run_generation_and_evaluation.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AYA-synthetic-data

About

Languages

License

MaastrichtU-CDS/AYA-synthetic-data

Folders and files

Latest commit

History

Repository files navigation

AYA-synthetic-data

About

Topics

Resources

License

Stars

Watchers

Forks

Languages