## Investigating the role of sample size in synthetic data

This file contains an example exercise similar to the application in the scientific publication at:

 
This example was produced using a publicly available dataset on paediatric bone marrow transplantation developed by:

* Marek Sikora(1,2) (marek.sikora@polsl.pl), 
* Lukasz Wrobel(1) (lukasz.wrobel@polsl.pl),  

(1) Institute of Computer Science, Silesian University of Technology, 44-100 Gliwice, Poland 
(2) Institute of Innovative Technologies EMAG, 40-189 Katowice, Poland

    
    
***To ensure that this example can be run successfully, ensure that you have installed the required libraries found in requirements.txt.***

***We recommend that this is done in a separate virtual environment.***

***Note that this environment should be available for the Jupyter server, if not, you can opt to extract the code snippets and run it directly in a separate virtual environment.***

***To install the requirements from the .txt file, use "pip install -r requirements.txt" in your virtual environment.***
  
  

To then start of, let's ensure we establish a specific directory we can work in.

In [None]:
import os

# set the working directory to a folder of your desire, here we will use the example folder that is located in the 
working_directory = os.path.join(os.getcwd(), "example_data")

### Pre-processing

First we compose a settings file, this file contains various items that relate to the various steps in the workflow.
These items vary from specifying what pre-processing steps to take and not take, to defining a formula for logistic regression analysis.

An aspect that is best not missed are the 'datatype'_corrections, which allow you to formulate a metadata file that the synthetic data generation models can use.
For example, with biological sex being a dichotomous variable, sex should be included in boolean_corrections.

*When attempting to use the code in this repository for another dataset this might require some trial and error.*

In [None]:
settings = {
    "Pre-processing": {
        "General_information": {
            "format": "single_table",
            "identifier": "",
            "codebook_variable_column": 0,
            "missing_value": "?",
            "boolean_triggers": [],
            "boolean_corrections": [
                "Recipientgender",
                "Stemcellsource",
                "Donorage35",
                "IIIV",
                "Gendermatch",
                "RecipientRh",
                "ABOmatch",
                "DonorCMV",
                "RecipientCMV",
                "Riskgroup",
                "Txpostrelapse",
                "Diseasegroup",
                "HLAmismatch",
                "Recipientage10",
                "Relapse",
                "aGvHDIIIIV",
                "extcGvHD",
                "survival_status"
            ],
            "categorical_variables": [],
            "categorical_corrections": [
                "DonorABO",
                "RecipientABO",
                "CMVstatus",
                "Disease",
                "HLAmatch",
                "Antigen",
                "Alel",
                "HLAgrI",
                "Recipientageint"
            ],
            "float_triggers": [],
            "float_corrections": ["CD34kgx10d6", "CD3dCD34", "CD3dkgx10d8", "Rbodymass"],
            "integer_triggers": [],
            "integer_corrections": [
                "Donorage",
                "Recipientage",
                "ANCrecovery",
                "PLTrecovery",
                "time_to_aGvHD_III_IV",
                "survival_time"
            ],
            "string_triggers": [],
            "string_corrections": []
        },
        "Settings": {
            "harmonise_booleans": "False",
            "find_codebook_discrepancies": "True",
            "recode_categories": [
                "Disease"
            ],
            "remove_free_text": "False",
            "remove_identification": "False",
            "remove_in_column_string": {},
            "remove_high_percentage_missing": [],
            "variables_to_keep": []
        }
    },
    "Evaluation": {
        "formula_logistic_regression": "",
        "variables_to_plot": [],
        "variables_nickname": {},
        "known_variables": ['Recipientgender', 'Stemcellsource'],
        "sensitive_variables": ['Donorage'],
        "graph_file_extension": ".eps"
    }
}

We then save the settings file so that it can be re-used in subsequent analyses. 

In case you already have a settings file that is appropriately formatted, it is not necessary to save it again.

In [None]:
import os

from src.file_handling import save_json

# save the file as name_settings.json, as this is the format that will be sought for later in the workflow; this name (bmt) refers to bone marrow transplantation
settings_name = "bmt_settings.json"
save_json(data=settings, filename=os.path.join(working_directory, "bmt_settings.json"))

# the settings file is now saved in the location that is findable through
settings_path = os.path.join(working_directory, "bmt_settings.json")

With the settings saved, we can now start pre-processing the data, this is done by providing the path to the dataset and the path to the settings file to the pre-processing component.

In [None]:
from src.data_pre_processing import UnProcessedData

# formulate the path to the dataset, in this example the dataset is called bmt.csv and is located in the same folder as this example workflow
dataset_path = os.path.join(working_directory, "bmt.csv")

# provide that path to the dataset and the previously defined path to the settings to the pre-processing component; we provide an empty string for the codebook, as this is not available here
pre_processed_data = UnProcessedData(data_path=dataset_path, codebook_path="", settings_path=settings_path)

# clean the data, and save it as name_clean.csv; this will be saved in the same folder as your dataset path
pre_processed_data.clean_data(save=True, filename_addition='_clean')

# formulate and save a Synthetic Data Vault metadata file; this will be saved in same folder as your dataset path
pre_processed_data.format_metadata(save=True)

In your working directory you now should have the following files
 * original CSV dataset, 
 * a clean CSV dataset, 
 * a JSON file containing the SDV metadata, and 
 * a JSON file containing the settings.

In case you've been using the provided example, your example_data folder should have the following files:

![Image of files in your working directory](assests_for_jupyter_example/files_in_your_working_directory_pre.png)

## Synthetic data generation and evaluation


### Evaluation of output sample size

Due to the nature of the variation in sample sizes (i.e., the variation in training and synthetic data sample size), the process of modelling a dataset, generating synthetic data, and the evaluation thereof, has been bundled in a single pipeline.
*However, the functions are generally independently callable if this is desired.*

For the sake of completeness, what will happen is the following:
* the dataset is modelled using a generative model,
* this generative model then produces synthetic datasets with sizes as specified,
* these synthetic datasets are then compared to the dataset that was used to train the generative model.

In [None]:
from src.evaluation_general import DataEvaluation

# specify the model that you wish to evaluate; note that only single table models are supported in our pipeline, which are: FAST_ML, GaussianCopula, CopulaGan, CTGAN, and TVAE; DP-CGAN is available in its specific branch
# for this example we will use DP-CGAN, as this is the DP-CGAN specific branch
generative_model = "DP-CGAN"

# we can also define the range that we produce synthetic data for, including the interval (i.e., the output sample size)
smallest_sample_size = 50
largest_sample_size = 1000
sample_size_interval = 100

# we now only have to specific the working directory, given that all files are present as described above
evaluator = DataEvaluation(directory=working_directory)

# now we specify the evaluation, there are multiple options available, for this example we will only evaluate output sample size; we include a variable default model to ensure that it uses default (hyper-)parameters
evaluator.generator_evaluate_n_output(start=smallest_sample_size, stop=largest_sample_size, step=sample_size_interval,
                                      models_to_evaluate=[generative_model], default_model=True, save_data=True,
                                      save_model=True)

# we remove the object again for the purpose of this example
del evaluator

In your working directory, you should now have the following new directories, and files
* evaluation
    * dump
    * name_clean_n_output.csv
    * name_clean_n_output_data_quality.eps
* models
    * name_clean_n_output_model_model_name.pk1 
* synthetic
    * name_clean_model_name_sample_size_evaluation_n_output_file_id.csv

In case you’ve been using the provided example, your example_data folder should have the following files *with the exception for the specific file names of the generated data*:
![Image of files in your working directory](assests_for_jupyter_example/files_in_your_working_directory_post.png)

Additionally, a figure has been produced by this process and is displayed in your Jupyter notebook; *in case it is not visible you should be able to find it in the evaluation folder*.
This figure represents a trade-off between:
* veracity in the horizontal lines (in this case through the precision, recall, density, and coverage metrics) and 
* privacy concealment in the vertical lines (in this case through the identity disclosure metric).

In case you've been using the provided example, your figure should more or less look as follows: 
*please do note that the results will never exactly be the same due to the re-modelling of data*
![Image produced by evaluating the synthetic data sample size with DP-CGAN](assests_for_jupyter_example/bmt_clean_n_output_data_quality_DP-CGAN.png)

This concludes the example of evaluating output sample size, to finish up, we can move the results of our evaluation to a directory specific to the generative model.

In [None]:
import shutil

shutil.move(os.path.join(working_directory, "evaluation"),
            os.path.join(working_directory, generative_model, "evaluation"))
shutil.move(os.path.join(working_directory, "models"),
            os.path.join(working_directory, generative_model, "models"))
shutil.move(os.path.join(working_directory, "synthetic"),
            os.path.join(working_directory, generative_model, "synthetic"))

### Evaluation of input and output sample size

Due to the nature of the variation in sample sizes (i.e., the variation in training and synthetic data sample size), the process of modelling a dataset, generating synthetic data, and the evaluation thereof, has been bundled in a single pipeline.
*However, the functions are generally independently callable if this is desired.*

For the sake of completeness, what will happen in this scenario is the following:
* the entire dataset is modelled using a generative model,
* this generative model then produces synthetic datasets with sizes as specified,
* these synthetic datasets are then compared to the dataset that was used to train the generative model,
* the entire dataset is sub-sampled, this sub-sample is then modelled using a new generative model,
* this new generative model then produces synthetic datasets with sizes as specified,
* these synthetic datasets are then compared to the dataset that was used to train the generative model (in this case thus a sub-sample),
* the entire dataset is sub-sampled again, et cetera...

In [None]:
from src.evaluation_general import DataEvaluation

# specify the model that you wish to evaluate; note that only single table models are supported in our pipeline, which are: FAST_ML, GaussianCopula, CopulaGan, CTGAN, and TVAE; DP-CGAN is available in its specific branch
# for this example we will use DP-CGAN, as this is the DP-CGAN specific branch
generative_model = "DP-CGAN"

# additionally we can define the smallest sample that we which to draw from the entire dataset (i.e., the input/training sample size)
smallest_input_sample_size = 80
sample_size_interval = 50

# we can also define the range that we produce synthetic data for, including the interval (i.e., the output sample size)
smallest_output_sample_size = 50
largest_output_sample_size = 1000
sample_output_size_interval = 100

# we now only have to specific the working directory, given that all files are present as described above
evaluator = DataEvaluation(directory=working_directory)

# now we specify the evaluation, there are multiple options available, for this example we will only evaluate output sample size; we include a variable default model to ensure that it uses default (hyper-)parameters
# we're avoiding saving any data here, as this can quickly consume a lot of disk space
evaluator.generator_evaluate_n_input_random(stop=smallest_input_sample_size, step=sample_size_interval,
                                            output_start=smallest_output_sample_size,
                                            output_stop=largest_output_sample_size,
                                            output_step=sample_output_size_interval,
                                            models_to_evaluate=[generative_model], default_model=True)

# we remove the object again for the purpose of this example
del evaluator

In your working directory, you should now have the following new directories, and files
* evaluation
    * dump
    * name_clean_n_input_sample_size_n_output.csv
    * name_clean_n_input_sample_size_n_output.eps
* models
* synthetic

In case you’ve been using the provided example, your example_data folder should have the following files *with the exception for the specific file names of the generated data*:
![Image of files in your working directory when performing input evaluation](assests_for_jupyter_example/files_in_your_working_directory_input_evaluation.png)

This concludes the example exercise of this research code, the results that have been stored in the csv files can be used to produce figures similar to those in the aforementioned scientific publication. Interpretation of the results in this exercise was considered to go beyond the scope of this document.