## Using Optuna's Bayesian optimization to tune hyperparameters 

- It is highly recommended that the Bayesian optimization routine 
is executed in an environment with access to CUDA and/or OpenMP, as it greatly 
accelerates the entire process.

In [1]:
from pysdg.synth.generate import Generator
from pysdg.synth.optimize import BayesianOptimizationRoutine

gen = Generator(gen_name="synthcity/ctgan")
real=gen.load("./raw_train.csv", "./raw_info.json")




2025-03-23 00:30:07,508 - pysdg - INFO - 545893 - generate.py:100 - **************Started logging the generator: synthcity/ctgan, num_cores= None.**************
2025-03-23 00:30:07,513 - pysdg - INFO - 545893 - generate.py:298 - Checking the input metadata for any conflict in variable indexes - Passed.
2025-03-23 00:30:07,801 - pysdg - INFO - 545893 - generate.py:410 - The dataset ['tutorial_data'] is loaded into the generator synthcity_ctgan


For the sake of example, we'll use `pysdg`'s vulnerability metric that is based on simularing a privacy attack using selected variables from data (the quasi-identifiers)

In [3]:
import pandas as pd
from pysdg.privacy.vuln_utility import calc_vulnerability_utility

# for calculating vulnerability utility metrics, a holdout set is needed
df_holdout = pd.read_csv("./raw_holdout.csv")

def my_eval_function(gen, df_holdout):
    real_data = gen.enc_real
    synth_data = gen.enc_synths[0]

    real_data = gen.restore_col_names(gen.enc_real)
    synth_data = gen.restore_col_names(gen.enc_synths[0])

    quasi_vars = real_data.columns.to_list()[:2]

    val = calc_vulnerability_utility(
        df_train=real_data,
        df_holdout=df_holdout,
        df_synthetic=synth_data,
        quasi_identifiers=quasi_vars
    )
    return val

bayes_opt = BayesianOptimizationRoutine(
                                        gen=gen,
                                        eval_function=my_eval_function,
                                        holdout_df=df_holdout,
                                        objective="minimize",
                                        n_trials=1, # to make it finish faster
                                        study_name="mismatches_study",
                                        dump_csv=False, # dumping csv will only happen at the end of the optimization
                                        dump_sqlite=False # dumping sql happens after each trial
                                        )

2025-03-23 00:30:13,883 - pysdg - INFO - 545893 - generate.py:849 - Started training using synthcity_ctgan...
[2025-03-23T00:30:14.137198-0400][545893][CRITICAL] module disabled: /home/vh/miniconda3/envs/pysdg/lib/python3.10/site-packages/synthcity/plugins/generic/plugin_goggle.py
2025-03-23 00:30:14,761 - pysdg - INFO - 545893 - generate.py:853 - No of Iterations=50, Batch Size=256
INFO:pysdg:No of Iterations=50, Batch Size=256
100%|██████████| 50/50 [00:14<00:00,  3.41it/s]
2025-03-23 00:30:34,557 - pysdg - INFO - 545893 - generate.py:861 - Completed training using synthcity_ctgan.
INFO:pysdg:Completed training using synthcity_ctgan.
2025-03-23 00:30:34,649 - pysdg - INFO - 545893 - generate.py:886 - Generating synth no. 0 of size (5000, 12) -- Completed!
INFO:pysdg:Generating synth no. 0 of size (5000, 12) -- Completed!


Calculating membership disclosure risk


2025-03-23 00:31:12,117 - pysdg - INFO - 545893 - generate.py:849 - Started training using synthcity_ctgan...
INFO:pysdg:Started training using synthcity_ctgan...
[2025-03-23T00:31:12.118405-0400][545893][CRITICAL] module disabled: /home/vh/miniconda3/envs/pysdg/lib/python3.10/site-packages/synthcity/plugins/generic/plugin_goggle.py
2025-03-23 00:31:12,120 - pysdg - INFO - 545893 - generate.py:853 - No of Iterations=50, Batch Size=256
INFO:pysdg:No of Iterations=50, Batch Size=256
100%|██████████| 50/50 [00:11<00:00,  4.34it/s]
2025-03-23 00:31:30,668 - pysdg - INFO - 545893 - generate.py:861 - Completed training using synthcity_ctgan.
INFO:pysdg:Completed training using synthcity_ctgan.


In [4]:
bayes_opt.best_gen.gen(num_rows=len(real), num_synths=1)
synths=bayes_opt.best_gen.unload()
synths[0]

2025-03-23 00:31:30,762 - pysdg - INFO - 545893 - generate.py:886 - Generating synth no. 0 of size (5000, 12) -- Completed!
INFO:pysdg:Generating synth no. 0 of size (5000, 12) -- Completed!


Unnamed: 0,outc_cod_0,event_dt,wt,wt_cod,age,age_cod,drugname_0,indi_pt_0,sex
0,DE,2018-08-11,,,74,,MYCOPHENOLIC ACID.,,M
1,,NaT,,,,,Pharmorubicin,,
2,,NaT,,,94,YR,TRUVADA,Ocular sarcoidosis,
3,OT,NaT,74.065438,,,YR,AZOR,,
4,,NaT,,,,,DEPO-PROVERA,Idiopathic urticaria,
...,...,...,...,...,...,...,...,...,...
4995,,NaT,,,76,,ALEVE,,
4996,,NaT,74.087348,,96,YR,FOLLISTIM,,
4997,,NaT,,KG,,,ACETAMINOPHEN\OXYCODONE HYDROCHLORIDE,Migraine prophylaxis,
4998,LT,NaT,,KG,,,ELEXACAFTOR\IVACAFTOR\TEZACAFTOR,,


In [5]:
bayes_opt.get_optimization_results()

Unnamed: 0,number,value,datetime_start,datetime_complete,duration,params_adjust_inference_sampling,params_batch_size,params_clipping_value,params_discriminator_dropout,params_discriminator_n_layers_hidden,...,params_encoder_max_clusters,params_generator_dropout,params_generator_n_layers_hidden,params_generator_n_units_hidden,params_generator_nonlin,params_lr,params_n_iter,params_weight_decay,user_attrs_my_eval_function,state
0,0,0.503802,2025-03-23 00:30:13.882440,2025-03-23 00:31:12.099704,0 days 00:00:58.217264,False,256,-1,0.271915,6,...,27,0.218612,3,32,selu,0.0001,50,0.006598,0.503802,COMPLETE


If you don't need a holdout set and just want to optimze around another metric that only takes the original data and the synthetic data, you can follow the template below

In [6]:
def my_eval_function(gen: Generator):
    real_data = gen.enc_real
    synth_data = gen.enc_synths[0] # we'll be assuming that we're generating only one dataset and we compare the encoded datasets, for simplicity
    n_mismatches  = (real_data != synth_data).sum().sum()
    return n_mismatches

bayes_opt = BayesianOptimizationRoutine(
                                        gen=gen,
                                        eval_function=my_eval_function,
                                        objective="minimize",
                                        n_trials=1, # to make it finish faster
                                        study_name="mismatches_study",
                                        dump_csv=False, # dumping csv will only happen at the end of the optimization
                                        dump_sqlite=False # dumping sql happens after each trial
                                        )

2025-03-23 00:31:43,844 - pysdg - INFO - 545893 - generate.py:849 - Started training using synthcity_ctgan...
INFO:pysdg:Started training using synthcity_ctgan...
[2025-03-23T00:31:43.845525-0400][545893][CRITICAL] module disabled: /home/vh/miniconda3/envs/pysdg/lib/python3.10/site-packages/synthcity/plugins/generic/plugin_goggle.py
2025-03-23 00:31:43,847 - pysdg - INFO - 545893 - generate.py:853 - No of Iterations=25, Batch Size=128
INFO:pysdg:No of Iterations=25, Batch Size=128
100%|██████████| 25/25 [00:57<00:00,  2.30s/it]
2025-03-23 00:32:51,447 - pysdg - INFO - 545893 - generate.py:861 - Completed training using synthcity_ctgan.
INFO:pysdg:Completed training using synthcity_ctgan.
2025-03-23 00:32:51,634 - pysdg - INFO - 545893 - generate.py:886 - Generating synth no. 0 of size (5000, 12) -- Completed!
INFO:pysdg:Generating synth no. 0 of size (5000, 12) -- Completed!
2025-03-23 00:32:51,661 - pysdg - INFO - 545893 - generate.py:849 - Started training using synthcity_ctgan...
IN