# Tutorial: Basic usage

**This tutorial demonstrates the core functions of pysdg.** It assumes that pysdg is already installed in a Conda environment, the environment has been activated from the shell, and this notebook is being run within that activated environment. For detailed instructions, please refer to the "pysdg" documentation.

The following cell sets the working directory to the location of this notebook. It is assumed that all files accessed by this notebook are stored in the same directory.

In [1]:
import os
from pathlib import Path
current_dir = Path().resolve()
os.chdir(current_dir)

The core functions in pysdg include: loading, training, generating and unloading. 

First we import the necessary packages and apply the proper settings for prettier display of both Pandas data frames and Python dictionaries. The last line below imports the  Generator class from `pysdg` synth module. 

In [2]:
import pandas as pd
pd.set_option('display.max_columns', None) 
pd.set_option('display.max_colwidth', 10)  
pd.set_option('display.width', 1000)  

import json 
from IPython.display import JSON

from pysdg.synth.generate import Generator 



Two files are necessary to be loaded into pysdg: first the `raw` tabular dataset in CSV format and secondly the corresponding data `info` file in JSON format. The JSON file shall be manually created and it has to include several mandatory keys. Below are the paths to both files:

In [3]:
raw_data_path='raw_data.csv'
raw_info_path='raw_info.json'

First, let us take a look to the first few rows of the `raw data`.

In [4]:
raw_data=pd.read_csv(raw_data_path)
raw_data.head(10)

Unnamed: 0,outc_cod_0,event_dt,wt,wt_cod,age,age_cod,drugname_0,indi_pt_0,sex
0,,NaT,,,,,ZANTAC,,
1,DE,201506,,,18.0,YR,OXYCONTIN,Drug abuse,M
2,OT,201907,,,,,LEMTRADA,,
3,OT,20190917,,,46.0,YR,COSENTYX,Psoriatic arthropathy,M
4,DE,20161201,110.0,KG,73.0,YR,ENTRESTO,Cardiac failure,M
5,OT,,95.0,KG,33.0,YR,Champix,Smoking cessation therapy,M
6,NAN,,,,,,CEFTRIAXONE SODIUM,,
7,,,86.0,KG,74.0,YR,LYRICA,Nerve injury,F
8,,,,,,,XELJANZ,,
9,,20190907,,,57.0,YR,COSENTYX,Ankylosing spondylitis,M


Let us also take a look to the data types as interpreted by the default settings of pandas. Clearly, these data types can vary depending on the library used for reading the CSV file.

In [5]:
raw_data.dtypes

outc_cod_0     object
event_dt       object
wt            float64
wt_cod         object
age           float64
age_cod        object
drugname_0     object
indi_pt_0      object
sex            object
dtype: object

We can see that the `raw data` above includes several representations of missing values, i.e.  `NA`, `NAN, NaT` and `<NA>`. We need to define that in the metadata JSON file. 

We also need to define the data types for all the variables to eliminate the dependency on the library used to read the CSV file. To simplify things, pysdg identifies four basic data types: categorical (`cat`), continuous (`cnt`) , discrete (`dscrt`) and datatime (`datetime`).  Please note that **the categorical variable can be either numbers or alphabets**. In the JSON file, we list all the indexes of the variables under the right data type. Let us take a look to the JSON file that we created earlier for the purpose of this tutorial.

In [6]:
with open(raw_info_path,"r") as f:
    raw_info=json.load(f)
    
JSON(raw_info)

<IPython.core.display.JSON object>

As you see above, the dataset is given the name `tutorial_data`. This can be any name. For the time being, we will define empty lists corresponding to the keys `nct_nos`, `id_idx`, `quasi_idxs`. We also keep `h0_value` set to a list of single element which is zero. Let us focus on the remaining keys. The list corresponding to the key `cat_idxs` includes  the indexes of the categorical variables as defined by the user. For instance, the first variable (index 0) in the `raw data`, namely, `outc_cod_0` is defined as categorical, while the third variable (index 2), namely `wt` is defined as continuous.  

We also note that all the occurring missing value representations are listed under `miss_vals`. Adding more representations not existing in the `raw data` is allowed and will have no impact. It is always advisable to include `nan`, `NA` and `''`.

Before loading both the CSV file and its corresponding  JSON file, we need to define a generator object. We pass the name of the desired generator as an argument. You can refer to `pysdg` documentation for a list of the names of available generators. In this tutorial, will use the [bayesian network generator from Synchcity](https://synthcity.readthedocs.io/en/latest/generated/synthcity.plugins.generic.plugin_bayesian_network.html), namely, `synthcity_bayesian_network`.

In [7]:
gen=Generator("synthcity/bayesian_network")

2025-03-10 14:10:03,500 - pysdg - INFO - 2985237 - generate.py:88 - **************Started logging the generator: synthcity/bayesian_network, num_cores= None.**************


Now we will load both the `raw data` path and its user-defined `raw info` path using the load method. In return, we will get back a clean `real` data that we can use in our downstream analysis.

In [8]:
real=gen.load(raw_data_path, raw_info_path)

2025-03-10 14:10:03,583 - pysdg - INFO - 2985237 - generate.py:250 - Checking the input metadata for any conflict in variable indexes - Passed.
2025-03-10 14:10:05,956 - pysdg - INFO - 2985237 - generate.py:318 - The dataset ['tutorial_data'] is loaded into the generator synthcity_bayesian_network


The `load` method gives you the option to load the raw dataframe object rather than the raw data path e.g.

In [9]:
real=gen.load(raw_data, raw_info_path)

2025-03-10 14:10:06,017 - pysdg - INFO - 2985237 - generate.py:250 - Checking the input metadata for any conflict in variable indexes - Passed.
2025-03-10 14:10:08,300 - pysdg - INFO - 2985237 - generate.py:318 - The dataset ['tutorial_data'] is loaded into the generator synthcity_bayesian_network


The clean `real data` enforces the data types as per the input `raw info` json file. Let's take a look to that as compared to the data types in the `raw data`. You can see below that all data types match what was defined in the `raw info` jon file. 

In [10]:
real.dtypes

outc_cod_0          category
event_dt      datetime64[ns]
wt                   float64
wt_cod              category
age                    Int64
age_cod             category
drugname_0          category
indi_pt_0           category
sex                 category
dtype: object

Moreover, all the variables in the `real` data will hold missing value representations conforming to their datatype. Let us take a look to teh first rows of `real` data as compared to `raw data`. It is imperative that if the `real` is saved to a CSV file, all missing values will hold a unified representation. 

In [11]:
real.head(10)

Unnamed: 0,outc_cod_0,event_dt,wt,wt_cod,age,age_cod,drugname_0,indi_pt_0,sex
0,,NaT,,,,,ZANTAC,,
1,DE,NaT,,,18.0,YR,OXYCONTIN,Drug abuse,M
2,OT,NaT,,,,,LEMTRADA,,
3,OT,2019-09-17,,,46.0,YR,COSENTYX,Psoriatic arthropathy,M
4,DE,2016-12-01,110.0,KG,73.0,YR,ENTRESTO,Cardiac failure,M
5,OT,NaT,95.0,KG,33.0,YR,Champix,Smoking cessation therapy,M
6,,NaT,,,,,CEFTRIAXONE SODIUM,,
7,,NaT,86.0,KG,74.0,YR,LYRICA,Nerve injury,F
8,,NaT,,,,,XELJANZ,,
9,,2019-09-07,,,57.0,YR,COSENTYX,Ankylosing spondylitis,M


We can further explore what happens with the input `raw info` file. Let us retrieve the `info` from our `gen` object. As you see below, the variable indexes are converted into variable names.

In [12]:
JSON(gen.real_info)

<IPython.core.display.JSON object>

The `load` method encodes the `real` data to be used for training the desired generator. Let us take a look to the `encoded real` data frame.

In [13]:
gen.enc_real.head(10)

Unnamed: 0,outc_cod_0%%%0,event_dt%%%1,wt%%%2,wt_cod%%%3,age%%%4,age_cod%%%5,drugname_0%%%6,indi_pt_0%%%7,sex%%%8,wt%%%2_missing,age%%%4_missing,event_dt%%%1_missing
0,%%MISS-0%%,19576.157499,73.731432,%%MISS-0%%,65.0,%%MISS-0%%,ZANTAC,%%MISS-0%%,%%MISS-0%%,True,True,True
1,DE,19576.157499,73.731432,%%MISS-0%%,18.0,YR,OXYCONTIN,Drug abuse,M,True,False,True
2,OT,19576.157499,73.731432,%%MISS-0%%,65.0,%%MISS-0%%,LEMTRADA,%%MISS-0%%,%%MISS-0%%,True,True,True
3,OT,19982.0,73.731432,%%MISS-0%%,46.0,YR,COSENTYX,Psoriatic arthropathy,M,True,False,False
4,DE,18962.0,110.0,KG,73.0,YR,ENTRESTO,Cardiac failure,M,False,False,False
5,OT,19576.157499,95.0,KG,33.0,YR,Champix,Smoking cessation therapy,M,False,False,True
6,%%MISS-1%%,19576.157499,73.731432,%%MISS-0%%,65.0,%%MISS-0%%,CEFTRIAXONE SODIUM,%%MISS-0%%,%%MISS-0%%,True,True,True
7,%%MISS-0%%,19576.157499,86.0,KG,74.0,YR,LYRICA,Nerve injury,F,False,False,True
8,%%MISS-0%%,19576.157499,73.731432,%%MISS-0%%,65.0,%%MISS-0%%,XELJANZ,%%MISS-0%%,%%MISS-0%%,True,True,True
9,%%MISS-0%%,19972.0,73.731432,%%MISS-0%%,57.0,YR,COSENTYX,Ankylosing spondylitis,M,True,False,False


After loading the data, we can start training the desired generator. 

In [14]:
gen.train()

[2025-03-10T14:10:11.374224-0400][2985237][CRITICAL] module disabled: /share/personal/skababji/conda_envs/pysdg_dev/lib/python3.10/site-packages/synthcity/plugins/generic/plugin_goggle.py
2025-03-10 14:10:21,127 - pysdg - INFO - 2985237 - generate.py:703 - Started training using synthcity_bayesian_network...
INFO:pysdg:Started training using synthcity_bayesian_network...
2025-03-10 14:11:53,705 - pysdg - INFO - 2985237 - generate.py:708 - Completed training using synthcity_bayesian_network.
INFO:pysdg:Completed training using synthcity_bayesian_network.


Once trained, the model can be used to generate the required number of records and synthetic datasets. In the following code line below, we are generating two synthetic datasets, each with the same number of records of the real data set. 

In [15]:
gen.gen(num_rows=len(real), num_synths=2)

2025-03-10 14:11:55,820 - pysdg - INFO - 2985237 - generate.py:756 - Generating synth no. 0 of size (10000, 12) -- Completed!
INFO:pysdg:Generating synth no. 0 of size (10000, 12) -- Completed!
2025-03-10 14:11:57,998 - pysdg - INFO - 2985237 - generate.py:756 - Generating synth no. 1 of size (10000, 12) -- Completed!
INFO:pysdg:Generating synth no. 1 of size (10000, 12) -- Completed!


The generated synthetic datasets are both encoded. For instance, we can check the first 10 records of the first synthetic dataset using:

In [16]:
gen.enc_synths[0].head(10)

Unnamed: 0,outc_cod_0%%%0,event_dt%%%1,wt%%%2,wt_cod%%%3,age%%%4,age_cod%%%5,drugname_0%%%6,indi_pt_0%%%7,sex%%%8,wt%%%2_missing,age%%%4_missing,event_dt%%%1_missing
0,%%MISS-0%%,19571.941165,73.657916,%%MISS-0%%,51.314295,%%MISS-0%%,VARGATEF,%%MISS-0%%,%%MISS-0%%,True,True,True
1,OT,19574.526627,73.702996,%%MISS-0%%,59.706389,%%MISS-0%%,ZESTRIL,%%MISS-0%%,%%MISS-0%%,True,True,True
2,LT,19578.924485,73.779676,%%MISS-0%%,73.981299,%%MISS-0%%,CASIRIVIMAB\IMDEVIMAB,Neutrophil function disorder,M,True,True,True
3,%%MISS-0%%,19575.225284,73.715177,%%MISS-0%%,61.974145,%%MISS-0%%,XTAMPZA ER,%%MISS-0%%,%%MISS-0%%,True,True,True
4,%%MISS-0%%,19576.282912,73.733618,%%MISS-0%%,65.407077,%%MISS-0%%,BRIGATINIB,Product used for unknown indication,%%MISS-0%%,True,True,True
5,%%MISS-0%%,19573.905605,73.692168,%%MISS-0%%,57.690628,%%MISS-0%%,UPADACITINIB,%%MISS-0%%,%%MISS-0%%,True,True,True
6,%%MISS-0%%,19886.42814,29.770553,KG,13.406351,MON,Dolutegravir,Myelodysplastic syndrome,M,False,False,False
7,%%MISS-0%%,17695.249435,73.71219,%%MISS-0%%,61.417911,%%MISS-0%%,GAMMAPLEX,Metastases to bone,F,True,True,False
8,%%MISS-0%%,19578.279916,73.768438,%%MISS-0%%,71.889108,%%MISS-0%%,DURVALUMAB,Prostatomegaly,F,True,True,True
9,%%MISS-0%%,19575.333824,73.71707,%%MISS-0%%,62.326453,%%MISS-0%%,ORILISSA,%%MISS-0%%,%%MISS-0%%,True,True,True


The synthetic datasets need to be decoded and we can use `unload` method as the final step to retrieve the list of the generated synthetic data sets, which is called below `synths`.

In [17]:
synths=gen.unload()

Let us check the first 10 records of the first synthetic data set.

In [18]:
synths[0].head(10)

Unnamed: 0,outc_cod_0,event_dt,wt,wt_cod,age,age_cod,drugname_0,indi_pt_0,sex
0,,NaT,,,,,VARGATEF,,
1,OT,NaT,,,,,ZESTRIL,,
2,LT,NaT,,,,,CASIRIVIMAB\IMDEVIMAB,Neutrophil function disorder,M
3,,NaT,,,,,XTAMPZA ER,,
4,,NaT,,,,,BRIGATINIB,Product used for unknown indication,
5,,NaT,,,,,UPADACITINIB,,
6,,2019-06-13,29.770553,KG,13.0,MON,Dolutegravir,Myelodysplastic syndrome,M
7,,2013-06-13,,,,,GAMMAPLEX,Metastases to bone,F
8,,NaT,,,,,DURVALUMAB,Prostatomegaly,F
9,,NaT,,,,,ORILISSA,,


Clearly, the final generated `synthetic` data sets have exactly the same data types and column names and arrangements of the `real` data set.

In [19]:
synths[0].dtypes

outc_cod_0          category
event_dt      datetime64[ns]
wt                   float64
wt_cod              category
age                    Int64
age_cod             category
drugname_0          category
indi_pt_0           category
sex                 category
dtype: object