# Tutorial 5: Generating data using Differential Privacy guarantees

`synthcity` includes several privacy-focused generators. One class of privacy models is for differential privacy.

The central idea behind `differential privacy` is to add random noise to the data in a way that preserves the overall statistical properties of the dataset, but makes it impossible to determine the presence or absence of any individual data point within the dataset. This is achieved by bounding the amount of information that any single query can reveal about an individual data point.

One key element in DP is __epsilon (ε)__, a measure of the privacy loss of a given algorithm or mechanism. Specifically, it is a way to quantify the maximum amount by which the probability of any outcome (or set of outcomes) can change due to the inclusion or exclusion of a single individual's data.

For example, suppose a given query to a differentially private algorithm has a privacy loss of ε=1. In that case, that means that the probability of any outcome (or set of outcomes) can change by, at most, a factor of e^1=2.718 (about 2.718 times) due to the inclusion or exclusion of a single individual's data. The lower the value of `ε`, the stronger the privacy guarantees provided by the algorithm or mechanism.

One common way to achieve differential privacy is through the use of randomization. A differentially private algorithm will add noise to the output of a query in order to make it difficult to determine whether any particular individual's data was used in the computation. The amount of noise added is typically determined by the value of ε, with a lower value of ε resulting in more noise being added and stronger privacy guarantees.

It is important to note that ε is not the only parameter that is used to measure privacy loss in differential privacy. Another commonly used parameter is __delta (δ)__, which represents the probability that the privacy loss of an algorithm or mechanism exceeds a certain threshold (often ε). Together, ε and δ can be used to specify the privacy guarantees of a differentially private algorithm or mechanism more precisely.


`synthcity` includes the following models with DP-focus:
 - `AdsGAN` - A GAN with an identifiability penalty
 - `PATEGAN` - A GAN which uses the Private Aggregation of Teacher Ensembles (PATE) framework to tightly bound the influence of any individual sample on the model.
 - `PrivBayes` - uses a Bayesian network to iteratively learn a set of low-dimensional conditional probabilities from noisy marginals.
 - `DPGAN` - A GAN which uses the DP-SGD optimizer for training the discriminator.
 
 PATEGAN, PrivBayes, DPGAN have the `epsilon` parameter, which can be customized.
 AdsGAN takes a different approach, and the privacy level there can be controlled by the `lambda_identifiability_penalty` parameter.

In [None]:
!pip install synthcity
!pip uninstall -y torchaudio torchdata

In [1]:
# stdlib
import sys
import warnings

warnings.filterwarnings("ignore")

# third party
from sklearn.datasets import load_diabetes

# synthcity absolute
import synthcity.logger as log
from synthcity.plugins import Plugins
from synthcity.plugins.core.dataloader import GenericDataLoader

log.add(sink=sys.stderr, level="INFO")

X, y = load_diabetes(return_X_y=True, as_frame=True)
X["target"] = y

X



Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.050680,0.044451,-0.005670,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.025930,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0
...,...,...,...,...,...,...,...,...,...,...,...
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207,178.0
438,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018114,0.044485,104.0
439,0.041708,0.050680,-0.015906,0.017293,-0.037344,-0.013840,-0.024993,-0.011080,-0.046883,0.015491,132.0
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.026560,0.044529,-0.025930,220.0


In [2]:
# Note: preprocessing data with OneHotEncoder or StandardScaler is not needed or recommended. Synthcity handles feature encoding and standardization internally.
loader = GenericDataLoader(
    X,
    target_column="target",
    sensitive_columns=["sex"],
)

## List privacy plugins

In [3]:
Plugins(categories=["privacy"]).list()

['privbayes', 'pategan', 'adsgan', 'decaf', 'dpgan', 'aim']

## Load and train a generative model

In [4]:
# synthcity absolute
from synthcity.plugins import Plugins

syn_model = Plugins().get("dpgan")

syn_model.fit(loader)

[2024-11-24T20:49:31.695392+0800][31696][CRITICAL] module disabled: e:\qycache\anaconda\envs\LLM\lib\site-packages\synthcity\plugins\generic\plugin_goggle.py
[2024-11-24T20:49:31.695392+0800][31696][CRITICAL] module disabled: e:\qycache\anaconda\envs\LLM\lib\site-packages\synthcity\plugins\generic\plugin_goggle.py
[2024-11-24T20:49:31.695392+0800][31696][CRITICAL] module disabled: e:\qycache\anaconda\envs\LLM\lib\site-packages\synthcity\plugins\generic\plugin_goggle.py
 15%|█▍        | 299/2000 [01:50<10:30,  2.70it/s]


<synthcity.plugins.privacy.plugin_dpgan.DPGANPlugin at 0x18effb56c10>

## Generate new data using the model

In [5]:
syn_model.generate(count=10).dataframe()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.066272,-0.044642,0.131566,0.051242,-0.031196,-0.021695,-0.037247,0.008006,0.064448,0.017967,218.396364
1,0.097217,-0.044642,0.098494,0.01026,-0.052599,0.014224,-0.024378,0.007747,0.034408,0.017972,156.462274
2,0.056227,0.05068,0.15708,0.045276,-0.032674,-0.05173,-0.061873,0.011906,0.092191,0.019966,204.283476
3,0.009991,0.05068,0.109226,0.034451,-0.046116,-0.024433,-0.034076,0.009777,0.061323,0.043703,227.000796
4,0.044127,-0.044642,0.096911,-0.003726,-0.054669,0.06274,0.008867,0.007633,0.071114,0.016739,162.229992
5,0.076378,0.05068,0.090852,0.047191,-0.059446,-0.022275,-0.035013,0.004838,0.064143,-0.00323,204.421871
6,0.05563,-0.044642,0.139378,0.013592,-0.013895,0.058805,-0.011707,0.015384,0.088941,-0.005652,179.249275
7,0.080505,-0.044642,0.121763,0.043262,-0.011573,0.037228,-0.031826,0.008005,0.067346,0.027105,227.740093
8,0.080938,-0.044642,0.135663,0.040924,-0.058942,0.035957,-0.042771,0.014104,0.084402,0.054957,201.581942
9,0.056783,-0.044642,0.123398,0.019973,-0.040483,0.104738,-0.009223,0.016466,0.077389,-0.023868,192.313173


## Benchmarking metrics

| **Metric**                                         | **Description**                                                                                                            |
|----------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------|
| sanity.data\_mismatch.score                        | Data types mismatch between the real//synthetic features                                                                   |
| sanity.common\_rows\_proportion.score              | Real data copy-paste in the synthetic data                                                                                 |
| sanity.nearest\_syn\_neighbor\_distance.mean       | Computes the \textless{}reduction\textgreater{}(distance) from the real data to the closest neighbor in the synthetic data |
| sanity.close\_values\_probability.score            | the probability of close values between the real and synthetic data.                                                       |
| sanity.distant\_values\_probability.score          | the probability of distant values between the real and synthetic data.                                                     |
| stats.jensenshannon\_dist.marginal                 | the average Jensen-Shannon distance                                                                                        |
| stats.chi\_squared\_test.marginal                  | the one-way chi-square test.                                                                                               |
| stats.feature\_corr.joint                          | the correlation/strength-of-association of features in data-set with both categorical and continuous features              |
| stats.inv\_kl\_divergence.marginal                 | the average inverse of the Kullback–Leibler Divergence metric.                                                             |
| stats.ks\_test.marginal                            | the Kolmogorov-Smirnov test for goodness of fit.                                                                           |
| stats.max\_mean\_discrepancy.joint                 | Empirical maximum mean discrepancy. The lower the result the more evidence that distributions are the same.                |
| stats.prdc.precision                               | precision between the two manifolds                                                                                        |
| stats.prdc.recall                                  | recall between the two manifolds                                                                                           |
| stats.prdc.density                                 | density between the two manifolds                                                                                          |
| stats.prdc.coverage                                | coverage between the two manifolds                                                                                         |
| stats.alpha\_precision.delta\_precision\_alpha\_OC | Delta precision                                                                                                            |
| stats.alpha\_precision.delta\_coverage\_beta\_OC   | Delta coverage                                                                                                             |
| stats.alpha\_precision.authenticity\_OC            | Authetnticity                                                                                                              |
| performance.linear\_model.gt.aucroc              | Train on real, test on the test real data using LogisticRegression: AUCROC                                                             |
| performance.linear\_model.syn\_id.aucroc         | Train on synthetic, test on the train real data using LogisticRegression: AUCROC                                                       |
| performance.linear\_model.syn\_ood.aucroc        | Train on synthetic, test on the test real data using LogisticRegression: AUCROC                                                        |
| performance.mlp.gt.aucroc                        | Train on real, test on the test real data using NN: AUCROC                                                                |
| performance.mlp.syn\_id.aucroc                    | Train on synthetic, test on the train real data using NN: AUCROC                                                          |
| performance.mlp.syn\_ood.aucroc                   | Train on synthetic, test on the test real data using NN: AUCROC                                                           |
| performance.xgb.gt.aucroc                         | Train on real, test on the test real data using XGB: AUCROC                                                               |
| performance.xgb.syn\_id.aucroc                    | Train on synthetic, test on the train real data using XGB: AUCROC                                                         |
| performance.xgb.syn\_ood.aucroc                   | Train on synthetic, test on the test real data using XGB: AUCROC                                                          |
| performance.feat\_rank\_distance.corr              | Correlation for the rank distances between the feature importance on real and synthetic data                               |
| performance.feat\_rank\_distance.pvalue            | p-vale for the rank distances between the feature importance on real and synthetic data                                    |
| detection.detection\_xgb.mean                      | The average AUCROC score for detecting synthetic data using an XGBoost.                                                    |
| detection.detection\_mlp.mean                      | The average AUCROC score for detecting synthetic data using a NN.                                                          |
| detection.detection\_gmm.mean                      | The average AUCROC score for detecting synthetic data using a GMM.                                                         |
| privacy.delta-presence.score                       | the maximum re-identification probability on the real dataset from the synthetic dataset.                                  |
| privacy.k-anonymization.gt                         | the k-anon for the real data                                                                                               |
| privacy.k-anonymization.syn                        | the k-anon for the synthetic data                                                                                          |
| privacy.k-map.score                                | the minimum value k that satisfies the k-map rule.                                                                         |
| privacy.distinct l-diversity.gt                    | the l-diversity for the real data                                                                                          |
| privacy.distinct l-diversity.syn                   | the l-diversity for the synthetic data                                                                                     |
| privacy.identifiability\_score.score               | the re-identification score on the real dataset from the synthetic dataset.                                                |

## Benchmark the quality of DPGAN for various epsilons

In [6]:
# synthcity absolute
from synthcity.benchmark import Benchmarks

score = Benchmarks.evaluate(
    [(f"test_eps_{eps}", "dpgan", {"epsilon": eps}) for eps in [0.1, 1, 10]],
    loader,
    synthetic_size=1000,
    repeats=2,
    synthetic_reuse_if_exists=False
)

[2024-11-24T20:51:58.028720+0800][31696][CRITICAL] module disabled: e:\qycache\anaconda\envs\LLM\lib\site-packages\synthcity\plugins\generic\plugin_goggle.py
 15%|█▍        | 299/2000 [01:40<09:30,  2.98it/s]
[2024-11-24T20:55:34.802793+0800][31696][CRITICAL] module disabled: e:\qycache\anaconda\envs\LLM\lib\site-packages\synthcity\plugins\generic\plugin_goggle.py
 15%|█▍        | 299/2000 [01:47<10:11,  2.78it/s]
[2024-11-24T20:59:59.355679+0800][31696][CRITICAL] module disabled: e:\qycache\anaconda\envs\LLM\lib\site-packages\synthcity\plugins\generic\plugin_goggle.py
 15%|█▍        | 299/2000 [01:48<10:18,  2.75it/s]
[2024-11-24T21:03:34.210536+0800][31696][CRITICAL] module disabled: e:\qycache\anaconda\envs\LLM\lib\site-packages\synthcity\plugins\generic\plugin_goggle.py
 17%|█▋        | 349/2000 [01:55<09:08,  3.01it/s]
[2024-11-24T21:07:20.508700+0800][31696][CRITICAL] module disabled: e:\qycache\anaconda\envs\LLM\lib\site-packages\synthcity\plugins\generic\plugin_goggle.py
 27%|█

In [7]:
Benchmarks.print(score)


[4m[1mComparatives[0m[0m


Unnamed: 0,test_eps_0.1,test_eps_1,test_eps_10
sanity.data_mismatch.score,0.0 +/- 0.0,0.0 +/- 0.0,0.0 +/- 0.0
sanity.common_rows_proportion.score,0.0 +/- 0.0,0.0 +/- 0.0,0.0 +/- 0.0
sanity.nearest_syn_neighbor_distance.mean,0.275 +/- 0.117,0.218 +/- 0.015,0.168 +/- 0.078
sanity.close_values_probability.score,0.534 +/- 0.253,0.674 +/- 0.045,0.713 +/- 0.152
sanity.distant_values_probability.score,0.073 +/- 0.028,0.101 +/- 0.0,0.045 +/- 0.022
stats.jensenshannon_dist.marginal,0.075 +/- 0.009,0.05 +/- 0.001,0.04 +/- 0.006
stats.chi_squared_test.marginal,0.218 +/- 0.138,0.047 +/- 0.03,0.043 +/- 0.042
stats.inv_kl_divergence.marginal,0.256 +/- 0.125,0.196 +/- 0.018,0.325 +/- 0.025
stats.ks_test.marginal,0.203 +/- 0.104,0.404 +/- 0.003,0.55 +/- 0.015
stats.max_mean_discrepancy.joint,0.505 +/- 0.191,0.029 +/- 0.001,0.188 +/- 0.156


## Benchmark the quality of DP-models for epsilon=0.1

In [None]:
# synthcity absolute
from synthcity.benchmark import Benchmarks

score = Benchmarks.evaluate(
    [
        (f"test_{model}", model, {"epsilon": 0.1})
        for model in ["pategan", "dpgan"]
    ],
    loader,
    synthetic_size=1000,
    repeats=2,
    synthetic_reuse_if_exists=False,
)

In [None]:
Benchmarks.print(score)

## Congratulations!

Congratulations on completing this notebook tutorial! If you enjoyed this and would like to join the movement towards Machine learning and AI for medicine, you can do so in the following ways!

### Star [Synthcity](https://github.com/vanderschaarlab/synthcity) on GitHub

- The easiest way to help our community is just by starring the Repos! This helps raise awareness of the tools we're building.


### Checkout other projects from vanderschaarlab
- [HyperImpute](https://github.com/vanderschaarlab/hyperimpute)
- [AutoPrognosis](https://github.com/vanderschaarlab/autoprognosis)
