### Quickstart for SynRD package

To get everything ready for your experience with SynRD package, let's do the following steps below!

- Clone dp-query-release repo https://github.com/terranceliu/dp-query-release (`git clone https://github.com/terranceliu/dp-query-release`)
- From dp-query-release repo move `/src` folder to the `/synthesizers` folder
- Done!

The main classes implemented by SynRD are Synthesizer, Publication, Finding, and Benchmark. We have `SynRD.papers`, which inclused publications' documentation, `SynRD.benchmark` which includes benchmark to use, `SynRD.synthesizers` with which you can import needed data synthesizers to process the data and set up needed parameters for them, and, finally, `SynRD.utils` for additional processes you might want to do.

The Synthesizer class provides a unified interface to the implementations of five DP synthesizers, specifying recommended parameter values for each and implementing the fit and sample methods. This class wraps MST, PATECTGAN, and AIM implementations from the SmartNoise package and an implementation of PrivBayes from the DataSynthesizer package.

In [2]:
import warnings
warnings.filterwarnings('ignore')
from SynRD.papers import Iverson22Football
from SynRD.benchmark import Benchmark
from synthesizer import MSTSynthesizer
from SynRD.utils import save_synthesizer, load_synthesizer, do_binning, unbin_df

Here, we can initialize the benchmark for our work (notice that it does not accept any arguments!) and the papers list, using which you can initialize all the papers you want to use with Classes (you can initialize from one to any number of papers; keep in mind that the more papers there are - the more time it will take to process all of them!).

In [22]:
benchmark = Benchmark()
papers = [Iverson22Football]

Then, using a benchmark, we initialize them, which does the download procedure (found papers online and downloaded them). You can look at the `data` folder created under the `synthesizers` folder - there, you can discover .tsv files for each preprocessed paper you initialized.

In [24]:
papers = benchmark.initialize_papers(papers)

We might also want to do additional transformations for our initialized papers. In this case we can use utils that we imported to do needed transformations, for example, binning. We can do so in the following way:

In [25]:
transforms = {}
for paper in papers:
    df, transform = do_binning(paper.real_dataframe)
    transforms[paper.__class__.__name__.lower()] = transform

Here is our binned data from the Iverson22Football paper:

In [26]:
df

Unnamed: 0,BIO_SEX,S44A21,H1GI9,H5OD11,S1,IYEAR5,IMONTH5,H1GI1Y,H1GI1M,H5ID6G,...,S44A25,S44A26,S44A27,S44A28,S44A29,H1HS3,H1SU1,H5ID6I,H5ID13,H5SS0B
0,1,0.0,1,8.0,13.0,0,2,4,6,0.0,...,0.0,0.0,0.0,0.0,1.0,0,0,0.0,0.0,1.0
1,1,0.0,2,3.0,14.0,1,4,3,9,0.0,...,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,1.0
2,1,0.0,2,6.0,15.0,1,7,2,1,0.0,...,0.0,0.0,0.0,1.0,0.0,0,0,0.0,0.0,1.0
3,1,1.0,2,4.0,13.0,0,9,4,2,0.0,...,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,1.0
4,1,0.0,1,2.0,14.0,1,6,4,10,0.0,...,0.0,0.0,0.0,0.0,0.0,0,0,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1757,1,,1,6.0,,1,4,5,3,1.0,...,,,,,,1,1,1.0,0.0,4.0
1758,1,0.0,2,6.0,13.0,0,7,4,3,1.0,...,0.0,0.0,0.0,0.0,0.0,0,0,0.0,1.0,2.0
1759,1,0.0,1,10.0,13.0,0,8,4,2,0.0,...,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,1.0
1760,1,1.0,1,2.0,14.0,1,1,3,3,0.0,...,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,1.0


Now, we can initialize a data synthesizer for our data. You can adjust any available parameter for the synthesizer you use to better 'control' what you are doing for data. To discover all possible ways to properly initialize synthesizers, take a look at `config_notebook` in the synthesizers folder :)

Let's now adjust one essential parameter - epsilon, and fit our data for the MST synthesizer (MST: Maximum Spanning Tree synthesizer), which is learning the distribution privately.

To discover more about the parameters for each synthesizer, you can look at the documentation for each synthesizer class we have, where you can find information regarding each parameter. 

In [27]:
mst = MSTSynthesizer(epsilon=1.0, thresh=0.1, verbose=True)
mst.fit(df)

Fitting with 42776387592120000 dimensions
Getting cliques
Estimating marginals


Finally, we must provide the Benchmark class with the execution parameters to sample the data. The number of samples to generate is `samples = n * B`, where n is the number of samples in the real dataset, and B is the number of bootstrap samples to run over the data. The sample is generated arbitrarily from existing distribution using the sample function. The benchmark checks all findings for that publication over the real data, generates synthetic datasets for each DP synthesizer, checks findings over synthetic data, and, finally, generates an epistemic parity score for each (synthesizer, finding) pair and for the synthesizer overall (over all findings).

In [28]:
B = 5
synth_df = mst.sample(B*len(paper.real_dataframe))

We can compare the statistics for the 'original' and synthetic datasets:

In [29]:
df.describe()

Unnamed: 0,BIO_SEX,S44A21,H1GI9,H5OD11,S1,IYEAR5,IMONTH5,H1GI1Y,H1GI1M,H5ID6G,...,S44A25,S44A26,S44A27,S44A28,S44A29,H1HS3,H1SU1,H5ID6I,H5ID13,H5SS0B
count,1762.0,1321.0,1762.0,1760.0,1319.0,1762.0,1762.0,1762.0,1762.0,1762.0,...,1321.0,1321.0,1321.0,1321.0,1321.0,1762.0,1762.0,1759.0,1753.0,1749.0
mean,1.0,0.279334,1.503973,7.65625,14.946171,0.233258,6.971623,1.997162,6.523837,0.174234,...,0.056018,0.126419,0.025738,0.080999,0.118849,0.106697,0.164586,0.156339,0.120365,1.348199
std,0.0,0.448842,1.041575,3.475818,1.748873,0.423025,2.795772,1.653208,3.4027,0.379418,...,0.230044,0.332447,0.158413,0.272937,0.323734,0.37974,0.684985,0.36328,0.325481,0.644883
min,1.0,0.0,1.0,2.0,11.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,1.0,0.0,1.0,4.0,14.0,0.0,5.0,0.0,4.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,1.0,0.0,1.0,8.0,15.0,0.0,8.0,2.0,7.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,1.0,1.0,2.0,10.0,16.0,0.0,9.0,3.0,9.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
max,1.0,1.0,8.0,16.0,19.0,1.0,12.0,5.0,12.0,1.0,...,1.0,1.0,1.0,1.0,1.0,8.0,8.0,1.0,1.0,4.0


In [30]:
synth_df.describe()

Unnamed: 0,BIO_SEX,S44A21,H1GI9,H5OD11,S1,IYEAR5,IMONTH5,H1GI1Y,H1GI1M,H5ID6G,...,S44A25,S44A26,S44A27,S44A28,S44A29,H1HS3,H1SU1,H5ID6I,H5ID13,H5SS0B
count,8640.0,6653.0,8650.0,8511.0,6772.0,8754.0,8400.0,8538.0,8342.0,8562.0,...,6504.0,6536.0,6471.0,6558.0,6529.0,8564.0,8634.0,8316.0,8538.0,8615.0
mean,1.0,0.277619,1.487399,8.053343,15.353219,0.194997,6.620952,2.089834,7.051546,0.173674,...,0.023678,0.15254,0.027971,0.070601,0.135855,0.408454,0.278782,0.122535,0.121691,1.329077
std,0.0,0.447858,1.192682,3.955329,2.227071,0.396221,2.934761,1.658007,3.287667,0.378851,...,0.152055,0.359571,0.164902,0.256176,0.342661,1.607862,1.349413,0.327922,0.326948,0.626968
min,1.0,0.0,1.0,2.0,11.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,1.0,0.0,1.0,4.0,14.0,0.0,4.0,1.0,5.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,1.0,0.0,1.0,9.0,16.0,0.0,7.0,2.0,8.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,1.0,1.0,2.0,10.0,17.0,0.0,8.0,3.0,10.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
max,1.0,1.0,8.0,16.0,19.0,1.0,12.0,5.0,12.0,1.0,...,1.0,1.0,1.0,1.0,1.0,8.0,8.0,1.0,1.0,4.0


Now, we set our synthetic dataset and produce a bunch of statistics, such as, for example, epistemic parity metric - the percentage of the trials where the data was replicated.

In [33]:
paper.set_synthetic_dataframe(synth_df)
benchmark.eval_soft_findings_each_finding(paper, 5)

[array([1. , 1. , 1. , 0.8, 0.8, 0.4, 0.2, 0.4, 0.2, 0.4, 0.4, 0.6, 0.4,
        0.6, 0.2, 0.4, 0.4]),
 array([0.        , 0.        , 0.        , 0.4       , 0.4       ,
        0.48989795, 0.4       , 0.48989795, 0.4       , 0.48989795,
        0.48989795, 0.48989795, 0.48989795, 0.48989795, 0.4       ,
        0.48989795, 0.48989795]),
 array([[1. , 1. , 1. , 0.1, 0.1, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ,
         0. , 0. , 0. , 0. ],
        [1. , 1. , 1. , 1. , 1. , 1. , 0.9, 1. , 0.9, 1. , 1. , 1. , 1. ,
         1. , 0.9, 1. , 1. ]])]

This was an example of interaction with the SynRD package, which you can follow or change depending on your interests/experiments, so good luck with your 'explorations'!