DPautoGAN

Code for the paper Differentially Private Mixed-Type Data Generation for Unsupervised Learning.

Dependencies

certifi==2020.4.5.1
cycler==0.10.0
future==0.18.2
joblib==0.15.1
kiwisolver==1.2.0
matplotlib==3.2.1
numpy==1.18.4
pandas==1.0.4
pyparsing==2.4.7
python-dateutil==2.8.1
pytz==2020.1
scikit-learn==0.23.1
scipy==1.4.1
six==1.15.0
sklearn==0.0
threadpoolctl==2.1.0
torch==1.5.0

UCI Data Generation

To run the DPautoGAN on UCI Dataset, follow the uci.ipynb jupyter notebook. Cell 2 requires the correct names of the dataset files to be passed in. If you have exactly cloned the repository and have not made any changes then this should not be an issue.

The rest of the notebook can be run as is.

If you train different models, with different hyperparameter values, make sure to change the filenames of the saved and loaded models to ensure consistency.

To perform the PCA evaluation on the generated data, follow the pca-eval.ipynb notebook (making sure to change the filename of the loaded models). To load the generated data into this file, you can either go back to uci.ipynb and save one of the generated values during the latter half of the notebook that is dedicated to evaluating the generated data or copy the code used to generate that data.

Mimic Data Generation

In order to run the MIMIC data generation process:

Run the dp_autoencoder.py file. The parameters to be passed in are at the bottom of the file.
Next, run the dp_wgan.py file. Make sure that the autoencoder model loaded into the GAN is the same as the one you saved in the previous step.
Run evaluation.py. At the bottom of the file, you can choose to comment or un-comment out the evaluations you wish to run.

Further evaluation

The above two data generation processes also show how to evaluate the data for ROC/AUC and PCA. In order to get the Wasserstein distance for the PCA plots:

We take a random sample of size 100 from real and synthetic PCA datasets. We compute the the matrix of pairwise distances and write it to a csv file. This process is repeated 100 times to create "i.csv" files where i varies from 0 to 99.
These files are then used while running eval_wasserstein.R code. This code evaluates Wasserstein distance between the data points for each of the 100 files and outputs the average.
The same process is used for computing Wasserstein distance on real and synthetic full datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
metric		metric
mimic		mimic
results		results
uci		uci
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DPautoGAN

Dependencies

UCI Data Generation

Mimic Data Generation

Further evaluation

About

Releases

Packages

Languages

License

DPautoGAN/DPautoGAN

Folders and files

Latest commit

History

Repository files navigation

DPautoGAN

Dependencies

UCI Data Generation

Mimic Data Generation

Further evaluation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages