$ cd DataSifterText-package
$ rm -rf env
$ python3 -m venv env
$ source env/bin/activate
$ pip install -r requirements.txt
$ python3 total.py <KEYWORDS/POSITION SWAP MODE>
SUMMARIZATION 0: no summarize, 1: summarize
KEYWORDS/POSITION SWAP MODE 0: keywords-swap, 1: position-swap
Notice that in summarization mode, we will only do keywords-swap.
$ python total.py 0 0
Built-in example: python3 total.py 0 0 processed_0_prepare.csv
will run the obfuscation without summarization and doing keywords-swap.
$ git clone https://github.com/google-research/bert.git
$ https://github.com/google-research/bert#pre-trained-models
$ mkdir data $ mkdir bert_output
Move train_sifter.py to the directory, run train_sifter.py inside the BERT Repository; make sure the data is in the "./data" directory
$ cp [your data] data $ python3 train_sifter.py
$ python3 run_classifier.py --task_name=cdc --do_train=true --do_eval=true --do_predict=true --data_dir=./data/ --vocab_file=./cased_L-12_H-768_A-12/vocab.txt --bert_config_file=./cased_L-12_H-768_A-12/bert_config.json --max_seq_length=512 --train_batch_size=32 --learning_rate=2e-5 --num_train_epochs=3.0 --output_dir=./bert_output/ --do_lower_case=False
The result will be shown in bert_output directory.
- DataSifter-Lite (V 1.0)
- DataSifter website
- Marino, S, Zhou, N, Zhao, Yi, Wang, L, Wu Q, and Dinov, ID. (2019) DataSifter: Statistical Obfuscation of Electronic Health Records and Other Sensitive Datasets, Journal of Statistical Computation and Simulation, 89(2): 249–271, DOI: 10.1080/00949655.2018.1545228.
- Zhou, N, Wang, L, Marino, S, Zhao, Y, Dinov, ID. (2022) DataSifter II: Partially Synthetic Data Sharing of Sensitive Information Containing Time-varying Correlated Observations, Journal of Algorithms & Computational Technology, Volume 15: 1–17, DOI: 10.1177/17483026211065379.