# SURE library testing notebook
Welocme to the SURE library testing notebook!

You can find the library, together with the necessary information to install it, at its [GitHub repository link](https://github.com/Clearbox-AI/SURE).

Use this notebook and the indications in the code boxes as a guideline to test the functionalities of the SURE library for Synthetic Data utility and privacy assessment. \
Feel free to explore and experiment with the library's features beyond these suggestions!

Besides this notebook and the [link](https://dario-brunelli-clearbox-ai.notion.site/SURE-Documentation-2c17db370641488a8db5bce406032c1f) to the documentation relative to the SURE library, you are provided with a Google Drive folder containing:
- a final questionnaire to gather your feedback.
- a consent form to process your feedback information
- three *.csv* files (described below)

After completing the testing, please fill in the feedback questionnaire and upload this notebook, the questionnaire and the signed consent form to the Google Drive folder.

### Datasets description

The three datasets provided are the following:

- *census_dataset_training.csv*
    
    The original real dataset used to train the generative model from which *census_dataset_synthetic* was produced.
    
- *census_dataset_validation.csv*
    
    This dataset was also part of the original real dataset, but it was not used to train the generative model that produced *census_dataset_synthetic.*
    
- *census_dataset_synthetic.csv*
    
    The synthetic dataset produced with the generative model trained on *census_dataset_training.*
    

The three census datasets include various demographic, social, economic, and housing characteristics of individuals. Every row of the datasets coresponds to an individual.

The machine learning task related to these datasets is a classification task, where, based on all the features, a ML classifier model must decide whether the individual earns more than 50k dollars per year (lable=1) or less (lable=0).\
The column "label" in each dataset is the ground truth for this classification task.

### Tasks

Below is a list of tasks. Please use them as general guidelines to proceed. Note that some tasks are deliberately open-ended to give you the freedom to approach them as you see fit and to test the clarity of the provided documentation:

1. Install the library.
2. Prepare the three datasets using the Preprocessor, adjusting its parameters as you deem best.
3. Assess the TSTR (Train on Synthetic, Test on Real) performance of the synthetic dataset on the classification task employing the utility modules (see [Section 4.1](https://www.notion.so/4-ML-Utility-Metrics-ac98a1d294b1428f8b67936323c7c569?pvs=21) of the documentation).
4. Evaluate the vulnerability of the synthetic dataset provided to membership inference attacks.
5. Generate and explore the final report.

Some suggestions on how to proceed are also available in comments in code blocks of the notebook you have been provided with.

To perform the task, please refer to the documentation provided.

Feel free to test the library in any other way you can think of to challenge the library’s capabilities!

(e.g. test different datasets than the ones provided)

If you have any doubt on how to procede during the testing, try searching for what you need in these reference links:
- [Documentation](https://dario-brunelli-clearbox-ai.notion.site/SURE-Documentation-2c17db370641488a8db5bce406032c1f)
- [GitHub page](https://github.com/Clearbox-AI/SURE)

### Questions
Answer the following questions with the results you found (just make sure that the answers are visible in the notebook before uploading it to the Google Drive folder):

1. What is the accuracy of the original dataset and the one of the synthetic dataset on the given classification task?
2. Which machine learning model has the highest accuracy on the TSTR task?
3. What is the percentage of DCRs closer to the training set than to the validation set you found?
4. What is the Membership Inference (MI) mean risk score you found?

## 0. Installing the library

In [5]:
# install the SURE library 
%pip install clearbox-sure

Collecting clearbox-sure
  Downloading clearbox_sure-0.1.9.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (664 bytes)
Collecting pandas<2.0.0,>=1.4.2 (from clearbox-sure)
  Downloading pandas-1.5.3.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting polars<1.0.0,>=0.20.31 (from clearbox-sure)
  Using cached polars-0.20.31-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (14 kB)
Collecting numpy<2.0.0 (from clearbox-sure)
  Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting cython (from clearbox-sure)
  Downloading Cython-3.0.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.2 kB)

In [2]:
from sure import Preprocessor, report
from sure.utility import (compute_statistical_metrics, compute_mutual_info,
													compute_utility_metrics_class)

In [7]:
from sure.privacy import (distance_to_closest_record, dcr_stats, number_of_dcr_equal_to_zero, validation_dcr_test, 
													adversary_dataset, membership_inference_test)

In file included from /home/matteolai/miniconda3/envs/sure/lib/python3.12/site-packages/numpy/core/include/numpy/ndarraytypes.h:1929,
                 from /home/matteolai/miniconda3/envs/sure/lib/python3.12/site-packages/numpy/core/include/numpy/ndarrayobject.h:12,
                 from /home/matteolai/miniconda3/envs/sure/lib/python3.12/site-packages/numpy/core/include/numpy/arrayobject.h:5,
                 from /home/matteolai/.pyxbld/temp.linux-x86_64-cpython-312/home/matteolai/miniconda3/envs/sure/lib/python3.12/site-packages/sure/distance_metrics/gower_matrix_c.c:1246:
      |  ^~~~~~~


In [4]:
import pandas as pd

## 1. Dataset import and preparation

#### 1.1 Import the datasets

In [5]:
# Import the datasets
real_data = pd.read_csv('SURE_testing_datasets/census_dataset_training.csv')
valid_data = pd.read_csv('SURE_testing_datasets/census_dataset_validation.csv')
synth_data = pd.read_csv('SURE_testing_datasets/census_dataset_synthetic.csv')

#### 1.2 Datasets preparation

In [10]:
# Prepare the datasets with the Preprocessor

# Real dataset - Preprocessor initialization and query exacution
preprocessor            = Preprocessor(real_data, get_discarded_info=False)
real_data_preprocessed  = preprocessor.transform(real_data, num_fill_null='forward', scaling='standardize')

# Validation dataset - Preprocessor initialization and query exacution
preprocessor            = Preprocessor(valid_data, get_discarded_info=False)
valid_data_preprocessed = preprocessor.transform(valid_data, num_fill_null='forward', scaling='standardize')

# Synthetic dataset - Preprocessor initialization and query exacution
preprocessor            = Preprocessor(synth_data, get_discarded_info=False)
synth_data_preprocessed = preprocessor.transform(synth_data, num_fill_null='forward', scaling='standardize')

In [11]:
real_data_preprocessed

Unnamed: 0,age,work_class_Federal-gov,work_class_Local-gov,work_class_Never-worked,work_class_Private,work_class_Self-emp-inc,work_class_Self-emp-not-inc,work_class_State-gov,work_class_Without-pay,education_10th,...,native_country_Puerto-Rico,native_country_Scotland,native_country_South,native_country_Taiwan,native_country_Thailand,native_country_Trinadad&Tobago,native_country_United-States,native_country_Vietnam,native_country_Yugoslavia,label
0,0.03,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,False
1,0.84,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,False
2,-0.04,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,False
3,1.06,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,False
4,-0.78,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,-0.85,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,False
32557,0.10,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,True
32558,1.42,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,False
32559,-1.22,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,False


## 2. Utility assessment

#### 2.1 Statistical properties and mutual information

In [12]:
# Compute statistical properties

num_features_stats, cat_features_stats, temporal_feat_stats = compute_statistical_metrics(real_data_preprocessed, synth_data_preprocessed)

In [13]:
num_features_stats

{'null_count': {'real': shape: (1, 49)
  ┌─────┬─────────────┬─────────────┬─────────────┬───┬────────────┬────────────┬────────────┬───────┐
  │ age ┆ work_class_ ┆ work_class_ ┆ work_class_ ┆ … ┆ hours_per_ ┆ native_cou ┆ native_cou ┆ label │
  │ --- ┆ Federal-gov ┆ Local-gov   ┆ Private     ┆   ┆ week       ┆ ntry_Mexic ┆ ntry_Unite ┆ ---   │
  │ u32 ┆ ---         ┆ ---         ┆ ---         ┆   ┆ ---        ┆ o          ┆ d-States   ┆ u32   │
  │     ┆ u32         ┆ u32         ┆ u32         ┆   ┆ u32        ┆ ---        ┆ ---        ┆       │
  │     ┆             ┆             ┆             ┆   ┆            ┆ u32        ┆ u32        ┆       │
  ╞═════╪═════════════╪═════════════╪═════════════╪═══╪════════════╪════════════╪════════════╪═══════╡
  │ 0   ┆ 0           ┆ 0           ┆ 0           ┆ … ┆ 0          ┆ 0          ┆ 0          ┆ 0     │
  └─────┴─────────────┴─────────────┴─────────────┴───┴────────────┴────────────┴────────────┴───────┘,
  'synthetic': shape: (1, 49)
  ┌

In [14]:
# Compute features mutual information

corr_real, corr_synth, corr_difference = compute_mutual_info(real_data_preprocessed, synth_data_preprocessed)                    

In [15]:
corr_difference

age,work_class_Federal-gov,work_class_Local-gov,work_class_Private,work_class_Self-emp-inc,work_class_Self-emp-not-inc,work_class_State-gov,education_10th,education_11th,education_Assoc-acdm,education_Assoc-voc,education_Bachelors,education_HS-grad,education_Masters,education_Some-college,marital_status_Divorced,marital_status_Married-civ-spouse,marital_status_Never-married,marital_status_Separated,marital_status_Widowed,occupation_Adm-clerical,occupation_Craft-repair,occupation_Exec-managerial,occupation_Farming-fishing,occupation_Handlers-cleaners,occupation_Machine-op-inspct,occupation_Other-service,occupation_Prof-specialty,occupation_Protective-serv,occupation_Sales,occupation_Tech-support,occupation_Transport-moving,relationship_Husband,relationship_Not-in-family,relationship_Other-relative,relationship_Own-child,relationship_Unmarried,relationship_Wife,race_Asian-Pac-Islander,race_Black,race_White,sex_Female,sex_Male,capital_gain,capital_loss,hours_per_week,native_country_Mexico,native_country_United-States,label
f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
0.0,0.002036,0.002723,-0.005267,0.00683,0.001555,-0.000197,0.00007,-0.003931,0.00034,-0.000745,-0.000537,-0.065809,0.000894,-0.000267,0.004293,0.002086,-0.00559,-0.001211,0.008135,0.00081,0.000039,0.001426,0.004435,-0.004211,-0.000094,-0.000448,-0.000919,-0.001397,-0.001648,-0.001479,0.000845,0.002919,0.000574,-0.003459,-0.008655,0.006381,-0.002374,-0.001931,-0.002102,0.013264,-0.002233,0.002233,0.000393,0.001419,0.000368,-0.009499,-0.027562,-0.000194
0.002036,0.0,0.001665,0.007616,0.002269,0.001777,0.001482,-0.011706,-0.001168,0.001465,-0.002245,0.000935,0.015989,-0.000941,0.001112,0.004751,-0.003419,0.003131,-0.008218,-0.000082,0.001509,-0.006337,0.001063,-0.007447,0.002523,-0.004095,0.001347,-0.002219,0.004326,0.003204,0.004274,0.002029,-0.001726,0.000944,-0.009521,0.002233,0.004478,-0.00057,0.002958,0.002164,-0.005533,0.000226,-0.000226,-0.001545,-0.001995,-0.005313,0.001908,-0.008662,0.001268
0.002723,0.001665,0.0,-0.004602,0.001608,-0.000026,0.000327,0.003062,0.002301,0.003898,-0.00225,0.000417,0.009933,0.007514,-0.004275,0.007226,-0.000645,-0.002871,0.003615,-0.003028,0.00076,-0.003781,0.000706,0.004723,-0.002276,-0.007511,0.001205,0.008183,0.006071,-0.001293,0.000631,0.001643,-0.000936,0.000365,0.003563,-0.001368,0.002934,-0.003274,-0.00391,0.004032,-0.003871,0.00156,-0.00156,0.000193,0.001684,-0.000546,-0.000842,0.010844,-0.001894
-0.005267,0.007616,-0.004602,0.0,0.006964,-0.005795,-0.001783,0.00238,-0.000877,-0.000622,0.000635,-0.000178,0.015034,-0.00189,0.002782,-0.00435,-0.002081,0.002144,0.000351,0.001186,-0.000044,0.001036,-0.002475,-0.010472,0.002344,0.003875,0.002322,-0.005848,-0.0063,0.000672,-0.000982,0.000636,-0.001523,0.000588,0.002362,-0.000416,0.000117,0.00258,-0.003144,0.001055,-0.001488,0.003355,-0.003355,-0.00291,0.001813,0.000867,0.007819,-0.002872,0.003134
0.00683,0.002269,0.001608,0.006964,0.0,0.001707,0.001452,0.001441,-0.00355,0.004983,-0.002944,-0.000037,-0.018112,-0.003604,0.003155,-0.002974,0.01089,-0.003811,-0.007918,0.00325,-0.003228,0.004475,0.009412,-0.003617,-0.006381,0.002309,-0.005443,-0.002729,-0.000702,0.002439,0.001795,0.000377,0.008801,-0.002908,0.003241,-0.003036,-0.00746,-0.002615,0.004826,-0.009781,0.010658,-0.005108,0.005108,0.010975,0.000758,0.005722,-0.010199,0.001129,0.004739
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
0.001419,-0.001995,0.001684,0.001813,0.000758,-0.002677,0.000034,0.000011,-0.002958,0.001042,-0.001326,0.000762,-0.012495,0.000693,-0.000111,-0.000753,0.003236,0.000573,-0.00494,0.00218,-0.000248,0.000474,0.000838,-0.00208,-0.004214,0.000918,0.00017,0.004438,-0.000815,0.000169,-0.00207,-0.000144,0.001424,0.000148,-0.000064,-0.001683,-0.000485,-0.000922,0.001478,-0.000832,0.004365,-0.001676,0.001676,-0.00006,0.0,0.000182,-0.004921,-0.002656,0.000445
0.000368,-0.005313,-0.000546,0.000867,0.005722,0.002439,-0.00047,-0.006267,-0.002105,-0.003145,0.003709,0.001354,-0.008026,0.001825,0.000059,0.001308,0.005467,-0.002798,-0.002979,-0.005265,-0.000696,0.001164,0.002759,0.005051,-0.004289,0.00108,-0.003506,0.007185,0.000384,-0.000568,-0.000713,0.0,0.004504,0.000719,-0.003933,-0.00548,0.000657,-0.003914,0.000262,-0.00084,0.00097,-0.005627,0.005627,0.001871,0.000182,0.0,0.0,0.007113,0.001313
-0.009499,0.001908,-0.000842,0.007819,-0.010199,-0.008917,0.003799,-0.005693,0.000216,-0.000508,-0.00568,-0.006712,-0.120529,-0.002199,-0.002074,-0.005279,-0.010596,0.005286,0.005187,-0.015306,-0.000732,-0.001794,-0.009712,0.006391,0.006774,0.010312,0.003009,-0.008162,-0.003225,-0.001748,0.002792,-0.004574,0.004843,-0.003348,0.009411,0.001064,-0.003511,-0.00543,-0.012777,0.001252,-0.019826,-0.000378,0.000378,0.001172,-0.004921,0.0,0.0,0.537987,-0.006267
-0.027562,-0.008662,0.010844,-0.002872,0.001129,0.003356,-0.005323,0.019286,0.003874,-0.010211,0.008543,-0.033734,0.139597,-0.025645,0.006897,0.011182,0.010221,0.006526,-0.010114,0.007181,-0.00763,0.017274,-0.003563,0.051782,0.017093,-0.010538,-0.015329,-0.024189,0.003367,0.009368,-0.011269,0.021517,0.009147,0.004723,-0.026616,0.009232,-0.004186,-0.022086,-0.369282,-0.017222,0.228173,-0.030247,0.030247,-0.003914,-0.002656,0.007113,0.537987,0.0,-0.017692


#### 2.2 ML utility - Train on Synthetic Test on Real

In [16]:
# Verify the machine learning utility of the synthetic dataset on the classification task
# Use the real dataset as validation set for the calssification task

X_train = real_data_preprocessed.drop("label", axis=1)
y_train = real_data_preprocessed["label"]
X_synth = synth_data_preprocessed.drop("label", axis=1)
y_synth = synth_data_preprocessed["label"]
X_test  = valid_data_preprocessed.drop("label", axis=1)
y_test  = valid_data_preprocessed["label"]
TSTR_real, TSTR_synth, TSTR_delta = compute_utility_metrics_class(X_train, X_synth, X_test, y_train, y_synth, y_test)

Fitting original models:


  0%|          | 0/31 [00:00<?, ?it/s]

100%|██████████| 31/31 [01:59<00:00,  3.87s/it]


Fitting synthetic models:


100%|██████████| 31/31 [01:46<00:00,  3.42s/it]


## 3. Privacy assessment

#### 3.1 Distance to closest record (DCR)

In [17]:
# Compute the distances to closest record between the synthetic dataset and the real dataset
dcr_synth_train = distance_to_closest_record("synth_train", synth_data_preprocessed, real_data_preprocessed)
dcr_synth_valid = distance_to_closest_record("synth_val", synth_data_preprocessed, valid_data_preprocessed)

In [20]:
# Check for any clones shared between the synthetic and real datasets (DCR=0).
dcr_zero_synth_train = number_of_dcr_equal_to_zero("synth_train", dcr_synth_train)
dcr_zero_synth_valid = number_of_dcr_equal_to_zero("synth_val", dcr_synth_valid)

In [23]:
# Compute some general statistcs for the DCR array computed above
dcr_stats_synth_train = dcr_stats("synth_train", dcr_synth_train)
dcr_stats_synth_valid = dcr_stats("synth_val", dcr_synth_valid)

In [24]:
dcr_stats_synth_train

{'mean': 0.003190891584381461,
 'min': 2.2817469016445102e-06,
 '25%': 6.808942089264747e-06,
 'median': 1.0700799066398758e-05,
 '75%': 0.0002793967432808131,
 'max': 0.08193255215883255}

In [25]:
# Compute the share of records that are closer to the training set than to the validation set
# For this task you need to compute also the DCR between the synthetic dataset and the validation dataset
share = validation_dcr_test(dcr_synth_train, dcr_synth_valid)

In [26]:
share



#### 3.2 Membership Inference Attack test

In [27]:
# Simulate a Membership inference Attack on your syntehtic dataset
# To do so, you'll need to produce an adversary dataset and some labels as adversary guesses groundtruth
adv_data = adversary_dataset(real_data_preprocessed, valid_data_preprocessed)
# hint: the label is automatically produced by the function adversary_dataset and is added as a column named 
# "privacy_test_is_training" in the adversary dataset returned
adv_guesses_ground_truth = adv_data["privacy_test_is_training"] 

MIA = membership_inference_test(adv_data, synth_data_preprocessed, adv_guesses_ground_truth)

In [28]:
MIA

{'adversary_distance_thresholds': [0.00034372034133411944,
  1.1110747209386318e-05,
  8.510365660185926e-06,
  4.34845833297004e-06],
 'adversary_precisions': [0.7526617526617526, 1.0, 1.0, 1.0],
 'membership_inference_mean_risk_score': 0.8763308763308764}

## 4. Utility-Privacy report

In [6]:
# Produce the utility privacy report with the information computed above
report(real_data, synth_data)


Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.


  You can now view your Streamlit app in your browser.

  Local URL: http://localhost:8501
  Network URL: http://137.204.143.95:8501
  External URL: http://137.204.143.95:8501



  from report_generator import _load_from_json, _convert_to_dataframe


Thanks for taking part to the SURE library testing! Your feedback is of great value for us to improve the library!

When you have finished testing the library, make sure that the answers to the questions are visible and then upload the notebook to the Google Drive folder!