# SURE library testing notebook
Welocme to the SURE library testing notebook!

Use this notebook and the indications in the code boxes as a guideline to test the functionalities of the SURE library for Synthetic Data utility and privacy assessment. \
Feel free to explore and experiment with the library's features beyond these suggestions!

Besides this notebook and the [link](https://dario-brunelli-clearbox-ai.notion.site/SURE-Documentation-2c17db370641488a8db5bce406032c1f) to the documentation relative to the SURE library, you are provided with a Google Drive folder containing:
- a final questionnaire to gather your feedback.
- a consent form to process your feedback information
- three *.csv* files (described below)

After completing the testing, please fill in the feedback questionnaire and upload this notebook, the questionnaire and the signed consent form to the Google Drive folder.

### Datasets description

The three datasets provided are the following:

- *census_dataset_training.csv*
    
    The original real dataset used to train the generative model from which *census_dataset_synthetic* was produced.
    
- *census_dataset_validation.csv*
    
    This dataset was also part of the original real dataset, but it was not used to train the generative model that produced *census_dataset_synthetic.*
    
- *census_dataset_synthetic.csv*
    
    The synthetic dataset produced with the generative model trained on *census_dataset_training.*
    

The three census datasets include various demographic, social, economic, and housing characteristics of individuals. Every row of the datasets coresponds to an individual.

The machine learning task related to these datasets is a classification task, where, based on all the features, a ML classifier model must decide whether the individual earns more than 50k dollars per year (lable=1) or less (lable=0).\
The column "label" in each dataset is the ground truth for this classification task.

### Tasks

Below is a list of tasks. Please use them as general guidelines to proceed. Note that some tasks are deliberately open-ended to give you the freedom to approach them as you see fit and to test the clarity of the provided documentation:

1. Install the library.
2. Prepare the three datasets using the Preprocessor, adjusting its parameters as you deem best.
3. Assess the TSTR (Train on Synthetic, Test on Real) performance of the synthetic dataset on the classification task employing the utility modules (see [Section 4.1](https://www.notion.so/4-ML-Utility-Metrics-ac98a1d294b1428f8b67936323c7c569?pvs=21) of the documentation).
4. Evaluate the vulnerability of the synthetic dataset provided to membership inference attacks.
5. Generate and explore the final report.

Some suggestions on how to proceed are also available in comments in code blocks of the notebook you have been provided with.

To perform the task, please refer to the documentation provided.

Feel free to test the library in any other way you can think of to challenge the library’s capabilities!

(e.g. test different datasets than the ones provided)

If you have any doubt on how to procede during the testing, try searching for what you need in these reference links:
- [Documentation](https://dario-brunelli-clearbox-ai.notion.site/SURE-Documentation-2c17db370641488a8db5bce406032c1f)
- [GitHub page](https://github.com/Clearbox-AI/SURE)

### Questions
Answer the following questions with the results you found (just make sure that the answers are visible in the notebook before uploading it to the Google Drive folder):

1. What is the accuracy of the original dataset and the one of the synthetic dataset on the given classification task?
2. Which machine learning model has the highest accuracy on the TSTR task?
3. What is the percentage of DCRs closer to the training set than to the validation set you found?
4. What is the Membership Inference (MI) mean risk score you found?

## 1. Dataset import and preparation

In [25]:
# install the SURE library 


#### 1.1 Import the datasets

In [2]:
# Import the datasets


#### 1.2 Datasets preparation

In [None]:
# Prepare the datasets with the Preprocessor


## 2. Utility assessment

#### 2.1 Statistical properties and mutual information

In [6]:
# Compute statistical properties


In [None]:
# Compute features mutual information


#### 2.2 ML utility - Train on Synthetic Test on Real

In [None]:
# Verify the machine learning utility of the synthetic dataset on the classification task
# Use the real dataset as validation set for the calssification task


## 3. Privacy assessment

#### 3.1 Distance to closest record (DCR)

In [10]:
# Compute the distances to closest record between the synthetic dataset and the real dataset


In [None]:
# Check for any clones shared between the synthetic and real datasets (DCR=0).


In [None]:
# Compute some general statistcs for the DCR array computed above


In [None]:
# Compute the share of records that are closer to the training set than to the validation set
# For this task you need to compute also the DCR between the synthetic dataset and the validation dataset


#### 3.2 Membership Inference Attack test

In [None]:
# Simulate a Membership inference Attack on your syntehtic dataset
# To do so, you'll need to produce an adversary dataset and some labels as adversary guesses groundtruth

# hint: the label is automatically produced by the function adversary_dataset and is added as a column named 
# "privacy_test_is_training" in the adversary dataset returned


## 4. Utility-Privacy report

In [None]:
# Produce the utility privacy report with the information computed above


Thanks for taking part to the SURE library testing! Your feedback is of great value for us to improve the library!

When you have finished testing the library, make sure that the answers to the questions are visible and then upload the notebook to the Google Drive folder!