# SURE library use case notebook
Useful links:
- [Github repo](https://github.com/Clearbox-AI/SURE)
- [Documentation](https://dario-brunelli-clearbox-ai.notion.site/SURE-Documentation-2c17db370641488a8db5bce406032c1f)
- [Datasets for testing](https://drive.google.com/drive/folders/1OP1TsRHOCRVc2znSV45kHiiX7Niz2uQ2)

Download the datasets and try out the library with this set-by-step guided use case.

We would gratly appreciate your feedback to help us improve the library! \
If you encounter any issues, please open an issue on our [GitHub repository](https://github.com/Clearbox-AI/SURE).

### Datasets description

The three datasets provided are the following:

- *census_dataset_training.csv* \
    The original real dataset used to train the generative model from which *census_dataset_synthetic* was produced.
    
- *census_dataset_validation.csv* \
    This dataset was also part of the original real dataset, but it was not used to train the generative model that produced 
    
- *census_dataset_synthetic.csv* \
    The synthetic dataset produced with the generative model trained on *census_dataset_training.*
    

The three census datasets include various demographic, social, economic, and housing characteristics of individuals. Every row of the datasets coresponds to an individual.

The machine learning task related to these datasets is a classification task, where, based on all the features, a ML classifier model must decide whether the individual earns more than 50k dollars per year (lable=1) or less (lable=0).\
The column "label" in each dataset is the ground truth for this classification task.

## 0. Installing the library and importing dependencies 

In [None]:
# install the SURE library 
%pip install clearbox-sure

In [1]:
# importing dependencies
import polars as pl # you can use polars or pandas for importing the datasets

from sure import Preprocessor, report
from sure.utility import (compute_statistical_metrics, compute_mutual_info,
			  compute_utility_metrics_class)
from sure.privacy import (distance_to_closest_record, dcr_stats, number_of_dcr_equal_to_zero, validation_dcr_test, 
			  adversary_dataset, membership_inference_test)

## 1. Dataset import and preparation

#### 1.1 Import the datasets

In [2]:
real_data = pl.scan_csv("census_dataset_training.csv")
valid_data = pl.scan_csv("census_dataset_validation.csv")
synth_data = pl.scan_csv("census_dataset_synthetic.csv")

#### 1.2 Datasets preparation

In [None]:
# Prepare the datasets with the Preprocessor

# Real dataset - Preprocessor initialization and query exacution
preprocessor            = Preprocessor(real_data, get_discarded_info=False)
real_data_preprocessed  = preprocessor.transform(real_data, num_fill_null='forward', scaling='standardize')

# Validation dataset - Preprocessor initialization and query exacution
preprocessor            = Preprocessor(valid_data, get_discarded_info=False)
valid_data_preprocessed = preprocessor.transform(valid_data, num_fill_null='forward', scaling='standardize')

# Synthetic dataset - Preprocessor initialization and query exacution
preprocessor            = Preprocessor(synth_data, get_discarded_info=False)
synth_data_preprocessed = preprocessor.transform(synth_data, num_fill_null='forward', scaling='standardize')

## 2. Utility assessment

#### 2.1 Statistical properties and mutual information

In [6]:
# Compute statistical properties and features mutual information
num_features_stats, cat_features_stats, temporal_feat_stats = compute_statistical_metrics(real_data_preprocessed, synth_data_preprocessed)
corr_real, corr_synth, corr_difference                      = compute_mutual_info(real_data_preprocessed, synth_data_preprocessed)

#### 2.2 ML utility - Train on Synthetic Test on Real

In [None]:
# Assessing the machine learning utility of the synthetic dataset on the classification task

# ML utility: TSTR - Train on Synthetic, Test on Real
X_train      = real_data_preprocessed.drop("label") # Assuming the datasets have a “label” column for the machine learning task they are intended for
y_train      = real_data_preprocessed["label"]
X_synth      = synth_data_preprocessed.drop("label")
y_synth      = synth_data_preprocessed["label"]
X_test       = valid_data_preprocessed.drop("label").limit(10000) # Test the trained models on a portion of the original real dataset (first 10k rows)
y_test       = valid_data_preprocessed["label"].limit(10000)
TSTR_metrics = compute_utility_metrics_class(X_train, X_synth, X_test, y_train, y_synth, y_test)

## 3. Privacy assessment

#### 3.1 Distance to closest record (DCR)

In [10]:
# Compute the distances to closest record between the synthetic dataset and the real dataset
# and the distances to closest record between the synthetic dataset and the validation dataset

dcr_synth_train = distance_to_closest_record("synth_train", synth_data_preprocessed, real_data_preprocessed)
dcr_synth_valid = distance_to_closest_record("synth_val", synth_data_preprocessed, valid_data_preprocessed)

In [None]:
# Check for any clones shared between the synthetic and real datasets (DCR=0).

dcr_zero_synth_train  = number_of_dcr_equal_to_zero("synth_train", dcr_synth_train)
dcr_zero_synth_valid  = number_of_dcr_equal_to_zero("synth_val", dcr_synth_valid)

In [None]:
# Compute some general statistcs for the DCR arrays computed above

dcr_stats_synth_train = dcr_stats("synth_train", dcr_synth_train)
dcr_stats_synth_valid = dcr_stats("synth_val", dcr_synth_valid)

In [None]:
# Compute the share of records that are closer to the training set than to the validation set

share = validation_dcr_test(dcr_synth_train, dcr_synth_valid)

#### 3.2 Membership Inference Attack test

In [None]:
# Simulate a Membership inference Attack on your syntehtic dataset
# To do so, you'll need to produce an adversary dataset and some labels as adversary guesses groundtruth

# The label is automatically produced by the function adversary_dataset and is added as a column named 
# "privacy_test_is_training" in the adversary dataset returned

# ML privacy attack sandbox initialization and simulation
adversary_dataset = adversary_dataset(real_data_preprocessed, valid_data_preprocessed)

# The function adversary_dataset adds a column "privacy_test_is_training" to the adversary dataset, indicating whether the record was part of the training set or not
adversary_guesses_ground_truth = adversary_dataset["privacy_test_is_training"] 
MIA = membership_inference_test(adversary_dataset, synth_data_preprocessed, adversary_guesses_ground_truth)

## 4. Utility-Privacy report

In [None]:
# Produce the utility privacy report with the information computed above

report(real_data, synth_data)