In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os, wget, shutil

# Notebook overview

## Differential Privacy and Privacy Protection
Differential Privacy (DP) is a mathematical framework that provides strong, quantifiable privacy guarantees when analyzing and sharing data. Its core idea is to ensure that the inclusion or exclusion of a single individual’s data does not significantly affect the output of an analysis, thereby limiting what can be inferred about any one person. This limits the ability of an attacker to infer whether someone participated in the dataset, even if they possess additional background information. To achieve this, DP algorithms inject carefully calibrated random noise into computations, such as LOO mean, making it mathematically difficult to isolate individual contributions. 

In this section we show how DP can be applied to data. This example revolves around anonymizing a dataset of individual heights that contains significant outliers. 

## The goal

The goal is to protect the data without compromising the privacy of any individual, especially in high risk or adversarial environments.

Down below we use usecase-2.1 individual's height.


In [2]:
os.makedirs("data", exist_ok=True)
link_original = "https://s3.amazonaws.com/openneuro.org/ds004148/participants.tsv?versionId=wt81Mu2B3fdeiXSis5ym288A64lXRXkR"
wget.download(link_original)
filename = "participants.tsv"
file_ = [os.path.join(root, file) for root, _, files in os.walk(os.getcwd()) for file in files if file == filename]
shutil.copy2(file_[0], "data")
os.remove(file_[0])
print(f"\nOriginal file downloaded.")

data = np.array(pd.read_csv("data/"+filename,delim_whitespace="\t")["Height"]).reshape(-1,1)
clean_data = data[~np.isnan(data)]


100% [..........................................................] 39886 / 39886
Original file downloaded.


## User Output

A data user on the SIESTA platform is typically a researcher from an institution, aiming to answer a scientific question using sensitive data. Since the original data cannot be downloaded directly due to privacy concerns, the data user must develop their analysis pipeline using an anonymized dataset that mimics the structure of the real data. Once developed, the analysis is containerized, where this containerized pipeline is then executed by the platform operator on the original data. 

After applying **differential privacy** techniques on the results, it will be shared back with the data user. If the result is injected with noise before releasing it, for example, by adding Laplace noise proportional to the L1 sensitivity of the mean. The attacker cannot recover the true values exactly, because the values are altered with noise. 

We implement a DP mechanism by adding Laplace noise scaled by sensitivity, defined here as the maximum deviation between the true mean and the leave-one-out estimates.


In [3]:
def user_output(data):

    true_mean = np.mean(data)
    return true_mean
    

In [4]:
loo_estimate = np.array([np.mean(np.delete(clean_data, i)) for i in range(len(clean_data))])

def dp(loo_mean):
    
    all_noise = np.full(loo_mean.shape[1], np.nan)

    for p_loo_mean in range(loo_mean.shape[1]):
        loo_estimate = loo_mean[:, p_loo_mean]
        loo_mean = user_output(loo_estimate)
        loo_scale = np.std(loo_estimate)
        sensitivity = np.max(np.abs(loo_mean - loo_estimate)) 

        while True:
            noise = np.random.laplace(loc=0.0, scale=loo_scale)
            if abs(noise) >= sensitivity:
                break
            
        all_noise[p_loo_mean] = noise
        noisy_user_output = loo_estimate + all_noise[p_loo_mean]
        
    return noisy_user_output



noisy_user_output = dp(loo_estimate.reshape(-1, 1))
list(noisy_user_output)

[159.96687557898008,
 160.37038435090992,
 160.1247703158222,
 160.33529663161167,
 160.08968259652394,
 160.17740189476956,
 160.15985803512044,
 160.01950715792745,
 160.19494575441868,
 160.2826650526643,
 160.2826650526643,
 160.3528404912608,
 160.37038435090992,
 160.24757733336605,
 160.1423141754713,
 160.19494575441868,
 160.23003347371693,
 160.2124896140678,
 160.07213873687482,
 159.93178785968183,
 160.2826650526643,
 160.2826650526643,
 160.2826650526643,
 160.03705101757657,
 160.24757733336605,
 160.2124896140678,
 160.0545948772257,
 160.1423141754713,
 160.2826650526643,
 160.2826650526643,
 160.10722645617307,
 159.84406856143625,
 159.9844194386292,
 159.9844194386292,
 159.89670014038362,
 160.01950715792745,
 160.19494575441868,
 160.01950715792745,
 159.93178785968183,
 160.19494575441868,
 162.07213873687482,
 159.84406856143625,
 163.0607352281029,
 160.15985803512044,
 160.40547207020816,
 160.03705101757657,
 160.0545948772257,
 160.01950715792745,
 160.15985

## Simulation
Here we simulate a potential privacy attack scenario where an adversary attempts to reconstruct the original individual data values from noisy LOO means.

To protect privacy, Laplace noise is added to the LOO means using the DP mechanism. However, even after adding noise, it's important to evaluate how well the attacker could still reconstruct the original values. 

The DP mechanism is randomized, each execution adds different noise. So, we run a Monte Carlo (MC) simulation 1000 times to:

- Simulate many possible outcomes of the noisy LOO release.
- Perform reconstruction attempts based on each noisy release.
- Visualize the distribution of attacker guesses for each individual’s value.

This allows us to:

- Understand the variability in the reconstructions.
- Check whether the true values are consistently hidden within a wide distribution.
- Provide empirical evidence that the data remains private on average, even if some reconstructions appear close by chance.