# Differentially Private Data Science - What, Why And How?

Differential privacy has become a standard framework for applying strict individual privacy protection. It provides a controlled way to ingest calibrated noise into sensitive datasets so that any statistical analyses conducted on them are in line with current legal demands like GDPR or CCPA. The core idea behind DP is that the addition or removal of any individual from a sensitive dataset should not significantly affect the result of an analysis. This feature provides a strong privacy guarantee, allowing analysts and scientists to prevent linkage attacks, which is crucial in many domains such as research, medicine, census and financial fields. In this tutorial, we will explore how this mathematical framework works, how to implement it and some critical ideas around it. 

## What Is Differential Privacy?

- Used to prevent linkage attacks
- According to their definition, the presence or absence of any individual record in the dataset should not significantly affect the mechanism's outcome.
- We call "mechanism" any computation that can be performed on the data. Differential privacy deals with randomized mechanisms, which are analyses whose output changes probabilistically for a given input.
- Thus, a mechanism is considered differentially private if the probability of any outcome occurring is nearly identical for any two datasets that differ in only one record.

![](https://www.anonos.com/hs-fs/hubfs/differential/Diff.1.webp?width=834&height=429&name=Diff.1.webp)

Let's define differential privacy by solving a seemingly simple problem: how to find the total number of data points in a sensitive dataset? Put another way, how do we find a value close enough to the true count but still not give away the real answer?

First, let's answer why would we even feel the need to hide this information. In some cases, knowing the exact number of records in a dataset can reveal sensitive information. For instance, in a medical study about a rare disease, if an attacker knows the total count and can find out that it increased by one after a specific person joined the study, they could infer that person's medical condition. 

Similarly, other statistics like the mean can also pose risks. In a salary database, if an attacker knows the mean salary before and after a new employee joins, they could potentially calculate that individual's exact salary. Another example is in voting data: if the mean age of voters in a small district is known, and then changes slightly after one person votes, their age and voting status could be deduced. These scenarios illustrate why protecting even basic aggregate statistics is crucial in maintaining individual privacy and preventing unintended disclosure of sensitive information.

So, going back to hiding the true count, adding random noise to the result would provide the most irreversible mask. This approach forms the foundation of differential privacy.


By introducing carefully calibrated random noise to our count, we can obscure the true value while still providing a useful approximation. The amount of noise added is typically drawn from a probability distribution, such as the Laplace distribution, which is commonly used in differential privacy implementations.

The key idea is that the noise should be large enough to mask the presence or absence of any single individual in the dataset, but small enough to maintain the utility of the data for analysis purposes. This is where the epsilon parameter comes into play, controlling the trade-off between privacy and accuracy.

## What Is Sensitivity And Epsilon in Differential Privacy?

In [None]:
import numpy as np

# Example: Applying ε-DP to census data
def add_laplace_noise(true_count, epsilon):
    sensitivity = 1.0
    scale = sensitivity / epsilon
    noise = np.random.laplace(0, scale)
    return true_count + noise

# Usage
true_population = 1000
epsilon = 0.05
private_population = add_laplace_noise(true_population, epsilon)

print(f"True population: {true_population}")
print(f"Private population: {private_population}")

True population: 1000
Private population: 1018.3669040577329


In [10]:
# Example: Differentially private mean calculation
def private_mean(data, epsilon):
    n = len(data)
    sensitivity = (max(data) - min(data)) / n
    noise = np.random.laplace(0, sensitivity / epsilon)
    return np.mean(data) + noise

# Usage
patient_ages = [25, 30, 35, 40, 45, 50, 55, 60]
epsilon = 0.5
private_avg_age = private_mean(patient_ages, epsilon)

print(f"True average age: {np.mean(patient_ages)}")
print(f"Differentially private average age: {private_avg_age}")


True average age: 42.5
Differentially private average age: 39.48729397508328


## The Importance of Epsilon Shown Visually

## Online Games Illustrating Epsilon's Importance

## Conclusion