# Differentially Private Data Science - What, Why And How?

Differential privacy has become a standard framework for applying strict individual privacy protection. It provides a controlled way to ingest calibrated noise into sensitive datasets so that any statistical analyses conducted on them are in line with current legal demands like GDPR or CCPA. The core idea behind DP is that the addition or removal of any individual from a sensitive dataset should not significantly affect the result of an analysis. This feature provides a strong privacy guarantee, allowing analysts and scientists to prevent linkage attacks, which is crucial in many domains such as research, medicine, census and finance. In this tutorial, we will explore how this mathematical framework works, how to implement it and important related ideas. 

## Why Do We Need Epsilon Differential Privacy?

Differential privacy uses randomized computations to preserve individual privacy. This means that every time you print a differentially private mean (or other statistic) of a dataset, you get a different result each time that is close to the _true mean_. 

To make each result truly irreversible, DP introduces a calibrated noise drawn from a probability distribution, typically the Laplace distribution. The drawn noise is large enough to mask the presence or absence of any single individual in the dataset, but small enough to maintain the utility of the data for analysis purposes.

But why would we even feel the need to hide such basic information? 

For example, in a medical study about a rare disease, if an attacker knows the total count and can find out that it increased by one after a specific person joined the study, they could infer that person's medical condition. Or in a salary database, if an attacker knows the mean salary before and after a new employee joins, they could potentially calculate that individual's exact salary. Another example is in voting data: if the mean age of voters in a small district is known, and then changes slightly after one person votes, their age and voting status could be deduced.

The goal of DP is to apply a randomized mechanism to all computations performed on a sensitive dataset so that the probability of any result occurring is nearly identical for any two datasets that differ in only one record.

![](https://www.anonos.com/hs-fs/hubfs/differential/Diff.1.webp?width=834&height=429&name=Diff.1.webp)

So, by making computations differentially private, we prevent {continue this sentence. remove hashtags}. Now, let's see a basic implementation of DP on some common operations.

## What Is Sensitivity And Epsilon in Differential Privacy?

In [None]:
import numpy as np

# Example: Applying ε-DP to census data
def add_laplace_noise(true_count, epsilon):
    sensitivity = 1.0
    scale = sensitivity / epsilon
    noise = np.random.laplace(0, scale)
    return true_count + noise

# Usage
true_population = 1000
epsilon = 0.05
private_population = add_laplace_noise(true_population, epsilon)

print(f"True population: {true_population}")
print(f"Private population: {private_population}")

True population: 1000
Private population: 1018.3669040577329


In [10]:
# Example: Differentially private mean calculation
def private_mean(data, epsilon):
    n = len(data)
    sensitivity = (max(data) - min(data)) / n
    noise = np.random.laplace(0, sensitivity / epsilon)
    return np.mean(data) + noise

# Usage
patient_ages = [25, 30, 35, 40, 45, 50, 55, 60]
epsilon = 0.5
private_avg_age = private_mean(patient_ages, epsilon)

print(f"True average age: {np.mean(patient_ages)}")
print(f"Differentially private average age: {private_avg_age}")


True average age: 42.5
Differentially private average age: 39.48729397508328


## The Importance of Epsilon Shown Visually

## Online Games Illustrating Epsilon's Importance

## Conclusion