# CS211: Data Privacy
## Homework 4

In [2]:
# Load the data and libraries
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')

def laplace_mech(v, sensitivity, epsilon):
    return v + np.random.laplace(loc=0, scale=sensitivity / epsilon)

def pct_error(orig, priv):
    return np.abs(orig - priv)/orig * 100.0

adult = pd.read_csv('https://github.com/jnear/cs211-data-privacy/raw/master/homework/adult_with_pii.csv')

In [None]:
adult['fnlwgt']

## Question 1 (10 points)

Complete the definition of `dp_sum_capgain` below. Your definition should compute a differentially private sum of the "Capital Gain" column of the `adult` dataset, and have a total privacy cost of `epsilon`.

In [None]:
def dp_sum_capgain(epsilon):
    sensitivity = 100000
    return laplace_mech(adult['Capital Gain'].clip(upper=sensitivity).sum(), sensitivity, epsilon)
dp_sum_capgain(1.0)

In [None]:
# TEST CASE for question 1

real_sum = adult['Capital Gain'].sum()
r1 = np.mean([pct_error(real_sum, dp_sum_capgain(0.1)) for _ in range(100)])
r2 = np.mean([pct_error(real_sum, dp_sum_capgain(1.0)) for _ in range(100)])
r3 = np.mean([pct_error(real_sum, dp_sum_capgain(10.0)) for _ in range(100)])

print("Average errors:", r1, r2, r3)

assert r1 > 0
assert r2 > 0
assert r3 > 0
assert r1 < 10
assert r2 < 2
assert r3 < 0.2

## Question 2 (10 points)

In 2-5 sentences each, answer the following:

- What clipping parameter did you use in your definition of `dp_sum_capital`, and why?
- What was the sensitivity of the query you used in `dp_sum_capital`, and how is it bounded?
- Argue that your definition of `dp_sum_capital` has a total privacy cost of `epsilon`

100000, because that was a reasonable upper bound on the sizes in the dataset.
The sensitivity is 100000 as well, because the sensitivity is bounded by the largest possible size in the dataset, so if we clip anything above that, we ensure that the function cannot increase or decrease by >100000 with the addition of more data.
Only one query is made with bounded sensitivity using the laplace mechanism, the sum of the one cost epsilon is epsilon.

## Question 3 (10 points)

Complete the definition of `dp_avg_capgain` below. Your definition should compute a differentially private average (mean) of the "Capital Gain" column of the adult dataset, and have a **total privacy cost of epsilon**.

In [None]:
def dp_avg_capgain(epsilon):
    return dp_sum_capgain(epsilon / 2) / laplace_mech(adult['Capital Gain'].shape[0], 1, epsilon / 2)

dp_avg_capgain(1.0)

In [None]:
# TEST CASE for question 3

real_avg = adult['Capital Gain'].mean()
r1 = np.mean([pct_error(real_avg, dp_avg_capgain(0.1)) for _ in range(100)])
r2 = np.mean([pct_error(real_avg, dp_avg_capgain(1.0)) for _ in range(100)])
r3 = np.mean([pct_error(real_avg, dp_avg_capgain(10.0)) for _ in range(100)])

print("Average errors:", r1, r2, r3)

assert r1 > 0
assert r2 > 0
assert r3 > 0
assert r1 < 20
assert r2 < 4
assert r3 < 0.4

## Question 4 (10 points)

In 2-5 sentences each, answer the following:

- Argue that your definition of `dp_avg_capgain` has a total privacy cost of `epsilon`
- For sums and averages, which seems to be more important for accuracy - the value of the clipping parameter $b$ or the scale of the noise added? Why?
- Do you think the answer to the previous point will be true for every dataset? Why or why not?

I divided it into a sum and a count query each with epsilon = `epsilon/2`. By sequential composition, total privacy cost is `epsilon`.
From playing around, it looks like variation in the clipping parameter is more problematic because this dataset is skewed such that the majority of people have very little capital gains, but those who have more than a little tend to have quite a bit. So when we clip on either end, we wind up much more inaccurate than if we had only a few outliers. Therefore, I don't think this property necessarily generalizes.

## Question 5 (20 points)

Write a function `auto_avg` that returns the differentially private average of a Pandas series `s`. Your function should automatically determine the clipping parameter `b`, and should enforce differential privacy for a **total privacy cost** of `epsilon`. You can assume that all values are non-negative (i.e. 0 or greater).

In [None]:
def auto_avg(series, epsilon):
    sub_e = epsilon / 16
    last = 0
    sensitivity = 0
    for i in range(15):
        sensitivity = 3 ** i
        new = laplace_mech(series.clip(upper=sensitivity).sum(), sensitivity, sub_e)
        if abs(new - last) < (new * 0.15):
            print(f"b = {sensitivity}")
            return new / laplace_mech(series.shape[0], 1, sub_e)
        last = new
    return last / laplace_mech(series.shape[0], 1, sub_e)

In [None]:
# TEST CASE for question 5
r1 = np.mean([pct_error(adult['Age'].mean(), auto_avg(adult['Age'], 1.0)) for _ in range(20)])
print('capital')
r2 = np.mean([pct_error(adult['Capital Gain'].mean(), auto_avg(adult['Capital Gain'], 1.0)) for _ in range(20)])
print('fnlwgt')
r3 = np.mean([pct_error(adult['fnlwgt'].mean(), auto_avg(adult['fnlwgt'], 1.0)) for _ in range(20)])

print('Average errors:', r1, r2, r3)
assert r1 > 0
assert r2 > 0
assert r3 > 0
assert r1 < 1
assert r2 < 100
assert r3 < 1

## Question 6

In 2-5 sentences each, answer the following:

- Explain your strategy for implementing `auto_avg`
- Argue informally that your definition of `auto_avg` has a total privacy cost of `epsilon`
- Did your solution work well for all three example columns? If it did not work well on any of them, why not?
- When is your solution likely to *not* work well? (i.e. what properties does the data have to have, in order for your solution to not work well?)

Tried to implement logarithmic strategy from textbook, tweaking parameters landed me at powers of 3, given it needs to cover a wide array of data sizes.
By sequential composition, my method runs at most 16 queries with an epsilon of `epsilon/16`. (15 checks to find b, and one count query.) which sum to at most `epsilon` and at least `epsilon/8`
It worked for all of them, but I did spend some time tweaking the parameters to get it there, so not necessarily generalizable.
When data exceeds 3 ** 14, or requires clipping with more precision, this solution will break down.