# PSU IST 597.005: Homework 3
## Differential Privacy

### Instructions

To ensure that the notebook runs, I've defined a function your_code_here() that simply returns the number 1. Whenever you see a call to this function, you should replace it with code you have written. Please make sure all cells of your notebook run without error before submitting the assignment. If you have not completed all the questions, leave calls to your_code_here() in place or insert dummy values so that the cell does not throw an error when it runs.

When answering non-code questions, feel free to use a comment, or keep the cell in Markdown mode and use Markdown. When you have finished your assignment, please submit it via Canvas.

### Preamble

In [21]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import pandas as pd
import numpy as np
import random

# Our usual dataset

adult_data = pd.read_csv("adult_with_pii.csv")
adult_data['DOB'] = pd.to_datetime(adult_data['DOB'], errors='coerce')
adult_data.head()

# Some useful utilities from earlier assignments

def laplace_mech(v, sensitivity, epsilon):
    return v + np.random.laplace(loc=0, scale=sensitivity / epsilon)

def laplace_mech_vec(vec, sensitivity, epsilon):
    return [v + np.random.laplace(loc=0, scale=sensitivity / epsilon) for v in vec]

def gaussian_mech(v, sensitivity, epsilon, delta):
    return v + np.random.normal(loc=0, scale=sensitivity * np.sqrt(2*np.log(1.25/delta)) / epsilon)

def gaussian_mech_vec(vec, sensitivity, epsilon, delta):
    return [v + np.random.normal(loc=0, scale=sensitivity * np.sqrt(2*np.log(1.25/delta)) / epsilon)
            for v in vec]

def z_clip(xs, b):
    return [min(x, b) for x in xs]

def clip(xs, upper, lower):
    return [max(min(x, upper), lower) for x in xs]

def pct_error(orig, priv):
    return np.abs(orig - priv)/orig * 100.0

def your_code_here():
    return 1

## Collaboration Statement

**You are expected to work indepdently on this assignment.** Everyone should write their *own code and responses*. You may collaborate with a classmate through *high level* discussions (only). To the extent that you do so, please describe this in the collaboration statment below.

>In this cell write your collaboration statement

### Question 1 (1 point)

Write code to answer the query: "how many participants have never been married?"

*Note*: "marital status" is misspelled as `Martial Status` in the column names.

In [6]:
def query1():
    your_code_here()

query1()

### Question 2 (3 points)

Use the implementation of `laplace_mech` defined in the preamble to produce a differentially private answer to `query1`, with `epsilon = 0.1`.

In [8]:
def dp_query1(epsilon):
    your_code_here()

dp_query1(0.1)

### Question 3 (6 points)

The `pct_error` function, defined below, returns the percent relative error between an original query result and a differentially private result for the same query.

Implement a function `graph_error1` that:

- Calculates 1000 differentially private answers to `dp_query1`
- Calculates the percent error for each one of these answers against the original (non-private) answer
- Graphs the distribution of errors using a histogram

Plot errors for `epsilon=0.1` and `epsilon=1.0`.

In [11]:
def graph_error1(epsilon):
    your_code_here()

graph_error1(0.1)
graph_error1(1.0)

### Question 4 (4 points)

In 2-5 sentences, answer the following:

- How does the histogram of relative errors for $\epsilon = 0.1$ differ from the one for $\epsilon = 1.0$?
- What do the two histograms tell you about the effect of $\epsilon$ on relative error?

>Your answer here.

### Question 5 (6 points)

Consider `query2`, which asks how many people in the dataset are over the age of 60.

In [11]:
def query2():
    return len(adult[adult['Age'] > 60])

Implement `dp_query2`, a differentially private version of `query2` (as in question 2), and `graph_error2`, which graphs relative error for `dp_query2` (as in question 3).

In [12]:
def dp_query2(epsilon):
    your_code_here()

def graph_error2(epsilon):
    your_code_here()

graph_error2(1.0)
graph_error1(1.0) #plot both errors for query 1 and query 2 at the same epsilon, to compare

### Question 6 (2 points)

In 2-5 sentences, answer the following:

- How does relative error differ between `dp_query1` and `dp_query2` for the same value of $\epsilon$?
- What property of the query causes the difference in relative errors between `dp_query1` and `dp_query2`?

>Your answer here.

### Question 7 (5 points)

Build a [contingency table](https://en.wikipedia.org/wiki/Contingency_table) between the `Martial Status` and `Sex` columns of the `adult_data` dataframe.

*Hint*: you can use `pd.crosstab(..., ...)` (documentation [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.crosstab.html))

In [13]:
your_code_here()

1

### Question 8 (5 points)

Write code to build a differentially private version of your result from Question 7.

In [14]:
your_code_here()

1

### Question 9 (5 points)

Write code to display a table containing percent errors for each of the cells in your answer to Question 8.

In [15]:
your_code_here()

1

### Question 10 (3 points)

In 2-3 sentences, answer the following question:
- What is the privacy cost of your answer to Question 8? Why?

>Your answer here.

### Question 11 (5 points)

Write code to compute the average of the values of the `Capital Gain` column of the dataset. Run your code for various values of the clipping parameter `b`.

In [None]:
your_code_here()

### Question 12 (10 points)

Write code to return the differentially private average of `Capital Gain` parameterized by the clipping parameter `b`. Run your code for various values of `b` and use `pct_error` to determine the error introduced for each value of `b`.

In [16]:
your_code_here()

1

### Question 13 (5 points)

In 5-10 sentences, answer the following:

- In Question 11, at approximately what value of the clipping parameter `b` does the clipped average approach the original (un-clipped) average?
- What is the sensitivity of the clipped average at this value of `b`, and why?
- In Question 12, at approximately what value of the clipping parameter `b` is the error minimized?
- Which seems to be more important for accuracy - the value of `b` or the scale of the noise added? Why?
- Do you think the answer to the previous point will be true for every dataset? Why or why not?

>Your answer here.

### Question 14 (10 points)

Write code to answer the question: How do the Laplace and Gaussian mechanisms compare in terms of relative error on the query "how many individuals are over 50 years old" with $\epsilon = 0.1$ and $\delta = 10^{-5}$?  In the cell below your code, explain your answer to the question.

In [18]:
your_code_here()

1

>Your answer here.

### Range Queries

A *range query* counts the number of rows in the dataset which have a value lying in a given range. For example, "how many participants are between the ages of 21 and 33?" is a range query. A *workload* of range queries is just a list of range queries. The code below generates 100 random range queries over ages in the adult dataset.

In [25]:
def range_query(df, col, a, b):
    return len(df[(df[col] >= a) & (df[col] < b)])

random_lower_bounds = [random.randint(1, 70) for _ in range(100)]
random_workload = [(lb, random.randint(lb, 100)) for lb in random_lower_bounds]
real_answers = [range_query(adult_data, 'Age', lb, ub) for (lb, ub) in random_workload]
print('First 5 queries: ', random_workload[:5])

First 5 queries:  [(19, 93), (13, 43), (4, 99), (41, 80), (22, 37)]


### Question 15 (10 points)

Write code to answer a workload of range queries using `laplace_mech` and sequential composition. Your solution should have a **total privacy cost of epsilon**.

In [27]:
def workload_laplace(workload, epsilon):
    your_code_here()

print('First 4 answers:', workload_laplace(random_workload, 1.0)[:4])

In [None]:
errors = [abs(r_a - l_a) for (r_a, l_a) in zip(real_answers, workload_laplace(random_workload, 1.0))]
print('Average absolute error:', np.mean(errors))

### Question 16 (10 points)

Write code to answer a workload using `laplace_mech_vec` - the version of the Laplace mechanism for **vector-valued** queries. Your solution should *not* use sequential composition, and should have a total privacy cost of `epsilon`.

In [None]:
def workload_laplace_vec(workload, epsilon):
    your_code_here()

print('First 4 answers:', workload_laplace_vec(random_workload, 1.0)[:4])

In [None]:
errors = [abs(r_a - l_a) for (r_a, l_a) in zip(real_answers, workload_laplace_vec(random_workload, 1.0))]
print('Average absolute error:', np.mean(errors))

### Question 17 (10 points)

Write code to answer a workload using `gaussian_mech_vec` - the version of the Gaussian mechanism for vector-valued queries. Your solution should not use sequential composition, should satisfy $(\epsilon, \delta)$-differential privacy, and should have a total privacy cost of (`epsilon`, `delta`).

In [None]:
def workload_gaussian_vec(workload, epsilon, delta):
    your_code_here()

print('First 4 answers:', workload_gaussian_vec(random_workload, 1.0, 1e-5)[:4])

In [None]:
errors = [abs(r_a - l_a) for (r_a, l_a) in zip(real_answers, workload_gaussian_vec(random_workload, 1.0, 1e-5))]
print('Average absolute error:', np.mean(errors))

### Question 18 (10 points)

In 2-5 sentences, answer the following:
- Of your solutions in questions 15-17, which ones rely on *sequential composition*?
- Which solution offers the best accuracy?
- Why does this particular solution yield the best accuracy?

>Your answer here.