<!-- # CMPUT 200 Fall 2024  Ethics of Data Science and AI
 -->
# Assignment 1: Applying and Analyzing Privacy Techniques

In this assignment you will apply basic privacy techniques to a dataset of your choosing. By the end of this assignment, you should be able to:
1. Understand and implement randomized response on binary data;
2. Calculate the sensititivity of a non-binary feature;
3. Add noise to a non-binary feature;
4. Compute aggregate statistics.

### Instructions
<!-- **Deadline.**  This assignment is due at ****.  Please check the syllabus for late submissions. -->
You are expected to write clear, detailed, and complete answers when analysing your data. Lack of this may result in point deductions.

**Reminder.** You must submit your own work. The collaboration policy for the assignments is Group Collaboration. You may work together in groups of up to 2.
 On Canvas we added a group set called "Assignment Group".  We have enabled self-signup for you to signup as a group.  If you don't select your group by Wednesday Sep 25th 11:59pm, we will assume you are working on your own.
 Under the group collaboration policy, besides working with your group member, you can discuss concepts with. your peers. However, the work must be by your own group, and all sources of information used including books, websites, students you talked to, must be cited in the submission. Please see the course FAQ document for details on this collaboration policy. We will adhere to current Faculty of Science guidelines on dealing with suspected cases of violation of academic integrity.

You must use this notebook to complete your assignment. You will execute the questions in the notebook. The questions might ask for a short answer in text form or for you to write and execute a piece of code. **Make sure you enter your answer in either case only in the cell provided.**  Do not use a different cell and do not create a new cell. Do not delete any test cells.  Creating new cells for your code and deleting test cells is not compatible with the auto-grading system we are using and thus your assignment will not be graded properly and you will lose marks for that question.

Your submitted notebook should run on our local installation.  So if you are importing packages not listed in the notebook or using local data files not included in the assignment package, make sure the notebook is self-contained with a requirements.txt file or cells in the notebook itself to install the extra packages.  If we cannot run your notebook, you will lose 50% of the marks, and any additional marks that may be lost due to wrong answers.

### Submission Instructions
When you are done, you will submit your work from the notebook. Make sure to save your notebook before running it, and then submit on Canvas the notebook file with your work completed.

**IMPORTANT: Name your file with your *Student ID number* and the assignment number (ex: 1234567_A1.ipynb). Failure to do so will result in a zero!**

In [1]:
# Run this cell to set up; Please don't change this cell.

import numpy as np
from numpy.random import default_rng
rng = default_rng()
import pandas as pd
from scipy.optimize import minimize

# These lines do some fancy plotting magic.
import matplotlib
# This is a magic function that renders the figure in the notebook, instead of displaying a dump of the figure object.
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

## Part 1: Data

**Question 1.1.** We will first load the data, carry out some cleaning and pre-processing, and inspect the data to understand what exploratory steps we will take. Name the DataFrame `df`.

In [2]:
# YOUR CODE HERE
import pandas as pd
df = pd.read_csv("./compas_data_two_years.csv")
print("Shape: ", df.shape)
df.head(5)

Shape:  (7214, 53)


Unnamed: 0,id,name,first,last,compas_screening_date,sex,dob,age,age_cat,race,...,v_decile_score,v_score_text,v_screening_date,in_custody,out_custody,priors_count.1,start,end,event,two_year_recid
0,1,miguel hernandez,miguel,hernandez,2013-08-14,Male,1947-04-18,69,Greater than 45,Other,...,1,Low,2013-08-14,2014-07-07,2014-07-14,0,0,327,0,0
1,3,kevon dixon,kevon,dixon,2013-01-27,Male,1982-01-22,34,25 - 45,African-American,...,1,Low,2013-01-27,2013-01-26,2013-02-05,0,9,159,1,1
2,4,ed philo,ed,philo,2013-04-14,Male,1991-05-14,24,Less than 25,African-American,...,3,Low,2013-04-14,2013-06-16,2013-06-16,4,0,63,0,1
3,5,marcu brown,marcu,brown,2013-01-13,Male,1993-01-21,23,Less than 25,African-American,...,6,Medium,2013-01-13,,,1,0,1174,0,0
4,6,bouthy pierrelouis,bouthy,pierrelouis,2013-03-26,Male,1973-01-22,43,25 - 45,Other,...,1,Low,2013-03-26,,,2,0,1102,0,0


**Question 1.2.** Describe your data and its purpose. Identify one variable that is binary or that could be classified into a binary feature. Identify another that is not a binary feature.

Answer: The COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) dataset is used to access a convicted criminals likelihood of reoffending. The dataset has almmost 7214 records of individuals, with almost 53 attribute columns; including sensitive attributes like first and last names, date of birth and race. The is_violent_recid variable can be classified as a binary feature. In contrast, the 'age' attribute can not be classified into a binary feature.

## Part 2: Data pre-processing

**Question 2.1.** If your data has missing values or empty rows, remove them in the cell below. If the feature that you chose above has to be classified into a binary feature, convert it. Finally, if your binary feature is categorical, convert it to numerial.

In [3]:
# YOUR CODE HERE
import numbers

#Dropping Columns which have ALOT of NaN values. Then, dropping rows with NaN values, Effectively making a clean dataset.
null_counts = df.isnull().sum()
df = df.loc[:,null_counts<1000] #df.loc[:,[True, True, False, False, ....]] : A Boolean mask for columns
df = df.dropna()

#Comverting all Categorical 'Binary' Features, into Binary Numerical.
binary_columns = df.columns[df.columns.map(lambda column: len(df[column].unique()) == 2)]
binary_types = binary_columns.map(lambda x: df[x].unique())
conversion = []
for i in range(0,len(binary_types)):
    if isinstance(binary_types[i][0], numbers.Number) ==  False:
        data = [binary_columns[i],binary_types[i][0], 0]
        data2 = [binary_columns[i],binary_types[i][1], 1]
        conversion.append(data)
        conversion.append(data2)
for ar in conversion:
    df[ar[0]] = df[ar[0]].replace({ar[1]:ar[2]})

## Part 3: Randomized Response

Now let's implement a Randomized Response mechanism. First we have to identify what the query on our **binary** variable will be. Then we can create our own randomized mechanism.

**Question 3.1.** Write a query on your binary feature.

Answer: The binary variable is 'is_violent_recid'-- which states whether the culprit commited a violent crime after previously commiting another violent crime. For this variable, the query is: "Did you (the culprit) commit a violent crime after previously commiting a crime"?

**Question 3.2.** Create your own randomized response mechanism for the query you defined above. You may NOT use the coin example from class, try to be creative!

Answer: We ask culprit to pick a number from 1 to 6. Then, we roll a fair 6-sided die. We ask the user to give a 'false' response if the number he picked is rolled. If the number rolled is not the one he picked, then he must give the correct answer. In this manner, we have a 83.33% (5/6) chance of obtaining a correct answer; and 16.6% (1/6) of a wrong answer.

**Question 3.3.** Implement a function for your mechanism in 3.2. in the code cell below. The function should accept a value `0` or `1` and return the reported answer according to the randomized response above. Name your function `rand_resp`.

In [4]:
# YOUR CODE HERE
import random
def rand_resp(actual_value):
    chosen = random.randint(1,6) #Simulate a user choosing a number
    rolled = random.randint(1,6) #Simulate the roll of a fair 6-sided die
    if chosen != rolled:
        return actual_value
    return abs(actual_value - 1)

**Question 3.4.** For each value in your dataframe's binary feature column, call your function. Store the results in a new column in df named `rrc1`.

In [5]:
true_responses = df["is_violent_recid"] #6900 responses
reported = []
for response in true_responses:
    reported.append( rand_resp(response) )
df["rrc1"] = reported

**Question 3.5.** Now get the **estimate** for the true number of people who answered `1`.  Write this result into the variable `count_est_true_yes`.  Given the number of reported `1`'s, we know how to estimate the proportion or number of true `1`'s.

Calculate the true number of people who answered `1` (the true answer in the data we imported) and write it into the variable `count_true_yes`.

In [6]:
# YOUR CODE HERE

#Calculating estimate using probability and mathematical manipulation:
total_reported_yes = 0
for i in df["rrc1"]:
    if i == 1:
        total_reported_yes += 1
count_est_true_yes = (6 * total_reported_yes- 6900)/4


#Calculating Actual ones:
count_true_yes = 0
for i in df["is_violent_recid"]:
    if i == 1:
       count_true_yes += 1

print("Estimated true yes count: ", count_est_true_yes)
print("True yes count: ", count_true_yes)

Estimated true yes count:  916.5
True yes count:  803


**Question 3.6.** Comment on your results from above. What can you say about the privacy-accuracy tradeoff of your randomized response mechanism?

Answer: As we can see, the Estimated true count of '1' responses is very close to the True yes count. Through the use of dice randomized response mehcanism, we have successfully preserved the privacy of individuals, but at the same time kept the accuracy of the response when calculating aggregate statistics.

**Question 3.7.** We learned in class that data analysts are still able to obtain aggregate statistics from the results of a randomized response survey. Using the column you made above, `rrc1`, to calculate the mean and median for the estimated true responses. Do the same for your true responses. Name your answers `mean_est_true_yes`, `mean_true_yes`, `median_est_true_yes`, and `median_true_yes` respectively.

In [7]:
# YOUR CODE HERE

#Mean:
mean_est_true_yes = count_est_true_yes / 6900
mean_true_yes = count_true_yes / 6900

#Median:
number_of_ones_reported = count_est_true_yes
number_of_zeros_reported = 6900 - number_of_ones_reported
median_est_true_yes = 1
median_index = 6900/2
if(number_of_zeros_reported > median_index):
    median_est_true_yes = 0
    
number_of_ones_actual = count_true_yes
number_of_zeros_actual = 6900 - number_of_ones_actual
median_true_yes = 1
median_index = 6900/2
if (number_of_zeros_actual > median_index):
    median_true_yes = 0




print("Mean: ", mean_est_true_yes, mean_true_yes)
print("Median: ", median_est_true_yes, median_true_yes)

Mean:  0.13282608695652173 0.11637681159420289
Median:  0 0


**Question 3.8.** Comment on your results from above. Are the results from your mechanism useable? What can you say about the privacy when it comes to randomized response? Comment on the distributions of your data.

Answer: Both Mean and Median are fairly accurate, and match the actual data itself. Looking at these results, we can easily deduce that the privacy of individuals is protected while attribute disclosure is prevented. The distributions of both datasets are identical.

## Part 4: Adding Noise

We are going to use the non-binary feature that you chose earlier and add noise to it. To do this, we apply the same steps that we would for differential privacy: $f(D) + Z$.

**Question 4.1.** Suppose the function we wish to query is **count**. What would the global sensitivity, $S(f)$, be? Explain why.

Answer: For the count query, we wish to count the total occurances of unique value in the dataset, for a specific feature. In this case, adding or removing a person will change the count by 1. Therefore, the Global Sensitivity Value for the 'count' query is 1.

**Question 4.2.** In the cell below, write a function that adds Laplace noise to a value given the sensitivity and the privacy parameter, epsilon. Name your function `add_laplace_noise`.

In [8]:
# YOUR CODE HERE
import numpy as np
def add_laplace_noise(value, sensitivity, epsilon):
    scale = sensitivity/epsilon
    noise = np.random.laplace(0,scale)
    return value+noise

Since our query is on count, we'll have to obtain the count of each unique value of our feature.

**Question 4.3.** Define a variable called that holds the count of each unique value of your feature. *Hint: there's a method that does this for us!*

In [9]:
# YOUR CODE HERE
unqiue_value_count = df["age"].value_counts()

**Question 4.4.** Before we add noise, let's calculate some stats from your variable from 4.3. Calculate the mean, median, and count and name them `mean_count`, `median_count`, and `total_count` respectively.

In [10]:
# YOUR CODE HERE
import statistics
mean_count = statistics.mean(df["age"])
median_count = statistics.median(df["age"])
total_count = len(df["age"])

print(f"Mean of true counts: {mean_count}")
print(f"Median of true counts: {median_count}")
print(f"Total count of true records: {total_count}")

Mean of true counts: 34.655072463768114
Median of true counts: 31.0
Total count of true records: 6900


**Question 4.5.** Now we can start adding noise. Use your Laplace function and your variable from 4.3. to calculate a noisy representation of each value in your feature. Set your value of epsilon to 1 for now.

In [11]:
# YOUR CODE HERE
df['age'] = df['age'].astype(float)
age_noisy = []
for i in range(0, len(df["age"])):
    value = df.iloc[i]["age"]
    count = unqiue_value_count[value]
    age_noisy.append(add_laplace_noise(df.iloc[i]["age"], count, 1))
df["age_noisy"] = age_noisy

**Question 4.6.** Now calculate the stats for your noisy values. Calculate the mean, median, and count and name them `mean_noisy_count`, `median_noisy_count`, and `total_noisy_count` respectively.

In [13]:
# YOUR CODE HERE
import statistics
mean_noisy_count = statistics.mean(df["age_noisy"])
median_noisy_count = statistics.median(df["age_noisy"])
total_noisy_count = len(df["age_noisy"])


print(f"Noisy mean of true counts: {mean_noisy_count}")
print(f"Noisy median of true counts: {median_noisy_count}")
print(f"Total noisy count of true records: {total_noisy_count}")

Noisy mean of true counts: 26.600354413940735
Noisy median of true counts: 41.24017792959074
Total noisy count of true records: 6900


**Question 4.7.** Comment on the differences in aggregate statistics between the original and the noisy values. What can you say about the utility and privacy?  

Answer: The mean of the noisy count and the true count are very similar, one is 36 and the other is 34. On the other hand, we can see that the median differs due to the noise added to the middle value. The median for true count is 31, whereas the median for the noisy count is 46.

**Question 4.8.** Go back to question 4.5. and change the value of epsilon. Repeat this until you notice a pattern in your aggregate statistics. What happens as epsilon changes? What happens to the privacy and the accuracy?

Answer: As the epsilon changes, the mean and median of the noisy count gets closer to the true count. This means, an increase in the value of epsilon increases the accuracy; but decrease the privacy.

# Rubric

| Question | Points|
|----------|----------|
| 1.1.   | 5   |
| 1.2.    | 10   |
| 2.1.    | 5   |
| 3.1.   | 5   |
| 3.2.    | 10  |
| 3.3.  | 10   |
| 3.4.    | 2   |
| 3.5.   | 8   |
| 3.6.   | 5   |
| 3.7.   | 6   |
| 3.8.   | 8   |
| 4.1.   | 2   |
| 4.2.    | 10  |
| 4.3.  | 5   |
| 4.4.    | 3   |
| 4.5.   | 5   |
| 4.6.   | 3   |
| 4.7.   | 5   |
| 4.8.   | 8   |
| Total  | 115   |


