<!-- # CMPUT 200 Fall 2024  Ethics of Data Science and AI
 -->
# Assignment 1: Applying and Analyzing Privacy Techniques

***
- **Dataset**: Titanic
- **FIRST name**: Hooriya
- **LAST name**: Kazmi
- **Student ID**: 1780094

Leave blank if individual:
- **Collaborator names**: Jillian Kriwokon
- **Collaborator student IDs**: 1765983
***

In this assignment you will apply basic privacy techniques to a dataset of your choosing. By the end of this assignment, you should be able to:
1. Understand and implement randomized response on binary data;
2. Calculate the sensititivity of a non-binary feature;
3. Add noise to a non-binary feature;
4. Compute aggregate statistics.

### Instructions
<!-- **Deadline.**  This assignment is due at ****.  Please check the syllabus for late submissions. -->
You are expected to write clear, detailed, and complete answers when analysing your data. Lack of this may result in point deductions.

**Reminder.** You must submit your own work. The collaboration policy for the assignments is Group Collaboration. You may work together in groups of up to 2.
 On Canvas we added a group set called "Assignment Group".  We have enabled self-signup for you to signup as a group.  If you don't select your group by Wednesday Sep 25th 11:59pm, we will assume you are working on your own.
 Under the group collaboration policy, besides working with your group member, you can discuss concepts with. your peers. However, the work must be by your own group, and all sources of information used including books, websites, students you talked to, must be cited in the submission. Please see the course FAQ document for details on this collaboration policy. We will adhere to current Faculty of Science guidelines on dealing with suspected cases of violation of academic integrity.

You must use this notebook to complete your assignment. You will execute the questions in the notebook. The questions might ask for a short answer in text form or for you to write and execute a piece of code. **Make sure you enter your answer in either case only in the cell provided.**  Do not use a different cell and do not create a new cell. Do not delete any test cells.  Creating new cells for your code and deleting test cells is not compatible with the auto-grading system we are using and thus your assignment will not be graded properly and you will lose marks for that question.

Your submitted notebook should run on our local installation.  So if you are importing packages not listed in the notebook or using local data files not included in the assignment package, make sure the notebook is self-contained with a requirements.txt file or cells in the notebook itself to install the extra packages.  If we cannot run your notebook, you will lose 50% of the marks, and any additional marks that may be lost due to wrong answers.

### Submission Instructions
When you are done, you will submit your work from the notebook. Make sure to save your notebook before running it, and then submit on Canvas the notebook file with your work completed.

**IMPORTANT: Name your file with your *Student ID number* and the assignment number (ex: 1234567_A1.ipynb). Failure to do so will result in a zero!**

In [None]:
# Run this cell to set up; Please don't change this cell.

import numpy as np
from numpy.random import default_rng
rng = default_rng()
import pandas as pd
from scipy.optimize import minimize

# These lines do some fancy plotting magic.
import matplotlib
# This is a magic function that renders the figure in the notebook, instead of displaying a dump of the figure object.
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

## Part 1: Data

**Question 1.1.** We will first load the data, carry out some cleaning and pre-processing, and inspect the data to understand what exploratory steps we will take. Name the DataFrame `df`.

In [None]:
# YOUR CODE HERE
df = pd.read_csv('titanic.csv', header=0)

print("Shape: ", df.shape)
df

Shape:  (891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


**Question 1.2.** Describe your data and its purpose. Identify one variable that is binary or that could be classified into a binary feature. Identify another that is not a binary feature.

The data describes the data of individuals onboard the Titanic and whether they survived. We can use this data to explore relationships between passenger characteristics (age, class, and gender) and their survival outcomes. This data can be used in statistical analysis and machine learning models for predicting survival.
A binary feature is survived which is coded as 0 for passenger didn't survive and 1 for they did survive. Passenger Class is a non binary feature as it has 3 numerical values it can take according to which class they were in on the titanic.

## Part 2: Data pre-processing

**Question 2.1.** If your data has missing values or empty rows, remove them in the cell below. If the feature that you chose above has to be classified into a binary feature, convert it. Finally, if your binary feature is categorical, convert it to numerial.

In [None]:
df.dropna(inplace=True)
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...,...
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,B51 B53 B55,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


## Part 3: Randomized Response

Now let's implement a Randomized Response mechanism. First we have to identify what the query on our **binary** variable will be. Then we can create our own randomized mechanism.

**Question 3.1.** Write a query on your binary feature.

The query will be "Did the passenger survive?"
1 indicates survived
0 indicates didn't survive

**Question 3.2.** Create your own randomized response mechanism for the query you defined above. You may NOT use the coin example from class, try to be creative!

based on cards drawn we will determine if an individuals value for survived will be changed. If pink is drawn, the value will not be changed. If a black card is drawn, the value will be switched.

**Question 3.3.** Implement a function for your mechanism in 3.2. in the code cell below. The function should accept a value `0` or `1` and return the reported answer according to the randomized response above. Name your function `rand_resp`.

In [None]:
import random

def randomized_response(x):
    card_draw = random.choice(["pink", "black"])

    #if the pink card is drawn, keep the original value
    if card_draw == "pink":
        return x
    else:
      if random.randint(0,1) == 1:
            return 1
      else:
            return 0


**Question 3.4.** For each value in your dataframe's binary feature column, call your function. Store the results in a new column in df named `rrc1`.

In [None]:
df['rrc1'] = df['Survived'].apply(randomized_response)

**Question 3.5.** Now get the **estimate** for the true number of people who answered `1`.  Write this result into the variable `count_est_true_yes`.  Given the number of reported `1`'s, we know how to estimate the proportion or number of true `1`'s.

Calculate the true number of people who answered `1` (the true answer in the data we imported) and write it into the variable `count_true_yes`.

In [None]:
from logging import error
#calculate z (number of reported 1s)
z = df['rrc1'].sum()
count_true_yes = df['Survived'].sum()
#total number of participants
n = len(df)
count_est_true_yes = 2 * z - n / 2

print("Estimated true yes count: ", count_est_true_yes)
print("True yes count: ", count_true_yes)
error_value = round(100*(count_true_yes - count_est_true_yes) / count_true_yes, 2)
print("Error Value: ", error_value)

Estimated true yes count:  108.5
True yes count:  123
Error Value:  11.79


**Question 3.6.** Comment on your results from above. What can you say about the privacy-accuracy tradeoff of your randomized response mechanism?

Since the value returned is 5.28, we know that 5.28% of the time the privacy protected responses are not accurate. This makes sense because if we had a different mechanism, we could have more privacy but higher error. Our value indicates 5% noise was added which means the noise level was not excessively high hence that suggests that our randomized response mechanism is effectively made for cases requiring both privacy and accuracy. In order to further protect privacy we can introduce more noise as a trade off for accuracy.

**Question 3.7.** We learned in class that data analysts are still able to obtain aggregate statistics from the results of a randomized response survey. Using the column you made above, `rrc1`, to calculate the mean and median for the estimated true responses. Do the same for your true responses. Name your answers `mean_est_true_yes`, `mean_true_yes`, `median_est_true_yes`, and `median_true_yes` respectively.

In [None]:
mean_est_true_yes = df['rrc1'].mean()
mean_true_yes = df['Survived'].mean()
median_est_true_yes = df['rrc1'].median()
median_true_yes = df['Survived'].median()


print("Mean: ", mean_est_true_yes, mean_true_yes)
print("Median: ", median_est_true_yes, median_true_yes)

Mean:  0.546448087431694 0.6721311475409836
Median:  1.0 1.0


**Question 3.8.** Comment on your results from above. Are the results from your mechanism useable? What can you say about the privacy when it comes to randomized response? Comment on the distributions of your data.

Yes our results are useable because the mean value for estimated true yes responses is very close to the mean value for actual true yes values, and the median is exactly the same between. The distributions of each set are roughly symmetric because the mean is near 0.5. The distribution for actual true yes values is slightly less symmetric and more left skewed because it has a higher mean value. This means that the mechanism slightly underestimated the number of true yes values, compared to the actuals.

## Part 4: Adding Noise

We are going to use the non-binary feature that you chose earlier and add noise to it. To do this, we apply the same steps that we would for differential privacy: $f(D) + Z$.

**Question 4.1.** Suppose the function we wish to query is **count**. What would the global sensitivity, $S(f)$, be? Explain why.

Count is the number of individuals that belong to a group within a dataset. The biggest individual change we could have in the dataset is to switch the value so that an individual changes groups. Then the count to one group will increase by one and the count for the other group will decrease by one. This is one change, so the global sensitivy, S(f) = 1.

**Question 4.2.** In the cell below, write a function that adds Laplace noise to a value given the sensitivity and the privacy parameter, epsilon. Name your function `add_laplace_noise`.

In [None]:
def add_laplace_noise(x, sensitivity, epsilon):
    scale = sensitivity / epsilon
    #add noise with mean 0 and scale
    x += np.random.laplace(0, scale, 1)[0]
    return x

Since our query is on count, we'll have to obtain the count of each unique value of our feature.

**Question 4.3.** Define a variable called that holds the count of each unique value of your feature. *Hint: there's a method that does this for us!*

In [None]:
total_count = df['Pclass'].value_counts()
total_count

Unnamed: 0_level_0,count
Pclass,Unnamed: 1_level_1
1,158
2,15
3,10


**Question 4.4.** Before we add noise, let's calculate some stats from your variable from 4.3. Calculate the mean, median, and count and name them `mean_count`, `median_count`, and `total_count` respectively.

In [None]:

mean_count = total_count.mean()
median_count = total_count.median()
total_count_sum = total_count.sum()

print(f"Mean of true counts: {mean_count}")
print(f"Median of true counts: {median_count}")
print(f"Total count of true records: {total_count_sum}")

Mean of true counts: 61.0
Median of true counts: 15.0
Total count of true records: 183


**Question 4.5.** Now we can start adding noise. Use your Laplace function and your variable from 4.3. to calculate a noisy representation of each value in your feature. Set your value of epsilon to 1 for now.

In [None]:
noisy_count = total_count.apply(lambda x: add_laplace_noise(x, 1, 1))
noisy_count.head()

Unnamed: 0_level_0,count
Pclass,Unnamed: 1_level_1
1,158.136389
2,13.953259
3,9.875628


**Question 4.6.** Now calculate the stats for your noisy values. Calculate the mean, median, and count and name them `mean_noisy_count`, `median_noisy_count`, and `total_noisy_count` respectively.

In [None]:
mean_noisy_count = noisy_count.mean()
median_noisy_count = noisy_count.median()
total_noisy_count = noisy_count.sum()

print(f"Noisy mean of true counts: {mean_noisy_count}")
print(f"Noisy median of true counts: {median_noisy_count}")
print(f"Total noisy count of true records: {total_noisy_count}")

print(f"difference between noisy and true mean: {mean_noisy_count - mean_count}")
print(f"difference between noisy and true median: {median_noisy_count - median_count}")

Noisy mean of true counts: 60.20186526916976
Noisy median of true counts: 16.05335901437156
Total noisy count of true records: 180.60559580750927
difference between noisy and true mean: -0.7981347308302418
difference between noisy and true median: 1.0533590143715585


**Question 4.7.** Comment on the differences in aggregate statistics between the original and the noisy values. What can you say about the utility and privacy?  

The noisy mean of 60.2 is close to true mean which is 61. A difference of 0.8 shows a very little amount of noise was added due to the moderate epsilon value used. The mean remains fairly accurate. The noisy median is similar with a difference of 1.5 between noisy(16.05) and true(15). While noise affects the central tendency it is not impacted hugely. We can conclude due to the closeness of noisy values to true values that the utility remains quite high as the data is still very accurate so we can use it to draw meaningful conclusions. However, the low level of noise introduced means there might not be adequate coverage of the true values. Hence, privacy is only moderately preserved. The randomization method we have used is still effective just not hugely so.

**Question 4.8.** Go back to question 4.5. and change the value of epsilon. Repeat this until you notice a pattern in your aggregate statistics. What happens as epsilon changes? What happens to the privacy and the accuracy?

Upon changing the value of epsilon the smaller value of epsilon leads to more noise being added which means more privacy in the values. When the epsilon is larger it has less privacy as less noise is added and more accuracy. Hence, we can conclude there is a direct relationship of epsilon between accuracy and inverse relationship between privacy. Hence, we have a tradeoff between privacy and accuracy. As epsilon increases accuracy improves and privacy decreases. Epsilon needs to be chosen carefully to strike a balance between both things.

# Rubric

| Question | Points|
|----------|----------|
| 1.1.   | 5   |
| 1.2.    | 10   |
| 2.1.    | 5   |
| 3.1.   | 5   |
| 3.2.    | 10  |
| 3.3.  | 10   |
| 3.4.    | 2   |
| 3.5.   | 8   |
| 3.6.   | 5   |
| 3.7.   | 6   |
| 3.8.   | 8   |
| 4.1.   | 2   |
| 4.2.    | 10  |
| 4.3.  | 5   |
| 4.4.    | 3   |
| 4.5.   | 5   |
| 4.6.   | 3   |
| 4.7.   | 5   |
| 4.8.   | 8   |
| Total  | 115   |


