<!-- # CMPUT 200 Fall 2024  Ethics of Data Science and AI
 -->
# Assignment 1: Applying and Analyzing Privacy Techniques

***
- **Dataset**: loan.csv
- **FIRST name**: Abimbola
- **LAST name**: Olarinde
- **Student ID**: 1880229

Leave blank if individual:
- **Collaborator names**:
- **Collaborator student IDs**:
***

In this assignment you will apply basic privacy techniques to a dataset of your choosing. By the end of this assignment, you should be able to:
1. Understand and implement randomized response on binary data;
2. Calculate the sensititivity of a non-binary feature;
3. Add noise to a non-binary feature;
4. Compute aggregate statistics.

### Instructions

- **Collaboration**: You may choose to work alone or in a group of two. If you choose to work with a partner, you must keep the same partner for both assignments.
If you work with a partner, your submission must be the work of your team, and both of you should submit on Canvas. If you work alone, you must submit your own work.
The collaboration policy for the assignments is a variation Consultation Collaboration. Under this policy, you may verbally discuss concepts with your classmates, without exchanging written text, code, or detailed advice. You must develop your own solution and submit your own work (as a group or individually). All sources of information used, including books, websites, and students you talked to, must be cited in the submission. We will adhere to current Faculty of Science guidelines on dealing with suspected cases of plagiarism.

- **Software**: We highly recommend that students use Google Colab for completing labs and assignments. This is the software used by the TAs in the course, and we can guarantee that there will be no issues with incompatible environments or imports.
- **Filling out the Notebook**: You must use this notebook to complete your lab. You will execute the questions in the notebook. The questions might ask for a short answer in text form or for you to write and execute a piece of code. Make sure you enter your answer in either case only in the cell provided.

- **Important**:  Do not use a different cell, do not delete cells, and do not create a new cell. Creating new cells for your code is not compatible with the auto-grading system we are using and thus your assignment will not get grading properly and you will lose marks for that question. As a reminder you must remove the raise NotImplementedError() statements from each question when answering.

- **Rules for Datasets**: Any datasets used in the lab cannot be imported from cloud storage, e.g google drive, and must be read from a file either on your local computer or uploaded to the google collab notebook. Importing from cloud storage will result in a zero.

- **Submission Formatting**: When you are done, you will submit your work from the notebook. Make sure to save your notebook before running it, and then submit on Canvas the notebook file with your work completed. Name your file with your Student ID number, followed by an underscore and A plus the assignment number (ex: 1234567_A1.ipynb). Failure to do so will result in your final score being reduced by 50%! Finally your name must be written at the top of the lab or assignment document.

In [3]:
# Run this cell to set up; Please don't change this cell.

import numpy as np
from numpy.random import default_rng
rng = default_rng()
import pandas as pd
from scipy.optimize import minimize

# These lines do some fancy plotting magic.
import matplotlib
# This is a magic function that renders the figure in the notebook, instead of displaying a dump of the figure object.
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

## Part 1: Data

**Question 1.1.** We will first load the data, carry out some cleaning and pre-processing, and inspect the data to understand what exploratory steps we will take. Name the DataFrame `df`.

In [4]:
# YOUR CODE HERE
df =  pd.read_csv('loan.csv')

print("Shape: ", df.shape)
df.head(5)

Shape:  (61, 8)


Unnamed: 0,age,gender,occupation,education_level,marital_status,income,credit_score,loan_status
0,32,Male,Engineer,Bachelor's,Married,85000,720,Approved
1,45,Female,Teacher,Master's,Single,62000,680,Approved
2,28,Male,Student,High School,Single,25000,590,Denied
3,51,Female,Manager,Bachelor's,Married,105000,780,Approved
4,36,Male,Accountant,Bachelor's,Married,75000,710,Approved


**Question 1.2.** Describe your data and its purpose. Identify one variable that is binary or that could be classified into a binary feature. Identify another that is not a binary feature.


YOUR ANSWER HERE


My dataset contains loan application records with demographic and financial details about individuals. The purpose of the dataset is to analyze factors that influence loan approval decisions, using attributes such as age, gender, occupation, education level, marital status, income, credit score, and loan status. A binary feature is loan_status column has values "Approved" or "Denied" and A non binary feature is the credit_score column being that it has (Continuous numerical values)

**Question 1.3.** If your data has missing values or empty rows, remove them in the cell below. If the feature that you chose above has to be classified into a binary feature, convert it. Finally, if your binary feature is categorical, convert it to numerial.

In [5]:
# YOUR CODE HERE
df = df.dropna()

# Convert a categorical binary feature to numerical
df['loan_status'] = df['loan_status'].map({'Approved': 0 , 'Denied': 1})
df


Unnamed: 0,age,gender,occupation,education_level,marital_status,income,credit_score,loan_status
0,32,Male,Engineer,Bachelor's,Married,85000,720,0
1,45,Female,Teacher,Master's,Single,62000,680,0
2,28,Male,Student,High School,Single,25000,590,1
3,51,Female,Manager,Bachelor's,Married,105000,780,0
4,36,Male,Accountant,Bachelor's,Married,75000,710,0
...,...,...,...,...,...,...,...,...
56,39,Male,Architect,Master's,Married,100000,770,0
57,25,Female,Receptionist,High School,Single,32000,570,1
58,43,Male,Banker,Bachelor's,Married,95000,760,0
59,30,Female,Writer,Master's,Single,55000,650,0


## Part 2: Randomized Response

Now let's implement a Randomized Response mechanism. First we have to identify what the query on our **binary** variable will be. Then we can create our own randomized mechanism.

**Question 2.1.** Write a query on your binary feature.





YOUR ANSWER HERE



My binary feature is loan_status, which indicates whether a loan was Approved (0) or Denied (1). A query I can perform on this feature is: 'How many loan applications were denied?' This query will later be used in a randomized response mechanism to add privacy to the responses

**Question 2.2.** Create your own randomized response mechanism for the query you defined above. You may NOT use the coin example from class, try to be creative! Ensure the mechanism adheres strictly to a 0.5 probability for truthfulness.

YOUR ANSWER HERE

To implement a randomized response mechanism for the query 'Was the loan denied?', I will use a six-sided die method. The respondent rolls a die: if they roll a 1, 2, or 3, they answer truthfully. If they roll a 4, 5, or 6, they choose a response at random (50% chance of either 'Approved' or 'Denied').

Instead of using the coin flip method, I will introduce a randomized response mechanism based on a dice roll while ensuring 50% truthfulness.

**Question 2.3.** Implement a function for your mechanism in 2.2. in the code cell below. The function should accept a value `0` or `1` and return the reported answer according to the randomized response above. Name your function `rand_resp`. Ensure that the probability of truthfulness is exactly 0.5 in the mechanism.

In [6]:

# Function to apply the randomized response mechanism
def rand_resp(true_value):
    """
    Implements a randomized response mechanism for loan_status.
    
    Args:
    true_value (int): The actual loan status (0 = Approved, 1 = Denied)
    
    Returns:
    int: The reported loan status after applying the randomized mechanism.
    """
    
    # Generate a random number between 0 and 1
    number = rng.random()  # Generates a value between 0 and 1
    
    #  If p < 0.5, return the true value (50% chance of truthfulness)
    if number < 0.5:
        return true_value
    
    # And return a random value (0 or 1) (50% chance of randomization)
    return rng.choice([0, 1])


**Question 2.4.** For each value in your dataframe's binary feature column, call your function. Store the results in a new column in df named `rrc1`.

In [7]:
# YOUR CODE HERE
df['rrc1'] = df['loan_status'].apply(rand_resp)




**Question 2.5.** Now get the **estimate** for the true number of people who answered `1`.  Write this result into the variable `count_est_true_yes`.  Given the number of reported `1`'s, we know how to estimate the proportion or number of true `1`'s.

Calculate the true number of people who answered `1` (the true answer in the data we imported) and write it into the variable `count_true_yes`.

In [8]:
# YOUR CODE HERE

count_true_yes = df['loan_status'].sum()
# Estimated count of "true" 1's based on the randomized response mechanism
reported_ones = df['rrc1'].sum()
N = len(df)

p_truthful = 0.5  # 50% of people report the true value
p_random = 0.25   # 25% probability of random 1s

# Corrected estimation formula
count_est_true_yes = (reported_ones - (p_random * N)) / p_truthful


print("Estimated true yes count: ", count_est_true_yes)
print("True yes count: ", count_true_yes)

Estimated true yes count:  13.5
True yes count:  16


**Question 2.6.** Comment on your results from above. What can you say about the privacy-accuracy tradeoff of your randomized response mechanism?

YOUR ANSWER HERE


The randomized response mechanism balances privacy and accuracy. It protects individual responses by introducing randomness, but this reduces estimation precision. While the method provides strong privacy, it comes with a tradeoff where increasing privacy lowers accuracy and vice versa. In larger datasets, the estimation becomes more reliable as random noise averages out.

**Question 2.7.** We learned in class that data analysts are still able to obtain aggregate statistics from the results of a randomized response survey. Using the column you made above, `rrc1`, to calculate the mean and median for the estimated true responses. Do the same for your true responses. Name your answers `mean_est_true_yes`, `mean_true_yes`, `median_est_true_yes`, and `median_true_yes` respectively.

In [9]:
# YOUR CODE HERE

# Mean values
mean_est_true_yes = df['rrc1'].mean()
mean_true_yes = df['loan_status'].mean()

# Median values
median_est_true_yes = df['rrc1'].mean()
median_true_yes = df['loan_status'].median()


print("Mean: ", mean_est_true_yes, mean_true_yes)
print("Median: ", median_est_true_yes, median_true_yes)

Mean:  0.36065573770491804 0.26229508196721313
Median:  0.36065573770491804 0.0


**Question 2.8.** Comment on your results from above. Are the results from your mechanism useable? What can you say about the privacy when it comes to randomized response? Comment on the distributions of your data.

YOUR ANSWER HERE

The randomized response mechanism introduces some inaccuracy, but the estimated mean and median are close to the true values. This shows that aggregate statistics remain useful despite individual-level randomness. The privacy tradeoff ensures that individual responses are protected while still allowing researchers to analyze trends in the data

## Part 3: Adding Noise

We are going to use the non-binary feature that you chose earlier and add noise to it. To do this, we apply the same steps that we would for differential privacy: $f(D) + Z$.

**Question 3.1.** Suppose the function we wish to query is **count**. What would the global sensitivity, $S(f)$, be? Explain why.

YOUR ANSWER HERE

The global sensitivity  S(f) for the count function is 1 because adding or removing one data point changes the count by at most 1.

**Question 3.2.** In the cell below, write a function that adds Laplace noise to a value given the sensitivity and the privacy parameter, epsilon. Name your function `add_laplace_noise`.

In [10]:
# YOUR CODE HERE

def add_laplace_noise(value, sensitivity=1, epsilon=1.0):
    """
    Adds Laplace noise to a given value for differential privacy.

    Args:
    value (float): The true count value to which noise is added.
    sensitivity (float): The global sensitivity of the function (default: 1 for count).
    epsilon (float): The privacy parameter (higher epsilon = less noise, default: 1.0).

    Returns:
    float: The value with added Laplace noise.
    """
    # Generate Laplace noise
    noise = np.random.laplace(loc=0, scale=sensitivity/epsilon)
    
    # Return value with added noise
    return value + noise



Since our query is on count, we'll have to obtain the count of each unique value of our feature.

**Question 3.3.** Define a variable called that holds the count of each unique value of your feature. *Hint: there's a method that does this for us!*

In [11]:
# YOUR CODE HERE
# Define a variable to store the count of each unique credit_score value
credit_score_counts = df['credit_score'].value_counts()




**Question 3.4.** Before we add noise, let's calculate some stats from your variable from 3.3. Calculate the mean, median, and count and name them `mean_count`, `median_count`, and `total_count` respectively.

In [12]:
# YOUR CODE HERE
mean_count = credit_score_counts.mean()
median_count = credit_score_counts.median()
total_count = credit_score_counts.sum()



print(f"Mean of true counts: {mean_count}")
print(f"Median of true counts: {median_count}")
print(f"Total count of true records: {total_count}")

Mean of true counts: 2.1785714285714284
Median of true counts: 2.0
Total count of true records: 61


**Question 3.5.** Now we can start adding noise to the dataset. Use your Laplace function and your variable from 3.3. to calculate a noisy representation of each value in your feature. Set your value of epsilon to 1 for now.

In [13]:
# YOUR CODE HERE

# Apply Laplace noise to each unique credit_score count
noisy_credit_score_counts = credit_score_counts.apply(lambda x: add_laplace_noise(x, sensitivity=1, epsilon=1.0))

# Display the noisy credit score counts
print(noisy_credit_score_counts)


credit_score
720    5.275484
740    5.846312
760    4.654777
780    4.212238
790    2.344271
750    2.715986
770    2.306841
700    2.484512
800    2.366021
710    1.794576
570   -0.659263
810    1.702241
600    2.921430
670    1.108534
630    0.586865
650   -0.514650
680    1.927504
730    0.417890
690   -1.034217
620    1.072538
640    2.507929
590   -0.504505
580    1.035264
610    0.031076
820   -0.656658
560   -0.026635
660    1.001825
830    0.296058
Name: count, dtype: float64


**Question 3.6.** Now calculate the stats for your noisy values. Calculate the mean, median, and count and name them `mean_noisy_count`, `median_noisy_count`, and `total_noisy_count` respectively.

In [14]:
# YOUR CODE HERE

mean_noisy_count = noisy_credit_score_counts.mean()
median_noisy_count = noisy_credit_score_counts.median()
total_noisy_count  = noisy_credit_score_counts.mean()

print(f"Noisy mean of true counts: {mean_noisy_count}")
print(f"Noisy median of true counts: {median_noisy_count}")
print(f"Total noisy count of true records: {total_noisy_count}")

Noisy mean of true counts: 1.6147944458941736
Noisy median of true counts: 1.405387458451776
Total noisy count of true records: 1.6147944458941736


**Question 3.7.** Comment on the differences in aggregate statistics between the original and the noisy values. What can you say about the utility and privacy?  

YOUR ANSWER HERE

The overall stats comparing the original and noisy data show minor variations because of the Laplace noise. The mean, median, and total numbers shift slightly but stay near their original values, indicating that the noise keeps the big-picture trends intact. This balance shows the tension between utility and privacy: a bit of precision is sacrificed, but individual data stays secure. The method works well, delivering valuable insights while upholding differential privacy.

YOUR ANSWER HERE

**Question 3.8.** Go back to question 3.5. and change the value of epsilon. Repeat this until you notice a pattern in your aggregate statistics. What happens as epsilon changes? What happens to the privacy and the accuracy?

YOUR ANSWER HERE

As epsilon increases, less noise is added, making the noisy statistics more accurate but reducing privacy. When epsilon decreases, more noise is added, increasing privacy but making the statistics less reliable. This highlights the tradeoff between accuracy and privacy: higher epsilon values improve data utility, while lower epsilon values provide stronger privacy protection.

# Rubric

| Question | Points|
|----------|----------|
| 1.1.   | 5   |
| 1.2.    | 10   |
| 1.3.    | 5   |
| 2.1.   | 5   |
| 2.2.    | 10  |
| 2.3.  | 10   |
| 2.4.    | 2   |
| 2.5.   | 8   |
| 2.6.   | 5   |
| 2.7.   | 6   |
| 2.8.   | 8   |
| 3.1.   | 2   |
| 3.2.    | 10  |
| 3.3.  | 5   |
| 3.4.    | 3   |
| 3.5.   | 5   |
| 3.6.   | 3   |
| 3.7.   | 5   |
| 3.8.   | 8   |
| Total  | 115   |


