# Module 2 - Data collection, validation and privacy


### Assignment overview

In this assignment, you will be exploring various aspects related to collecting data and identifying bias in datasets. You will also be asked to consider issues of data privacy and governance.

For this assignment, it is possible to work in **groups of up to 2 students**. 

### Group members
Leave blanks if group has less than 2 members:
- Student 1: Jingyuan Liu 69763183
- Student 2: Nicholas Tam 45695970

### Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:
1. Discuss the implications of data governance and data ownership in data science  
2. Argue the advantages and disadvantages of collecting individuals’ data online  
3. Distinguish between a sample and a population, what attributes make a representative sample and the possible ethical implications of a non-representative sample  
4. Explain the elements of experimental design 
5. Identify possible sources of bias in datasets (such as historical, measurement, and representation bias) 
6. Describe the ethical implications of variable choice in data science (e.g., use of proxies, use of gender and race as variables) 
7. Apply good practices for minimizing errors in data cleaning  
8. Apply methods for improving privacy and anonymity in stored data and data analysis, such as k-anonymity and randomized response 
9. Explain the notion of differential privacy 


# Part 1: Data collection, sampling and bias

In class, we discussed different sources of bias that can affect the data we want to use for our Data Science applications. Here is a summary:

### 1. Historical bias
**Historical bias:** bias that exists in society and is reflected in the data. It is the most insidious because it arises even if we are able to perfectly sample from the existing population. Most often, it affects groups that are historically disadvantaged.

E.g. In 2018, 5% of Fortune 500 CEOs were women. Historically, women have less frequently made it to a CEO position. A classifier trained to predict the best choice for a new CEO may learn this pattern and determine that being a woman makes one less qualified to be a CEO.

### 2. Representation bias
**Representation bias:** the sample underrepresents part(s) of the population and fails to generalize well. This may happen for different reasons:

1. The sampling methods only reached a portion of the population. E.g. Data collected via smartphone apps can under-represent lower incomes or older groups, who may be less likely to own smartphones.

2. The population of interest has changed or is distinct from the sample used during model training. E.g. Data that is representative of Vancouver may not be representative if used to analyze the population in Toronto. Similarly, data representative of Vancouver 100 years ago may not reflect today's population. 

### 3. Measurement bias
**Measurement bias:** it occurs when choosing features that fail to correctly represent the problem, or when there are issues with the data collection. Fore example:

1. The measurement processes varies across groups. E.g. one group of workers is monitored more closely and thus more errors are observed in that group.

2. The quality of data varies across groups. E.g. women often self-report less pain than men and are therefore less likely to receive certain diagnoses

3. The defined classification task or one of the features used is an oversimplification. E.g. We are designing a model to predict whether a student will be successful in college. We choose to predict the final GPA as metric of success. This, however, ignores other indicators of success.

**Question 1** 

Consider a crowd-sourcing project called [Street Bump](https://www.boston.gov/transportation/street-bump) aimed at helping improve neighbourhood streets in Boston from 2011 to 2014. Volunteers used a smartphone app, which captured GPS location and reported back to the city everytime the driver hit a pothole. The data was provided to governments so they could use the data to fix any road issues.

Can you think of any sources of bias in the scenario above? Explain them. 

<span style="color:blue"> 
There is a risk of representation bias, as the data will likely under-represent lower-income or older groups that are less likely to have smartphones, on top of the population of people that would be interested in volunteering potentially not being representative of the overall population. 
<br><br>There is also a risk of measurement bias, as road quality is determined by more attributes than potholes alone, such as effective drainage and traffic management. The frequency of drivers hitting potholes is also determined by other factors, such as the proficiency of the drivers themselves, or the location of the potholes. In other words, the feature used to determine the road quality is an oversimplification.
</span>

## Observational and experimental studies

- **Observational study:** study where there is no deliberate human intervention regarding the variable under investigation. Observational studies are ones where researchers observe the effect of a treatment/intervention without trying to change who is or isn’t exposed to it. In an observational study, the subjects are assigned or assign themselves to the exposure group they belong to.
- **Experimental study:** : study that involves planned intervention on the exposure to a condition. In an experiment, subjects are assigned to a condition by the researcher and thus one can establish a cause-and-effect relationship when we see a difference in the outcome between the experimental groups. Randomizing study subjects balances any differences between treatment groups with respect to all variables except the condition of exposure.

## A/B testing

A/B testing can be considered the most basic kind of randomized controlled experiment. 

Complete the following reading, then answer the comprehension questions below: https://hbr.org/2017/06/a-refresher-on-ab-testing

**Question 2**

In the following table, select which statements are true or false:

| Statement | True | False |
| -------- | :------- | :------- |
| A/B testing is an example of experimental study. |  ✔      |        | 
| Observational studies require subjects to not be informed that they are being studied. |        |      ✔  |  
| Ethical experimental studies require genuine uncertainty about the benefits/harms of treatment or exposure (equipoise) |    ✔    |        | 
| A researcher is interested in studying the effects of certain dietary habits. They recruite people and, through a survey, they ask them to disclose their current dietary habits, on which bases they will be assigned to treatment or control group. This is an example of experimantal study. |        |    ✔    | 
| The control group and the exposed group must include different individuals. |        |   ✔     | 
| One of the main advantages of experimental studies is that they allow for better randomization. |     ✔   |        | 



**Question 3**

Explain the role of blocking in A/B testing.

<span style="color:blue"> 
Blocking is defined as splitting the data by similarity in a factor that is of less interest, but will still heavily influence the success metric of our interest. For example, from the article, whether or not someone views a website on mobile or desktop influences the click rates on both versions of a website, but the groups of interest in the study are the two versions of our website, not the devices of users. In this case, we should first divide the users into two blocks, one for mobile users, and the other for desktop users. Then randomly assign users to each version within each block. Blocking in A/B testing allows for a more accurate reflection of the distinctions between the methods of interest.
</span>

**Question 4**

The authors warn about observing too many metrics when running an A/B test. Why is that the case? What could happen if I ignore this warning?

<span style="color:blue"> 
Observing too many metrics runs the risk of observing "spurious correlations", where multiple variables are only seemingly correlated without being causally related. The more metrics we observe, the more likely we will see some statistically significant results that only happen by chance, which is as what Fund said "random fluctuation".
<br><br>Ignoring the warning will lead to some incorrect or misleading conclusions, making the interpretation of results difficult due to too many metrics influencing changes in data all at once. For example, you may want to switch to the new version of the product because you found some metrics significant from the A/B testing. But if you have many metrics, it is more likely that some significant metrics occur only by chance. In this case, if you make a decision to switch the product to the new version based on this result, the new version may be as effective or less effective as the original one.
</span>

**Question 5** 

You want to determine the size of the subscribe button on your website. You plan to evaluate the performance by the number of visitors who click on the button. To run the test, you show one set of users one version and collect information about the number of visitors who click on the button. One month later you show users another version where the only thing different is the size of the button. Based on this test, you determine that the second version had a higher number of visitors who clicked on the button. Can you conclude that this version of the website leads to a higher number of visitors clicking on the button? Briefly explain.  

<span style="color:blue"> 
I would argue that we cannot conclude that this version of the website leads to a higher number of visitors clicking on the button. There is no statistic provided to indicate that the difference in button clicks is statistically significant enough to reject the null hypothesis that the number of clicks for both websites is the same. More importantly, as the test was conducted in two different periods, there might be some other variables that could potentially influence the results also changing over time (i.e. users' mood, seasonal effect, etc). The data we collected for each version of the website may also be representative of different populations due to the difference in time frames, leading to representation bias. Therefore, we should conduct this test simultaneously by randomly assigning users to one of the versions, minimizing the effect of other variables on the result.
</span>

### Ethical A/B testing
Ethical A/B testing still requires all the ethical considerations of any experimental study, such as informed consent or possibility to opt out. A notorious case of a company failing to meet ethics requirement in A/B testing is the infamous Facebook "social contagion experiment", in which almost 700,000 users were showed, for a week, only positive or only negative content, to see how this variation impacted their online behaviour. The selected users were not informed and could not opt out. Furthermore, their emotional state was affected. Facebook defended itself by saying that Facebook's Data Use Policy warns users that Facebook “may use the information we receive about you…for internal operations, including troubleshooting, data analysis, testing, research and service improvement”. This defense was largely rejected by the scientific community, which still considered the study as unethical. You can read more about this incident in this [article](https://www.theguardian.com/technology/2014/jun/30/facebook-emotion-study-breached-ethical-guidelines-researchers-say). 

## Case Study: National Institute of Justice's (NIJ) Recidivism Dataset 

We will now look at the NIJ's Recidivism data set, which contains data on 26,000 individuals from the State of Georgia released from prison on parole (early release from prison where the person agrees to abide by certain conditions) between January 1, 2013 and December 31, 2015. **Recidivism** is the act of committing another crime.

This dataset is split into two sets, training and test, 70% of the data is in the training dataset and 30% in the test dataset. The training set contains four variables that measure recidivism: whether an individual recidivated within three years of the supervision start date and whether they recidivated in year 1, year 2, or year 3. In this data set, recidivism is defined as being arrested for a new crime during this three-year period. The test set does not include these four variables. 

The data was provided by the Georgia Department of Community Supervision (GDCS) and the Georgia Bureau of Investigation.

*Source: https://data.ojp.usdoj.gov/stories/s/daxx-hznc*

Let's start by familiarizing with the [dataset source](https://nij.ojp.gov/funding/recidivism-forecasting-challenge). The website includes a lot of information on the dataset and a detailed description of each of its columns (look for Appendix 2: Codebook).



**Question 6**
Think about how the data set was collected and what we are trying to predict. Are there any potential sources of bias (historical, representation, measurement)? Explain your answer. 

* <span style="color:blue"> Historical: The historical bias against some certain racial groups (i.e. Black people) can affect the performance of the model nowadays, reflecting past inequalities an unfairness. </span>
* <span style="color:blue"> Representation: The population of individuals used for model training would have changed over time, such as the proportions of people at certain ages on release, and thus may not be reflective of the current population of people on parole. In addition, the data only collects individuals from the State of Georgia, the people released from prison on parole from other states or countries may be underrepresented. Besides, the data was collected between 2013 and 2015, which may not be representative of the population of interests nowadays, due to social and political events and changes over time, such as COVID-19.</span>
* <span style="color:blue"> Measurement: The recidivism is defined as being arrested for a new crime during this three-year period in this study. However, it is possible that some individuals recidivated are not arrested. .Factors such as race and age may lead to the unfairness of measurements and quality of data between groups of individuals, such as stricter monitoring between those in the same supervision level due to racism or prioritization of criminal acts.  In other words, the feature used to determine recidivism is an oversimplification.</span>

### Question 7: Exploratory Data Analysis (EDA)

We are now going to perform some Exploratory Data Analysis on the NIJ's Recidivism Training set. This will serve 2 purposes:
- it will help us familiarize with the dataset
- it will help us spot possible imbalances or sources of bias in the dataset

You are free to use tools and functions of your choice to complete the EDA. Your goal is to answer the following questions:
1. Does the dataset include protected characteristics? We recommend using the [BC Human Rights Code](http://www.bchrt.bc.ca/human-rights-duties/characteristics.htm) for reference.
2. If the dataset includes protected characteristic, do you think they are necessary to perform the predictive task? Why or why not?
3. If we were to remove the columns including protected characteristics, do you think it would still be possible to retrieve that information through other features (proxies)? Explain how.
4. Is the target variable balanced? If not, what could happen?
5. Is the target variable balanced *across protected segments of the population?* What could happen if this is not the case? 
6. Are there features with missing values? Do you suspect that they may be Missing Not At Random (MNAR), and if so, how would it be best to fill this information?

**Notes:**
- Bar charts and other plots are helpful to visually spot imbalances
- You are encouraged to talk to the instructor and TA to discuss your EDA strategy and if you need suggestions with the code

In [1]:
# Your solution here. You may add more code/markdown cells as needed. 
import pandas as pd

train_df = pd.read_csv("NIJ_s_Recidivism_Challenge_Training_Dataset.csv")
train_df.head()

Unnamed: 0,ID,Gender,Race,Age_at_Release,Residence_PUMA,Gang_Affiliated,Supervision_Risk_Score_First,Supervision_Level_First,Education_Level,Dependents,...,DrugTests_Cocaine_Positive,DrugTests_Meth_Positive,DrugTests_Other_Positive,Percent_Days_Employed,Jobs_Per_Year,Employment_Exempt,Recidivism_Within_3years,Recidivism_Arrest_Year1,Recidivism_Arrest_Year2,Recidivism_Arrest_Year3
0,1,M,BLACK,43-47,16,False,3.0,Standard,At least some college,3 or more,...,0.0,0.0,0.0,0.488562,0.44761,False,False,False,False,False
1,2,M,BLACK,33-37,16,False,6.0,Specialized,Less than HS diploma,1,...,0.0,0.0,0.0,0.425234,2.0,False,True,False,False,True
2,3,M,BLACK,48 or older,24,False,7.0,High,At least some college,3 or more,...,0.0,0.166667,0.0,0.0,0.0,False,True,False,True,False
3,4,M,WHITE,38-42,16,False,7.0,High,Less than HS diploma,1,...,0.0,0.0,0.0,1.0,0.718996,False,False,False,False,False
4,5,M,WHITE,33-37,16,False,4.0,Specialized,Less than HS diploma,3 or more,...,0.0,0.058824,0.0,0.203562,0.929389,False,True,True,False,False


In [2]:
display(train_df.describe()) 

Unnamed: 0,ID,Residence_PUMA,Supervision_Risk_Score_First,Avg_Days_per_DrugTest,DrugTests_THC_Positive,DrugTests_Cocaine_Positive,DrugTests_Meth_Positive,DrugTests_Other_Positive,Percent_Days_Employed,Jobs_Per_Year
count,18028.0,18028.0,17698.0,13768.0,14396.0,14396.0,14396.0,14396.0,17721.0,17494.0
mean,13386.065343,12.307577,6.064753,93.58586,0.06312,0.014173,0.012768,0.007681,0.480035,0.766423
std,7721.451992,7.143255,2.382811,117.561341,0.138357,0.063473,0.059572,0.042224,0.424396,0.813474
min,1.0,1.0,1.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0
25%,6702.75,6.0,4.0,28.666667,0.0,0.0,0.0,0.0,0.0,0.0
50%,13405.5,12.0,6.0,55.0,0.0,0.0,0.0,0.0,0.466543,0.636324
75%,20081.25,18.0,8.0,110.0,0.068242,0.0,0.0,0.0,0.966184,1.0
max,26761.0,25.0,10.0,1087.0,1.0,1.0,1.0,1.0,1.0,8.0


In [3]:
# 1, 2: Protected charactersitics
display(train_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18028 entries, 0 to 18027
Data columns (total 53 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 0   ID                                                 18028 non-null  int64  
 1   Gender                                             18028 non-null  object 
 2   Race                                               18028 non-null  object 
 3   Age_at_Release                                     18028 non-null  object 
 4   Residence_PUMA                                     18028 non-null  int64  
 5   Gang_Affiliated                                    15811 non-null  object 
 6   Supervision_Risk_Score_First                       17698 non-null  float64
 7   Supervision_Level_First                            16816 non-null  object 
 8   Education_Level                                    18028 non-null  object 
 9   Depend

None

<span style="color:blue">Q7.1: From the list of columns provided by `train_df.info()`, the columns that likely include protected characteristics are `Gender`, `Race`, `Age_at_Release`, `Dependents`, and the characteristics involving prior arrests, convictions and revocations (`Prior_Arrest_Episodes_Felony`, `Prior_Arrest_Episodes_Misdemeanor`, `Prior_Arrest_Episodes_Violent`, `Prior_Arrest_Episodes_Property`, `Prior_Arrest_Episodes_Drug`, `Prior_Arrest_Episodes_PPViolationCharges`, `Prior_Arrest_Episodes_DomesticViolenceCharges`, `Prior_Arrest_Episodes_GunCharges`, `Prior_Conviction_Episodes_Felony`, `Prior_Conviction_Episodes_Misdemeanor`, `Prior_Conviction_Episodes_Violent`, `Prior_Conviction_Episodes_Property`, `Prior_Conviction_Episodes_Drug`, `Prior_Conviction_Episodes_PPViolationCharges`, `Prior_Conviction_Episodes_DomesticViolenceCharges`, `Prior_Conviction_Episodes_GunCharges`, `Prior_Revocations_Parole`, and `Prior_Revocations_Probation`).</span>
<br>
<span style="color:blue">Q7.2: The `Gender`, `Race`, `Age_at_Release`, and `Dependents` characteristics are unnecessary as they appear to only be indirectly correlated to their probability of a person undergoing recidivism. The characteristics relating to prior arrests, convictions, and revocations appear to be more directly related to a person recommitting crimes and thus may be necessary for the predictive task.</span>
<br> 
<span style="color:blue">Q7.3: It could be possible to retieve an individual's `Age_at_Release`, `Prior_Arrest_Episodes_Drug`, and `Prior_Conviction_Episodes_Drug`. The former characteristic could be inferred from characteristics that would likely tie to someone's work such as `Percent_Days_Employed` and `Jobs_Per_Year`, under the assumption that a younger individual would likely have lower values for both of those characteristics. The latter two characteristics are potentially associated with other characteristics involving drug testing, as individuals with a history of drug abuse would likely be closely monitored for potential relapses.</span>

In [4]:
# 4: Check if the class distribution is balanced 
display(train_df["Recidivism_Within_3years"].value_counts(normalize=True), 
        train_df["Recidivism_Arrest_Year1"].value_counts(normalize=True),
        train_df["Recidivism_Arrest_Year2"].value_counts(normalize=True),
        train_df["Recidivism_Arrest_Year3"].value_counts(normalize=True))

Recidivism_Within_3years
True     0.578045
False    0.421955
Name: proportion, dtype: float64

Recidivism_Arrest_Year1
False    0.701742
True     0.298258
Name: proportion, dtype: float64

Recidivism_Arrest_Year2
False    0.819558
True     0.180442
Name: proportion, dtype: float64

Recidivism_Arrest_Year3
False    0.900655
True     0.099345
Name: proportion, dtype: float64

<span style="color:blue">Q7.4: The target variable `Recidivism_Within_3years` is not balanced, with 57.8% of samples having `Recidivism_Within_3years == True`, and 42.2% of samples having `Recidivism_Within_3years == False`. The proportions of recidivism in each of the trees are not balanced either. As a result, the model may have biased predictive results in favor of the more frequent class of the target variable.</span> 

In [5]:
# 5: Check if class distribution is balanced within protected segments
for gender in train_df["Gender"].unique():
    print("Recidivism_Within_3years for gender:" + gender)
    display(train_df[train_df["Gender"] == gender]["Recidivism_Within_3years"].value_counts(normalize=True))

for race in train_df["Race"].unique():
    print("Recidivism_Within_3years for race:" + race)
    display(train_df[train_df["Race"] == race]["Recidivism_Within_3years"].value_counts(normalize=True))

for age in train_df["Age_at_Release"].unique():
    print("Recidivism_Within_3years for age group:" + age)
    display(train_df[train_df["Age_at_Release"] == age]["Recidivism_Within_3years"].value_counts(normalize=True))

for dep in train_df["Dependents"].unique():
    print("Recidivism_Within_3years for dependent groups:" + dep)
    display(train_df[train_df["Dependents"] == dep]["Recidivism_Within_3years"].value_counts(normalize=True))

Recidivism_Within_3years for gender:M


Recidivism_Within_3years
True     0.595155
False    0.404845
Name: proportion, dtype: float64

Recidivism_Within_3years for gender:F


Recidivism_Within_3years
False    0.543978
True     0.456022
Name: proportion, dtype: float64

Recidivism_Within_3years for race:BLACK


Recidivism_Within_3years
True     0.589159
False    0.410841
Name: proportion, dtype: float64

Recidivism_Within_3years for race:WHITE


Recidivism_Within_3years
True     0.563189
False    0.436811
Name: proportion, dtype: float64

Recidivism_Within_3years for age group:43-47


Recidivism_Within_3years
True     0.503229
False    0.496771
Name: proportion, dtype: float64

Recidivism_Within_3years for age group:33-37


Recidivism_Within_3years
True     0.57479
False    0.42521
Name: proportion, dtype: float64

Recidivism_Within_3years for age group:48 or older


Recidivism_Within_3years
False    0.587656
True     0.412344
Name: proportion, dtype: float64

Recidivism_Within_3years for age group:38-42


Recidivism_Within_3years
True     0.537745
False    0.462255
Name: proportion, dtype: float64

Recidivism_Within_3years for age group:18-22


Recidivism_Within_3years
True     0.719395
False    0.280605
Name: proportion, dtype: float64

Recidivism_Within_3years for age group:23-27


Recidivism_Within_3years
True     0.666574
False    0.333426
Name: proportion, dtype: float64

Recidivism_Within_3years for age group:28-32


Recidivism_Within_3years
True     0.6196
False    0.3804
Name: proportion, dtype: float64

Recidivism_Within_3years for dependent groups:3 or more


Recidivism_Within_3years
True     0.54828
False    0.45172
Name: proportion, dtype: float64

Recidivism_Within_3years for dependent groups:1


Recidivism_Within_3years
True     0.605972
False    0.394028
Name: proportion, dtype: float64

Recidivism_Within_3years for dependent groups:2


Recidivism_Within_3years
True     0.582845
False    0.417155
Name: proportion, dtype: float64

Recidivism_Within_3years for dependent groups:0


Recidivism_Within_3years
True     0.585462
False    0.414538
Name: proportion, dtype: float64

<span style="color:blue">Q7.5: The target variable `Recidivism_Within_3years` is not balanced across most protected segments, nor are the distributions of each `Recidivism_Within_3years` category equal across each level of protected segments. For instance, the proportion of `Recidivism_Within_3years` being true is 59.5% among male individuals and 54.4% among female individuals; the proportion of `Recidivism_Within_3years` being true is 72% among age group:18-22 individuals and 50% among age group:43-47 individuals. This runs the risk of differential treatment and measurement of recidivism between categories of protected characteristics and increases the predictive bias against certain groups under protected characteristics</span> 

In [6]:
# 6: Presence of NaN 
# https://stackoverflow.com/questions/36226083/how-to-find-which-columns-contain-any-nan-value-in-pandas-dataframe
display(train_df.isna().any())

ID                                                   False
Gender                                               False
Race                                                 False
Age_at_Release                                       False
Residence_PUMA                                       False
Gang_Affiliated                                       True
Supervision_Risk_Score_First                          True
Supervision_Level_First                               True
Education_Level                                      False
Dependents                                           False
Prison_Offense                                        True
Prison_Years                                         False
Prior_Arrest_Episodes_Felony                         False
Prior_Arrest_Episodes_Misd                           False
Prior_Arrest_Episodes_Violent                        False
Prior_Arrest_Episodes_Property                       False
Prior_Arrest_Episodes_Drug                           Fal

<span style="color:blue">Q7.6: The columns `Gang_Affiliated`, `Supervision_Risk_Score_First`, `Supervision_Level_First`, `Prison_Offense`, `Avg_Days_per_DrugTest`, `DrugTests_THC_Positive`, `DrugTests_Cocaine_Positive`, `DrugTests_Meth_Positive`, `DrugTests_Other_Positive`, `Percent_Days_Employed`, and `Jobs_Per_Year` contain missing values. Of these characteristics, `Gang_Affiliated`, `Supervision_Risk_Score_First`, `Supervision_Level_First`, and `Prison_Offense` are categorical, while the rest are numerical. The variables `Gang_Affiliated`, `Avg_Days_per_DrugTest`, and `Jobs_Per_Year` may be MNAR, as they may not applicable to the individual (e.g. `Avg_Days_per_DrugTest` for someone that never got tested for drugs in the first place), or actively refused to disclose such information (e.g. `Gang_Affiliated`). `Gang_Affiliated` and `Prison_Offense` can have their information filled by creating/using a separate "Other" category, while `Avg_Days_per_DrugTest` can be filled with a default value of 0 to indicate a lack of drug testing in the first place.</span> 

# Part 2: Privacy



When collecting data for a study, privacy is almost always a primary concern. Our data set may include information that makes it possible to identify an individual, including:

- **Direct identifiers**, which are the ones that can be used to uniquely identify an individual or a household in a dataset, such as a record ID number, patient number, social insurance number, full address, etc. Usually, name is also considered a direct identifier (although several people can have the same name). Other features such as age, date of birth, or postal code are not sufficient on their own to uniquely identify an individual and would not be considered direct identifiers.
- **Indirect (or quasi) identifiers**, which are the columns that do not themselves identify any individual or household, but can do so when combined with other indirect-identifiers. For example, postal code and date of birth are often indirect identifiers, because it is very likely that within a zip code only one individual has this particular birth date. The more indirect identifiers that you have, the more likely it is that individuals become identifiable because there are more possible unique combinations of identifying features.

### Question 8
1. Which columns in the NIJ dataset are direct identifiers? Briefly motivate your answer.
2. Which of the remaining columns make good candidates for indirect identifiers? Which ones do not?

Hint: It can be useful to use the `nunique()` and `value_counts()` dataframe methods to get an idea of how many distinct values a feature has.


In [7]:
# Your answer here (code portion)
display(train_df.nunique())
display(train_df.shape)

ID                                                   18028
Gender                                                   2
Race                                                     2
Age_at_Release                                           7
Residence_PUMA                                          25
Gang_Affiliated                                          2
Supervision_Risk_Score_First                            10
Supervision_Level_First                                  3
Education_Level                                          3
Dependents                                               4
Prison_Offense                                           5
Prison_Years                                             4
Prior_Arrest_Episodes_Felony                            11
Prior_Arrest_Episodes_Misd                               7
Prior_Arrest_Episodes_Violent                            4
Prior_Arrest_Episodes_Property                           6
Prior_Arrest_Episodes_Drug                              

(18028, 53)

* <span style="color:blue">Q8.1: `ID` is the only column in the NIJ dataset that is a direct identifier, as the number of unique values in the training dataset is equal to the number of individuals in the dataset, which is 18028.</span> 

* <span style="color:blue">Q8.2: `Gender`, `Race`, `Age_at_Release`,  `Residence_PUMA`, `Education_Level`, and `Dependents` are effective as indirect identifiers, as they are unlikely to change drastically over extended periods of time and can be used to narrow down the individuals of interest; we can use a combination of these features to identify individuals of interest. The characteristics relating to supervision activities, from `Violations_ElectronicMonitoring` to `Employment_Exempt`, would make for poor candidates for indirect identifiers, as they are directly measured during parole and thus are unlikely to be matched with other anonymous data.</span> 

## De-identification of structured data

To safeguard the privacy of the individuals in our dataset, we need to make sure that they are not identifiable, either directly or indirectly. There are three main strategies to achieve this: suppression, pseudonymization, and generalization.

### Suppression
Suppression is an effective way to get rid of a direct identifier by simply removing the entire column. 

**Question 9:** using the appropriate dataframe methods, suppress all direct identifier in the NIJ training set. Save the result in a new dataframe called `suppressed_df`

In [8]:
# Your answer here
direct_id = ["ID"]
suppressed_df = train_df.drop(columns=direct_id)
suppressed_df.head()

Unnamed: 0,Gender,Race,Age_at_Release,Residence_PUMA,Gang_Affiliated,Supervision_Risk_Score_First,Supervision_Level_First,Education_Level,Dependents,Prison_Offense,...,DrugTests_Cocaine_Positive,DrugTests_Meth_Positive,DrugTests_Other_Positive,Percent_Days_Employed,Jobs_Per_Year,Employment_Exempt,Recidivism_Within_3years,Recidivism_Arrest_Year1,Recidivism_Arrest_Year2,Recidivism_Arrest_Year3
0,M,BLACK,43-47,16,False,3.0,Standard,At least some college,3 or more,Drug,...,0.0,0.0,0.0,0.488562,0.44761,False,False,False,False,False
1,M,BLACK,33-37,16,False,6.0,Specialized,Less than HS diploma,1,Violent/Non-Sex,...,0.0,0.0,0.0,0.425234,2.0,False,True,False,False,True
2,M,BLACK,48 or older,24,False,7.0,High,At least some college,3 or more,Drug,...,0.0,0.166667,0.0,0.0,0.0,False,True,False,True,False
3,M,WHITE,38-42,16,False,7.0,High,Less than HS diploma,1,Property,...,0.0,0.0,0.0,1.0,0.718996,False,False,False,False,False
4,M,WHITE,33-37,16,False,4.0,Specialized,Less than HS diploma,3 or more,Violent/Non-Sex,...,0.0,0.058824,0.0,0.203562,0.929389,False,True,True,False,False


### Pseudonymization

A big issue with suppression of direct identifier is that it is not reversible. If at some point we need to identify an individual in our dataset, we would be out of luck. If you have reasons to believe that re-identification may be required, pseudonymization would be a better option to handle direct identifiers. Pseudonymization replaces one or more direct identifiers with a unique but less meaningful value. Usually when we pseudonymize an identifier, there is a possibility of re-identification if required (but it would not be available to the general public).

**Question 10:** pseudomyze the ID column of the NIJ training set and save the result in a new dataframe called `pseudo_df`. In a different code cell, show that it is possible to re-identify the samples by converting them back to the original ID number.

There are different ways to achieve this you may want to explore:
- Write your own pseudonymization function. You should write at least 2 functions: one to pseudomyze, and another to re-identify. The function does not have to be exceedingly complex but it should not be obvious either (e.g. only basic arithmetic involved).
- Use an extisting library, such as [`cryptography`](https://cryptography.io/en/latest/).

<span style="color:blue">Q10 with `cryptography`</span> 

In [9]:
# Your answer here (you may add more cells as needed)
from cryptography.fernet import Fernet

# define the pseudomyze function:
def psuedo_encry(col): 
    key = Fernet.generate_key()
    f = Fernet(key)
    result1 = col.apply(lambda x: x.to_bytes(2, byteorder='big'))
    result2 = result1.apply(lambda x: f.encrypt(x)) 
    print("Data encrypted")
    return result2, f 

# define the re-identify function:
def psuedo_decry(col, f): 
    result1 = col.apply(lambda x: f.decrypt(x))
    result2 = result1.apply(lambda x: int.from_bytes(x, byteorder='big'))
    print("Data decrypted")
    return result2

In [10]:
# Pseudomyzation
pseudo_df = train_df.copy()
pseudo_df["ID"], f = psuedo_encry(train_df["ID"])
pseudo_df.head()

Data encrypted


Unnamed: 0,ID,Gender,Race,Age_at_Release,Residence_PUMA,Gang_Affiliated,Supervision_Risk_Score_First,Supervision_Level_First,Education_Level,Dependents,...,DrugTests_Cocaine_Positive,DrugTests_Meth_Positive,DrugTests_Other_Positive,Percent_Days_Employed,Jobs_Per_Year,Employment_Exempt,Recidivism_Within_3years,Recidivism_Arrest_Year1,Recidivism_Arrest_Year2,Recidivism_Arrest_Year3
0,b'gAAAAABm-1QBoywSIPV3buuXLO8WRySWtz0orATwCti8...,M,BLACK,43-47,16,False,3.0,Standard,At least some college,3 or more,...,0.0,0.0,0.0,0.488562,0.44761,False,False,False,False,False
1,b'gAAAAABm-1QBYrk2g8GnFg5JtQEFi5ckh1UJ0OrGBHVu...,M,BLACK,33-37,16,False,6.0,Specialized,Less than HS diploma,1,...,0.0,0.0,0.0,0.425234,2.0,False,True,False,False,True
2,b'gAAAAABm-1QB8m9N3FT8xEGDTsGnI5dlfAnNVuE3N9f2...,M,BLACK,48 or older,24,False,7.0,High,At least some college,3 or more,...,0.0,0.166667,0.0,0.0,0.0,False,True,False,True,False
3,b'gAAAAABm-1QBwDZ5ECEshYKJLS5ICHo2uNqJ8upXmaRR...,M,WHITE,38-42,16,False,7.0,High,Less than HS diploma,1,...,0.0,0.0,0.0,1.0,0.718996,False,False,False,False,False
4,b'gAAAAABm-1QBrWDi4O7jx32g0La_xTLzCizsTtTgtIZ8...,M,WHITE,33-37,16,False,4.0,Specialized,Less than HS diploma,3 or more,...,0.0,0.058824,0.0,0.203562,0.929389,False,True,True,False,False


In [11]:
# Reidentification
pseudo_df["ID"] = psuedo_decry(pseudo_df["ID"], f)
pseudo_df.head()

Data decrypted


Unnamed: 0,ID,Gender,Race,Age_at_Release,Residence_PUMA,Gang_Affiliated,Supervision_Risk_Score_First,Supervision_Level_First,Education_Level,Dependents,...,DrugTests_Cocaine_Positive,DrugTests_Meth_Positive,DrugTests_Other_Positive,Percent_Days_Employed,Jobs_Per_Year,Employment_Exempt,Recidivism_Within_3years,Recidivism_Arrest_Year1,Recidivism_Arrest_Year2,Recidivism_Arrest_Year3
0,1,M,BLACK,43-47,16,False,3.0,Standard,At least some college,3 or more,...,0.0,0.0,0.0,0.488562,0.44761,False,False,False,False,False
1,2,M,BLACK,33-37,16,False,6.0,Specialized,Less than HS diploma,1,...,0.0,0.0,0.0,0.425234,2.0,False,True,False,False,True
2,3,M,BLACK,48 or older,24,False,7.0,High,At least some college,3 or more,...,0.0,0.166667,0.0,0.0,0.0,False,True,False,True,False
3,4,M,WHITE,38-42,16,False,7.0,High,Less than HS diploma,1,...,0.0,0.0,0.0,1.0,0.718996,False,False,False,False,False
4,5,M,WHITE,33-37,16,False,4.0,Specialized,Less than HS diploma,3 or more,...,0.0,0.058824,0.0,0.203562,0.929389,False,True,True,False,False


<span style="color:blue">Encoding idea</span> 

In [12]:
# # Your answer here (you may add more cells as needed)
# # https://stackoverflow.com/questions/64605565/how-to-pseudonymize-and-pseudonymize-it-back
# # https://stackoverflow.com/questions/23199733/convert-numbers-into-corresponding-letter-using-python
# # https://www.geeksforgeeks.org/python-split-string-into-list-of-characters/
# # Idea: Number divided by powers of 26, first character is power, second character is remainder

# def encoding1(int): 
#     pow = 0
#     multiple = 0
#     rem = int%26
#     while (int >= 26): 
#         int = int//26
#         multiple += 1 
#         if (multiple >= 26): 
#             pow += 1
#             multiple = 1
#         else: 
#             continue
#     return chr(ord('@')+(pow+1)) + chr(ord('@')+(multiple+1)) + chr(ord('@')+(rem+1))
#     # return chr(ord('@')+(pow+1)) + chr(ord('@')+(rem+1))
# print(encoding1(25))
# print(encoding1(26))
# print(encoding1(51))
# print(encoding1(53))
# print(encoding1(1000))

# def decoding1(str): 
#     list = [x for x in str]
#     pow = ord(list[0])-ord("A")
#     multiple = ord(list[1])-ord("A")
#     rem = ord(list[2])-ord("A")
#     return (26**pow) * multiple + rem
# print(decoding1("AAZ"))
# print(decoding1("ABA"))
# print(decoding1("ACM"))

# # def encoding1(int): 
# #     val = int
# #     pow = 0
# #     rem = 0
# #     while (val > 26): 
# #         val = val//26 
# #         pow += 1 
# #     rem = val%26
# #     return str(pow) + str(rem)

# # pseudo_df = train_df
# # pseudo_df.head()

### Generalization

Generalization is a commonly used technique in anonymization, which involves reducing the precision of a column. For example, the date of birth or the date of a doctor's visit can be generalized to a month and year, to a year, or to a five-year interval. Generalization can help achieving $k$-anonymity. 

To check for $k$-anonymity, we will use the [`pycanon` library](https://github.com/IFCA/pycanon). You can install this library in your virtual environment by running the command:

```
pip install pycanon
```

**Question 11:** `pycanon` includes several functions (feel free to explore them in the related documentation), but we will only be using `k-anonimity`. Look at the documentation, then use `k-anonimity` to determine the $k$-anonymity of the following groups of variables:

- $k$-anonymity of Gender and Race features: <span style="color:blue">743</span> 
- $k$-anonymity of Gender, Race, and Age_at_Release features: <span style="color:blue">44</span>
- $k$-anonymity of Gender, Race, Age_at_Release and Residence_PUMA features: <span style="color:blue">1</span>

In [13]:
from pycanon import anonymity

# Your answer here
k1 = anonymity.alpha_k_anonymity(train_df, quasi_ident = ["Gender", "Race"], sens_att = ["Gender", "Race"])[1]
k2 = anonymity.alpha_k_anonymity(train_df, quasi_ident = ["Gender", "Race", "Age_at_Release"], sens_att = ["Gender", "Race", "Age_at_Release"])[1]
k3 = anonymity.alpha_k_anonymity(train_df, quasi_ident = ["Gender", "Race", "Age_at_Release", "Residence_PUMA"], sens_att = ["Gender", "Race", "Age_at_Release"])[1]
print("k-anonymity of Gender and Race features: " + str(k1))
print("k-anonymity of Gender, Race, and Age_at_Release features: " + str(k2))
print("k-anonymity of Gender, Race, Age_at_Release and Residence_PUMA features: " + str(k3))

k-anonymity of Gender and Race features: 743
k-anonymity of Gender, Race, and Age_at_Release features: 44
k-anonymity of Gender, Race, Age_at_Release and Residence_PUMA features: 1


The $k$-anonymity of the combination of Gender, Race, Age_at_Release and Residence_PUMA is clearly problematic! It would be very easy to identify someone if we knew these 4 pieces of information about them. 

**Question 12:** can you bin the Residence_PUMA feature to achieve 4-anonymity for this set of features? Add the new column to the existing dataframe, using the name `Binned_PUMA`.

For this task, you may want to look into the `cut()` and `qcut()` functions of the pandas library.

Remember that now, when checking for $k$-anonymity, you should be looking at the new column `Binned_PUMA`, not at `Residence_PUMA`.

In [14]:
# Your answer here
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

for bin in range(20):
    train_df["Binned_PUMA"] = pd.qcut(train_df["Residence_PUMA"], bin)
    k = anonymity.alpha_k_anonymity(train_df, quasi_ident = ["Gender", "Race", "Age_at_Release", "Binned_PUMA"], sens_att = ["Gender", "Race", "Age_at_Release"])[1]
    if k==4:
        print("k-anonymity of Gender, Race, Age_at_Release and Binned_PUMA features: " + str(k))
        display(train_df.head())
        break

k-anonymity of Gender, Race, Age_at_Release and Binned_PUMA features: 4


Unnamed: 0,ID,Gender,Race,Age_at_Release,Residence_PUMA,Gang_Affiliated,Supervision_Risk_Score_First,Supervision_Level_First,Education_Level,Dependents,...,DrugTests_Meth_Positive,DrugTests_Other_Positive,Percent_Days_Employed,Jobs_Per_Year,Employment_Exempt,Recidivism_Within_3years,Recidivism_Arrest_Year1,Recidivism_Arrest_Year2,Recidivism_Arrest_Year3,Binned_PUMA
0,1,M,BLACK,43-47,16,False,3.0,Standard,At least some college,3 or more,...,0.0,0.0,0.488562,0.44761,False,False,False,False,False,"(15.0, 20.0]"
1,2,M,BLACK,33-37,16,False,6.0,Specialized,Less than HS diploma,1,...,0.0,0.0,0.425234,2.0,False,True,False,False,True,"(15.0, 20.0]"
2,3,M,BLACK,48 or older,24,False,7.0,High,At least some college,3 or more,...,0.166667,0.0,0.0,0.0,False,True,False,True,False,"(20.0, 25.0]"
3,4,M,WHITE,38-42,16,False,7.0,High,Less than HS diploma,1,...,0.0,0.0,1.0,0.718996,False,False,False,False,False,"(15.0, 20.0]"
4,5,M,WHITE,33-37,16,False,4.0,Specialized,Less than HS diploma,3 or more,...,0.058824,0.0,0.203562,0.929389,False,True,True,False,False,"(15.0, 20.0]"


With 4-anonymity for these set of features, we can rest assured that there are at least 4 individuals sharing the same combination, making it more difficult to identify someone by knowing only these 4 pieces of information. However, let's not ignore the following issues:
- We did not test $k$-anonymity for other combinations of features, so it is very likely that our dataset is still not anonymized.
- 4-anonymity is not very strong; if I can narrow down my search to 4 people, I can still learn a lot about a person (at least approximatively).
- We may lose $k$-anonymity by adding more information.

## Differential Privacy

As discussed in class, differential privacy is a stronger, mathematically robust definition of privacy for an algorithm. You can learn more about it by watching this video from Minute Physics: [Protecting Privacy with MATH](https://www.youtube.com/watch?v=pT19VwBAqKA)

After watching this video, try answering the following questions:
1. If you have two differentially private datasets, one with and one without your data, what does differential privacy guarantee regarding your privacy?
2. An algorithm has differential privacy $\epsilon$ = 2, another one $\epsilon$ = 4. Which one provides a higher level of privacy? Explain your answer.
3. The video highlights at least two of the main challenges with differential privacy. Summarize them.

* <span style="color:blue">Q12.1: Differential privacy guarantees that the change in output of the algorithm between the two datasets will be minimal, and thus it would be difficult to deduce which dataset your data is present in. In other words, whether your data is in a dataset or not, the change in the output will be limited, so if the data is published, other people cannot easily detect your presence in the data.</span>
* <span style="color:blue">Q12.2: $\epsilon$ is a measure of the privacy loss as a result from differential changes in data, such as the addition of removal of new entries. As such, the algorithm with $\epsilon$ = 2 indicates that the change in the output will be small with and without the presence of your data, and thus has a higher level of privacy.</span>
* <span style="color:blue">Q12.3: There is a tradeoff between differential privacy and informational accuracy, so people will need to figure out the minimum amount of noise required to maximise both privacy and accuracy. The publication of multiple jittered statistics also runs the risk of being combined to reconstruct the data that was meant to be hidden, so their publication need to be future-proofed to prevent such.</span>

## Randomized response

In class, we described randomized polling as a way to conduct interviews including sensitive questions, while protecting individuals' privacy. 

**Question 13:** imagine that UBC has been surveying students to understand how many of them have been cheating in a final exam. Because the information is very sensitive and students will most likely not want to share this information, they use the randomized polling protocol described in class. If 1000 students have been surveyed, and 300 of them responded "yes", what is the actual percentage of students who cheated in a final?

<span style="color:blue"> 
Let $x$ be the actual percentage of students who cheated in the final.
    
$$x * \frac{3}{4} + (1 - x) * \frac{1}{4} = \frac{300}{1000} $$
$$ \frac{1}{4} + \frac{x}{2}=\frac{3}{10}$$
$$x = \frac{1}{10}$$ 
$$x = 10\% $$ 

Therefore, we conclude that $x = 10\% $ is the actual percentage of students who cheated in the final.
</span>

# Part 3: Data Governance 

Data governance refers to the set of policies, procedures and standards that companies and organization must adopt to ensure quality, sacurity and usability of the data in their possession. 

To gain a better understanding of what data governance is, why it is important and what common mistakes affect it, please read the following articles:
- https://www.egnyte.com/guides/governance/data-ownership
- https://atlan.com/data-governance-mistakes/#what-is-data-governance

As you can see, the issue of data governance is complex and multifaceted. A group of experts with a variety of experties is necessary to design and implement a robust data governance plan. Still, we can train ourselves to spot the most common mistakes when we see them. Take, for example, the following fictitional scenario (co-authored in collaboration with [ChatGPT](https://chat.openai.com/))

"SleekTech Solutions" is a cutting-edge technology company specializes in technologies related to artificial intelligence and data analytics. Their services include data analytics, big data processing, cloud computing, and Internet of Things (IoT). They offer their services to various industries, such as healthcare, finance, retail, manufacturing.

The company is young, only founded in 2021, and has rapidly expanded. At their inception, they used to accumulate data in a vast digital repository known as the "Data Lake." Initially, this seemed like a cost-effective solution to store all types of data, and they have not changed this strategy to this date. 

To increase agility, SleekTech's different divisions have significant autonomy over their data. This means that the same data may be recorded by different department using different standards and metrics. SleekTech also encourages a culture of openness. Employees have access to vast amounts of data, including sensitive customer information, to complete the tasks they are assigned to.

SleekTech has been expanding rapidly. Founded in Canada, is now looking to expand into new markets including US and Europe.

**Question 14:** using the readings as reference, outline at least 4 distinct mistakes that SleekTech Solutions is likely to commit because of their data governance strategy. 


* <span style="color:blue"> 
Ignoring privacy and security: In the question description, it says that "SleekTech also encourages a culture of openness. Employees have access to vast amounts of data, including sensitive customer information, to complete the tasks they are assigned to". All employees having little to no restriction on access privileges, including access to sensitive customer information, leads to a severe risk of security breaches and thus damage to the company's reputation. 
</span>

* <span style="color:blue"> 
Inadequate communication and training: In the question description, it says that "The company is young, only founded in 2021, and has rapidly expanded". The rapid growth and significant autonomy can lead to insufficient communication and training in the company. If we don't have adequate communication and training, "data governance initiatives may be misunderstood or improperly implemented" (atlan, 2023). The significant autonomy over the data within each department runs the risk of potential misunderstanding, improper implementation, and confusion over the same data.
</span>

* <span style="color:blue"> 
Neglecting data quality: SleekTech Solutions' data lake contains data on various industries with no indication that they are segregated for organization, which can lead to unnecessary difficulties in relevant operations involving seeking relevant data. They are more likely to have data with poor quality and. Besides, different departments can use different standards and metrics on the same data, which can also lead to poor data quality and data inconsistency.
</span>

* <span style="color:blue"> 
Failing to evolve and adapt: The lack of adaptation to more effective data storage options since they were founded in 2021 may lead to their data management tools becoming redundant and irrelevant compared to their competition. They have been using the same strategy for more than three years without accommodating changes for new technologies, even though this is a cost-effective solution.
</span>

# Final thoughts

1) If you have completed this assignment in a group, please write a detailed description of how you divided the work and how you helped each other completing it:

* <span style="color:blue">Jingyuan's response: We worked on the assignment separately, then collaborated to form our final assignment submission.</span>
* <span style="color:blue">Nicholas' response: We worked on the assignment separately, then collaborated to form our final assignment submission.</span>

2) Have you used ChatGPT or a similar Large Language Model (LLM) to complete this homework? Please describe how you used the tool. **We will never deduct points for using LLMs for completing homework assignments,** but this helps us understand how you are using the tool and advise you in case we believe you are using it incorrectly.

* <span style="color:blue">Jingyuan's response: I used ChatGPT to help debug the codes for pseudonymization and re-identification from `pycanon`.</span>
* <span style="color:blue">Nicholas' response: I have used Poe to assist in accessing the `pycanon` module, as well as the encoding in Q10 with both the pseudonymization function idea and using `cryptography`.</span>

3) Have you struggled with some parts (or all) of this homework? Do you have pending questions you would like to ask? Write them down here!

* <span style="color:blue">Jingyuan's response: Pending questions: what is the mathematical definition of $\epsilon$-differential privacy? How do we interpret $\epsilon$?</span>
* <span style="color:blue">Nicholas' response: Encoding ideas for Q10.</span>