# Data Anonymization

## Background

A hospital located in Oxford wants to conduct research to look at how a number of health conditions
are related to age and gender. They also would like to look at the spatial distribution of the results. Finally, they would like to be able to get back to individual patients' information after the study is done.
They outsource this task to a data analytics startup created by two
Oxford grad students : ShaZen.
Before transferring the data to ShaZen, the hospital needs to make sure its patients information
is well protected, and therefore they conduct data anonymization using the K-anonymity method. 

They assess the risk of breach by an adversary to rather low, at 1%. Given that they want to bring the overall chances of re-indentification of their patients at about one in five hundred, what value of K should they choose? 

## The dataset

The data consists of the records of 100 patients with their name, age, gender, postcode, admission and discharge dates, and their diagnosis codes. 
Diagnosis codes are following the International Statistical Classification of Diseases and Related Health Problems 10th edition (ICD-10), a medical classification list by the World Health Organization. 
The ones present in the dataset are:
- I519: Cardiac Arrest
- J189: Pneumonia
- E116: Complications of diabetes
- A419: Scepsis
- B20: HIV


## The method
Let's remember the workflow of anonymisation, which will be implemented in the rest of the notebook

- Determine the release model: public or non plublic. 
- Determine the acceptable re-identification risk threshold.
- Classify data attributes (direct, indirect, non identifiers).
- Remove unused attributes.
- Anonymise direct and indirect identifiers.
- Determine actual risk and compare against threshold.
- Perform more anonymisation if necessary.
- Evaluate solution: does the utility meet the target?
- Determine controls required.
- Document the anonymisation process.



### Import libraries 
We will be using Pandas, a very powerful open source data manipulation library

In [None]:
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'
import datetime

### Load the dataset 

In [None]:
### read the dataset CSV file.
df = pd.read_csv('dataset_anonymisation.csv', index_col=0)

### Let's look at a few records
df.iloc[:10]

### Step 1: Determine the release model
Q: Is the release model public or non public?

A: The data is only going to be released to ShaZen, making the release model non public.

### Step 2: Determine the acceptable re-identification risk threshold
Q: Which threshold did the hospital decide on?

A: The hospital wants the chance of re-identification to be one in five hundreds, or 0.02%. With the equation P(re-ID) = Pkanon * P(breach), replacing P(re-ID) = 0.002 and P(breach) = 0.01, and Pkanon = 1/K, we get K = 5.

### Step 3: Classify data attributes
Q: Write down all the attributes in the dataset, and if they are direct, indirect, or non identifiers.

A: Names and postcode are direct identifiers, the rest are indirect.

### Step 4: Remove unused attributes
Q: Given the research direction of the hospital, which attributes can be removed without affecting utility?


A: For this particular study, the hospital is not interested in the doctor treating the patients, so we can safely remove this attribute without affecting the utility of this dataset for the study.

In [None]:
### Use the drop method of the panda dataframe object to remove the two columns 
df = df.drop(['Dr name'], axis = 1)

### Step 5: Anonymise direct and indirect identifiers 

We are left with seven identifiers: two direct and five indirect.





### Name
Let's start with the name.
The hospital wants to be able to get back to the identity of the patient at the end of the study, so we will use
pseudonymization.
There is a csv file containing a table of pseudonymes, which we can load using Pandas.



In [None]:
### Read the pseudonymisation table csv
df_pseudos = pd.read_csv('pseudonymisation_table.csv',index_col=0)

### Let's look at some values
df_pseudos.iloc[:10]

We can then simply replace the values of the 'Name' attribute with the values of the 'Pseudonyme' attribute.

In [None]:
### Loop over the names in the dataset and get the corresponding pseudonyme

for index in range(len(df.index)):
    name = df['Names'].iloc[index]
    df['Names'].iloc[index] = df_pseudos['Pseudonyme'].loc[name]

### Look at some values
df.iloc[:10]

### Age

Age is an indirect identifier, and for the purposes of achieving K-anonymity we will use data perturbation, more specifically base-5 rounding.
It is important to notice that by doing this we are degrading the accuracy of a useful feature of the dataset. 
This is part of the tradeoff between anonynimity and utility.

In [None]:
### Define a simple helper function for rounding to a given base
def myround(x, base=5):
    return int(base * round(float(x)/base))


### Round the ages in the dataset.
df['Age'] = pd.Series(map(myround, df['Age']))

### Look at some values
df.iloc[:10]

### Gender
This is an indirect identifier, but we can see that it already satisfies the K-anonymity condition. Therefore no further action is necessary.

### Postcode

Oxford postcode work in the following way: the first three letters-digits combination refers to a rather large area (four in total in Oxford) and the three latter refer to a specific house in that area.
Postcode is a direct identifier, and we will again need to lose some utility in order to satisfy the anonymity threshold. We will perturb the value in the records and remove the latter three letters-digits combination, thus keeping the general area only.

In [None]:
### Keep only first half of the postcode 
df['Postcode'] = pd.Series([pcode.strip().split()[0] for pcode in df['Postcode']])

### Look at some values
df.iloc[:10]

### Admission and discharge dates

The hospital does not necessarily need time as one of the features of the dataset so it potentially could have been removed as an unused attribute. It was however kept, but does not need to be very precise, so we can only keep the year information and remove the rest.

In [None]:
### Parse the string and keep only the year
df['Admission date'] = pd.Series([date.strip().split('-')[0] for date in df['Admission date']])
df['Discharge date'] = pd.Series([date.strip().split('-')[0] for date in df['Discharge date']])

### Look at some values
df.iloc[:10]

### Diagnosis code

The remaining attribute is the diagnosis code, an indirect identifier. 
To understand how many different attributes and their frequency, we need to do some data exploration, which Pandas is very convenient for.


In [None]:
### Value count identifies unique values and returns their frequencies
value_count = df['Diagnosis code'].value_counts()

### Let's look at the results
value_count

We can see that the HIV code B20 is unique. Removing the whole "Diagnosis code" attribute is of course out of the question, but we can sacrifice the unique record without losing much utility.

In [None]:
### Remove the row containing the B20 value for "Diagnosis code"
df = df.drop(df.loc[df['Diagnosis code'] == 'B20'].index[0], axis = 0 ) 

In [None]:
df

## Step 6: Determine actual risk and compare against threshold

Q: What is the k-anonimity of the dataset now? Compare against the risk threshold decided on by the hospital.

A: With the code below, we can see that the minimum K in this dataset is 5. Therefore the dataset is 5-anonymous. This is just enough to satisfy the threshold.

In [None]:
print("Values and frequency for Age attribute")
print(pd.value_counts(df['Age']))
print('\n')

print("Values and frequency for Gender attribute")
print(pd.value_counts(df['Gender']))
print('\n')

print("Values and frequency for Postcode attribute")
print(pd.value_counts(df['Postcode']))
print('\n')

print("Values and frequency for Admission date attribute")
print(pd.value_counts(df['Admission date']))
print('\n')

print("Values and frequency for Discharge date attribute")
print(pd.value_counts(df['Discharge date']))
print('\n')

print("Values and frequency for Diagnosis code attribute")
print(pd.value_counts(df['Diagnosis code']))
print('\n')


## Step 7: Perform more anonymisation if necessary

Q: Is any more anonymisation required?

A: No, we satisfied the risk threshold.

## Step 8: Evaluate solution: does the utility meet the target?

The main attributes of interest for the study are still present, despite some data perturbation applied to gender and address. The utility should therefore meet the desired target. 


## Step 9: Determine controls required

Q: Which controls would you implement?

A: We can implement a variety of controls. The hospital chooses to implement a query only revocable access system for ShaZen. Additionally, all their databases (including pseudonymisation) are kept on encrypted hard drives in safes within the hospital.

## Step 10: Document the anonymisation process

Q: Write a summary of the anonymization process that was implemented.

A: 
We applied k-anonymisation through the following steps:
- We removed the attribute 'Dr Name'
- We pseudonymised the 'Names' attribute. The pseudonyme database will be kept in a secure location in a hard drive in a safe in the hospital.
- We applied data perturbation with a base 5 rounding on the 'Age attribute'.
- We performed data perturbation on the 'Postcode' attribute by only keeping the first digit-letter combination.
- We performed data perturbation on  the 'Admission date' and 'Discharge date' attributes, by only keeping the year information, and discarding the rest.
- We removed record number 99 containing the single 'B20' value for the 'Diagnosis code attribute'.
