# CDM Project: iInsureU123 - k-Anonymity and anonymising data

 ## Overview
 In the notebook, we will implement functions to produce a k-anonymous customer information dataset provided by an insurance company, iInsureU123. The anonymised dataset would be made available to the researchers and government. Quasi-identifiers and sensitive attributes of values will be replaced with banded quantities to prevent sampled individuals from being identified. In addition, cryptographic hashing function will be applied to transform certain sensitive data to hash values for privacy reasons. The dataset will be made available to the public by the government whereas researchers from Imperial College will use the dataset for research purposes. Therefore, two datasets containing different attributes will be generated and encrypted.

## Loading Dataframes
In the project, we will use the Pandas package to load dataset as a data frame. Other modules that will be used have been listed below.

In [94]:
# Packages loading

import pandas as pd
import datetime
import matplotlib.pyplot as plt
from datetime import date
import hashlib
import random, string
from cryptography.fernet import Fernet

Reading the raw data "customer_information.csv" in as a dataframe with Pandas

In [95]:
# Import data and target
df = pd.read_csv('customer_information.csv') #Using relative path

## Banding
The utility of the dataset depends is based around exploring the association of the gene DRD4 and the exposures of travelling, education and geographical location. However, individual height and weight may be not necessary for the research purposes. We decide to convert height and weight into BMI for both research and privacy-preserving reasons, since the utility is maintained whilst improving anonymity.

In [96]:
#Converting columns "weight" and "height" into numeric.
df["weight"] = pd.to_numeric(df["weight"])
df["height"] = pd.to_numeric(df["height"])
df['heightsquared']=df['height']**2

#Converting to BMI
df['bmi']=df['weight']/df['heightsquared']
df['bmi']=df['bmi'].apply(lambda x:round(x,2))

#Deleting the intermediate column 'heightsquared'
del df['heightsquared']

In the following section, values of quasi-identifiers and sensitive attributes will be banded. For example, birthdate will be converted to banded age groups; participants will be categorized into light, intermediate and heavy smokers according to average cigarettes smoked per week; consumption of drinks will be categorized to low-level, moderate and high-level alcoholic consumption; participants will be grouped to "underweight","healthy" and "overweight".

In [97]:
#Generate a function to convert birthdate to age group
#Intermediate year of birth column for calculating age
birthdate=pd.to_datetime(df.birthdate)
df['birth_year'] = pd.DatetimeIndex(birthdate).year

#Age calculation
def from_birthdate_to_age(birth_date):
    now=pd.Timestamp('now')
    now_year,now_month,now_day = now.year, now.month, now.day
    birth_date = pd.to_datetime(birth_date)
    birth_year, birth_month, birth_day = birth_date.year, birth_date.month, birth_date.day
    age = now_year - birth_year
    if now_month >= birth_month:
        if now_day >= birth_day:
            age = now_year - birth_year + 1
    return (age)

df['age'] = df['birthdate'].apply(from_birthdate_to_age)

#Banding age to groups
bins_age = [18,27,37,47,57,67]
labels_age= ['18-27','28-37','38-47','48-57','58-67']
df['age_groups'] = pd.cut(df.age, bins = bins_age, labels = labels_age)

In [98]:
#Banding smoking per week
##We are defining smoking <=40 cig per week as light smokers, 40-175 as intermediate smokers, >175 as heavy smokers [(Schane et al., 2010)][(Wilson et al., 1992)]
bins_smok= [0, 40, 175, 500]
labels_smok=['light smokers','intermediate smokers','heavy smokers']
df['smoking_status'] = pd.cut(df.avg_n_cigret_per_week, bins = bins_smok, labels = labels_smok)

# Banding avg_n_drinks_per_week
## 0-3.9 as low-level alc consumption, 4-6.9 as moderate alc consumption,7-10 as high-level alc consumption
bins_alc = [0,4,7,10]
label_alc = ['low', 'moderate', 'high']
df['level of drinking_status'] = pd.cut(df.avg_n_drinks_per_week, bins = bins_alc, labels = label_alc, right = False)


In [99]:
# Banding 'BMI' to groups
##0-18.5 as underweight, 18.5-24.9 as healthy,24.9 and over as overweight

bins_bmi = [0,18.5,24.9,50]
label_bmi = ['Underweight','Healthy','Overweight']
df['level of bmi'] = pd.cut(df.bmi, bins = bins_bmi, labels = label_bmi, right = False)

The government aims to identify the association of genes and educational or geographical background. Therefore, country of birth and education levels are essential for the government dataset. In the following section, country of birth will be grouped by continents and cultural reasons because geographical and socio-cultural perspectives affect the decisions on migration.

In [100]:
# Country_of_birth mapping to continents
import pycountry_convert as pc  #pip install pycountry-convert
replace_dict={'Korea':'South Korea','Palestinian Territory':'Jordan','Saint Barthelemy':'Dominican Republic','Saint Helena': 'South Africa',
'Reunion': 'Mauritius','Svalbard and Jan Mayen':'Greenland','United States Minor Outlying Islands':'United States',
'Antarctica (the territory South of 60 deg S)': 'Heard Island and McDonald Islands','Western Sahara':'Morocco','Svalbard & Jan Mayen Islands':'Heard Island and McDonald Islands',
'Libyan Arab Jamahiriya':'Libya','Pitcairn Islands':'Fiji','Slovakia (Slovak Republic)':'Slovakia','Bouvet Island (Bouvetoya)':'Heard Island and McDonald Islands',
'Holy See (Vatican City State)':'Italy','Timor-Leste':'Indonesia',"Cote d'Ivoire":'Ghana','British Indian Ocean Territory (Chagos Archipelago)':'India',
"Netherlands Antilles":"Netherlands"}
df['country_of_birth'].replace(replace_dict,inplace=True)

# Group original countries to continent by applying a function to dataframe
def country_to_continent(country_of_birth):
    country_alpha2 = pc.country_name_to_country_alpha2(country_of_birth)
    country_continent_code = pc.country_alpha2_to_continent_code(country_alpha2)
    country_continent_name = pc.convert_continent_code_to_continent_name(country_continent_code)
    return country_continent_name
df['continent_of_birth'] = df['country_of_birth'].apply(country_to_continent)

# Group some continents according to geographic / cultural reasons to improve K-anonymity
df['continent_of_birth'].replace({'North America':'America','South America':'America','Antarctica':'Europe'},inplace=True)

In the following section, we band the education level into three groups. Primary and secondary education will be grouped to pre-university whereas masters and phD will be identified as post-graduate qualification. As apprenticeship or diploma are not included in the list, we assume that other degree could contain these academic qualification which then should be categorized as bachelor.

In [101]:
# Banding of education
df['education_level'].replace({'secondary':'pre-uni',
'primary': 'pre-uni',
'bachelor': 'bachelor',
'other': 'bachelor',
'masters': 'post-grad',
'phD':'post-grad'},inplace=True)

## Secure Hashing Algorithm
In the section below, we will create pseudonyms to anonymize direct sensitive identifiers (e.g. names, national insurance numbers). Secure Hashing Algorithm (SHA) will be used to map personal identifiers into a hashed pseudonym. In addition to SHA, we also add a salt i.e. random number generated by the computer to enhance the security, as well as a key to increase the input space.

In [102]:
#SHA 
#Salt generator
def randomword(length):
   letters = string.ascii_lowercase #generates lowercase letters
   return ''.join(random.choice(letters) for i in range(length)) #generates salt from letters

df['NI_enc']=df['national_insurance_number'].str.encode('utf-8') #utf encoding needed for sha function

key='h8jij4f3'.encode('utf-8') #encoding key

#Hash function
hashes=[]
salt=[]
for i in range(len(df['NI_enc'])):
    salt.append(randomword(10).encode('utf-8'))
    hashes.append(hashlib.sha1(key+salt[i]+df['NI_enc'][i]).hexdigest()) #hash function applied 

df['hash']=hashes 
hash_cols=['given_name','surname','phone_number','national_insurance_number','blood_group',
'postcode','birthdate','age','bank_account_number','birth_year','avg_n_drinks_per_week',
'avg_n_cigret_per_week','bmi','country_of_birth','height','weight']

#Lookup table generation
secure_df=df[hash_cols]
secure_df['hash']=df['hash']
secure_df['salt']=salt
secure_df.drop(columns=['age','birth_year','bmi'],inplace=True)
secure_df.set_index('hash',inplace=True)#This is the sensitive dataset that is not shared
new_cust=df.drop(columns=hash_cols)
new_cust=new_cust.drop(columns='NI_enc')
new_cust.set_index('hash',inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  secure_df['hash']=df['hash']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  secure_df['salt']=salt
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  secure_df.drop(columns=['age','birth_year','bmi'],inplace=True)


## Calculating K-anonymity
K-anonymity is computed based on processed data in order to assess the dataset's confidentiality in information sharing in the public domain. In the following sectioin, we calculate the k-anonminity for both researchers at Imperial College and the government's dataset. Due to the differences in research purposes, the dataset will contain different attributes as discussed previously. As a result quasi-identifiers differ in each anonymised dataset.

In [103]:
# K-anonyminity for Imperial dataset
quasi_identifiers_imp=['age_groups','level of bmi','gender']
k_df_imp=new_cust.groupby(quasi_identifiers_imp,observed=True).size().reset_index(name='Count').sort_values(by='Count')
print(f'The Imperial K-anonyminity is: {min(k_df_imp["Count"])}')

# K-anonyminity for government dataset
quasi_identifiers_gov=['level of bmi','continent_of_birth','education_level']
k_df_gov=new_cust.groupby(quasi_identifiers_gov,observed=True).size().reset_index(name='Count').sort_values(by='Count')
print(f'The Government K-anonyminity is: {min(k_df_gov["Count"])}')

The Imperial K-anonyminity is: 17
The Government K-anonyminity is: 6


Here, we will export anonymised dataframe to CSV File.

In [104]:
# convert the dataframe into csv
imp_df=new_cust[['gender','age_groups','n_countries_visited','cc_status','level of bmi','level of drinking_status','smoking_status']]
gov_df=new_cust[['continent_of_birth','education_level','cc_status','level of bmi','level of drinking_status','smoking_status']]
imp_df.to_csv('Imperial Researchers Dataset.csv')
gov_df.to_csv('Government Dataset.csv')

## Secure File Generation via Encryption
Creating keys for encrypting the datasets

In [105]:
# key generation 
key = Fernet.generate_key()

# string the key in a file
with open ('filekey.key', 'wb') as filekey:
    filekey.write(key)

Encryption for the Imperial Researchers Dataset

In [106]:
# opening the key 
with open ('filekey.key', 'rb') as filekey:
    key = filekey.read()

# using the generated key 
fernet = Fernet (key)

# opening the original file to encrypt
with open('Imperial Researchers Dataset.csv', 'rb') as Imperial_Researchers_Dataset_file:
    original = Imperial_Researchers_Dataset_file.read()

# encrypting the file
encrypted = fernet.encrypt(original)

# opening the file in write mode and 
# writing the encrypted data
with open ('Imperial Researchers Dataset.csv', 'wb') as encrypted_Imperial_Researchers_Dataset_file:
    encrypted_Imperial_Researchers_Dataset_file.write(encrypted)

Encryption for the Government Dataset

In [107]:
# opening the key 
with open ('filekey.key', 'rb') as filekey:
    key = filekey.read()

# using the generated key 
fernet = Fernet (key)

# opening the original file to encrypt
with open('Government Dataset.csv', 'rb') as Government_Dataset_file:
    original = Government_Dataset_file.read()

# encrypting the file
encrypted = fernet.encrypt(original)

# opening the file in write mode and 
# writing the encrypted data
with open ('Government Dataset.csv', 'wb') as encrypted_Government_Dataset_file:
    encrypted_Government_Dataset_file.write(encrypted)

Instructions for decrypting the Imperial Researchers Dataset csv file

In [108]:
# using the key
fernet = Fernet(key)


# opening the encrypted file
with open ('Imperial Researchers Dataset.csv', 'rb') as enc_Imperial_Researchers_Dataset_file:
  encrypted = enc_Imperial_Researchers_Dataset_file.read()


# decrypting the file
decrypted = fernet.decrypt(encrypted)


# opening the file in write mode and 
# writing the decrypted data
with open('Imperial Researchers Dataset.csv', 'wb') as dec_Imperial_Researchers_Dataset_file:
  dec_Imperial_Researchers_Dataset_file.write(decrypted)

Instructions for decrypting the Government Dataset csv file

In [109]:
# using the key
fernet = Fernet(key)


# opening the encrypted file
with open ('Government Dataset.csv', 'rb') as enc_Government_Dataset_file:
  encrypted = enc_Government_Dataset_file.read()


# decrypting the file
decrypted = fernet.decrypt(encrypted)


# opening the file in write mode and 
# writing the decrypted data
with open('Government Dataset.csv', 'wb') as dec_Government_Dataset_file:
  dec_Government_Dataset_file.write(decrypted)