<h1 style="text-align: center;">Climate Tech Amsterdam</h1>
<img src="Images/Sustainalab.png" 
     align="right" 
     width="400" />
     
In this notebook we will show the basics of anonymisation and synthetic data.
We will go through the need for privacy enhancement via a real-life example. Following this, we will anonymise our data and create synthetic data.

This notebook is presented by Pepijn de Reus & Ana Oprescu.

In [1]:
%%capture
!pip install DataSynthesizer
!pip install pandas
!pip install numpy
!pip install pyRAPL
!pip install pymongo

In [2]:
# import packages

import pandas as pd
import numpy as np
import DataSynthesizer
import pyRAPL
import subprocess
import os
import os.path
import sys
import yaml
import time
import pymongo

# import modules from certain packages
from subprocess import STDOUT, PIPE
from yaml.loader import SafeLoader
from DataSynthesizer.DataDescriber import DataDescriber
from DataSynthesizer.DataGenerator import DataGenerator
from DataSynthesizer.lib.utils import read_json_file, display_bayesian_network

## Looking at the data
Below we have the data in our data set, consisting of 30159 individuals with their respective data.

In [3]:
data = pd.read_csv('./Adult/Adult_train.csv')
data.head(10)

Unnamed: 0,age,type_employer,education,marital,occupation,relationship,race,sex,capital_gain,capital_loss,hr_per_week,country,income
0,39,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,Masters,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,9th,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,HS-grad,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,Masters,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


Looking at this data, is this data set anonymised? ..

## Performing some queries on the data

Suppose that in this data set, we are looking for someone from Yugoslavia:

In [4]:
yug_data = data.loc[data['country'] == 'Yugoslavia']
yug_data

Unnamed: 0,age,type_employer,education,marital,occupation,relationship,race,sex,capital_gain,capital_loss,hr_per_week,country,income
946,56,Private,HS-grad,Married-civ-spouse,Other-service,Husband,White,Male,0,0,50,Yugoslavia,<=50K
4075,25,Private,Some-college,Never-married,Exec-managerial,Own-child,White,Female,0,0,40,Yugoslavia,<=50K
5821,20,Private,Some-college,Never-married,Adm-clerical,Own-child,White,Male,0,0,40,Yugoslavia,<=50K
6718,35,Private,HS-grad,Married-civ-spouse,Other-service,Husband,White,Male,0,0,40,Yugoslavia,>50K
11564,40,Local-gov,9th,Married-civ-spouse,Other-service,Wife,White,Female,0,0,40,Yugoslavia,>50K
11858,31,Private,Bachelors,Married-civ-spouse,Other-service,Husband,White,Male,0,0,40,Yugoslavia,<=50K
12143,66,Private,Assoc-acdm,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,5556,0,40,Yugoslavia,>50K
16914,41,Private,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,Yugoslavia,<=50K
19042,35,Private,Bachelors,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,65,Yugoslavia,>50K
20568,56,Private,7th-8th,Divorced,Machine-op-inspct,Not-in-family,White,Female,0,0,20,Yugoslavia,<=50K


And suppose we know a person from Yugoslavia that works for the local government:

In [5]:
yug_data = yug_data.loc[yug_data['type_employer'] == 'Local-gov']
yug_data

Unnamed: 0,age,type_employer,education,marital,occupation,relationship,race,sex,capital_gain,capital_loss,hr_per_week,country,income
11564,40,Local-gov,9th,Married-civ-spouse,Other-service,Wife,White,Female,0,0,40,Yugoslavia,>50K


We can now see that we have a unique person. By just selecting persons working for a local government in combination with Yugoslavia as country of origin we have a unique person left in our data set.

Suppose we knew the origin of this data set, e.g. some town, we could quite easily trace this data back to this specific woman. By performing just two queries, we can uniquely identify a person.


<h1 style="text-align: center;">k-anonymity</h1>

To fight such tracebacks to unique persons, we can anonymise our data using k-anonymity. K-anonymity is a property of a data set, which we could define as a metric for privacy. A data set with a k-anonymity of 3 means that for each row in our data, at least 3 other rows are exactly the same. This would make these 4 rows indistinguishable from each other.

In our earlier example with the Yugoslavian people, this would mean that at least 4 of them would be working in the local government so that we cannot uniquely identify them.

We can obtain k-anonymity by generalisation and suppression. In the example above we could for in stance either remove the unique values such as 'local-gov'. Or, we could generalise. Yugoslavian would then become "Eastern-Europe", as would Polish, Romanian and Bulgarian. Similarly, we could generalise local government into government. For all these categorisations we can build a hierarchy until there is nothing be be generalised. If that is the case and the data set still does not obtain k-anonymity, we can suppress the data by replacing it with asterixs (**).

### An example of such a generalisation hierarchy can be seen below:

![title](Images/Gen+Supp.png)

## Applying k-anonymity to the Adult data set

In [6]:
def Load_control_file(Test_file):
    if not os.path.isfile(Test_file):
        print("Given file does not exsist")
        exit()

    if not ".yaml" in Test_file:
        print("Control-file has to be a yaml file")
        exit()

    with open(Test_file) as f:
        data = yaml.load(f, Loader=SafeLoader)
    
    return data

def create_gen(data):
    print("Creating generalisation structures...\n")
    f = open("./Adult/hierarchy/hierarchy.txt", "w")
    for key, value in data["hierarchy"].items():
        f.write(key + "," + value + "," + "\n")
    f.close()

def create_hierachy(data):
    print("Creating hierarchy..")
    if data["type"] == "general":
        create_gen(data)
    else:
        print("Invalid type given")
        exit()

def measurement(k, iterations, input_file, class_name, suppression, data_type, dataset):
    # compile Java file using
    # sudo javac -cp .:libraries/*
    
    subprocess.run(['javac', '-cp', '.:libraries/*', 'k_anonymity.java'], capture_output=False)

    # delete old synthetic data if present
    if os.path.isfile('/Adult/' + str(dataset)+'_'+ str(k) + '.csv'):
        os.remove('/Adult/' + str(dataset)+'_'+ str(k) + '.csv')

    def Energy_consumption():

        subprocess.run(['java', '-cp', '.:libraries/*', class_name, 
            str(k), input_file, str(suppression), dataset], capture_output=False)
        
    for _ in range(iterations):
        Energy_consumption()
        print(f"Anonymising {dataset} dataset with k-anonymity of {k} completed.")


# use YAML file as input
data = Load_control_file('./Adult/adult.yaml')

# create hierachy for the Adult data
create_hierachy(data)

for k in data["k_values"]:
    measurement(k, int(data["iterations"]), data["input_file"], data["class_name"], 
                    data["suppression_limit"], data["type"], data["data_set"])

Creating hierarchy..
Creating generalisation structures...
Anonymising adult with k-anonymity of 3 completed.
Anonymising adult with k-anonymity of 10 completed.
Anonymising adult with k-anonymity of 27 completed.


Let's look at our data with k-anonymity of 3!


In [7]:
data_3 = pd.read_csv('./Adult/adult_3.csv', delimiter=';')
data_3.head(5)

Unnamed: 0,age,type_employer,education,marital,occupation,relationship,race,sex,capital_gain,capital_loss,hr_per_week,country,income
0,*,*,*,*,*,*,*,*,*,*,*,*,<=50K
1,*,Self-emp-not-inc,high,Married-civ-spouse,Stores,Married,White,Male,-1500,0,*,North-America,<=50K
2,*,Private,medium,Divorced,Services,Other,White,Male,-1500,0,*,North-America,<=50K
3,*,Private,low,Married-civ-spouse,Services,Married,Black,Male,-1500,0,*,North-America,<=50K
4,*,*,*,*,*,*,*,*,*,*,*,*,<=50K


We see quite some changes in the file now. Rows 0 and 4 have been suppressed. Education is generalised into low-medium-high and the countries are grouped per region. Could we still find our unique person from the earlier example?

In [8]:
yug_data = data_3.loc[data_3['country'] == 'Yugoslavia']
yug_data = yug_data.loc[yug_data['type_employer'] == 'Local-gov']
yug_data

Unnamed: 0,age,type_employer,education,marital,occupation,relationship,race,sex,capital_gain,capital_loss,hr_per_week,country,income


No!


<h1 style="text-align: center;">Synthetic data</h1>

Apart from anonymising our data, we can also make synthetic data. Synthetic data is constructed by learning the distribution of the original data. We learn the specifics of a data set, e.g. what is the average age, what percentage of males works in education etc. Then learning these distributions, we construct a model that creates new data following this distribution.

We then have data that is similar to the original data, without having overlap. So eventhough there could be unique rows, the unique rows in the original data will not be present in the synthetic data.

In [9]:
input_data = './Adult/Adult_train.csv'

# location of two output files
mode = 'correlated_attribute_mode'
description_file = f'./Adult/description_adult.json'
synthetic_data = f'./Adult/adult_synthetic_data.csv'

# An attribute is categorical if its domain size is less than this threshold.
# Here modify the threshold to adapt to the domain size of "education" (which is 14 in input dataset).
threshold_value = 42

# A parameter in Differential Privacy. It roughly means that removing a row in the input dataset will not 
# change the probability of getting the same output more than a multiplicative difference of exp(epsilon).
# Increase epsilon value to reduce the injected noises. Set epsilon=0 to turn off differential privacy.
epsilon = 0

# The maximum number of parents in Bayesian network, i.e., the maximum number of incoming edges.
degree_of_bayesian_network = 2

# Number of tuples generated in synthetic dataset.
num_tuples_to_generate = 30163

describer = DataDescriber(category_threshold=threshold_value)
describer.describe_dataset_in_correlated_attribute_mode(dataset_file=input_data, 
                                                        epsilon=epsilon, 
                                                        k=degree_of_bayesian_network)
describer.save_dataset_description_to_file(description_file)

# Generate data set
generator = DataGenerator()
generator.generate_dataset_in_correlated_attribute_mode(num_tuples_to_generate, description_file)
generator.save_synthetic_data(synthetic_data)

Adding ROOT race
Adding attribute country
Adding attribute education
Adding attribute occupation
Adding attribute hr_per_week
Adding attribute age
Adding attribute marital
Adding attribute relationship
Adding attribute sex
Adding attribute income
Adding attribute type_employer
Adding attribute capital_gain
Adding attribute capital_loss


  for parents_instance, stats_sub in stats.groupby(parents):


In [10]:
syn_data = pd.read_csv('./Adult/adult_synthetic_data.csv')
syn_data.head(5)

Unnamed: 0,age,type_employer,education,marital,occupation,relationship,race,sex,capital_gain,capital_loss,hr_per_week,country,income
0,38.0,Private,Bachelors,Never-married,Sales,Not-in-family,White,Male,470.0,131.0,50.0,United-States,<=50K
1,53.0,Private,Assoc-voc,Married-civ-spouse,Craft-repair,Husband,White,Male,3566.0,91.0,36.0,United-States,>50K
2,33.0,Private,HS-grad,Never-married,Exec-managerial,Unmarried,White,Female,1326.0,214.0,40.0,United-States,<=50K
3,62.0,Private,Bachelors,Married-civ-spouse,Craft-repair,Husband,White,Male,2326.0,1669.0,30.0,United-States,>50K
4,48.0,Private,Bachelors,Married-civ-spouse,Prof-specialty,Husband,White,Male,1290.0,125.0,39.0,United-States,>50K


In [11]:
syn_yug_data = syn_data.loc[syn_data['country'] == 'Yugoslavia']
syn_yug_data = syn_yug_data.loc[syn_yug_data['type_employer'] == 'Local-gov']
syn_yug_data

Unnamed: 0,age,type_employer,education,marital,occupation,relationship,race,sex,capital_gain,capital_loss,hr_per_week,country,income


So no overlap with the old data. But still unique rows (that do not occur in the original data set).

And for other data subjects with country Yugoslavia:

In [12]:
syn_yug_data = syn_data.loc[syn_data['country'] == 'Yugoslavia']
syn_yug_data

Unnamed: 0,age,type_employer,education,marital,occupation,relationship,race,sex,capital_gain,capital_loss,hr_per_week,country,income
8167,28.0,Private,Assoc-acdm,Never-married,Craft-repair,Not-in-family,White,Male,2326.0,104.0,44.0,Yugoslavia,<=50K
8987,44.0,Private,Some-college,Never-married,Exec-managerial,Own-child,White,Female,4496.0,205.0,36.0,Yugoslavia,<=50K
10109,59.0,Self-emp-not-inc,Assoc-acdm,Married-civ-spouse,Craft-repair,Husband,White,Male,3789.0,204.0,28.0,Yugoslavia,>50K
13169,37.0,Private,Bachelors,Divorced,Other-service,Unmarried,White,Female,2023.0,6.0,34.0,Yugoslavia,<=50K
13647,19.0,Private,Assoc-voc,Never-married,Craft-repair,Own-child,White,Male,3575.0,181.0,29.0,Yugoslavia,<=50K
14240,53.0,Private,7th-8th,Divorced,Machine-op-inspct,Not-in-family,White,Male,513.0,122.0,37.0,Yugoslavia,<=50K
15086,32.0,Private,Assoc-voc,Married-civ-spouse,Craft-repair,Husband,White,Male,3156.0,131.0,36.0,Yugoslavia,<=50K
15158,43.0,Federal-gov,Assoc-acdm,Divorced,Machine-op-inspct,Unmarried,White,Female,4722.0,103.0,36.0,Yugoslavia,<=50K
17887,43.0,Private,Assoc-acdm,Divorced,Machine-op-inspct,Not-in-family,White,Male,2234.0,214.0,38.0,Yugoslavia,<=50K
18618,40.0,Private,HS-grad,Never-married,Other-service,Not-in-family,White,Female,1179.0,2167.0,39.0,Yugoslavia,<=50K
