# Testing the pure-LDP library with actual benchmark datasets
To apply the LDP mechanisms in our research we will use a python library that has implementations of all LDP mechanisms that we will use. The library is from Samuel-Moddock and is called 'pure-ldp'. The link to the github page: https://github.com/Samuel-Maddock/pure-LDP.

To be certain this library will work we need to conduct some tests on datasets that we will use for our benchmark tests later in the project.

So first let's load some LDP mechanisms from the pure-ldp library. For this we will use local hashing.

In [68]:
import numpy as np
from pure_ldp.frequency_oracles.local_hashing import LHClient, LHServer

The first test is the example provided by Samuel Maddock's github page.

In [69]:
# Using Optimal Local Hashing (OLH)

epsilon = 3 # Privacy budget of 3
d = 4 # For simplicity, we use a dataset with 4 possible data items

client_olh = LHClient(epsilon=epsilon, d=d, use_olh=True)
server_olh = LHServer(epsilon=epsilon, d=d, use_olh=True)

# Test dataset, every user has a number between 1-4, 10,000 users total
data = np.concatenate(([1]*4000, [2]*3000, [3]*2000, [4]*1000))

for item in data:
    # Simulate client-side privatisation
    priv_data = client_olh.privatise(item)

    # Simulate server-side aggregation
    server_olh.aggregate(priv_data)

# Simulate server-side estimation
print(server_olh.estimate(1)) # Should be approximately 4000 +- 200

3911.826679014247


## Load datasets
For the datasets we will use the same datasets that were used in the paper 'Comparing Classifiers’ Performance under Differential Privacy' by Lopuhaä-Zwakenberg et al. 

The first to load is the 'Adult' dataset. This dataset has some different attributes like age, race, sex, education etc. The target attribute is a value that indicates if the person has an income of >50K or <=50K.

In [216]:
import pandas as pd

raw_data = pd.read_csv("./Dataset/adult.data")

print("An example of the first 5 rows:\n\n", raw_data.head())

An example of the first 5 rows:

    age          workclass  fnlwgt   education  education-num  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital-status          occupation    relationship    race      sex  \
0        Never-married        Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse     Exec-managerial         Husband   White     Male   
2             Divorced   Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse      Prof-specialty            Wife   Black   Female   

   capital-gain  capital-loss  hours-per-week  native-country  income  
0          2174             0     

## Discrete data test
The dataset contains discrete and continuous data. 
Let's first look at the discrete data. Some questions that arise are: Do we need to encode the data into numerical values? 
From the github example the pure-ldp functions take numerical data so we need to construct an encoder.

In [194]:
def convert_discrete_data(data):
    features = data.columns
    
    for feature_name in features:
        data.loc[:,feature_name] = data[feature_name].astype("category").cat.codes.copy()

    return data

The columns with discrete data are: workclass, education, marital-status, occupation, relationship, race, sex, native-country.

In [196]:
discrete_feature_columns = ["workclass", "education", "marital-status", "occupation", "relationship", "race", "sex", "native-country"]

# select only the data from the discrete columns
discrete_data = raw_data[discrete_feature_columns]
# convert the data from category names to numbers
discrete_data = convert_discrete_data(discrete_data)
discrete_data = discrete_data.to_numpy()

# Reload a new instance of the client and server functions of the local_hashing frequency oracles.
epsilon=4
d=9
client_olh = LHClient(epsilon=epsilon, d=d, use_olh=True)
server_olh = LHServer(epsilon=epsilon, d=d, use_olh=True)

for item in discrete_data:
    # Simulate client-side privatisation
    priv_data = client_olh.privatise(item[0]+1)

    # Simulate server-side aggregation
    server_olh.aggregate(priv_data)

# sum the unique values
unique, counts = np.unique(discrete_data[:,0], return_counts=True)

print("The sum of unique values: ", dict(zip(unique, counts)))
print("The estimate by the olh frequency oracle for the number 0: ",{i : server_olh.estimate(i+1) for i in range(8)})

The sum of unique values:  {0: 1836, 1: 960, 2: 2093, 3: 7, 4: 22696, 5: 1116, 6: 2541, 7: 1298, 8: 14}
The estimate by the olh frequency oracle for the number 0:  {0: 1808.3203571105807, 1: 898.4904720551697, 2: 2083.14311415249, 3: -19.667375334844337, 4: 22865.573120155033, 5: 1110.853511587554, 6: 2603.6407600651955, 7: 1337.7904852054937}


### Note
The frequency oracle functions use numerical inputs that start from 1 and not from 0. This should be kept in mind when specifying the domain size (`d`) and the estimated value (`.estimate(...)`)

## Continuous data test
Since the LDP algorithm needs categorical data there is a need to categorize continuous data. We can do this by binning the data. This means we will split the range of the data into different bins that each represent a range of values.  

Pandas has a function `pd.qcut(...)` that does this automatically. It distributes data in bins of equal sizes. The number of bins is specified by the `q` value. The `labels` value is set to `False` to indicate that we want numerical values for the bins.

In [245]:
continuous_feature_columns = ["age", "fnlwgt", "education-num", "capital-gain", "capital-loss", "hours-per-week"]

continuous_data = raw_data[continuous_feature_columns].copy()
for feature in continuous_feature_columns:
    continuous_data.loc[:,feature] = pd.qcut(continuous_data[feature], q=10, labels=False, duplicates="drop")

print(continuous_data.head())

continuous_data = continuous_data.to_numpy()

   age  fnlwgt  education-num  capital-gain  capital-loss  hours-per-week
0    5       1              4             0             0               2
1    7       1              4             0             0               0
2    5       6              1             0             0               2
3    8       7              0             0             0               2
4    2       9              4             0             0               2


### Applying LDP to binned continuous data 
Now that the data is separated in bin we can try and apply the pure-LDP frequency oracle to see if it will correctly estimate the frequency.

In [250]:
# Reload a new instance of the client and server functions of the local_hashing frequency oracles.
epsilon=4
d=10
client_olh = LHClient(epsilon=epsilon, d=d, use_olh=True)
server_olh = LHServer(epsilon=epsilon, d=d, use_olh=True)

for item in continuous_data:
    # Simulate client-side privatisation
    priv_data = client_olh.privatise(item[0]+1)

    # Simulate server-side aggregation
    server_olh.aggregate(priv_data)

# sum the unique values
unique, counts = np.unique(continuous_data[:,0], return_counts=True)

print("The sum of unique values: ", dict(zip(unique, counts)))
print("The estimate by the olh frequency oracle for the number 0: ",{i : server_olh.estimate(i+1) for i in range(d)})

The sum of unique values:  {0: 3895, 1: 3301, 2: 3376, 3: 2591, 4: 3518, 5: 3245, 6: 3008, 7: 3167, 8: 3461, 9: 2999}
The estimate by the olh frequency oracle for the number 0:  {0: 3921.5407995161686, 1: 3305.2715867555244, 2: 3407.289125354415, 3: 2557.836967224878, 4: 3457.2568993620343, 5: 3459.338889945686, 6: 3059.5966978847273, 7: 3194.9260858220305, 8: 3378.1412571833034, 9: 2932.5952722820266}
