# Using ML anonymization on one-hot encoded data

In this tutorial we will show how to anonymize models using the ML anonymization module, specifically when the input data is already one-hot encoded. 

This will be demonstarted using the Adult dataset (original dataset can be found here: https://archive.ics.uci.edu/ml/datasets/adult). 

## Load data

In [282]:
import numpy as np

import os
import sys
sys.path.insert(0, os.path.abspath('..'))
from apt.utils.dataset_utils import get_adult_dataset_pd

# 'workclass', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country'
categorical_features = [1, 3, 4, 5, 6, 7, 11]

# requires a folder called 'datasets' in the current directory
(x_train, y_train), (x_test, y_test) = get_adult_dataset_pd()
x_train = x_train.to_numpy()[:, [1, 3, 4, 5, 6, 7, 11]]
y_train = y_train.to_numpy().astype(int)
x_test = x_test.to_numpy()[:, [1, 3, 4, 5, 6, 7, 11]]
y_test = y_test.to_numpy().astype(int)

# get balanced dataset
x_train = x_train[:x_test.shape[0]]
y_train = y_train[:y_test.shape[0]]

print(x_train)

[['State-gov' 'Never-married' 'Adm-clerical' ... 'White' 'Male'
  'UnitedStates']
 ['Self-emp-not-inc' 'Married-civ-spouse' 'Exec-managerial' ... 'White'
  'Male' 'UnitedStates']
 ['Private' 'Divorced' 'Handlers-cleaners' ... 'White' 'Male'
  'UnitedStates']
 ...
 ['Private' 'Never-married' 'Sales' ... 'White' 'Female' 'UnitedStates']
 ['Private' 'Never-married' 'Craft-repair' ... 'White' 'Male'
  'UnitedStates']
 ['Private' 'Never-married' 'Handlers-cleaners' ... 'White' 'Male'
  'UnitedStates']]


## Encode data

In [283]:
from sklearn.preprocessing import OneHotEncoder
import scipy

preprocessor = OneHotEncoder(handle_unknown="ignore")

x_train = preprocessor.fit_transform(x_train)
x_test = preprocessor.transform(x_test)
if scipy.sparse.issparse(x_train):
    x_train = x_train.toarray().astype(int)
if scipy.sparse.issparse(x_test):
    x_test = x_test.toarray().astype(int)

print(x_train)

[[0 0 0 ... 0 1 0]
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 0 1 0]
 ...
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 0 1 0]]


## Train decision tree model

In [284]:
from sklearn.tree import DecisionTreeClassifier
from art.estimators.classification.scikitlearn import ScikitlearnDecisionTreeClassifier

model = DecisionTreeClassifier()
model.fit(x_train, y_train)

art_classifier = ScikitlearnDecisionTreeClassifier(model)

print('Base model accuracy: ', model.score(x_test, y_test))

Base model accuracy:  0.8147533935261961


# Anonymize data
## k=100

The data is anonymized on the quasi-identifiers: age, education-num, capital-gain, hours-per-week and with a privact parameter k=100.

This means that each record in the anonymized dataset is identical to 99 others on the quasi-identifier values (i.e., when looking only at those features, the records are indistinguishable).

## l = 6
The data is anonymised on the sensitive attributes: workclass, occupation
with a privacy parameter l=6.

This means that in each group there are at least 6 different rows for the sensitive attributes, otherwise the sensitive attributes in the group are suppressed.

## t = 0.1
The data is anonymised on the sensitive attributes: workclass, occupation
with a privacy parameter t=0.1.

if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold 0.1.

In [285]:
from apt.utils.datasets import ArrayDataset
from apt.anonymization import Anonymize

x_train_predictions = np.array([np.argmax(arr) for arr in art_classifier.predict(x_train)])

# QI = (race, sex)
QI = [53, 52, 51, 50, 49, 48, 47]
QI_slices = [[47, 48, 49, 50, 51], [52, 53]]

# Sensitive attributes
sensitive_attributes = [1, 4]
# Anonymizers 
anonymizer = Anonymize(100, QI, quasi_identifer_slices=QI_slices)
anonymizer_l = Anonymize(100, QI, quasi_identifer_slices=QI_slices, l=6, sensitive_attributes=sensitive_attributes)
anonymizer_lt = Anonymize(100, QI, quasi_identifer_slices=QI_slices, l=6, t = 0.1, sensitive_attributes=sensitive_attributes)
anonymizer_t = Anonymize(100, QI, quasi_identifer_slices=QI_slices, t=0.1, sensitive_attributes=sensitive_attributes)
anon = anonymizer.anonymize(ArrayDataset(x_train, x_train_predictions))
anon_l = anonymizer_l.anonymize(ArrayDataset(x_train, x_train_predictions))
anon_lt = anonymizer_lt.anonymize(ArrayDataset(x_train, x_train_predictions))
anon_t = anonymizer_t.anonymize(ArrayDataset(x_train, x_train_predictions))

In [286]:
# number of distinct rows in original data
print("Unique rows: ", len(np.unique(x_train, axis=0)))

suppressed = np.sum(x_train[sensitive_attributes] == -99)
total = np.prod(x_train.shape)
print("Number of rows with the sensitive attributes supressed", suppressed)

Unique rows:  2711
Number of rows with the sensitive attributes supressed 0


#### Using k-anonymity

In [287]:
# number of distinct rows in anonymized data (with k-anonymity)
print("Unique rows: ", len(np.unique(anon, axis=0)))

suppressed = np.sum(anon[sensitive_attributes] == -99)
total = len(anon)
print("Number of rows with the sensitive attributes supressed", suppressed)

Unique rows:  2476
Number of rows with the sensitive attributes supressed 0


#### Using k-anonymity and l-diversity

In [288]:
# number of distinct rows in anonymized data (with k-anonymity and l-diversity)
print("Unique rows: ", len(np.unique(anon_l, axis=0)))

suppressed = np.sum(anon_l[sensitive_attributes] == -99)
total = len(anon_l)
print("Number of rows with the sensitive attributes supressed", suppressed)

Unique rows:  2323
Number of rows with the sensitive attributes supressed 4


#### Using k-anonymity and and t-closeness

In [289]:
# number of distinct rows in anonymized data (with k-anonymity and l-diversity)
print("Unique rows: ", len(np.unique(anon_t, axis=0)))

suppressed = np.sum(anon_t[sensitive_attributes] == -99)
total = len(anon_t)
print("Number of rows with the sensitive attributes supressed", suppressed)

Unique rows:  2352
Number of rows with the sensitive attributes supressed 4


#### Using k-anonymity, l-diversity and t-closeness

In [290]:
# number of distinct rows in anonymized data (with k-anonymity, l-diversity and t-closeness)
print("Unique rows: ", len(np.unique(anon_lt, axis=0)))
suppressed = np.sum(anon_lt[sensitive_attributes] == -99)
total = len(anon_lt)
print("Number of rows with the sensitive attributes supressed", suppressed)

Unique rows:  2323
Number of rows with the sensitive attributes supressed 4


## Train decision tree model

In [291]:
anon_model = DecisionTreeClassifier()
anon_model.fit(x_train, y_train)

anon_art_classifier = ScikitlearnDecisionTreeClassifier(anon_model)

print('Anonymized model accuracy: ', anon_model.score(x_test, y_test))

Anonymized model accuracy:  0.8139549167741539


#### Using k-anonymity

In [292]:
anon_model = DecisionTreeClassifier()
anon_model.fit(anon, y_train)

anon_art_classifier = ScikitlearnDecisionTreeClassifier(anon_model)

print('Anonymized model accuracy: ', anon_model.score(x_test, y_test))

Anonymized model accuracy:  0.8127879122903998


#### Using k-anonymity and l-diversity

In [293]:
anon_model = DecisionTreeClassifier()
anon_model.fit(anon_l, y_train)

anon_art_classifier = ScikitlearnDecisionTreeClassifier(anon_model)

print('Anonymized model accuracy: ', anon_model.score(x_test, y_test))

Anonymized model accuracy:  0.8122965419814507


#### Using k-anonymity and t-closeness

In [294]:
anon_model = DecisionTreeClassifier()
anon_model.fit(anon_t, y_train)

anon_art_classifier = ScikitlearnDecisionTreeClassifier(anon_model)

print('Anonymized model accuracy: ', anon_model.score(x_test, y_test))

Anonymized model accuracy:  0.786315336895768


#### Using k-anonymity, l-diversity and t-closeness

In [295]:
anon_model = DecisionTreeClassifier()
anon_model.fit(anon_lt, y_train)

anon_art_classifier = ScikitlearnDecisionTreeClassifier(anon_model)

print('Anonymized model accuracy: ', anon_model.score(x_test, y_test))

Anonymized model accuracy:  0.8125422271359253
