# Using ML anonymization on one-hot encoded data

In this tutorial we will show how to anonymize models using the ML anonymization module, specifically when the inout data is already one-hot encoded. 

This will be demonstarted using the Adult dataset (original dataset can be found here: https://archive.ics.uci.edu/ml/datasets/adult). 

## Load data

In [21]:
import numpy as np

import os
import sys
sys.path.insert(0, os.path.abspath('..'))
from apt.utils.dataset_utils import get_adult_dataset_pd

# 'workclass', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country'
categorical_features = [1, 3, 4, 5, 6, 7, 11]

# requires a folder called 'datasets' in the current directory
(x_train, y_train), (x_test, y_test) = get_adult_dataset_pd()
x_train = x_train.to_numpy()[:, [1, 3, 4, 5, 6, 7, 11]]
y_train = y_train.to_numpy().astype(int)
x_test = x_test.to_numpy()[:, [1, 3, 4, 5, 6, 7, 11]]
y_test = y_test.to_numpy().astype(int)

# get balanced dataset
x_train = x_train[:x_test.shape[0]]
y_train = y_train[:y_test.shape[0]]

print(x_train)

[['State-gov' 'Never-married' 'Adm-clerical' ... 'White' 'Male'
  'UnitedStates']
 ['Self-emp-not-inc' 'Married-civ-spouse' 'Exec-managerial' ... 'White'
  'Male' 'UnitedStates']
 ['Private' 'Divorced' 'Handlers-cleaners' ... 'White' 'Male'
  'UnitedStates']
 ...
 ['Private' 'Never-married' 'Sales' ... 'White' 'Female' 'UnitedStates']
 ['Private' 'Never-married' 'Craft-repair' ... 'White' 'Male'
  'UnitedStates']
 ['Private' 'Never-married' 'Handlers-cleaners' ... 'White' 'Male'
  'UnitedStates']]


## Encode data

In [22]:
from sklearn.preprocessing import OneHotEncoder
import scipy

preprocessor = OneHotEncoder(handle_unknown="ignore")

x_train = preprocessor.fit_transform(x_train)
x_test = preprocessor.transform(x_test)
if scipy.sparse.issparse(x_train):
    x_train = x_train.toarray().astype(int)
if scipy.sparse.issparse(x_test):
    x_test = x_test.toarray().astype(int)

print(x_train)

[[0 0 0 ... 0 1 0]
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 0 1 0]
 ...
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 0 1 0]]


## Train decision tree model

In [23]:
from sklearn.tree import DecisionTreeClassifier
from art.estimators.classification.scikitlearn import ScikitlearnDecisionTreeClassifier

model = DecisionTreeClassifier()
model.fit(x_train, y_train)

art_classifier = ScikitlearnDecisionTreeClassifier(model)

print('Base model accuracy: ', model.score(x_test, y_test))

Base model accuracy:  0.814446287083103




# Anonymize data
## k=100

The data is anonymized on the quasi-identifiers: age, education-num, capital-gain, hours-per-week and with a privact parameter k=100.

This means that each record in the anonymized dataset is identical to 99 others on the quasi-identifier values (i.e., when looking only at those features, the records are indistinguishable).

In [25]:
from apt.utils.datasets import ArrayDataset
from apt.anonymization import Anonymize

x_train_predictions = np.array([np.argmax(arr) for arr in art_classifier.predict(x_train)])

# QI = (race, sex)
QI = [53, 52, 51, 50, 49, 48, 47]
QI_slices = [[47, 48, 49, 50, 51], [52, 53]]
anonymizer = Anonymize(100, QI)
anon = anonymizer.anonymize(ArrayDataset(x_train, x_train_predictions))
print(anon)

[[0 0 0 ... 0 1 0]
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 0 1 0]
 ...
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 0 1 0]]


In [26]:
# number of distinct rows in original data
len(np.unique(x_train, axis=0))

2711

In [27]:
# number of distinct rows in anonymized data
len(np.unique(anon, axis=0))

2476

## Train decision tree model

In [28]:
anon_model = DecisionTreeClassifier()
anon_model.fit(anon, y_train)

anon_art_classifier = ScikitlearnDecisionTreeClassifier(anon_model)

print('Anonymized model accuracy: ', anon_model.score(x_test, y_test))

Anonymized model accuracy:  0.8135863890424421


