# Using ML anonymization to defend against attribute inference attacks

In this tutorial we will show how to anonymize models using the ML anonymization module. 

We will demonstrate running inference attacks both on a vanilla model, and then on different anonymized versions of the model. We will run both black-box and white-box attribute inference attacks using ART's inference module (https://github.com/Trusted-AI/adversarial-robustness-toolbox/tree/main/art/attacks/inference). 

This will be demonstarted using the Nursery dataset (original dataset can be found here: https://archive.ics.uci.edu/ml/datasets/nursery). 

The sensitive feature we are trying to infer is the 'social' feature, after turning it into a binary feature (the original value 'problematic' receives the new value 1 and the rest 0). We also preprocess the data such that all categorical features are one-hot encoded.

## Load data

In [1]:
import os
import sys
sys.path.insert(0, os.path.abspath('..'))

from apt.utils.dataset_utils import get_nursery_dataset

(x_train, y_train), (x_test, y_test) = get_nursery_dataset(transform_social=True)

x_train

Unnamed: 0,parents,has_nurs,form,children,housing,finance,social,health
8450,pretentious,very_crit,foster,1,less_conv,convenient,1,not_recom
12147,great_pret,very_crit,complete,1,critical,inconv,1,recommended
2780,usual,critical,complete,4,less_conv,convenient,1,not_recom
11924,great_pret,critical,foster,1,critical,convenient,1,not_recom
59,usual,proper,complete,2,convenient,convenient,0,not_recom
...,...,...,...,...,...,...,...,...
5193,pretentious,less_proper,complete,1,convenient,inconv,0,recommended
1375,usual,less_proper,incomplete,2,less_conv,convenient,1,priority
10318,great_pret,less_proper,foster,4,convenient,convenient,0,priority
6396,pretentious,improper,completed,3,less_conv,convenient,1,recommended


## Train decision tree model

In [2]:
from sklearn.tree import DecisionTreeClassifier
from art.estimators.classification.scikitlearn import ScikitlearnDecisionTreeClassifier
from sklearn.preprocessing import OneHotEncoder

x_train_str = x_train.astype(str)
train_encoded = OneHotEncoder(sparse=False).fit_transform(x_train_str)
x_test_str = x_test.astype(str)
test_encoded = OneHotEncoder(sparse=False).fit_transform(x_test_str)
    
model = DecisionTreeClassifier()
model.fit(train_encoded, y_train)

art_classifier = ScikitlearnDecisionTreeClassifier(model)

print('Base model accuracy: ', model.score(test_encoded, y_test))

Base model accuracy:  0.9969135802469136


## Attack
### Black-box attack
The black-box attack basically trains an additional classifier (called the attack model) to predict the attacked feature's value from the remaining n-1 features as well as the original (attacked) model's predictions.
#### Train attack model

In [3]:
import numpy as np
from art.attacks.inference.attribute_inference import AttributeInferenceBlackBox

attack_feature = 20

# training data without attacked feature
x_train_for_attack = np.delete(train_encoded, attack_feature, 1)
# only attacked feature
x_train_feature = train_encoded[:, attack_feature].copy().reshape(-1, 1)

bb_attack = AttributeInferenceBlackBox(art_classifier, attack_feature=attack_feature)

# get original model's predictions
x_train_predictions = np.array([np.argmax(arr) for arr in art_classifier.predict(train_encoded)]).reshape(-1,1)

# use half of training set for training the attack
attack_train_ratio = 0.5
attack_train_size = int(len(train_encoded) * attack_train_ratio)

# train attack model
bb_attack.fit(train_encoded[:attack_train_size])

#### Infer sensitive feature and check accuracy

In [4]:
# get inferred values
values=[0, 1]

inferred_train_bb = bb_attack.infer(x_train_for_attack[attack_train_size:], x_train_predictions[attack_train_size:], values=values)
# check accuracy
train_acc = np.sum(inferred_train_bb == np.around(x_train_feature[attack_train_size:], decimals=8).reshape(1,-1)) / len(inferred_train_bb)
print(train_acc)

1.0


This means that for 64% of the training set, the attacked feature is inferred correctly using this attack.

## Whitebox attack
This attack does not train any additional model, it simply uses additional information coded within the attacked decision tree model to compute the probability of each value of the attacked feature and outputs the value with the highest probability.

In [5]:
from art.attacks.inference.attribute_inference import AttributeInferenceWhiteBoxDecisionTree

priors = [6925 / 10366, 3441 / 10366]

wb2_attack = AttributeInferenceWhiteBoxDecisionTree(art_classifier, attack_feature=attack_feature)

# get inferred values
inferred_train_wb2 = wb2_attack.infer(x_train_for_attack, x_train_predictions, values=values, priors=priors)

# check accuracy
train_acc = np.sum(inferred_train_wb2 == np.around(x_train_feature, decimals=8).reshape(1,-1)) / len(inferred_train_wb2)
print(train_acc)

0.5122515917422342


The white-box attack is able to correctly infer the attacked feature value in 69% of the training set. 

# Anonymized data
## k=100

Now we will apply the same attacks on an anonymized version of the same dataset (k=100). The data is anonymized on the quasi-identifiers: finance, social, health.

k=100 means that each record in the anonymized dataset is identical to 99 others on the quasi-identifier values (i.e., when looking only at those 3 feature, the records are indistinguishable).

In [6]:
from apt.utils.datasets import ArrayDataset
from apt.anonymization import Anonymize

features = x_train.columns
QI = ["finance", "social", "health"]
categorical_features = ["parents", "has_nurs", "form", "housing", "finance", "health", 'children']
QI_indexes = [i for i, v in enumerate(features) if v in QI]
categorical_features_indexes = [i for i, v in enumerate(features) if v in categorical_features]
anonymizer = Anonymize(100, QI_indexes, categorical_features=categorical_features_indexes)
anon = anonymizer.anonymize(ArrayDataset(x_train, x_train_predictions))
anon


Unnamed: 0,parents,has_nurs,form,children,housing,finance,social,health
0,pretentious,very_crit,foster,1,less_conv,convenient,0,not_recom
1,great_pret,very_crit,complete,1,critical,inconv,1,recommended
2,usual,critical,complete,4,less_conv,convenient,0,not_recom
3,great_pret,critical,foster,1,critical,convenient,0,not_recom
4,usual,proper,complete,2,convenient,convenient,0,not_recom
...,...,...,...,...,...,...,...,...
10361,pretentious,less_proper,complete,1,convenient,inconv,0,recommended
10362,usual,less_proper,incomplete,2,less_conv,convenient,1,priority
10363,great_pret,less_proper,foster,4,convenient,convenient,0,priority
10364,pretentious,improper,completed,3,less_conv,convenient,1,recommended


In [7]:
# number of distinct rows in original data
len(x_train.drop_duplicates())

7585

In [8]:
# number of distinct rows in anonymized data
len(anon.drop_duplicates())

5766

## Train decision tree model

In [9]:
anon_str = anon.astype(str)
anon_encoded = OneHotEncoder(sparse=False).fit_transform(anon_str)

anon_model = DecisionTreeClassifier()
anon_model.fit(anon_encoded, y_train)

anon_art_classifier = ScikitlearnDecisionTreeClassifier(anon_model)

print('Anonymized model accuracy: ', anon_model.score(test_encoded, y_test))

Anonymized model accuracy:  0.9976851851851852


## Attack
### Black-box attack

In [10]:
anon_bb_attack = AttributeInferenceBlackBox(anon_art_classifier, attack_feature=attack_feature)

# get original model's predictions
anon_x_train_predictions = np.array([np.argmax(arr) for arr in anon_art_classifier.predict(train_encoded)]).reshape(-1,1)

# train attack model
anon_bb_attack.fit(train_encoded[:attack_train_size])

# get inferred values
inferred_train_anon_bb = anon_bb_attack.infer(x_train_for_attack[attack_train_size:], anon_x_train_predictions[attack_train_size:], values=values)
# check accuracy
train_acc = np.sum(inferred_train_anon_bb == np.around(x_train_feature[attack_train_size:], decimals=8).reshape(1,-1)) / len(inferred_train_anon_bb)
print(train_acc)

1.0


### White box attack

In [11]:
anon_wb2_attack = AttributeInferenceWhiteBoxDecisionTree(anon_art_classifier, attack_feature=attack_feature)

# get inferred values
inferred_train_anon_wb2 = anon_wb2_attack.infer(x_train_for_attack, anon_x_train_predictions, values=values, priors=priors)

# check accuracy
anon_train_acc = np.sum(inferred_train_anon_wb2 == np.around(x_train_feature, decimals=8).reshape(1,-1)) / len(inferred_train_anon_wb2)
print(anon_train_acc)

0.5245996527107852


The accuracy of the attacks remains more or less the same. Let's check the precision and recall for each case:

In [12]:
def calc_precision_recall(predicted, actual, positive_value=1):
    score = 0  # both predicted and actual are positive
    num_positive_predicted = 0  # predicted positive
    num_positive_actual = 0  # actual positive
    for i in range(len(predicted)):
        if predicted[i] == positive_value:
            num_positive_predicted += 1
        if actual[i] == positive_value:
            num_positive_actual += 1
        if predicted[i] == actual[i]:
            if predicted[i] == positive_value:
                score += 1
    
    if num_positive_predicted == 0:
        precision = 1
    else:
        precision = score / num_positive_predicted  # the fraction of predicted “Yes” responses that are correct
    if num_positive_actual == 0:
        recall = 1
    else:
        recall = score / num_positive_actual  # the fraction of “Yes” responses that are predicted correctly

    return precision, recall
    
# black-box regular
print(calc_precision_recall(inferred_train_bb, x_train_feature))
# black-box anonymized
print(calc_precision_recall(inferred_train_anon_bb, x_train_feature))

(0.49415432579890883, 0.48976438779451525)
(0.49415432579890883, 0.48976438779451525)


In [13]:
# white-box regular
print(calc_precision_recall(inferred_train_wb2, x_train_feature))
# white-box anonymized
print(calc_precision_recall(inferred_train_anon_wb2, x_train_feature))

(1.0, 0.019204655674102813)
(0.9829787234042553, 0.04481086323957323)


Precision and recall remain almost the same, sometimes even slightly increasing.

Now let's see what happens when we increase k to 1000.

## k=1000

Now we apply the attacks on an anonymized version of the same dataset (k=1000). The data has been anonymized on the quasi-identifiers: finance, social, health.

In [14]:
anonymizer2 = Anonymize(1000, QI_indexes, categorical_features=categorical_features_indexes)
anon2 = anonymizer2.anonymize(ArrayDataset(x_train, x_train_predictions))

In [15]:
# number of distinct rows in anonymized data
len(anon2.drop_duplicates())

4226

## Train decision tree model

In [16]:
anon2_str = anon2.astype(str)
anon2_encoded = OneHotEncoder(sparse=False).fit_transform(anon2_str)

anon2_model = DecisionTreeClassifier()
anon2_model.fit(anon2_encoded, y_train)

anon2_art_classifier = ScikitlearnDecisionTreeClassifier(anon2_model)

print('Anonymized model accuracy: ', anon2_model.score(test_encoded, y_test))

Anonymized model accuracy:  0.9930555555555556


## Attack
### Black-box attack

In [17]:
anon2_bb_attack = AttributeInferenceBlackBox(anon2_art_classifier, attack_feature=attack_feature)

# get original model's predictions
anon2_x_train_predictions = np.array([np.argmax(arr) for arr in anon2_art_classifier.predict(train_encoded)]).reshape(-1,1)

# train attack model
anon2_bb_attack.fit(train_encoded[:attack_train_size])

# get inferred values
inferred_train_anon2_bb = anon2_bb_attack.infer(x_train_for_attack[attack_train_size:], anon2_x_train_predictions[attack_train_size:], values=values)
# check accuracy
train_acc = np.sum(inferred_train_anon2_bb == np.around(x_train_feature[attack_train_size:], decimals=8).reshape(1,-1)) / len(inferred_train_anon2_bb)
print(train_acc)

1.0


### White box attack

In [18]:
anon2_wb2_attack = AttributeInferenceWhiteBoxDecisionTree(anon2_art_classifier, attack_feature=attack_feature)

# get inferred values
inferred_train_anon2_wb2 = anon2_wb2_attack.infer(x_train_for_attack, anon2_x_train_predictions, values=values, priors=priors)

# check accuracy
train_acc = np.sum(inferred_train_anon2_wb2 == np.around(x_train_feature, decimals=8).reshape(1,-1)) / len(inferred_train_anon_wb2)
print(train_acc)

0.515820953115956


In [19]:
# black-box regular
print(calc_precision_recall(inferred_train_bb, x_train_feature))
# black-box anonymized
print(calc_precision_recall(inferred_train_anon2_bb, x_train_feature))

# white-box regular
print(calc_precision_recall(inferred_train_wb2, x_train_feature))
# white-box anonymized
print(calc_precision_recall(inferred_train_anon2_wb2, x_train_feature))

(0.49415432579890883, 0.48976438779451525)
(0.49415432579890883, 0.48976438779451525)
(1.0, 0.019204655674102813)
(1.0, 0.026382153249272552)


The accuracy of the black-box attack is slightly reduced, as well as the precision and recall in both attacks.

## k=100, all QI
Now let's see what happens if we define all 8 features in the Nursery dataset as quasi-identifiers.

In [20]:
QI2 = ["parents", "has_nurs", "form", "children", "housing", "finance", "social", "health"]
QI2_indexes = [i for i, v in enumerate(features) if v in QI2]
anonymizer3 = Anonymize(100, QI2_indexes, categorical_features=categorical_features_indexes)
anon3 = anonymizer3.anonymize(ArrayDataset(x_train, x_train_predictions))

In [21]:
# number of distinct rows in anonymized data
len(anon3.drop_duplicates())

39

In [22]:
anon3_str = anon3.astype(str)
anon3_encoded = OneHotEncoder(sparse=False).fit_transform(anon3_str)

anon3_model = DecisionTreeClassifier()
anon3_model.fit(anon3_encoded, y_train)

anon3_art_classifier = ScikitlearnDecisionTreeClassifier(anon3_model)

print('Anonymized model accuracy: ', anon3_model.score(test_encoded, y_test))

anon3_bb_attack = AttributeInferenceBlackBox(anon3_art_classifier, attack_feature=attack_feature)

# get original model's predictions
anon3_x_train_predictions = np.array([np.argmax(arr) for arr in anon3_art_classifier.predict(train_encoded)]).reshape(-1,1)

# train attack model
anon3_bb_attack.fit(train_encoded[:attack_train_size])

# get inferred values
inferred_train_anon3_bb = anon3_bb_attack.infer(x_train_for_attack[attack_train_size:], anon3_x_train_predictions[attack_train_size:], values=values)
# check accuracy
train_acc = np.sum(inferred_train_anon3_bb == np.around(x_train_feature[attack_train_size:], decimals=8).reshape(1,-1)) / len(inferred_train_anon2_bb)
print('BB attack accuracy: ', train_acc)

anon3_wb2_attack = AttributeInferenceWhiteBoxDecisionTree(anon3_art_classifier, attack_feature=attack_feature)

# get inferred values
inferred_train_anon3_wb2 = anon3_wb2_attack.infer(x_train_for_attack, anon3_x_train_predictions, values=values, priors=priors)

# check accuracy
train_acc = np.sum(inferred_train_anon3_wb2 == np.around(x_train_feature, decimals=8).reshape(1,-1)) / len(inferred_train_anon_wb2)
print('WB attack accuracy: ', train_acc)

Anonymized model accuracy:  0.751929012345679
BB attack accuracy:  1.0
WB attack accuracy:  0.5187150299054601


In [23]:
# black-box regular
print(calc_precision_recall(inferred_train_bb, x_train_feature))
# black-box anonymized
print(calc_precision_recall(inferred_train_anon3_bb, x_train_feature))

# white-box regular
print(calc_precision_recall(inferred_train_wb2, x_train_feature))
# white-box anonymized
print(calc_precision_recall(inferred_train_anon3_wb2, x_train_feature))

(0.49415432579890883, 0.48976438779451525)
(0.49415432579890883, 0.48976438779451525)
(1.0, 0.019204655674102813)
(1.0, 0.032201745877788554)


Accuracy of both attacks has decreased. Precision and recall remain roughly the same in the black-box case. 

*In the anonymized version of the white-box attack, no records were predicted with the positive value for the attacked feature.