# Using ML anonymization to defend against membership inference attacks

In this tutorial we will show how to anonymize models using the ML anonymization module. 

We will demonstrate running inference attacks both on a vanilla model, and then on an anonymized version of the model. We will run a black-box membership inference attack using ART's inference module (https://github.com/Trusted-AI/adversarial-robustness-toolbox/tree/main/art/attacks/inference). 

This will be demonstarted using the Adult dataset (original dataset can be found here: https://archive.ics.uci.edu/ml/datasets/nursery). 

For simplicity, we used only the numerical features in the dataset.

## Load data

In [1]:
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

dataset = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.5, random_state=14)

## Train linear regression model

In [2]:
from sklearn.linear_model import LinearRegression
from art.estimators.regression.scikitlearn import ScikitlearnRegressor

model = LinearRegression()
model.fit(X_train, y_train)

art_classifier = ScikitlearnRegressor(model)

print('Base model accuracy (R2 score): ', model.score(X_test, y_test))

x_train_predictions = art_classifier.predict(X_train)

Base model accuracy (R2 score):  0.5080618258593721


## Attack
The black-box attack basically trains an additional classifier (called the attack model) to predict the membership status of a sample.
#### Train attack model

In [3]:
from art.attacks.inference.membership_inference import MembershipInferenceBlackBox

# attack_model_type can be nn (neural network), rf (random forest) or gb (gradient boosting)
bb_attack = MembershipInferenceBlackBox(art_classifier, attack_model_type='nn', input_type='loss')

# use half of each dataset for training the attack
attack_train_ratio = 0.5
attack_train_size = int(len(X_train) * attack_train_ratio)
attack_test_size = int(len(X_test) * attack_train_ratio)

# train attack model
bb_attack.fit(X_train[:attack_train_size], y_train[:attack_train_size],
              X_test[:attack_test_size], y_test[:attack_test_size])

# get inferred values for remaining half
inferred_train_bb = bb_attack.infer(X_train[attack_train_size:], y_train[attack_train_size:])
inferred_test_bb = bb_attack.infer(X_test[attack_test_size:], y_test[attack_test_size:])
# check accuracy
train_acc = np.sum(inferred_train_bb) / len(inferred_train_bb)
test_acc = 1 - (np.sum(inferred_test_bb) / len(inferred_test_bb))
acc = (train_acc * len(inferred_train_bb) + test_acc * len(inferred_test_bb)) / (len(inferred_train_bb) + len(inferred_test_bb))
print(acc)

0.527027027027027


This means that for 52% of the data, membership is inferred correctly using this attack.


In [4]:
from apt.utils.datasets import ArrayDataset
from apt.anonymization import Anonymize
k_values=[5, 10, 20, 50, 75]
model_accuracy = []
attack_accuracy = []
unique_values = []

# QI = all
QI = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
print('unique rows in original data: ', len(np.unique(X_train, axis=0)))

for k in k_values:
    anonymizer = Anonymize(k, QI, is_regression=True)
    anon = anonymizer.anonymize(ArrayDataset(X_train, x_train_predictions))
    unique_values.append(len(np.unique(anon, axis=0)))
    
    anon_model = LinearRegression()
    anon_model.fit(anon, y_train)

    anon_art_classifier = ScikitlearnRegressor(anon_model)

    model_accuracy.append(anon_model.score(X_test, y_test))
    
    anon_bb_attack = MembershipInferenceBlackBox(anon_art_classifier, attack_model_type='rf', input_type='loss')

    # train attack model
    anon_bb_attack.fit(X_train[:attack_train_size], y_train[:attack_train_size],
                       X_test[:attack_test_size], y_test[:attack_test_size])

    # get inferred values
    anon_inferred_train_bb = anon_bb_attack.infer(X_train[attack_train_size:], y_train[attack_train_size:])
    anon_inferred_test_bb = anon_bb_attack.infer(X_test[attack_test_size:], y_test[attack_test_size:])
    # check accuracy
    anon_train_acc = np.sum(anon_inferred_train_bb) / len(anon_inferred_train_bb)
    anon_test_acc = 1 - (np.sum(anon_inferred_test_bb) / len(anon_inferred_test_bb))
    anon_acc = (anon_train_acc * len(anon_inferred_train_bb) + anon_test_acc * len(anon_inferred_test_bb)) / (len(anon_inferred_train_bb) + len(anon_inferred_test_bb))
    attack_accuracy.append(anon_acc)
    
print('k values: ', k_values)
print('unique rows:', unique_values)
print('model accuracy:', model_accuracy)
print('attack accuracy:', attack_accuracy)

unique rows in original data:  221


  self.attack_model.fit(np.c_[x_1, x_2], y_ready)  # type: ignore
  self.attack_model.fit(np.c_[x_1, x_2], y_ready)  # type: ignore
  self.attack_model.fit(np.c_[x_1, x_2], y_ready)  # type: ignore
  self.attack_model.fit(np.c_[x_1, x_2], y_ready)  # type: ignore
  self.attack_model.fit(np.c_[x_1, x_2], y_ready)  # type: ignore


k values:  [5, 10, 20, 50, 75]
unique rows: [34, 19, 8, 4, 2]
model accuracy: [0.43165832354998956, 0.4509641063206041, -1.730181929385853, -5.577098823982753e+27, -1.2751609045828272e+25]
attack accuracy: [0.509009009009009, 0.481981981981982, 0.509009009009009, 0.5045045045045045, 0.4954954954954955]
