# Applying data minimization to a trained ML model

In this tutorial we will show how to perform data minimization for ML models using the minimization module. 

This will be demonstarted using the Adult dataset (original dataset can be found here: https://archive.ics.uci.edu/ml/datasets/adult). 

We use only the numerical features in the dataset because this is what is currently supported by the module.

Furthermore, We will explore how to use the anonymization features to further increase the data protection capabilities of the solution

## Load data

In [1]:
import numpy as np
import pandas as pd
import warnings
import anonymize_module as an
import os
import sys
import calcmetric as cm
sys.path.insert(0, os.path.abspath('..'))
from apt.utils.datasets import ArrayDataset
from apt.utils.models import SklearnClassifier, ModelOutputType
from sklearn.tree import DecisionTreeClassifier

warnings.filterwarnings('ignore', category=UserWarning)
warnings.filterwarnings('ignore', category=FutureWarning)
# Use only numeric features (age, education-num, capital-gain, capital-loss, hours-per-week)


x_train = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
                        usecols=(0, 4, 10, 11, 12), delimiter=",")

y_train = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
                        usecols=14, dtype=str, delimiter=",")


x_test = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
                        usecols=(0, 4, 10, 11, 12), delimiter=",", skiprows=1)

y_test = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
                        usecols=14, dtype=str, delimiter=",", skiprows=1)

x_train = x_train[:1500]
y_train = y_train[:1500]
x_test = x_test[:1500]
y_test = y_test[:1500]

# Trim trailing period "." from label
y_test = np.array([a[:-1] for a in y_test])
y_train[y_train == ' <=50K'] = 0
y_train[y_train == ' >50K'] = 1
y_train = y_train.astype(int)

y_test[y_test == ' <=50K'] = 0
y_test[y_test == ' >50K'] = 1
y_test = y_test.astype(int)


## Train decision tree model

In [2]:
import os
import sys
sys.path.insert(0, os.path.abspath('..'))

from apt.utils.datasets import ArrayDataset
from apt.utils.models import SklearnClassifier, ModelOutputType
from sklearn.tree import DecisionTreeClassifier

base_est = DecisionTreeClassifier()
model = SklearnClassifier(base_est, ModelOutputType.CLASSIFIER_PROBABILITIES)
model.fit(ArrayDataset(x_train, y_train))

print('Base model accuracy: ', model.score(ArrayDataset(x_test, y_test)))

Base model accuracy:  0.7706666666666667


## Run minimization
We will try to run minimization with different possible values of target accuracy (how close to the original model's accuracy we want to get, 1 being same accuracy as for original data).

In [3]:
from apt.minimization import GeneralizeToRepresentative
from sklearn.model_selection import train_test_split

# default target_accuracy is 0.998
minimizer = GeneralizeToRepresentative(model)

# Fitting the minimizar can be done either on training or test data. Doing it with test data is better as the 
# resulting accuracy on test data will be closer to the desired target accuracy (when working with training 
# data it could result in a larger gap)
# Don't forget to leave a hold-out set for final validation!
X_generalizer_train, x_test, y_generalizer_train, y_test = train_test_split(x_test, y_test, stratify=y_test,
                                                                test_size = 0.4, random_state = 38)
x_train_predictions = model.predict(ArrayDataset(X_generalizer_train))
if x_train_predictions.shape[1] > 1:
    x_train_predictions = np.argmax(x_train_predictions, axis=1)
minimizer.fit(dataset=ArrayDataset(X_generalizer_train, x_train_predictions))
transformed = minimizer.transform(dataset=ArrayDataset(x_test))

print('Accuracy on minimized data: ', model.score(ArrayDataset(transformed, y_test)))

Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 0.852778
Improving accuracy
feature to remove: 1
Removed feature: 1, new relative accuracy: 0.850000
feature to remove: 2
Removed feature: 2, new relative accuracy: 0.863889
feature to remove: 0
Removed feature: 0, new relative accuracy: 0.938889
feature to remove: 4
Removed feature: 4, new relative accuracy: 0.994444
feature to remove: 3
Removed feature: 3, new relative accuracy: 1.000000
Accuracy on minimized data:  0.7883333333333333


#### Let's see what features were generalized

In [4]:
generalizations = minimizer.generalizations
print(generalizations)

{'ranges': {}, 'categories': {}, 'untouched': ['3', '4', '1', '2', '0'], 'category_representatives': {}, 'range_representatives': {}}


We can see that for the default target accuracy of 0.998 of the original accuracy, no generalizations are possible (all features are left untouched, i.e., not generalized).

Let's change to a slightly lower target accuracy.

In [5]:
# We allow a 10% deviation in accuracy from the original model accuracy
minimizer2 = GeneralizeToRepresentative(model, target_accuracy=0.9)

minimizer2.fit(dataset=ArrayDataset(X_generalizer_train, x_train_predictions))
transformed2 = minimizer2.transform(dataset=ArrayDataset(x_test))
print('Accuracy on minimized data: ', model.score(test_data=ArrayDataset(transformed2, y_test)))
generalizations2 = minimizer2.generalizations
print(generalizations2)

Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 0.852778
Improving accuracy
feature to remove: 1
Removed feature: 1, new relative accuracy: 0.850000
feature to remove: 2
Removed feature: 2, new relative accuracy: 0.863889
feature to remove: 0
Removed feature: 0, new relative accuracy: 0.938889
Accuracy on minimized data:  0.7583333333333333
{'ranges': {'3': [704.0, 814.0, 1578.5], '4': [15.0, 21.5, 25.0, 27.5, 32.5, 35.0, 36.0, 37.5, 41.0, 41.5, 42.5, 43.0, 43.5, 44.0, 45.0, 47.5, 49.0, 49.5, 52.5, 54.5, 55.0, 55.5, 65.0, 75.0]}, 'categories': {}, 'untouched': ['1', '0', '2'], 'category_representatives': {}, 'range_representatives': {'3': [704.0, 0.0, 382.25], '4': [15.0, 8.0, 20.0, 25.0, 2.5, 30.0, 35.0, 36.0, 37.0, 40.0, 0.5, 42.0, 0.25, 0.25, 44.0, 45.0, 46.0, 48.0, 1.5, 50.0, 54.0, 55.0, 4.75, 60.0]}}


## Construct the dataset and measure k-anonymity
dataset consists of the transformed data and the sensitive column

In [6]:
features = ["age", "education-num", "capital-gain", "capital-loss", "hours-per-week"]
QI = ["age", "capital-gain", "capital-loss"]
sensitive_column = 'income'

df = pd.DataFrame(transformed2, columns=features)
df[sensitive_column] = y_test

# find maximum k-value for which k-anonymity is still satisfied
cm.find_k_anonymity(df, QI)

Dataset satisfies 1-anonymity
Dataset does not satisfy 2-anonymity
Dataset satisfies maximum 1-anonymity


1

## K-Anonymize
As seen, the dataset is not anonymized
To anonymize the dataset, call k-anonymize function

In [7]:
k_anonymdf = an.anonymize_k_anonymity(df, QI, sensitive_column, k=10)

Partition the dataset:
31 partitions created.
Changing quasi-identifiers values with the aggregation of their partition
Showcasing some of the anonymized partition:
Partition 7
+-----+-------+-----------------+----------------+----------------+------------------+----------+
|     |   age |   education-num |   capital-gain |   capital-loss |   hours-per-week |   income |
|-----+-------+-----------------+----------------+----------------+------------------+----------|
|   8 |    23 |              10 |              0 |              0 |               40 |        0 |
|  16 |    23 |              13 |              0 |              0 |               40 |        0 |
|  22 |    23 |              10 |              0 |              0 |               40 |        0 |
|  24 |    23 |              10 |              0 |              0 |               40 |        1 |
| 145 |    23 |               9 |              0 |              0 |               40 |        0 |
| 157 |    23 |              13 |      

Let's check again

In [8]:
cm.find_k_anonymity(k_anonymdf, QI)

Dataset satisfies 1-anonymity
Dataset satisfies 2-anonymity
Dataset satisfies 3-anonymity
Dataset satisfies 4-anonymity
Dataset satisfies 5-anonymity
Dataset satisfies 6-anonymity
Dataset satisfies 7-anonymity
Dataset satisfies 8-anonymity
Dataset satisfies 9-anonymity
Dataset satisfies 10-anonymity
Dataset does not satisfy 11-anonymity
Dataset satisfies maximum 10-anonymity


10

Now, the dataset is 10-anonymous.

## L-diversity

In [9]:
cm.find_l_diversity(k_anonymdf, QI, sensitive_column)
l_diversedf = an.anonymize_l_diversity(df, QI, sensitive_column, 10, 2)
cm.find_l_diversity(l_diversedf, QI, sensitive_column)

Dataset satisfies 1-diversity
Dataset does not satisfy 2-diversity
Dataset satisfies maximum 1-diversity
Partition the dataset:
25 partitions created.
Changing quasi-identifiers values with the aggregation of their partition
Showcasing some of the anonymized partition:
Partition 0
+-----+---------+-----------------+----------------+----------------+------------------+----------+
|     |     age |   education-num |   capital-gain |   capital-loss |   hours-per-week |   income |
|-----+---------+-----------------+----------------+----------------+------------------+----------|
|   6 | 21.6397 |              10 |        253.074 |              0 |               40 |        0 |
|   8 | 21.6397 |              10 |        253.074 |              0 |               40 |        0 |
|  10 | 21.6397 |              10 |        253.074 |              0 |               40 |        0 |
|  12 | 21.6397 |              13 |        253.074 |              0 |               40 |        1 |
|  14 | 21.6397 | 

2

## T-closeness
Tries to create partitions so that the distribution of the sensitive column is similar to the entire dataset

In [10]:
t_closedf = an.anonymize_t_closeness(df, QI, sensitive_column, k = 8, t=1)


Partition the dataset:
33 partitions created.
Changing quasi-identifiers values with the aggregation of their partition
Showcasing some of the anonymized partition:
Partition 30
+-----+---------+-----------------+----------------+----------------+------------------+----------+
|     |     age |   education-num |   capital-gain |   capital-loss |   hours-per-week |   income |
|-----+---------+-----------------+----------------+----------------+------------------+----------|
| 179 | 63.6154 |               5 |        2802.15 |        146.308 |               40 |        0 |
| 203 | 63.6154 |               2 |        2802.15 |        146.308 |               48 |        0 |
| 231 | 63.6154 |              11 |        2802.15 |        146.308 |               40 |        0 |
| 254 | 63.6154 |              10 |        2802.15 |        146.308 |               20 |        0 |
| 279 | 63.6154 |               9 |        2802.15 |        146.308 |               40 |        0 |
| 317 | 63.6154 |     

## Using the anonymized data as input to ML model

In [11]:
print('Accuracy on k_anonymized/minimized data: ', model.score(test_data=ArrayDataset(k_anonymdf[features], y_test)))
print('Accuracy on l_diversed/minimized data: ', model.score(test_data=ArrayDataset(l_diversedf[features], y_test)))
print('Accuracy on t_close/minimized data: ', model.score(test_data=ArrayDataset(t_closedf[features], y_test)))


Accuracy on k_anonymized/minimized data:  0.7016666666666667
Accuracy on l_diversed/minimized data:  0.6983333333333334
Accuracy on t_close/minimized data:  0.6983333333333334
