# Applying data minimization to one-hot encoded data

In this tutorial we will show how to perform data minimization for ML models using the minimization module, specifically when the input data is already one-hot encoded. 

This will be demonstarted using the Adult dataset (original dataset can be found here: https://archive.ics.uci.edu/ml/datasets/adult). 

## Load data

In [4]:
import numpy as np

import os
import sys
sys.path.insert(0, os.path.abspath('..'))
from apt.utils.dataset_utils import get_adult_dataset_pd

# 'workclass', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country'
categorical_features = [1, 3, 4, 5, 6, 7, 11]

# requires a folder called 'datasets' in the current directory
(x_train, y_train), (x_test, y_test) = get_adult_dataset_pd()
x_train = x_train.to_numpy()[:, [1, 3, 4, 5, 6, 7, 11]]
y_train = y_train.to_numpy().astype(int)
x_test = x_test.to_numpy()[:, [1, 3, 4, 5, 6, 7, 11]]
y_test = y_test.to_numpy().astype(int)

# get balanced dataset
x_train = x_train[:x_test.shape[0]]
y_train = y_train[:y_test.shape[0]]

print(x_train)

[['State-gov' 'Never-married' 'Adm-clerical' ... 'White' 'Male'
  'UnitedStates']
 ['Self-emp-not-inc' 'Married-civ-spouse' 'Exec-managerial' ... 'White'
  'Male' 'UnitedStates']
 ['Private' 'Divorced' 'Handlers-cleaners' ... 'White' 'Male'
  'UnitedStates']
 ...
 ['Private' 'Never-married' 'Sales' ... 'White' 'Female' 'UnitedStates']
 ['Private' 'Never-married' 'Craft-repair' ... 'White' 'Male'
  'UnitedStates']
 ['Private' 'Never-married' 'Handlers-cleaners' ... 'White' 'Male'
  'UnitedStates']]


In [5]:
from sklearn.preprocessing import OneHotEncoder
import scipy

preprocessor = OneHotEncoder(handle_unknown="ignore")

x_train = preprocessor.fit_transform(x_train)
x_test = preprocessor.transform(x_test)
if scipy.sparse.issparse(x_train):
    x_train = x_train.toarray().astype(int)
if scipy.sparse.issparse(x_test):
    x_test = x_test.toarray().astype(int)

print(x_train)

[[0 0 0 ... 0 1 0]
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 0 1 0]
 ...
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 0 1 0]]


## Train decision tree model

In [6]:
import os
import sys
sys.path.insert(0, os.path.abspath('..'))

from apt.utils.datasets import ArrayDataset
from apt.utils.models import SklearnClassifier, ModelOutputType
from sklearn.tree import DecisionTreeClassifier

base_est = DecisionTreeClassifier()
model = SklearnClassifier(base_est, ModelOutputType.CLASSIFIER_PROBABILITIES)
model.fit(ArrayDataset(x_train, y_train))

print('Base model accuracy: ', model.score(ArrayDataset(x_test, y_test)))

Base model accuracy:  0.8145077083717216




## Run minimization
We will try to run minimization with different possible values of target accuracy (how close to the original model's accuracy we want to get, 1 being same accuracy as for original data).

In [10]:
from apt.minimization import GeneralizeToRepresentative
from sklearn.model_selection import train_test_split


# features to minimize = (race, sex)
QI = [53, 52, 51, 50, 49, 48, 47]
QI_slices = [[47, 48, 49, 50, 51], [52, 53]]

minimizer = GeneralizeToRepresentative(model, target_accuracy=0.99, features_to_minimize=QI, feature_slices=QI_slices)

# Fitting the minimizar can be done either on training or test data. Doing it with test data is better as the 
# resulting accuracy on test data will be closer to the desired target accuracy (when working with training 
# data it could result in a larger gap)
# Don't forget to leave a hold-out set for final validation!
X_generalizer_train, x_test, y_generalizer_train, y_test = train_test_split(x_test, y_test, stratify=y_test,
                                                                test_size = 0.4, random_state = 38)
x_train_predictions = model.predict(ArrayDataset(X_generalizer_train))
if x_train_predictions.shape[1] > 1:
    x_train_predictions = np.argmax(x_train_predictions, axis=1)
minimizer.fit(dataset=ArrayDataset(X_generalizer_train, x_train_predictions))
transformed = minimizer.transform(dataset=ArrayDataset(x_test))

print('Accuracy on minimized data: ', model.score(ArrayDataset(transformed, y_test)))



Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations
Pruned tree to level: 1, new relative accuracy: 1.000000




Pruned tree to level: 2, new relative accuracy: 1.000000
Pruned tree to level: 3, new relative accuracy: 0.999360




Pruned tree to level: 4, new relative accuracy: 0.998081
Pruned tree to level: 5, new relative accuracy: 0.998081






Pruned tree to level: 6, new relative accuracy: 0.994242


  original_data_generalized.loc[indexes, representatives.columns.tolist()] = replace


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



#### Let's see what features were generalized

In [11]:
generalizations = minimizer.generalizations
print(generalizations)

{'ranges': {}, 'categories': {'53': [[0, 1]], '52': [[0, 1]]}, 'untouched': ['24', '41', '13', '18', '7', '10', '14', '11', '31', '33', '28', '12', '5', '17', '44', '8', '0', '20', '19', '46', '21', '38', '25', '42', '34', '45', '35', '3', '4', '2', '1', '39', '37', '6', '9', '36', '27', '30', '26', '15', '29', '16', '23', '40', '43', '22', '32', '51', '48', '49', '47', '50'], 'category_representatives': {'53': [0], '52': [1]}, 'range_representatives': {}}
