# Applying data minimization to a trained ML model

In this tutorial we will show how to perform data minimization for ML models using the minimization module.

This will be demonstarted using the German Credit dataset (original dataset can be found here: https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data).

## Load data

In [76]:
from apt.utils import get_german_credit_dataset

(x_train, y_train), (x_test, y_test) = get_german_credit_dataset()
features = ["Existing_checking_account", "Duration_in_month", "Credit_history", "Purpose", "Credit_amount",
                "Savings_account", "Present_employment_since", "Installment_rate", "Personal_status_sex", "debtors",
                "Present_residence", "Property", "Age", "Other_installment_plans", "Housing",
                "Number_of_existing_credits", "Job", "N_people_being_liable_provide_maintenance", "Telephone",
                "Foreign_worker"]
categorical_features = ["Existing_checking_account", "Credit_history", "Purpose", "Savings_account",
                        "Present_employment_since", "Personal_status_sex", "debtors", "Property",
                        "Other_installment_plans", "Housing", "Job"]
QI = ["Duration_in_month", "Credit_history", "Purpose", "debtors", "Property", "Other_installment_plans",
      "Housing", "Job"]

print(x_train)

    Existing_checking_account  Duration_in_month Credit_history Purpose  \
0                         A14                 24            A32     A41   
1                         A14                 33            A33     A49   
2                         A11                  9            A32     A42   
3                         A14                 28            A34     A43   
4                         A11                 24            A33     A43   
..                        ...                ...            ...     ...   
695                       A14                 12            A32     A43   
696                       A14                 13            A32     A43   
697                       A11                 48            A30     A41   
698                       A12                 21            A34     A42   
699                       A13                 15            A32     A46   

     Credit_amount Savings_account Present_employment_since  Installment_rate  \
0             7814

## Train decision tree model
we use OneHotEncoder to handle categorical features.

In [77]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
numeric_features = [f for f in features if f not in categorical_features]
numeric_transformer = Pipeline(
    steps=[('imputer', SimpleImputer(strategy='constant', fill_value=0))]
)
categorical_transformer = OneHotEncoder(handle_unknown="ignore", sparse=False)
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)
encoded_train = preprocessor.fit_transform(x_train)
model = DecisionTreeClassifier()
model.fit(encoded_train, y_train)

encoded_test = preprocessor.transform(x_test)
print('Base model accuracy: ', model.score(encoded_test, y_test))

Base model accuracy:  0.73


## Run minimization
We will try to run minimization with only some features and with different possible values of target accuracy (how close to the original model's accuracy we want to get, 1 being same accuracy as for original data).

In [78]:
import sys
import os
sys.path.insert(0, os.path.abspath('..'))

from apt.minimization import GeneralizeToRepresentative
from sklearn.model_selection import train_test_split

# default target_accuracy is 0.998
minimizer = GeneralizeToRepresentative(model, features=features,
                                     categorical_features=categorical_features, features_to_minimize=QI)

# Fitting the minimizar can be done either on training or test data. Doing it with test data is better as the
# resulting accuracy on test data will be closer to the desired target accuracy (when working with training
# data it could result in a larger gap)
# Don't forget to leave a hold-out set for final validation!
X_generalizer_train, x_test, y_generalizer_train, y_test = train_test_split(x_test, y_test, stratify=y_test,
                                                                test_size = 0.4, random_state = 38)
X_generalizer_train.reset_index(drop=True, inplace=True)
y_generalizer_train.reset_index(drop=True, inplace=True)
x_test.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)
encoded_generalizer_train = preprocessor.transform(X_generalizer_train)
x_train_predictions = model.predict(encoded_generalizer_train)
minimizer.fit(X_generalizer_train, x_train_predictions)
transformed = minimizer.transform(x_test)

encoded_transformed = preprocessor.transform(transformed)
print('Accuracy on minimized data: ', model.score(encoded_transformed, y_test))

Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 0.861111
Improving accuracy
feature to remove: Other_installment_plans
Removed feature: Other_installment_plans, new relative accuracy: 0.902778
feature to remove: Property
Removed feature: Property, new relative accuracy: 0.888889
feature to remove: Job
Removed feature: Job, new relative accuracy: 0.888889
feature to remove: debtors
Removed feature: debtors, new relative accuracy: 0.888889
feature to remove: Housing
Removed feature: Housing, new relative accuracy: 0.888889
feature to remove: Credit_history
Removed feature: Credit_history, new relative accuracy: 0.944444
feature to remove: Purpose
Removed feature: Purpose, new relative accuracy: 0.972222
feature to remove: Duration_in_month
Removed feature: Duration_in_month, new relative accuracy: 1.000000
Accuracy on minimized data:  0.6916666666666667


#### Let's see what features were generalized

In [79]:
generalizations = minimizer.generalizations
print(generalizations)

{'ranges': {}, 'categories': {}, 'untouched': ['Foreign_worker', 'Present_employment_since', 'Property', 'N_people_being_liable_provide_maintenance', 'Duration_in_month', 'debtors', 'Housing', 'Purpose', 'Existing_checking_account', 'Savings_account', 'Other_installment_plans', 'Job', 'Telephone', 'Credit_amount', 'Installment_rate', 'Number_of_existing_credits', 'Personal_status_sex', 'Present_residence', 'Credit_history', 'Age']}


We can see that for the default target accuracy of 0.998 of the original accuracy, no generalizations are possible (all features are left untouched, i.e., not generalized).

Let's change to a slightly lower target accuracy.

In [80]:
# We allow a 1% deviation in accuracy from the original model accuracy
minimizer2 = GeneralizeToRepresentative(model, target_accuracy=0.9, features=features,
                                     categorical_features=categorical_features, features_to_minimize=QI)

minimizer2.fit(X_generalizer_train, x_train_predictions)
transformed2 = minimizer2.transform(x_test)

encoded_transformed2 = preprocessor.transform(transformed2)
print('Accuracy on minimized data: ', model.score(encoded_transformed2, y_test))
generalizations2 = minimizer2.generalizations
print(generalizations2)

Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 0.861111
Improving accuracy
feature to remove: Other_installment_plans
Removed feature: Other_installment_plans, new relative accuracy: 0.902778
Accuracy on minimized data:  0.5583333333333333
{'ranges': {'Duration_in_month': [8.0, 10.5, 16.5, 33.0, 40.5]}, 'categories': {'Credit_history': [['A32'], ['A33', 'A30', 'A31'], ['A34']], 'Purpose': [['A41'], ['A49'], ['A40'], ['A410'], ['A43', 'A46', 'A48'], ['A44'], ['A45'], ['A42']], 'debtors': [['A101', 'A102'], ['A103']], 'Property': [['A121'], ['A123', 'A122'], ['A124']], 'Housing': [['A153', 'A151', 'A152']], 'Job': [['A173', 'A172'], ['A174'], ['A171']]}, 'untouched': ['Existing_checking_account', 'Personal_status_sex', 'Other_installment_plans', 'Present_residence', 'Foreign_worker', 'Present_employment_since', 'Telephone', 'N_people_being_liable_provide_maintenance', 'Credit_amount', 'In

This time we were able to generalize two features (Duration_in_month and debtors).