diff --git a/README.md b/README.md
index 820a840..fa8eb79 100644
--- a/README.md
+++ b/README.md
@@ -6,20 +6,23 @@
A toolkit for tools and techniques related to the privacy and compliance of AI models.
-The first release of this toolkit contains a single module called [**anonymization**](apt/anonymization/README.md).
-This module contains methods for anonymizing ML model training data, so that when
-a model is retrained on the anonymized data, the model itself will also be considered
-anonymous. This may help exempt the model from different obligations and restrictions
+The [**anonymization**](apt/anonymization/README.md) module contains methods for anonymizing ML model
+training data, so that when a model is retrained on the anonymized data, the model itself will also be
+considered anonymous. This may help exempt the model from different obligations and restrictions
set out in data protection regulations such as GDPR, CCPA, etc.
+The [**minimization**](apt/minimization/README.md) module contains methods to help adhere to the data
+minimization principle in GDPR for ML models. It enables to reduce the amount of
+personal data needed to perform predictions with a machine learning model, while still enabling the model
+to make accurate predictions. This is done by by removing or generalizing some of the input features.
+
Official ai-privacy-toolkit documentation: https://ai-privacy-toolkit.readthedocs.io/en/latest/
Installation: pip install ai-privacy-toolkit
**Related toolkits:**
-[ai-minimization-toolkit](https://github.com/IBM/ai-minimization-toolkit): A toolkit for
-reducing the amount of personal data needed to perform predictions with a machine learning model
+ai-minimization-toolkit - has been migrated into this toolkit.
[differential-privacy-library](https://github.com/IBM/differential-privacy-library): A
general-purpose library for experimenting with, investigating and developing applications in,
diff --git a/apt/__init__.py b/apt/__init__.py
index 4062275..99e5ad6 100644
--- a/apt/__init__.py
+++ b/apt/__init__.py
@@ -3,6 +3,7 @@
"""
from apt import anonymization
+from apt import minimization
from apt import utils
-__version__ = "0.0.2"
\ No newline at end of file
+__version__ = "0.0.3"
\ No newline at end of file
diff --git a/apt/minimization/README.md b/apt/minimization/README.md
new file mode 100644
index 0000000..0ba7705
--- /dev/null
+++ b/apt/minimization/README.md
@@ -0,0 +1,110 @@
+# data minimization module
+
+The EU General Data Protection Regulation (GDPR) mandates the principle of data minimization, which requires that only
+data necessary to fulfill a certain purpose be collected. However, it can often be difficult to determine the minimal
+amount of data required, especially in complex machine learning models such as neural networks.
+
+This module implements a first-of-a-kind method to help reduce the amount of personal data needed to perform
+predictions with a machine learning model, by removing or generalizing some of the input features. The type of data
+minimization this toolkit focuses on is the reduction of the number and/or granularity of features collected for analysis.
+
+The generalization process basically searches for several similar records and groups them together. Then, for each
+feature, the individual values for that feature within each group are replaced with a represenataive value that is
+common across the whole group. This process is done while using knowledge encoded within the model to produce a
+generalization that has little to no impact on its accuracy.
+
+For more information about the method see: http://export.arxiv.org/pdf/2008.04113
+
+The following figure depicts the overall process:
+
+
+
+
+
+
+Usage
+-----
+
+The main class, ``GeneralizeToRepresentative``, is a scikit-learn compatible ``Transformer``, that receives an existing
+estimator and labeled training data, and learns the generalizations that can be applied to any newly collected data for
+analysis by the original model. The ``fit()`` method learns the generalizations and the ``transform()`` method applies
+them to new data.
+
+It is also possible to export the generalizations as feature ranges.
+
+The current implementation supports only numeric features, so any categorical features must be transformed to a numeric
+representation before using this class.
+
+Start by training your machine learning model. In this example, we will use a ``DecisionTreeClassifier``, but any
+scikit-learn model can be used. We will use the iris dataset in our example.
+
+.. code:: python
+
+ from sklearn import datasets
+ from sklearn.model_selection import train_test_split
+ from sklearn.tree import DecisionTreeClassifier
+
+ dataset = datasets.load_iris()
+ X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.2)
+
+ base_est = DecisionTreeClassifier()
+ base_est.fit(X_train, y_train)
+
+Now create the ``GeneralizeToRepresentative`` transformer and train it. Supply it with the original model and the
+desired target accuracy. The training process may receive the original labeled training data or the model's predictions
+on the data.
+
+.. code:: python
+
+ predictions = base_est.predict(X_train)
+ gen = GeneralizeToRepresentative(base_est, target_accuracy=0.9)
+ gen.fit(X_train, predictions)
+
+Now use the transformer to transform new data, for example the test data.
+
+.. code:: python
+
+ transformed = gen.transform(X_test)
+
+The transformed data has the same columns and formats as the original data, so it can be used directly to derive
+predictions from the original model.
+
+.. code:: python
+
+ new_predictions = base_est.predict(transformed)
+
+To export the resulting generalizations, retrieve the ``Transformer``'s ``_generalize`` parameter.
+
+.. code:: python
+
+ generalizations = base_est._generalize
+
+The returned object has the following structure::
+
+ {
+ ranges:
+ {
+ list of (: [])
+ },
+ untouched: []
+ }
+
+For example::
+
+ {
+ ranges:
+ {
+ age: [21.5, 39.0, 51.0, 70.5],
+ education-years: [8.0, 12.0, 14.5]
+ },
+ untouched: ["occupation", "marital-status"]
+ }
+
+Where each value inside the range list represents a cutoff point. For example, for the ``age`` feature, the ranges in
+this example are: ``<21.5, 21.5-39.0, 39.0-51.0, 51.0-70.5, >70.5``. The ``untouched`` list represents features that
+were not generalized, i.e., their values should remain unchanged.
+
+
+
+
+
diff --git a/apt/minimization/__init__.py b/apt/minimization/__init__.py
new file mode 100644
index 0000000..e9aa35d
--- /dev/null
+++ b/apt/minimization/__init__.py
@@ -0,0 +1,19 @@
+"""
+Module providing data minimization for ML.
+
+This module implements a first-of-a-kind method to help reduce the amount of personal data needed to perform
+predictions with a machine learning model, by removing or generalizing some of the input features. For more information
+about the method see: http://export.arxiv.org/pdf/2008.04113
+
+The main class, ``GeneralizeToRepresentative``, is a scikit-learn compatible ``Transformer``, that receives an existing
+estimator and labeled training data, and learns the generalizations that can be applied to any newly collected data for
+analysis by the original model. The ``fit()`` method learns the generalizations and the ``transform()`` method applies
+them to new data.
+
+It is also possible to export the generalizations as feature ranges.
+
+The current implementation supports only numeric features, so any categorical features must be transformed to a numeric
+representation before using this class.
+
+"""
+from apt.minimization.minimizer import GeneralizeToRepresentative
diff --git a/apt/minimization/minimizer.py b/apt/minimization/minimizer.py
new file mode 100644
index 0000000..4440e44
--- /dev/null
+++ b/apt/minimization/minimizer.py
@@ -0,0 +1,664 @@
+"""
+This module implements all classes needed to perform data minimization
+"""
+
+import pandas as pd
+import numpy as np
+import copy
+import sys
+from scipy.spatial import distance
+from sklearn.base import BaseEstimator, TransformerMixin, MetaEstimatorMixin
+from sklearn.base import clone
+from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
+from sklearn.tree import DecisionTreeClassifier
+from sklearn.model_selection import train_test_split
+
+class GeneralizeToRepresentative(BaseEstimator, MetaEstimatorMixin, TransformerMixin):
+ """ A transformer that generalizes data to representative points.
+
+ Learns data generalizations based on an original model's predictions
+ and a target accuracy. Once the generalizations are learned, can
+ receive one or more data records and transform them to representative
+ points based on the learned generalization.
+
+ An alternative way to use the transformer is to supply ``cells`` and
+ ``features`` in init or set_params and those will be used to transform
+ data to representatives. In this case, fit must still be called but
+ there is no need to supply it with ``X`` and ``y``, and there is no
+ need to supply an existing ``estimator`` to init.
+
+ In summary, either ``estimator`` and ``target_accuracy`` should be
+ supplied or ``cells`` and ``features`` should be supplied.
+
+ Parameters
+ ----------
+ estimator : estimator, optional
+ The original model for which generalization is being performed.
+ Should be pre-fitted.
+
+ target_accuracy : float, optional
+ The required accuracy when applying the base model to the
+ generalized data. Accuracy is measured relative to the original
+ accuracy of the model.
+
+ features : list of str, optional
+ The feature names, in the order that they appear in the data.
+
+ cells : list of object, optional
+ The cells used to generalize records. Each cell must define a
+ range or subset of categories for each feature, as well as a
+ representative value for each feature.
+ This parameter should be used when instantiating a transformer
+ object without first fitting it.
+
+ Attributes
+ ----------
+ cells_ : list of object
+ The cells used to generalize records, as learned when calling fit.
+
+ ncp_ : float
+ The NCP (information loss) score of the resulting generalization,
+ as measured on the training data.
+
+ generalizations_ : object
+ The generalizations that were learned (actual feature ranges).
+
+ Notes
+ -----
+
+
+ """
+
+ def __init__(self, estimator=None, target_accuracy=0.998, features=None,
+ cells=None):
+ self.estimator = estimator
+ self.target_accuracy = target_accuracy
+ self.features = features
+ self.cells = cells
+
+ def get_params(self, deep=True):
+ """Get parameters for this estimator.
+
+ Parameters
+ ----------
+ deep : boolean, optional
+ If True, will return the parameters for this estimator and contained
+ subobjects that are estimators.
+
+ Returns
+ -------
+ params : mapping of string to any
+ Parameter names mapped to their values.
+ """
+ ret = {}
+ ret['target_accuracy'] = self.target_accuracy
+ if deep:
+ ret['features'] = copy.deepcopy(self.features)
+ ret['cells'] = copy.deepcopy(self.cells)
+ ret['estimator'] = self.estimator
+ else:
+ ret['features'] = copy.copy(self.features)
+ ret['cells'] = copy.copy(self.cells)
+ return ret
+
+ def set_params(self, **params):
+ """Set the parameters of this estimator.
+
+ Returns
+ -------
+ self : object
+ Returns self.
+ """
+ if 'target_accuracy' in params:
+ self.target_accuracy = params['target_accuracy']
+ if 'features' in params:
+ self.features = params['features']
+ if 'cells' in params:
+ self.cells = params['cells']
+ return self
+
+ def fit_transform(self, X=None, y=None):
+ """Learns the generalizations based on training data, and applies them to the data.
+
+ Parameters
+ ----------
+ X : {array-like, sparse matrix}, shape (n_samples, n_features), optional
+ The training input samples.
+ y : array-like, shape (n_samples,), optional
+ The target values. An array of int.
+ This should contain the predictions of the original model on ``X``.
+
+ Returns
+ -------
+ self : object
+ Returns self.
+ """
+ self.fit(X, y)
+ return self.transform(X)
+
+ def fit(self, X=None, y=None):
+ """Learns the generalizations based on training data.
+
+ Parameters
+ ----------
+ X : {array-like, sparse matrix}, shape (n_samples, n_features), optional
+ The training input samples.
+ y : array-like, shape (n_samples,), optional
+ The target values. An array of int.
+ This should contain the predictions of the original model on ``X``.
+
+ Returns
+ -------
+ X_transformed : ndarray, shape (n_samples, n_features)
+ The array containing the representative values to which each record in
+ ``X`` is mapped.
+ """
+
+ # take into account that estimator, X, y, cells, features may be None
+
+ if X is not None and y is not None:
+ X, y = check_X_y(X, y, accept_sparse=True)
+ self.n_features_ = X.shape[1]
+ elif self.features:
+ self.n_features_ = len(self.features)
+ else:
+ self.n_features_ = 0
+
+ if self.features:
+ self._features = self.features
+ # if features is None, use numbers instead of names
+ elif self.n_features_ != 0:
+ self._features = [i for i in range(self.n_features_)]
+ else:
+ self._features = None
+
+ if self.cells:
+ self.cells_ = self.cells
+ else:
+ self.cells_ = {}
+
+
+ # Going to fit
+ # (currently not dealing with option to fit with only X and y and no estimator)
+ if self.estimator and X is not None and y is not None:
+ # divide dataset into train and test
+ X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,
+ test_size = 0.4,
+ random_state = 18)
+
+ # collect feature data (such as min, max)
+ train_data = pd.DataFrame(X_train, columns=self._features)
+ feature_data = {}
+ for feature in self._features:
+ if not feature in feature_data.keys():
+ values = list(train_data.loc[:, feature])
+ fd = {}
+ fd['min'] = min(values)
+ fd['max'] = max(values)
+ feature_data[feature] = fd
+
+ self.cells_ = {}
+ self.dt_ = DecisionTreeClassifier(random_state=0, min_samples_split=2,
+ min_samples_leaf=1)
+ self.dt_.fit(X_train, y_train)
+ self._calculate_cells()
+ self._modify_cells()
+ nodes = self._get_nodes_level(0)
+ self._attach_cells_representatives(X_train, y_train, nodes)
+ # self.cells_ currently holds the generalization created from the tree leaves
+
+ # apply generalizations to test data
+ generalized = self._generalize(X_test, nodes, self.cells_, self.cells_by_id_)
+
+ # check accuracy
+ accuracy = self.estimator.score(generalized, y_test)
+ print('Initial accuracy is %f' % accuracy)
+
+ # if accuracy above threshold, improve generalization
+ if accuracy > self.target_accuracy:
+ level = 1
+ while accuracy > self.target_accuracy:
+ nodes = self._get_nodes_level(level)
+ self._calculate_level_cells(level)
+ self._attach_cells_representatives(X_train, y_train, nodes)
+ generalized = self._generalize(X_test, nodes, self.cells_,
+ self.cells_by_id_)
+ accuracy = self.estimator.score(generalized, y_test)
+ print('Level: %d, accuracy: %f' % (level, accuracy))
+ level+=1
+
+ # if accuracy below threshold, improve accuracy by removing features from generalization
+ if accuracy < self.target_accuracy:
+ while accuracy < self.target_accuracy:
+ self._calculate_generalizations()
+ removed_feature = self._remove_feature_from_generalization(X_test,
+ nodes, y_test,
+ feature_data)
+ if not removed_feature:
+ break
+ generalized = self._generalize(X_test, nodes, self.cells_,
+ self.cells_by_id_)
+ accuracy = self.estimator.score(generalized, y_test)
+ print('Removed feature: %s, accuracy: %f' % (removed_feature, accuracy))
+
+ # self.cells_ currently holds the chosen generalization based on target accuracy
+
+ # calculate iLoss
+ self.ncp_ = self._calculate_ncp(X_test, self.generalizations_, feature_data)
+
+ # Return the transformer
+ return self
+
+ def transform(self, X):
+ """ Transforms data records to representative points.
+
+ Parameters
+ ----------
+ X : {array-like, sparse-matrix}, shape (n_samples, n_features)
+ The input samples.
+
+ Returns
+ -------
+ X_transformed : ndarray, shape (n_samples, n_features)
+ The array containing the representative values to which each record in
+ ``X`` is mapped.
+ """
+
+ # Check if fit has been called
+ msg = 'This %(name)s instance is not initialized yet. ' \
+ 'Call ‘fit’ or ‘set_params’ with ' \
+ 'appropriate arguments before using this method.'
+ check_is_fitted(self, ['cells', 'features'], msg=msg)
+
+ # Input validation
+ X = check_array(X, accept_sparse=True)
+ if X.shape[1] != self.n_features_ and self.n_features_ != 0:
+ raise ValueError('Shape of input is different from what was seen'
+ 'in `fit`')
+
+ if not self._features:
+ self._features = [i for i in range(X.shape[1])]
+
+ representatives = pd.DataFrame(columns=self._features) # only columns
+ generalized = pd.DataFrame(X, columns=self._features, copy=True) # original data
+ mapped = np.zeros(X.shape[0]) # to mark records we already mapped
+
+ # iterate over cells (leaves in decision tree)
+ for i in range(len(self.cells_)):
+ # Copy the representatives from the cells into another data structure:
+ # iterate over features in test data
+ for feature in self._features:
+ # if feature has a representative value in the cell and should not
+ # be left untouched, take the representative value
+ if feature in self.cells_[i]['representative'] and \
+ ( 'untouched' not in self.cells_[i] \
+ or feature not in self.cells_[i]['untouched'] ):
+ representatives.loc[i, feature] = self.cells_[i]['representative'][feature]
+ # else, drop the feature (removes from representatives columns that
+ # do not have a representative value or should remain untouched)
+ elif feature in representatives.columns.tolist():
+ representatives = representatives.drop(feature, axis=1)
+
+ # get the indexes of all records that map to this cell
+ indexes = self._get_record_indexes_for_cell(X, self.cells_[i], mapped)
+
+ # replace the values in the representative columns with the representative
+ # values (leaves others untouched)
+ if not representatives.columns.empty:
+ if len(indexes) > 1:
+ replace = pd.concat([representatives.loc[i].to_frame().T]*len(indexes)).reset_index(drop=True)
+ replace.index = indexes
+ else:
+ replace = representatives.loc[i].to_frame().T
+ generalized.loc[indexes, representatives.columns] = replace
+
+ return generalized.to_numpy()
+
+ def _get_record_indexes_for_cell(self, X, cell, mapped):
+ return [i for i, x in enumerate(X) if not mapped.item(i) and
+ self._cell_contains(cell, x, i, mapped)]
+
+ def _cell_contains(self, cell, x, i, mapped):
+ for f in self._features:
+ if f in cell['ranges']:
+ if not self._cell_contains_numeric(f, cell['ranges'][f], x):
+ return False
+ else:
+ #TODO: exception - feature not defined
+ pass
+ # Mark as mapped
+ mapped.itemset(i, 1)
+ return True
+
+ def _cell_contains_numeric(self, f, range, x):
+ i = self._features.index(f)
+ # convert x to ndarray to allow indexing
+ a = np.array(x)
+ value = a.item(i)
+ if range['start']:
+ if value <= range['start']:
+ return False
+ if range['end']:
+ if value > range['end']:
+ return False
+ return True
+
+ def _calculate_cells(self):
+ self.cells_by_id_ = {}
+ self.cells_ = self._calculate_cells_recursive(0)
+
+ def _calculate_cells_recursive(self, node):
+ feature_index = self.dt_.tree_.feature[node]
+ if feature_index == -2:
+ # this is a leaf
+ label = self._calculate_cell_label(node)
+ hist = [int(i) for i in self.dt_.tree_.value[node][0]]
+ cell = {'label': label, 'hist': hist, 'ranges': {}, 'id': int(node)}
+ return [cell]
+
+ cells = []
+ feature = self._features[feature_index]
+ threshold = self.dt_.tree_.threshold[node]
+ left_child = self.dt_.tree_.children_left[node]
+ right_child = self.dt_.tree_.children_right[node]
+
+ left_child_cells = self._calculate_cells_recursive(left_child)
+ for cell in left_child_cells:
+ if feature not in cell['ranges'].keys():
+ cell['ranges'][feature] = {'start': None, 'end': None}
+ if cell['ranges'][feature]['end'] is None:
+ cell['ranges'][feature]['end'] = threshold
+ cells.append(cell)
+ self.cells_by_id_[cell['id']] = cell
+
+ right_child_cells = self._calculate_cells_recursive(right_child)
+ for cell in right_child_cells:
+ if feature not in cell['ranges'].keys():
+ cell['ranges'][feature] = {'start': None, 'end': None}
+ if cell['ranges'][feature]['start'] is None:
+ cell['ranges'][feature]['start'] = threshold
+ cells.append(cell)
+ self.cells_by_id_[cell['id']] = cell
+
+ return cells
+
+ def _calculate_cell_label(self, node):
+ label_hist = self.dt_.tree_.value[node][0]
+ return int(self.dt_.classes_[np.argmax(label_hist)])
+
+ def _modify_cells(self):
+ cells = []
+ for cell in self.cells_:
+ new_cell = {'id': cell['id'], 'label': cell['label'], 'ranges': {},
+ 'categories': {}, 'hist': cell['hist'], 'representative': None}
+ for feature in self._features:
+ if feature in cell['ranges'].keys():
+ new_cell['ranges'][feature] = cell['ranges'][feature]
+ else:
+ new_cell['ranges'][feature] = {'start': None, 'end': None}
+ cells.append(new_cell)
+ self.cells_by_id_[new_cell['id']] = new_cell
+ self.cells_ = cells
+
+ def _calculate_level_cells(self, level):
+ if level < 0 or level > self.dt_.get_depth():
+ #TODO: exception 'Illegal level %d' % level
+ pass
+
+ if level > 0:
+ new_cells = []
+ new_cells_by_id = {}
+ nodes = self._get_nodes_level(level)
+ for node in nodes:
+ if self.dt_.tree_.feature[node] == -2: # leaf node
+ new_cell = self.cells_by_id_[node]
+ else:
+ left_child = self.dt_.tree_.children_left[node]
+ right_child = self.dt_.tree_.children_right[node]
+ left_cell = self.cells_by_id_[left_child]
+ right_cell = self.cells_by_id_[right_child]
+ new_cell = {'id': int(node), 'ranges': {}, 'categories': {},
+ 'label': None, 'representative': None}
+ for feature in left_cell['ranges'].keys():
+ new_cell['ranges'][feature] = {}
+ new_cell['ranges'][feature]['start'] = left_cell['ranges'][feature]['start']
+ new_cell['ranges'][feature]['end'] = right_cell['ranges'][feature]['start']
+ for feature in left_cell['categories'].keys():
+ new_cell['categories'][feature] = \
+ list(set(left_cell['categories'][feature]) |
+ set(right_cell['categories'][feature]))
+ self._calculate_level_cell_label(left_cell, right_cell, new_cell)
+ new_cells.append(new_cell)
+ new_cells_by_id[new_cell['id']] = new_cell
+ self.cells_ = new_cells
+ self.cells_by_id_ = new_cells_by_id
+ # else: nothing to do, stay with previous cells
+
+ def _calculate_level_cell_label(self, left_cell, right_cell, new_cell):
+ new_cell['hist'] = [x + y for x, y in zip(left_cell['hist'], right_cell['hist'])]
+ new_cell['label'] = int(self.dt_.classes_[np.argmax(new_cell['hist'])])
+
+ def _get_nodes_level(self, level):
+ # level = distance from lowest leaf
+ node_depth = np.zeros(shape=self.dt_.tree_.node_count, dtype=np.int64)
+ is_leaves = np.zeros(shape=self.dt_.tree_.node_count, dtype=bool)
+ stack = [(0, -1)] # seed is the root node id and its parent depth
+ while len(stack) > 0:
+ node_id, parent_depth = stack.pop()
+ node_depth[node_id] = parent_depth + 1
+
+ if self.dt_.tree_.children_left[node_id] != self.dt_.tree_.children_right[node_id]:
+ stack.append((self.dt_.tree_.children_left[node_id], parent_depth + 1))
+ stack.append((self.dt_.tree_.children_right[node_id], parent_depth + 1))
+ else:
+ is_leaves[node_id] = True
+
+ max_depth = max(node_depth)
+ depth = max_depth - level
+ if depth < 0:
+ return None
+ return [i for i, x in enumerate(node_depth) if x == depth or (x < depth and is_leaves[i])]
+
+ def _attach_cells_representatives(self, samples, labels, level_nodes):
+ samples_df = pd.DataFrame(samples, columns=self._features)
+ labels_df = pd.DataFrame(labels, columns=['label'])
+ samples_node_ids = self._find_sample_nodes(samples_df, level_nodes)
+ for cell in self.cells_:
+ cell['representative'] = {}
+ # get all rows in cell
+ indexes = [i for i, x in enumerate(samples_node_ids) if x == cell['id']]
+ sample_rows = samples_df.iloc[indexes]
+ sample_labels = labels_df.iloc[indexes]['label'].values.tolist()
+ # get rows with matching label
+ indexes = [i for i, label in enumerate(sample_labels) if label == cell['label']]
+ match_samples = sample_rows.iloc[indexes]
+ # find the "middle" of the cluster
+ array = match_samples.values
+ median = np.median(array, axis=0)
+ # find the record closest to the median
+ i = 0
+ min = len(array)
+ min_dist = float("inf")
+ for row in array:
+ dist = distance.euclidean(row, median)
+ if dist < min_dist:
+ min_dist = dist
+ min = i
+ i = i + 1
+ row = match_samples.iloc[min]
+ # use its values as the representative
+ for feature in cell['ranges'].keys():
+ cell['representative'][feature] = row[feature].item()
+
+ def _find_sample_nodes(self, samples, nodes):
+ paths = self.dt_.decision_path(samples).toarray()
+ nodeSet = set(nodes)
+ return [(list(set([i for i, v in enumerate(p) if v == 1]) & nodeSet))[0] for p in paths]
+
+ def _generalize(self, data, level_nodes, cells, cells_by_id):
+ representatives = pd.DataFrame(columns=self._features) # empty except for columns
+ generalized = pd.DataFrame(data, columns=self._features, copy=True) # original data
+ mapping_to_cells = self._map_to_cells(generalized, level_nodes, cells_by_id)
+ # iterate over cells (leaves in decision tree)
+ for i in range(len(cells)):
+ # This code just copies the representatives from the cells into another data structure
+ # iterate over features
+ for feature in self._features:
+ # if feature has a representative value in the cell and should not be left untouched,
+ # take the representative value
+ if feature in cells[i]['representative'] and ('untouched' not in cells[i] or
+ feature not in cells[i]['untouched']):
+ representatives.loc[i, feature] = cells[i]['representative'][feature]
+ # else, drop the feature (removes from representatives columns that do not have a
+ # representative value or should remain untouched)
+ elif feature in representatives.columns.tolist():
+ representatives = representatives.drop(feature, axis=1)
+
+ # get the indexes of all records that map to this cell
+ indexes = [j for j in range(len(mapping_to_cells)) if mapping_to_cells[j]['id'] == cells[i]['id']]
+ # replaces the values in the representative columns with the representative values
+ # (leaves others untouched)
+ if not representatives.columns.empty:
+ if len(indexes) > 1:
+ replace = pd.concat([representatives.loc[i].to_frame().T]*len(indexes)).reset_index(drop=True)
+ replace.index = indexes
+ else:
+ replace = representatives.loc[i].to_frame().T
+ generalized.loc[indexes, representatives.columns] = replace
+
+ return generalized.to_numpy()
+
+ def _map_to_cells(self, samples, nodes, cells_by_id):
+ mapping_to_cells = []
+ for index, row in samples.iterrows():
+ cell = self._find_sample_cells([row], nodes, cells_by_id)[0]
+ mapping_to_cells.append(cell)
+ return mapping_to_cells
+
+ def _find_sample_cells(self, samples, nodes, cells_by_id):
+ node_ids = self._find_sample_nodes(samples, nodes)
+ return [cells_by_id[nodeId] for nodeId in node_ids]
+
+ def _remove_feature_from_generalization(self, samples, nodes, labels, feature_data):
+ feature = self._get_feature_to_remove(samples, nodes, labels, feature_data)
+ if not feature:
+ return None
+ GeneralizeToRepresentative._remove_feature_from_cells(self.cells_, self.cells_by_id_, feature)
+ return feature
+
+ def _get_feature_to_remove(self, samples, nodes, labels, feature_data):
+ # We want to remove features with low iLoss (NCP) and high accuracy gain
+ # (after removing them)
+ ranges = self.generalizations_['ranges']
+ range_counts = self._find_range_count(samples, ranges)
+ total = samples.size
+ range_min = sys.float_info.max
+ remove_feature = None
+
+ for feature in ranges.keys():
+ if feature not in self.generalizations_['untouched']:
+ feature_ncp = self._calc_ncp_numeric(ranges[feature],
+ range_counts[feature],
+ feature_data[feature],
+ total)
+ if feature_ncp > 0:
+ # divide by accuracy gain
+ new_cells = copy.deepcopy(self.cells_)
+ cells_by_id = copy.deepcopy(self.cells_by_id_)
+ GeneralizeToRepresentative._remove_feature_from_cells(new_cells, cells_by_id, feature)
+ generalized = self._generalize(samples, nodes, new_cells, cells_by_id)
+ accuracy = self.estimator.score(generalized, labels)
+ feature_ncp = feature_ncp / accuracy
+ if feature_ncp < range_min:
+ range_min = feature_ncp
+ remove_feature = feature
+
+ print('feature to remove: ' + (remove_feature if remove_feature else ''))
+ return remove_feature
+
+ def _calculate_generalizations(self):
+ self.generalizations_ = {'ranges': GeneralizeToRepresentative._calculate_ranges(self.cells_),
+ 'untouched': GeneralizeToRepresentative._calculate_untouched(self.cells_)}
+
+ def _find_range_count(self, samples, ranges):
+ samples_df = pd.DataFrame(samples, columns=self._features)
+ range_counts = {}
+ last_value = None
+ for r in ranges.keys():
+ range_counts[r] = []
+ # if empty list, all samples should be counted
+ if not ranges[r]:
+ range_counts[r].append(samples_df.shape[0])
+ else:
+ for value in ranges[r]:
+ range_counts[r].append(len(samples_df.loc[samples_df[r] <= value]))
+ last_value = value
+ range_counts[r].append(len(samples_df.loc[samples_df[r] > last_value]))
+ return range_counts
+
+ def _calculate_ncp(self, samples, generalizations, feature_data):
+ # supressed features are already taken care of within _calc_ncp_numeric
+ ranges = generalizations['ranges']
+ range_counts = self._find_range_count(samples, ranges)
+ total = samples.shape[0]
+ total_ncp = 0
+ total_features = len(generalizations['untouched'])
+ for feature in ranges.keys():
+ feature_ncp = GeneralizeToRepresentative._calc_ncp_numeric(ranges[feature], range_counts[feature], feature_data[feature], total)
+ total_ncp = total_ncp + feature_ncp
+ total_features += 1
+ if total_features == 0:
+ return 0
+ return total_ncp / total_features
+
+ @staticmethod
+ def _calculate_ranges(cells):
+ ranges = {}
+ for cell in cells:
+ for feature in [key for key in cell['ranges'].keys() if
+ 'untouched' not in cell or key not in cell['untouched']]:
+ if feature not in ranges.keys():
+ ranges[feature] = []
+ if cell['ranges'][feature]['start'] is not None:
+ ranges[feature].append(cell['ranges'][feature]['start'])
+ if cell['ranges'][feature]['end'] is not None:
+ ranges[feature].append(cell['ranges'][feature]['end'])
+ for feature in ranges.keys():
+ ranges[feature] = list(set(ranges[feature]))
+ ranges[feature].sort()
+ return ranges
+
+ @staticmethod
+ def _calculate_untouched(cells):
+ untouched_lists = [cell['untouched'] if 'untouched' in cell else [] for cell in cells]
+ untouched = set(untouched_lists[0])
+ untouched = untouched.intersection(*untouched_lists)
+ return list(untouched)
+
+ @staticmethod
+ def _calc_ncp_numeric(feature_range, range_count, feature_data, total):
+ # if there are no ranges, feature is supressed and iLoss is 1
+ if not feature_range:
+ return 1
+ # range only contains the split values, need to add min and max value of feature
+ # to enable computing sizes of all ranges
+ new_range = [feature_data['min']] + feature_range + [feature_data['max']]
+ range_sizes = [b - a for a, b in zip(new_range[::1], new_range[1::1])]
+ normalized_range_sizes = [s * n / total for s, n in zip(range_sizes, range_count)]
+ average_range_size = sum(normalized_range_sizes) / len(normalized_range_sizes)
+ return average_range_size / (feature_data['max'] - feature_data['min'])
+
+
+ @staticmethod
+ def _remove_feature_from_cells(cells, cells_by_id, feature):
+ for cell in cells:
+ if 'untouched' not in cell:
+ cell['untouched'] = []
+ if feature in cell['ranges'].keys():
+ del cell['ranges'][feature]
+ else:
+ del cell['categories'][feature]
+ cell['untouched'].append(feature)
+ cells_by_id[cell['id']] = cell.copy()
+
+
diff --git a/apt/utils.py b/apt/utils.py
index b7aa78a..086492f 100644
--- a/apt/utils.py
+++ b/apt/utils.py
@@ -2,7 +2,7 @@
import sklearn.preprocessing
import pandas as pd
import ssl
-from os import path
+from os import path, mkdir
from six.moves.urllib.request import urlretrieve
@@ -40,9 +40,13 @@ def get_adult_dataset():
'label']
train_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
test_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test'
+ data_dir = '../datasets/adult'
train_file = '../datasets/adult/train'
test_file = '../datasets/adult/test'
+ if not path.exists(data_dir):
+ mkdir(data_dir)
+
ssl._create_default_https_context = ssl._create_unverified_context
if not path.exists(train_file):
urlretrieve(train_url, train_file)
@@ -139,8 +143,12 @@ def get_nursery_dataset(raw: bool = True, test_set: float = 0.2, transform_socia
:return: Dataset and labels as pandas dataframes.
"""
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/nursery/nursery.data'
+ data_dir = '../datasets/nursery'
data_file = '../datasets/nursery/data'
+ if not path.exists(data_dir):
+ mkdir(data_dir)
+
ssl._create_default_https_context = ssl._create_unverified_context
if not path.exists(data_file):
urlretrieve(url, data_file)
diff --git a/datasets/.gitignore b/datasets/.gitignore
new file mode 100644
index 0000000..86d0cb2
--- /dev/null
+++ b/datasets/.gitignore
@@ -0,0 +1,4 @@
+# Ignore everything in this directory
+*
+# Except this file
+!.gitignore
\ No newline at end of file
diff --git a/docs/conf.py b/docs/conf.py
index a1843e1..0b26b58 100644
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -22,8 +22,9 @@
author = 'Abigail Goldsteen'
# The full version, including alpha/beta/rc tags
-release = '0.0.1'
+release = '0.0.3'
+master_doc = 'index'
# -- General configuration ---------------------------------------------------
diff --git a/docs/index.rst b/docs/index.rst
index be954dc..6a1969d 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -8,12 +8,16 @@ Welcome to ai-privacy-toolkit's documentation!
This project provides tools for assessing and improving the privacy and compliance of AI models.
-The first release of this toolkit contains a single module called anonymization. This
-module contains methods for anonymizing ML model training data, so that when
-a model is retrained on the anonymized data, the model itself will also be considered
-anonymous. This may help exempt the model from different obligations and restrictions
+The anonymization module contains methods for anonymizing ML model
+training data, so that when a model is retrained on the anonymized data, the model itself will also be
+considered anonymous. This may help exempt the model from different obligations and restrictions
set out in data protection regulations such as GDPR, CCPA, etc.
+The minimization module contains methods to help adhere to the data
+minimization principle in GDPR for ML models. It enables to reduce the amount of
+personal data needed to perform predictions with a machine learning model, while still enabling the model
+to make accurate predictions. This is done by by removing or generalizing some of the input features.
+
.. toctree::
:maxdepth: 2
:caption: Getting Started:
diff --git a/docs/source/apt.anonymization.rst b/docs/source/apt.anonymization.rst
index 6453554..727706b 100644
--- a/docs/source/apt.anonymization.rst
+++ b/docs/source/apt.anonymization.rst
@@ -8,15 +8,15 @@ apt.anonymization.anonymizer module
-----------------------------------
.. automodule:: apt.anonymization.anonymizer
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
Module contents
---------------
.. automodule:: apt.anonymization
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/apt.rst b/docs/source/apt.rst
index 372f81e..fbe1c02 100644
--- a/docs/source/apt.rst
+++ b/docs/source/apt.rst
@@ -5,9 +5,9 @@ Subpackages
-----------
.. toctree::
- :maxdepth: 4
- apt.anonymization
+ apt.anonymization
+ apt.minimization
Submodules
----------
@@ -16,15 +16,15 @@ apt.utils module
----------------
.. automodule:: apt.utils
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
Module contents
---------------
.. automodule:: apt
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/tests.rst b/docs/source/tests.rst
index 3983caf..b1428e0 100644
--- a/docs/source/tests.rst
+++ b/docs/source/tests.rst
@@ -8,15 +8,23 @@ tests.test\_anonymizer module
-----------------------------
.. automodule:: tests.test_anonymizer
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+tests.test\_minimizer module
+----------------------------
+
+.. automodule:: tests.test_minimizer
+ :members:
+ :undoc-members:
+ :show-inheritance:
Module contents
---------------
.. automodule:: tests
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/requirements.txt b/requirements.txt
index ba69642..cf7d35b 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,4 +1,4 @@
-numpy==1.19.0
+numpy==1.9.0
pandas==1.1.0
scipy==1.4.1
scikit-learn==0.22.2
diff --git a/tests/test_minimizer.py b/tests/test_minimizer.py
new file mode 100644
index 0000000..9cb59e1
--- /dev/null
+++ b/tests/test_minimizer.py
@@ -0,0 +1,64 @@
+import pytest
+import numpy as np
+
+from sklearn.datasets import load_boston
+
+from apt.minimization import GeneralizeToRepresentative
+from sklearn.tree import DecisionTreeClassifier
+
+
+@pytest.fixture
+def data():
+ return load_boston(return_X_y=True)
+
+def test_minimizer_params(data):
+ # Assume two features, age and height, and boolean label
+ cells = [{"id": 1, "ranges": {"age": {"start": None, "end": 38}, "height": {"start": None, "end": 170}}, "label": 0,
+ "representative": {"age": 26, "height": 149}},
+ {"id": 2, "ranges": {"age": {"start": 39, "end": None}, "height": {"start": None, "end": 170}}, "label": 1,
+ "representative": {"age": 58, "height": 163}},
+ {"id": 3, "ranges": {"age": {"start": None, "end": 38}, "height": {"start": 171, "end": None}}, "label": 0,
+ "representative": {"age": 31, "height": 184}},
+ {"id": 4, "ranges": {"age": {"start": 39, "end": None}, "height": {"start": 171, "end": None}}, "label": 1,
+ "representative": {"age": 45, "height": 176}}
+ ]
+ features = ['age', 'height']
+ X = np.array([[23, 165],
+ [45, 158],
+ [18, 190]])
+ print(X.dtype)
+ y = [1,1,0]
+ base_est = DecisionTreeClassifier()
+ base_est.fit(X, y)
+
+ gen = GeneralizeToRepresentative(base_est, features=features, cells=cells)
+ gen.fit()
+ transformed = gen.transform(X)
+ print(transformed)
+
+def test_minimizer_fit(data):
+ features = ['age', 'height']
+ X = np.array([[23, 165],
+ [45, 158],
+ [56, 123],
+ [67, 154],
+ [45, 149],
+ [42, 166],
+ [73, 172],
+ [94, 168],
+ [69, 175],
+ [24, 181],
+ [18, 190]])
+ print(X)
+ y = [1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0]
+ base_est = DecisionTreeClassifier()
+ base_est.fit(X, y)
+ predictions = base_est.predict(X)
+
+ gen = GeneralizeToRepresentative(base_est, features=features, target_accuracy=0.5)
+ gen.fit(X, predictions)
+ transformed = gen.transform(X)
+ print(X)
+ print(transformed)
+
+