Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data minimization functionality to the ai-privacy-toolkit #3

Merged
merged 4 commits into from Jul 12, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
15 changes: 9 additions & 6 deletions README.md
Expand Up @@ -6,20 +6,23 @@

A toolkit for tools and techniques related to the privacy and compliance of AI models.

The first release of this toolkit contains a single module called [**anonymization**](apt/anonymization/README.md).
This module contains methods for anonymizing ML model training data, so that when
a model is retrained on the anonymized data, the model itself will also be considered
anonymous. This may help exempt the model from different obligations and restrictions
The [**anonymization**](apt/anonymization/README.md) module contains methods for anonymizing ML model
training data, so that when a model is retrained on the anonymized data, the model itself will also be
considered anonymous. This may help exempt the model from different obligations and restrictions
set out in data protection regulations such as GDPR, CCPA, etc.

The [**minimization**](apt/minimization/README.md) module contains methods to help adhere to the data
minimization principle in GDPR for ML models. It enables to reduce the amount of
personal data needed to perform predictions with a machine learning model, while still enabling the model
to make accurate predictions. This is done by by removing or generalizing some of the input features.

Official ai-privacy-toolkit documentation: https://ai-privacy-toolkit.readthedocs.io/en/latest/

Installation: pip install ai-privacy-toolkit

**Related toolkits:**

[ai-minimization-toolkit](https://github.com/IBM/ai-minimization-toolkit): A toolkit for
reducing the amount of personal data needed to perform predictions with a machine learning model
ai-minimization-toolkit - has been migrated into this toolkit.

[differential-privacy-library](https://github.com/IBM/differential-privacy-library): A
general-purpose library for experimenting with, investigating and developing applications in,
Expand Down
3 changes: 2 additions & 1 deletion apt/__init__.py
Expand Up @@ -3,6 +3,7 @@
"""

from apt import anonymization
from apt import minimization
from apt import utils

__version__ = "0.0.2"
__version__ = "0.0.3"
110 changes: 110 additions & 0 deletions apt/minimization/README.md
@@ -0,0 +1,110 @@
# data minimization module

The EU General Data Protection Regulation (GDPR) mandates the principle of data minimization, which requires that only
data necessary to fulfill a certain purpose be collected. However, it can often be difficult to determine the minimal
amount of data required, especially in complex machine learning models such as neural networks.

This module implements a first-of-a-kind method to help reduce the amount of personal data needed to perform
predictions with a machine learning model, by removing or generalizing some of the input features. The type of data
minimization this toolkit focuses on is the reduction of the number and/or granularity of features collected for analysis.

The generalization process basically searches for several similar records and groups them together. Then, for each
feature, the individual values for that feature within each group are replaced with a represenataive value that is
common across the whole group. This process is done while using knowledge encoded within the model to produce a
generalization that has little to no impact on its accuracy.

For more information about the method see: http://export.arxiv.org/pdf/2008.04113

The following figure depicts the overall process:

<p align="center">
<img src="../../docs/images/AI_Privacy_project.jpg?raw=true" width="667" title="data minimization process">
</p>
<br />

Usage
-----

The main class, ``GeneralizeToRepresentative``, is a scikit-learn compatible ``Transformer``, that receives an existing
estimator and labeled training data, and learns the generalizations that can be applied to any newly collected data for
analysis by the original model. The ``fit()`` method learns the generalizations and the ``transform()`` method applies
them to new data.

It is also possible to export the generalizations as feature ranges.

The current implementation supports only numeric features, so any categorical features must be transformed to a numeric
representation before using this class.

Start by training your machine learning model. In this example, we will use a ``DecisionTreeClassifier``, but any
scikit-learn model can be used. We will use the iris dataset in our example.

.. code:: python

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

dataset = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.2)

base_est = DecisionTreeClassifier()
base_est.fit(X_train, y_train)

Now create the ``GeneralizeToRepresentative`` transformer and train it. Supply it with the original model and the
desired target accuracy. The training process may receive the original labeled training data or the model's predictions
on the data.

.. code:: python

predictions = base_est.predict(X_train)
gen = GeneralizeToRepresentative(base_est, target_accuracy=0.9)
gen.fit(X_train, predictions)

Now use the transformer to transform new data, for example the test data.

.. code:: python

transformed = gen.transform(X_test)

The transformed data has the same columns and formats as the original data, so it can be used directly to derive
predictions from the original model.

.. code:: python

new_predictions = base_est.predict(transformed)

To export the resulting generalizations, retrieve the ``Transformer``'s ``_generalize`` parameter.

.. code:: python

generalizations = base_est._generalize

The returned object has the following structure::

{
ranges:
{
list of (<feature name>: [<list of values>])
},
untouched: [<list of feature names>]
}

For example::

{
ranges:
{
age: [21.5, 39.0, 51.0, 70.5],
education-years: [8.0, 12.0, 14.5]
},
untouched: ["occupation", "marital-status"]
}

Where each value inside the range list represents a cutoff point. For example, for the ``age`` feature, the ranges in
this example are: ``<21.5, 21.5-39.0, 39.0-51.0, 51.0-70.5, >70.5``. The ``untouched`` list represents features that
were not generalized, i.e., their values should remain unchanged.





19 changes: 19 additions & 0 deletions apt/minimization/__init__.py
@@ -0,0 +1,19 @@
"""
Module providing data minimization for ML.

This module implements a first-of-a-kind method to help reduce the amount of personal data needed to perform
predictions with a machine learning model, by removing or generalizing some of the input features. For more information
about the method see: http://export.arxiv.org/pdf/2008.04113

The main class, ``GeneralizeToRepresentative``, is a scikit-learn compatible ``Transformer``, that receives an existing
estimator and labeled training data, and learns the generalizations that can be applied to any newly collected data for
analysis by the original model. The ``fit()`` method learns the generalizations and the ``transform()`` method applies
them to new data.

It is also possible to export the generalizations as feature ranges.

The current implementation supports only numeric features, so any categorical features must be transformed to a numeric
representation before using this class.

"""
from apt.minimization.minimizer import GeneralizeToRepresentative