IBM · abigailgold · Jul 12, 2021 · Jul 12, 2021 · Jul 12, 2021 · Jul 12, 2021
diff --git a/README.md b/README.md
@@ -6,20 +6,23 @@
 
 A toolkit for tools and techniques related to the privacy and compliance of AI models.
 
-The first release of this toolkit contains a single module called [**anonymization**](apt/anonymization/README.md).
-This module contains methods for anonymizing ML model training data, so that when 
-a model is retrained on the anonymized data, the model itself will also be considered 
-anonymous. This may help exempt the model from different obligations and restrictions 
+The [**anonymization**](apt/anonymization/README.md) module contains methods for anonymizing ML model 
+training data, so that when a model is retrained on the anonymized data, the model itself will also be 
+considered anonymous. This may help exempt the model from different obligations and restrictions 
 set out in data protection regulations such as GDPR, CCPA, etc. 
 
+The [**minimization**](apt/minimization/README.md) module contains methods to help adhere to the data 
+minimization principle in GDPR for ML models. It enables to reduce the amount of 
+personal data needed to perform predictions with a machine learning model, while still enabling the model
+to make accurate predictions. This is done by by removing or generalizing some of the input features.
+
 Official ai-privacy-toolkit documentation: https://ai-privacy-toolkit.readthedocs.io/en/latest/
 
 Installation: pip install ai-privacy-toolkit
 
 **Related toolkits:**
 
-[ai-minimization-toolkit](https://github.com/IBM/ai-minimization-toolkit): A toolkit for 
-reducing the amount of personal data needed to perform predictions with a machine learning model
+ai-minimization-toolkit - has been migrated into this toolkit.
 
 [differential-privacy-library](https://github.com/IBM/differential-privacy-library): A 
 general-purpose library for experimenting with, investigating and developing applications in, 

diff --git a/apt/__init__.py b/apt/__init__.py
@@ -3,6 +3,7 @@
 """
 
 from apt import anonymization
+from apt import minimization
 from apt import utils
 
-__version__ = "0.0.2"
+__version__ = "0.0.3"
diff --git a/apt/minimization/README.md b/apt/minimization/README.md
@@ -0,0 +1,110 @@
+# data minimization module
+
+The EU General Data Protection Regulation (GDPR) mandates the principle of data minimization, which requires that only 
+data necessary to fulfill a certain purpose be collected. However, it can often be difficult to determine the minimal 
+amount of data required, especially in complex machine learning models such as neural networks. 
+
+This module implements a first-of-a-kind method to help reduce the amount of personal data needed to perform 
+predictions with a machine learning model, by removing or generalizing some of the input features. The type of data 
+minimization this toolkit focuses on is the reduction of the number and/or granularity of features collected for analysis. 
+
+The generalization process basically searches for several similar records and groups them together. Then, for each 
+feature, the individual values for that feature within each group are replaced with a represenataive value that is 
+common across the whole group. This process is done while using knowledge encoded within the model to produce a 
+generalization that has little to no impact on its accuracy. 
+
+For more information about the method see: http://export.arxiv.org/pdf/2008.04113
+
+The following figure depicts the overall process:
+
+<p align="center">
+  <img src="../../docs/images/AI_Privacy_project.jpg?raw=true" width="667" title="data minimization process">
+</p>
+<br />
+
+Usage
+-----
+
+The main class, ``GeneralizeToRepresentative``, is a scikit-learn compatible ``Transformer``, that receives an existing 
+estimator and labeled training data, and learns the generalizations that can be applied to any newly collected data for 
+analysis by the original model. The ``fit()`` method learns the generalizations and the ``transform()`` method applies 
+them to new data.
+
+It is also possible to export the generalizations as feature ranges.
+
+The current implementation supports only numeric features, so any categorical features must be transformed to a numeric 
+representation before using this class.
+
+Start by training your machine learning model. In this example, we will use a ``DecisionTreeClassifier``, but any 
+scikit-learn model can be used. We will use the iris dataset in our example.
+
+.. code:: python
+
+  from sklearn import datasets
+  from sklearn.model_selection import train_test_split
+  from sklearn.tree import DecisionTreeClassifier
+
+  dataset = datasets.load_iris()
+  X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.2)
+
+  base_est = DecisionTreeClassifier()
+  base_est.fit(X_train, y_train)
+
+Now create the ``GeneralizeToRepresentative`` transformer and train it. Supply it with the original model and the 
+desired target accuracy. The training process may receive the original labeled training data or the model's predictions 
+on the data.
+
+.. code:: python
+
+  predictions = base_est.predict(X_train)
+  gen = GeneralizeToRepresentative(base_est, target_accuracy=0.9)
+  gen.fit(X_train, predictions)
+
+Now use the transformer to transform new data, for example the test data.
+
+.. code:: python
+
+  transformed = gen.transform(X_test)
+
+The transformed data has the same columns and formats as the original data, so it can be used directly to derive 
+predictions from the original model.
+
+.. code:: python
+
+  new_predictions = base_est.predict(transformed)
+
+To export the resulting generalizations, retrieve the ``Transformer``'s ``_generalize`` parameter.
+
+.. code:: python
+
+  generalizations = base_est._generalize
+
+The returned object has the following structure::
+
+  {
+    ranges: 
+    {
+      list of (<feature name>: [<list of values>])
+    }, 
+    untouched: [<list of feature names>]
+  }
+
+For example::
+
+  {
+    ranges: 
+    {
+      age: [21.5, 39.0, 51.0, 70.5], 
+      education-years: [8.0, 12.0, 14.5]
+    }, 
+    untouched: ["occupation", "marital-status"]
+  }
+
+Where each value inside the range list represents a cutoff point. For example, for the ``age`` feature, the ranges in 
+this example are: ``<21.5, 21.5-39.0, 39.0-51.0, 51.0-70.5, >70.5``. The ``untouched`` list represents features that 
+were not generalized, i.e., their values should remain unchanged.
+
+
+
+
+
diff --git a/apt/minimization/__init__.py b/apt/minimization/__init__.py
@@ -0,0 +1,19 @@
+"""
+Module providing data minimization for ML.
+
+This module implements a first-of-a-kind method to help reduce the amount of personal data needed to perform
+predictions with a machine learning model, by removing or generalizing some of the input features. For more information
+about the method see: http://export.arxiv.org/pdf/2008.04113
+
+The main class, ``GeneralizeToRepresentative``, is a scikit-learn compatible ``Transformer``, that receives an existing
+estimator and labeled training data, and learns the generalizations that can be applied to any newly collected data for
+analysis by the original model. The ``fit()`` method learns the generalizations and the ``transform()`` method applies
+them to new data.
+
+It is also possible to export the generalizations as feature ranges.
+
+The current implementation supports only numeric features, so any categorical features must be transformed to a numeric
+representation before using this class.
+
+"""
+from apt.minimization.minimizer import GeneralizeToRepresentative