# Applying data minimization to a trained regression ML model

In this tutorial we will show how to perform data minimization for regression ML models using the minimization module.

We will show you applying data minimization to a different trained regression models.

## Load data
QI parameter determines which features will be minimized.

In [17]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

dataset = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.5, random_state=14)

features = ['age', 'sex', 'bmi', 'bp',
                's1', 's2', 's3', 's4', 's5', 's6']
QI = [0, 2, 5, 8, 9]

## Train DecisionTreeRegressor model

In [18]:
from apt.minimization import GeneralizeToRepresentative
from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor(random_state=10, min_samples_split=2)
model.fit(X_train, y_train)
pred = model.predict(X_train)
print('Base model accuracy (R2 score): ', model.score(X_test, y_test))

Base model accuracy (R2 score):  0.15014421352446072


## Run minimization
We will try to run minimization with only a subset of the features.

In [19]:
# note that is_regression param is True
gen = GeneralizeToRepresentative(model, target_accuracy=0.7, features=features, is_regression=True,
                                 features_to_minimize=QI)
gen.fit(X_train, pred)
transformed = gen.transform(X_train)
model.fit(transformed, y_train)
print('Base model accuracy (R2 score) after anonymization: ', model.score(X_test, y_test))

Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 0.365257
Improving accuracy
feature to remove: s5
Removed feature: s5, new relative accuracy: 0.597736
feature to remove: s6
Removed feature: s6, new relative accuracy: 0.749938
Base model accuracy (R2 score) after anonymization:  -0.1704892941317131


## Train linear regression model

In [20]:
from sklearn.linear_model import LinearRegression
from apt.minimization import GeneralizeToRepresentative

model = LinearRegression()
model.fit(X_train, y_train)
pred = model.predict(X_train)
print('Base model accuracy (R2 score): ', model.score(X_test, y_test))

Base model accuracy (R2 score):  0.5080618258593721


## Run minimization
We will try to run minimization with only a subset of the features.

In [21]:
# note that is_regression param is True
gen = GeneralizeToRepresentative(model, target_accuracy=0.7, features=features, is_regression=True,
                                 features_to_minimize=QI)
gen.fit(X_train, pred)
transformed = gen.transform(X_train)
model.fit(transformed, y_train)
print('Base model accuracy (R2 score) after anonymization: ', model.score(X_test, y_test))

Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 0.282418
Improving accuracy
feature to remove: s2
Removed feature: s2, new relative accuracy: 0.791109
Base model accuracy (R2 score) after anonymization:  0.5031250541011055
