# Applying data minimization to a trained regression ML model

In this tutorial we will show how to perform data minimization for regression ML models using the minimization module.

We will show you applying data minimization to a different trained regression models.

## Load data
QI parameter determines which features will be minimized.

In [6]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

import sys
sys.path.append('/Users/leodom01/Repos/ai-privacy-DataProtectionTechonolgies')

import warnings
warnings.filterwarnings('ignore', category=UserWarning)
warnings.filterwarnings('ignore', category=FutureWarning)

dataset = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.5, random_state=14)

features = ['age', 'sex', 'bmi', 'bp',
                's1', 's2', 's3', 's4', 's5', 's6']
QI = ['age', 'bmi', 's2', 's5', 's6']

print(X_train)
print(y_train)

[[-0.07090025  0.05068012 -0.08919748 ... -0.00259226 -0.01290868
  -0.05492509]
 [-0.08179786  0.05068012  0.04229559 ...  0.1081111   0.04719048
  -0.03835666]
 [-0.05637009 -0.04464164 -0.01159501 ... -0.03949338 -0.00797714
  -0.08806194]
 ...
 [ 0.06350368  0.05068012  0.08864151 ...  0.07120998  0.02929656
   0.07348023]
 [-0.10722563 -0.04464164 -0.01159501 ...  0.03430886  0.00702714
  -0.03007245]
 [ 0.02717829 -0.04464164  0.04984027 ...  0.05275942 -0.05296264
  -0.0052198 ]]
[104. 137. 190. 220. 171.  70. 128. 292. 178. 127. 310. 150.  39.  65.
 110.  53.  71.  77.  47. 175. 275. 283.  77.  97.  92. 258.  66. 202.
 230. 220. 182. 103. 217. 277. 281. 142.  63. 137.  90. 139.  63. 140.
 332.  71. 225.  93. 268.  99.  88. 182. 232. 162. 293.  90.  71.  51.
  77. 124. 190. 152. 212. 115. 116. 179.  96. 139. 192.  42. 180. 111.
 177.  81. 198. 131. 230. 197.  64. 321. 275. 214. 210. 122. 141. 121.
 191. 126. 168. 277. 111.  68. 265. 172. 129.  84. 153. 174. 252. 196.
 196. 185. 

## Train DecisionTreeRegressor model

In [8]:
from apt.minimization import GeneralizeToRepresentative
from sklearn.tree import DecisionTreeRegressor

model1 = DecisionTreeRegressor(random_state=10, min_samples_split=2)
model1.fit(X_train, y_train)
print('Base model accuracy (R2 score): ', model1.score(X_test, y_test))

Base model accuracy (R2 score):  0.15014421352446072


## Run minimization
We will try to run minimization with only a subset of the features.

In [12]:
# note that is_regression param is True

minimizer1 = GeneralizeToRepresentative(model1, target_accuracy=0.7, is_regression=True,
                                    features_to_minimize=QI)

# Fitting the minimizar can be done either on training or test data. Doing it with test data is better as the
# resulting accuracy on test data will be closer to the desired target accuracy (when working with training
# data it could result in a larger gap)
# Don't forget to leave a hold-out set for final validation!
X_generalizer_train1, x_test1, y_generalizer_train1, y_test1 = train_test_split(X_test, y_test,
                                                                test_size = 0.4, random_state = 38)

x_train_predictions1 = model1.predict(X_generalizer_train1)
minimizer1.fit(X_generalizer_train1, x_train_predictions1, features_names=features)
transformed1 = minimizer1.transform(x_test1, features_names=features)
print('Accuracy on minimized data: ', model1.score(transformed1, y_test1))
print('generalizations: ',minimizer1.generalizations)#%% md

Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 0.108922
Improving accuracy
feature to remove: s5
Removed feature: s5, new relative accuracy: 0.505498
feature to remove: bmi
Removed feature: bmi, new relative accuracy: 0.716972
Accuracy on minimized data:  0.1116122925781402
generalizations:  {'ranges': {'age': [-0.07090024650096893, -0.043656209483742714, -0.041839939542114735, -0.03639113181270659, -0.01459590089507401, -0.012779632292222232, -0.009147093165665865, -0.0036982858437113464, 0.03989217430353165, 0.039892176166176796, 0.05623859912157059, 0.06713621318340302], 's2': [-0.0550188384950161, -0.0285577941685915, -0.024643437936902046, -0.02135537937283516, -0.013683241792023182, -0.006480826530605555, 0.009176596067845821, 0.023111702874302864, 0.02420772146433592, 0.02655633445829153, 0.039082273840904236], 's6': [-0.052854035049676895, -0.03835666086524725, -0.02593033987795

## Train linear regression model

In [13]:
from sklearn.linear_model import LinearRegression
from apt.minimization import GeneralizeToRepresentative

model2 = LinearRegression()
model2.fit(X_train, y_train)
print('Base model accuracy (R2 score): ', model2.score(X_test, y_test))

Base model accuracy (R2 score):  0.5080563960651392


## Run minimization
We will try to run minimization with only a subset of the features.

In [15]:
# note that is_regression param is True

minimizer2 = GeneralizeToRepresentative(model2, target_accuracy=0.7, is_regression=True,
                                    features_to_minimize=QI)

# Fitting the minimizar can be done either on training or test data. Doing it with test data is better as the
# resulting accuracy on test data will be closer to the desired target accuracy (when working with training
# data it could result in a larger gap)
# Don't forget to leave a hold-out set for final validation!
X_generalizer_train2, x_test2, y_generalizer_train2, y_test2 = train_test_split(X_test, y_test,
                                                                test_size = 0.4, random_state = 38)

x_train_predictions2 = model2.predict(X_generalizer_train2)
minimizer2.fit(X_generalizer_train2, x_train_predictions2, features_names=features)
transformed2 = minimizer2.transform(x_test2, features_names=features)
print('Accuracy on minimized data: ', model2.score(transformed2, y_test2))
print('generalizations: ',minimizer2.generalizations)

Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 0.201734
Improving accuracy
feature to remove: s5
Removed feature: s5, new relative accuracy: 0.292914
feature to remove: age
Removed feature: age, new relative accuracy: 0.291507
feature to remove: s2
Removed feature: s2, new relative accuracy: 0.947873
Accuracy on minimized data:  0.46523158691549726
generalizations:  {'ranges': {'bmi': [-0.0660245232284069, -0.06171327643096447, -0.048779530450701714, -0.04770171828567982, -0.036923596635460854, -0.022912041284143925, -0.01644516922533512, -0.015906263142824173, -0.009978296235203743, 0.007266696775332093, 0.022356065921485424, 0.028822937980294228, 0.04499012045562267, 0.053073709830641747, 0.10103634744882584], 's6': [-0.07356456853449345, -0.052854035049676895, -0.048711927607655525, -0.046640874817967415, -0.044569820165634155, -0.0383566590026021, -0.021788232028484344, -0.017646125