# 📊 **Antigranular** Heart Disease Prediction Contest (ft. **Harvard/OpenDP** and **TPDP**)

🎉 Welcome to a new [Antigranular](https://antigranular.com) contest in collaboration with the [TPDP Workshop](https://tpdp.journalprivacyconfidentiality.org/2024/) and [Harvard's OpenDP Community Meeting](https://opendp.org/)!

🩺 This time, we are focusing on [heart condition detection](https://en.wikipedia.org/wiki/Cardiovascular_disease) using our new [TensorFlow Privacy](https://github.com/tensorflow/privacy) and [Opacus (PyTorch)](https://opacus.ai/) models!

🦜 Any questions? Head over to our [Discord](https://discord.com/invite/KJwApgXs4s)!



## 🏃‍♂️ Getting Started

In this section we will download the antigranular package and login




### 📦 Install Antigranular

This command installs the [Antigranular PyPI Package](https://pypi.org/project/antigranular/) on the local enviroment.


In [1]:
# Install the Antigranular package
!pip install antigranular &> /dev/null

### ✍ Login to the Enclave

Head over to [Competitions](https://www.antigranular.com/competitions) to find your `<user_id>`, `<user_secret>` and the competition's name and copy that command here.

![img](https://docs.antigranular.com/shots/comp_cell.png)

In [2]:
import antigranular as ag
session = ag.login(<client_id>,<client_secret>, competition = "Heart Disease Prediction Hackathon")

Dataset "Heart Disease Prediction Hackathon Dataset" loaded to the kernel as [92mheart_disease_prediction_hackathon_dataset[0m
Key Name                       Value Type     
---------------------------------------------
train_y                        PrivateDataFrame
train_x                        PrivateDataFrame
test_x                         DataFrame      

Connected to Antigranular server session id: 43fa43e2-cb48-40d8-921a-298a95601aa3, the session will time out if idle for 25 minutes
Cell magic '%%ag' registered successfully, use `%%ag` in a notebook cell to execute your python code on Antigranular private python server
🚀 Everything's set up and ready to roll!


### 🤖 Using AG

You can now simply use ``%%ag`` to run code on an enclave! You can always head over to our [Docs](https://docs.antigranular.com/) to learn more about AG, but for now, we can define train and test variables as follows.

In [3]:
%%ag
x_train = heart_disease_prediction_hackathon_dataset["train_x"]
y_train = heart_disease_prediction_hackathon_dataset["train_y"]
x_test = heart_disease_prediction_hackathon_dataset["test_x"]

### 🕵️‍♂️ Exploring data

Exploring data in Antigranular involves spending your epsilon budget, be mindful of your usage but remember that the less epsilon you use, the less accurate your results will get!

In [4]:
%%ag
x_train.info()

+----+----------+-------------+---------------+---------+------------+
|    | Column   | numerical   | categorical   | dtype   | bounds     |
|----+----------+-------------+---------------+---------+------------|
|  0 | age      | True        | False         | int64   | (21, 86)   |
|  1 | sex      | True        | False         | int64   | (0, 1)     |
|  2 | bp       | True        | False         | int64   | (80, 215)  |
|  3 | ch       | True        | False         | int64   | (102, 597) |
|  4 | bs       | True        | False         | int64   | (67, 157)  |
|  5 | phr      | True        | False         | int64   | (62, 222)  |
+----+----------+-------------+---------------+---------+------------+



In [5]:
%%ag
y_train.info()

+----+-----------+-------------+---------------+---------+----------+
|    | Column    | numerical   | categorical   | dtype   | bounds   |
|----+-----------+-------------+---------------+---------+----------|
|  0 | condition | True        | False         | int64   | (0, 1)   |
+----+-----------+-------------+---------------+---------+----------+



In [None]:
%%ag
# We can start by exploring the data, carefully using our epsilon
describe = x_train.describe(eps=0.1)
ag_print(describe)

               age          sex  ...           bs          phr
count  8016.000000  7885.000000  ...  7774.000000  8077.000000
mean     59.237499     0.726286  ...    98.681172   143.758544
std       5.315081     0.444097  ...    11.489598    13.397244
min      21.000000     0.000000  ...    67.000000    62.000000
25%      24.469890     0.000689  ...    77.201568   138.840002
50%      60.522819     0.976855  ...    94.001013   138.864569
75%      47.156685     0.890680  ...   117.633120   176.605836
max      80.664432     0.998731  ...   124.144124   190.977667

[8 rows x 6 columns]



In [None]:
%%ag
# We can start by exploring the data, carefully using our epsilon
describe = y_train.describe(eps=0.1)
ag_print(describe)

         condition
count  8014.000000
mean      0.537660
std       0.469280
min       0.000000
25%       0.015925
50%       0.999878
75%       0.931988
max       0.987946



In [None]:
%%ag
# x_test is a public test set, so we can print it without using epsilon
ag_print(x_test)

      age  sex   bp   ch   bs  phr
0      71    1  128  326   95  117
1      61    1  153  270   98  123
2      59    1  113  236  106  181
3      69    0  109  151  109  108
4      55    0  137  235  101  150
...   ...  ...  ...  ...  ...  ...
1995   60    1  128  261  112  143
1996   50    1  143  216   94  100
1997   64    1  120  172   87  142
1998   56    1  158  294   82  144
1999   69    0  117  559  112  157

[2000 rows x 6 columns]



STANDARD SCALING
%%ag

Most critical point I added to this study, reviewing in your documentation over Antigranular introduction and finding

StandardScaler in op_diffprivlib.models (>>>>>from op_diffprivlib.models import StandardScaler)

which has fit_transform and transform methods seperately.

The one from op_pandas import standard_scaler

does not have fit_transform and transform methods. In here it is mandatory create two seperate standard_scaler objects that deteriorate data consistency and cause discrepancy for the train_set and test_set. For allowing model learns on the same transformed scale, test_set should be transformed with StandardScale object fitted through train_set.

In the data management, data consistency is very important I guess. I am not theoretician but this point I think settled under this scope.

In [6]:
%%ag
from op_diffprivlib.models import LinearRegression,RandomForestClassifier,LogisticRegression,StandardScaler

In [None]:
%%ag
X_scaler = StandardScaler(epsilon = 0.2,bounds = ([21,80,102,67,62],[86,215,597,157,222])) #Sex bounds is not included

In [None]:
%%ag
X_train_std = X_scaler.fit_transform(x_train[["age","bp", "ch", "bs", "phr"]])

In [None]:
#Sex column is categoric, means that already scaled so I kept it seperated then join it with scaled dataframe

%%ag
X_train_std = X_train_std.join(x_train[['sex']])

In [None]:
%%ag
X_test_std = X_scaler.transform(x_test[["age","bp", "ch", "bs", "phr"]])

In [None]:
%%ag
ag_print(X_test_std)

In [None]:
#transforming above return raw data without column name so I need to create new dataframe with column name

%%ag
import pandas as pd
X_test_std = pd.DataFrame(X_test_std, columns = ["age","bp", "ch", "bs", "phr"] )




In [None]:
%%ag
X_test_std = X_test_std.join(x_test[['sex']])

In [None]:
%%ag

ag_print(X_test_std)

In [None]:
#We can test and see what our bounds of  scaled x_test data are, so in the scenario of cretaing PrivateDataFrmae
#we can use them as a metadata. In this note book i dont but during developing the model I use it in a couple of experiments

#like this
"""{"age":(-3.231322116161105,3.126608937803356),"sex":(0,1),"bp":(-2.3587053350012646,4.11330047355693),
"ch":(-2.0595143397556464,5.23177028955764),"bs":(-1.6838249760431203,2.707904986315708),
        "phr":(-5.1514514050320885,4.214822075062706)} """

%%ag
for i in x_test.columns:
  ag_print(i,X_test_std[i].min(),X_test_std[i].max())

### 🎈 A quick solution

In this section we evaluate an editorial solution in AG using TensorFlow!

In [None]:
### I tried to develope model a number of time with different parameter values
## more and less the top scored model I submitsted should be this one

In [None]:
%%ag
import tensorflow as tf
from op_pandas import standard_scaler, PrivateDataFrame
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from op_tensorflow import PrivateKerasModel, PrivateDataLoader


# Normal keras model
seqM = Sequential([
    Dense(32, activation='relu', input_shape=(6,)),
    Dropout(0.2),
    Dense(64, activation='relu'),
    Dropout(0.2),
    Dense(64, activation='relu'),
    Dropout(0.2),
    Dense(16, activation='relu'),
    Dense(1, activation='sigmoid')  # Binary classification
])


dp_model = PrivateKerasModel(model=seqM, l2_norm_clip=1, noise_multiplier=1)

#let set lr 0.001
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)


dp_model.compile(
	optimizer = optimizer,
	loss = 'binary_crossentropy',
	metrics = ["accuracy"]
)

  if (distutils.version.LooseVersion(tf.__version__) <
  distutils.version.LooseVersion(required_tensorflow_version)):



In [None]:
%%ag
x_train_scaled = standard_scaler(x_train, eps=.1)
x_train_scaled.info()

+----+----------+-------------+---------------+---------+------------------------------------------+
|    | Column   | numerical   | categorical   | dtype   | bounds                                   |
|----+----------+-------------+---------------+---------+------------------------------------------|
|  0 | age      | True        | False         | float64 | (-3.434428441950818, 3.251320884665355)  |
|  1 | sex      | True        | False         | float64 | (-1.5464939599176255,                    |
|    |          |             |               |         | 0.7585069065360925)                      |
|  2 | bp       | True        | False         | float64 | (-1.9262570222101114, 2.969333123214706) |
|  3 | ch       | True        | False         | float64 | (-6.62666685526513, 15.939133150116538)  |
|  4 | bs       | True        | False         | float64 | (-1.2919946241806954,                    |
|    |          |             |               |         | 1.9166940009123714)              

In [None]:
%%ag
data_loader = PrivateDataLoader(feature_df=X_train_std , label_df=y_train, batch_size=32)

In [None]:
%%ag
dp_model.fit(x=data_loader, epochs=44, target_delta=1/128)

In [None]:
%%ag
y_pred = dp_model.predict(PrivateDataFrame(X_test_std), label_columns=["output"])

 1/63 [..............................] - ETA: 18s
 4/63 [>.............................] - ETA: 1s 



Alternative wayof creating PrivateDataFrame with metadata
%%ag

y_pred = dp_model.predict(PrivateDataFrame(X_test_std,metadata={"age":(-3.231322116161105,3.126608937803356),"sex":(0,1),"bp":(-2.3587053350012646,4.11330047355693),"ch":(-2.0595143397556464,5.23177028955764),"bs":(-1.6838249760431203,2.707904986315708),"phr":(-5.1514514050320885,4.214822075062706)}), label_columns=["output"])

In [None]:
%%ag
# Note that the predictions are a float scalar
# so we scale it
def f(x: float) -> float:
  if x > 0.5:
    return 1
  else:
    return 0

y_pred["output"] = y_pred["output"].map(f, output_bounds=(0, 1))

In [None]:
%%ag
result = submit_predictions(y_pred)