In [None]:
# Install the Antigranular package
!pip install antigranular &> /dev/null

In [None]:
import antigranular as ag
session = ag.login(<client_id>,<client_secret>, competition = "Heart Disease Prediction Hackathon")

Dataset "Heart Disease Prediction Hackathon Dataset" loaded to the kernel as [92mheart_disease_prediction_hackathon_dataset[0m
Key Name                       Value Type     
---------------------------------------------
train_y                        PrivateDataFrame
train_x                        PrivateDataFrame
test_x                         DataFrame      

Connected to Antigranular server session id: 38606523-88c4-44d3-80c1-386422337201, the session will time out if idle for 25 minutes
Cell magic '%%ag' registered successfully, use `%%ag` in a notebook cell to execute your python code on Antigranular private python server
🚀 Everything's set up and ready to roll!


In [None]:
%%ag
x_train = heart_disease_prediction_hackathon_dataset["train_x"]
y_train = heart_disease_prediction_hackathon_dataset["train_y"]
x_test = heart_disease_prediction_hackathon_dataset["test_x"]

In [None]:
%%ag
ag_print(x_train.columns)
ag_print(x_test.columns)

['age', 'sex', 'bp', 'ch', 'bs', 'phr']
Index(['age', 'sex', 'bp', 'ch', 'bs', 'phr'], dtype='object')



In [None]:
%%ag
ag_print(x_test)

      age  sex   bp   ch   bs  phr
0      71    1  128  326   95  117
1      61    1  153  270   98  123
2      59    1  113  236  106  181
3      69    0  109  151  109  108
4      55    0  137  235  101  150
...   ...  ...  ...  ...  ...  ...
1995   60    1  128  261  112  143
1996   50    1  143  216   94  100
1997   64    1  120  172   87  142
1998   56    1  158  294   82  144
1999   69    0  117  559  112  157

[2000 rows x 6 columns]



# Differential Privacy with TensorFlow and Custom Scaling

In this notebook, we implement differential privacy in machine learning models using TensorFlow. We apply custom Min-Max scaling to preprocess the data and build a privacy-preserving neural network using TensorFlow's `PrivateKerasModel`.




### Custom Min-Max Scaler Function Definition
This cell defines a custom function `min_max_scaler_manual` that applies Min-Max scaling to the data. The function scales each column in the DataFrame to a range of [0, 1] using the minimum and maximum values provided in the metadata.

In [None]:
%%ag
def min_max_scaler_manual(data, metadata):
    # Initialize an empty dictionary to store the scaled data
    scaled_data_dict = data

    # Iterate over each column in the data
    for col in data.columns:
        # Extract the min and max from the metadata
        col_min = metadata[col][0]
        col_max = metadata[col][1]

        # Scale the column data
        scaled_data_dict[col] = (data[col] - col_min) / (col_max - col_min)

    return scaled_data_dict

### Apply Min-Max Scaling to Training and Test Data
This cell applies the `min_max_scaler_manual` function to scale both the `x_train` and `x_test` datasets using the metadata from `x_train`. The scaled data is stored in `x_train_scaled` and `x_test_scaled`.


In [None]:
%%ag
x_test_scaled = min_max_scaler_manual(x_test, x_train.metadata)

In [None]:
%%ag
x_train_scaled = min_max_scaler_manual(x_train, x_train.metadata)


In [None]:
%%ag
ag_print(x_test_scaled)

           age  sex        bp        ch        bs      phr
0     0.769231  1.0  0.355556  0.452525  0.311111  0.34375
1     0.615385  1.0  0.540741  0.339394  0.344444  0.38125
2     0.584615  1.0  0.244444  0.270707  0.433333  0.74375
3     0.738462  0.0  0.214815  0.098990  0.466667  0.28750
4     0.523077  0.0  0.422222  0.268687  0.377778  0.55000
...        ...  ...       ...       ...       ...      ...
1995  0.600000  1.0  0.355556  0.321212  0.500000  0.50625
1996  0.446154  1.0  0.466667  0.230303  0.300000  0.23750
1997  0.661538  1.0  0.296296  0.141414  0.222222  0.50000
1998  0.538462  1.0  0.577778  0.387879  0.166667  0.51250
1999  0.738462  0.0  0.274074  0.923232  0.500000  0.59375

[2000 rows x 6 columns]



In [None]:
%%ag
ag_print(x_train.metadata)

{'age': (21, 86), 'sex': (0, 1), 'bp': (80, 215), 'ch': (102, 597), 'bs': (67, 157), 'phr': (62, 222)}



In [None]:
%%ag
ag_print(x_test)

      age  sex   bp   ch   bs  phr
0      71    1  128  326   95  117
1      61    1  153  270   98  123
2      59    1  113  236  106  181
3      69    0  109  151  109  108
4      55    0  137  235  101  150
...   ...  ...  ...  ...  ...  ...
1995   60    1  128  261  112  143
1996   50    1  143  216   94  100
1997   64    1  120  172   87  142
1998   56    1  158  294   82  144
1999   69    0  117  559  112  157

[2000 rows x 6 columns]



## Build and Compile Differentially Private Neural Network Model

This cell imports necessary libraries for building a TensorFlow model with differential privacy. Specifically:

1. **Import Libraries**:
   - `tensorflow` is imported for building and training the neural network.
   - `standard_scaler` and `PrivateDataFrame` from `op_pandas` are used for data preprocessing and handling private data.
   - `Sequential`, `Dense`, `Dropout`, and `BatchNormalization` from `tensorflow.keras` are used to construct the neural network model.
   - `PrivateKerasModel` and `PrivateDataLoader` from `op_tensorflow` are used to apply differential privacy to the model.

2. **Model Architecture**:
   - A `Sequential` model is created with several `Dense` layers, which are fully connected layers in the neural network.
   - Each `Dense` layer is followed by `BatchNormalization` and `Dropout` layers:
     - **`Dense` Layers**: These layers apply linear transformations followed by activation functions (`relu` for hidden layers and `sigmoid` for the output layer) to introduce non-linearity into the model.
     - **`BatchNormalization` Layers**: These normalize the output of the previous layer to improve training stability and speed.
     - **`Dropout` Layers**: These randomly drop units during training to prevent overfitting by making the model more robust.

3. **Create Differentially Private Model**:
   - `PrivateKerasModel` is used to create a differentially private version of the neural network. It includes:
     - `l2_norm_clip=1`: Limits the L2 norm of the gradients to control the impact of individual data points.
     - `noise_multiplier=1000`: Adds noise to the gradients to enhance privacy. A higher value provides stronger privacy protection but may affect model performance.

4. **Compile the Model**:
   - The model is compiled using the Adam optimizer with a learning rate of `0.01` and the binary cross-entropy loss function.
   - **Optimizer**: The Adam optimizer is used for adjusting the weights of the model based on the gradients.
   - **Loss Function**: Binary cross-entropy is used since this is a binary classification problem.
   - **Metrics**: Accuracy is used to evaluate the model's performance during training and validation.

In summary, this cell sets up a differentially private neural network model with TensorFlow, including defining the model architecture, applying differential privacy, and compiling the model for training.


In [None]:
%%ag
import tensorflow as tf
from op_pandas import standard_scaler, PrivateDataFrame
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from op_tensorflow import PrivateKerasModel, PrivateDataLoader


seqM1 = Sequential([
    Dense(256, activation='relu', input_shape=(6,)),
    BatchNormalization(),
    Dropout(0.3),
    Dense(128, activation='relu'),
    BatchNormalization(),
    Dropout(0.3),
    Dense(64, activation='relu'),
    BatchNormalization(),
    Dropout(0.3),
    Dense(32, activation='relu'),
    BatchNormalization(),
    Dropout(0.3),
    Dense(1, activation='sigmoid')
])


# Create DP keras model
dp_model1 = PrivateKerasModel(model=seqM1, l2_norm_clip=1, noise_multiplier=1000)

# Use a standard (non-DP) optimizer directly from keras.
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)

# PrivateKerasModel uses similar API as standard Keras
dp_model1.compile(
	optimizer = optimizer,
	loss = 'binary_crossentropy',
	metrics = ["accuracy"]
)


**Reason for Using a High Noise Multiplier**

I used a high noise multiplier of `1000` to ensure strong privacy guarantees. Since I was trying to use smaller epsilon for better privacy, the large noise multiplier compensates for this by adding sufficient noise to protect individual data points. Although this can reduce model performance, it helps prevent overfitting by acting as a form of regularization. The goal was to balance strong privacy with acceptable model accuracy.


### Prepare DataLoader for Training
This cell creates a `PrivateDataLoader` to handle private data, specifying the training data and batch size. This loader is used to provide data to the model during training while preserving privacy.


In [None]:
%%ag
# Assuming you have preprocessed and scaled your data into x_train_scaled, y_train, x_test_scaled, y_test
data_loader = PrivateDataLoader(feature_df=x_train_scaled, label_df=y_train, batch_size=32)

### Manage Privacy Budget
This cell uses the `get_privacy_budget` function to calculate the privacy budget for the training process based on the sample size, batch size, number of epochs, noise multiplier, and target delta.

In [None]:
%%ag
from op_tensorflow import get_privacy_budget
get_privacy_budget(
    sample_size=2000,
    batch_size=32,
    num_epochs=500,
    noise_multiplier=1000,
    target_delta=1e-5,
)

=> EPSILON_REQUIRED = 0.0075974776466098395 using TARGET_DELTA = 1e-05
Training parameters used :-
    SAMPLE_SIZE = 2000
    BATCH_SIZE = 32
    NUM_EPOCHS = 500
    NOISE_MULTIPLIER = 1000
    



### Train the Differentially Private Model
This cell fits the differentially private model (`dp_model1`) to the scaled training data (`x_train_scaled`, `y_train`) for a specified number of epochs and batch size.


In [None]:
%%ag
# Fit the model with validation split
dp_model1.fit(
    x=x_train_scaled,
    y=y_train,
    epochs=100,
    batch_size=32,
)

Epoch 1/100

250/250 - 17s - loss: 0.5745 - accuracy: 0.7028 - 17s/epoch - 69ms/step

Epoch 2/100

250/250 - 8s - loss: 0.5491 - accuracy: 0.7172 - 8s/epoch - 34ms/step

Epoch 3/100

250/250 - 8s - loss: 0.5149 - accuracy: 0.7417 - 8s/epoch - 33ms/step

Epoch 4/100

250/250 - 8s - loss: 0.5144 - accuracy: 0.7309 - 8s/epoch - 33ms/step

Epoch 5/100

250/250 - 8s - loss: 0.5110 - accuracy: 0.7347 - 8s/epoch - 33ms/step

Epoch 6/100

250/250 - 8s - loss: 0.4994 - accuracy: 0.7452 - 8s/epoch - 33ms/step

Epoch 7/100

250/250 - 8s - loss: 0.5088 - accuracy: 0.7413 - 8s/epoch - 32ms/step

Epoch 8/100

250/250 - 8s - loss: 0.4911 - accuracy: 0.7523 - 8s/epoch - 34ms/step

Epoch 9/100

250/250 - 8s - loss: 0.4998 - accuracy: 0.7493 - 8s/epoch - 33ms/step

Epoch 10/100

250/250 - 8s - loss: 0.4941 - accuracy: 0.7549 - 8s/epoch - 33ms/step

Epoch 11/100

250/250 - 8s - loss: 0.5007 - accuracy: 0.7439 - 8s/epoch - 33ms/step

Epoch 12/100

250/250 - 8s - loss: 0.4842 - accuracy: 0.7558 - 8s/epoch 

In [None]:
%%ag
# Fit the model with validation split
dp_model1.fit(
    x=x_train_scaled,
    y=y_train,
    epochs=100,
    batch_size=32,
)

Epoch 1/100

250/250 - 8s - loss: 0.4155 - accuracy: 0.8066 - 8s/epoch - 33ms/step

Epoch 2/100

250/250 - 8s - loss: 0.4122 - accuracy: 0.8071 - 8s/epoch - 33ms/step

Epoch 3/100

250/250 - 8s - loss: 0.4154 - accuracy: 0.8020 - 8s/epoch - 33ms/step

Epoch 4/100

250/250 - 8s - loss: 0.4054 - accuracy: 0.8097 - 8s/epoch - 32ms/step

Epoch 5/100

250/250 - 8s - loss: 0.3994 - accuracy: 0.8165 - 8s/epoch - 33ms/step

Epoch 6/100

250/250 - 8s - loss: 0.4082 - accuracy: 0.8088 - 8s/epoch - 33ms/step

Epoch 7/100

250/250 - 8s - loss: 0.4065 - accuracy: 0.8069 - 8s/epoch - 32ms/step

Epoch 8/100

250/250 - 8s - loss: 0.4009 - accuracy: 0.8168 - 8s/epoch - 33ms/step

Epoch 9/100

250/250 - 8s - loss: 0.4111 - accuracy: 0.8113 - 8s/epoch - 32ms/step

Epoch 10/100

250/250 - 8s - loss: 0.3929 - accuracy: 0.8223 - 8s/epoch - 32ms/step

Epoch 11/100

250/250 - 8s - loss: 0.4042 - accuracy: 0.8148 - 8s/epoch - 33ms/step

Epoch 12/100

250/250 - 8s - loss: 0.3941 - accuracy: 0.8166 - 8s/epoch - 

In [None]:
%%ag
# Fit the model with validation split
dp_model1.fit(
    x=x_train_scaled,
    y=y_train,
    epochs=100,
    batch_size=32,
)

Epoch 1/100

250/250 - 8s - loss: 0.3761 - accuracy: 0.8267 - 8s/epoch - 34ms/step

Epoch 2/100

250/250 - 8s - loss: 0.3840 - accuracy: 0.8211 - 8s/epoch - 33ms/step

Epoch 3/100

250/250 - 8s - loss: 0.3833 - accuracy: 0.8242 - 8s/epoch - 32ms/step

Epoch 4/100

250/250 - 8s - loss: 0.3795 - accuracy: 0.8299 - 8s/epoch - 33ms/step

Epoch 5/100

250/250 - 8s - loss: 0.3756 - accuracy: 0.8259 - 8s/epoch - 32ms/step

Epoch 6/100

250/250 - 8s - loss: 0.3743 - accuracy: 0.8306 - 8s/epoch - 32ms/step

Epoch 7/100

250/250 - 8s - loss: 0.3683 - accuracy: 0.8310 - 8s/epoch - 33ms/step

Epoch 8/100

250/250 - 8s - loss: 0.3755 - accuracy: 0.8308 - 8s/epoch - 33ms/step

Epoch 9/100

250/250 - 8s - loss: 0.3853 - accuracy: 0.8196 - 8s/epoch - 33ms/step

Epoch 10/100

250/250 - 8s - loss: 0.3905 - accuracy: 0.8152 - 8s/epoch - 33ms/step

Epoch 11/100

250/250 - 8s - loss: 0.3875 - accuracy: 0.8252 - 8s/epoch - 33ms/step

Epoch 12/100

250/250 - 8s - loss: 0.3875 - accuracy: 0.8235 - 8s/epoch - 

In [None]:
%%ag
# Fit the model with validation split
dp_model1.fit(
    x=x_train_scaled,
    y=y_train,
    epochs=100,
    batch_size=32,
)

Epoch 1/100

250/250 - 8s - loss: 0.3667 - accuracy: 0.8324 - 8s/epoch - 32ms/step

Epoch 2/100

250/250 - 8s - loss: 0.3508 - accuracy: 0.8415 - 8s/epoch - 33ms/step

Epoch 3/100

250/250 - 8s - loss: 0.3693 - accuracy: 0.8338 - 8s/epoch - 34ms/step

Epoch 4/100

250/250 - 8s - loss: 0.3591 - accuracy: 0.8364 - 8s/epoch - 34ms/step

Epoch 5/100

250/250 - 8s - loss: 0.3550 - accuracy: 0.8368 - 8s/epoch - 33ms/step

Epoch 6/100

250/250 - 8s - loss: 0.3522 - accuracy: 0.8379 - 8s/epoch - 33ms/step

Epoch 7/100

250/250 - 8s - loss: 0.3621 - accuracy: 0.8366 - 8s/epoch - 32ms/step

Epoch 8/100

250/250 - 8s - loss: 0.3694 - accuracy: 0.8341 - 8s/epoch - 32ms/step

Epoch 9/100

250/250 - 8s - loss: 0.3527 - accuracy: 0.8392 - 8s/epoch - 33ms/step

Epoch 10/100

250/250 - 8s - loss: 0.3628 - accuracy: 0.8341 - 8s/epoch - 32ms/step

Epoch 11/100

250/250 - 8s - loss: 0.3551 - accuracy: 0.8389 - 8s/epoch - 33ms/step

Epoch 12/100

250/250 - 8s - loss: 0.3579 - accuracy: 0.8375 - 8s/epoch - 

In [None]:
%%ag
ag_print(x_train_scaled.columns)
ag_print(x_test.columns)

['age', 'sex', 'bp', 'ch', 'bs', 'phr']
Index(['age', 'sex', 'bp', 'ch', 'bs', 'phr'], dtype='object')



### Predict and Post-process Predictions
This cell uses the trained model to make predictions on the scaled test data (`x_test_scaled`). The predictions are converted from float scalars to binary class labels (0 or 1) based on a threshold of 0.5.

In [None]:
%%ag
y_pred1 = dp_model1.predict(PrivateDataFrame(x_test_scaled), label_columns=["output"])

 1/63 [..............................] - ETA: 5s
 8/63 [==>...........................] - ETA: 0s



In [None]:
%%ag
ag_print(y_pred1)

An exception occurred: Please ensure ag_print function is called on a non-private type.

[0;31mValueError[0m: Please ensure ag_print function is called on a non-private type.


In [None]:
%%ag
# Note that the predictions are a float scalar
# so we scale it
def f(x: float) -> float:
  if x > 0.5:
    return 1
  else:
    return 0

y_pred1["output"] = y_pred1["output"].map(f, output_bounds=(0, 1))


In [None]:
%%ag
ag_print(x_test_scaled)

           age  sex        bp        ch        bs      phr
0     0.769231  1.0  0.355556  0.452525  0.311111  0.34375
1     0.615385  1.0  0.540741  0.339394  0.344444  0.38125
2     0.584615  1.0  0.244444  0.270707  0.433333  0.74375
3     0.738462  0.0  0.214815  0.098990  0.466667  0.28750
4     0.523077  0.0  0.422222  0.268687  0.377778  0.55000
...        ...  ...       ...       ...       ...      ...
1995  0.600000  1.0  0.355556  0.321212  0.500000  0.50625
1996  0.446154  1.0  0.466667  0.230303  0.300000  0.23750
1997  0.661538  1.0  0.296296  0.141414  0.222222  0.50000
1998  0.538462  1.0  0.577778  0.387879  0.166667  0.51250
1999  0.738462  0.0  0.274074  0.923232  0.500000  0.59375

[2000 rows x 6 columns]



### Submit Predictions
This cell submits the processed predictions (`y_pred1`) to the competition leaderboard.


In [None]:
%%ag
result = submit_predictions(y_pred1)

score: {'leaderboard': 0.9564211271444968, 'logs': {'BIN_ACC': 0.9762324159219367, 'LIN_EPS': -0.019811288777439824}}

