In [None]:
# Install the Antigranular package
!pip install antigranular &> /dev/null

In [None]:
import antigranular as ag
session = ag.login(<client_id>,<client_secret>, competition = "Heart Disease Prediction Hackathon")

Dataset "Heart Disease Prediction Hackathon Dataset" loaded to the kernel as [92mheart_disease_prediction_hackathon_dataset[0m
Key Name                       Value Type     
---------------------------------------------
train_y                        PrivateDataFrame
train_x                        PrivateDataFrame
test_x                         DataFrame      

Connected to Antigranular server session id: 38606523-88c4-44d3-80c1-386422337201, the session will time out if idle for 25 minutes
Cell magic '%%ag' registered successfully, use `%%ag` in a notebook cell to execute your python code on Antigranular private python server
🚀 Everything's set up and ready to roll!


In [None]:
%%ag
x_train = heart_disease_prediction_hackathon_dataset["train_x"]
y_train = heart_disease_prediction_hackathon_dataset["train_y"]
x_test = heart_disease_prediction_hackathon_dataset["test_x"]

In [None]:
%%ag
ag_print(x_train.columns)
ag_print(x_test.columns)

['age', 'sex', 'bp', 'ch', 'bs', 'phr']
Index(['age', 'sex', 'bp', 'ch', 'bs', 'phr'], dtype='object')



In [None]:
%%ag
ag_print(x_test)

      age  sex   bp   ch   bs  phr
0      71    1  128  326   95  117
1      61    1  153  270   98  123
2      59    1  113  236  106  181
3      69    0  109  151  109  108
4      55    0  137  235  101  150
...   ...  ...  ...  ...  ...  ...
1995   60    1  128  261  112  143
1996   50    1  143  216   94  100
1997   64    1  120  172   87  142
1998   56    1  158  294   82  144
1999   69    0  117  559  112  157

[2000 rows x 6 columns]



# Differential Privacy using different diffprivlib models

In this notebook, I implement and evaluate various machine learning models with differential privacy . The workflow involves standardizing the data using both custom and library-based scalers, and then training models like Logistic Regression, Random Forest, and Gaussian Naive Bayes. Each model is trained with a specified privacy budget (epsilon) to ensure privacy-preserving predictions. The predictions are then exported and submitted to a leaderboard for evaluation.


### Standard Scaler Manual Function
This cell first defines a custom function `standard_scaler_manual` to standardize data. It calculates the mean and standard deviation for each column and then scales the data by subtracting the mean and dividing by the standard deviation.

Next, it selects columns from `x_test` to scale and applies the `standard_scaler_manual` function.


In [None]:
%%ag
def standard_scaler_manual(data):
    # Calculate mean and standard deviation for each column
    means = data.mean()
    std_devs = data.std()

    # Standardize each column
    scaled_data = (data - means) / std_devs

    return scaled_data

# Assuming 'x_test' is your DataFrame and 'sex' is the column you want to exclude
columns_to_scale = [col for col in x_test.columns]

# Apply standard scaling to selected columns manually
x_test_scaled = standard_scaler_manual(x_test[columns_to_scale])

# x_test_scaled=x_test_scaled.tolist()

# Print the column names of the scaled DataFrame
ag_print(x_test_scaled.columns)


Index(['age', 'sex', 'bp', 'ch', 'bs', 'phr'], dtype='object')



In [None]:
%%ag
ag_print(x_test_scaled)

           age       sex        bp        ch        bs       phr
0     1.766038  0.672463 -0.171234  1.557830 -0.477453 -1.528098
1     0.720952  0.672463  1.196887  0.471132 -0.294463 -1.244706
2     0.511935  0.672463 -0.992106 -0.188649  0.193513  1.494752
3     1.557021 -1.486327 -1.211006 -1.838101  0.376503 -1.953187
4     0.093901 -1.486327  0.321289 -0.208054 -0.111472  0.030559
...        ...       ...       ...       ...       ...       ...
1995  0.616444  0.672463 -0.171234  0.296484  0.559494 -0.300065
1996 -0.428642  0.672463  0.649638 -0.576755 -0.538450 -2.331043
1997  1.034478  0.672463 -0.609033 -1.430589 -0.965429 -0.347297
1998  0.198410  0.672463  1.470511  0.936860 -1.270413 -0.252833
1999  1.557021 -1.486327 -0.773207  6.079269  0.559494  0.361183

[2000 rows x 6 columns]



The `standard_scaler` function is used to apply standard scaling to the training data (`x_train`).

In [None]:
%%ag
#  Importing necessary libraries
from op_pandas import standard_scaler,PrivateDataFrame

# Assuming 'x_train' is your DataFrame and 'sex' is the column you want to exclude
columns_to_scale = [col for col in x_train.columns]

# Apply standard scaling to selected columns
x_train_scaled = standard_scaler(x_train[columns_to_scale], eps=0.1)


In [None]:
%%ag
ag_print(x_train_scaled.metadata)

{'age': (-1.9185128845458763, 1.8131036161185545), 'sex': (-1.3525680442501236, 0.7306177851446554), 'bp': (-3.6132463724066013, 5.709982575715551), 'ch': (-1.4514821495197625, 3.6555611098762197), 'bs': (-4.771411457573128, 8.393270517145876), 'phr': (-3.2271728472610643, 2.653948474164446)}



# Logistic Regression


### L2 Norm Calculation
This cell calculates the L2 norm for the scaled `x_train` DataFrame. It squares each element, sums these squares along the columns, takes the square root, and then finds the maximum L2 norm. The result is printed.


In [None]:
%%ag
x_train_norm = x_train_scaled ** 2

# Sum the squared values along the columns (axis=1)
l_2_norm = x_train_norm.sum(axis=1) ** 0.5

# Get the maximum L2 norm
data_norm = l_2_norm.max(eps=0.01)

ag_print("Max L2 norm (data_norm):", data_norm)




Max L2 norm (data_norm): 3.4535451599767155



### Logistic Regression Model Training
This cell initializes a logistic regression model with a privacy budget (`epsilon`) of 0.1 and a specified data norm. It then fits the model to the scaled `x_train` and `y_train` data and predicts the labels for the scaled `x_test` data.


In [None]:
%%ag
from op_diffprivlib.models import LogisticRegression

epsilon = 0.1

clf = LogisticRegression(epsilon=epsilon, data_norm=3.4535451599767155)
clf.fit(x_train_scaled, y_train)
y_pred = clf.predict(x_test_scaled)


  y = column_or_1d(y, warn=True)



In [None]:
%%ag
ag_print(y_pred)

[0 1 0 ... 1 1 0]



### Export Predictions
This cell converts the predictions into a `DataFrame`, exports it to the local environment, and submits the predictions to a leaderboard.


In [None]:
%%ag
# Prepare to export it by converting it into a DataFrame
from pandas import DataFrame
my_predictions = DataFrame(y_pred)
# Export to local enviroment
export(y_pred, "my_predictions")

Setting up exported variable in local environment: my_predictions
[0;31mNameError[0m: name 'session' is not defined


In [None]:
from pandas import DataFrame
# Send predictions to the leaderboard
session.submit_predictions(DataFrame(my_predictions))

{'score': {'leaderboard': 0.6022827834374976,
  'logs': {'BIN_ACC': 0.6154791234887222, 'LIN_EPS': -0.01319634005122459}}}

# Random Forest

### Calculate Low and High Bounds
This cell calculates the low and high bounds for each column in the `x_train_scaled` DataFrame using metadata. It stores these bounds in `low_bound` and `high_bound` lists.


In [None]:
%%ag
# Calculate low and high bounds
low_bound, high_bound = [], []
for col in x_train_scaled.columns:
    low, high = x_train_scaled.metadata[col]
    low_bound.append(low)
    high_bound.append(high)

x_bounds_scaled_train = (low_bound, high_bound)

### Initialize and Train Random Forest Classifier
This cell initializes a `RandomForestClassifier` with 500 estimators, an epsilon of 0.1, specified bounds, classes, and other parameters. It then fits the classifier to the `x_train_scaled` and `y_train` data.


In [None]:
%%ag
from op_diffprivlib.models import RandomForestClassifier
ran_model=RandomForestClassifier(n_estimators=500,epsilon=0.1, bounds=x_bounds_scaled_train, classes=[0,1],n_jobs=1,max_depth=5)

In [None]:
%%ag
ran_model.fit(x_train_scaled, y_train)

  y = column_or_1d(y, warn=True)



### Predict Using Random Forest Classifier
This cell uses the trained `RandomForestClassifier` to predict the labels for the `x_test_scaled` data and stores the predictions in `y_pred`.


In [None]:
%%ag
y_pred = ran_model.predict(x_test_scaled)

In [None]:
%%ag
ag_print(y_pred)

[0 0 1 ... 1 0 1]



### Export Predictions
This cell converts the predictions into a `DataFrame`, exports it to the local environment, and submits the predictions to a leaderboard.


In [None]:
%%ag
# Prepare to export it by converting it into a DataFrame
from pandas import DataFrame
my_predictions = DataFrame(y_pred)

In [None]:
%%ag
# Export to local enviroment
export(y_pred, "my_predictions")

Setting up exported variable in local environment: my_predictions


In [None]:
from pandas import DataFrame
# Send predictions to the leaderboard
session.submit_predictions(DataFrame(my_predictions))

{'score': {'leaderboard': 0.6081428667932699,
  'logs': {'BIN_ACC': 0.6204892068444945, 'LIN_EPS': -0.012346340051224593}}}

#Gaussian Naive Bayes

### Train Gaussian Naive Bayes Classifier
This cell imports the `GaussianNB` model from `op_diffprivlib.models`, sets a seed for repeatability, and initializes a Gaussian Naive Bayes classifier with a specified privacy budget (`epsilon`). It then fits the classifier to the `x_train_scaled` and `y_train` data.


In [None]:
%%ag
from op_diffprivlib.models import GaussianNB

seed = 1 # to have a repeatable result for debugging
epsilon = 0.1

clf = GaussianNB(epsilon=epsilon,bounds=X_bounds)
clf.fit(x_train_scaled, y_train)



  y = column_or_1d(y, warn=True)



### Scale Test Data and Predict Using Gaussian Naive Bayes
This cell scales the `x_test` data using a standard scaler with a privacy budget (`eps`) and then uses the trained Gaussian Naive Bayes classifier to predict the labels for the scaled `x_test` data. The predictions are stored in `y_pred`.

In [None]:
%%ag
x_test_scaler = standard_scaler(PrivateDataFrame(x_test), eps=.1)
y_pred = clf.predict(x_test_scaler)

In [None]:
%%ag
ag_print(y_pred)

[0 0 1 ... 0 1 0]



### Export Predictions
This cell converts the predictions into a `DataFrame`, exports it to the local environment, and submits the predictions to a leaderboard.


In [None]:
%%ag
# Prepare to export it by converting it into a DataFrame
from pandas import DataFrame
my_predictions = DataFrame(y_pred)
# Export to local enviroment
export(y_pred, "my_predictions")
from pandas import DataFrame
# Send predictions to the leaderboard
session.submit_predictions(DataFrame(my_predictions))

In [None]:
%%ag
# Export to local enviroment
export(y_pred, "my_predictions")

Setting up exported variable in local environment: my_predictions


In [None]:
from pandas import DataFrame
# Send predictions to the leaderboard
session.submit_predictions(DataFrame(my_predictions))

{'score': {'leaderboard': 0.6296271060426313,
  'logs': {'BIN_ACC': 0.6296271060426313}}}