# Training a Basic Machine Learning Model with AG

This notebook is all about building, training, and evaluating a machine learning model with differential privacy using the Iris dataset in Antigranular. Let's dive into the world of flowers and privacy-preserving ML! 🦾

The Iris dataset is super famous in the ML world. It's often used for classification tasks and contains samples of iris flowers described by four features: sepal length, sepal width, petal length, and petal width. 🌸🌼

By the end of this notebook, you'll be able to:

1. Load and preprocess data.
2. Build and train a neural network model with differential privacy.
3. Evaluate the model's performance.
4. Save and reload the trained model for future use.

Let's get started with loading and analysing the data. 🚀

## Installing and Loading Configuration

First, let's get everything set up. We need to install and import the necessary libraries and then log in to our Antigranular session. 🛠️

In [None]:
!pip install antigranular



In [1]:
import antigranular as ag

session = ag.login(
    "8sqSztw0PjQvNWR4JTQktnbtGO5KmR3t",
    "M9BJrFVyo0mryGakX4xnYWRyXKtNvg16pV3rp3nZzfcxPD0CunbFJZSzE_mTDznZ",
    competition="Sandbox Competition",
)

local_host_port: ba89e8bb-8683-4f75-bfe1-a2caeb83cef0
server_hostname: ip-100-100-17-79.eu-west-1.compute.internal
tls_cert_name: ip-100-100-17-79.eu-west-1.compute.internal_ba89e8bb-8683-4f75-bfe1-a2caeb83cef0
Dataset "Medical Treatment" loaded to the kernel as [92mmedical_treatment[0m
Key Name                       Value Type     
---------------------------------------------
train_x                        PrivateDataFrame
train_y                        PrivateDataFrame
test_x                         DataFrame      

Connected to Antigranular server session id: 3b2e1e86-61e8-4ea3-9735-16782ab5c76e, the session will time out if idle for 25 minutes
Cell magic '%%ag' registered successfully, use `%%ag` in a notebook cell to execute your python code on Antigranular private python server
🚀 Everything's set up and ready to roll!


## Loading the Data + Basic Analysis

In this section, we're going to load the Iris dataset and do some basic analysis to get a feel for the data. Privacy is our priority, so we'll make sure to use a PrivateDataFrame. 🔒 Think of this as our first peek into the world of irises, but with a privacy-preserving twist. 😉

In [None]:
import pandas as pd

URL = "https://content.antigranular.com/image/notebook_content/Iris.csv"

iris_dataset = pd.read_csv(URL)

session.private_import(iris_dataset, "iris_dataset")

dataframe cached to server, loading to kernel...
DataFrame loaded successfully to the kernel



In [None]:
%%ag
from op_pandas import PrivateDataFrame

data = PrivateDataFrame(iris_dataset, metadata = {'PetalLengthCm': (1.0, 7.0), 'PetalWidthCm': (0.0, 2.5), 'SepalLengthCm': (4.0, 8.0), 'SepalWidthCm': (2.0, 5.0)})

The `info()` method provides a concise summary of the DataFrame, including column names, data types, and non-null values.

In [None]:
%%ag
data.info()

+----+---------------+-------------+---------------+---------+------------+
|    | Column        | numerical   | categorical   | dtype   | bounds     |
|----+---------------+-------------+---------------+---------+------------|
|  0 | Id            | True        | False         | int64   | None       |
|  1 | SepalLengthCm | True        | False         | float64 | (4.0, 8.0) |
|  2 | SepalWidthCm  | True        | False         | float64 | (2.0, 5.0) |
|  3 | PetalLengthCm | True        | False         | float64 | (1.0, 7.0) |
|  4 | PetalWidthCm  | True        | False         | float64 | (0.0, 2.5) |
|  5 | Species       | False       | False         | object  | None       |
+----+---------------+-------------+---------------+---------+------------+



Check out the `metadata` which provides the bounds for each numerical column—super important for ensuring data privacy. This metadata is like our guide, ensuring we don't accidentally peek too much. 🔍

In [None]:
%%ag

ag_print(data.metadata)

{'PetalLengthCm': (1.0, 7.0), 'PetalWidthCm': (0.0, 2.5), 'SepalLengthCm': (4.0, 8.0), 'SepalWidthCm': (2.0, 5.0)}



 The `describe()` method shows descriptive statistics like mean, standard deviation, minimum, and maximum values for each numerical feature. These stats give us a snapshot of our data's central tendencies and spread. 📊

In [None]:
%%ag

ag_print(data.describe(eps = 1))

               Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
count  150.000000     212.000000    147.000000     147.000000    161.000000
mean    75.500000       5.620798      2.000000       3.718233      1.007954
std     43.445368       1.668323      0.142438       2.896114      0.296229
min      1.000000       4.000000      2.000000       1.000000      0.000000
25%     38.250000       5.168620      2.397140       2.929862      0.105924
50%     75.500000       6.084685      2.658683       4.660259      1.048370
75%    112.750000       6.768751      2.996746       4.657727      0.111331
max    150.000000       6.349513      4.242385       4.486505      2.498177



The correlation matrix shows the relationships between different numerical features.

In [None]:
%%ag

ag_print(data.corr(eps = 1))

              PetalLengthCm PetalWidthCm SepalLengthCm SepalWidthCm
PetalLengthCm           1.0     0.846679      1.106034     0.522887
PetalWidthCm       0.846679          1.0      0.767682    -0.145475
SepalLengthCm      1.106034     0.767682           1.0     0.092484
SepalWidthCm       0.522887    -0.145475      0.092484          1.0



In the Iris dataset, we see high positive correlations between `PetalLengthCm` and `PetalWidthCm`, indicating that as the length of the petals increases, the width also tends to increase.  🌱

## Pre-processing for Training the Model

Time to prep our data for training! 👊🏼 In this section, we will preprocess the data to prepare it for training a machine learning model. Preprocessing includes importing necessary libraries, splitting the data into training and test sets, selecting features, encoding the target variables, and normalizing the data.

### Importing the Necessary Libraries 📚
First, let's bring in all the tools we'll need for this ML journey. 🧰

In [None]:
%%ag
import pandas as pd
import op_pandas as opd
import numpy as np
import torch
from torch import nn, optim

from op_opacus import PrivateDPDataLoader, PrivacyEngine, ApplyModel, TrainModel, make_loss_private
from op_pandas import train_test_split

We are using libraries like `pandas` and `numpy` for data manipulation. `torch` is used for building and training the neural network model. `opacus` provides tools to ensure differential privacy during model training, and `op_pandas` offers functions tailored for differentially private data handling. 🕵🏻‍♀️

### Splitting the Data into Train and Test

We'll split our data into training and test sets to see how our model performs on new, unseen data. This is like dividing your study material before an exam—train on one part, test on the other. ✂️

In [None]:
%%ag
train, test = train_test_split(data)

In [None]:
%%ag
ag_print(train.columns)
ag_print(test.columns)

['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm', 'Species']
['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm', 'Species']



Next, let's select the features (input variables) and the target variable for our model. This step involves choosing which columns will be used as inputs for training and which column will be the target we want to predict. 👀

In [None]:
%%ag
train_x = train[["PetalLengthCm", "PetalWidthCm", "SepalLengthCm", "SepalWidthCm"]]
train_y = train["Species"]
test_x = test[["PetalLengthCm", "PetalWidthCm", "SepalLengthCm", "SepalWidthCm"]]
test_y = test["Species"]

**Note:** We select four features: `PetalLengthCm`, `PetalWidthCm`, `SepalLengthCm`, and `SepalWidthCm`. These features will be used to train the model. The target variable is the species of the iris flowers, which we aim to predict.

In [None]:
%%ag
ag_print(train_x.describe(eps=1))

       PetalLengthCm  PetalWidthCm  SepalLengthCm  SepalWidthCm
count     140.000000     48.000000     146.000000    145.000000
mean        4.693648      1.154522       5.284139      2.927259
std         2.652672      0.844226       0.688012      0.309831
min         1.000000      0.000000       4.000000      2.000000
25%         3.706785      2.407373       5.539526      2.445628
50%         1.223581      0.735186       4.705366      3.048147
75%         3.491770      1.467781       4.846437      3.710183
max         6.696962      2.304352       4.371945      2.033162



### Encoding the Train / Test Outputs Using map

We encode the categorical target variable (species) into numerical values to facilitate model training. Most machine learning algorithms, including neural networks, require numerical input. Think of this step as translating the species names into numbers that our model can understand. 🔢

In [None]:
%%ag

def func_(x:str)-> int:
    if x == 'Iris-setosa':
        return 0
    elif x == 'Iris-versicolor':
        return 1
    else: # Iris-virginica
        return 2

train_y_encoded = train_y.map(func_ , output_bounds=(0,2))
test_y_encoded = test_y.map(func_, output_bounds=(0,2))

In [None]:
%%ag
train_y = train_y_encoded
test_y = test_y_encoded

In [None]:
%%ag

ag_print(train_y.metadata)
ag_print(test_y.metadata)

(0, 2)
(0, 2)



**Note:** Printing the metadata for the encoded target variables helps ensure that the encoding was performed correctly and that the numerical values fall within the expected bounds (0 to 2).

### Creating the Data Loader, Neural Network, Optimiser and Loss Function

In this section, we will create the data loader, define the neural network model, set up the optimiser, and define the loss function. These are crucial steps for training a machine learning model. ✅

In [None]:
%%ag
# data loader from private dataframe
data_loader = PrivateDPDataLoader.from_private_dataframe(
    [train_x, train_y], dtypes=[torch.float, torch.long]
)

# sequential model
model = nn.Sequential(nn.Linear(4, 16),
                      nn.ReLU(),
                      nn.Linear(16, 3),
                     nn.Softmax())

# stochastic gradient descent model
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Cross entropy loss
PrivateCrossEntropyLoss = make_loss_private(nn.CrossEntropyLoss)
loss_function = PrivateCrossEntropyLoss()   # so that per epoch average loss can be shown

**Note:**

- First, we'll create the `PrivateDPDataLoader` from our private DataFrame and define the neural network architecture so that it is compatible with training using op_opacus. This network will have an input layer, one hidden layer with ReLU activation, and an output layer with softmax activation. 🧠
- We use Stochastic Gradient Descent (SGD) as the optimiser with a learning rate of 0.01.
- The loss function is cross-entropy loss, made private using `make_loss_private`. It is necessary to do so as it enables op_opacus to calculate average loss per epoch and present it while training. 🥳

### Making the Privacy Engine

We're setting up the privacy engine to keep our model training private. This step ensures our training process adheres to differential privacy standards. 🔒🔧

In [None]:
%%ag
privacy_engine = PrivacyEngine()

privacy_engine.make_private_with_epsilon(
    module=model,
    optimizer=optimizer,
    data_loader=data_loader,
    target_epsilon=3,
    target_delta=1e-5,
    epochs=10,
    max_grad_norm=1,
)

  z = np.log((np.exp(t) + q - 1) / q)



Note:

- The PrivacyEngine is used to enforce differential privacy during training.
- The `make_private_with_epsilon` method configures the model, optimiser, and data loader with specified privacy budgets (target_epsilon and target_delta), number of epochs, and maximum gradient norm for clipping.

In [None]:
%%ag

def train_callable(model, optimizer, z, loss_function):
    inputs = z[0]
    labels = z[1]
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_function(outputs, labels)
    loss.backward()
    optimizer.step()

**Note**: The `train_callable` function performs a single training step:

It resets gradients, performs a forward pass, computes the loss, performs a backward pass, and updates the model parameters.

## Training the Model

Finally, let's get this model trained! We'll use our defined privacy engine, loss function, and training callable. Training is where the magic happens, as our model learns from the data. ✨

In [None]:
%%ag

train_model = TrainModel(privacy_engine, loss_function)
train_model.train(train_callable, verbose=2)

Epoch 1 completed.
 Epsilon used: 1.4517814813791448,For target delta: 1e-05
Average loss for this epoch : 1.1122778595229725, time taken: 1.4373948409920558 seconds.

Epoch 2 completed.
 Epsilon used: 1.7286017897530341,For target delta: 1e-05
Average loss for this epoch : 1.0607246393742769, time taken: 1.5032620900019538 seconds.

Epoch 3 completed.
 Epsilon used: 1.9412375107726951,For target delta: 1e-05
Average loss for this epoch : 0.9102800861898676, time taken: 1.5879814669897314 seconds.

Epoch 4 completed.
 Epsilon used: 2.1253217178212642,For target delta: 1e-05
Average loss for this epoch : 0.9270107269287109, time taken: 1.5768276260059793 seconds.

Epoch 5 completed.
 Epsilon used: 2.2921996452700104,For target delta: 1e-05
Average loss for this epoch : 0.8705504393577576, time taken: 1.3092820619931445 seconds.

Epoch 6 completed.
 Epsilon used: 2.447055361402201,For target delta: 1e-05
Average loss for this epoch : 0.88616097340217, time taken: 1.4900863580114674 secon

  z = np.log((np.exp(t) + q - 1) / q)



**Note:**

- The TrainModel class initialises the training process with the configured privacy engine and loss function.
- The train method is called to start the training process, with `verbose=2` to display detailed progress information for each epoch.

## Getting the Accuracy of the Model

Time to see how well our model performs on the test dataset. This involves applying the model to the test data, decoding the predictions, and calculating the accuracy. This is the moment of truth! 🤓

In [None]:
%%ag

test_model = ApplyModel(privacy_engine=privacy_engine)
output_bounds = {
    "Iris-setosa": (0, 1),
    "Iris-versicolor": (0, 1),
    "Iris-virginica": (0, 1)
}

out = test_model.apply_model_private(test_x, dtype=torch.float, output_col_names=["Iris-setosa", "Iris-versicolor", "Iris-virginica"], output_bounds = output_bounds)

  input = module(input)



In [None]:
%%ag

ag_print(out.columns)

['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']



**Note:**

- We use `ApplyModel` from opacus to apply the trained model to the test data while ensuring differential privacy.
- The `apply_model_private` method generates predictions for the test data with specified output bounds for each class.
- We print the column names of the prediction output to verify the predictions for each class.

#### Decoding the Predictions
Let's decode the predictions back to their original class labels. This is where we translate the model's "language" from numerical values back to the original class labels. 🗣️

In [None]:
%%ag
out = out.idxmax()

def func_(x: str) -> int:
  if x == 'Iris-setosa':
      return 0
  elif x == 'Iris-versicolor':
      return 1
  else: # Iris-virginica
      return 2

out = out.map(func_, output_bounds=(0, 2))

**Note:**

- The `idxmax` function finds the class with the highest probability for each prediction.
- We define a function to map the class labels back to their numerical representations (0 for Iris-setosa, 1 for Iris-versicolor, and 2 for Iris-virginica).

#### Calculating the Accuracy
Let's check the accuracy of our predictions. We calculate the accuracy of the model by comparing the predicted values with the actual labels from the test set. This is where we see how well our model has learned to classify the irises. 🌼📈

In [None]:
%%ag

pred_correctness = (test_y==out)

pred_acc = pred_correctness.sum(eps = 1) / pred_correctness.count(eps = 1)

ag_print(pred_acc)

0.7433080849342009



Note:

- We compare the predicted values (`out`) with the actual labels (`test_y`) to determine the correctness of each prediction.
- The accuracy is calculated as the proportion of correct predictions to the total number of predictions.
- We print the accuracy, which represents the percentage of correctly classified instances in the test set.

## Exporting the Model

In this section, we will export the trained model's state dictionary, which contains all the learned parameters (weights and biases). This allows us to save the model for future use without retraining it. Think of this as saving your game progress so you can pick up right where you left off. 🎮

In [None]:
%%ag
state_dict = model.state_dict()

export(state_dict, 'state_dict')

Setting up exported variable in local environment: state_dict


**Note:**

- The `state_dict` method retrieves the model's parameters, including weights and biases.
- The `export` function is used to save the state dictionary to a file or another environment for future use.

## Importing the Model

This section demonstrates how to import the saved state dictionary and load it into a new model instance. This is super handy when you want to use a trained model without retraining it. 🚀

In [None]:
from collections import OrderedDict

state_dict_serializable = OrderedDict((key, value.numpy().tolist()) for key, value in state_dict.items())

# The `weights_and_biases_np` now contains the weights and biases as NumPy arrays
for key, value in state_dict_serializable.items():
    print(f"{key}:\n{value}\n")

0.weight:
[[0.22285522520542145, 0.5076473355293274, -0.7087080478668213, 0.8396198153495789], [-0.8552817702293396, 0.6698233485221863, -0.024328814819455147, -0.34753310680389404], [-0.364187091588974, 0.28911224007606506, 0.5861486792564392, 0.37813711166381836], [-0.3704695701599121, -0.006869448348879814, 0.0845165029168129, 0.4203955829143524], [-0.06879377365112305, 0.35394763946533203, 0.09323772042989731, 0.20437338948249817], [-0.1443086713552475, 0.5684356093406677, -0.346179336309433, 0.2015683650970459], [-0.4188190698623657, -0.5016990900039673, 0.2140047252178192, -0.06029246747493744], [-0.021660931408405304, 0.04751019552350044, -0.6953786015510559, -0.5345582962036133], [0.5547587275505066, 0.2320217788219452, 0.44901153445243835, -0.8750210404396057], [0.9238684177398682, 0.5569440722465515, -0.41409748792648315, -0.8329393863677979], [0.34989067912101746, -0.27599990367889404, -0.28253260254859924, -0.5326759815216064], [-0.3412042260169983, 0.2241271734237671, 0.56

In [None]:
session.private_import(state_dict_serializable, 'state_dict_serializable')

dict cached to server, loading to kernel...
Dict loaded successfully to the kernel



Note:

The `private_import` function is used to load the serialised state dictionary back into the environment, ensuring it is ready for use.

### Recreating the Model
We'll recreate the model architecture and load the saved parameters to restore the trained model. 🛠️

In [None]:
%%ag
model = nn.Sequential(nn.Linear(4, 16),
                      nn.ReLU(),
                      nn.Linear(16, 3),
                     nn.Softmax())

In [None]:
%%ag
import numpy as np
for k, v in state_dict_serializable.items():
  state_dict_serializable[k] = torch.from_numpy(np.array(v))

Note:

- We recreate the neural network architecture to match the original model structure. This is necessary for loading the state dictionary correctly.
- We convert the serialised state dictionary values from lists back to torch tensors to ensure compatibility with the model.

In [None]:
%%ag
model.load_state_dict(state_dict_serializable)

Note:

The load_state_dict method loads the saved parameters into the recreated model, restoring it to its trained state.

### Wrapping Up
Finally, let's end the session. It's always good to clean up when you're done. 🧹

In [None]:
session.terminate_session()

{'status': 'ok'}

### Conclusion

And that's a wrap! 🎉

- This notebook provided a comprehensive guide to implementing differential privacy in machine learning workflows using Antigranular.
- We demonstrated that it is possible to train accurate models while preserving data privacy, which is crucial for applications involving sensitive information.
- The ability to save and reload the model ensures that the trained model can be reused without retraining, enhancing efficiency and practicality.

By following the steps outlined in this notebook, you can build, train, and evaluate your own differentially private machine learning models, ensuring both accuracy and privacy in your data-driven applications. Woohoo! 💻