## AMLD Workshop Notebook 4

In this notebook, you will combine what you learned in notebook 3 to train a federated learning task with advanced data protection using gradient clipping and differential privacy.

## ðŸŽ¯ OBJECTIVE

Train the same machine learning model collaboratively across multiple institutions using federated learning, without transferring raw patient data outside local environments. Additionally, incorporate gradient clipping during local training and inject calibrated differential privacy noise into the exchanged model updates. This bounds the sensitivity of each clientâ€™s contribution and mitigates risks such as membership inference, model inversion, and reconstruction attacks, thereby providing formal privacy guarantees for the local training data.

<div style="background-color: rgba(182, 255, 18, 0.15); border-left: 5px solid #B6FF12; padding: 15px; margin: 10px 0;">
<h3> Real-world context</h3>
<p>In healthcare, data is often legally and ethically constrained to remain within institutional or national boundaries. As a result, pooling datasets into a single central location for AI training is frequently infeasible. Yet, developing robust and generalizable models requires access to diverse populations and larger sample sizes than any single institution can typically provide.

Federated learning addresses this tension by allowing each institution to train a model locally and share only model updates rather than raw data. In practice, this requires more than just algorithms: it depends on secure execution environments, identity management, authorization workflows, and auditability across all participating sites.

In this setup, CHORUS provides the secure processing environment and governance controls at each hospital node, while Tune Insight coordinates the federated training process across institutions. Together, they make cross-institution AI development operationally feasible while respecting data sovereignty and regulatory constraints.
</p>
</div>

### Setup

In [None]:
import pandas as pd
import uuid

from tuneinsight import Diapason, models
from tuneinsight.computations import HybridFL
import tuneinsight.utils.time_tools as time

#### Create clients

In [None]:
from amld_setup import *

# TODO: Please update the credentials below with those provided to you.
%env TI_USERNAME=amld-workshopX@tuneinsight.com
%env TI_PASSWORD=AMLD_workshop_X

In [None]:
client = Diapason.from_env()

In [None]:
client.healthcheck()

#### Create and share the project

Projects are the main unit of collaboration in Tune Insight projects. In a project, you will define the computation to run in a federated setting, and set the datasource used by your instance. Other participants will also choose the data used by their instance. Once everything is set up, the federated analysis can be run using data from all instances, without centralizing the data.

In [None]:
PROJECT_NAME = f"project-4-{uuid.uuid4()}"

project = client.new_project(name=PROJECT_NAME, clear_if_exists=True)
project.share()

## Load the dataset

In [None]:
data_path = "data/data_0.csv"

In [None]:
df = pd.read_csv(data_path)
df

In [None]:
# Feel free to play around with the data if you want.

### Split the dataset into training and validation sets

In [None]:
df["split"] = "train"
df.loc[df.sample(frac=0.2, random_state=42).index, "split"] = "val"
df

Upload the data to the instance and set it on the project.

In [None]:
datasource = client.new_datasource(dataframe=df, name=f"patient_data_{uuid.uuid4()}", clear_if_exists=True)

In [None]:
project.set_datasource(datasource)

### Task Definition

In this notebook, we will define a machine learning task with the `ti-models` library, similar to what you did in notebook 2.

In [None]:
from torch import nn
from ti_models.models.ti_model import TIModel
from ti_models.preprocessing.preprocessing import Preprocessing, InputType
from ti_models.preprocessing.datasets.csv_dataset import CSVDataset
from ti_models.trainer.ti_trainer import TITrainer
from ti_models.trainer.ti_loss import TILoss
from ti_models.trainer.ti_optimizer import TIOptimizer, OptimizerType

Build the model architecture in PyTorch

In [None]:
INPUT_DIM = 4
N_CLASSES = 2

torch_model = nn.Sequential(
    nn.BatchNorm1d(INPUT_DIM, affine=False, track_running_stats=True),
    nn.Linear(INPUT_DIM, N_CLASSES),
    nn.Sigmoid(),
)

Construct the `TIModel` wrapper that adds constraints to data inputs

In [None]:
preprocessing = Preprocessing(input_type=InputType.TABULAR, input_shape=(INPUT_DIM,))

model = TIModel(
    name="LogisticRegression",
    description="Logistic Regression classifier for prostate cancer detection task",
    n_classes=N_CLASSES,
    torch_model=torch_model,
    preprocessing=preprocessing,
)

### Gradient Clipping

To enable differentially private aggregation, client updates must satisfy a bounded sensitivity constraint. We therefore clip gradients prior to aggregation so that their â„“_2 norm does not exceed a prescribed threshold.

For a given layer $\ell$, let $g_\ell$ denote its gradient and $C_\ell$ the clipping bound. The clipped gradient is defined as:

$$
g_\ell \leftarrow g_\ell \cdot \min\left(1, \frac{C_\ell}{\|g_\ell\|_2}\right).
$$

The clipping threshold is **layer-wise and adaptive**. Specifically, for each layer:

$$
C_\ell = \alpha \, \|W_\ell\|_2,
$$

where $W_\ell$ are the layer weights and $\alpha$ is a fixed scaling coefficient. Thus, the clipping bound scales linearly with the norm of the corresponding layerâ€™s parameters.

This formulation provides two key benefits:

- **Layer-wise proportional scaling:** Layers with larger weight norms are allowed proportionally larger updates, preventing systematic over-clipping of high-magnitude layers.
- **Improved privacyâ€“utility trade-off:** By bounding sensitivity in proportion to each layerâ€™s scale, the Gaussian noise calibrated to the global $(\varepsilon, \delta)$-DP budget is better aligned with the model geometry, reducing unnecessary degradation of convergence.

In summary, gradients are norm-clipped per layer, with thresholds defined as a constant multiple of each layerâ€™s weight norm, enabling stable and efficient differentially private training.

Concretely, we set the base gradient clipping threshold to 0.1 and enable layer-wise adaptive clipping, which improves model accuracy while preserving comparable (Îµ, Î´)-differential privacy guarantees.

In [None]:
GRADIENT_CLIPPING = 0.1
ADAPTIVE_CLIPPING = True

### Impact of the gradient clipping on the performance

Feel free to tune the gradient clipping threshold to optimize the final federated learning performance.

**Hints:**

- **High clipping threshold** â†’ increases the sensitivity of client updates, requiring more DP noise during aggregation.

- **Low clipping threshold** â†’ excessively restricts updates, which can slow local training convergence and degrade utility.

Selecting an appropriate value is therefore critical to achieving a good privacyâ€“utility trade-off.

Build the federated trainer `TITrainer` that specifies loss and optimizer and batch size

In [None]:
LEARNING_RATE = 0.1
MOMENTUM = 0.9
BATCH_SIZE = 16

loss = TILoss(loss_func=nn.CrossEntropyLoss())

optimizer = TIOptimizer(
    optimizer_type=OptimizerType.SECURE_SGD,
    lr=LEARNING_RATE,
    momentum=MOMENTUM
)

trainer = TITrainer(
    model=model,
    loss=loss,
    optimizer=optimizer,
    batch_size=BATCH_SIZE,
    gradient_clipping=GRADIENT_CLIPPING,
    adaptive_gradient_clipping=ADAPTIVE_CLIPPING
)

In [None]:
print(trainer)

#### Start with a preliminary local training on your dataset

Create the datasets for local training

In [None]:
train_df = df[df["split"] == "train"].drop(columns=["split"])
val_df = df[df["split"] == "val"].drop(columns=["split"])

train_dataset = CSVDataset(df = train_df)
val_dataset = CSVDataset(df = val_df)

Create a copy of the trainer to avoid modifications on the federated trainer

In [None]:
local_trainer = TITrainer.unmarshal_binary(trainer.marshal_binary())

Train the model for 3 epochs using the trainer

In [None]:
local_trainer.train(
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    epochs=3,
    eval_after_epoch=True,
    logging_frequency=10,
)

In [None]:
local_accuracy = round(local_trainer.history.get_curve("accuracy")[-1][2], 4)
print("Local accuracy:", local_accuracy)

### Setup the Parameters for Federated Learning

We perform 3 communication rounds, each consisting of 1 local training epoch per client before aggregation.

Differential privacy is a popular definition of privacy. It requires that the output of a randomized algorithm does not change too much between two datasets that differ by one record.

We enforce client-level differential privacy with parameters (Îµ, Î´), which bound the information that the aggregated model can reveal about any individual training sample contained in a clientâ€™s local dataset. 

More formally:

- **Îµ (epsilon)** controls the privacy loss: smaller values provide stronger privacy guarantees but typically reduce model utility.
- **Î´ (delta)** represents the probability of the privacy guarantee being violated and is chosen to be negligible relative to the dataset size.

Differential privacy is achieved by clipping client updates to bound sensitivity and adding calibrated Gaussian noise during aggregation. This ensures that the contribution of any single data point remains statistically indistinguishable within the specified (Îµ, Î´) budget.

In [None]:
dp_epsilon = 1
dp_delta = 1e-4

In [None]:
params = models.HybridFLGenericParams(
    fl_rounds=3,
    num_workers=2,
    strategy = models.aggregation_strategy.AggregationStrategy.CONSTANT
)

ml_params = models.HybridFLMachineLearningParams(
    local_epochs=1,
    batch_size=BATCH_SIZE,
    learning_rate=LEARNING_RATE,
    momentum=MOMENTUM
)

dp_params = models.HybridFLDpParams(
    delta=dp_delta,
    gradient_clipping=GRADIENT_CLIPPING,
    use_clipping_factor=ADAPTIVE_CLIPPING,
)

Define the computation (Federated Learning) on the project.

In [None]:
hybrid_fl = HybridFL(
    project=project,
    task_id = "logreg",
    trainer=trainer,
    params=params,
    spec_params= ml_params,
    dp_params=dp_params,
    dp_epsilon=dp_epsilon
)
hybrid_fl.max_timeout = 300 * 60 * time.SECOND

Clients authorize the project

In [None]:
project.request_authorization()

Here you can get a quick summary of the project:

In [None]:
project.display_overview()

In [None]:
project.display_datasources()

## Run the training

This will run a federated learning on the network of four instances (three contributing, and a coordinating root node).

Note: you might experience the following error

> `InternalError: error happened internally: unexpected error: please contact the instance's administrator`

if that is the case, please call one of the organizers for assistance.

In [None]:
hybrid_fl.run()

### Retrieve the result history

In [None]:
results = project.fetch_results()[-1][1]

In [None]:
hybrid_fl.display_results(results.history)

In [None]:
import json

federated_accuracy = round(json.loads(results.history.metrics[-1])["accuracy"][-1][1], 4)
print("Local accuracy:", local_accuracy)
print("Federated accuracy:", federated_accuracy)