# Experiment Tracking

Experiment tracking can be done through keeping track of metrics, which due to the nature of a federated environment, will have more factors to consider compared to an environment without FL (see [details on experiment tracking](https://nvflare.readthedocs.io/en/main/programming_guide/experiment_tracking/experiment_tracking_log_writer.html) in NVFlare).

## Introduction to distributed experiment tracking

In a federated computing setting, data is distributed across multiple devices or systems, and training is run on each device independently while preserving each client’s data privacy.

Assuming a federated system consisting of one server and many clients and the server coordinating the ML training of clients, we can interact with ML experiment tracking tools in two different ways:

- Client-side experiment tracking: Each client will directly send the log metrics/parameters to the ML experiment tracking server (like MLflow or Weights and Biases) or local file system (like tensorboard)
- Aggregated experiment tracking: Clients will send the log metrics/parameters to the FL server, and the FL server will send the metrics to ML experiment tracking server or local file system

NVFlare makes it possible for you to configure either way, but in this example we will demonstrate a server-side approach for aggregated experiment tracking.

## Default in FedAvgJob

The FedJob API makes it easy to create job congifurations, and by default the `TBAnalyticsReceiver` for TensorBoard streaming is included. You can specify your own analytics_receiver of type `AnalyticsReceiver` as a parameter if you want, but if left unspecified, `TBAnalyticsReceiver` is configured to be set up in `BaseFedJob` (nvflare/app_opt/pt/job_config/base_fed_job.py). 

The `TBAnalyticsReceiver` for TensorBoard streaming receives and records the logs during the experiment by saving them to Tensoboard event files on the FL server. See [this link](https://nvflare.readthedocs.io/en/main/programming_guide/experiment_tracking/experiment_tracking_log_writer.html#tools-sender-logwriter-and-receivers) for more details on the other available AnalyticsReceivers in NVFlare: MLflowReceiver and WandBReceiver.

## Add SummaryWriter and add_scalar for logging metrics

To keep things simple, we start from the state of the code we had in part 1.1 earlier this chapter and make the few modifications needed to implement adding metrics for experiment tracking.

### Add import from Client API 

In order to add SummaryWriter to the client training code, we need to import it with the following line (at the top of client.py):

In [2]:
from nvflare.client.tracking import SummaryWriter

After that, we need to add the following line after `flare.init()`:

In [3]:
summary_writer = SummaryWriter()

We can then use summary_writer to log. In this case, we have a running_loss available already, so we can use `add_scalar()` to log this:

In [None]:
summary_writer.add_scalar(tag="training_loss", scalar=running_loss, global_step=global_step)

Note that the global_step is included here, so we calculate the global step for it on the previous line:

In [None]:
global_step = epoch * n_loaders + i

Also note that we log once every 100 steps to reduce the burden on the logging.

You can see the full contents of the updated training code in client.py:

In [1]:
!cat code/src/client.py

# Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os

import torch
from network import SimpleNetwork
from torch import nn
from torch.optim import SGD
from torch.utils.data.dataloader import DataLoader
from torchvision.datasets import CIFAR10
from torchvision.transforms import Compose, Normalize, ToTensor

import nvflare.client as flare

from nvflare.client.tracking import SummaryWriter

DATASET_PATH = "/tmp/nvflare/data"


def m

## View tensorboard results

In order to see the results, you can use the following command directed to the locatin of the tensorboard event files:

```commandline
tensorboard --logdir=/tmp/nvflare/jobs/workdir/server/simulate_job/tb_events
```

Now, we know how experiment tracking can be achieved through metric logging and can be configured to work in a job with an `AnalyticsReceiver`. With this mechanism, we can stream various types of metric data.

To continue, please see [Understanding FLARE federated learning Job structure](../01.6_job_structure_and_configuration/01.1.6.1_understanding_fl_job.ipynb).