# Experiment Tracking using Weight & Biases (W&B)

![wnb-logo](https://raw.githubusercontent.com/wandb/wandb/020f30e567f6d168bc18eaa668ff063d28163fd7/docs/README_images/logo-light.svg)

**Weights and Biases (W&B)** is a platform that provides tools for tracking and visualizing machine learning experiments. It aims to enhance the workflow of researchers and practitioners by offering features to log, analyze, and compare various aspects of their experiments.

**W&B** allows you to easily log and monitor different parameters during training, such as model performance metrics, hyperparameters, and intermediate outputs. It provides interactive visualizations and dashboards to track the progress of your models over time.

To simplify the notebook, the codes from the previous notebook will be moved to `src/` folder and we only need to call it from here. This section only shows how to use `wandb` with `lightning`and add other functionality to help understand the training process.

In [3]:
!pip install -q wandb

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting wandb
  Downloading wandb-0.15.3-py3-none-any.whl (2.0 MB)
[K     |████████████████████████████████| 2.0 MB 1.3 MB/s eta 0:00:01
Collecting sentry-sdk>=1.0.0
  Downloading sentry_sdk-1.24.0-py2.py3-none-any.whl (206 kB)
[K     |████████████████████████████████| 206 kB 1.3 MB/s eta 0:00:01
[?25hCollecting pathtools
  Downloading pathtools-0.1.2.tar.gz (11 kB)
Collecting GitPython!=3.1.29,>=1.0.0
  Downloading GitPython-3.1.31-py3-none-any.whl (184 kB)
[K     |████████████████████████████████| 184 kB 1.6 MB/s eta 0:00:01
Collecting docker-pycreds>=0.4.0
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting appdirs>=1.4.3
  Downloading appdirs-1.4.4-py2.py3-none-any.whl (9.6 kB)
Collecting setproctitle
  Downloading setproctitle-1.3.2-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (31 kB)
Collecting gitdb<5,>=4.0.1
  Downloading git

In [1]:
import os

ROOT_DIR = os.path.dirname(os.path.abspath(''))
DATA_DIR = os.path.join(ROOT_DIR, 'data/food-101-tiny')

TRAIN_DATA_PATH = os.path.join(DATA_DIR, 'train')
VAL_DATA_PATH = os.path.join(DATA_DIR, 'valid')
SIMPLE_MODEL_CHECKPOINT = os.path.join(ROOT_DIR, 'pretrained/simple-lightning-epoch100/resnet18_epoch99.ckpt')

## Initialize a Project in W&B

[Login](https://wandb.ai/site) to your W&B account and create a new project called `learning-food101-tiny`.

Copy the `API Key` for that project and login through the Python SDK

In [2]:
import getpass
import wandb

API_KEY = getpass.getpass()
wandb.login(key=API_KEY)

 ········


[34m[1mwandb[0m: Currently logged in as: [33mharitsahm[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

## Track Training Logs with W&B

To use **W&B** with PyTorch Lightning, you can simply add the `W&B Logger` as a callback in your PyTorch Lightning `Trainer` configuration. This allows you to automatically log metrics, hyperparameters, and other metadata to your W&B project. With W&B and PyTorch Lightning, you can monitor your training progress in real-time through the W&B dashboard. It provides interactive visualizations, charts, and tables to help you analyze and compare experiments effortlessly.

In [21]:
from lightning.pytorch.trainer import Trainer
from lightning.pytorch.loggers import WandbLogger
from src import models, dataset
from src.models import BasicBlock, ResNet18, ClassificationLightningModule

# Global Variables
NUM_CLASSES = 10
LEARNING_RATE = 0.0001
BATCH_SIZE = 4

# Load from checkpoint training pipeline
lit_model = ClassificationLightningModule.load_from_checkpoint(SIMPLE_MODEL_CHECKPOINT)

# Construct the datamodule
datamodule = dataset.Food101LitDatamodule(
    data_dir=DATA_DIR,
    batch_size=BATCH_SIZE,
    num_workers=4
)

# NOTE: Rename the project to your use case.
wandb_logger = WandbLogger(project="learning-food101-tiny")

training_config = {
    "accelerator": 'auto',
    "devices": 'auto',
    "precision": 32,
    "max_epochs": 10,
    "logger": wandb_logger,
}

trainer = Trainer(**training_config)
trainer.fit(model=lit_model, datamodule=datamodule)

# Sync current run with cloud.
wandb.finish()

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name         | Type               | Params
----------------------------------------------------
0 | net          | ResNet18           | 12.6 M
1 | criterion    | CrossEntropyLoss   | 0     
2 | train_acc    | MulticlassAccuracy | 0     
3 | val_metrics  | MetricCollection   | 0     
4 | test_metrics | MetricCollection   | 0     
----------------------------------------------------
12.6 M    Trainable params
0         Non-trainable params
12.6 M    Total params
50.251    Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

`Trainer.fit` stopped: `max_epochs=10` reached.


VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
epoch,▁▁▂▂▃▃▃▃▄▄▅▅▆▆▆▆▇▇██
train/acc,▁▃▅█▁▄█▅▇▇
train/loss,█▅▃▂▅▂▁▃▁▃
trainer/global_step,▁▁▂▂▃▃▃▃▄▄▅▅▆▆▆▆▇▇██
val/acc,▁▅▅▆▆▄▆█▅▆
val/auroc,▁▅▇▇▇█▇█▅▄
val/f1,▁▅▅▆▆▅▆█▅▆
val/loss,█▄▂▂▄▃▂▁▄▃
val/prec,▁▆▅▇▆▇▇█▄▆
val/rec,▁▅▅▆▆▄▆█▅▆

0,1
epoch,9.0
train/acc,0.23931
train/loss,0.99751
trainer/global_step,3749.0
val/acc,0.0658
val/auroc,0.0068
val/f1,0.0758
val/loss,1.2469
val/prec,0.09733
val/rec,0.0658


While the training is on going, you can monitor the progress from W&B dashboard.

## Save Checkpoint to W&B

There are several ways to log `model`/`checkpoints` to W&B as described in the [WandbLogger](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.loggers.wandb.html)
- Using `watch` to Log gradients, parameters and model topology
- Use `Callbacks` and log only the last or best model

For this experiment we're going to use `Callbacks` method

In [23]:
from lightning.pytorch.callbacks import ModelCheckpoint

# NOTE: Rename the project to your use case.
wandb_logger = WandbLogger(
    project="learning-food101-tiny",
    log_model=True,
)

# Model Checkpoint Callback
## Full arguments are available from https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.callbacks.ModelCheckpoint.html#modelcheckpoint
checkpoint_callback = ModelCheckpoint(monitor="val/acc", mode="max")

# Define the Trainer metrics
## Full arguments are available from https://lightning.ai/docs/pytorch/stable/common/trainer.html#trainer-class-api
training_config = {
    "accelerator": 'auto',
    "devices": 'auto',
    "precision": 32,
    "max_epochs": 10,
    "logger": wandb_logger,
    "callbacks": [checkpoint_callback],
}

trainer = Trainer(**training_config)
trainer.fit(model=lit_model, datamodule=datamodule)

# Sync current run with cloud.
wandb.finish()

  rank_zero_warn(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name         | Type               | Params
----------------------------------------------------
0 | net          | ResNet18           | 12.6 M
1 | criterion    | CrossEntropyLoss   | 0     
2 | train_acc    | MulticlassAccuracy | 0     
3 | val_metrics  | MetricCollection   | 0     
4 | test_metrics | MetricCollection   | 0     
----------------------------------------------------
12.6 M    Trainable params
0         Non-trainable params
12.6 M    Total params
50.251    Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]



Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

`Trainer.fit` stopped: `max_epochs=10` reached.


In [24]:
# Sync current run with cloud.
wandb.finish()

0,1
epoch,▁▁▂▂▃▃▃▃▄▄▅▅▆▆▆▆▇▇██
train/acc,▁▆▂▂█▂▄▆▆▂
train/loss,█▄▇▆▂▆▄▁▃▃
trainer/global_step,▁▁▂▂▃▃▃▃▄▄▅▅▆▆▆▆▇▇██
val/acc,▆▅▁▁▆▇▄▆▃█
val/auroc,█▆▃█▁█▃▆▆█
val/f1,▆▅▁▂▆▇▅▆▃█
val/loss,▃▇█▄▃▁▂▃▆▂
val/prec,▆▆▁▁██▆▃▅▇
val/rec,▆▅▁▁▆▇▄▆▃█

0,1
epoch,9.0
train/acc,0.24078
train/loss,0.90191
trainer/global_step,3749.0
val/acc,0.0706
val/auroc,0.008
val/f1,0.07985
val/loss,1.10898
val/prec,0.1
val/rec,0.0706


After the training finished, you should see the last and best checkpoints in the run artifact.

## Track and Visualize Image Predictions

**Explainability is a crucial aspect**, especially when developing computer vision models, as it allows us to **understand and interpret the decisions made** by these models. By providing explanations, we gain insights into how and why a particular prediction was made, which helps build trust, detect biases, and debug the models. Explainability is particularly important in applications such as medical diagnosis, autonomous vehicles, and critical decision-making systems where the impact of a wrong prediction can be significant.

One useful tool for explaining computer vision models is GradCAM (Gradient-weighted Class Activation Mapping), which is available in the [pytorch-grad-cam](https://github.com/jacobgil/pytorch-grad-cam) repository on GitHub. GradCAM helps in visualizing and interpreting the decisions made by a deep neural network, specifically in the context of image classification tasks.

GradCAM **highlights the regions** of an input image that are **most influential in determining the model's prediction** for a specific class. It achieves this by computing the gradients of the target class score with respect to the feature maps in the final convolutional layer of the network. These gradients are then used to weigh the feature maps, producing a class activation map that visually represents the important regions of the image for the predicted class.

![grad-cam](https://miro.medium.com/v2/resize:fit:720/format:webp/0*nE5_ZjRhOcslpvXI.png)


In [37]:
!pip install -q grad-cam 

You should consider upgrading via the '/usr/bin/python -m pip install --upgrade pip' command.[0m


### Adding GradCAM

We're adding the `gradcam` in the evaluation step. Instead of constructing a new `LightningModule`, we're going to inherit from `ClassificationLightningModule` that we developed earlier to reuse the pipeline and only modify a certain part of it. Note that this function will add extra computation load and longer training time.

In [8]:
import numpy as np
import torch
from torchvision.transforms import transforms
from typing import Any

from src import dataset
from src.models import ClassificationLightningModule
from pytorch_grad_cam import EigenCAM
from pytorch_grad_cam.utils.model_targets import ClassifierOutputTarget
from pytorch_grad_cam.utils.image import show_cam_on_image
from wandb import wandb_run

invTrans = transforms.Compose([ transforms.Normalize(mean = [ 0., 0., 0. ],
                                                     std = np.reciprocal(dataset.RGB_STD).tolist()),
                                transforms.Normalize(mean = [-1 * item for item in dataset.RGB_MEAN],
                                                     std = [ 1., 1., 1. ]),
                               ])

class AdvancedLightningModule(ClassificationLightningModule):
    def __init__(
        self,
        net: torch.nn.Module,
        num_classes: int = 10,
        lr: float = 0.00001,
    ):

        super().__init__(net, num_classes, lr)

        # Define GradCAM functions
        target_layers = [net.layer4]
        param = [p for p in net.parameters()]
        GRAD_CAM = EigenCAM(net, target_layers, use_cuda=False)

        self.grad_cam_targets = [ClassifierOutputTarget(10)]
        self.grad_cam_profiler = EigenCAM(net, target_layers)

    def validation_step(self, batch: Any, batch_idx: int) -> None:
        images, targets = batch
        loss, preds, targets = self.model_step(batch)

        vis_images = []
        grayscale_cams = self.grad_cam_profiler(input_tensor=images, targets=self.grad_cam_targets)

        
        for idx in range(len(images)):
            if idx > 15: # Limit to 16 grids
                break
            gt = targets[idx].detach().cpu()
            pd = preds[idx].detach().cpu()
            pd = torch.argmax(pd)
            norm_img = invTrans(images[idx])
            norm_img = torch.permute(norm_img, (1, 2, 0)) # C,H,W -> H,W,C
            cam_image = show_cam_on_image(norm_img.detach().cpu().numpy(), grayscale_cams[idx], use_rgb=True)
            wandb_image = wandb.Image(cam_image, caption=f"Target: {gt.item()} - Pred: {pd.item()}")
            vis_images.append(wandb_image)

        # Log the data to W&B
        for logger in self.loggers:
            if isinstance(logger.experiment, wandb_run.Run):
                logger.experiment.log({"val_images": vis_images})

        # update and log metrics
        metrics = self.val_metrics(preds, targets)
        self.log('val/loss', loss, on_step=False, on_epoch=True, prog_bar=True)
        self.log('val/acc', metrics['acc'], on_step=False, on_epoch=True, prog_bar=True)
        self.log('val/prec', metrics['prec'], on_step=False, on_epoch=True, prog_bar=False)
        self.log('val/rec', metrics['rec'], on_step=False, on_epoch=True, prog_bar=False)
        self.log('val/auroc', metrics['auroc'], on_step=False, on_epoch=True, prog_bar=False)
        self.log('val/f1', metrics['f1'], on_step=False, on_epoch=True, prog_bar=True)



### Retrain Using Addition Logging

With the new pipeline, we're going to retrain the model. After it finish, you can check the logged samples in W&B.

In [9]:
from lightning.pytorch.trainer import Trainer
from src.models import BasicBlock, ResNet18, ClassificationLightningModule
from src import dataset
from lightning.pytorch.loggers import WandbLogger

# Global Variables
NUM_CLASSES = 10
LEARNING_RATE = 0.0001
BATCH_SIZE = 16

# Construct the model
model = ResNet18(3, 10)

# Construct training pipeline
lit_model = AdvancedLightningModule(
    net = model,
    num_classes = NUM_CLASSES,
    lr = LEARNING_RATE,
)

# Construct the datamodule
datamodule = dataset.Food101LitDatamodule(
    data_dir=DATA_DIR,
    batch_size=BATCH_SIZE,
    num_workers=4
)

# NOTE: Rename the project to your use case.
wandb_logger = WandbLogger(
    project="learning-food101-tiny",
    log_model=True,
)

# Define the Trainer metrics
## Full arguments are available from https://lightning.ai/docs/pytorch/stable/common/trainer.html#trainer-class-api
training_config = {
    "accelerator": 'auto',
    "devices": 'auto',
    "max_epochs": 10,
    "logger": wandb_logger,
}

trainer = Trainer(**training_config)
trainer.fit(model=lit_model, datamodule=datamodule)

# Sync current run with cloud.
wandb.finish()

trainer = None
datamodule = None
lit_model = None
model = None

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name         | Type               | Params
----------------------------------------------------
0 | net          | ResNet18           | 12.6 M
1 | criterion    | CrossEntropyLoss   | 0     
2 | train_acc    | MulticlassAccuracy | 0     
3 | val_metrics  | MetricCollection   | 0     
4 | test_metrics | MetricCollection   | 0     
----------------------------------------------------
12.6 M    Trainable params
0         Non-trainable params
12.6 M    Total params
50.251    Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.


In [10]:
# Sync current run with cloud.
wandb.finish()

0,1
epoch,▁▁▁▁
train/acc,▁█
train/loss,█▁
trainer/global_step,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
val/acc,▁█
val/auroc,█▁
val/f1,▁█
val/loss,▁█
val/prec,▁█
val/rec,▁█

0,1
epoch,0.0
train/acc,0.172
train/loss,2.33883
trainer/global_step,374.0
val/acc,0.2
val/auroc,0.004
val/f1,0.2
val/loss,3.0033
val/prec,0.2
val/rec,0.2
