<center><a href="https://www.nvidia.com/en-us/training/"><img src="https://dli-lms.s3.amazonaws.com/assets/general/DLI_Header_White.png" width="400" height="186" /></a></center>

# 5. Assessment with WANDB

Congratulations on going through today's course! Hope it was a fun journey with some new skills as souvenirs. Now it's time to put those skills to the test.

Here's the challenge: Let's say we have a have a classification model that uses LiDAR data to classify spheres and cubes. Compared to RGB cameras, LiDAR sensors are not as easy to come by, so we'd like to convert this model so it can classify RGB images instead. Since we used [NVIDIA Omniverse](https://www.nvidia.com/en-us/omniverse/) to generate LiDAR and RGB data pairs, let's use this data to create a contrastive pre-training model. Since CLIP is already taken, we will call this model `CILP` for "Contrastive Image LiDAR Pre-training". 

## 5.1 Setup

Let's get started. Below are the libraries used in this assessment.

In [1]:
import numpy as np
from PIL import Image

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam
import torchvision.transforms as transforms
from torch.utils.data import Dataset, DataLoader

from assessment import assessment_utils
import utils

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.cuda.is_available()


import wandb
wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mjosefpribbernow[0m ([33mjosefpribbernow-hasso-plattner-institute[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

### 5.1.1 The Model

Next, let's load our classification model and call it `lidar_cnn`. If we take a moment to view the [assement_utils](assessment/assesment_utils.py), we can see the `Classifier` class used to construct the model. Please note the `get_embs` method, which we will be using to construct our cross-modal projector.

In [2]:
lidar_cnn = assessment_utils.Classifier(1).to(device)
lidar_cnn.load_state_dict(torch.load("assessment/lidar_cnn.pt", weights_only=True))
# Do not unfreeze. Otherwise, it would be difficult to pass the assessment.
for param in lidar_cnn.parameters():
    lidar_cnn.requires_grad = False
lidar_cnn.eval()

Classifier(
  (embedder): Sequential(
    (0): Conv2d(1, 50, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(50, 100, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (4): ReLU()
    (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(100, 200, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU()
    (8): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (9): Conv2d(200, 200, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (10): ReLU()
    (11): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (12): Flatten(start_dim=1, end_dim=-1)
  )
  (classifier): Sequential(
    (0): Linear(in_features=3200, out_features=100, bias=True)
    (1): ReLU()
    (2): Linear(in_features=100, out_features=1, bias=True)
  )
)

### 5.1.2 The Dataset

Below is the dataset we will be using in this assessment. It is similar to the dataset we used in the first few labs, but please note `self.classes`. Unlike the first lab where we predicted position, in this lab, we will determine whether the RGB or LiDAR we are evaluating contains a `cube` or a `sphere`.

In [3]:
IMG_SIZE = 64
img_transforms = transforms.Compose([
    transforms.Resize(IMG_SIZE),
    transforms.ToTensor(),  # Scales data into [0,1]
])

class MyDataset(Dataset):
    def __init__(self, root_dir, start_idx, stop_idx):
        self.classes = ["cubes", "spheres"]
        self.root_dir = root_dir
        self.rgb = []
        self.lidar = []
        self.class_idxs = []

        for class_idx, class_name in enumerate(self.classes):
            for idx in range(start_idx, stop_idx):
                file_number = "{:04d}".format(idx)
                rbg_img = Image.open(self.root_dir + class_name + "/rgb/" + file_number + ".png")
                rbg_img = img_transforms(rbg_img).to(device)
                self.rgb.append(rbg_img)
    
                lidar_depth = np.load(self.root_dir + class_name + "/lidar/" + file_number + ".npy")
                lidar_depth = torch.from_numpy(lidar_depth[None, :, :]).to(torch.float32).to(device)
                self.lidar.append(lidar_depth)

                self.class_idxs.append(torch.tensor(class_idx, dtype=torch.float32)[None].to(device))

    def __len__(self):
        return len(self.class_idxs)

    def __getitem__(self, idx):
        rbg_img = self.rgb[idx]
        lidar_depth = self.lidar[idx]
        class_idx = self.class_idxs[idx]
        return rbg_img, lidar_depth, class_idx

This data is available in the `/data/assessment` folder. Here is an example of one of the cubes. The images are small, but there is enough detail that our models will be able to tell the difference.

<center><img src="data/assessment/cubes/rgb/0002.png" /></center>

Let's go ahead and load the data into a `DataLoader`. We'll set aside a few batches (`VALID_BATCHES`) for validation. The rest of the data will be used for training. We have `9999` images for each of the cube and sphere categories, so we'll multiply N times 2 to reflect the combined dataset.

In [4]:
BATCH_SIZE = 32
VALID_BATCHES = 10
N = 9999

valid_N = VALID_BATCHES*BATCH_SIZE
train_N = N - valid_N

train_data = MyDataset("data/assessment/", 0, train_N)
train_dataloader = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)
valid_data = MyDataset("data/assessment/", train_N, N)
valid_dataloader = DataLoader(valid_data, batch_size=BATCH_SIZE, shuffle=False, drop_last=True)

N *= 2
valid_N *= 2
train_N *= 2

In [5]:
# W&B Configuration - Change this for different tasks
TASK_NAME = "baseline"  # Change to "task1", "task2", "task3", etc.
WANDB_TAGS = ["baseline", "contrastive"]  # Add descriptive tags for filtering

## 5.2 Contrastive Pre-training

Before we create a cross-modal projection model, it would be nice to have a way to embed our RGB images as a starting point. Let's be efficient with our data and create a contrastive pre-training model. First, it would help to have a convolutional model. We've prepared a recommended architecture below.

In [6]:
CILP_EMB_SIZE = 200

class Embedder(nn.Module):
    def __init__(self, in_ch, emb_size=CILP_EMB_SIZE):
        super().__init__()
        kernel_size = 3
        stride = 1
        padding = 1

        # Convolution
        self.conv = nn.Sequential(
            nn.Conv2d(in_ch, 50, kernel_size, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(50, 100, kernel_size, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(100, 200, kernel_size, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(200, 200, kernel_size, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Flatten()
        )

        # Embeddings
        self.dense_emb = nn.Sequential(
            nn.Linear(200 * 4 * 4, 100),
            nn.ReLU(),
            nn.Linear(100, emb_size)
        )

    def forward(self, x):
        conv = self.conv(x)
        emb = self.dense_emb(conv)
        return F.normalize(emb)

The RGB data has `4` channels, and our LiDAR data has `1`. Let's initiate these embedding models respectively.

In [7]:
img_embedder = Embedder(4).to(device)
lidar_embedder = Embedder(1).to(device)

Now that we have our embedding models, let's combine them into a `ContrastivePretraining` model.

In [8]:
class ContrastivePretraining(nn.Module):
    def __init__(self):
        super().__init__()
        self.img_embedder = img_embedder
        self.lidar_embedder = lidar_embedder
        self.cos = nn.CosineSimilarity()

    def forward(self, rgb_imgs, lidar_depths):
        img_emb = self.img_embedder(rgb_imgs)
        lidar_emb = self.lidar_embedder(lidar_depths)

        repeated_img_emb = img_emb.repeat_interleave(len(img_emb), dim=0)
        repeated_lidar_emb = lidar_emb.repeat(len(lidar_emb), 1)

        similarity = self.cos(repeated_img_emb, repeated_lidar_emb)
        similarity = torch.unflatten(similarity, 0, (BATCH_SIZE, BATCH_SIZE))
        similarity = (similarity + 1) / 2

        logits_per_img = similarity
        logits_per_lidar = similarity.T
        return logits_per_img, logits_per_lidar

Time to put these models to the test! First, let's initialize the model.

In [9]:
CILP_LR = 0.0001
CILP_model = ContrastivePretraining().to(device)
optimizer = torch.optim.AdamW(CILP_model.parameters(), lr=CILP_LR)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=1)
loss_img = nn.CrossEntropyLoss()
loss_lidar = nn.CrossEntropyLoss()
ground_truth = torch.arange(BATCH_SIZE, dtype=torch.long).to(device)
epochs = 3

# Initialize W&B for CILP training
wandb.init(
    project="cilp-extended-assessment",
    group=TASK_NAME,
    name="CILP_contrastive_pretraining",
    tags=WANDB_TAGS + ["contrastive_pretraining"],
    config={
        "learning_rate": CILP_LR,
        "architecture": "CILP_Contrastive",
        "embedding_size": CILP_EMB_SIZE,
        "batch_size": BATCH_SIZE,
        "epochs": epochs,
        "optimizer": optimizer.__class__.__name__,
        "scheduler": "ReduceLROnPlateau",
        "fusion_strategy": "contrastive",
        "num_params": sum(p.numel() for p in CILP_model.parameters() if p.requires_grad),
    }
)

Before we can train the model, we should define a loss function to guide our model in learning.

In [10]:
def get_CILP_loss(batch):
    rbg_img, lidar_depth, class_idx = batch
    logits_per_img, logits_per_lidar = CILP_model(rbg_img, lidar_depth)
    total_loss = (loss_img(logits_per_img, ground_truth) + loss_lidar(logits_per_lidar, ground_truth))/2
    return total_loss, logits_per_img

Next, it's time to train. If the above `TODO`s were completed correctly, the loss should be under `3.2`. Are the values along the diagional close to `1`?

In [11]:
for epoch in range(epochs):
    CILP_model.train()
    train_loss = 0
    for step, batch in enumerate(train_dataloader):
        optimizer.zero_grad()
        loss, logits_per_img = get_CILP_loss(batch)
        loss.backward()
        train_loss += loss.item()
        optimizer.step()
    
    avg_train_loss = train_loss/step
    assessment_utils.print_CILP_results(epoch, avg_train_loss, logits_per_img, is_train=True)

    CILP_model.eval()
    valid_loss = 0
    for step, batch in enumerate(valid_dataloader):
        loss, logits_per_img = get_CILP_loss(batch)
        valid_loss += loss.item()
    
    avg_valid_loss = valid_loss/step
    assessment_utils.print_CILP_results(epoch, avg_valid_loss, logits_per_img, is_train=False)
    
    # Step the scheduler based on validation loss
    scheduler.step(avg_valid_loss)
    current_lr = optimizer.param_groups[0]['lr']
    
    # Log metrics to W&B
    wandb.log({
        "cilp_train/loss": avg_train_loss,
        "cilp_valid/loss": avg_valid_loss,
        "learning_rate": current_lr,
        "epoch": epoch,
    })

# Log similarity matrix at end of training as an image
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 8))
im = ax.imshow(logits_per_img.detach().cpu().numpy(), cmap='viridis', aspect='auto')
ax.set_xlabel('LiDAR Index')
ax.set_ylabel('Image Index')
ax.set_title('Similarity Matrix')
plt.colorbar(im, ax=ax)

wandb.log({"similarity_matrix": wandb.Image(fig)})
wandb.finish()

plt.close(fig)

Epoch 0
Train Loss: 3.0831294779358416 
Similarity:
tensor([[0.9956, 0.6243, 0.9475,  ..., 0.7311, 0.4361, 0.8131],
        [0.6609, 0.9916, 0.8185,  ..., 0.2307, 0.6002, 0.9576],
        [0.9529, 0.7629, 0.9931,  ..., 0.5457, 0.4802, 0.9311],
        ...,
        [0.7098, 0.2297, 0.5117,  ..., 0.9951, 0.2588, 0.3361],
        [0.4438, 0.5700, 0.4727,  ..., 0.3012, 0.9930, 0.4772],
        [0.8054, 0.9327, 0.9325,  ..., 0.3436, 0.5187, 0.9980]],
       device='cuda:0', grad_fn=<DivBackward0>)
Valid Loss: 3.1898251960152075 
Similarity:
tensor([[0.9929, 0.8187, 0.3802,  ..., 0.4351, 0.5168, 0.9863],
        [0.8388, 0.9929, 0.2445,  ..., 0.2492, 0.3178, 0.8504],
        [0.3461, 0.2512, 0.9953,  ..., 0.6651, 0.5370, 0.2902],
        ...,
        [0.4139, 0.2344, 0.6133,  ..., 0.9965, 0.9845, 0.4190],
        [0.4982, 0.2953, 0.5056,  ..., 0.9711, 0.9959, 0.5141],
        [0.9864, 0.8342, 0.2877,  ..., 0.4280, 0.5314, 0.9947]],
       device='cuda:0', grad_fn=<DivBackward0>)
Epoch 1
Trai

0,1
cilp_train/loss,█▂▁
cilp_valid/loss,█▄▁
epoch,▁▅█
learning_rate,▁▁▁

0,1
cilp_train/loss,3.02293
cilp_valid/loss,3.17552
epoch,2.0
learning_rate,0.0001


When complete, please freeze the model. We will assess this model with our cross-model projection model, and if this model is altered during cross-model projection training, it may not pass!

In [12]:
for param in CILP_model.parameters():
    CILP_model.requires_grad = False

The CILP contrastive pre-training has been logged to W&B with:
- Training and validation loss per epoch
- Learning rate
- Similarity matrix visualization at the end of training
- All hyperparameters in the config

The W&B run has been completed and you can view the results in your W&B dashboard.

## 5.3 Cross-Modal Projection

Now that we have a way to embed our image data, let's move on to cross-modal projection. 

Let's jump right in and create the projector. What should be the dimensions into the model, and what should be the dimensions out of the model? A hint to the first dimension can be found in section [#5.2-Contrastive-Pre-training](#5.2-Contrastive-Pre-training) in the `Embedder` class. A hint to the second dimension can be found in the [assessment/assesment_utils.py](assessment/assesment_utils.py) file in the `Classifier` class. The dimensions of the output should be the same size as the output of the `get_embs` function.

In [13]:
projector = nn.Sequential(
    nn.Linear(CILP_EMB_SIZE, 1000),
    nn.ReLU(),
    nn.Linear(1000, 500),
    nn.ReLU(),
    nn.Linear(500, 3200)
).to(device)

Next, let's define the loss function for training the `projector`.

In [14]:
def get_projector_loss(model, batch):
    rbg_img, lidar_depth, class_idx = batch
    imb_embs = CILP_model.img_embedder(rbg_img)
    lidar_emb = lidar_cnn.get_embs(lidar_depth)
    pred_lidar_embs = model(imb_embs)
    return nn.MSELoss()(pred_lidar_embs, lidar_emb)

The `projector` will take a little while to train, but at the end of it, should reach a validation loss around 2.

In [15]:
epochs = 40
optimizer = torch.optim.AdamW(projector.parameters())
assessment_utils.train_model(
    projector, 
    optimizer, 
    get_projector_loss, 
    epochs, 
    train_dataloader, 
    valid_dataloader,
    wandb_project="cilp-extended-assessment",
    wandb_name="CILP_projector_training",
    wandb_config={
        "architecture": "Projector",
        "group": TASK_NAME,
        "tags": WANDB_TAGS + ["projector_training"]
    }
)

Epoch   0 | Train Loss: 3.4427
Epoch   0 | Valid Loss: 3.1946
Epoch   1 | Train Loss: 3.1244
Epoch   1 | Valid Loss: 3.1352
Epoch   2 | Train Loss: 3.0468
Epoch   2 | Valid Loss: 3.1103
Epoch   3 | Train Loss: 2.9435
Epoch   3 | Valid Loss: 2.9241
Epoch   4 | Train Loss: 2.8149
Epoch   4 | Valid Loss: 2.8579
Epoch   5 | Train Loss: 2.7117
Epoch   5 | Valid Loss: 2.7991
Epoch   6 | Train Loss: 2.6023
Epoch   6 | Valid Loss: 2.6085
Epoch   7 | Train Loss: 2.4950
Epoch   7 | Valid Loss: 2.5086
Epoch   8 | Train Loss: 2.3852
Epoch   8 | Valid Loss: 2.4043
Epoch   9 | Train Loss: 2.2837
Epoch   9 | Valid Loss: 2.3598
Epoch  10 | Train Loss: 2.2386
Epoch  10 | Valid Loss: 2.3394
Epoch  11 | Train Loss: 2.1737
Epoch  11 | Valid Loss: 2.2267
Epoch  12 | Train Loss: 2.1289
Epoch  12 | Valid Loss: 2.1669
Epoch  13 | Train Loss: 2.0826
Epoch  13 | Valid Loss: 2.2585
Epoch  14 | Train Loss: 2.0484
Epoch  14 | Valid Loss: 2.3193
Epoch  15 | Train Loss: 1.9056
Epoch  15 | Valid Loss: 2.1188
Epoch  1

0,1
epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
learning_rate,██████████████▄▄▄▄▄▄▄▄▄▃▃▃▃▃▃▃▃▃▂▂▂▁▁▁▁▁
train/loss,█▇▇▆▆▅▅▄▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
valid/loss,███▇▆▆▅▅▄▄▄▃▃▃▄▃▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁

0,1
epoch,39.0
learning_rate,3e-05
train/loss,1.57325
valid/loss,1.799


Time to bring it together. Let's create a new model `RGB2LiDARClassifier` where we can use our projector with the pre-trained `lidar_cnn` model.

In [16]:
class RGB2LiDARClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.projector = projector
        self.img_embedder = CILP_model.img_embedder
        self.shape_classifier = lidar_cnn
    
    def forward(self, imgs):
        img_encodings = self.img_embedder(imgs)
        proj_lidar_embs = self.projector(img_encodings)
        return self.shape_classifier(data_embs=proj_lidar_embs)

In [17]:
my_classifier = RGB2LiDARClassifier()

Before we train this model, let's see how it does out of the box. We'll create a function `get_correct` that we can use to calculate the number of classifications that were correct.

In [18]:
def get_correct(output, y):
    zero_tensor = torch.tensor([0]).to(device)
    pred = torch.gt(output, zero_tensor)
    correct = pred.eq(y.view_as(pred)).sum().item()
    return correct

Next, we can make a `get_valid_metrics` function to calculate the model's accuracy with the validation dataset. If done correctly, the accuracy should be above `.70`, or 70%.

In [19]:
def get_valid_metrics():
    my_classifier.eval()
    correct = 0
    batch_correct = 0
    total_loss = 0
    for step, batch in enumerate(valid_dataloader):
        rbg_img, _, class_idx = batch
        output = my_classifier(rbg_img)
        loss = nn.BCEWithLogitsLoss()(output, class_idx)
        batch_correct = get_correct(output, class_idx)
        correct += batch_correct
        total_loss += loss.item()
    
    avg_loss = total_loss / (step + 1)
    accuracy = correct / valid_N
    print(f"Valid Loss: {avg_loss:2.4f} | Accuracy {accuracy:2.4f}")
    return avg_loss, accuracy

get_valid_metrics()

Valid Loss: 1.1538 | Accuracy 0.8375


(1.1538238167762755, 0.8375)

Finally, let's fine-tune the completed model. Since `CILP` and `lidar_cnn` are frozen, this should only change the `projector` part of the model. Even so, the model should achieve a validation accuracy of above `.95` or 95%.

In [20]:
epochs = 5
CLASSIFIER_LR = 0.001
optimizer = torch.optim.AdamW(my_classifier.parameters(), lr=CLASSIFIER_LR)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=1)

wandb.init(
    project="cilp-extended-assessment",
    group=TASK_NAME,
    name="CILP_final_classifier",
    tags=WANDB_TAGS + ["final_classifier"],
    config={
        "learning_rate": CLASSIFIER_LR,
        "architecture": "RGB2LiDARClassifier",
        "embedding_size": CILP_EMB_SIZE,
        "batch_size": BATCH_SIZE,
        "epochs": epochs,
        "optimizer": optimizer.__class__.__name__,
        "scheduler": "ReduceLROnPlateau",
        "fusion_strategy": "contrastive",
        "num_params": sum(p.numel() for p in my_classifier.parameters() if p.requires_grad),
    }
)

my_classifier.train()
for epoch in range(epochs):
    correct = 0
    batch_correct = 0
    total_train_loss = 0
    for step, batch in enumerate(train_dataloader):
        optimizer.zero_grad()
        rbg_img, _, class_idx = batch
        output = my_classifier(rbg_img)
        loss = nn.BCEWithLogitsLoss()(output, class_idx)
        batch_correct = get_correct(output, class_idx)
        correct += batch_correct
        total_train_loss += loss.item()
        loss.backward()
        optimizer.step()
    
    avg_train_loss = total_train_loss / (step + 1)
    train_accuracy = correct / train_N
    print(f"Train Loss: {avg_train_loss:2.4f} | Accuracy {train_accuracy:2.4f}")
    valid_loss, valid_acc = get_valid_metrics()
    
    # Step the scheduler based on validation loss
    scheduler.step(valid_loss)
    current_lr = optimizer.param_groups[0]['lr']
    
    wandb.log({
        "train/loss": avg_train_loss,
        "train/accuracy": train_accuracy,
        "valid/loss": valid_loss,
        "valid/accuracy": valid_acc,
        "learning_rate": current_lr,

        "epoch": epoch,    })

Train Loss: 0.4093 | Accuracy 0.7742
Valid Loss: 0.1002 | Accuracy 0.9609
Train Loss: 0.0396 | Accuracy 0.9866
Valid Loss: 0.0015 | Accuracy 1.0000
Train Loss: 0.0136 | Accuracy 0.9944
Valid Loss: 0.0004 | Accuracy 1.0000
Train Loss: 0.0157 | Accuracy 0.9934
Valid Loss: 0.0005 | Accuracy 1.0000
Train Loss: 0.0091 | Accuracy 0.9952
Valid Loss: 0.0012 | Accuracy 0.9984


Sample 5 predictions and log them to Weights & Biases (wandb) for visualization.

In [21]:
my_classifier.eval()
for step, batch in enumerate(valid_dataloader):
    rbg_img, _, class_idx = batch
    output = my_classifier(rbg_img)
    wandb.log({"predictions": wandb.Table(data=[[rbg_img[i].cpu().numpy(), torch.sigmoid(output[i]).item(), class_idx[i].item()] for i in range(5)],
                                           columns=["rgb_image", "predicted_class", "true_class"])})
    break
wandb.finish()

0,1
epoch,▁▃▅▆█
learning_rate,████▁
train/accuracy,▁████
train/loss,█▂▁▁▁
valid/accuracy,▁████
valid/loss,█▁▁▁▁

0,1
epoch,4.0
learning_rate,0.0005
train/accuracy,0.9952
train/loss,0.00906
valid/accuracy,0.99844
valid/loss,0.00121
