# 4. Final Assessment

## 5.1 Setup

Let's get started. Below are the libraries used in this assessment.

In [1]:
import numpy as np
from PIL import Image

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam
import torchvision.transforms as transforms
from torch.utils.data import Dataset, DataLoader

import sys
import os
sys.path.append(os.path.abspath("../src"))
from visualization import print_CILP_results
from models import Classifier, Embedder
from training import run_training
#import utils

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.cuda.is_available()

True

### 5.1.1 The Model

Next, let's load our classification model and call it `lidar_cnn`. If we take a moment to view the [assement_utils](assessment/assesment_utils.py), we can see the `Classifier` class used to construct the model. Please note the `get_embs` method, which we will be using to construct our cross-modal projector.

In [2]:
lidar_cnn = Classifier(1).to(device)
lidar_cnn.load_state_dict(torch.load("../data/assessment/lidar_cnn.pt", weights_only=True))
# Do not unfreeze. Otherwise, it would be difficult to pass the assessment.
for param in lidar_cnn.parameters():
    lidar_cnn.requires_grad = False #changed vs original!
lidar_cnn.eval()

Classifier(
  (embedder): Sequential(
    (0): Conv2d(1, 50, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(50, 100, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (4): ReLU()
    (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(100, 200, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU()
    (8): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (9): Conv2d(200, 200, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (10): ReLU()
    (11): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (12): Flatten(start_dim=1, end_dim=-1)
  )
  (classifier): Sequential(
    (0): Linear(in_features=3200, out_features=100, bias=True)
    (1): ReLU()
    (2): Linear(in_features=100, out_features=1, bias=True)
  )
)

### 5.1.2 The Dataset

Below is the dataset we will be using in this assessment. It is similar to the dataset we used in the first few labs, but please note `self.classes`. Unlike the first lab where we predicted position, in this lab, we will determine whether the RGB or LiDAR we are evaluating contains a `cube` or a `sphere`.

In [3]:
IMG_SIZE = 64
img_transforms = transforms.Compose([
    transforms.Resize(IMG_SIZE),
    transforms.ToTensor(),  # Scales data into [0,1]
])

class MyDataset(Dataset):
    def __init__(self, root_dir, start_idx, stop_idx):
        self.classes = ["cubes", "spheres"]
        self.root_dir = root_dir
        self.rgb = []
        self.lidar = []
        self.class_idxs = []

        for class_idx, class_name in enumerate(self.classes):
            for idx in range(start_idx, stop_idx):
                file_number = "{:04d}".format(idx)
                rbg_img = Image.open(self.root_dir + class_name + "/rgb/" + file_number + ".png")
                rbg_img = img_transforms(rbg_img).to(device)
                self.rgb.append(rbg_img)
    
                lidar_depth = np.load(self.root_dir + class_name + "/lidar/" + file_number + ".npy")
                lidar_depth = torch.from_numpy(lidar_depth[None, :, :]).to(torch.float32).to(device)
                self.lidar.append(lidar_depth)

                self.class_idxs.append(torch.tensor(class_idx, dtype=torch.float32)[None].to(device))

    def __len__(self):
        return len(self.class_idxs)

    def __getitem__(self, idx):
        rbg_img = self.rgb[idx]
        lidar_depth = self.lidar[idx]
        class_idx = self.class_idxs[idx]
        return rbg_img, lidar_depth, class_idx

This data is available in the `/data/assessment` folder. Here is an example of one of the cubes. The images are small, but there is enough detail that our models will be able to tell the difference.

<center><img src="data/assessment/cubes/rgb/0002.png" /></center>

Let's go ahead and load the data into a `DataLoader`. We'll set aside a few batches (`VALID_BATCHES`) for validation. The rest of the data will be used for training. We have `9999` images for each of the cube and sphere categories, so we'll multiply N times 2 to reflect the combined dataset.

In [4]:
BATCH_SIZE = 32
VALID_BATCHES = 10
N = 9999

valid_N = VALID_BATCHES*BATCH_SIZE
train_N = N - valid_N

train_data = MyDataset("../data/assessment/", 0, train_N)
train_dataloader = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)
valid_data = MyDataset("../data/assessment/", train_N, N)
valid_dataloader = DataLoader(valid_data, batch_size=BATCH_SIZE, shuffle=False, drop_last=True)

N *= 2
valid_N *= 2
train_N *= 2

print(len(train_data))
print(len(valid_data))

19358
640


## 5.2 Contrastive Pre-training

Before we create a cross-modal projection model, it would be nice to have a way to embed our RGB images as a starting point. Let's be efficient with our data and create a contrastive pre-training model. First, it would help to have a convolutional model. We've prepared a recommended architecture below.

The RGB data has `4` channels, and our LiDAR data has `1`. Let's initiate these embedding models respectively.

In [5]:
img_embedder = Embedder(4).to(device)
lidar_embedder = Embedder(1).to(device)

Now that we have our embedding models, let's combine them into a `ContrastivePretraining` model.

**TODO**: The `ContrastivePretraining` class below is almost done, but it has a few `FIXME`s. Please replace the FIXMEs to have a working model. Feel free to review notebook [02b_Contrastive_Pretraining.ipynb](02b_Contrastive_Pretraining.ipynb) for a hint.

In [6]:
class ContrastivePretraining(nn.Module):
    def __init__(self):
        super().__init__()
        self.img_embedder = img_embedder
        self.lidar_embedder = lidar_embedder
        self.cos = nn.CosineSimilarity()

    def forward(self, rgb_imgs, lidar_depths):
        img_emb = self.img_embedder(rgb_imgs)
        lidar_emb = self.lidar_embedder(lidar_depths)

        repeated_img_emb = img_emb.repeat_interleave(len(img_emb), dim=0)   # 0,0,0, ..., 1,1,1, ...
        repeated_lidar_emb = lidar_emb.repeat(len(lidar_emb), 1)            # 0,1,2, ..., 0,1,2, ...

        similarity = self.cos(repeated_img_emb, repeated_lidar_emb)
        similarity = torch.unflatten(similarity, 0, (BATCH_SIZE, BATCH_SIZE))
        similarity = (similarity + 1) / 2

        logits_per_img = similarity
        logits_per_lidar = similarity.T
        return logits_per_img, logits_per_lidar

Time to put these models to the test! First, let's initialize the model.

In [7]:
CILP_model = ContrastivePretraining().to(device)
print("moved CILP model to ", device)

trainable_params = [p for p in CILP_model.parameters() if p.requires_grad]
print(f"CILP trainable params: {sum(p.numel() for p in trainable_params):,}")

optimizer = Adam(trainable_params, lr=1e-4)
loss_img = nn.CrossEntropyLoss()
loss_lidar = nn.CrossEntropyLoss()
ground_truth = torch.arange(BATCH_SIZE, dtype=torch.long).to(device)
epochs = 3

moved CILP model to  cuda
CILP trainable params: 1,853,950


Also we will use weights and biases to log our runs.

In [8]:
import wandb

wandb.login()

run = wandb.init(
    project="contrastive pre training",
    config={
        "cilp_epochs": epochs,
        "batch_size": BATCH_SIZE,
        "n_train": train_N,
        "n_val": valid_N,
        "optimizer": "Adam",
        "cilp_lr": 1e-4,
    }
)


[34m[1mwandb[0m: [wandb.login()] Loaded credentials for https://api.wandb.ai from /home/henrizeiler/.netrc.
[34m[1mwandb[0m: Currently logged in as: [33mzeihenri[0m ([33mzeihenri-hasso-plattner-institut[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Before we can train the model, we should define a loss function to guide our model in learning.

**TODO**: The `get_CILP_loss` function below is almost done. Do you remember the formula to calculate the loss? Please replace the `FIXME`s below.

In [9]:
def get_CILP_loss(batch):
    rbg_img, lidar_depth, _ = batch
    logits_per_img, logits_per_lidar = CILP_model(rbg_img, lidar_depth)
    total_loss = (loss_img(logits_per_img, ground_truth) + loss_lidar(logits_per_lidar, ground_truth))/2
    #print("Loss: ", total_loss)
    return total_loss, logits_per_img

In [10]:
# Quick sanity check: confirm one optimizer step actually changes weights
CILP_model.train()
rgb_batch = torch.randn(BATCH_SIZE, 4, IMG_SIZE, IMG_SIZE, device=device)
lidar_batch = torch.randn(BATCH_SIZE, 1, IMG_SIZE, IMG_SIZE, device=device)
dummy_class = torch.zeros(BATCH_SIZE, 1, device=device)
batch = (rgb_batch, lidar_batch, dummy_class)

optimizer.zero_grad()
loss_before, _ = get_CILP_loss(batch)
loss_before.backward()

grad_norm = 0.0
for p in CILP_model.parameters():
    if p.grad is not None:
        grad_norm += p.grad.detach().abs().mean().item()

w_before = CILP_model.img_embedder.conv[0].weight.detach().clone()
optimizer.step()
w_after = CILP_model.img_embedder.conv[0].weight.detach()
delta = (w_after - w_before).abs().mean().item()

print(f"Mean grad magnitude (sum over params): {grad_norm:.6e}")
print(f"Mean weight delta after 1 step: {delta:.6e}")

Mean grad magnitude (sum over params): 2.093303e-04
Mean weight delta after 1 step: 9.969963e-05


Next, it's time to train. If the above `TODO`s were completed correctly, the loss should be under `3.2`. Are the values along the diagional close to `1`?

In [11]:
for epoch in range(epochs):
    CILP_model.train()
    train_loss = 0
    for step, batch in enumerate(train_dataloader):
        #print("step: ", step) DOES STEP
        optimizer.zero_grad()
        loss, logits_per_img = get_CILP_loss(batch)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
    epoch_train_loss = train_loss / step
    print_CILP_results(epoch, epoch_train_loss, logits_per_img, is_train=True)

    CILP_model.eval()
    valid_loss = 0
    with torch.no_grad():
        for step, batch in enumerate(valid_dataloader):
            loss, logits_per_img = get_CILP_loss(batch)
            valid_loss += loss.item()
    epoch_val_loss = valid_loss / step
    print_CILP_results(epoch, epoch_val_loss, logits_per_img, is_train=False)

    wandb.log({
        "cilp/train_loss": epoch_train_loss,
        "cilp/val_loss": epoch_val_loss,
    }, step=epoch)


Epoch 0
Train Loss: 3.080447541342841 
Similarity:
tensor([[0.9976, 0.0739, 0.9517,  ..., 0.2866, 0.5012, 0.1895],
        [0.0774, 0.9933, 0.0877,  ..., 0.5959, 0.4071, 0.6787],
        [0.9595, 0.0719, 0.9971,  ..., 0.2317, 0.4527, 0.1425],
        ...,
        [0.2899, 0.6100, 0.2556,  ..., 0.9958, 0.3441, 0.6587],
        [0.5287, 0.3884, 0.4504,  ..., 0.3668, 0.9973, 0.8251],
        [0.3024, 0.5784, 0.2307,  ..., 0.5898, 0.9269, 0.9721]],
       device='cuda:0', grad_fn=<DivBackward0>)
Valid Loss: 3.19351463568838 
Similarity:
tensor([[0.9953, 0.8765, 0.5017,  ..., 0.5064, 0.5460, 0.9880],
        [0.8633, 0.9952, 0.4986,  ..., 0.2495, 0.2731, 0.8715],
        [0.4522, 0.4615, 0.9951,  ..., 0.5610, 0.4382, 0.4017],
        ...,
        [0.5125, 0.2639, 0.5176,  ..., 0.9975, 0.9853, 0.4914],
        [0.5608, 0.2914, 0.4119,  ..., 0.9790, 0.9980, 0.5496],
        [0.9916, 0.8762, 0.4030,  ..., 0.4863, 0.5497, 0.9967]],
       device='cuda:0')
Epoch 1
Train Loss: 3.028940698202965 


When complete, please freeze the model. We will assess this model with our cross-model projection model, and if this model is altered during cross-model projection training, it may not pass!

In [12]:
for param in CILP_model.parameters():
    CILP_model.requires_grad = False

## 5.3 Cross-Modal Projection

Now that we have a way to embed our image data, let's move on to cross-modal projection. 

**TODO**: Let's jump right in and create the projector. What should be the dimensions into the model, and what should be the dimensions out of the model? A hint to the first `FIXME` can be found in section [#5.2-Contrastive-Pre-training](#5.2-Contrastive-Pre-training) in the `Embedder` class. A hint to the second `FIXME` can be found in the [assessment/assesment_utils.py](assessment/assesment_utils.py) file in the `Classifier` class. The dimensions of the second `FIXME` should be the same size as the output of the `get_embs` function.

In [13]:
#The projector maps from the constrastive pretraining models latent space to that of our lidar cnn model.
#Even when the dimensionality of the embeddings match we still need to align the latent spaces
# -> the first fixme needs to be replaced by the CILP embedding size (output off the embedders)
# -> the second one depends on the output of the embedder the classifier was initially trained with

projector = nn.Sequential(
    nn.Linear(200, 1000),   #maybe I should not turnn these into dense embeddings, lets see
    nn.ReLU(),
    nn.Linear(1000, 500),
    nn.ReLU(),
    nn.Linear(500, 200 * 4 * 4)
).to(device)

Next, let's define the loss function for training the `projector`.

**TODO**: What was the loss function for estimating projection embeddings? Please replace the `FIXME` below. Review notebook [03a_Projection.ipynb](03a_Projection.ipynb) section 3.2 for a hint.

In [14]:
def get_projector_loss(model, batch):
    rbg_img, lidar_depth, class_idx = batch
    imb_embs = CILP_model.img_embedder(rbg_img)
    lidar_emb = lidar_cnn.get_embs(lidar_depth)
    pred_lidar_embs = model(imb_embs)
    return nn.MSELoss()(pred_lidar_embs, lidar_emb), pred_lidar_embs

The `projector` will take a little while to train, but at the end of it, should reach a validation loss around 2.

In [15]:
epochs = 40
optimizer = torch.optim.Adam(projector.parameters())
run.config.update({"projector_epochs": epochs, "projector_optimizer": "Adam"})

train_losses, val_losses, _, _ = run_training(
    projector, optimizer, epochs, train_dataloader, valid_dataloader, device, get_projector_loss, False
)

for epoch, (t_loss, v_loss) in enumerate(zip(train_losses, val_losses)):
    wandb.log({"projector/train_loss": t_loss, "projector/val_loss": v_loss}, step=epoch)


Epoch   0 | Train: 3.4750  Val: 3.2427
Epoch   1 | Train: 3.1597  Val: 3.1567
Epoch   2 | Train: 3.0925  Val: 3.0813
Epoch   3 | Train: 3.0347  Val: 3.0276
Epoch   4 | Train: 2.9249  Val: 2.8820
Epoch   5 | Train: 2.7349  Val: 2.6280
Epoch   6 | Train: 2.5874  Val: 2.4948
Epoch   7 | Train: 2.4050  Val: 2.4433
Epoch   8 | Train: 2.3128  Val: 2.3548
Epoch   9 | Train: 2.2244  Val: 2.3243
Epoch  10 | Train: 2.1567  Val: 2.2408
Epoch  11 | Train: 2.1000  Val: 2.1560
Epoch  12 | Train: 2.0552  Val: 2.2638
Epoch  13 | Train: 2.0151  Val: 2.0913
Epoch  14 | Train: 1.9818  Val: 2.1904
Epoch  15 | Train: 1.9650  Val: 2.1680
Epoch  16 | Train: 1.9437  Val: 2.1411
Epoch  17 | Train: 1.9106  Val: 1.9241
Epoch  18 | Train: 1.9129  Val: 1.9698
Epoch  19 | Train: 1.8825  Val: 1.9649
Epoch  20 | Train: 1.8799  Val: 1.8895
Epoch  21 | Train: 1.8322  Val: 1.9870
Epoch  22 | Train: 1.8231  Val: 1.9491
Epoch  23 | Train: 1.8217  Val: 1.8921
Epoch  24 | Train: 1.7990  Val: 1.8701
Epoch  25 | Train: 1.8136

Time to bring it together. Let's create a new model `RGB2LiDARClassifier` where we can use our projector with the pre-trained `lidar_cnn` model.

**TODO**: Please fix the `FIXME`s below. Which `embedder` should we be using from our `CILP_model`?

In [16]:
class RGB2LiDARClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.projector = projector
        self.img_embedder = CILP_model.img_embedder
        self.shape_classifier = lidar_cnn
    
    def forward(self, imgs):
        img_encodings = self.img_embedder(imgs)
        proj_lidar_embs = self.projector(img_encodings)
        return self.shape_classifier(data_embs=proj_lidar_embs)

In [17]:
my_classifier = RGB2LiDARClassifier()

Before we train this model, let's see how it does out of the box. We'll create a function `get_correct` that we can use to calculate the number of classifications that were correct.

In [18]:
def get_correct(output, y):
    zero_tensor = torch.tensor([0]).to(device)
    pred = torch.gt(output, zero_tensor)
    correct = pred.eq(y.view_as(pred)).sum().item()
    return correct

Next, we can make a `get_valid_metrics` function to calculate the model's accuracy with the validation dataset. If done correctly, the accuracy should be above `.70`, or 70%.

In [19]:
def get_valid_metrics():
    my_classifier.eval()
    correct = 0
    batch_correct = 0
    total_loss = 0
    for step, batch in enumerate(valid_dataloader):
        rbg_img, _, class_idx = batch
        output = my_classifier(rbg_img)
        loss = nn.BCEWithLogitsLoss()(output, class_idx)
        batch_correct = get_correct(output, class_idx)
        correct += batch_correct
        total_loss += loss.item()
    avg_loss = total_loss / (step + 1)
    accuracy = correct / valid_N
    print(f"Valid Loss: {avg_loss:2.4f} | Accuracy {accuracy:2.4f}")
    return avg_loss, accuracy

get_valid_metrics()


Valid Loss: 1.2299 | Accuracy 0.8484


(1.2299052242189645, 0.8484375)

Finally, let's fine-tune the completed model. Since `CILP` and `lidar_cnn` are frozen, this should only change the `projector` part of the model. Even so, the model should achieve a validation accuracy of above `.95` or 95%.

In [20]:
epochs = 5
optimizer = torch.optim.Adam(my_classifier.parameters())
run.config.update({"finetune_epochs": epochs, "finetune_optimizer": "Adam"})

my_classifier.train()
for epoch in range(epochs):
    correct = 0
    batch_correct = 0
    train_loss_total = 0
    for step, batch in enumerate(train_dataloader):
        optimizer.zero_grad()
        rbg_img, _, class_idx = batch
        output = my_classifier(rbg_img)
        loss = nn.BCEWithLogitsLoss()(output, class_idx)
        batch_correct = get_correct(output, class_idx)
        correct += batch_correct
        train_loss_total += loss.item()
        loss.backward()
        optimizer.step()
    train_accuracy = correct / train_N
    avg_train_loss = train_loss_total / (step + 1)
    print(f"Train Loss: {avg_train_loss:2.4f} | Accuracy {train_accuracy:2.4f}")
    val_loss, val_accuracy = get_valid_metrics()
    wandb.log({
        "finetune/train_loss": avg_train_loss,
        "finetune/train_accuracy": train_accuracy,
        "finetune/val_loss": val_loss,
        "finetune/val_accuracy": val_accuracy,
    }, step=epoch)

# Log final performance as summary metrics
wandb.summary["final_val_accuracy"] = val_accuracy
wandb.summary["final_val_loss"] = val_loss
wandb.summary["final_train_accuracy"] = train_accuracy


Train Loss: 0.5954 | Accuracy 0.6655
Valid Loss: 0.2562 | Accuracy 0.9078




Train Loss: 0.1851 | Accuracy 0.9362
Valid Loss: 0.1218 | Accuracy 0.9672




Train Loss: 0.0387 | Accuracy 0.9873
Valid Loss: 0.0185 | Accuracy 0.9922
Train Loss: 0.0245 | Accuracy 0.9911
Valid Loss: 0.0337 | Accuracy 0.9844




Train Loss: 0.0192 | Accuracy 0.9929
Valid Loss: 0.0012 | Accuracy 1.0000


In [21]:
wandb.finish()


0,1
cilp/train_loss,█▂▁
cilp/val_loss,█▂▁
projector/train_loss,██▇▆▆▅▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁
projector/val_loss,██▇▆▅▅▄▄▄▃▄▃▃▃▃▂▂▂▂▂▂▂▂▁▁▁▂▂▁▁▁▃▁▂▁▂▁▁

0,1
cilp/train_loss,3.0236
cilp/val_loss,3.1819
final_train_accuracy,0.99287
final_val_accuracy,1.0
final_val_loss,0.00122
projector/train_loss,1.66481
projector/val_loss,1.76633
