# CILP Assessment Performance

The goal of this notebook is to train a model that projects between two embeddings spaces. Here the projector should turn image embeddings onto lidar embeddings so we can use a lidar classifier to classify images. For this we first train a single modal classifier on only lidar data. This model defines the lidar embedding space. Then we train a contrastive learning model "CILP" (Contrastive-Image-Lidar-Pretraining). The image embedder from our CILP model defines the image embedding space. We then train a projector model that takes an image embedding produced by CILP and turns it into a lidar embedding that is as close as possible to the lidar embedding produced by the first lidar model. Lastly we put everything together into an RGB2Lidar classifier that takes images, produces embeddings using CILP, turns these image embeddings onto lidar embeddings, and gives them to the pretrained and frozen lidar classifier head.

In [None]:
import sys

# Colab-only setup
if "google.colab" in sys.modules:
    print("Running in Google Colab. Setting up repo")

    !git clone https://github.com/MatthiasCr/Computer-Vision-Assignment-2.git
    %cd Computer-Vision-Assignment-2/notebooks
    !pip install -r requirements.txt

In [None]:
# insert wandb token 
!wandb login

In [None]:
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam
from torch.optim.lr_scheduler import CosineAnnealingLR
import torchvision.transforms.v2 as transforms
from torch.utils.data import DataLoader
import wandb


import sys
from pathlib import Path

project_root = Path("..").resolve()
sys.path.append(str(project_root))
from src import datasets
from src import training
from src import visualization
from src import models

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Load Data

As we did in the previous notebooks we start by loading the data as a fiftyone dataset from huggingface.

In [None]:
IMG_SIZE = 64
BATCH_SIZE = 32

In [None]:
# load fiftyone dataset from huggingface
dataset = load_from_hub(
    "MatthiasCr/multimodal-shapes-subset", 
    name="multimodal-shapes-subset",
    # fewer workers and greater batch size to hopefully avoid getting rate limited
    num_workers=2,
    batch_size=1000,
    overwrite=True,
)

In [None]:
img_transforms = transforms.Compose([
    transforms.Resize(IMG_SIZE),
    transforms.ToImage(),
    transforms.ToDtype(torch.float32, scale=True),
])

train_dataset = datasets.MultimodalDataset(dataset, "train", img_transforms)
val_dataset = datasets.MultimodalDataset(dataset, "val", img_transforms)

# use generator with fixed seed for reproducible shuffling
generator = torch.Generator()
generator.manual_seed(51)

train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, drop_last=True, generator=generator)
val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, drop_last=True)

# number of train batches, needed for learning rate scheduling
steps_per_epoch = len(train_dataloader)

## Baseline Lidar Classifier

We first train a single modal classifier that only uses lidar data to do the cube/sphere classification. This `LidarClassifier` consists of an embedder and a classification head. The embedder dimension is 200.

The purpose of this model is to define an embedding space for lidar data and to have a classifier head that works on these embeddings. We will later project image embeddings onto this lidar embedding space so that we can use this classifier head (without the embedder) for classification of image data.

In [None]:
lidar_classifier = models.LidarClassifier(emb_size=200, normalize_embs=True).to(device)
lidar_classifier_num_params = sum(p.numel() for p in lidar_classifier.parameters())
epochs = 20
start_lr = 1e-4
end_lr = 1e-6

optim = Adam(lidar_classifier.parameters(), lr=start_lr)
scheduler = CosineAnnealingLR(optim, T_max=epochs * steps_per_epoch, eta_min=end_lr)
loss_func = nn.BCEWithLogitsLoss()

In [None]:
def apply_classifier_model(model, batch):
    # only lidar data used
    _, inputs_xyz, target = batch
    inputs_xyz = inputs_xyz.to(device)
    target = target.to(device)
    outputs = model(raw_data=inputs_xyz)
    return outputs, target

In [None]:
run = training.initWandbRun(
    "", epochs, BATCH_SIZE, lidar_classifier_num_params, "Adam", "Cosine Annealing", start_lr, end_lr
)

classifier_train_loss, classifier_val_loss = training.train_model(
    lidar_classifier, 
    optim, 
    apply_classifier_model, 
    loss_func, 
    epochs, 
    train_dataloader, 
    val_dataloader, 
    device, 
    run, 
    scheduler=scheduler, 
    output_name="lidar_classifier"
)

run.finish()

visualization.plot_loss(epochs,
    {
        "Classifier Train Loss": classifier_train_loss,
        "Classifier Val Loss": classifier_val_loss
    }
)

We can visualize the loss and accuracy curves on Wandb. The validation accuracy quickly reaches 0.999.

<img src="../results/wandb-t5-lidar-graph.png">

Now we load the best checkpoint and freeze the model so we can use it later without altering it. The frozen classification head will be metric on how good the projector model can imitate the lidar embeddings from lidar_classifier model. 

In [None]:
lidar_classifier = models.LidarClassifier()
lidar_classifier.load_state_dict(
    torch.load("../checkpoints/lidar_classifier.pt", map_location=device))
lidar_classifier = lidar_classifier.to(device)

# freezing
for param in lidar_classifier.parameters():
    param.requires_grad = False
lidar_classifier.eval()

## CILP Model

Now we train the CILP model. The goal is to create a model that creates aligned embeddings for both modalities using self-supervised contrastive learning. The embeddings have 200 dimensions which aligns with embedding size from the lidar classifier. This will make things a little bit easier for the projector.

In [None]:
cilp_model = models.ContrastivePretraining(batch_size=BATCH_SIZE).to(device)
cilp_num_params = sum(p.numel() for p in cilp_model.parameters())
epochs = 40
lr = 1e-2

optim = Adam(cilp_model.parameters(), lr=lr)

In [None]:
def apply_cilp_model(model, batch):
    # cilp model doesn't use the class information
    inputs_rgb, inputs_xyz, _ = batch
    inputs_rgb = inputs_rgb.to(device)
    inputs_xyz = inputs_xyz.to(device)
    logits_per_img, logits_per_lidar = model(inputs_rgb, inputs_xyz)
    return logits_per_img, logits_per_lidar

loss_img = nn.CrossEntropyLoss()
loss_lidar = nn.CrossEntropyLoss()
ground_truth = torch.arange(BATCH_SIZE, dtype=torch.long).to(device)

def cilp_loss(logits_per_img, logits_per_lidar):
    return (loss_img(logits_per_img, ground_truth) + loss_lidar(logits_per_lidar, ground_truth)) / 2

In [None]:
run = training.initWandbRun(
    "", epochs, BATCH_SIZE, cilp_num_params, "Adam", "", lr, lr
)

cilp_train_loss, cilp_val_loss = training.train_model(
    cilp_model, 
    optim,
    apply_cilp_model, 
    cilp_loss, 
    epochs, 
    train_dataloader, 
    val_dataloader, 
    device, 
    run, 
    scheduler=None, 
    output_name="cilp", 
    calc_accuracy=False
)

visualization.plot_loss(epochs,
    {
        "Cilp Train Loss": cilp_train_loss,
        "Cilp Val Loss": cilp_val_loss
    }
)

# create and log similarity matrix for first validation batch
visualization.plot_similarity_matrix(cilp_model, apply_cilp_model, val_dataloader, device, run)

run.finish()

The following graphs show again the resulting loss curves. The mininum validation loss we got is **2.59**.

<img src="../results/wandb-t5-cilp.png">

The similarity matrix is a metric to visualize and evaluate the contrastive learning model. It shows the similarities across all image and lidar embeddings of the first validation batch. The diagonal represents the similarities of the matching image/lidar samples while all other points are negative pairings. We can see that the diagnoal is close to one which indicates that matching image and lidar embeddings are very similar. Most of the other points are close to zero which is also desired. However, there is still much noise which indicates that many negatives are still too similar. Possible ways to further improve this model is to increase the batch size which would give the model more negative examples, or to just train longer. An additional method could be to detect and oversample negatives that are especially hard.

This similarity matrix is also logged as an artifact on Wandb.

<img src="../results/cilp-similarity-matrix.png">

Now load the best cilp checkpoint and freeze the model for further usage.

In [None]:
cilp_model = models.ContrastivePretraining(batch_size=BATCH_SIZE)
cilp_model.load_state_dict(torch.load("../checkpoints/cilp.pt", map_location=device))
cilp_model = cilp_model.to(device)

# freezing
for param in cilp_model.parameters():
    param.requires_grad = False
cilp_model.eval()

## Projector Model

Now comes the projector model. After a lot of trying around I decided to just use a very simple and relatively small linear model. Since I only use a smaller subset of the complete training data, a more complex model tended to overfit extremly fast. I decided on the following architecture with 40300 parameters that is implemented in the `models.Projector` class. Input and output dimensions are both 200. Also I added a normalization at the end of the projector, since both the image as well as the lidar embeddings are normalized. This could make things easier for the projector.


```python
nn.Sequential(
    nn.Linear(img_emb_size, 100),
    nn.ReLU(),
    nn.Linear(100, lidar_emb_size)
)
```

Since the model still tended to overfit quite fast I chose a small learning rate, added weight-decay regularization and, as always, checkpointed the model at the best validtion loss.

In [None]:
source_embedding_dim = cilp_model.get_embedding_size()
target_embedding_dim = lidar_classifier.get_embedding_size()

projector = models.Projector(source_embedding_dim, target_embedding_dim).to(device)
projector_num_params = sum(p.numel() for p in projector.parameters())
epochs = 40
lr = 1e-4
optim = torch.optim.Adam(projector.parameters(), lr=lr, weight_decay=5e-4)

In [None]:
def apply_projector_model(model, batch):
    rgb_img, lidar_xyz, _ = batch
    rgb_img = rgb_img.to(device)
    lidar_xyz = lidar_xyz.to(device)
    imb_embs = cilp_model.img_embedder(rgb_img)
    # get lidar embeddings from lidar classifier as "ground truth"
    lidar_embs = lidar_classifier.get_embs(lidar_xyz)
    pred_lidar_embs = model(imb_embs)
    return pred_lidar_embs, lidar_embs

def projector_loss(pred_lidar_embs, lidar_embs):
    # simple MSE loss to compare similarity between embeddings
    return nn.MSELoss()(pred_lidar_embs, lidar_embs)

In [None]:
run = training.initWandbRun(
    "", epochs, BATCH_SIZE, projector_num_params, "Adam", "", lr, lr
)

projector_train_loss, projector_val_loss = training.train_model(
    projector, 
    optim,
    apply_projector_model, 
    projector_loss, 
    epochs, 
    train_dataloader, 
    val_dataloader, 
    device, 
    run, 
    scheduler=None, 
    output_name="projector", 
    calc_accuracy=False
)

run.finish()

visualization.plot_loss(epochs,
    {
        "Projector Train Loss": projector_train_loss,
        "Projector Val Loss": projector_val_loss
    }
)

These are the loss curves I got in my experiment. The minimal validation MSE is 0.0045.

![](../results/wandb-t5-projector.png)

Again, load the best saved weights for the projector model. This time we don't freeze it as we fine-tune it inside the final RGB-to-Lidar classifier.

In [None]:
projector = models.Projector(source_embedding_dim, target_embedding_dim)
projector.load_state_dict(torch.load("../checkpoints/projector.pt", map_location=device))
projector = projector.to(device)

## Final RGB-to-Lidar Classifier

Now we put this pretrained projector in action by constructing a new model to classify images. This `RGB2LidarClassifier` takes raw images, creates image embeddings using CILP, predicts lidar embeddings from it using the projector, and lastly uses the classifier head for a decision. Since the CILP embedder and the classifier head are frozen, only the projector is further fine-tuned.

In [None]:
rgbToLiderClassifier = models.RGB2LiDARClassifier(projector, cilp_model, lidar_classifier)
rgbToLidar_num_params = sum(p.numel() for p in rgbToLiderClassifier.parameters())

epochs = 50
start_lr = 1e-6
# we explicitly give the optimizer only the projector parameters so that only the projector is trained
optim = torch.optim.Adam(projector.parameters(), lr=lr, weight_decay=5e-5)

In [None]:
def apply_rgb_Lidar_Classifier_model(model, batch):
    inputs_rgb, _, target = batch
    inputs_rgb = inputs_rgb.to(device)
    target = target.to(device)
    outputs = model(inputs_rgb)
    return outputs, target

loss_func = nn.BCEWithLogitsLoss()

In [None]:
run = training.initWandbRun(
    "", epochs, BATCH_SIZE, rgbToLidar_num_params, "Adam", "", lr, lr
)

final_train_loss, final_val_loss = training.train_model(
    rgbToLiderClassifier, 
    optim,
    apply_rgb_Lidar_Classifier_model, 
    loss_func, 
    epochs, 
    train_dataloader, 
    val_dataloader, 
    device, 
    run, 
    scheduler=None, 
    output_name="rgb_to_lidar_classifier"
)

run.finish()

visualization.plot_loss(epochs,
    {
        "Rgb-to-Lidar Train Loss": final_train_loss,
        "Rgb-to-Lidar Val Loss": final_val_loss
    }
)

Unfortunally, even after around 15-20 hours of debugging over multiple days I couldn't get this last model to perform well.
The loss curves don't look good. While it seems like training loss is decreasing very slightly, validation loss does not get better at all. The prediction accuracy is just around 0.5 which tells us the model is not any better than just random guessing. 

![](../results/wandb-t5-final.png)

I used the following code to somehow debug and understand the problem. This takes the first validation batch and calculates the classification accuracy when using the "true" embeddings using the embedder from the lidar classifier, and compares it this the accuracy using the predicted lidar embeddings using the projector. I also compute cosine similarity of the true and predicted embeddings as well as their norm to understand how similar or different they are. 

In [None]:
projector.eval()
with torch.no_grad():
    batch = next(iter(val_dataloader))
    rgb, lidar_xyza, labels = batch
    rgb = rgb.to(device)
    lidar_xyza = lidar_xyza.to(device)
    labels = labels.to(device)

    # True LiDAR path
    true_embs = lidar_classifier.get_embs(lidar_xyza)
    logits_true = lidar_classifier(data_embs=true_embs).squeeze(1)
    acc_true = ((torch.sigmoid(logits_true) >= 0.5) == labels.bool()).float().mean()

    # Projected path
    img_embs = cilp_model.img_embedder(rgb)
    proj_embs = projector(img_embs)
    logits_proj = lidar_classifier(data_embs=proj_embs).squeeze(1)
    acc_proj = ((torch.sigmoid(logits_proj) >= 0.5) == labels.bool()).float().mean()

    # Alignment stats
    cosine = torch.nn.functional.cosine_similarity(proj_embs, true_embs, dim=1).mean()
    norm_true = true_embs.norm(dim=1).mean()
    norm_proj = proj_embs.norm(dim=1).mean()

print(f"Acc true: {acc_true:.3f} | Acc proj: {acc_proj:.3f}")
print(f"Cosine sim: {cosine:.3f} | Norm true: {norm_true:.3f} | Norm proj: {norm_proj:.3f}")

In my last run I got the following output:

```
Acc true: 0.969 | Acc proj: 0.625
Cosine sim: 0.435 | Norm true: 1.000 | Norm proj: 1.000
```

The accuracy when using the true lidar embeddings is quite high, which is exptected given that the lidar_classifier has an overall validation accuracy of nearly 1. This indicates that the classification head is OK. The accuracy when using the projected embeddings is at least over 0.5, so not random. This indicates that the projector at least learns *anything*. The norm of both embeddings is exactly 1 which means the enforced normalization in the projector works. The real problem is the small cosine similarity of just 0.435. This tells us that the true and predicted embeddings are not well aligned and off in direction.

What I also noticed is that the two-phase training of the projector (first MSE, second BCE inside the RGB2Lidar classifier) is difficult. Often the BCE fine-tuning just tends to destabilize the embedding alignment making the projector's accuracy worse. Thats why I ended up using only a very tiny learning rate for the BCE fine-tuning. 

Things I tried:
- Using Cosine Similarity as the main loss function of the projector pre-training **instead** of MSE (-> did not help)
- Mixing MSE with Cosine Similarity in the loss function (-> did not help)
- Using different embedding dimensions for image and lidar (-> worked best when both are the same)
- not normalizing the embeddings in the lidar classifier as well as in CILP and the projector (-> normalizing was better)
- Using a more complex projector architecture (-> extrem overfitting)
- Using different learning rates / weight-decay (higher learning rates -> overfitting)
- Trying out the exact code used in the NVIDIA lab, which worked there extremely well but did not work here :(