# Model Architecture:

This section presents the three-stage pipeline used in the final assessment: contrastive pretraining (CILP), cross-modal projection, and RGB→LiDAR classification. The diagrams summarize how RGB and LiDAR information flow through the system and how each model component contributes to aligning, transforming, and classifying multimodal inputs. This high-level overview clarifies the purpose of each stage before moving on to implementation.

**Stage 1:** Contrastive Pretraining: CILP_model

**Goal:** align RGB and LiDAR in a shared 200-D space --> encodes both rgb and lidar in the same dimensionality space
```
RGB ----> Img Encoder ----\
                            ----> CLIP-style similarity
LiDAR -> Lidar Encoder ----/
```
**Outcome:** Shared embedding space where matching RGB/LiDAR pairs have high similarity and non-matches low similarity.

----------------------------

**Stage 2:** Projector Training: projector

**Goal:** learn a mapping from RGB CILP embeddings to LiDAR embeddings used by lidar_cnn:
ℝ²⁰⁰ (CILP RGB embedding) → ℝ³²⁰⁰ (LiDAR-CNN embedding)

projector knows how to “pretend” RGBs are LiDAR internally: projected_RGB_embedding ≈ “real” LiDAR embedding for each paired RGB/LiDAR sample.
```
RGB ----> Img Encoder ----> Projector ----> LiDAR embedding
                                     |
                                     v
                             MSE-loss to true LiDAR embedding

```
----------------------------

**Stage 3:** Final Classifier: RGB2LiDARClassifier

**Goal:** chaining all models together to classify spheres and cubes from images

pretends the RGBs look like LiDAR in the internal feature space and then uses LiDAR classifier.
```
RGB (img) ----> (CILP Img Encoder) ----> 200-D CILP embedding ----> (Projector) ---> 3200-D LiDAR embedding
---> (LiDAR Classifier) ---> cube/sphere

```

# Setup

This section installs required dependencies, mounts Google Drive, defines data paths, and sets global settings such as the random seed and device. It prepares the environment so that all subsequent loading, training, and evaluation steps run consistently and reproducibly, both in Colab and locally.

In [None]:
%%capture
%pip install wandb weave

In [None]:
%%capture
%pip install fiftyone==1.10.0 sympy==1.12 torch==2.9.0 torchvision==0.20.0 numpy open-clip-torch

## Imports

Here we import all libraries and modules needed for the final assessment: PyTorch, torchvision transforms, W&B, dataset utilities, training utilities, and the model classes used across all three stages. Centralizing imports keeps the notebook organized and ensures that each component is available when needed.

In [None]:
import os
from pathlib import Path
from google.colab import userdata

import torch
import torch.nn as nn
import torch.nn.functional as F
#import torchvision.transforms as transforms
import torchvision.transforms.v2 as transforms

import wandb

In [None]:
from google.colab import drive
drive.mount('/content/drive')

STORAGE_PATH = Path("/content/drive/MyDrive/Colab Notebooks/Applied Computer Vision/Applied-Computer-Vision-Projects/Multimodal_Learning_02/")
TMP_STORAGE_PATH = "/content"

# DATA_PATH = STORAGE_PATH / "data/assessment"
DATA_PATH = TMP_STORAGE_PATH / "data/assessment"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!cp -r "/content/drive/MyDrive/Colab Notebooks/Applied Computer Vision/Applied-Computer-Vision-Projects/Multimodal_Learning_02/multimodal_training_workshop/data" /content/data

In [None]:
%cd "/content/drive/MyDrive/Colab Notebooks/Applied Computer Vision/Applied-Computer-Vision-Projects/Multimodal_Learning_02/"

/content/drive/MyDrive/Colab Notebooks/Applied Computer Vision/Applied-Computer-Vision-Projects/Multimodal_Learning_02


In [None]:
from src.utility import set_seeds
from src.datasets import compute_dataset_mean_std_neu, get_cilp_dataloaders
from src.training import compute_class_weights, get_rgb_inputs, train_model, init_wandb, train_with_batch_loss
#from src.visualization import build_fusion_comparison_df, plot_losses
from src.models import CILPBackbone, ContrastivePretraining, Classifier, Projector, EmbedderMaxPool, RGB2LiDARClassifier

In [None]:
import importlib, src.training as training
importlib.reload(training)
from src.training import train_with_batch_loss


## Constants

We define key configuration values such as the seed, batch size, image size, number of workers, and label mappings. These constants ensure consistent behavior across all stages and make the hyperparameters easy to adjust or reference later.

In [None]:
SEED = 51
NUM_WORKERS = os.cpu_count()  # Number of CPU cores

BATCH_SIZE = 32
IMG_SIZE = 64

CLASSES = ["cubes", "spheres"]
NUM_CLASSES = len(CLASSES)
LABEL_MAP = {"cubes": 0, "spheres": 1}

VALID_BATCHES = 10
N = 12500

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
torch.cuda.is_available()

True

In [None]:
# Usage: Call this function at the beginning and before each training phase
set_seeds(SEED)

All random seeds set to 51 for reproducibility


# Integration of Wandb

This section authenticates with Weights & Biases using the API key stored in Colab Secrets. Initializing W&B enables automatic logging of losses, metrics, hyperparameters, and summary statistics for all training stages. This satisfies the experiment-tracking requirement of the assessment.

In [None]:
# Load W&B API key from Colab Secrets and make it available as env variable
wandb_key = userdata.get('WANDB_API_KEY')
os.environ["WANDB_API_KEY"] = wandb_key
wandb.login()



True

# Loading and preparation of Data

We compute dataset statistics, define preprocessing transforms, and build train/validation/test DataLoaders for the assessment dataset. This ensures that all stages—CILP pretraining, projector training, and final classification—operate on consistently preprocessed and reproducible batches of data.

In [None]:
## Final: dynamisch
# mean, std = compute_dataset_mean_std(root_dir=root, img_size=IMG_SIZE)
mean, std = compute_dataset_mean_std_neu(root_dir=DATA_PATH, img_size=IMG_SIZE, seed=SEED)
print(mean, std)

In [None]:
## Final: dynamisch
img_transforms = transforms.Compose([
    transforms.ToImage(),   # Scales data into [0,1]
    transforms.Resize(IMG_SIZE),
    transforms.ToDtype(torch.float32, scale=True),
    transforms.Normalize(([0.0051, 0.0052, 0.0051, 1.0000]), ([5.8023e-02, 5.8933e-02, 5.8108e-02, 2.4509e-07]))     ## assessment dataset
    # transforms.Normalize(mean.tolist(), std.tolist())     ## assessment dataset
])

In [None]:
set_seeds(SEED)

train_data, train_dataloader, valid_data, val_dataloader, test_data, test_dataloader  = get_cilp_dataloaders(
    str(DATA_PATH),
    VALID_BATCHES,
    test_frac=0.10,
    batch_size=BATCH_SIZE,
    img_transforms=img_transforms,
    num_workers=NUM_WORKERS,
    seed=SEED
)

for i, sample in enumerate(train_data):
    print(i, *(x.shape for x in sample))
    break

[CILP] Total samples: 12500
[CILP] Train: 10962, Val: 320, Test: 1218
0 torch.Size([4, 64, 64]) torch.Size([1, 64, 64]) torch.Size([1])


# Model Training

This section introduces the architectural choices for the CILP backbone and projector, as well as the rationale behind using a two-layer projection head. It describes how embeddings are produced and how the shared embedding space is leveraged across the three training stages. This conceptual grounding precedes the implementation of Stage 1.


I use a 2-layer projection head with a 512-dimensional hidden layer between the backbone’s CNN features and the final CILP embedding.
The 512 dimension is standard in contrastive learning literature (e.g., CLIP, SimCLR), as it provides a good balance between model expressiveness and computational efficiency.
It allows a nonlinear transformation from the high-dimensional CNN output into the shared embedding space while avoiding overfitting.

## Stage 1: CILP contrastive pretraining

In this stage, we train a dual-encoder contrastive model to align RGB and LiDAR representations in a shared embedding space. The section defines the batch loss function, sets up optimizers, and runs the contrastive training loop using train_with_batch_loss. The best-performing model is saved and logged to W&B. After training, the CILP encoders are frozen for later stages.

In [None]:
BEST_EMBEDDER = EmbedderMaxPool
FEATURE_DIM = 128
CILP_EMB_SIZE = 200

In [None]:
set_seeds(SEED)

img_embedder = CILPBackbone(
    in_ch=4,
    embedder_cls=BEST_EMBEDDER,
    feature_dim=FEATURE_DIM,
    emb_size=CILP_EMB_SIZE
).to(device)

lidar_embedder = CILPBackbone(
    in_ch=1,
    embedder_cls=BEST_EMBEDDER,
    feature_dim=FEATURE_DIM,
    emb_size=CILP_EMB_SIZE
).to(device)

In [None]:
# Initialize the model
set_seeds(SEED)

CILP_model = ContrastivePretraining(img_embedder, lidar_embedder).to(device)

loss_img = nn.CrossEntropyLoss()
loss_lidar = nn.CrossEntropyLoss()

In [None]:
def cilp_batch_loss_fn(model, batch, device):
    """
    outputs: (logits_per_img, logits_per_lidar), each of shape (B, B)

    We build ground-truth indices 0..B-1 so that:
      - row i in logits_per_img should classify LiDAR i as the correct match
      - row i in logits_per_lidar should classify RGB i as the correct match
    """
    rgb, lidar, _ = batch
    rgb = rgb.to(device)
    lidar = lidar.to(device)

    logits_per_img, logits_per_lidar = model(rgb, lidar)   # (B, B)
    B = logits_per_img.size(0)
    ground_truth = torch.arange(B, dtype=torch.long, device=device)

    loss_i = loss_img(logits_per_img, ground_truth)
    loss_l = loss_lidar(logits_per_lidar, ground_truth)

    total_loss = (loss_i + loss_l) / 2.0

    return total_loss, logits_per_img

In [None]:
import time
import torch
import numpy as np

def train_with_batch_loss(
    model,
    optimizer,
    train_dataloader,
    val_dataloader,
    batch_loss_fn,
    epochs,
    model_save_path,
    log_to_wandb=False,
    device=None,
    extra_args=None,
):
    """
    Generic training loop where loss is computed by a batch_loss_fn.

    batch_loss_fn(model, batch, device, **extra_args) must return either:
      - loss
      - (loss, log_dict)
    """
    best_val_loss = float("inf")
    train_losses = []
    valid_losses = []

    use_cuda = torch.cuda.is_available()
    max_gpu_mem_mb = 0.0
    if use_cuda:
        torch.cuda.reset_peak_memory_stats()

    for epoch in range(epochs):
        start_time = time.time()
        print(f"[BatchLoss] Epoch {epoch+1}/{epochs}")

        # ---------- TRAIN ----------
        model.train()
        train_loss_epoch = 0.0

        for step, batch in enumerate(train_dataloader):
            optimizer.zero_grad()

            out = batch_loss_fn(model, batch, device, **(extra_args or {}))
            if isinstance(out, tuple):
                loss, log_dict = out
            else:
                loss, log_dict = out, {}

            loss.backward()
            optimizer.step()

            train_loss_epoch += loss.item()

        train_loss_epoch /= (step + 1)
        train_losses.append(train_loss_epoch)

        # ---------- VALID ----------
        model.eval()
        valid_loss_epoch = 0.0
        with torch.no_grad():
            for step, batch in enumerate(val_dataloader):
                out = batch_loss_fn(model, batch, device, **(extra_args or {}))
                if isinstance(out, tuple):
                    loss, _ = out
                else:
                    loss = out
                valid_loss_epoch += loss.item()

        valid_loss_epoch /= (step + 1)
        valid_losses.append(valid_loss_epoch)

        # ---------- SAVE BEST ----------
        if valid_loss_epoch < best_val_loss:
            best_val_loss = valid_loss_epoch
            torch.save(model.state_dict(), model_save_path)
            print("Found and saved better weights for the model")

        # ---------- LOG / PRINT ----------
        epoch_time = time.time() - start_time
        if use_cuda:
            gpu_mem_mb = torch.cuda.max_memory_allocated() / (1024 ** 2)
            max_gpu_mem_mb = max(max_gpu_mem_mb, gpu_mem_mb)
        else:
            gpu_mem_mb = 0.0

        print(
            f"epoch {epoch} train loss: {train_loss_epoch}"
        )
        print(
            f"epoch {epoch} valid loss: {valid_loss_epoch}"
        )

        if log_to_wandb:
            import wandb
            wandb.log(
                {
                    "model": model.__class__.__name__,
                    "epoch": epoch + 1,
                    "train_loss": train_loss_epoch,
                    "valid_loss": valid_loss_epoch,
                    "lr": optimizer.param_groups[0]["lr"],
                    "epoch_time": epoch_time,
                    "max_gpu_mem_mb_epoch": gpu_mem_mb if use_cuda else 0.0,
                }
            )

    return {
        "train_losses": train_losses,
        "valid_losses": valid_losses,
        "best_valid_loss": float(best_val_loss),
        "max_gpu_mem_mb": float(max_gpu_mem_mb),
        "num_params": sum(p.numel() for p in model.parameters() if p.requires_grad),
    }


In [None]:
## train CILP
EPOCHS_CILP = 5
LR_CILP = 0.0001

opt = torch.optim.Adam(CILP_model.parameters(), LR_CILP)

best_val = float("inf")

# Path where best model is saved
checkpoint_dir = STORAGE_PATH / "checkpoints"
checkpoint_dir.mkdir(exist_ok=True)
model_save_path=checkpoint_dir / "cilp_model.pth"

init_wandb(
    model=CILP_model,
    opt_name = opt.__class__.__name__
)

results = train_with_batch_loss(
    model=CILP_model,
    optimizer=opt,
    train_dataloader=train_dataloader,
    val_dataloader=val_dataloader,
    batch_loss_fn=cilp_batch_loss_fn,
    epochs=EPOCHS_CILP,
    model_save_path=model_save_path,
    log_to_wandb=True,
    device=device
)

best_cilp_val = results["best_valid_loss"]
print(f"[5.1] Best CILP validation loss: {best_cilp_val:.4f}")


wandb.run.summary["cilp_best_val_loss"] = best_cilp_val
# End wandb run before starting the next model
wandb.finish()

[BatchLoss] Epoch 1/5
Found and saved better weights for the model
epoch 0 train loss: 2.6652660460499993
epoch 0 valid loss: 2.601843976974487
[BatchLoss] Epoch 2/5
Found and saved better weights for the model
epoch 1 train loss: 2.592795539320561
epoch 1 valid loss: 2.5814525842666627
[BatchLoss] Epoch 3/5
Found and saved better weights for the model
epoch 2 train loss: 2.579532387660958
epoch 2 valid loss: 2.5765286922454833
[BatchLoss] Epoch 4/5
Found and saved better weights for the model
epoch 3 train loss: 2.570431894029093
epoch 3 valid loss: 2.573051905632019
[BatchLoss] Epoch 5/5
Found and saved better weights for the model
epoch 4 train loss: 2.56716895870298
epoch 4 valid loss: 2.568479371070862
[5.1] Best CILP validation loss: 2.5685


0,1
epoch,▁▃▅▆█
epoch_time,▁▅▁█▃
lr,▁▁▁▁▁
max_gpu_mem_mb_epoch,▁████
train_loss,█▃▂▁▁
valid_loss,█▄▃▂▁

0,1
cilp_best_val_loss,2.56848
epoch,5
epoch_time,7.05844
lr,0.0001
max_gpu_mem_mb_epoch,390.20605
model,ContrastivePretraini...
train_loss,2.56717
valid_loss,2.56848


In [None]:
## freeze pre-trained model
for param in CILP_model.parameters():
    param.requires_grad = False

CILP_model.eval()

ContrastivePretraining(
  (img_embedder): CILPBackbone(
    (encoder): EmbedderMaxPool(
      (conv1): Conv2d(4, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv2): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv3): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    )
    (dense_emb): Sequential(
      (0): Linear(in_features=8192, out_features=512, bias=True)
      (1): ReLU()
      (2): Linear(in_features=512, out_features=200, bias=True)
    )
  )
  (lidar_embedder): CILPBackbone(
    (encoder): EmbedderMaxPool(
      (conv1): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv2): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv3): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mo

## Stage 2: Cross-Modal Projection

Here we train a projector that maps CILP RGB embeddings into the LiDAR-CNN feature space. Using a frozen CILP model and a pretrained LiDAR classifier, we minimize an MSE loss between predicted LiDAR embeddings and true LiDAR embeddings. This stage enables the system to “pretend” RGB images look like LiDAR feature vectors, allowing downstream LiDAR-based classification from RGB alone.

In [None]:
# load pre-trained lidar_cnn classifier
lidar_cnn_path = STORAGE_PATH / "checkpoints/lidar_cnn.pt"

lidar_cnn = Classifier(in_ch=1).to(device)
lidar_cnn.load_state_dict(torch.load(lidar_cnn_path, weights_only=True))

for param in lidar_cnn.parameters():
    param.requires_grad = False

lidar_cnn.eval()

Classifier(
  (embedder): Sequential(
    (0): Conv2d(1, 50, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(50, 100, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (4): ReLU()
    (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(100, 200, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU()
    (8): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (9): Conv2d(200, 200, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (10): ReLU()
    (11): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (12): Flatten(start_dim=1, end_dim=-1)
  )
  (classifier): Sequential(
    (0): Linear(in_features=3200, out_features=100, bias=True)
    (1): ReLU()
    (2): Linear(in_features=100, out_features=1, bias=True)
  )
)

In [None]:
def projector_batch_loss_fn(model, batch, device, CILP_model, lidar_cnn):
    rgb_img, lidar_depth, _ = batch
    rgb_img = rgb_img.to(device)
    lidar_depth = lidar_depth.to(device)

    # Use frozen encoders
    CILP_model.eval()
    lidar_cnn.eval()

    with torch.no_grad():
        img_embs = CILP_model.img_embedder(rgb_img)      # (B, CILP_EMB_SIZE)
        lidar_embs = lidar_cnn.get_embs(lidar_depth)     # (B, lidar_dim)

    pred_lidar_embs = model(img_embs)

    loss = F.mse_loss(pred_lidar_embs, lidar_embs)

    # match the convention used in train_with_batch_loss
    return loss, {"loss": loss.item()}


In [None]:
img_dim = CILP_EMB_SIZE
lidar_dim = 200 * 4 * 4

set_seeds(SEED)

projector = Projector(img_dim, lidar_dim).to(device)

In [None]:
EPOCHS_PROJ = 40
LR_PROJECTOR = 1e-4
model_save_path=checkpoint_dir / "projector.pth"

opt = torch.optim.Adam(projector.parameters(), LR_PROJECTOR)

best_val_proj = float("inf")

init_wandb(
    model=projector,
    opt_name = opt.__class__.__name__
)

results = train_with_batch_loss(
    model=projector,
    optimizer=opt,
    train_dataloader=train_dataloader,
    val_dataloader=val_dataloader,
    batch_loss_fn=projector_batch_loss_fn,
    epochs=EPOCHS_PROJ,
    model_save_path=model_save_path,
    log_to_wandb=True,
    device=device,
    extra_args={
        "CILP_model": CILP_model,
        "lidar_cnn": lidar_cnn,
    }
)

best_proj_val = results["best_valid_loss"]
print(f"[5.2] Best projector validation MSE: {best_proj_val:.4f}")

wandb.run.summary["projector_best_val_mse"] = best_proj_val
# End wandb run before starting the next model
wandb.finish()

[BatchLoss] Epoch 1/40
Found and saved better weights for the model
epoch 0 train loss: 1.6388600946169847
epoch 0 valid loss: 1.2890848696231842
[BatchLoss] Epoch 2/40
Found and saved better weights for the model
epoch 1 train loss: 1.6224197928319897
epoch 1 valid loss: 1.2846409142017365
[BatchLoss] Epoch 3/40
Found and saved better weights for the model
epoch 2 train loss: 1.6049904155800914
epoch 2 valid loss: 1.2684852600097656
[BatchLoss] Epoch 4/40
Found and saved better weights for the model
epoch 3 train loss: 1.5911426063169514
epoch 3 valid loss: 1.2645985782146454
[BatchLoss] Epoch 5/40
Found and saved better weights for the model
epoch 4 train loss: 1.5782182199216028
epoch 4 valid loss: 1.2419158279895783
[BatchLoss] Epoch 6/40
Found and saved better weights for the model
epoch 5 train loss: 1.563390554740415
epoch 5 valid loss: 1.2293817102909088
[BatchLoss] Epoch 7/40
Found and saved better weights for the model
epoch 6 train loss: 1.5521342682908152
epoch 6 valid loss

0,1
epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
epoch_time,▄▂▁▇▂▁▂▆▁▂▅▄▂▁▇▂▂▄▅▁▁▆▂▂▂█▄▄▇▅▁▄▅▂▁▆▂▁▁▇
lr,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
max_gpu_mem_mb_epoch,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train_loss,██▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▁▁▁▁
valid_loss,███▇▇▇▇▆▆▆▆▆▆▅▅▅▅▄▅▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁

0,1
epoch,40
epoch_time,4.89051
lr,0.0001
max_gpu_mem_mb_epoch,327.29541
model,Projector
projector_best_val_mse,0.96225
train_loss,1.19839
valid_loss,0.96468


## Stage 3: RGB2LiDARClassifier

In the final stage, we build and train a model that chains together the CILP backbone, the trained projector, and the LiDAR classifier. Only the projector is trainable; all other components remain frozen. The model learns to classify cubes and spheres directly from RGB input by leveraging the LiDAR feature space. The section logs training and validation metrics and reports the final accuracy.

In [None]:
import torch
import torch.nn as nn

def train_rgb2lidar_classifier(
    model,
    train_loader,
    val_loader,
    epochs,
    lr,
    device,
):
    model = model.to(device)

    # IMPORTANT: Only projector is trained
    optimizer = torch.optim.Adam(model.projector.parameters(), lr=lr)
    loss_fn = nn.BCEWithLogitsLoss()

    history = {
        "train_loss": [],
        "val_loss": [],
        "val_acc": []
    }

    # sanity check: grads enabled
    torch.set_grad_enabled(True)

    for epoch in range(epochs):
        print(f"\nEpoch {epoch+1}/{epochs}")

        # ---------- TRAIN ----------
        model.train()
        running_loss = 0.0

        for imgs, _, labels in train_loader:
            imgs = imgs.to(device)
            labels = labels.float().view(-1, 1).to(device)   # [B,1]

            optimizer.zero_grad()

            logits = model(imgs)          # [B,1] – MUST be computed with grad

            # DEBUG: verify graph is alive
            if epoch == 0 and running_loss == 0.0:
                print("DEBUG logits.requires_grad:", logits.requires_grad)

            loss = loss_fn(logits, labels)

            # DEBUG: verify loss is linked to graph
            if epoch == 0 and running_loss == 0.0:
                print("DEBUG loss.requires_grad:", loss.requires_grad)

            loss.backward()
            optimizer.step()

            running_loss += loss.item()

        train_loss = running_loss / len(train_loader)
        history["train_loss"].append(train_loss)

        # ---------- VALID ----------
        model.eval()
        val_loss = 0.0
        correct = 0
        total = 0

        with torch.no_grad():
            for imgs, _, labels in val_loader:
                imgs = imgs.to(device)
                labels = labels.float().view(-1, 1).to(device)

                logits = model(imgs)              # [B,1]
                loss = loss_fn(logits, labels)
                val_loss += loss.item()

                probs = torch.sigmoid(logits)     # [B,1], 0–1
                preds = (probs >= 0.5).long()     # [B,1]
                correct += (preds.view(-1) == labels.view(-1).long()).sum().item()
                total += labels.size(0)

        val_loss = val_loss / len(val_loader)
        val_acc = correct / total

        history["val_loss"].append(val_loss)
        history["val_acc"].append(val_acc)

        print(f"train_loss={train_loss:.4f}  val_loss={val_loss:.4f}  val_acc={val_acc*100:.2f}%")

    return history


In [None]:
LR_RGB2LIDAR = 1e-3
EPOCHS_RGB2LIDAR = 5

set_seeds(SEED)

rgb2lidar_clf = RGB2LiDARClassifier(
    CILP=CILP_model,
    projector=projector,
    lidar_cnn=lidar_cnn,
).to(device)

#class_weights = compute_class_weights(train_data, NUM_CLASSES).to(device)
#loss_func = nn.CrossEntropyLoss(weight=class_weights.to(device))
#loss_func = nn.BCEWithLogitsLoss()

#opt = torch.optim.Adam(rgb2lidar_clf.parameters(), lr=LR_RGB2LIDAR)

# 1) Freeze img_embedder & lidar_cnn
for p in rgb2lidar_clf.img_embedder.parameters():
    p.requires_grad = False

for p in rgb2lidar_clf.shape_classifier.parameters():
    p.requires_grad = False

# 2) Ensure projector is trainable
for p in rgb2lidar_clf.projector.parameters():
    p.requires_grad = True

print("ANY trainable in projector?",
      any(p.requires_grad for p in rgb2lidar_clf.projector.parameters()))


results = train_rgb2lidar_classifier(
    model=rgb2lidar_clf,
    train_loader=train_dataloader,
    val_loader=val_dataloader,
    epochs=EPOCHS_RGB2LIDAR,
    lr=LR_RGB2LIDAR,
    device=device
)

best_rgb2lidar_val = results["best_valid_loss"]
print(f"[5.3] Best validation loss: {best_rgb2lidar_val:.4f}")
best_rgb2lidar_acc = results["best_valid_acc"]
print(f"[5.3] Best validation accuracy: {best_rgb2lidar_acc:.4f}")

ANY trainable in projector? True

Epoch 1/5
DEBUG logits.requires_grad: False
DEBUG loss.requires_grad: False


RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn