# Overview: Final Multimodal Pipeline

This notebook implements the complete three-stage multimodal pipeline used for the final assessment. It combines contrastive pretraining (CILP), cross-modal projection, and RGB→LiDAR classification to enable object classification from RGB images via a LiDAR feature space. Each stage is trained sequentially under controlled settings, with performance and diagnostics logged using Weights & Biases.

# Model Architecture:

This section presents the three-stage pipeline used in the final assessment: contrastive pretraining (CILP), cross-modal projection, and RGB→LiDAR classification. The diagrams summarize how RGB and LiDAR information flow through the system and how each model component contributes to aligning, transforming, and classifying multimodal inputs. This high-level overview clarifies the purpose of each stage before moving on to implementation.

**Stage 1:** Contrastive Pretraining: CILP_model

**Goal:** align RGB and LiDAR in a shared 200-D space --> encodes both rgb and lidar in the same dimensionality space
```
          (200-D emb.)
RGB ----> Img Encoder ----\
                            ----> CLIP-style similarity
LiDAR -> Lidar Encoder ----/
          (200-D emb.)
```
**Outcome:** Shared embedding space where matching RGB/LiDAR pairs have high similarity and non-matches low similarity.

----------------------------

**Stage 2:** Projector Training: projector

**Goal:** learn a mapping from RGB CILP embeddings to LiDAR embeddings used by lidar_cnn:
ℝ²⁰⁰ (CILP RGB embedding) → ℝ³²⁰⁰ (LiDAR-CNN embedding)

projector knows how to “pretend” RGBs are LiDAR internally: projected_RGB_embedding ≈ “real” LiDAR embedding for each paired RGB/LiDAR sample.
```
RGB ----> Img Encoder ----> Projector ----> LiDAR embedding
        (200-d emb.)    (3200-d emb.)   |
                                        v
                             MSE-loss to true LiDAR embedding

```

We freeze the pretrained CILP model and the pretrained LiDAR CNN and train a projector to map RGB embeddings into the LiDAR embedding space.
The input to the projector is the 200-dimensional RGB embedding produced by the CILP image encoder. The target is the LiDAR embedding of dimension 3200, obtained by flattening the LiDAR CNN feature map (200 channels × 4 × 4 spatial resolution).

The projector is implemented as a small MLP that maps 200 → 3200 and is trained using mean squared error (MSE) loss between the projected RGB embedding and the corresponding LiDAR embedding. Training uses the Adam optimizer with a fixed learning rate and runs for multiple epochs, selecting the best checkpoint based on lowest validation MSE.

The final model achieves a best validation MSE below 2.5, satisfying the project requirement. Training and validation loss curves, as well as the final validation performance, are logged to Weights & Biases.

----------------------------

**Stage 3:** Final Classifier: RGB2LiDARClassifier

**Goal:** chaining all models together to classify spheres and cubes from images

pretends the RGBs look like LiDAR in the internal feature space and then uses LiDAR classifier.
```
RGB (img) ----> (CILP Img Encoder) ----> CILP emb. ----> (Projector) ---> LiDAR emb. ---> (LiDAR Classifier) ---> cube/sphere
                                        (200-D emb.)                      (3200-D emb.)
```

We train a binary RGB-to-LiDAR classifier using frozen pretrained CILP encoders and the trained RGB→LiDAR projector; only the classifier head is trainable. The model operates on the 3200-dimensional projected RGB embedding and LiDAR embedding and is optimized with binary cross-entropy with logits, using class weighting to address class imbalance.

Evaluation is performed on at least five validation batches, and the final model achieves a validation accuracy above 95%. Loss curves, final performance, and sample predictions are logged to Weights & Biases.

# Initial Setup

The project repository is mounted from Google Drive and added to the Python path to allow clean imports from the src module. The dataset is copied to the local Colab filesystem to improve I/O performance during training. All global settings (random seed, device selection, paths, batch sizes) are defined once and reused across the notebook to ensure consistency and reproducibility.

Weights & Biases is initialized for experiment tracking, and all training stages use the same precomputed dataset statistics and DataLoaders for fair comparison across models.

In [None]:
import sys
from pathlib import Path

from google.colab import drive
drive.mount('/content/drive')

%cd "/content/drive/MyDrive/Colab Notebooks/Applied Computer Vision/Applied-Computer-Vision-Projects/Multimodal_Learning_02/"

PROJECT_ROOT = Path.cwd()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

In [None]:
# Install dependencies
%%capture
%pip install --no-cache-dir -r requirements.txt

In [None]:
import os
from google.colab import userdata

import torch
import torch.nn as nn
import torch.nn.functional as Func
import torchvision.transforms.v2 as transforms

import wandb
import matplotlib.pyplot as plt

In [None]:
from src.config import (SEED, NUM_WORKERS, BATCH_SIZE, IMG_SIZE, N, DRIVE_ROOT,
                        RAW_DATA, CHECKPOINTS, DEVICE, VALID_BATCHES)
from src.utility import set_seeds, init_wandb, get_train_stats, compute_embedding_size
from src.datasets import compute_dataset_mean_std, get_cilp_dataloaders
from src.training import train_with_batch_loss, train_classifier_with_acc, load_model
from src.visualization import plot_similarity_matrix, plot_retrieval_examples, plot_losses
from src.models import CILPBackbone, ContrastivePretraining, Classifier, Projector, EmbedderMaxPool, RGB2LiDARClassifier

In [None]:
# Copy data from frive to /content for performance
!rm -rf /content/data
!cp -r "/content/drive/MyDrive/Colab Notebooks/Applied Computer Vision/Applied-Computer-Vision-Projects/Multimodal_Learning_02/data/assessment" /content/data

In [None]:
# Usage: Call this function at the beginning and before each training phase
set_seeds(SEED)

All random seeds set to 51 for reproducibility


In [None]:
# Load W&B API key from Colab Secrets and make it available as env variable
wandb_key = userdata.get('WANDB_API_KEY')
os.environ["WANDB_API_KEY"] = wandb_key
wandb.login()



True

# Loading and preparation of Data

We compute dataset statistics, define preprocessing transforms, and build train/validation/test DataLoaders for the assessment dataset. This ensures that all stages—CILP pretraining, projector training, and final classification—operate on consistently preprocessed and reproducible batches of data.

In [None]:
# gets calculated mean, std from file or calculates it from the rgb train data
# for different dataset (or change in train data) recalculate mean and standard deviation
mean, std = get_train_stats(dir=DRIVE_ROOT, img_size=IMG_SIZE, data_dir=RAW_DATA)

In [None]:
img_transforms = transforms.Compose([
    transforms.ToImage(),   # Scales data into [0,1]
    transforms.Resize(IMG_SIZE),
    transforms.ToDtype(torch.float32, scale=True),
    transforms.Normalize(mean, std)
])

In [None]:
set_seeds(SEED)

# Fix random seeds and create reproducible train/val/test splits and DataLoaders.
# The first batch is inspected to verify tensor shapes and data consistency.
train_data, train_dataloader, valid_data, val_dataloader, test_data, test_dataloader = get_cilp_dataloaders(
    str(RAW_DATA),
    VALID_BATCHES,
    test_frac=0.10,
    batch_size=BATCH_SIZE,
    img_transforms=img_transforms,
    num_workers=NUM_WORKERS,
    seed=SEED
)

for i, sample in enumerate(train_data):
    print(i, *(x.shape for x in sample))
    break

[CILP] Total samples: 12500
[CILP] Train: 10962, Val: 320, Test: 1218
0 torch.Size([4, 64, 64]) torch.Size([1, 64, 64]) torch.Size([1])


# Model Training

This section introduces the architectural choices for the CILP backbone and projector, as well as the rationale behind using a two-layer projection head. It describes how embeddings are produced and how the shared embedding space is leveraged across the three training stages. This conceptual grounding precedes the implementation of Stage 1.


## Stage 1: CILP contrastive pretraining

In this stage, we train a dual-encoder contrastive model to align RGB and LiDAR representations in a shared embedding space. The section defines the batch loss function, sets up optimizers, and runs the contrastive training loop using train_with_batch_loss. The best-performing model is saved and logged to W&B. After training, the CILP encoders are frozen for later stages.

In [None]:
set_seeds(SEED)

## constants for clip model and training
BEST_EMBEDDER = EmbedderMaxPool
FEATURE_DIM = 128
CILP_EMB_SIZE = 200
EPOCHS_CILP = 5
LR_CILP = 0.0001

# Initialize embedder for CILP model
img_embedder = CILPBackbone(
    in_ch=4,
    embedder_cls=BEST_EMBEDDER,
    feature_dim=FEATURE_DIM,
    emb_size=CILP_EMB_SIZE
).to(DEVICE)

lidar_embedder = CILPBackbone(
    in_ch=1,
    embedder_cls=BEST_EMBEDDER,
    feature_dim=FEATURE_DIM,
    emb_size=CILP_EMB_SIZE
).to(DEVICE)

In [None]:
# Initialize the CILP model
set_seeds(SEED)

CILP_model = ContrastivePretraining(img_embedder, lidar_embedder).to(DEVICE)

loss_img = nn.CrossEntropyLoss()
loss_lidar = nn.CrossEntropyLoss()

In [None]:
def cilp_batch_loss_fn(model, batch, device):
    """
    Compute symmetric CLIP contrastive loss for an RGB–LiDAR batch.

    Uses cross-entropy on image→LiDAR and LiDAR→image similarity matrices
    and returns their average.
    """
    rgb, lidar, _ = batch
    rgb = rgb.to(device)
    lidar = lidar.to(device)

    logits_per_img, logits_per_lidar = model(rgb, lidar)  # (B,B)
    B = logits_per_img.size(0)
    targets = torch.arange(B, device=device)

    loss_i = loss_img(logits_per_img, targets)
    loss_l = loss_lidar(logits_per_lidar, targets)
    
    total_loss = (loss_i + loss_l) / 2.0

    return total_loss, logits_per_img

In [None]:
model_save_path = CHECKPOINTS / "cilp_model.pt"

if not model_save_path.exists():
    ## Train CILP model
    # Path where best model is saved

    best_val = float("inf")

    # Number of trainable parameters
    num_params = sum(p.numel() for p in CILP_model.parameters() if p.requires_grad)

    opt = torch.optim.Adam(CILP_model.parameters(), LR_CILP)

    embedding_size = compute_embedding_size("cilp", CILP_EMB_SIZE, spatial=(4, 4))

    init_wandb(
        model=CILP_model,
        embedding_size=embedding_size,
        opt_name=opt.__class__.__name__,
        name="CILP model | 1st stage",
        num_params=num_params
    )

    results = train_with_batch_loss(
        model=CILP_model,
        optimizer=opt,
        train_dataloader=train_dataloader,
        val_dataloader=val_dataloader,
        batch_loss_fn=cilp_batch_loss_fn,
        epochs=EPOCHS_CILP,
        model_save_path=model_save_path,
        log_to_wandb=True,
        device=DEVICE
    )

    wandb.summary["CILP model: best_val_loss"] = results["best_val_loss"]

    best_cilp_val = results["best_val_loss"]
    print("\n" + "-"*50)
    print(f"BEST CILP VALIDATION LOSS → {best_cilp_val:.4f}")
    print("-"*50 + "\n")

In [None]:
best_CILP_model = ContrastivePretraining(img_embedder, lidar_embedder).to(DEVICE)

# Load pre-trained cilp model
cilp_path = CHECKPOINTS / "cilp_model.pt"
best_CILP_model = load_model(best_CILP_model, cilp_path, DEVICE)

for p in best_CILP_model.parameters():
    p.requires_grad = False

best_CILP_model.eval()

ContrastivePretraining(
  (img_embedder): CILPBackbone(
    (encoder): EmbedderMaxPool(
      (conv1): Conv2d(4, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv2): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv3): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    )
    (dense_emb): Sequential(
      (0): Linear(in_features=8192, out_features=512, bias=True)
      (1): ReLU()
      (2): Linear(in_features=512, out_features=200, bias=True)
    )
  )
  (lidar_embedder): CILPBackbone(
    (encoder): EmbedderMaxPool(
      (conv1): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv2): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv3): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mo

In [None]:
fig = plot_similarity_matrix(
    model=best_CILP_model,
    dataloader=val_dataloader,
    device=DEVICE,
    normalize="softmax",
    temperature=1.0,
)
wandb.log({"cilp/similarity_matrix": wandb.Image(fig)})
plt.show()

In [None]:
# Shows k random examples; choose between mode: "random" | "mismatches" | "correct" |
fig, table = plot_retrieval_examples(best_CILP_model, val_dataloader, DEVICE, k=5, mode="mismatches")
wandb.log({"cilp/sample_retrievals": table})
plt.show()

# End wandb run before starting the next model
wandb.finish()

## Stage 2: Cross-Modal Projection

Here we train a projector that maps CILP RGB embeddings into the LiDAR-CNN feature space. Using a frozen CILP model and a pretrained LiDAR classifier, we minimize an MSE loss between predicted LiDAR embeddings and true LiDAR embeddings. This stage enables the system to “pretend” RGB images look like LiDAR feature vectors, allowing downstream LiDAR-based classification from RGB alone.

**Projector Architecture:**
The projector is implemented as a three-layer multilayer perceptron (MLP) that maps RGB embeddings into the LiDAR embedding space. It consists of:
* A linear layer mapping 200 → 1000, followed by ReLU
* A linear layer mapping 1000 → 500, followed by ReLU
* A final linear layer mapping 500 → 3200
* The projector is trained while all encoders are kept frozen.

In [None]:
best_lidar_cnn = Classifier(in_ch=1).to(DEVICE)

# Load pre-trained lidar_cnn classifier
lidar_cnn_path = CHECKPOINTS / "lidar_cnn.pt"
best_lidar_cnn = load_model(best_lidar_cnn, lidar_cnn_path)

for p in best_lidar_cnn.parameters():
    p.requires_grad = False

best_lidar_cnn.eval()

In [None]:
def projector_batch_loss_fn(model, batch, device, CILP_model, lidar_cnn):
    """
    Compute MSE loss between projected RGB embeddings and frozen LiDAR embeddings.
    """
    rgb_img, lidar_depth, _ = batch
    rgb_img = rgb_img.to(device)
    lidar_depth = lidar_depth.to(device)

    # Use frozen encoders
    CILP_model.eval()
    lidar_cnn.eval()

    with torch.no_grad():
        img_embs = CILP_model.img_embedder(rgb_img)      # (B, CILP_EMB_SIZE)
        lidar_embs = lidar_cnn.get_embs(lidar_depth)     # (B, lidar_dim)

    pred_lidar_embs = model(img_embs)

    loss = Func.mse_loss(pred_lidar_embs, lidar_embs)

    # match the convention used in train_with_batch_loss
    return loss, {"loss": loss.item()}


In [None]:
set_seeds(SEED)

# Constants for projector
img_dim = CILP_EMB_SIZE
lidar_dim = 200 * 4 * 4
EPOCHS_PROJ = 40
LR_PROJECTOR = 1e-4

projector = Projector(img_dim, lidar_dim).to(DEVICE)

In [None]:
# Train Projector
model_save_path= CHECKPOINTS / "projector.pt"

if not model_save_path.exists():

    best_val_proj = float("inf")

    # Number of trainable parameters
    num_params = sum(p.numel() for p in projector.parameters() if p.requires_grad)

    opt = torch.optim.Adam(projector.parameters(), LR_PROJECTOR)

    embedding_size = compute_embedding_size("projector", CILP_EMB_SIZE, spatial=(4, 4))

    init_wandb(
        model=projector,
        embedding_size=embedding_size,
        opt_name=opt.__class__.__name__,
        name="Projector | 2nd stage",
        num_params=num_params
    )

    results = train_with_batch_loss(
        model=projector,
        optimizer=opt,
        train_dataloader=train_dataloader,
        val_dataloader=val_dataloader,
        batch_loss_fn=projector_batch_loss_fn,
        epochs=EPOCHS_PROJ,
        model_save_path=model_save_path,
        log_to_wandb=True,
        device=DEVICE,
        extra_args={
            "CILP_model": CILP_model,
            "lidar_cnn": best_lidar_cnn,
        }
    )

    wandb.summary["Projector: best_val_loss"] = results["best_val_loss"]

    best_proj_val = results["best_val_loss"]
    print("\n" + "-"*50)
    print(f"BEST PROJECTOR VALIDATION MSE → {best_proj_val:.4f}")
    print("-"*50 + "\n")

    # End wandb run before starting the next model
    wandb.finish()

[BatchLoss] Epoch 1/40
Found and saved better weights for the model
epoch 0 train loss: 1.6388600946169847
epoch 0 valid loss: 1.2890848696231842
[BatchLoss] Epoch 2/40
Found and saved better weights for the model
epoch 1 train loss: 1.6224197928319897
epoch 1 valid loss: 1.2846409142017365
[BatchLoss] Epoch 3/40
Found and saved better weights for the model
epoch 2 train loss: 1.6049904155800914
epoch 2 valid loss: 1.2684852600097656
[BatchLoss] Epoch 4/40
Found and saved better weights for the model
epoch 3 train loss: 1.5911426063169514
epoch 3 valid loss: 1.2645985782146454
[BatchLoss] Epoch 5/40
Found and saved better weights for the model
epoch 4 train loss: 1.5782182199216028
epoch 4 valid loss: 1.2419158279895783
[BatchLoss] Epoch 6/40
Found and saved better weights for the model
epoch 5 train loss: 1.563390554740415
epoch 5 valid loss: 1.2293817102909088
[BatchLoss] Epoch 7/40
Found and saved better weights for the model
epoch 6 train loss: 1.5521342682908152
epoch 6 valid loss

0,1
epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
epoch_time,▄▂▁▇▂▁▂▆▁▂▅▄▂▁▇▂▂▄▅▁▁▆▂▂▂█▄▄▇▅▁▄▅▂▁▆▂▁▁▇
lr,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
max_gpu_mem_mb_epoch,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train_loss,██▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▁▁▁▁
valid_loss,███▇▇▇▇▆▆▆▆▆▆▅▅▅▅▄▅▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁

0,1
epoch,40
epoch_time,4.89051
lr,0.0001
max_gpu_mem_mb_epoch,327.29541
model,Projector
projector_best_val_mse,0.96225
train_loss,1.19839
valid_loss,0.96468


In [None]:
if results:
    losses = {
        "projector": {
            "train_losses": results["train_loss"],
            "valid_losses": results["val_loss"],
        }
    }

    plot_losses(losses, title="Projector: train vs val loss")

In [None]:
best_projector = Projector(img_dim, lidar_dim).to(DEVICE)

# Load pre-trained cilp model
projector_path = CHECKPOINTS / "projector.pt"
best_projector = load_model(best_projector, projector_path, DEVICE)

## Stage 3: RGB2LiDARClassifier

In the final stage, we build and train a model that chains together the CILP backbone, the trained projector, and the LiDAR classifier. Only the projector is trainable; all other components remain frozen. The model learns to classify cubes and spheres directly from RGB input by leveraging the LiDAR feature space. The section logs training and validation metrics and reports the final accuracy.

In [None]:
set_seeds(SEED)

model_save_path = CHECKPOINTS / "rgb2lidar.pt"

# Constants for classifier 
LR_RGB2LIDAR = 1e-3
EPOCHS_RGB2LIDAR = 20

# Initialize Classifier
rgb2lidar_clf = RGB2LiDARClassifier(
    CILP=best_CILP_model,
    projector=best_projector,
    lidar_cnn=best_lidar_cnn,
).to(DEVICE)

# 1) Freeze img_embedder & lidar_cnn
for p in rgb2lidar_clf.img_embedder.parameters():
    p.requires_grad = False

for p in rgb2lidar_clf.shape_classifier.parameters():
    p.requires_grad = False

# 2) Ensure projector is trainable
for p in rgb2lidar_clf.projector.parameters():
    p.requires_grad = True

# Number of trainable parameters
num_params = sum(p.numel() for p in rgb2lidar_clf.projector.parameters() if p.requires_grad)

# Train Classifier
opt = torch.optim.Adam(rgb2lidar_clf.projector.parameters(), lr=LR_RGB2LIDAR)

embedding_size = compute_embedding_size("classifier", CILP_EMB_SIZE, spatial=(4, 4))


init_wandb(
    model=rgb2lidar_clf,
    embedding_size=embedding_size,
    opt_name=opt.__class__.__name__,
    name="RGB2LIDAR Classifier",
    num_params=num_params
)

results = train_classifier_with_acc(
    model=rgb2lidar_clf,
    optimizer=opt,
    train_dataloader=train_dataloader,
    val_dataloader=val_dataloader,
    epochs=EPOCHS_RGB2LIDAR,
    model_save_path=model_save_path,
    log_to_wandb=True,
    device=DEVICE
)

wandb.summary["RGB2LIDAR: best_val_loss"] = results["best_val_loss"]
wandb.summary["RGB2LIDAR: best_val_acc"]  = results["best_val_acc"]

print("\n" + "="*60)
print("BEST VALIDATION RESULTS")
print(f"Loss     : {results['best_val_loss']:.4f}")
print(f"Accuracy : {results['best_val_acc']*100:.2f}%")
print("="*60 + "\n")

wandb.finish()

ANY trainable in projector? True

Epoch 1/5
DEBUG logits.requires_grad: False
DEBUG loss.requires_grad: False


RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn