Remove before Submitting.

In [1]:
# Enable autoreload in Jupyter
%load_ext autoreload
%autoreload 2

# Imports and Seed Management

In [2]:
import os

# Set environment variables for reproducibility BEFORE importing torch
os.environ['PYTHONHASHSEED'] = '51'
os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'

import sys
from pathlib import Path

# Add project root to sys.path for module imports
PROJECT_ROOT = Path.cwd().parent
sys.path.append(str(PROJECT_ROOT))

import wandb
import numpy as np
import torch
import torch.nn as nn
import fiftyone as fo
from torch.optim import Adam
from pathlib import Path
from tabulate import tabulate

from src.datasets import CustomTorchImageDataset
from src.models import (
    Net,
    LateFusionModel,
    ConcatIntermediateNet,
    AdditionIntermediateNet,
    HadamardIntermediateNet,
)
from src.training import train_model
from src.utils import (
    set_seeds,
    create_deterministic_training_dataloader,
    get_rgb_input,
    get_lidar_input,
    get_mm_intermediate_inputs,
    get_mm_late_inputs,
)

set_seeds(51)

PROJECT_NAME = "cilp-extended-assessment"

All random seeds set to 51 for reproducibility


# Dataset Loading

In [3]:
IMG_SIZE = 64

dataset_name = "cilp_assessment"

# Load the FiftyOne dataset from disk
dataset = fo.Dataset.from_dir(
    dataset_dir=Path.cwd().parent / dataset_name,
    dataset_type=fo.types.FiftyOneDataset,
)

print(f"Total samples in dataset: {len(dataset)}")

Importing samples...
 100% |███████████████| 3228/3228 [88.9ms elapsed, 0s remaining, 36.3K samples/s]   
Total samples in dataset: 1076


Extract train and test split of the dataset.

In [4]:
train_dataset = dataset.match_tags("train")
val_dataset = dataset.match_tags("validation")

print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")

Training samples: 897
Validation samples: 179


Generate custom torch datasets to use dataloader.

In [5]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print("Device: ", device)

torch_train_dataset = CustomTorchImageDataset(
    fiftyone_dataset=train_dataset,
    img_size=IMG_SIZE,
    device=device,
)

torch_val_dataset = CustomTorchImageDataset(
    fiftyone_dataset=val_dataset,
    img_size=IMG_SIZE,
    device=device,
)

Device:  cpu
CustomTorchImageDataset initialized with 897 samples.
CustomTorchImageDataset initialized with 179 samples.


Create a DataLoader and use a deterministic setup for training to make the results reproducible

In [6]:
train_dataloader = create_deterministic_training_dataloader(
    torch_train_dataset,
    batch_size=32,
    shuffle=True,
)

val_dataloader = torch.utils.data.DataLoader(
    torch_val_dataset,
    batch_size=32,
    shuffle=False,
)

For the loss function, we use the same one as in the assessment: **BCEWithLogitsLoss**. This loss works well with a single output neuron for binary classification. In later tasks, we set *num_classes=1* to ensure the model has only one output neuron, which aligns with this loss function.

In [7]:
loss_func = nn.BCEWithLogitsLoss()

We initialize our table for the comparison of the different architectures at the end.

In [8]:
table = [
    ["Metrics", "Validation Loss", "Parameters (M)", "Training Time / epoch (s)", "GPU Memory (MB)"]
]

# Intermediate Fusion

For training, we need to define the input format of a batch. For all intermediate architectures, we use the same format. It is defined within the utils.py.

## Concatenation

Our first intermediate architecture uses **Concatenation**. Here, we use separate convolutions for both modalities. Afterwards, we concatenate the results and pass them through a feedforward network before the final output.

In [9]:
epochs = 2

gpu_mem_before = torch.cuda.memory_allocated()
mm_concat_intermediate_net = ConcatIntermediateNet(rgb_ch=4, xyz_ch=4).to(device)
# We collect this after model creation to measure model memory usage
gpu_mem_after = torch.cuda.memory_allocated()

mm_concat_intermediate_opt = Adam(mm_concat_intermediate_net.parameters(), lr=0.0001)
mm_concat_intermediate_save_path = Path.cwd().parent / "checkpoints" / "02_mm_concat_intermediate_net.pth"
mm_concat_intermediate_run = wandb.init(project=PROJECT_NAME, name=f"{ConcatIntermediateNet.__name__}")

print("Training mm_concat_intermediate_net")
set_seeds(51)
mm_concat_intermediate_train_loss, mm_concat_intermediate_valid_loss, mm_concat_intermediate_train_time = train_model(
    mm_concat_intermediate_net,
    mm_concat_intermediate_opt,
    loss_func,
    get_mm_intermediate_inputs,
    epochs,
    train_dataloader,
    val_dataloader,
    save_path=mm_concat_intermediate_save_path,
    run=mm_concat_intermediate_run,
)
mm_concat_intermediate_num_params = mm_concat_intermediate_net.get_number_of_parameters() / 1e6  # in millions

table.append([
    "Intermediate Fusion (Concatenation)",
    np.min(mm_concat_intermediate_valid_loss),
    mm_concat_intermediate_num_params,
    mm_concat_intermediate_train_time / epochs,
    (gpu_mem_after - gpu_mem_before) / (1024 ** 2),
])

mm_concat_intermediate_run.finish()

del mm_concat_intermediate_net, mm_concat_intermediate_opt

[34m[1mwandb[0m: Currently logged in as: [33mkarl-schuetz[0m ([33mkarl-schuetz-hasso-plattner-institut[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Training mm_concat_intermediate_net
All random seeds set to 51 for reproducibility


0,1
epoch,▁█
learning_rate,▁▁
train_loss,█▁
valid_loss,▁█

0,1
epoch,2.0
learning_rate,0.0001
train_loss,0.23266
valid_loss,0.21223


## Addition

Now we use **Addition** as our intermediate architecture. Here, we apply separate convolutions to both modalities. Afterwards, we perform element-wise addition of the partial results and pass the sum through a feedforward network before the final output.

In [10]:
epochs = 2

gpu_mem_before = torch.cuda.memory_allocated()
mm_addition_intermediate_net = AdditionIntermediateNet(rgb_ch=4, xyz_ch=4).to(device)
gpu_mem_after = torch.cuda.memory_allocated()

mm_addition_intermediate_opt = Adam(mm_addition_intermediate_net.parameters(), lr=0.0001)
mm_addition_intermediate_save_path = Path.cwd().parent / "checkpoints" / "02_mm_addition_intermediate_net.pth"
mm_addition_intermediate_run = wandb.init(project=PROJECT_NAME, name=f"{AdditionIntermediateNet.__name__}")

print("Training mm_addition_intermediate_net")
set_seeds(51)
mm_addition_intermediate_train_loss, mm_addition_intermediate_valid_loss, mm_addition_intermediate_train_time = train_model(
    mm_addition_intermediate_net,
    mm_addition_intermediate_opt,
    loss_func,
    get_mm_intermediate_inputs,
    epochs,
    train_dataloader,
    val_dataloader,
    save_path=mm_addition_intermediate_save_path,
    run=mm_addition_intermediate_run,
)
mm_addition_intermediate_num_params = mm_addition_intermediate_net.get_number_of_parameters() / 1e6  # in millions

table.append([
    "Intermediate Fusion (Addition)",
    np.min(mm_addition_intermediate_valid_loss),
    mm_addition_intermediate_num_params,
    mm_addition_intermediate_train_time / epochs,
    (gpu_mem_after - gpu_mem_before) / (1024 ** 2),
])

mm_addition_intermediate_run.finish()

del mm_addition_intermediate_net, mm_addition_intermediate_opt

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin


Training mm_addition_intermediate_net
All random seeds set to 51 for reproducibility


0,1
epoch,▁█
learning_rate,▁▁
train_loss,█▁
valid_loss,█▁

0,1
epoch,2.0
learning_rate,0.0001
train_loss,0.23474
valid_loss,0.20454


## Hadamard Product

As the final intermediate strategy, we use the **Hadamard Product**. Here, we apply separate convolutions to both modalities. Afterwards, we perform element-wise multiplication of the partial results and pass the product through a feedforward network before the final output.

In [11]:
epochs = 2

gpu_mem_before = torch.cuda.memory_allocated()
mm_hadamard_intermediate_net = HadamardIntermediateNet(rgb_ch=4, xyz_ch=4).to(device)
gpu_mem_after = torch.cuda.memory_allocated()

mm_hadamard_intermediate_opt = Adam(mm_hadamard_intermediate_net.parameters(), lr=0.0001)
mm_hadamard_intermediate_save_path = Path.cwd().parent / "checkpoints" / "02_mm_hadamard_intermediate_net.pth"
mm_hadamard_intermediate_run = wandb.init(project=PROJECT_NAME, name=f"{HadamardIntermediateNet.__name__}")

print("Training mm_hadamard_intermediate_net")
set_seeds(51)
mm_hadamard_intermediate_train_loss, mm_hadamard_intermediate_valid_loss, mm_hadamard_intermediate_train_time = train_model(
    mm_hadamard_intermediate_net,
    mm_hadamard_intermediate_opt,
    loss_func,
    get_mm_intermediate_inputs,
    epochs,
    train_dataloader,
    val_dataloader,
    save_path=mm_hadamard_intermediate_save_path,
    run=mm_hadamard_intermediate_run,
)
mm_hadamard_intermediate_num_params = mm_hadamard_intermediate_net.get_number_of_parameters() / 1e6  # in millions

table.append([
    "Intermediate Fusion (Hadamard)",
    np.min(mm_hadamard_intermediate_valid_loss),
    mm_hadamard_intermediate_num_params,
    mm_hadamard_intermediate_train_time / epochs,
    (gpu_mem_after - gpu_mem_before) / (1024 ** 2),
])

mm_hadamard_intermediate_run.finish()

del mm_hadamard_intermediate_net, mm_hadamard_intermediate_opt

Training mm_hadamard_intermediate_net
All random seeds set to 51 for reproducibility


0,1
epoch,▁█
learning_rate,▁▁
train_loss,█▁
valid_loss,█▁

0,1
epoch,2.0
learning_rate,0.0001
train_loss,0.23152
valid_loss,0.20386


## Advantages and Limitations

Intermediate fusion models enable the processing of multiple modalities before combining them. This allows the network to extract important features through convolutions from each modality before making the final prediction. We explored several fusion strategies:

Concatenation increases the input size of the feedforward layer, resulting in a large number of parameters. Consequently, the model is bigger and requires more computation and time. However, concatenation still gives the model the opportunity to ignore less relevant modalities and focus on the most important information.

Addition and Hadamard Product fusion combine the modalities mathematically at an earlier stage. This reduces the number of parameters since the output size after summing or multiplying remains the same. But it can obscure differences between modalities. For example, two samples with very different Lidar and RGB features could produce the same element-wise sum or product, potentially making it harder for the model to distinguish between them.


# Late Fusion

Now we use Late Fusion. Here, we first process each modality independently through separate convolutional and feedforward layers. Afterwards, we combine the outputs of both modalities using concatenation, and then produce the final prediction.

## RGB Classifier

First, we need to train the RGB classifier.

In [None]:
epochs_rgb = 2

gpu_mem_before = torch.cuda.memory_allocated()
rgb_net = Net(in_ch=4).to(device)
gpu_mem_after = torch.cuda.memory_allocated()
gpu_mem_rgb = gpu_mem_after - gpu_mem_before

rgb_opt = Adam(rgb_net.parameters(), lr=0.0001)
rgb_save_path = Path.cwd().parent / "checkpoints" / "02_rgb_net.pth"
rgb_run = wandb.init(project=PROJECT_NAME, name="RGB_Net_Training")

print("Training rgb_net")
set_seeds(51)
rgb_train_loss, rgb_valid_loss, rgb_train_time = train_model(
    rgb_net,
    rgb_opt,
    loss_func,
    get_rgb_input,
    epochs_rgb,
    train_dataloader,
    val_dataloader,
    save_path=rgb_save_path,
    run=rgb_run,
)

rgb_run.finish()

Training rgb_net
All random seeds set to 51 for reproducibility


## Lidar Classifier

We also need a Lidar classifier before using the late fusion model.

In [None]:

epochs_xyz = 2

gpu_mem_before = torch.cuda.memory_allocated()
xyz_net = Net(in_ch=4).to(device)
gpu_mem_after = torch.cuda.memory_allocated()
gpu_mem_xyz = gpu_mem_after - gpu_mem_before

xyz_opt = Adam(xyz_net.parameters(), lr=0.0001)
xyz_save_path = Path.cwd().parent / "checkpoints" / "02_xyz_net.pth"
xyz_run = wandb.init(project=PROJECT_NAME, name="Lidar_Net_Training")

print("Training lidar_net")
set_seeds(51)
xyz_train_loss, xyz_valid_loss, xyz_train_time = train_model(
    xyz_net,
    xyz_opt,
    loss_func,
    get_lidar_input,
    epochs_xyz,
    train_dataloader,
    val_dataloader,
    save_path=xyz_save_path,
    run=xyz_run,
)

xyz_run.finish()

0,1
epoch,▁█
learning_rate,▁▁
train_loss,█▁
valid_loss,█▁

0,1
epoch,2.0
learning_rate,0.0001
train_loss,0.23007
valid_loss,0.20601


Training lidar_net
All random seeds set to 51 for reproducibility


## Late Fusion

Load best performing models from disk.

In [14]:
rgb_net.load_state_dict(torch.load(rgb_save_path))
xyz_net.load_state_dict(torch.load(xyz_save_path))


<All keys matched successfully>

Disable gradient updates for the *rgb_net* and *xyz_net* to train only the late fusion model.

In [15]:
networks = [rgb_net, xyz_net]

for network in networks:
    for param in network.parameters():
        param.requires_grad = False

We train the late fusion component without updating the weights of the fused models.

In [16]:
epochs = 2

gpu_mem_before = torch.cuda.memory_allocated()
mm_late_net = LateFusionModel(rgb_net, xyz_net).to(device)
gpu_mem_after = torch.cuda.memory_allocated()

mm_late_opt = Adam(mm_late_net.parameters(), lr=0.0001)
mm_late_save_path = Path.cwd().parent / "checkpoints" / "02_mm_late_net.pth"
mm_late_run = wandb.init(project=PROJECT_NAME, name=f"{LateFusionModel.__name__}")

print("Training mm_late_net")
set_seeds(51)
mm_late_train_loss, mm_late_valid_loss, mm_late_train_time = train_model(
    mm_late_net,
    mm_late_opt,
    loss_func,
    get_mm_late_inputs,
    epochs,
    train_dataloader,
    val_dataloader,
    save_path=mm_late_save_path,
    run=mm_late_run,
)
mm_late_num_params = mm_late_net.get_number_of_parameters() / 1e6  # in millions

# Include the training times and epochs of the individual networks
mm_late_train_time += rgb_train_time + xyz_train_time
epochs += epochs_rgb + epochs_xyz

table.append([
    "Late Fusion",
    np.min(mm_late_valid_loss),
    mm_late_num_params,
    mm_late_train_time / epochs,
    (gpu_mem_after - gpu_mem_before + gpu_mem_rgb + gpu_mem_xyz) / (1024 ** 2),
])

mm_late_run.finish()

del mm_late_net, mm_late_opt, rgb_net, xyz_net, rgb_opt, xyz_opt

0,1
epoch,▁█
learning_rate,▁▁
train_loss,█▁
valid_loss,▁█

0,1
epoch,2.0
learning_rate,0.0001
train_loss,0.24168
valid_loss,0.24108


Training mm_late_net
All random seeds set to 51 for reproducibility


0,1
epoch,▁█
learning_rate,▁▁
train_loss,█▁
valid_loss,█▁

0,1
epoch,2.0
learning_rate,0.0001
train_loss,0.44547
valid_loss,0.43635


## Advantages and Limitations

Using Late Fusion allows us to combine already trained classifiers, as demonstrated in this notebook. Combining pretrained models without retraining the entire network is a practical advantage. While this can be beneficial, it can be challenging to merge the outputs of the classifiers because they may lack essential information about the original modalities. If the classifiers have already made predictions, it can be difficult for the fusion model to produce an accurate final output. In contrast, an intermediate fusion model can access both modalities directly and exploit cross-modal interactions for prediction.

Because Late Fusion requires two fully trained models as well as additional linear layers for fusion, this architecture involves a large number of parameters.


# Comparison

Print the comparision table while keeping all intermediate architectures.

In [17]:
rows = list(zip(*table)) # transpose for tabulate
print(tabulate(rows[1:], headers=rows[0], tablefmt="grid"))


+---------------------------+---------------------------------------+----------------------------------+----------------------------------+---------------+
| Metrics                   |   Intermediate Fusion (Concatenation) |   Intermediate Fusion (Addition) |   Intermediate Fusion (Hadamard) |   Late Fusion |
| Validation Loss           |                              0.207008 |                         0.204541 |                         0.203858 |      0.436355 |
+---------------------------+---------------------------------------+----------------------------------+----------------------------------+---------------+
| Parameters (M)            |                             13.0159   |                         6.61585  |                         6.61585  |     26.2567   |
+---------------------------+---------------------------------------+----------------------------------+----------------------------------+---------------+
| Training Time / epoch (s) |                              3.422

Include written analysis (200-400 words) addressing:

Which architecture performed best and why
Trade-offs between parameter count and performance
Recommendations for when to use each approach
