# Colab Setup

Code for mounting drive and setting the project path.

In [1]:
!nvidia-smi

Mon Dec 29 12:09:18 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   37C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
import sys
import os
from pathlib import Path
from google.colab import drive

# 1. Mount Google Drive
drive.mount('/content/drive')

# 2. Define your project path
# (Adjust this path if you put the folder inside a subfolder in Drive)

# Google drive path
PROJECT_PATH = Path("/content/drive/MyDrive/CILP-Assessment-Multimodal-Learning")

# Local Path
# PROJECT_PATH = Path.cwd().parent

# 3. Add the project directory to Python's path so we can import from 'src'
if PROJECT_PATH not in sys.path:
    sys.path.insert(0, str(PROJECT_PATH))

# 4. Verify setup by listing files
print("Project files found:", os.listdir(PROJECT_PATH))

# 5. Install dependencies inside the remote Colab machine
# We use the requirements file located in your Drive
!pip install -r "{PROJECT_PATH}/requirements_colab.txt"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Project files found: ['notebooks', 'src', 'data', 'README', 'results', '.git', '.gitignore', 'requirements.txt', '.ipynb_checkpoints', 'cilp_assessment', 'checkpoints', 'requirements_colab.txt']


We copy the dataset to the content folder. This enables faster access for training.

In [3]:
!rsync -ah "/content/drive/MyDrive/CILP-Assessment-Multimodal-Learning/cilp_assessment" "/content/"
DATASET_PATH = Path("/content")

# Imports and Seed Management

In [4]:
import os

# Set environment variables for reproducibility BEFORE importing torch
os.environ['PYTHONHASHSEED'] = '51'
os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'

from pathlib import Path

import wandb
import numpy as np
import torch
import torch.nn as nn
from torch.optim.lr_scheduler import CosineAnnealingLR
import fiftyone as fo
from torch.optim import Adam
from pathlib import Path
from tabulate import tabulate
from sklearn.metrics import f1_score

from src.datasets import CustomTorchImageDataset
from src.models import (
    LateFusionModel,
    ConcatIntermediateNet,
    AdditionIntermediateNet,
    HadamardIntermediateNet,
)
from src.training import train_model
from src.utils import (
    set_seeds,
    create_deterministic_training_dataloader,
    get_rgb_input,
    get_lidar_input,
    get_mm_intermediate_inputs,
    get_mm_late_inputs,
    infer_model,
)

set_seeds(51)

PROJECT_NAME = "cilp-extended-assessment"

  return '(?ms)' + res + '\Z'


All random seeds set to 51 for reproducibility


# Dataset Loading

In [5]:
IMG_SIZE = 64

dataset_name = "cilp_assessment"

# Load the FiftyOne dataset from disk
dataset = fo.Dataset.from_dir(
    dataset_dir=DATASET_PATH / dataset_name,
    dataset_type=fo.types.FiftyOneDataset,
).sort_by("filepath")

print(f"Total samples in dataset: {len(dataset)}")

Importing samples...


INFO:fiftyone.utils.data.importers:Importing samples...


 100% |███████████████| 3222/3222 [177.9ms elapsed, 0s remaining, 18.4K samples/s] 


INFO:eta.core.utils: 100% |███████████████| 3222/3222 [177.9ms elapsed, 0s remaining, 18.4K samples/s] 


Total samples in dataset: 1074


Extract train and test split of the dataset.

In [6]:
train_dataset = dataset.match_tags("train")
val_dataset = dataset.match_tags("validation")

print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")

Training samples: 895
Validation samples: 179


Generate custom torch datasets to use dataloader.

In [7]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print("Device: ", device)

torch_train_dataset = CustomTorchImageDataset(
    fiftyone_dataset=train_dataset,
    img_size=IMG_SIZE,
    device=device,
)

torch_val_dataset = CustomTorchImageDataset(
    fiftyone_dataset=val_dataset,
    img_size=IMG_SIZE,
    device=device,
)

Device:  cuda
CustomTorchImageDataset initialized with 895 samples.
CustomTorchImageDataset initialized with 179 samples.


Create a DataLoader and use a deterministic setup for training to make the results reproducible

In [8]:
train_dataloader = create_deterministic_training_dataloader(
    torch_train_dataset,
    batch_size=32,
    shuffle=True,
)

val_dataloader = torch.utils.data.DataLoader(
    torch_val_dataset,
    batch_size=32,
    shuffle=False,
)

For the loss function, we use the same one as in the assessment: **BCEWithLogitsLoss**. This loss works well with a single output neuron for binary classification. In later tasks, we set *num_classes=1* to ensure the model has only one output neuron, which aligns with this loss function.

In [9]:
loss_func = nn.BCEWithLogitsLoss()

We initialize our table for the comparison of the different architectures at the end.

In [10]:
table = [
    ["Metrics", "F1 score", "Validation Loss", "Parameters (M)", "Training Time / epoch (s)", "GPU Memory (MB)"]
]

# Intermediate Fusion

For training, we need to define the input format of a batch. For all intermediate architectures, we use the same format. It is defined within the utils.py.

## Concatenation

Our first intermediate architecture uses **Concatenation**. Here, we use separate convolutions for both modalities. Afterwards, we concatenate the results and use some shared layers. 

In [11]:
epochs = 50

gpu_mem_before = torch.cuda.memory_allocated()
mm_concat_intermediate_net = ConcatIntermediateNet(rgb_ch=4, xyz_ch=4).to(device)
# We collect this after model creation to measure model memory usage
gpu_mem_after = torch.cuda.memory_allocated()

mm_concat_intermediate_opt = Adam(mm_concat_intermediate_net.parameters(), lr=1e-3)
mm_concat_intermediate_scheduler = CosineAnnealingLR(mm_concat_intermediate_opt, T_max=epochs, eta_min=1e-6)
mm_concat_intermediate_save_path = PROJECT_PATH / "checkpoints" / "02_mm_concat_intermediate_net.pth"
mm_concat_intermediate_run = wandb.init(project=PROJECT_NAME, name=f"{ConcatIntermediateNet.__name__}")

print("Training mm_concat_intermediate_net")
set_seeds(51)
mm_concat_intermediate_train_loss, mm_concat_intermediate_valid_loss, mm_concat_intermediate_train_time = train_model(
    mm_concat_intermediate_net,
    mm_concat_intermediate_opt,
    loss_func,
    get_mm_intermediate_inputs,
    epochs,
    train_dataloader,
    val_dataloader,
    save_path=mm_concat_intermediate_save_path,
    run=mm_concat_intermediate_run,
    scheduler=mm_concat_intermediate_scheduler,
)
mm_concat_intermediate_num_params = mm_concat_intermediate_net.get_number_of_parameters() / 1e6  # in millions

mm_concat_intermediate_run.finish()

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

 2


[34m[1mwandb[0m: You chose 'Use an existing W&B account'
[34m[1mwandb[0m: Logging into https://api.wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: Find your API key here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mkarl-schuetz[0m ([33mkarl-schuetz-hasso-plattner-institut[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Training mm_concat_intermediate_net
All random seeds set to 51 for reproducibility


100%|██████████| 50/50 [01:05<00:00,  1.31s/it]


0,1
epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███
learning_rate,███████▇▇▇▇▇▇▆▆▆▆▅▅▅▄▄▄▄▃▃▃▃▂▂▂▂▂▂▁▁▁▁▁▁
train_loss,█▇▆▆▆▅▅▅▅▅▄▄▃▂▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
valid_loss,█▆▆▆▆▅▅▄▅▄▄▄▄▃▃▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
epoch,50.0
learning_rate,0.0
train_loss,8e-05
valid_loss,0.00173


Use inference for the f1 score.

In [12]:
mm_concat_intermediate_net.load_state_dict(torch.load(mm_concat_intermediate_save_path))

mm_concat_intermediate_pred, ground_truth  = infer_model(
    mm_concat_intermediate_net,
    val_dataloader,
    get_mm_intermediate_inputs,
)

table.append([
    "Intermediate Fusion (Concatenation)",
    f1_score(ground_truth, mm_concat_intermediate_pred),
    np.min(mm_concat_intermediate_valid_loss),
    mm_concat_intermediate_num_params,
    mm_concat_intermediate_train_time / epochs,
    (gpu_mem_after - gpu_mem_before) / (1024 ** 2),
])

del mm_concat_intermediate_net, mm_concat_intermediate_opt, mm_concat_intermediate_scheduler

## Addition

Now we use **Addition** as our intermediate architecture. Here, we apply separate convolutions to both modalities. Afterwards, we perform element-wise addition of the partial results and pass the sum shared layers for the final output

In [13]:
gpu_mem_before = torch.cuda.memory_allocated()
mm_addition_intermediate_net = AdditionIntermediateNet(rgb_ch=4, xyz_ch=4).to(device)
gpu_mem_after = torch.cuda.memory_allocated()

mm_addition_intermediate_opt = Adam(mm_addition_intermediate_net.parameters(), lr=1e-3)
mm_addition_intermediate_scheduler = CosineAnnealingLR(mm_addition_intermediate_opt, T_max=epochs, eta_min=1e-6)
mm_addition_intermediate_save_path = PROJECT_PATH / "checkpoints" / "02_mm_addition_intermediate_net.pth"
mm_addition_intermediate_run = wandb.init(project=PROJECT_NAME, name=f"{AdditionIntermediateNet.__name__}")

print("Training mm_addition_intermediate_net")
set_seeds(51)
mm_addition_intermediate_train_loss, mm_addition_intermediate_valid_loss, mm_addition_intermediate_train_time = train_model(
    mm_addition_intermediate_net,
    mm_addition_intermediate_opt,
    loss_func,
    get_mm_intermediate_inputs,
    epochs,
    train_dataloader,
    val_dataloader,
    save_path=mm_addition_intermediate_save_path,
    run=mm_addition_intermediate_run,
    scheduler=mm_addition_intermediate_scheduler,
)
mm_addition_intermediate_num_params = mm_addition_intermediate_net.get_number_of_parameters() / 1e6  # in millions

mm_addition_intermediate_run.finish()

Training mm_addition_intermediate_net
All random seeds set to 51 for reproducibility


100%|██████████| 50/50 [01:01<00:00,  1.22s/it]


0,1
epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███
learning_rate,███████▇▇▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▄▃▃▃▃▂▂▂▂▁▁▁▁▁▁▁
train_loss,█▄▄▄▄▃▄▃▃▃▃▃▃▃▃▂▂▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
valid_loss,███▇▆▆▅▆▆▆▅▅▅▅▃▂▃▃▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
epoch,50.0
learning_rate,0.0
train_loss,0.00027
valid_loss,0.04479


Use inference for the f1 score.

In [14]:
mm_addition_intermediate_net.load_state_dict(torch.load(mm_addition_intermediate_save_path))

mm_addition_intermediate_pred, ground_truth  = infer_model(
    mm_addition_intermediate_net,
    val_dataloader,
    get_mm_intermediate_inputs,
)

table.append([
    "Intermediate Fusion (Addition)",
    f1_score(ground_truth, mm_addition_intermediate_pred),
    np.min(mm_addition_intermediate_valid_loss),
    mm_addition_intermediate_num_params,
    mm_addition_intermediate_train_time / epochs,
    (gpu_mem_after - gpu_mem_before) / (1024 ** 2),
])

del mm_addition_intermediate_net, mm_addition_intermediate_opt, mm_addition_intermediate_scheduler

## Hadamard Product

As the final intermediate strategy, we use the **Hadamard Product**. Here, we apply separate convolutions to both modalities. Afterwards, we perform element-wise multiplication of the partial results and pass the product through shared layres for the final output.

In [15]:
gpu_mem_before = torch.cuda.memory_allocated()
mm_hadamard_intermediate_net = HadamardIntermediateNet(rgb_ch=4, xyz_ch=4).to(device)
gpu_mem_after = torch.cuda.memory_allocated()

mm_hadamard_intermediate_opt = Adam(mm_hadamard_intermediate_net.parameters(), lr=1e-3)
mm_hadamard_intermediate_scheduler = CosineAnnealingLR(mm_hadamard_intermediate_opt, T_max=epochs, eta_min=1e-6)
mm_hadamard_intermediate_save_path = PROJECT_PATH / "checkpoints" / "02_mm_hadamard_intermediate_net.pth"
mm_hadamard_intermediate_run = wandb.init(project=PROJECT_NAME, name=f"{HadamardIntermediateNet.__name__}")

print("Training mm_hadamard_intermediate_net")
set_seeds(51)
mm_hadamard_intermediate_train_loss, mm_hadamard_intermediate_valid_loss, mm_hadamard_intermediate_train_time = train_model(
    mm_hadamard_intermediate_net,
    mm_hadamard_intermediate_opt,
    loss_func,
    get_mm_intermediate_inputs,
    epochs,
    train_dataloader,
    val_dataloader,
    save_path=mm_hadamard_intermediate_save_path,
    run=mm_hadamard_intermediate_run,
    scheduler=mm_hadamard_intermediate_scheduler,
)
mm_hadamard_intermediate_num_params = mm_hadamard_intermediate_net.get_number_of_parameters() / 1e6  # in millions

mm_hadamard_intermediate_run.finish()

Training mm_hadamard_intermediate_net
All random seeds set to 51 for reproducibility


100%|██████████| 50/50 [01:01<00:00,  1.23s/it]


0,1
epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇██
learning_rate,██████▇▇▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▄▃▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁
train_loss,█▆▆▆▅▅▄▄▄▃▃▃▃▃▃▂▃▂▂▂▃▂▂▂▂▂▂▂▂▁▁▁▁▂▁▁▁▁▁▁
valid_loss,▆▄█▄▄▄▃▃▃▂▂▂▃▂▂▂▂▁▃▂▁▁▁▃▂▂▁▁▂▁▃▁▁▁▁▃▂▂▂▂

0,1
epoch,50.0
learning_rate,0.0
train_loss,0.05077
valid_loss,0.12456


Use inference for the f1 score.

In [16]:
mm_hadamard_intermediate_net.load_state_dict(torch.load(mm_hadamard_intermediate_save_path))

mm_hadamard_intermediate_pred, ground_truth  = infer_model(
    mm_hadamard_intermediate_net,
    val_dataloader,
    get_mm_intermediate_inputs,
)

table.append([
    "Intermediate Fusion (Hadamard)",
    f1_score(ground_truth, mm_hadamard_intermediate_pred),
    np.min(mm_hadamard_intermediate_valid_loss),
    mm_hadamard_intermediate_num_params,
    mm_hadamard_intermediate_train_time / epochs,
    (gpu_mem_after - gpu_mem_before) / (1024 ** 2),
])

del mm_hadamard_intermediate_net, mm_hadamard_intermediate_opt, mm_hadamard_intermediate_scheduler

## Advantages and Limitations

Intermediate Fusion models enable the processing of multiple modalities before combining them. This allows the network to extract important features through convolutions from each modality before making the final prediction. Also, Intermediate Fusion uses shared convolution layers after combining the modality embeddings. This enables cross-modal interactions. We explored several fusion strategies:

Concatenation increases the input size shared layers, resulting in a higher number of parameters. Consequently, the model is bigger and requires more computation and time. However, concatenation still gives the model the opportunity to ignore less relevant modalities and focus on the most important information.

Addition and Hadamard Product fusion combine the modalities mathematically at an earlier stage. This reduces the number of parameters since the output size after summing or multiplying remains the same. But it can obscure differences between modalities. For example, two samples with very different Lidar and RGB features could produce the same element-wise sum or product, potentially making it harder for the model to distinguish between them.


# Late Fusion

Now we use Late Fusion. Here, we first process each modality independently through separate convolutional layers (creating embeddings). Afterwards, we combine the outputs of both modalities using concatenation, and then produce the final prediction using classifier head.

In [17]:
gpu_mem_before = torch.cuda.memory_allocated()
mm_late_net = LateFusionModel(rgb_in_ch=4, lidar_in_ch=4).to(device)
gpu_mem_after = torch.cuda.memory_allocated()

mm_late_opt = Adam(mm_late_net.parameters(), lr=1e-3)
mm_late_scheduler = CosineAnnealingLR(mm_late_opt, T_max=epochs, eta_min=1e-6)
mm_late_save_path = PROJECT_PATH / "checkpoints" / "02_mm_late_net.pth"
mm_late_run = wandb.init(project=PROJECT_NAME, name=f"{LateFusionModel.__name__}")

print("Training mm_late_net")
set_seeds(51)
mm_late_train_loss, mm_late_valid_loss, mm_late_train_time = train_model(
    mm_late_net,
    mm_late_opt,
    loss_func,
    get_mm_late_inputs,
    epochs,
    train_dataloader,
    val_dataloader,
    save_path=mm_late_save_path,
    run=mm_late_run,
    scheduler=mm_late_scheduler,
)
mm_late_num_params = mm_late_net.get_number_of_parameters() / 1e6  # in millions

mm_late_run.finish()

Training mm_late_net
All random seeds set to 51 for reproducibility


100%|██████████| 50/50 [01:07<00:00,  1.34s/it]


0,1
epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███
learning_rate,██████▇▇▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▄▃▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁
train_loss,█▅▅▅▅▄▄▃▄▃▃▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁
valid_loss,▆▆▆▆▅▄▄▄▃▄▃█▅▃▃▃▂▂▂▂▂▂▃▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
epoch,50.0
learning_rate,0.0
train_loss,0.01804
valid_loss,0.07165


Use inference for the f1 score.

In [18]:
mm_late_net.load_state_dict(torch.load(mm_late_save_path))

mm_late_pred, ground_truth  = infer_model(
    mm_late_net,
    val_dataloader,
    get_mm_late_inputs,
)

table.append([
    "Late Fusion",
    f1_score(ground_truth, mm_late_pred),
    np.min(mm_late_valid_loss),
    mm_late_num_params,
    mm_late_train_time / epochs,
    (gpu_mem_after - gpu_mem_before) / (1024 ** 2),
])

del mm_late_net, mm_late_opt, mm_late_scheduler

## Advantages and Limitations

Using Late Fusion allows us to combine already trained embedders. Combining pretrained models without retraining the entire network is a practical advantage. While this can be beneficial, it can be challenging to merge the outputs of the embedders because there is no addtional convolution and only a classifier head. In contrast, an intermediate fusion model can use convolutions on both modalities directly and exploit cross-modal interactions for prediction.

Because Late Fusion has a huge input space for the classifier head, this architecture involves a large number of parameters.


# Comparison

Print the comparision table while keeping all intermediate architectures.

In [19]:
rows = list(zip(*table)) # transpose for tabulate
print(tabulate(rows[1:], headers=rows[0], tablefmt="grid"))


+---------------------------+---------------------------------------+----------------------------------+----------------------------------+---------------+
| Metrics                   |   Intermediate Fusion (Concatenation) |   Intermediate Fusion (Addition) |   Intermediate Fusion (Hadamard) |   Late Fusion |
| F1 score                  |                            1          |                        0.903226  |                        0.833333  |      0.833333 |
+---------------------------+---------------------------------------+----------------------------------+----------------------------------+---------------+
| Validation Loss           |                            0.00171058 |                        0.030699  |                        0.0694984 |      0.070162 |
+---------------------------+---------------------------------------+----------------------------------+----------------------------------+---------------+
| Parameters (M)            |                            2.0945 

Note: GPU Memory for Addition is wrong, but I could not fix the problem in time. It should be the same as for Hadamard.

The Intermediate Fusion with Concatenation architecture performs the best overall, achieving an F1 score of 1.0 and the lowest validation loss. This approach first embeds each modality separately and then concatenates the embeddings before feeding them into shared convolutional layers. This allows the network to preserve modality-specific structure while still enabling cross-modal interaction at a later stage. Because the shared layers learn how to optimally combine the modalities, the model avoids injecting noise through rigid element-wise addition or multiplication. This flexibility likely explains the strong performance of the Concatenation model.

In terms of efficiency, the Concatenation model has slightly more parameters than the other intermediate fusion models (2.09M vs. 1.91M) and a marginally higher training time per epoch. However, it still requires fewer parameters than Late Fusion, which has over 3M parameters while performing notably worse. The Addition and Hadamard models are more parameter-efficient but show a clear drop in F1 score, particularly the Hadamard model, which performs worse despite having the same parameter count as Addition. This suggests that the performance cost of excessive parameter reduction outweighs the computational benefits in this case.

For recommendations, the Concatenation architecture is the best default choice when maximizing predictive performance is the priority, as it offers the strongest balance of accuracy and model complexity. If runtime or resource constraints are critical, the Addition model is a reasonable alternative because it is faster and lighter while still performing well. The Hadamard model is less attractive because it offers no efficiency advantage over Addition yet performs worse. Late Fusion becomes appealing when high-quality pretrained encoders already exist and the goal is to avoid retraining shared multimodal layers; however, it comes with a higher parameter cost and weaker performance in this experiment.

Overall, Concatenation-based Intermediate Fusion provides the most effective trade-off between flexibility, parameter count, and accuracy.