# Colab Setup

Code for mounting drive and setting the project path.

In [1]:
!nvidia-smi

Mon Dec 29 12:43:40 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   46C    P8             12W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
import sys
import os
from pathlib import Path
from google.colab import drive

# 1. Mount Google Drive
drive.mount('/content/drive')

# 2. Define your project path
# (Adjust this path if you put the folder inside a subfolder in Drive)

# Google drive path
PROJECT_PATH = Path("/content/drive/MyDrive/CILP-Assessment-Multimodal-Learning")

# Local Path
# PROJECT_PATH = Path.cwd().parent

# 3. Add the project directory to Python's path so we can import from 'src'
if PROJECT_PATH not in sys.path:
    sys.path.insert(0, str(PROJECT_PATH))

# 4. Verify setup by listing files
print("Project files found:", os.listdir(PROJECT_PATH))

# 5. Install dependencies inside the remote Colab machine
# We use the requirements file located in your Drive
!pip install -r "{PROJECT_PATH}/requirements_colab.txt"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Project files found: ['notebooks', 'src', 'data', 'README', 'results', '.git', '.gitignore', 'requirements.txt', '.ipynb_checkpoints', 'cilp_assessment', 'checkpoints', 'requirements_colab.txt']


We copy the dataset to the content folder. This enables faster access for training.

In [3]:
!rsync -ah "/content/drive/MyDrive/CILP-Assessment-Multimodal-Learning/cilp_assessment" "/content/"
DATASET_PATH = Path("/content")

# Imports and Seed Management

In [4]:
import os

# Set environment variables for reproducibility BEFORE importing torch
os.environ['PYTHONHASHSEED'] = '51'
os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'

from pathlib import Path

import wandb
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import ConcatDataset, DataLoader
from torch.optim.lr_scheduler import CosineAnnealingLR
import fiftyone as fo
from torch.optim import Adam
from pathlib import Path
from tabulate import tabulate

from src.datasets import CustomTorchImageDataset
from src.models import ConcatIntermediateNet, ConcatIntermediateNetWithStride
from src.training import train_model
from src.utils import (
    set_seeds,
    create_deterministic_training_dataloader,
    infer_model,
    get_mm_intermediate_inputs,
)

set_seeds(51)

PROJECT_NAME = "cilp-extended-assessment"

  return '(?ms)' + res + '\Z'


All random seeds set to 51 for reproducibility


# Dataset Loading

In [5]:
IMG_SIZE = 64

dataset_name = "cilp_assessment"

# Load the FiftyOne dataset from disk
dataset = fo.Dataset.from_dir(
    dataset_dir=DATASET_PATH / dataset_name,
    dataset_type=fo.types.FiftyOneDataset,
).sort_by("filepath")

print(f"Total samples in dataset: {len(dataset)}")

Importing samples...


INFO:fiftyone.utils.data.importers:Importing samples...


 100% |███████████████| 3222/3222 [164.1ms elapsed, 0s remaining, 20.0K samples/s] 


INFO:eta.core.utils: 100% |███████████████| 3222/3222 [164.1ms elapsed, 0s remaining, 20.0K samples/s] 


Total samples in dataset: 1074


Extract train and test split of the dataset.

In [6]:
train_dataset = dataset.match_tags("train")
val_dataset = dataset.match_tags("validation")

print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")

Training samples: 895
Validation samples: 179


Generate custom torch datasets to use dataloader.

In [7]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print("Device: ", device)

torch_train_dataset = CustomTorchImageDataset(
    fiftyone_dataset=train_dataset,
    img_size=IMG_SIZE,
    device=device,
)

torch_val_dataset = CustomTorchImageDataset(
    fiftyone_dataset=val_dataset,
    img_size=IMG_SIZE,
    device=device,
)

Device:  cuda
CustomTorchImageDataset initialized with 895 samples.
CustomTorchImageDataset initialized with 179 samples.


Create a DataLoader and use a deterministic setup for training to make the results reproducible

In [8]:
train_dataloader = create_deterministic_training_dataloader(
    torch_train_dataset,
    batch_size=32,
    shuffle=True,
)

val_dataloader = DataLoader(
    torch_val_dataset,
    batch_size=32,
    shuffle=False,
)

Create a concatinated dataset for inference.

In [9]:
concat_dataset = ConcatDataset([torch_train_dataset, torch_val_dataset])
print(f"Total samples in concat dataset: {len(concat_dataset)}")

concat_dataloader = DataLoader(
    concat_dataset,
    batch_size=32,
    shuffle=False,
)

Total samples in concat dataset: 1074


# Hyperparameters

For the loss function, we use the same one as in the assessment: **BCEWithLogitsLoss**. This loss works well with a single output neuron for binary classification. In later tasks, we set *num_classes=1* to ensure the model has only one output neuron, which aligns with this loss function.

In [10]:
loss_func = nn.BCEWithLogitsLoss()

We initialize our table for the comparison of the different architectures at the end.

In [11]:
table = [
    ["Metric", "Validation Loss", "Parameters (M)", "Training Time", "Final Accuracy"]
]

We define hyperparameters that are shared between both architectures.

In [12]:
epochs = 50
rgb_ch = 4
xyz_ch = 4
lr = 1e-3
eta_min=1e-6

# MaxPool2d

First, we start with the **MaxPool2d** version of the fusion model. This was already trained in the last notebook. To make them more independent, we train it here again.

In [13]:
mm_max_pool_net = ConcatIntermediateNet(rgb_ch, xyz_ch).to(device)
mm_max_pool_opt = Adam(mm_max_pool_net.parameters(), lr=lr)
mm_max_pool_scheduler = CosineAnnealingLR(mm_max_pool_opt, T_max=epochs, eta_min=eta_min)
mm_max_pool_save_path = PROJECT_PATH / "checkpoints" / "03_mm_max_pool_model.pth"
mm_max_pool_run = wandb.init(project=PROJECT_NAME, name=f"{ConcatIntermediateNet.__name__}")

print("Training mm_max_pool_net")
set_seeds(51)
mm_max_pool_train_loss, mm_max_pool_valid_loss, mm_max_pool_train_time = train_model(
    mm_max_pool_net,
    mm_max_pool_opt,
    loss_func,
    get_mm_intermediate_inputs,
    epochs,
    train_dataloader,
    val_dataloader,
    save_path=mm_max_pool_save_path,
    run=mm_max_pool_run,
    scheduler=mm_max_pool_scheduler,
)
mm_max_pool_num_params = mm_max_pool_net.get_number_of_parameters() / 1e6  # in millions

mm_max_pool_run.finish()

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

 2


[34m[1mwandb[0m: You chose 'Use an existing W&B account'
[34m[1mwandb[0m: Logging into https://api.wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: Find your API key here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mkarl-schuetz[0m ([33mkarl-schuetz-hasso-plattner-institut[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Training mm_max_pool_net
All random seeds set to 51 for reproducibility


100%|██████████| 50/50 [01:05<00:00,  1.31s/it]


0,1
epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
learning_rate,██████▇▇▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▄▃▃▃▃▂▂▂▂▂▂▁▁▁▁▁▁
train_loss,█▇▆▆▆▅▅▅▅▄▃▄▃▂▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
valid_loss,█▆▆▆▆▅▅▄▅▄▄▄▄▃▃▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
epoch,50.0
learning_rate,0.0
train_loss,8e-05
valid_loss,0.00173


Load best model and calculate accuracy.

In [14]:
best_mm_max_pool_model = ConcatIntermediateNet(rgb_ch, xyz_ch).to(device)
best_mm_max_pool_model.load_state_dict(torch.load(mm_max_pool_save_path))

mm_max_pool_pred, ground_truth = infer_model(
    best_mm_max_pool_model,
    concat_dataloader,
    get_mm_intermediate_inputs,
)
mm_max_pool_accuracy = np.mean(mm_max_pool_pred == ground_truth)

Add metrics to table.

In [15]:
table.append([
    "MaxPool2d",
    np.min(mm_max_pool_valid_loss),
    mm_max_pool_num_params,
    mm_max_pool_train_time,
    mm_max_pool_accuracy,
])

# Strided Conv

Now, we train the model using the **Strided Conv** version. Here, we removed the pooling layers and added stride to the convolutional layers.

In [16]:
mm_stride_net = ConcatIntermediateNetWithStride(rgb_ch, xyz_ch).to(device)
mm_stride_opt = Adam(mm_stride_net.parameters(), lr=lr)
mm_stride_scheduler = CosineAnnealingLR(mm_stride_opt, T_max=epochs, eta_min=eta_min)
mm_stride_save_path = PROJECT_PATH / "checkpoints" / "03_mm_stride_model.pth"
mm_stride_run = wandb.init(project=PROJECT_NAME, name=f"{ConcatIntermediateNetWithStride.__name__}")

print("Training mm_stride_net")
set_seeds(51)
mm_stride_train_loss, mm_stride_valid_loss, mm_stride_train_time = train_model(
    mm_stride_net,
    mm_stride_opt,
    loss_func,
    get_mm_intermediate_inputs,
    epochs,
    train_dataloader,
    val_dataloader,
    save_path=mm_stride_save_path,
    run=mm_stride_run,
    scheduler=mm_stride_scheduler,
)
mm_stride_num_params = mm_stride_net.get_number_of_parameters() / 1e6  # in millions

mm_stride_run.finish()

Training mm_stride_net
All random seeds set to 51 for reproducibility


100%|██████████| 50/50 [00:51<00:00,  1.03s/it]


0,1
epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███
learning_rate,███████▇▇▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▃▃▃▃▂▂▂▂▂▂▁▁▁▁▁▁
train_loss,█▄▄▄▃▃▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
valid_loss,▄▂▂▂▁▁▁▂▁▂▁▁▁▁▁▂▂▃▁▂▂▃▅▄▄▅▅▄▄▅▇▇▇▇██████

0,1
epoch,50.0
learning_rate,0.0
train_loss,0.03749
valid_loss,0.43594


Load best model and calculate accuracy.

In [17]:
best_mm_stride_model = ConcatIntermediateNetWithStride(rgb_ch, xyz_ch).to(device)
best_mm_stride_model.load_state_dict(torch.load(mm_stride_save_path))

mm_stride_pred, groud_truth = infer_model(
    best_mm_stride_model,
    concat_dataloader,
    get_mm_intermediate_inputs,
)
mm_stride_accuracy = np.mean(mm_stride_pred == ground_truth)

Add metrics to table.

In [18]:
table.append([
    "Strid-2 Conv2d",
    np.min(mm_stride_valid_loss),
    mm_stride_num_params,
    mm_stride_train_time,
    mm_stride_accuracy,
])

# Comparison

We calculate the differences for all our metrics.

In [19]:
differences = ["Difference"]
for i in range(1, len(table[0])):
    differences.append(table[2][i] - table[1][i])
table.append(differences)

Print the comparision table.

In [20]:
rows = list(zip(*table)) # transpose for tabulate
print(tabulate(rows[1:], headers=rows[0], tablefmt="grid"))

+-----------------+-------------+------------------+--------------+
| Metric          |   MaxPool2d |   Strid-2 Conv2d |   Difference |
| Validation Loss |  0.00171058 |         0.198788 |    0.197078  |
+-----------------+-------------+------------------+--------------+
| Parameters (M)  |  2.0945     |         2.0945   |    0         |
+-----------------+-------------+------------------+--------------+
| Training Time   | 63.3715     |        49.8364   |  -13.535     |
+-----------------+-------------+------------------+--------------+
| Final Accuracy  |  1          |         0.930168 |   -0.0698324 |
+-----------------+-------------+------------------+--------------+


## Theoretical Differences
Both approaches reduce the spatial size of an activation map, but they do so in different ways.

MaxPool2D selects the maximum value within each 2×2 window (with kernel size of 2). This halves the height and width of the activation map without learning any new parameters. Pooling is therefore parameter-free and purely statistical.

Strided Convolution reduces spatial resolution by moving the convolution filter. The stride defines how many pixels the filter shifts after each convolution step. A stride of 2 skips every second position in both spatial dimensions. Unlike pooling, this operation is learnable because the convolution weights are trained. The downsampling is therefore coupled with feature extraction.

## Impact on Gradient Flow and learned Features
Because MaxPool2D is not learnable, gradients do not pass through pooling weights (there are none). Gradients only flow back to the max-selected elements, meaning some activations receive no gradient signal. This can make the model slightly more selective and introduce sparsity.

In contrast, Strided Convolution learns both feature extraction and downsampling jointly. This means, that gradients propagate through the convolution weights and more elements contribute to the backward pass.

## Recommendation with Justification
Use Strided Convolution when you want:

the model to learn how to downsample

richer, more flexible feature learning

tighter control over how spatial information is preserved

Use MaxPool2D when you want:

a simpler, parameter-free downsampling method

stronger spatial invariance and feature sparsity

a slight regularization effect