## Multi-Modal Embedder Fusion Architecture Comparison

In this notebook, we extend the analysis started with data visualization and preparation to conduct multimodal machine learning experiments focused on fusion strategies for combining multiple data modalities. Different modalities capture complementary information which, when fused, can lead to more robust and performant models.

We focus on two fusion strategies, whose architectural choices, advantages, and limitations are discussed in detail in the `Experiments` section:

- **Late Fusion**, where modalities are processed independently and combined at a later stage;  
- **Intermediate Fusion**, where intermediate representations from unimodal networks are combined via concatenation, addition, or multiplication.  

In our setting, the two modalities are RGB and LiDAR images, previously analyzed in Notebook 01. The task is binary classification of object shape (cube vs. sphere). We implement both fusion strategies and, for intermediate fusion, its three variants, performing a structured design exploration to identify the best-performing architecture for this task and modality pair.

The first section of this notebook defines the experimental configuration, including path setup and initialization of the data loaders.

In [2]:
%load_ext autoreload
%autoreload 2

import wandb
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0" 

import json 
import torch
print(f"Using GPU: {torch.cuda.get_device_name(0)}")

import pandas as pd

from handsoncv.datasets import CILPFusionDataset
from handsoncv.models import LateFusionNet, IntermediateFusionNet
from handsoncv.training import train_fusion_cilp_model
from handsoncv.utils import set_seed, seed_worker
from torchvision import transforms
from torch.utils.data import DataLoader

NOTEBOOK_DIR = os.getcwd()
PROJECT_ROOT = os.path.abspath(os.path.join(NOTEBOOK_DIR, "..", ".."))
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(DEVICE)

# Folders we frequently use across the experiments' notebooks
ROOT_PATH = os.path.join(PROJECT_ROOT, "Assignment-2")
RESULTS_DIR = os.path.join(ROOT_PATH, "results", "tables")
os.makedirs(RESULTS_DIR, exist_ok=True) # Ensures folder exists 

CHECKPOINTS_DIR = os.path.join(ROOT_PATH, "checkpoints")
ROOT_DATA = "~/Documents/repos/BuildingAIAgentsWithMultimodalModels/data/assessment/"
IMG_SIZE = 64
BATCH_SIZE = 32

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Using GPU: NVIDIA GeForce RTX 3090
cuda


In the following cell, we set a fixed random seed to ensure reproducible data shuffling in the DataLoader multiprocessing pipeline. We then use a custom data-loading function implemented in `src/datasets.py`, which constructs the training and validation splits from predefined sample lists. These lists were generated and saved earlier in `01_dataset_exploration.ipynb`. For details on the creation procedure, refer to `01_dataset_exploration.ipynb`, and for information on the subset size used in the experiments, see the configuration logs in the public [handsoncv-fusion project link](https://wandb.ai/handsoncv-research/handsoncv-fusion?nw=nwuserguarinovanessaemanuela).

In [3]:
# Load split dictionary previouslu created with 01_dataset_exploration.ipynb
mapping_file = "subset_splits.json"
with open(f"{ROOT_PATH}/{mapping_file}", "r") as f:
    splits = json.load(f)
    
SEED = splits["seed"] # From .json file created through notebook 01_dataset_exploration.ipynb 
set_seed(SEED)

# Instantiate Dataset and relative Transformation
img_transforms = transforms.Compose([
    transforms.Resize(IMG_SIZE),
    transforms.ToTensor(),  # Scales data into [0,1]
])

train_ds = CILPFusionDataset(root_dir=ROOT_DATA, sample_ids=splits["train"], transform=img_transforms)
val_ds = CILPFusionDataset(root_dir=ROOT_DATA, sample_ids=splits["val"], transform=img_transforms)

# Create a Generator object to pass to the dataLoaders
g = torch.Generator()
g.manual_seed(SEED)

train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, drop_last=True, num_workers=2, worker_init_fn=seed_worker, generator=g)
val_loader = DataLoader(val_ds, batch_size=BATCH_SIZE, shuffle=False, drop_last=True, num_workers=2, worker_init_fn=seed_worker, generator=g)

print(f"Ready to train with {len(train_ds)} training pairs and {len(val_ds)} validation pairs.")

Seeds set to 42 for reproducibility.
Ready to train with 4799 training pairs and 1200 validation pairs.


Finally, the last configuration cell ensures a balanced distribution of classes within the training and validation batches. This is particularly important because **the datasets provided in the NVIDIA notebooks produced batches containing only a single class**, leading to unreliable accuracy estimates. These three configuration cells are shared across the experimental notebooks `02_*`, `03_*`, and `04_*`.

In [4]:
assert set(train_ds.sample_ids).isdisjoint(set(val_ds.sample_ids)), "DATA LEAKAGE DETECTED!"

leaked_ids = set(train_ds.sample_ids).intersection(set(val_ds.sample_ids))
print(f"Found {len(leaked_ids)} overlapping IDs.")
print(f"Example leaked IDs: {list(leaked_ids)[:10]}")

train_labels = next(iter(train_loader))[-1].cpu().numpy()
val_labels = next(iter(val_loader))[-1].cpu().numpy()
class_prior_train, class_prior_val = train_labels.mean(), val_labels.mean()

print(f"Class prior average in first training batch: {class_prior_train:.4f}, and validation batch: {class_prior_val:.4f}")

if class_prior_train < 0.01 or class_prior_train > 0.99:
    raise ValueError("The training batch is extremely imbalanced "
        f"(class prior = {class_prior_train:.4f}). "
        "It will cause the model to memorize label ordering. "
        "Please recreate the dataset splits."
    )

Found 0 overlapping IDs.
Example leaked IDs: []


Class prior average in first training batch: 0.2812, and validation batch: 0.5625


### Experiments

### Model Architectures and Fusion Strategies

This section describes the neural architectures evaluated in the experiments, beginning with the shared embedding backbone and followed by the multimodal fusion strategies considered.

---

1. **Embedding Backbone**

All models and also the cross-modal projection stratgey we will implement in the notebook `04_final_assessment.ipynb` employ a common `Embedder` architecture, adapted from the NVIDIA assessment baseline and designed to process each modality independently. The embedder consists of a sequence of convolutional layers with ReLU activations and progressive spatial downsampling. The channel progression follows a compact design:
$
\text{in\_channels} \rightarrow 50 \rightarrow 100 \rightarrow 200 \rightarrow 200
$
.Downsampling is performed either via max pooling or strided convolution, depending on the experimental configuration, whose ablation is provided in the notebook `03_strided_conv_ablation.ipynb`.

For intermediate fusion, the `Embedder` outputs a spatial feature map of size \([B, 200, 4, 4]\), preserving intermediate spatial structure for cross-modal interactions. For late fusion, an additional projection head maps the flattened feature map to a low-dimensional embedding vector of dimension $\texttt{emb\_dim\_late}$, enabling fusion at an almost final representation level. $\texttt{emb\_dim\_late}$ is set to be 2 in the following experiments, to concatenate fetaures almost mapped to their final rerpesnetation.

2. **Late Fusion Architecture**

In the Late Fusion model, RGB and LiDAR inputs are processed independently by two embedders configured to output compact embedding vectors of size $[B, \texttt{emb\_dim\_late}]$. These modality-specific embeddings are then easily concatenated and passed to a shared classifier. This approach enforces modality separation and enables independent feature learning for each input stream.

Late fusion is expected to be robust to modality-specific noise, to be modular and avoid higehr dimensional issues that cross-modal intercation and concatenation at intermeidat estages would cause. However, because cross-modal interactions occur only at the final stage, the model may be limited in its ability to capture fine-grained spatial correspondences between modalities.

3. **Intermediate Fusion Architectures**

In **Intermediate Fusion**, RGB and LiDAR inputs are encoded into spatial feature maps of size $[B, 200, 4, 4]$ and fused prior to classification, enabling the network to learn joint representations with explicit spatial alignment across modalities. Three fusion operators are evaluated:

- *Element-wise Addition*
Feature maps are combined through element-wise addition, producing a tensor of size $[B, 200, 4, 4]$. This his operation enforces shared semantics across and is parameter-efficient and computationally inexpensive. However, it may limit expressiveness when the modalities encode complementary rather than redundant information.

- *Hadamard Product*
Element-wise multiplication combines the feature maps into a tensor of size $[B, 200, 4, 4]$, emphasizing features that are simultaneously salient in both modalities. This acts as an implicit local attention mechanism but may suppress informative features when one modality is weak, potentially affecting gradient propagation and optimization stability.

- *Channel-wise Concatenation*
Feature maps are concatenated along the channel dimension, yielding a fused representation of size $[B, 400, 4, 4]$. This strategy preserves all modality-specific information and provides the highest representational capacity, at the cost of increased parameter count, memory usage, and optimization complexity.

---

> *Expected Trade-offs.*  
> Late fusion prioritizes modularity and robustness, whereas intermediate fusion enables earlier cross-modal interactions at the cost of increased computational and optimization complexity. Moreover, identifying the representation level at which modalities can be meaningfully aligned is not straightforward. Among intermediate fusion strategies, addition and multiplication introduce stronger inductive biases and improved efficiency, but the latter carries a risk of overfitting, while concatenation maximizes expressiveness with higher complexity and an increased risk of overfitting. The subsequent experiments quantitatively evaluate these trade-offs in terms of performance, convergence stability, and memory efficiency.

In the following, we perform the proposed suite of experiments using the `dynamic_train_fusion_cilp_model` function (implemented in `src/training.py`), logging parameters and curves at the following public [handsoncv-fusion project link](https://wandb.ai/handsoncv-research/handsoncv-fusion?nw=nwuserguarinovanessaemanuela). Please refer to the latest runs as the main runs; previous ones are left to illustrate experimentation.

In [4]:
# Configuration to fufill logging requirement
EPOCHS = 20
LEARNING_RATE = 1e-4
SUBSET_SIZE = len(train_ds) + len(val_ds) 
LATE_FUSION_EMB_DIM = 2
INTERM_FUSION_EMB_DIM = 200

# Define Experiment Suite
strategies = [
    ("Late Fusion", LateFusionNet(emb_dim_interm=INTERM_FUSION_EMB_DIM, emb_dim_late=LATE_FUSION_EMB_DIM), "late"),
    ("Int Fusion Concat", IntermediateFusionNet(mode='concat', emb_dim_interm=INTERM_FUSION_EMB_DIM), "intermediate_concat"),
    ("Int Fusion Add", IntermediateFusionNet(mode='add', emb_dim_interm=INTERM_FUSION_EMB_DIM), "intermediate_add"),
    ("Int Fusion Mul", IntermediateFusionNet(mode='mul', emb_dim_interm=INTERM_FUSION_EMB_DIM), "intermediate_mul"),
]

results = []

for name, model, strategy_type in strategies:
    current_emb_size = LATE_FUSION_EMB_DIM if strategy_type == "late" else INTERM_FUSION_EMB_DIM
    run = wandb.init(
        project="handsoncv-fusion", 
        name=name,
        config={
            "architecture": name,
            "fusion_strategy": strategy_type,
            "embedding_size": current_emb_size,
            "learning_rate": LEARNING_RATE,
            "batch_size": BATCH_SIZE,
            "epochs": EPOCHS,
            "optimizer_type": "Adam",
            "subset_size": SUBSET_SIZE,
            "seed": splits["seed"]
        }
    )
    
    optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=EPOCHS) #T_max set to the total number of epochs
    
    print(f"Training {name}...")
    
    metrics = train_fusion_cilp_model(
        model, 
        train_loader, 
        val_loader, 
        optimizer=optimizer,
        criterion=torch.nn.CrossEntropyLoss(),
        device="cuda" if torch.cuda.is_available() else "cpu",
        epochs=EPOCHS,
        scheduler=scheduler,
        task_mode="fusion"
    )
    
    metrics['Architecture'] = name
    results.append(metrics)
    wandb.finish()

# --- Final Comparison Table (Task 3.4) ---
# Create DataFrame and reorder columns to match assignment table
df = pd.DataFrame(results)
cols = ["Architecture", "val_loss", "accuracy", "params", "sec_per_epoch", "gpu_mem_mb"]
comparison_table = df[cols]

# Display the table
print("\n" + "="*60)
print("FINAL FUSION COMPARISON TABLE")
print("="*60)
print(comparison_table.to_string(index=False))

[34m[1mwandb[0m: Currently logged in as: [33mguarino-vanessa-emanuela[0m ([33mhandsoncv-research[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Training Late Fusion...
Epoch 0: Val Loss: 0.5511, Acc: 70.27% | Mem: 377.8MB
Epoch 1: Val Loss: 0.4688, Acc: 77.53% | Mem: 377.8MB
Epoch 2: Val Loss: 0.3927, Acc: 81.50% | Mem: 377.8MB
Epoch 3: Val Loss: 0.3010, Acc: 86.82% | Mem: 377.8MB
Epoch 4: Val Loss: 0.2231, Acc: 92.23% | Mem: 377.8MB
Epoch 5: Val Loss: 0.1434, Acc: 95.02% | Mem: 377.8MB
Epoch 6: Val Loss: 0.0861, Acc: 96.79% | Mem: 377.8MB
Epoch 7: Val Loss: 0.0553, Acc: 97.97% | Mem: 377.8MB
Epoch 8: Val Loss: 0.0387, Acc: 98.56% | Mem: 377.8MB
Epoch 9: Val Loss: 0.0285, Acc: 99.16% | Mem: 377.8MB
Epoch 10: Val Loss: 0.0263, Acc: 99.32% | Mem: 377.8MB
Epoch 11: Val Loss: 0.0366, Acc: 98.56% | Mem: 377.8MB
Epoch 12: Val Loss: 0.0217, Acc: 99.24% | Mem: 377.8MB
Epoch 13: Val Loss: 0.0111, Acc: 99.83% | Mem: 377.8MB
Epoch 14: Val Loss: 0.0171, Acc: 99.32% | Mem: 377.8MB
Epoch 15: Val Loss: 0.0114, Acc: 99.66% | Mem: 377.8MB
Epoch 16: Val Loss: 0.0098, Acc: 99.83% | Mem: 377.8MB
Epoch 17: Val Loss: 0.0102, Acc: 99.75% | Mem: 377.

0,1
accuracy,▁▃▄▅▆▇▇█████████████
epoch,▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▇▇▇██
epoch_time_sec,▆▂▂▂▂▂▁▂▁▂▂▃▂▂▂▃▂▅▃█
learning_rate,████▇▇▇▆▆▅▄▄▃▃▂▂▂▁▁▁
peak_gpu_mem_mb,▁███████████████████
train_loss,█▆▅▅▄▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁
val_loss,█▇▆▅▄▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁

0,1
accuracy,99.5777
epoch,19.0
epoch_time_sec,6.39039
learning_rate,0.0
peak_gpu_mem_mb,377.8252
train_loss,0.00424
val_loss,0.01089


Training Int Fusion Concat...
Epoch 0: Val Loss: 0.5065, Acc: 74.58% | Mem: 505.3MB
Epoch 1: Val Loss: 0.3979, Acc: 81.42% | Mem: 505.3MB
Epoch 2: Val Loss: 0.1699, Acc: 93.33% | Mem: 505.3MB
Epoch 3: Val Loss: 0.0391, Acc: 98.65% | Mem: 505.3MB
Epoch 4: Val Loss: 0.0256, Acc: 98.90% | Mem: 505.3MB
Epoch 5: Val Loss: 0.0104, Acc: 99.75% | Mem: 505.3MB
Epoch 6: Val Loss: 0.0111, Acc: 99.66% | Mem: 505.3MB
Epoch 7: Val Loss: 0.0042, Acc: 99.83% | Mem: 505.3MB
Epoch 8: Val Loss: 0.0035, Acc: 99.92% | Mem: 505.3MB
Epoch 9: Val Loss: 0.0032, Acc: 99.83% | Mem: 505.3MB
Epoch 10: Val Loss: 0.0040, Acc: 99.83% | Mem: 505.3MB
Epoch 11: Val Loss: 0.0036, Acc: 99.83% | Mem: 505.3MB
Epoch 12: Val Loss: 0.0035, Acc: 99.83% | Mem: 505.3MB
Epoch 13: Val Loss: 0.0032, Acc: 99.83% | Mem: 505.3MB
Epoch 14: Val Loss: 0.0038, Acc: 99.83% | Mem: 505.3MB
Epoch 15: Val Loss: 0.0038, Acc: 99.83% | Mem: 505.3MB
Epoch 16: Val Loss: 0.0036, Acc: 99.83% | Mem: 505.3MB
Epoch 17: Val Loss: 0.0033, Acc: 99.83% | Mem

0,1
accuracy,▁▃▆█████████████████
epoch,▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▇▇▇██
epoch_time_sec,▂▃▂▂▂▂▂▂▂▁▁▁▂▁▁▂▁▂▁█
learning_rate,████▇▇▇▆▆▅▄▄▃▃▂▂▂▁▁▁
peak_gpu_mem_mb,▁███████████████████
train_loss,█▆▄▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
val_loss,█▆▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
accuracy,99.83108
epoch,19.0
epoch_time_sec,6.17162
learning_rate,0.0
peak_gpu_mem_mb,505.28174
train_loss,0.00023
val_loss,0.00341


Training Int Fusion Add...
Epoch 0: Val Loss: 0.5319, Acc: 75.93% | Mem: 510.4MB
Epoch 1: Val Loss: 0.4354, Acc: 78.63% | Mem: 510.4MB
Epoch 2: Val Loss: 0.3588, Acc: 84.46% | Mem: 510.4MB
Epoch 3: Val Loss: 0.0451, Acc: 99.16% | Mem: 510.4MB
Epoch 4: Val Loss: 0.0108, Acc: 99.92% | Mem: 510.4MB
Epoch 5: Val Loss: 0.0063, Acc: 99.92% | Mem: 510.4MB
Epoch 6: Val Loss: 0.0022, Acc: 100.00% | Mem: 510.4MB
Epoch 7: Val Loss: 0.0015, Acc: 100.00% | Mem: 510.4MB
Epoch 8: Val Loss: 0.0015, Acc: 100.00% | Mem: 510.4MB
Epoch 9: Val Loss: 0.0016, Acc: 100.00% | Mem: 510.4MB
Epoch 10: Val Loss: 0.0012, Acc: 100.00% | Mem: 510.4MB
Epoch 11: Val Loss: 0.0012, Acc: 100.00% | Mem: 510.4MB
Epoch 12: Val Loss: 0.0010, Acc: 100.00% | Mem: 510.4MB
Epoch 13: Val Loss: 0.0007, Acc: 100.00% | Mem: 510.4MB
Epoch 14: Val Loss: 0.0007, Acc: 100.00% | Mem: 510.4MB
Epoch 15: Val Loss: 0.0007, Acc: 100.00% | Mem: 510.4MB
Epoch 16: Val Loss: 0.0007, Acc: 100.00% | Mem: 510.4MB
Epoch 17: Val Loss: 0.0007, Acc: 100.

0,1
accuracy,▁▂▃█████████████████
epoch,▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▇▇▇██
epoch_time_sec,█▇█▅▁▁▃▇▄▁▂▂▂▂▂▂▁▁▁▂
learning_rate,████▇▇▇▆▆▅▄▄▃▃▂▂▂▁▁▁
peak_gpu_mem_mb,▁███████████████████
train_loss,█▆▅▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
val_loss,█▇▆▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
accuracy,100.0
epoch,19.0
epoch_time_sec,5.29866
learning_rate,0.0
peak_gpu_mem_mb,510.44971
train_loss,0.00017
val_loss,0.00067


Training Int Fusion Mul...
Epoch 0: Val Loss: 0.4381, Acc: 79.31% | Mem: 565.7MB
Epoch 1: Val Loss: 0.1454, Acc: 93.75% | Mem: 565.7MB
Epoch 2: Val Loss: 0.0431, Acc: 98.82% | Mem: 565.7MB
Epoch 3: Val Loss: 0.0538, Acc: 98.56% | Mem: 565.7MB
Epoch 4: Val Loss: 0.0295, Acc: 99.49% | Mem: 565.7MB
Epoch 5: Val Loss: 0.0227, Acc: 99.49% | Mem: 565.7MB
Epoch 6: Val Loss: 0.0151, Acc: 99.58% | Mem: 565.7MB
Epoch 7: Val Loss: 0.0121, Acc: 99.75% | Mem: 565.7MB
Epoch 8: Val Loss: 0.2296, Acc: 94.43% | Mem: 565.7MB
Epoch 9: Val Loss: 0.0397, Acc: 99.16% | Mem: 565.7MB
Epoch 10: Val Loss: 0.0398, Acc: 99.16% | Mem: 565.7MB
Epoch 11: Val Loss: 0.0262, Acc: 99.32% | Mem: 565.7MB
Epoch 12: Val Loss: 0.0223, Acc: 99.58% | Mem: 565.7MB
Epoch 13: Val Loss: 0.0203, Acc: 99.58% | Mem: 565.7MB
Epoch 14: Val Loss: 0.0239, Acc: 99.49% | Mem: 565.7MB
Epoch 15: Val Loss: 0.0237, Acc: 99.49% | Mem: 565.7MB
Epoch 16: Val Loss: 0.0241, Acc: 99.49% | Mem: 565.7MB
Epoch 17: Val Loss: 0.0241, Acc: 99.49% | Mem: 5

0,1
accuracy,▁▆██████▆███████████
epoch,▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▇▇▇██
epoch_time_sec,▇▂▃▇▇▇▅▁▁▂▁▁▂▁▁▁▁▂▁█
learning_rate,████▇▇▇▆▆▅▄▄▃▃▂▂▂▁▁▁
peak_gpu_mem_mb,▁███████████████████
train_loss,█▅▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
val_loss,█▃▂▂▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁

0,1
accuracy,99.49324
epoch,19.0
epoch_time_sec,6.11958
learning_rate,0.0
peak_gpu_mem_mb,565.68018
train_loss,0.00022
val_loss,0.02441



FINAL FUSION COMPARISON TABLE
     Architecture  val_loss   accuracy   params  sec_per_epoch  gpu_mem_mb
      Late Fusion  0.010889  99.577703 13694510       5.610209  377.825195
Int Fusion Concat  0.003413  99.831081 13627934       5.460804  505.281738
   Int Fusion Add  0.000675 100.000000  7074334       5.503140  510.449707
   Int Fusion Mul  0.024415  99.493243  7074334       5.552202  565.680176
