# Overview: Fusion Architecture Comparison

This notebook implements and evaluates several multimodal fusion strategies for RGB + LiDAR classification (cube vs. sphere).
We compare early fusion, intermediate fusion (with multiple variants), and late fusion to understand their trade-offs in parameter count, performance, and training behavior.

The goal of this notebook is to:
* Build modality-specific or shared encoders
* Implement different fusion strategies
* Train models using identical settings
* Log results with Weights & Biases
* Produce a comparison table and loss curves



**The Architecture Flow:**

```
RGB Input (4ch)       LiDAR Input (4ch)
      │                     │
[RGB Encoder]         [XYZ Encoder]    <-- Learn specific features independently
      │                     │
  RGB Features          XYZ Features   <-- (e.g. 128 channels each)
      └──────────┬──────────┘
                 │
           Concatenation               <-- Fuse at the "Feature Level"
                 │
         [Regression Head]             <-- Learn relationships between features
                 │
           Output (x,y,z)
```

# Initial Setup

The project repository is mounted from Google Drive and added to the Python path to allow clean imports from the src module. The dataset is copied to the local Colab filesystem to improve I/O performance during training. All global settings (random seed, device selection, paths, batch sizes) are defined once and reused across the notebook to ensure consistency and reproducibility.

Weights & Biases is initialized for experiment tracking, and all training stages use the same precomputed dataset statistics and DataLoaders for fair comparison across models.

In [None]:
import sys
from pathlib import Path

from google.colab import drive
drive.mount('/content/drive')

%cd "/content/drive/MyDrive/Applied-Computer-Vision-Projects/Multimodal_Learning_02/"

PROJECT_ROOT = Path.cwd()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

In [None]:
# Install dependencies
%%capture
%pip install --no-cache-dir -r requirements.txt

In [None]:
import os
from pathlib import Path
from google.colab import userdata

import torch
import torch.nn as nn
import torchvision.transforms.v2 as transforms
from torch.optim import Adam

import wandb
import matplotlib.pyplot as plt

In [None]:
!rm -rf /content/data
!cp -r "$DRIVE_ROOT/data/assessment" /content/data

In [None]:
from src.config import (SEED, NUM_WORKERS, BATCH_SIZE, IMG_SIZE, DRIVE_ROOT,
                        NUM_CLASSES, RAW_DATA, CHECKPOINTS, DEVICE, VALID_BATCHES)
from src.utility import set_seeds, init_wandb, compute_embedding_size
from src.datasets import get_dataloaders, get_train_stats
from src.training import get_early_inputs, get_inputs, train_model
from src.visualization import build_fusion_comparison_df, plot_val_losses
from src.models import EarlyFusionModel, ConcatIntermediateNet, AddIntermediateNet, MatmulIntermediateNet, HadamardIntermediateNet, LateNet

In [None]:
# Fusion specific constants
EPOCHS = 15
LR = 0.0001

In [None]:
# Usage: Call this function at the beginning and before each training phase
set_seeds(SEED)

In [None]:
# Load W&B API key from Colab Secrets and make it available as env variable
wandb_key = userdata.get('WANDB_API_KEY')
os.environ["WANDB_API_KEY"] = wandb_key
wandb.login()

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Currently logged in as: [33mmichele-marschner[0m ([33mmichele-marschner-university-of-potsdam[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

# Loading and preparation of Data

This section computes normalization statistics, defines input transforms, and constructs training, validation, and test dataloaders. It ensures that all fusion models receive consistent and correctly processed input data.

In [None]:
# gets calculated mean, std from file or calculates it from the rgb train data
# for different dataset (or change in train data) recalculate mean and standard deviation
mean, std = get_train_stats(dir=DRIVE_ROOT, img_size=IMG_SIZE, data_dir=RAW_DATA)

Scanning dataset in /content/drive/MyDrive/Colab Notebooks/Applied Computer Vision/Applied-Computer-Vision-Projects/Multimodal_Learning_02/data...
cubes: 2501 RGB files found. Matching XYZA...




spheres: 9999 RGB files found. Matching XYZA...




Preloading LiDAR XYZA tensors into RAM...


Loading XYZA:   0%|          | 38/12500 [00:28<2:35:57,  1.33it/s]

In [None]:
img_transforms = transforms.Compose([
    transforms.ToImage(),   # Scales data into [0,1]
    transforms.Resize(IMG_SIZE),
    transforms.ToDtype(torch.float32, scale=True),
    transforms.Normalize(mean, std)
])

In [None]:
set_seeds(SEED)

train_data, train_dataloader, valid_data, val_dataloader, test_data, test_dataloader = get_dataloaders(
    str(RAW_DATA),
    VALID_BATCHES,
    test_frac=0.10,
    batch_size=BATCH_SIZE,
    img_transforms=img_transforms,
    num_workers=NUM_WORKERS,
    seed=SEED
)

for i, sample in enumerate(train_data):
    print(i, *(x.shape for x in sample))
    break

# Models

This section introduces the different multimodal fusion strategies implemented in the experiment. It explains the conceptual differences between early fusion, intermediate fusion (with several variants), and late fusion, setting the stage for the comparative evaluation.

The detailed use cases, advantages, and limitations of each fusion strategy are described in the sections below.

## Early Fusion Model

**Concept:** Modalities are fused before any deep processing — usually by concatenating channels or inputs.

```
[RGB , LiDAR/XYZ]  ──concat──>  x ∈ ℝ^{(C_rgb+C_lidar)×H×W}
                                   │
                                   ▼
                          Shared CNN backbone
                                   │
                                   ▼
                               Output
```



**Advantages:**

* **Captures Early Cross-Modal Interactions:** Learns joint low-level correlations directly from raw signals.
* **Simple & Lightweight**: Easiest fusion method to implement; minimal architectural overhead.
* **Effective with Perfect Alignment:** Works well when modalities are tightly synchronized and spatially aligned.

**Limitations:**

* **Noise Sensitivity:** One noisy or corrupted modality directly contaminates the shared feature space.
* **Strict Alignment Requirement:** Modalities must have matching spatial resolution, alignment, and synchronization.
* **Feature Space Mismatch:** Raw modalities differ in scale, units, and distribution; one modality can dominate without careful normalization.
* **High Input Dimensionality:** Channel concatenation increases the input size and can require more data and compute to train effectively.
* **Limited Flexibility:** Assumes combining low-level signals is beneficial; may underperform when modalities carry different types of information.

## Intermediate Fusion Model

**Concept:** Each modality has its own encoder / feature extractor, and fusion happens after some layers but before classification.

```
e.g. Addition:

RGB image ──> RGB CNN Encoder ──> f_rgb ∈ ℝ^{C×H'×W'}
                                     |
LiDAR/XYZ ─> LiDAR CNN Encoder ─> f_lidar ∈ ℝ^{C×H'×W'}
                                     |
                                     v
                           Fuse: f = f_rgb + f_lidar (element-wise add)
                                     |
                                     v
                          Shared head / classifier / projector
```



**Advantages:**

* **Specialized Processing:** Each modality gets its own encoder, tailored to its characteristics.
* **Learned Representations:** Fusion occurs on higher-level, more discriminative features rather than raw data.
* **Flexible Design:** The fusion point can be chosen at different network depths, allowing fine-grained architectural control.
* **Easily Extendable:** New modalities can be added by including additional modality-specific branches.


**Limitations:**

* **Architectural Complexity:** Requires designing separate modality-specific encoders and choosing an appropriate fusion point.
* **Higher Computational Cost:** More expensive than early fusion due to duplicated feature extractors.
* **Fusion Design Sensitivity:** Performance depends on the chosen fusion mechanism (concat, addition, multiplicative, bilinear, attention), which often requires experimentation.
* **Depth Selection Challenge:** Deciding how much unimodal processing to perform before fusion can be non-trivial and task-dependent.

Implemented 4 variants:

*   Concatenation
*   Addition
* Hadamard Product (element-wise multiplication)
* Matrix-Multiplication



| Fusion Method | Advantages | Limitations |
|---------------|------------|-------------|
| **Concatenation** | - Very expressive and flexible<br>- Lets the network learn arbitrary cross-modal interactions<br>- Robust and widely used baseline | - Doubles channel count → more parameters & memory<br>- Computationally heavier<br>- Fusion is unguided; model must discover interactions itself |
| **Addition** | - Lightweight (no increase in channels)<br>- Fast and parameter-efficient<br>- Enforces similar feature spaces between modalities | - Assumes features are aligned and comparable<br>- One noisy modality corrupts the other<br>- Sensitive to scale differences between modalities |
| **Multiplicative (Hadamard Product)** | - Gating effect: highlights features important in *both* modalities<br>- More expressive than addition, cheaper than concat<br>- Natural for attention-like fusion | - Suppresses features when one modality has low magnitude<br>- Requires careful normalization<br>- Can amplify noise if both activations are high |
| **Matrix Multiplication (Bilinear-like)** | - Captures rich pairwise correlations between modalities<br>- Most expressive among all four<br>- Enables true 2nd-order interaction learning | - Very heavy in compute & memory<br>- Requires flattening or dimensionality reduction<br>- Easily overfits; harder to train and tune |


## Late Fusion Model

**Concept:** Each modality is processed completely separately, and only the final predictions or high-level embeddings are fused.

```
RGB (C_rgb×H×W)  -> RGB Encoder   -> z_rgb
                                  \
                                   -> Fuse at decision level -> output
                                  /
LiDAR/XYZ (C_l×H×W) -> LiDAR Encoder -> z_lidar
```

**Advantages:**

* **Robust to Missing Modalities:** The system can still operate if one modality is noisy, unreliable, or absent.
* **Best for Heterogeneous Modalities:** Works well when modalities differ greatly.
* **Modular & Simple:** Unimodal models can be trained, debugged, and replaced independently.
* **Leverages Existing Models:** Allows the reuse of strong off-the-shelf unimodal experts without architectural changes.


**Limitations:**

* **Missed Interactions:** No joint feature learning — modalities never influence each other during representation learning.
* **Limited Expressiveness:** Simple fusion rules (e.g., averaging, weighted sum) cannot capture complex cross-modal relationships.
* **Information Loss:** By the time unimodal predictors output logits/embeddings, rich spatial and semantic details may already be discarded, limiting the power of fusion.

# Model Training

Here we train each fusion model using the same dataset, optimizer, and loss function. The training loop logs metrics to W&B, saves checkpoints, and stores results for later comparison. This is the core experimental section of the notebook.

In [None]:
FEATURE_DIM = 128

set_seeds(SEED)

#class_weights = compute_class_weights(train_data, NUM_CLASSES).to(DEVICE)
#loss_func = nn.CrossEntropyLoss(weight=class_weights.to(DEVICE))
loss_func = nn.CrossEntropyLoss()

metrics = {}   # store losses for each model

# Define fusion models to train and compare
models_to_train = {
    "early_fusion": EarlyFusionModel(in_ch=8, output_dim=NUM_CLASSES).to(DEVICE),
    "intermediate_fusion_concat": ConcatIntermediateNet(4, 4, output_dim=NUM_CLASSES, feature_dim=FEATURE_DIM).to(DEVICE),
    "intermediate_fusion_matmul": MatmulIntermediateNet(4, 4, output_dim=NUM_CLASSES, feature_dim=FEATURE_DIM).to(DEVICE),
    "intermediate_fusion_hadamard": HadamardIntermediateNet(4, 4, output_dim=NUM_CLASSES, feature_dim=FEATURE_DIM).to(DEVICE),
    "intermediate_fusion_add": AddIntermediateNet(4, 4, output_dim=NUM_CLASSES, feature_dim=FEATURE_DIM).to(DEVICE),
    "late_fusion": LateNet(4, 4, output_dim=NUM_CLASSES).to(DEVICE),
}

# === Main experiment loop over all fusion strategies ===
for name, model in models_to_train.items():
  model_save_path = CHECKPOINTS / f"{name}.pt"

  # Number of trainable parameters (for the comparison table)
  num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

  opt = Adam(model.parameters(), lr=LR)

  embedding_size = compute_embedding_size(name, FEATURE_DIM, spatial=(8, 8))

  # Initialize a new Weights & Biases run for this model.
  init_wandb(
      model=model,
      name=name,
      embedding_size=embedding_size,
      fusion_name=name,
      num_params=num_params,
      opt_name = opt.__class__.__name__
    )

  # Choose the proper input function depending on the fusion strategy:
  if name.startswith("early_fusion"):
    input_fn = get_early_inputs
  else:
    input_fn = get_inputs

  results = train_model(
    model=model,
    optimizer=opt,
    input_fn=input_fn,
    epochs=EPOCHS,
    loss_fn=loss_func,
    train_dataloader=train_dataloader,
    val_dataloader=val_dataloader,
    model_save_path=model_save_path,
    target_idx=-1,   # last element in batch is target
    log_to_wandb=True,
    device=DEVICE
  )

  metrics[name] = results

  # End wandb run before starting the next model
  wandb.finish()

# Evaluation

This section visualizes validation losses across all fusion strategies, builds the fusion comparison table, and logs the aggregated results to W&B.

It provides the final quantitative comparison between early, intermediate, and late fusion methods.

In [None]:
name_map = {
    "early_fusion": "Early Fusion",
    "late_fusion": "Late Fusion",
    "intermediate_fusion_concat": "Intermediate (Concat)",
    "intermediate_fusion_matmul": "Intermediate (Multiplicative)",
    "intermediate_fusion_hadamard": "Intermediate (Hadamard)",
    "intermediate_fusion_add": "Intermediate (Add)",
}

In [None]:
# Build comparison overview
df_comparison = build_fusion_comparison_df(metrics, name_map)

# Log the comparison table to wandb
wandb.init(
    project="cilp-extended-assessment", 
    name="fusion_comparison_all",
    job_type="analysis",
)

# Log comparison table and loss curves to wandb
fusion_comparison_table = wandb.Table(dataframe=df_comparison)
wandb.log({"fusion_comparison": fusion_comparison_table})

loss_dict = {name: m["valid_losses"] for name, m in metrics.items()}
fig, ax = plot_val_losses(loss_dict, title="Validation Loss per Model")
plt.show()

wandb.log({"fusion/val_loss_curves": wandb.Image(fig)})
plt.close(fig)

wandb.finish()

In [None]:
df_comparison

## Evaluation of Fusion Strategies

|index|Fusion Strategy|Avg Valid Loss|Best Valid Loss|Num of params|Avg time per epoch \(min:s\)|GPU Memory \(MB, max\)|
|---|---|---|---|---|---|---|
|0|Early Fusion|0\.0047|1\.2874e-06|8387990|10\.0998|497\.7852|
|1|Intermediate \(Concat\)|0\.0057|5\.0663e-07|16672374|15\.8507|672\.7407|
|2|Intermediate \(Multiplicative\)|0\.0069|1\.3186e-06|8480374|13\.6595|643\.0933|
|3|Intermediate \(Hadamard\)|0\.0023|1\.4230e-06|8480374|13\.2025|675\.4458|
|4|Intermediate \(Add\)|0\.0029|1\.0505e-07|8480374|13\.1532|707\.7983|
|5|Late Fusion|0\.0070|1\.5497e-07|16672374|15\.6503|833\.4009|

The evaluation of different fusion strategies indicates that intermediate fusion approaches provide the most favorable balance between performance and computational efficiency, while early and late fusion exhibit clear trade-offs. Among all tested models, Intermediate Fusion (Add) achieves the lowest and most stable best validation loss (1.05 × 10⁻⁷), indicating highly effective alignment between RGB and LiDAR representations. At the same time, Intermediate Fusion (Hadamard) yields the lowest average validation loss (0.0023), suggesting particularly stable convergence behavior across training epochs.

Both additive and Hadamard intermediate fusion operate with a moderate parameter count of approximately 8.48 million, making them substantially more efficient than concatenation-based or late fusion strategies. In contrast, Intermediate Fusion (Concat) and Late Fusion roughly double the number of parameters (~16.7M) and require noticeably more GPU memory and longer training times per epoch, without delivering corresponding improvements in validation performance. This highlights that increased model capacity alone does not guarantee better cross-modal alignment.

Early Fusion benefits from the lowest computational footprint, achieving the shortest average epoch time and lowest GPU memory usage. However, it converges more slowly and shows a higher average validation loss, likely because the model must learn cross-modal relationships directly from low-level features, which can be challenging when modalities differ significantly in structure or noise characteristics.

Overall, taking previous experimental runs into account, Intermediate Fusion (Add) emerges as the most effective architecture. It combines excellent validation performance, stable convergence, and efficient parameter usage, making it the most robust and practical choice in this setting.

**When to Use Each Fusion Strategy:**
*Early Fusion:*
Best suited for closely aligned, low-level modalities with similar structure and noise characteristics. While computationally efficient, it is less robust when inputs are noisy or heterogeneous.

*Intermediate Fusion:*
Ideal for modalities with different structures that benefit from separate early processing to learn modality-specific features. It provides the best overall balance between performance, flexibility, and efficiency.

*Late Fusion:*
Most appropriate when strong unimodal predictors already exist or when robustness to missing or unreliable modalities is required. It serves as a reliable fallback but comes at a higher computational cost.