# Overview: Fusion Architecture Comparison

This notebook implements and evaluates several multimodal fusion strategies for RGB + LiDAR classification (cube vs. sphere).
We compare early fusion, intermediate fusion (with multiple variants), and late fusion to understand their trade-offs in parameter count, performance, and training behavior.

The goal of this notebook is to:
* Build modality-specific or shared encoders
* Implement different fusion strategies
* Train models using identical settings
* Log results with Weights & Biases
* Produce a comparison table and loss curves

---
This section provides a high-level overview of the multimodal architecture used throughout the notebook. It illustrates how RGB and LiDAR inputs flow through their respective encoders and where fusion occurs. The diagrams help clarify the motivation behind comparing early, intermediate, and late fusion strategies.



```
RGB ----> RGB Encoder ----\
                            ----> Fusion ---> Classifier ---> cube/sphere
LiDAR -> LiDAR Encoder ----/
```



**The Architecture Flow:**

```
RGB Input (4ch)       LiDAR Input (4ch)
      │                     │
[RGB Encoder]         [XYZ Encoder]    <-- Learn specific features independently
      │                     │
  RGB Features          XYZ Features   <-- (e.g. 128 channels each)
      └──────────┬──────────┘
                 │
           Concatenation               <-- Fuse at the "Feature Level"
                 │
         [Regression Head]             <-- Learn relationships between features
                 │
           Output (x,y,z)
```

Multimodal fusion refers to how we combine information from different modalities (e.g., RGB and LiDAR).
There are three canonical levels of fusion:

Early fusion – combine raw or early-level features

Intermediate fusion – combine learned feature representations

Late fusion – combine decisions or latent vectors at the end of the pipeline;  it's almost like we're creating an ensemble model, where each model has a weighted vote in the final result.

Each level has different strengths + limitations.

# Setup

This section installs required packages, imports all necessary modules, and prepares utility functions from the project's src/ folder. It ensures that the notebook runs cleanly in Colab or a local environment and establishes a consistent base for all subsequent experiments.


## Installations & Imports

Here we import PyTorch, torchvision, W&B, the fusion models, training utilities, and dataset handlers. This cell initializes all dependencies required for running the fusion experiments.

In [None]:
import sys
from pathlib import Path

from google.colab import drive
drive.mount('/content/drive')

%cd "/content/drive/MyDrive/Colab Notebooks/Applied Computer Vision/Applied-Computer-Vision-Projects/Multimodal_Learning_02/"

PROJECT_ROOT = Path.cwd()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

In [None]:
# Install dependencies
%pip install -r requirements.txt

In [None]:
# %pip install wandb fiftyone==1.10.0 sympy==1.12 torch==2.9.0 torchvision==0.20.0 numpy open-clip-torch

In [None]:
import os
from pathlib import Path
from google.colab import userdata

import torch
import torch.nn as nn
import torchvision.transforms.v2 as transforms
from torch.optim import Adam

import wandb
import matplotlib.pyplot as plt

## Data Paths

This section defines the filesystem layout used throughout the notebook. Google Drive is mounted to access pretrained checkpoints and datasets, and paths for data loading and result storage are established. All model loading, image saving, and logging rely on these paths.

In [None]:
!rm -rf /content/data
!cp -r "/content/drive/MyDrive/Colab Notebooks/Applied Computer Vision/Applied-Computer-Vision-Projects/Multimodal_Learning_02/data_transformed" /content/data
# !cp -r "/content/drive/MyDrive/Colab Notebooks/Applied Computer Vision/Applied-Computer-Vision-Projects/Multimodal_Learning_02/data/assessment" /content/data

In [None]:
from src.config import (SEED, NUM_WORKERS, BATCH_SIZE, IMG_SIZE, N, LABEL_MAP,
                        CLASSES, NUM_CLASSES, RAW_DATA, CHECKPOINTS, DEVICE, VALID_BATCHES)
from src.utility import set_seeds
from src.datasets import compute_dataset_mean_std, get_dataloaders
from src.training import compute_class_weights, get_early_inputs, get_inputs, train_model, init_wandb
from src.visualization import build_fusion_comparison_df, plot_losses
from src.models import EarlyFusionModel, ConcatIntermediateNet, AddIntermediateNet, MatmulIntermediateNet, HadamardIntermediateNet, LateNet

## Constants

We define the global configuration parameters used throughout the notebook, such as batch size, image size, number of epochs, learning rate, and label mappings. These constants ensure that every fusion model is trained under identical, reproducible settings.

In [None]:
EPOCHS = 15
LR = 0.0001

In [None]:
# Usage: Call this function at the beginning and before each training phase
set_seeds(SEED)

# Integration of Wandb

We authenticate with Weights & Biases using a stored API key and initialize project logging. This enables automatic tracking of training progress, hyperparameters, losses, and comparison tables for all fusion models.

In [None]:
# Load W&B API key from Colab Secrets and make it available as env variable
wandb_key = userdata.get('WANDB_API_KEY')
os.environ["WANDB_API_KEY"] = wandb_key
wandb.login()

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Currently logged in as: [33mmichele-marschner[0m ([33mmichele-marschner-university-of-potsdam[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

# Loading and preparation of Data

This section computes normalization statistics, defines input transforms, and constructs training, validation, and test dataloaders. It ensures that all fusion models receive consistent and correctly processed input data.

In [None]:
## Final: dynamisch
mean, std = compute_dataset_mean_std(root_dir=RAW_DATA, img_size=IMG_SIZE)
# mean, std = compute_dataset_mean_std_neu(root_dir=root, img_size=IMG_SIZE, seed=SEED)


Scanning dataset in /content/drive/MyDrive/Colab Notebooks/Applied Computer Vision/Applied-Computer-Vision-Projects/Multimodal_Learning_02/data...
cubes: 2501 RGB files found. Matching XYZA...




spheres: 9999 RGB files found. Matching XYZA...




Preloading LiDAR XYZA tensors into RAM...


Loading XYZA:   0%|          | 38/12500 [00:28<2:35:57,  1.33it/s]

In [None]:
## Final: dynamisch
img_transforms = transforms.Compose([
    transforms.ToImage(),   # Scales data into [0,1]
    transforms.Resize(IMG_SIZE),
    transforms.ToDtype(torch.float32, scale=True),
    transforms.Normalize(([0.0051, 0.0052, 0.0051, 1.0000]), ([5.8023e-02, 5.8933e-02, 5.8108e-02, 2.4509e-07]))     ## assessment dataset
    # transforms.Normalize(mean.tolist(), std.tolist())     ## assessment dataset
])

In [None]:
set_seeds(SEED)

train_data, train_dataloader, valid_data, val_dataloader, test_data, test_dataloader = get_dataloaders(
    str(RAW_DATA),
    VALID_BATCHES,
    test_frac=0.10,
    batch_size=BATCH_SIZE,
    img_transforms=img_transforms,
    num_workers=NUM_WORKERS,
    seed=SEED
)

for i, sample in enumerate(train_data):
    print(i, *(x.shape for x in sample))
    break

# Models

This section introduces the different multimodal fusion strategies implemented in the experiment. It explains the conceptual differences between early fusion, intermediate fusion (with several variants), and late fusion, setting the stage for the comparative evaluation.

The detailed use cases, advantages, and limitations of each fusion strategy are described in the sections below.

## Early Fusion Model

**Concept:** Fuse modalities before any deep processing — usually by concatenating channels or inputs.

```
input = concat(RGB, XYZ)  → shape (8, H, W)
-> shared CNN processes everything together
```



**Advantages:**

* **Captures Early Cross-Modal Interactions:** Learns joint low-level correlations directly from raw signals.
* **Simple & Lightweight**: Easiest fusion method to implement; minimal architectural overhead.
* **Effective with Perfect Alignment:** Works well when modalities are tightly synchronized and spatially aligned.

**Limitations:**

* **Noise Sensitivity:** One noisy or corrupted modality directly contaminates the shared feature space.
* **Strict Alignment Requirement:** Modalities must have matching spatial resolution, alignment, and synchronization.
* **Feature Space Mismatch:** Raw modalities differ in scale, units, and distribution; one modality can dominate without careful normalization.
* **High Input Dimensionality:** Channel concatenation increases the input size and can require more data and compute to train effectively.
* **Limited Flexibility:** Assumes combining low-level signals is beneficial; may underperform when modalities carry different types of information.

## Intermediate Fusion Model

**Concept:** Each modality has its own encoder / feature extractor, and fusion happens after some layers but before classification.

```
RGB → RGB_conv → RGB_features (C, H, W)
LiDAR → LiDAR_conv → LiDAR_features (C, H, W)

Fusion → joint_features → FC → output
```



**Advantages:**

* **Specialized Processing:** Each modality gets its own encoder, tailored to its characteristics.
* **Learned Representations:** Fusion occurs on higher-level, more discriminative features rather than raw data.
* **Flexible Design:** The fusion point can be chosen at different network depths, allowing fine-grained architectural control.
* **Easily Extendable:** New modalities can be added by including additional modality-specific branches.


**Limitations:**

* **Architectural Complexity:** Requires designing separate modality-specific encoders and choosing an appropriate fusion point.
* **Higher Computational Cost:** More expensive than early fusion due to duplicated feature extractors.
* **Fusion Design Sensitivity:** Performance depends on the chosen fusion mechanism (concat, addition, multiplicative, bilinear, attention), which often requires experimentation.
* **Depth Selection Challenge:** Deciding how much unimodal processing to perform before fusion can be non-trivial and task-dependent.

Implemented 4 variants:

*   Concatenation
*   Addition
* Hadamard Product (element-wise multiplication)
* Matrix-Multiplication



| Fusion Method | Advantages | Limitations |
|---------------|------------|-------------|
| **Concatenation** | - Very expressive and flexible<br>- Lets the network learn arbitrary cross-modal interactions<br>- Robust and widely used baseline | - Doubles channel count → more parameters & memory<br>- Computationally heavier<br>- Fusion is unguided; model must discover interactions itself |
| **Addition** | - Lightweight (no increase in channels)<br>- Fast and parameter-efficient<br>- Enforces similar feature spaces between modalities | - Assumes features are aligned and comparable<br>- One noisy modality corrupts the other<br>- Sensitive to scale differences between modalities |
| **Multiplicative (Hadamard Product)** | - Gating effect: highlights features important in *both* modalities<br>- More expressive than addition, cheaper than concat<br>- Natural for attention-like fusion | - Suppresses features when one modality has low magnitude<br>- Requires careful normalization<br>- Can amplify noise if both activations are high |
| **Matrix Multiplication (Bilinear-like)** | - Captures rich pairwise correlations between modalities<br>- Most expressive among all four<br>- Enables true 2nd-order interaction learning | - Very heavy in compute & memory<br>- Requires flattening or dimensionality reduction<br>- Easily overfits; harder to train and tune |


## Late Fusion Model

**Concept:** Each modality is processed completely separately, and only the final predictions or high-level embeddings are fused.

```
RGB → RGB-Embedder → logits_rgb
LiDAR → LiDAR-Embedder → logits_lidar

Fusion → final decision
```

**Advantages:**

* **Robust to Missing Modalities:** The system can still operate if one modality is noisy, unreliable, or absent.
* **Best for Heterogeneous Modalities:** Works well when modalities differ greatly.
* **Modular & Simple:** Unimodal models can be trained, debugged, and replaced independently.
* **Leverages Existing Models:** Allows the reuse of strong off-the-shelf unimodal experts without architectural changes.


**Limitations:**

* **Missed Interactions:** No joint feature learning — modalities never influence each other during representation learning.
* **Limited Expressiveness:** Simple fusion rules (e.g., averaging, weighted sum) cannot capture complex cross-modal relationships.
* **Information Loss:** By the time unimodal predictors output logits/embeddings, rich spatial and semantic details may already be discarded, limiting the power of fusion.

# Model Training

Here we train each fusion model using the same dataset, optimizer, and loss function. The training loop logs metrics to W&B, saves checkpoints, and stores results for later comparison. This is the core experimental section of the notebook.

In [None]:
FEATURE_DIM = 128

set_seeds(SEED)

class_weights = compute_class_weights(train_data, NUM_CLASSES).to(DEVICE)
loss_func = nn.CrossEntropyLoss(weight=class_weights.to(DEVICE))

metrics = {}   # store losses for each model

# Defines fusion models to train and compare
models_to_train = {
    "early_fusion": EarlyFusionModel(in_ch=8, output_dim=NUM_CLASSES).to(DEVICE),
    "intermediate_fusion_concat": ConcatIntermediateNet(4, 4, output_dim=NUM_CLASSES, feature_dim=FEATURE_DIM).to(DEVICE),
    "intermediate_fusion_matmul": MatmulIntermediateNet(4, 4, output_dim=NUM_CLASSES, feature_dim=FEATURE_DIM).to(DEVICE),
    "intermediate_fusion_hadamard": HadamardIntermediateNet(4, 4, output_dim=NUM_CLASSES, feature_dim=FEATURE_DIM).to(DEVICE),
    "intermediate_fusion_add": AddIntermediateNet(4, 4, output_dim=NUM_CLASSES, feature_dim=FEATURE_DIM).to(DEVICE),
    "late_fusion": LateNet(4, 4, output_dim=NUM_CLASSES).to(DEVICE),
}

# === Main experiment loop over all fusion strategies ===
for name, model in models_to_train.items():
  model_save_path = CHECKPOINTS / f"{name}.pth"

  # Number of trainable parameters (for the comparison table)
  num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

  opt = Adam(model.parameters(), lr=LR)

  # Initialize a new Weights & Biases run for this model.
  init_wandb(
      model=model,
      fusion_name=name,
      num_params=num_params,
      opt_name = opt.__class__.__name__)

  # Choose the proper input function depending on the fusion strategy:
  if name.startswith("early_fusion"):
    input_fn = get_early_inputs
  else:
    input_fn = get_inputs

  results = train_model(
    model=model,
    optimizer=opt,
    input_fn=input_fn,
    epochs=EPOCHS,
    loss_fn=loss_func,
    train_dataloader=train_dataloader,
    val_dataloader=val_dataloader,
    model_save_path=model_save_path,
    target_idx=-1,   # last element in batch is target
    log_to_wandb=True,
    device=DEVICE
  )

  metrics[name] = results

  # End wandb run before starting the next model
  wandb.finish()

# Evaluation

This section visualizes validation losses across all fusion strategies, builds the fusion comparison table, and logs the aggregated results to W&B.

It provides the final quantitative comparison between early, intermediate, and late fusion methods.

In [None]:
name_map = {
    "early_fusion": "Early Fusion",
    "late_fusion": "Late Fusion",
    "intermediate_fusion_concat": "Intermediate (Concat)",
    "intermediate_fusion_matmul": "Intermediate (Multiplicative)",
    "intermediate_fusion_hadamard": "Intermediate (Hadamard)",
    "intermediate_fusion_add": "Intermediate (Add)",
}

In [None]:
loss_dict = {name: m["valid_losses"] for name, m in metrics.items()}
fig = plot_losses(loss_dict, title="Validation Loss per Model")
plt.show()

df_comparison = build_fusion_comparison_df(metrics, name_map)
df_comparison

In [None]:
# logs the comparison table to wandb
wandb.init(
    project="cilp-extended-assessment",   # your project name
    name="fusion_comparison_all",
    job_type="analysis",
)

fusion_comparison_table = wandb.Table(dataframe=df_comparison)
wandb.log({"fusion_comparison": fusion_comparison_table})

wandb.finish()

## When to Use Each Fusion Strategy
The notebook concludes by summarizing when each fusion type is most appropriate, based on their strengths, limitations, and performance observed in the experiment:

**Early Fusion:**
* Aligned, closely related low-level modalities and comparable features
* Simple setup; avoid if sensors are noisy

**Intermediate Fusion:**
* Modalities with different structure that benefit from separate early processing in order to learn modality-specific features   
* best overall balance of performance and flexibility

**Late Fusion:**
* Strong, independent unimodal predictors, to combine their strengths
* ideal for heterogeneous or missing modalities
* robust fallback when one modality fails