# Overview: Strided Convolution Ablation

This notebook studies the impact of replacing MaxPool-based downsampling with strided convolutions across different fusion architectures. Using identical training settings, we compare early, intermediate, and late fusion models to analyze how the choice of downsampling affects validation performance, training time, parameter efficiency, and feature stability. Results are logged with Weights & Biases and summarized in comparative tables and loss curves.

# Initial Setup

The project repository is mounted from Google Drive and added to the Python path to allow clean imports from the src module. The dataset is copied to the local Colab filesystem to improve I/O performance during training. All global settings (random seed, device selection, paths, batch sizes) are defined once and reused across the notebook to ensure consistency and reproducibility.

Weights & Biases is initialized for experiment tracking, and all training stages use the same precomputed dataset statistics and DataLoaders for fair comparison across models.

In [None]:
import sys
from pathlib import Path

from google.colab import drive
drive.mount('/content/drive')

%cd "/content/drive/MyDrive/Applied-Computer-Vision-Projects/Multimodal_Learning_02/"

PROJECT_ROOT = Path.cwd()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

In [None]:
# Install dependencies
%%capture
%pip install --no-cache-dir -r requirements.txt

In [None]:
import os
from google.colab import userdata

import torch
import torch.nn as nn
import torchvision.transforms.v2 as transforms
from torch.optim import Adam

import wandb
import matplotlib.pyplot as plt
from IPython.display import display, Markdown

In [None]:
!rm -rf /content/data
!cp -r "$DRIVE_ROOT/data/assessment" /content/data

In [None]:
from src.config import (SEED, NUM_WORKERS, BATCH_SIZE, IMG_SIZE, NUM_CLASSES, DRIVE_ROOT, 
                        RAW_DATA, CHECKPOINTS, DEVICE, VALID_BATCHES)
from src.utility import set_seeds, init_wandb, compute_embedding_size
from src.datasets import get_dataloaders, get_train_stats
from src.training import get_early_inputs, get_inputs, train_model
from src.visualization import build_pairwise_downsampling_tables, plot_val_losses
from src.models import EmbedderMaxPool, EmbedderStrided, EarlyFusionModel, ConcatIntermediateNet, LateNet

In [None]:
# Fusion specific constants
EPOCHS = 15
LR = 0.0001

In [None]:
# Usage: Call this function at the beginning and before each training phase
set_seeds(SEED)

In [None]:
# Load W&B API key from Colab Secrets and make it available as env variable
wandb_key = userdata.get('WANDB_API_KEY')
os.environ["WANDB_API_KEY"] = wandb_key
wandb.login()

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Currently logged in as: [33mmichele-marschner[0m ([33mmichele-marschner-university-of-potsdam[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

# Loading and preparation of Data

This section computes normalization statistics, defines the RGB+LiDAR transforms, and prepares the training, validation, and test DataLoaders. Both downsampling strategies (MaxPool and Stride) receive identical input preprocessing to guarantee a fair ablation comparison.

In [None]:
# gets calculated mean, std from file or calculates it from the rgb train data
# for different dataset (or change in train data) recalculate mean and standard deviation
mean, std = get_train_stats(dir=DRIVE_ROOT, img_size=IMG_SIZE, data_dir=RAW_DATA)

Scanning dataset in /content/drive/MyDrive/Colab Notebooks/Applied Computer Vision/Applied-Computer-Vision-Projects/Multimodal_Learning_02/data...
cubes: 2501 RGB files found. Matching XYZA...




spheres: 9999 RGB files found. Matching XYZA...




Preloading LiDAR XYZA tensors into RAM...


Loading XYZA:   0%|          | 38/12500 [00:28<2:35:57,  1.33it/s]

In [None]:
img_transforms = transforms.Compose([
    transforms.ToImage(),   # Scales data into [0,1]
    transforms.Resize(IMG_SIZE),
    transforms.ToDtype(torch.float32, scale=True),
    transforms.Normalize(mean, std)
])

In [None]:
set_seeds(SEED)

train_data, train_dataloader, valid_data, val_dataloader, test_data, test_dataloader = get_dataloaders(
    str(RAW_DATA),
    VALID_BATCHES,
    test_frac=0.15,
    batch_size=BATCH_SIZE,
    img_transforms=img_transforms,
    num_workers=NUM_WORKERS,
    seed=SEED
)

for i, sample in enumerate(train_data):
    print(i, *(x.shape for x in sample))
    break

# Model Training

In this section we train multiple fusion models using either MaxPool-based embedders or StridedConv-based embedders. For each architecture, we initialize the optimizer, create a W&B run, train the model, and save checkpoints. The goal is to isolate the impact of replacing MaxPool with stride-based downsampling while keeping all other components constant.

In [None]:
FEATURE_DIM = 128

set_seeds(SEED)

#class_weights = compute_class_weights(train_data, NUM_CLASSES).to(DEVICE)
#loss_func = nn.CrossEntropyLoss(weight=class_weights.to(DEVICE))
loss_func = nn.CrossEntropyLoss()

metrics = {}   # store losses for each model

# Defines fusion models to train and compare
models_to_train = {
    "early_fusion_pool": EarlyFusionModel(in_ch=8, output_dim=2, embedder_cls=EmbedderMaxPool).to(DEVICE),
    "early_fusion_stride": EarlyFusionModel(in_ch=8, output_dim=2, embedder_cls=EmbedderStrided).to(DEVICE),
    "intermediate_fusion_concat_pool": ConcatIntermediateNet(4, 4, output_dim=NUM_CLASSES, feature_dim=FEATURE_DIM, embedder_cls=EmbedderMaxPool).to(DEVICE),
    "intermediate_fusion_concat_stride": ConcatIntermediateNet(4, 4, output_dim=NUM_CLASSES, feature_dim=FEATURE_DIM, embedder_cls=EmbedderStrided).to(DEVICE),
    "late_fusion_pool": LateNet(4, 4, output_dim=NUM_CLASSES, embedder_cls=EmbedderMaxPool).to(DEVICE),
    "late_fusion_stride": LateNet(4, 4, output_dim=NUM_CLASSES, embedder_cls=EmbedderStrided).to(DEVICE)
}

# === Main experiment loop over all fusion strategies ===
for name, model in models_to_train.items():
  model_save_path = CHECKPOINTS / f"{name}.pt"

  # Number of trainable parameters (for the comparison table)
  num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

  opt = Adam(model.parameters(), lr=LR)

  embedding_size = compute_embedding_size(name, FEATURE_DIM, spatial=(8, 8))

  # Initialize a new Weights & Biases run for this model.
  init_wandb(
      model=model,
      name=name,
      embedding_size=embedding_size,
      fusion_name=name,
      num_params=num_params,
      opt_name = opt.__class__.__name__)

  # Choose the proper input function depending on the fusion strategy:
  if name.startswith("early_fusion"):
    input_fn = get_early_inputs
  else:
    input_fn = get_inputs

  results = train_model(
    model=model,
    optimizer=opt,
    input_fn=input_fn,
    epochs=EPOCHS,
    loss_fn=loss_func,
    train_dataloader=train_dataloader,
    val_dataloader=val_dataloader,
    model_save_path=model_save_path,
    target_idx=-1,   # last element in batch is target
    log_to_wandb=True,
    device=DEVICE
  )

  metrics[name] = results

  # End wandb run before starting the next model
  wandb.finish()

# Evaluation

This section plots validation losses for all trained models and generates downsampling comparison tables. These tables summarize performance differences between MaxPool and StridedConv variants across early, intermediate, and late fusion methods. All results are logged to W&B for easy inspection and reproducibility.

In [None]:
name_map = {
    "early_fusion": "Early Fusion",
    "intermediate_fusion_concat": "Intermediate (Concat)",
    "late_fusion": "Late Fusion",
}

In [None]:
loss_dict = {name: m["valid_losses"] for name, m in metrics.items()}
fig, ax = plot_val_losses(loss_dict, title="Validation Loss per Model")
plt.show()

tables = build_pairwise_downsampling_tables(metrics, name_map)

# logs comparison tables and loss curves to wandb
for base, df in tables.items():
    wandb.init(
        project="cilp-extended-assessment",
        name=f"downsampling_comparison_{base}",
        job_type="analysis",
    )
    wandb.log({f"task4_downsampling_{base}": wandb.Table(dataframe=df)})

wandb.log({"max_pool_vs_stride/val_loss_curves": wandb.Image(fig)})
plt.close(fig)

wandb.finish()

In [None]:
for name, df in tables.items():
    display(Markdown(f"### {name} — MaxPool vs Strided Conv"))
    display(df)

# Evaluation of Ablation Study - Max Pooling vs. Strided:

**Early Fusion:**
|index|Metric|MaxPool2d|Strided Conv|Difference \(Strided - MaxPool\)|
|---|---|---|---|---|
|0|Validation Loss \(best\)|5\.9752e-07|7\.1077e-07|1\.1325e-07|
|1|Parameters|8387990\.0|8387990\.0|0\.0|
|2|Training Time \(s\)|136\.3252|129\.5581|-6\.7671|
|3|Final Accuracy|1\.0|1\.0|0\.0|

**Intermediate Fusion (Concat):**
|index|Metric|MaxPool2d|Strided Conv|Difference \(Strided - MaxPool\)|
|---|---|---|---|---|
|0|Validation Loss \(best\)|1\.3411e-08|4\.7867e-06|4\.7733e-06|
|1|Parameters|16672374\.0|16672374\.0|0\.0|
|2|Training Time \(s\)|218\.2184|168\.5268|-49\.6916|
|3|Final Accuracy|1\.0|1\.0|0\.0|

**Late Fusion:**
|index|Metric|MaxPool2d|Strided Conv|Difference \(Strided - MaxPool\)|
|---|---|---|---|---|
|0|Validation Loss \(best\)|1\.2889e-07|5\.1322e-06|5\.0033e-06|
|1|Parameters|16672374\.0|16672374\.0|0\.0|
|2|Training Time \(s\)|217\.3971|169\.4651|-47\.9319|
|3|Final Accuracy|1\.0|1\.0|0\.0|

Across all fusion strategies, MaxPool2d consistently achieves lower best validation loss than strided convolutions, while both approaches reach identical final accuracy.

# Downsampling Ablation: Quantitative and Theoretical Analysis

Across all fusion strategies, MaxPool consistently achieves lower best validation loss than strided convolutions, while both approaches reach identical final accuracy. This indicates more stable and precise feature alignment for MaxPool, particularly for intermediate and late fusion, where strided convolutions lead to substantially higher validation loss despite reducing training time by approximately 45–50 seconds per run.

Both MaxPool and strided convolutions downsample feature maps, but they differ fundamentally in how information is preserved, how features are learned, and how gradients flow through the network. The appropriate choice depends on the architecture, dataset, and task.

*MaxPool* is a non-parametric operation that selects the maximum value within each pooling window, preserving only the most prominent activations and introducing translation invariance. This is beneficial for classification tasks where coarse, dominant patterns are sufficient. However, because most activations are discarded, gradient updates are sparse and less informative. Max pooling also removes subtle spatial variations and geometric cues and, lacking learnable parameters, cannot adapt its downsampling behavior to the task.

*Strided convolutions*, in contrast, perform downsampling using learnable filters. They aggregate information from overlapping regions, producing denser and smoother gradients while preserving more contextual and structural detail. This allows the network to adaptively combine features before resolution reduction, making strided convolutions better suited for tasks requiring spatial precision or fine-grained feature learning.

**Recommendation:**
MaxPool is well suited for simple classification tasks with stable patterns. Strided convolutions are preferable when spatial detail is critical, such as in regression, localization, or anomaly detection. In our setting, MaxPool is clearly the better choice, offering superior validation performance and stability without unnecessary model complexity.