# MaxPool2d to Strided Convolution Ablation

In this notebook I will retrain the four models from the previous notebooks but use a different downsampling method. I specific I will replace `MaxPool2D` with strided convolutions. 

In [None]:
import sys

# Colab-only setup
if "google.colab" in sys.modules:
    print("Running in Google Colab. Setting up repo")

    !git clone https://github.com/MatthiasCr/Computer-Vision-Assignment-2.git
    %cd Computer-Vision-Assignment-2
    !pip install -r requirements.txt
    %cd notebooks

In [None]:
# insert wandb token
!wandb login

In [None]:
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub
import numpy as np
import torch
import torch.nn as nn
from torch.optim import Adam
from torch.optim.lr_scheduler import CosineAnnealingLR
import torchvision.transforms.v2 as transforms
from torch.utils.data import DataLoader
import wandb


import sys
from pathlib import Path

project_root = Path("..").resolve()
sys.path.append(str(project_root))
from src import datasets
from src import training
from src import visualization
from src import models


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Load Data

As in the previous notebooks I start by loading the data from huggingface.

In [None]:
# hyperparameters will be the same for all experiments to make them comparable
IMG_SIZE = 64
BATCH_SIZE = 32
EPOCHS = 20

START_LR = 1e-3
END_LR = 1e-6

In [None]:
# load fiftyone dataset from huggingface
dataset = load_from_hub(
    "MatthiasCr/multimodal-shapes-subset", 
    name="multimodal-shapes-subset",
    # fewer workers and greater batch size to hopefully avoid getting rate limited
    num_workers=2,
    batch_size=500,
    overwrite=True,
)

In [None]:
img_transforms = transforms.Compose([
    transforms.Resize(IMG_SIZE),
    transforms.ToImage(),
    transforms.ToDtype(torch.float32, scale=True),
])

train_dataset = datasets.MultimodalDataset(dataset, "train", img_transforms)
val_dataset = datasets.MultimodalDataset(dataset, "val", img_transforms)

# use generator with fixed seed for reproducible shuffling
generator = torch.Generator()
generator.manual_seed(51)

train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, drop_last=True, generator=generator)
val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, drop_last=True)

# loader to conduct sample predictions
log_loader = DataLoader(val_dataset, batch_size=5, shuffle=True, num_workers=0, generator=generator)

# number of train batches, needed for learning rate scheduling
steps_per_epoch = len(train_dataloader)

## Experiments

I use the same hyperparameters and the same experiment function as in the previous notebook. This way we can directly compare the runs.

In [None]:
# function that tells the training process how to apply the model on a batch of the dataset
def apply_model(model, batch):
    target = batch[2].to(device)
    inputs_rgb = batch[0].to(device)
    inputs_xyz = batch[1].to(device)
    outputs = model(inputs_rgb, inputs_xyz)
    return outputs, target

In [None]:
def log_experiment(model, best_model, fusion_type, device, output_name):
    num_params = sum(p.numel() for p in model.parameters())
    optim = Adam(model.parameters(), lr=START_LR)
    scheduler = CosineAnnealingLR(optim, T_max=EPOCHS * steps_per_epoch, eta_min=END_LR)
    loss_func = nn.BCEWithLogitsLoss()

    # init wandb run and log config hyperparameters
    run = training.initWandbRun(
        fusion_type, EPOCHS, BATCH_SIZE, num_params, "Adam", "Cosine Annealing", START_LR, END_LR
    )

    # train and log loss
    train_loss, val_loss = training.train_model(
        model, optim, apply_model, loss_func, EPOCHS, train_dataloader, val_dataloader, device, run, scheduler=scheduler, output_name=output_name
    )

    # load best model
    model_save_path = f"../checkpoints/{output_name}.pt"
    best_model.load_state_dict(torch.load(model_save_path, map_location=device))
    best_model = best_model.to(device)

    # predict on 4 batches of each 5 samples = 20 preditions. Log predictions to wandb
    training.log_predictions(best_model, log_loader, device, run, num_batches=4)
    
    run.finish()
    return train_loss, val_loss

Now we can start training the four models. I use the same models as in the previous notebook but pass the extra parameter `embedder_type = "strided"`. This replaces maxpooling layers with the identity, and changes the stride of all convolutions from 1 to 2. 

In [None]:
late_model = models.LateFusionNet(embedder_type="strided").to(device)
late_model_best = models.LateFusionNet(embedder_type="strided").to(device)
late_train_loss, late_val_loss = log_experiment(late_model, late_model_best, "late", device, output_name="strided_late")

cat_model = models.IntermediateFusionNet(fusion_type="cat", embedder_type="strided").to(device)
cat_model_best = models.IntermediateFusionNet(fusion_type="cat", embedder_type="strided").to(device)
cat_train_loss, cat_val_loss = log_experiment(cat_model, cat_model_best, "intermediate (concatenation)", device, output_name="strided_cat")

add_model = models.IntermediateFusionNet(fusion_type="add", embedder_type="strided").to(device)
add_model_best = models.IntermediateFusionNet(fusion_type="add", embedder_type="strided").to(device)
add_train_loss, add_val_loss = log_experiment(add_model, add_model_best, "intermediate (addition)", device, output_name="strided_add")

had_model = models.IntermediateFusionNet(fusion_type="had", embedder_type="strided").to(device)
had_model_best = models.IntermediateFusionNet(fusion_type="had", embedder_type="strided").to(device)
had_train_loss, had_val_loss = log_experiment(had_model, had_model_best, "intermediate (hadamard)", device, output_name="strided_had")

In [None]:
visualization.plot_loss(EPOCHS,
    {
        "Concat Valid Loss": cat_val_loss,
        "Addition Valid Loss": add_val_loss,
        "Hadamard Valid Loss": had_val_loss,
        "Late Valid Loss": late_val_loss
    }
)

## Analysis

The following screenshots show the direct comparison of the different valid loss curves of the fusion models with strided convolutional downsampling compared to the models with maxpool from the last notebook. 

<div style="display:grid; grid-template-columns: 1fr 1fr; gap:10px;">
  <img src="../results/wandb-t4-late-graphs.png">
  <img src="../results/wandb-t4-concat-graphs.png">
  <img src="../results/wandb-t4-add-graphs.png">
  <img src="../results/wandb-t4-had-graphs.png">
</div>

We can see that the models with strided convolutions consistently underperform the corresponding maxpool version. The following table shows the exact values of the strided models. In the brackets behind the values are the differences to the corresponding maxpool run.


| Metric | Strided Late | Strided Concat | Strided Addition | Strided Hadamard |
| --- | --- | --- | --- | --- |
| Valid Loss | 0.254 (+0.215) | 0.357 (+0.325) | 0.395 (+0.276) | 0.179 (+0.109) |
| Valid Accuracy | 0.929 (-0.058) | 0.841 (-0.151) | 0.820 (-0.151) | 0.971 (-0.013) |
| Parameter Count | 1,415,051 (=) | 914,201 (=) | 824,201 (=) | 824,201 (=) |
| Train Time (s) | 16.15 (-6.02) | 15.25 (-9.92) | 14.97 (-9.41) | 14.82 (-9.94) |
| GPU Memory (MB) | 779 (+178) | 779 (+6) | 779 (+4) | 779 (+2) |

As we can see in the table, the strided models performed worse than the corresponding maxpool versions for every fusion type. All validation losses increased strongly while the validation accuracies decreased. Especially the intermediate fusion with concatenation and addition are negatively affected having accuracies of only 82% - 84%. On the other hand did the training times of the strided models decrease by roughly around 33% - 66% compared with the maxpool models. The recorded allocated GPU memory is exactly the same for all models, this could be either due to the fact that all runs were executed subsequently with no time in between, or to unknown google colab restrictions, or it could have something to do with how wandb logs these metrics internally. 

All strided models have exactly the same number of parameters as the maxpool versions. The number of parameters only depends on the number of filters and filter sizes in the convolutional layers, and the dimensions of the linear layers which did not change. Maxpooling itself has no learnable parameters. However when the stride of the convolutions is changed from 1 to 2 then the output feature maps are smaller. While this is the indended downsampling effect, it is noteworthy that this also reduces the models complexity and flexibility. The convolution and the downsampling is done **in one step** whereas the maxpool architectures decouple these two operations. With MaxPool2d, the convolution layers first compute dense, high-resolution feature maps, and the pooling operation then performs a simple selection that preserves the strongest activations. This separation allows the network to learn richer local features before reducing resolution. In contrast, using stride-2 convolutions forces the network to discard spatial information already during feature extraction. Each filter observes fewer spatial positions, and fine-grained details may be skipped entirely. This is particularly harmful for intermediate fusion, where precise alignment between RGB and Lidar features is important. The loss of dense details reduces the effectiveness of cross-modal interactions, which likely explains the strong degradation for concatenation and addition fusion.

On the other side this is also the reason that the training times decreased so drastically. The strided convolutions require much less FLOPs and therefore also less time and energy. My recommendation would be to consider replacing maxpool with strided convolutions only when computational efficiency, faster training, or deployment constraints are very important. If you want to get peak accuracy performance then, based on my experiments, MaxPooling would be the better choice.