# Tackling Multimodal Overfitting with Feature Masking
By Vanama Pranav

## Motivation

In many AI tasks we assume “more data is better”, but when combining multiple data types (modalities) this isn’t always true. For example, adding audio to a vision-based emotion classifier actually *hurts* accuracy on CREMA-D. This counterintuitive phenomenon—sometimes called **multimodal overfitting** or **modality imbalance**—is why I picked this paper. The paper proposes a way to directly regularize the *features* of each modality, rather than fiddling with learning rates or gradient schedules. By focusing on what features each network branch learns, we can gain insight into *why* one modality dominates the other, and how to fix it. Understanding this is crucial for robust multimodal learning, where we want vision, audio, text, etc. to truly complement each other.

## Historical Context

Early multimodal models often relied on attention or gating mechanisms to let each modality emphasize or ignore information. For instance, gated attention can weight the visual and audio features differently when fusing them. Other work noted that simply balancing the training speed of each branch can help: Wang *et al.* suggested scaling the learning rate per modality, and Peng *et al.* proposed an “on-the-fly” gradient balancing strategy. These methods adjust *how* the models train but don’t explicitly tell us *which* features are causing interference. More recently, feature-level approaches have emerged. Fan *et al.* introduced **PMR** (Prototypical Modality Rebalance): they create class-prototype vectors and encourage the weaker modality to align its features with these prototypes, while adding entropy loss to prevent the strong modality from dominating. This is a form of feature regularization. In parallel, generative-model researchers have long used *shared/private* latent spaces (e.g. Lee and Pavlovic 2021) to disentangle what information is common vs. exclusive to each modality. However, those ideas were mainly for generation and don’t directly address discriminative tasks. In summary, the field has moved from simple attention fusion to scheduling-based regularization, and now to masking at the feature level (our paper’s idea), which promises a more interpretable look at modality interaction.

## Technical Summary of the Paper

### Definition of Multimodal Overfitting and Hypothesis

**Multimodal overfitting** is when adding a modality does not improve—or even worsens—performance. The authors cite Wang *et al.* to define it as “overfitting of the strong modality over the underfitting weak modality.” Intuitively, one modality (e.g. vision) dominates and captures most task-relevant signals, while the other (e.g. audio) fails to learn enough. The key hypothesis of this paper is that we should *directly regularize the features* that each modality learns, rather than abstract parameters. In practice, they propose to split the fused latent features into two sets: **complementary** (features useful for the task) and **contradictory** (features that *hurt* the other modality’s learning). Concretely, the model learns a mask that zeros out certain features if they are deemed “contradictory”. This masking is learned end-to-end, with a penalty to encourage removing unnecessary features. By inspecting what gets masked, we gain intuitive insight into modality overfitting.

### Feature Separation: Complementary vs. Contradictory

The core idea is to **binarize an attention mask** so each feature channel is either fully kept or fully dropped. Unlike soft attention, this hard masking cleanly separates features. During training, each modality’s features are encoded (by a VQ-VAE backbone, see below) and concatenated. A learnable mask tensor Ω (one weight per channel) is applied, thresholded into ω ∈ {0,1} values. Channels where ω=0 are considered **contradictory** (they are masked out), and the rest are **complementary**. The model is penalized (via an L1 loss on Ω) for keeping features, which pushes it to *prune* any feature that isn’t absolutely needed. In effect, the network automatically “decides” which features from each modality truly cooperate (complementary) and which interfere (contradictory) with the task. This separation is similar to past shared/private latent spaces in generative models, but applied here in a discriminative setting.

### CM-VQVAE Architecture (VQ-VAE + Masking)

The implementation is called **CM-VQVAE** (Complementary Multimodal VQ-VAE). Figure 2 (in the paper) shows two parallel VQ-VAE encoders, one per modality. Each encoder discretizes its modality’s feature maps into a learned codebook (this is the Vector-Quantized VAE). In other words, continuous feature tensors become sequences of codebook indices and reconstructions. After encoding, the latent features from all modalities are fused (concatenated). Then comes the masking step: the learned mask Ω is applied to the concatenated tensor, zeroing out whole channels. The remaining (unmasked) channels form the complementary feature set which is fed to a classifier. During training, the loss includes: (1) a VQ-VAE reconstruction loss to ensure good encoding; (2) a VQ codebook loss; (3) the usual task loss (e.g. cross-entropy for classification); and (4) a *complementarity loss* L<sub>compl</sub>=λ∑Ω<sub>d</sub>. The last term penalizes large mask values, encouraging more masking of features. They use an L1 penalty (since L0 is hard to optimize) and set λ small (0.0001) to balance accuracy. In short, the CM-VQVAE automatically learns discrete feature “units” per modality and a binary mask that prunes them, preventing any one modality’s features from drowning out the other.

### Complementarity Metric ζ

Because the VQ-VAE discretizes features into units, the model learns how many codebook entries from each modality are used (complementary) vs. masked (contradictory). From this we can define a **modal complementarity** score ζ (zeta). For two modalities, if each has `compl<sup>1</sup>` and `compl<sup>2</sup>` percentage of features kept, then

$$
\zeta = \frac{\min(\text{compl}^1,\text{compl}^2)}{\max(\text{compl}^1,\text{compl}^2)}.
$$

Thus ζ∈\[0,1], with ζ≈1 meaning both modalities contribute equally, and ζ small meaning one dominates. This quantitative measure emerges naturally from the learned mask sizes. For example, on CREMA-D the model found ζ≈1 (both image and audio were equally used), whereas on PennAction ζ was very low because pose dominated over color. The authors also verify a nice property: a more balanced (higher ζ) latent split generally correlates with better performance.

### Training and Evaluation (Datasets & Metrics)

They train CM-VQVAE end-to-end on several multimodal benchmarks, always using a ResNet backbone for the encoders. Key datasets are: **CREMA-D** (visual+audio emotion recognition), **PennAction** (color video + pose for action), and **NYUv2** (RGB + depth for scene segmentation). CREMA-D is known to suffer modality imbalance, so it’s a main testbed. PennAction has redundant modalities (pose alone almost solves it), so it tests an extreme case. NYUv2 has no obvious overfitting issue and checks generalization to a different task (segmentation). Metrics are standard (accuracy for classification, mIoU for segmentation).

**Results:** The authors report that CM-VQVAE consistently *fixes* overfitting: on CREMA-D, the vanilla multimodal accuracy (using all features) is \~54.4%, while the best single-modality was 59.0%; with CM-VQVAE their model reaches \~65.3%, well above both baselines. In all cases they see “Multi < best unimodal < our method”. They compare to two simple baselines: random feature dropout and a soft gated-attention model (Ω not binarized). The learned mask approach beats random dropout (it simply removes useful units), and also beats soft attention (the gated baseline had no significant gain). Ablation studies show that each piece matters: removing the mask reg (λ) degrades performance, and leaving no mask (i.e. standard concatenation) of course reverts to the worst result. They also compare against recent literature on CREMA-D: their CM-VQVAE achieves higher accuracy than GradBlend \[28], OGM-GE \[19], or PMR \[6] (see Ablation Table in the paper).

### Qualitative Insights

Crucially, the model provides introspection. The paper visualizes *reconstructions* of the complementary vs. contradictory spaces through the VQ-VAE decoder. In CREMA-D, the “complementary” face reconstructions drop skin-color differences (they look almost grayscale). This indicates the network learned that skin color (which correlates with identity or ethnicity) is irrelevant or even distracting for emotion classification, so those features became “contradictory” and were masked out. Conversely, the complementary audio reconstructions keep basic tone but omit speaker-specific details. In PennAction, the complementary images are very simple silhouettes (just enough to see the action pose), whereas the detailed background and even some appearance features end up in the contradictory set. In pose modality, the complementary space cleanly captures the skeleton needed for action, leaving user-specific jittery joint shifts as contradictory noise. These visualizations show that the mask is meaningfully partitioning semantics: complementary features capture the core information for the task, and the network (with a little help from the decoder) shows what it “throws away”. Finally, they observe an interesting dynamic: the weaker modality *alone* tends to learn many features, but when combined with a stronger modality, the weaker one’s mask is pruned more aggressively. In other words, the strong modality can “cover” for missing features of the weak one. This confirms the strong-weak overfitting intuition: the network only keeps those weak-modality features that the strong modality can’t provide.

## Insights and What I Learned

This paper’s approach is novel in making the hidden decision of “what to drop” **explicit and learnable**. Instead of manually guessing which modality might dominate, the model discovers it and counters it. I was impressed that the method not only boosts accuracy but also gives human-interpretable insights (e.g. “the network learned skin color doesn’t matter for emotion”). It confirms that **spurious correlations** (like identity or noise) can be learned in multimodal nets, and need active regularization. The learned *complementarity score* ζ is also a useful diagnostic: for instance, ζ=0.78 on CREMA-D means “about equal contributions”, whereas ζ=0.10 on PennAction flags that one modality is almost entirely ignored.

On the flip side, a potential weakness is the extra complexity: two VQ-VAEs and a mask loss. Tuning the threshold and λ was done empirically, so one might wonder if there’s a more adaptive way (e.g. learning a continuous sparsity penalty). Also, the mask is all-or-nothing; there may be cases where a “soft mask” (like gating) could preserve fine-grained information. Finally, this work is for two modalities—extending to many modalities (e.g. vision+audio+text) could be challenging, as every pair has a complementarity. Despite that, the idea of feature masking is broadly applicable. I’m also intrigued by the transfer learning result: by cutting out task-irrelevant features (like skin tone), the model actually generalized better to a new emotion dataset. This shows that the mask regularizer acts like a form of de-biasing, which could be useful in fairness-sensitive tasks.

## References

* Tejero-de-Pablos, A. (2024). *Complementary-Contradictory Feature Regularization against Multimodal Overfitting*. WACV 2024.
* Wang, W., Tran, D., & Feiszli, M. (2020). *What makes training multi-modal classification networks hard?* In CVPR 2020.
* Peng, X., Wei, Y., Deng, A., Wang, D., & Hu, D. (2022). *Balanced Multimodal Learning via On-the-Fly Gradient Modulation*. In CVPR 2022.
* Fan, Y., Xu, W., Wang, H., Wang, J., & Guo, S. (2023). *PMR: Prototypical Modal Rebalance for Multimodal Learning*. In CVPR 2023.
* Lee, M., & Pavlovic, V. (2021). *Private-Shared Disentangled Multimodal VAE for Learning Latent Representations*. In CVPR 2021.
* Shi, Y., Siddharth, N., Paige, B., & Torr, P. H. S. (2019). *Variational Mixture-of-Experts Autoencoders for Multimodal Generative Models*. NeurIPS 2019.
* Xiao, F., Lee, Y. J., Grauman, K., Malik, J., & Feichtenhofer, C. (2020). *Audiovisual SlowFast Networks for Video Recognition*. arXiv preprint arXiv:2001.08740.
