# Understanding the Challenges of Training Multi-modal Neural Networks


![My Image](images/01.png)




In the evolving landscape of artificial intelligence, multi-modal learning represents a significant frontier, combining various data types such as images, audio, and motion to achieve more comprehensive understanding. However, researchers have encountered a puzzling phenomenon: despite having access to more information, multi-modal networks often perform worse than their uni-modal counterparts. This blog explores why this happens and examines a novel solution called Gradient-Blending that effectively addresses these challenges.

---

## The Paradox of Multi-modal Performance

When training deep learning models on tasks with multiple input modalities, one would logically expect that a multi-modal network would outperform a single-modality network. After all, the multi-modal network receives more information about the target concept. However, empirical evidence reveals the opposite trend across various datasets and modality combinations.


![My Image](images/1.png)


As it is clear from results above, on the Kinetics video classification dataset, a visual-only (RGB) model achieves 72.6% top-1 accuracy, while a model combining RGB and audio modalities achieves only 71.4% accuracy – a drop of 1.2%. This performance drop remains consistent across different modality combinations: RGB + Optical Flow (71.3%), Audio + Optical Flow (58.3%), and even when combining all three modalities (70.0%). The counter-intuitive nature of these results warrants deeper investigation into the underlying causes and potential solutions.



The discrepancy becomes even more notable when we understand that multi-modal solutions are mathematically a superset of uni-modal solutions. In theory, a well-optimized multi-modal model should be able to learn to ignore less helpful modalities and match or exceed the performance of the best single modality. Yet in practice, this theoretical advantage fails to materialize, pointing to challenges in the optimization process rather than architectural limitations.


---


## Some Fusion Techniques: 

When seeking to improve multi-modal model performance, researchers have explored various fusion mechanisms. Two notable approaches are Squeeze-and-Excitation (SE) gates and Non-Local (NL) gates.


### Squeeze-and-Excitation(SE) network
Squeeze-and-Excitation (SE) networks represent an architectural enhancement that improves channel interdependence modeling. The SE block adaptively recalibrates channel-wise feature responses by explicitly modeling interdependencies between channels. It works through a two-step process: first, "squeezing" global spatial information into a channel descriptor via global average pooling, and then "exciting" these channel-wise statistics through a gating mechanism with sigmoid activation.

![My Image](images/2.png)

![My Image](images/3.png)


![My Image](images/4.png)


 This process allows the network to selectively emphasize informative features while suppressing less useful ones, theoretically making it well-suited for multi-modal fusion where different channels might correspond to different modalities.





### Non-Local (NL) neural networks
Non-Local (NL) neural networks, on the other hand, are designed to capture long-range dependencies in data. Unlike convolutional operations that process one local neighborhood at a time, the non-local operation computes the response at a position as a weighted sum of features from all positions.

![My Image](images/5.png)
![My Image](images/6.png)
![My Image](images/7.png)

 This global context awareness should be beneficial for multi-modal learning, where relationships between modalities might span across the entire feature space rather than just local neighborhoods.





![My Image](images/TopAccuracyonKinetics.png)





Despite their sophisticated design and success in single-modality contexts, both SE-gate and NL-gate mechanisms fail to resolve the multi-modal performance paradox. When applied to audio-visual fusion on Kinetics, SE-gate achieves only 71.4% accuracy and NL-gate reaches 71.7% – both still falling short of the RGB-only model's 72.6%. This suggests that the issue lies deeper than the fusion architecture itself.




---


## Core Challenges in Multi-modal Learning

The paper by Wang et al. identifies two primary causes for the multi-modal performance drop, providing valuable insights into the challenges of multi-modal learning

- ### The Overfitting Problem
The first major challenge is overfitting. Multi-modal networks have significantly higher capacity due to their increased number of parameters. For instance, a late-fusion audio-visual network has nearly twice the parameters of a visual-only network. This increased capacity makes multi-modal networks more prone to overfitting – they memorize training data patterns that don't generalize well to new examples.


- ### The Challenge of Combining Modalities
The second key challenge lies in how different modalities behave during training. Different modalities naturally overfit and generalize at different rates, making joint training with a single optimization strategy inherently sub-optimal. For example, audio models typically overfit much faster than visual models due to differences in data distribution, model capacity, and the inherent complexity of the signal.



---




## Paper Setup

- ### Key Concepts in Multi-modal Neural Networks

    - #### Motion Vectors
        Motion Vectors are two-dimensional vectors used for inter-prediction in video processing, providing an offset from coordinates in one frame to coordinates in a reference frame. Originally developed for video coding to reduce redundancy between frames, motion vectors have found applications in object tracking, action recognition, and person detection. They can be computed using block-matching algorithms and represent temporal information that complements spatial data from individual frames.



    - #### Cross-modal Self-supervised Learning
        Cross-modal Self-supervised Learning leverages the natural correspondence between modalities to learn representations without explicit labels. For example, a system might learn to predict whether an audio clip corresponds to a video clip, using the inherent relationship between these modalities as a supervisory signal. This approach helps models learn rich, transferable representations that capture the semantic relationships between modalities.





    - #### Auxiliary Losses
        Auxiliary Losses provide additional supervision signals during training by attaching secondary objectives to intermediate layers of a neural network. In the context of multi-modal learning, auxiliary losses can help optimize the training process for individual modalities while the main branch loss handles the integrated representation. This technique can stabilize training, reduce vanishing gradient problems for earlier layers, and serve as a form of regularization.



- ### Methodologies currently used:

    - #### Uni-modal Training
    ![My Image](images/8.png)

    Description: Each modality is trained independently.

    Limitations: No interaction or shared learning between modalities.



    - #### Naive Multi-modal Joint Training
    ![My Image](images/9.png)

    Description: Features from both modalities are concatenated and used together.


    Limitation: Assumes both modalities are always aligned.

    - #### Multi-modal Joint Training
    ![My Image](images/10.png)

    Description: Combines both modalities based on discussed fusion techniques (SE Gate or NL Gate).
    Limitation: Computationally expensive and disalignment of embeddings.




---

## Solutions Proposed:


### Defining OGR and L*

The authors introduce a metric called Overfitting-to-Generalization Ratio (OGR) to quantitatively measure the quality of learning between model checkpoints:


![My Image](images/11.png)

Where:

- ΔO represents the change in overfitting (the gap between training and validation performance)
- ΔG represents the change in generalization (improvement in validation performance)


A lower OGR indicates better learning, as it means the model is improving on validation data with minimal increase in overfitting. L* represents the "true" loss with respect to the target distribution, which is approximated using validation loss in practice.


This metric provides a principled way to evaluate whether learning is proceeding efficiently, focusing not just on absolute performance but on the quality of the learning process itself. By monitoring OGR during training, researchers can better understand how well different components of a multi-modal network are learning generalizable patterns versus memorizing training data.


### Blending through OGR: A Convex Formulation

The key insight of Gradient-Blending is that by optimally combining gradient estimates from multiple modalities, we can minimize OGR and improve generalization. The paper formulates this as a convex optimization problem:


Let $\{v_k\}_{k=0}^M$ be a set of estimates for $\nabla \mathcal{L}^*$ whose overfitting satisfies

$$
\mathbb{E} \left[ \langle \nabla \mathcal{L}_T - \nabla \mathcal{L}^*, v_k \rangle \langle \nabla \mathcal{L}_T - \nabla \mathcal{L}^*, v_j \rangle \right] = 0 \quad \text{for } j \ne k.
$$

Given the constraint $\sum_k w_k = 1$, the optimal weights $w_k \in \mathbb{R}$ for the problem


![My Image](images/12.png)

This optimisation problem gives a closed form solution given by
![My Image](images/13.png)

where $\sigma_k^2 \equiv \mathbb{E} \left[ \langle \nabla \mathcal{L}_T - \nabla \mathcal{L}^*, v_k \rangle^2 \right]$ and 
$Z = \sum_k \frac{\langle \nabla \mathcal{L}^*, v_k \rangle^2}{\sigma_k^2}$ is a normalizing constant.



![My Image](images/14.png)



This elegant solution has an intuitive interpretation: modalities that generalize well (high G<sub>k</sub>) and overfit little (low O<sub>k</sub>) receive higher weights, while modalities that overfit severely relative to their generalization are downweighted. The quadratic penalty on overfitting (O²<sub>k</sub>) ensures that modalities with extreme overfitting are substantially penalized.

####  Blend Loss


Based on the calculated *w<sub>i</sub>* , calculate Blend Loss :


$$
L_{\text{blend}} = \sum_{i=1}^{k+1} w_i \mathcal{L}_i
$$



### Implementation Approaches: Online vs. Offline



- #### Offline Gradient-Blending

Offline Gradient-Blending computes weights only once at the beginning of training and uses this fixed set of weights throughout the entire training process.

![My Image](images/15.png)

- #### Online Gradient-Blending

Online Gradient-Blending represents the complete version, recomputing weights regularly (e.g., every few epochs) and adjusting the training process dynamically. This approach provides additional performance gains but is more complex to implement.


![My Image](images/16.png)





---

## RESULTS


<div style="display: flex; justify-content: space-around; align-items: flex-start; gap: 10px;">

  <div style="text-align: center;">
    <img src="images/17.png" alt="Algorithm 1" width="700"/>
    <p><strong>Both offline and online Gradient-Blending outperform Naive
late fusion and RGB only. </strong>Offline G-Blend is slightly less accurate compared with the online version, but much simpler to implement.</p>
  </div>

</div>




<div style="display: flex; justify-content: center; gap: 10px;">

  <img src="images/18.png" alt="Algorithm 1" width="400" height = "200"/>
  <img src="images/19.png" alt="Algorithm 2" width="400"/>

</div>
<p style="text-align: center; font-style: italic;">
  <strong>(a).</strong> Online G-Blend outperforms naive training on each super-epoch
  <strong>(b).</strong> G-Blend on different optimizers.

</p>


<div style="display: flex; justify-content: space-around; align-items: flex-start; gap: 10px;">

  <div style="text-align: center;">
    <img src="images/20.png" alt="Algorithm 1" width="700" height = "150/">
    <p><strong>Gradient-Blending (G-Blend) works on different multi-modal problems.</strong>
  </div>

</div>



###  Comparison with State-of-the-Art


# #1 KINETICS

<div style="display: flex; justify-content: space-around; align-items: flex-start; gap: 10px;">

  <div style="text-align: center;">
    <img src="images/sota_kinetics.png" alt="Algorithm 1">
    <p>Comparison with state-of-the-art methods on <strong>Kinetics Dataset.</strong>
  </div>

</div>


# #2 AudioSet

<div style="display: flex; justify-content: space-around; align-items: flex-start; gap: 10px;">

  <div style="text-align: center;">
    <img src="images/sota_audioSet.png" alt="Algorithm 1">
    <p>Comparison with state-of-the-art methods on <strong>AudioSet.</strong>
  </div>

</div>



# #3 EPIC-Kitchen

<div style="display: flex; justify-content: space-around; align-items: flex-start; gap: 10px;">

  <div style="text-align: center;">
    <img src="images/sota_epic.png" alt="Algorithm 1">
    <p>Comparison with state-of-the-art methods on <strong>EPIC-Kitchen.</strong>
  </div>

</div>




---

## Conclusion

This blog has explored the paradoxical challenge in multi-modal neural networks where additional information leads to worse performance. We've identified two primary causes: increased overfitting due to higher model capacity and the difficulties of jointly training modalities that learn at different rates. The Gradient-Blending technique offers an elegant and effective solution by optimally combining modalities based on their overfitting behaviors.



---

#### Credits: 
```
@article{wu2024visual,
  title={What Makes Training Multi-Modal Classification Networks Hard?},
  author={Weiyao Wang, Du Tran, Matt Feiszli},
  journal={arXiv cs	arXiv:1905.12681},
  year={2020}
}
```

--- 
---


<div style="display: flex; justify-content: space-around; align-items: flex-start; gap: 10px;">

  <div style="text-align: center;">
    <img src="images/22.png" alt="Algorithm 1">
  </div>

</div>



---
---