Introduction

This is our implementation codes of "Resisting Noise in Pseudo-Labels: Audible Video Event Parsing with Evidential Learning" .

Noise-Resistant Event Parsing

The Audio-Visual Video Parsing task generally follows weakly-supervised learning settings, since only videolevel labels are provided. Most existing works usually generate modality-wise pseudo-labels first then learn to parse audio or visual events from the audible videos. However, this paradigm inevitably results in two defects: 1) The generated pseudolabels for each modality are not fully reliable, which may confuse models if they are adopted as supervision signals for discriminating modalities. 2) The absence of temporal supervision increases the ambiguities in localizing foregrounds in videos, furtherly causing models prone to being disturbed by noisy labels. To tackle these problems, we propose a novel AVVP framework termed Noise-Resistant Event Parsing (NREP), which introduces evidential deep learning to overcome the limitations of noisy pseudo supervision. Through perceiving meaningful video content and learning evidence for modality dependencies, our method suppresses the disturbance of noise in generated pseudo-labels thus achieves remarkable performance with different pseudo-label generation strategies. We evaluate our NREP method on two AVVP benchmark datasets and demonstrate it consistently establishes new state-of-the-art.

Insight of Our Work

We propose a novel framework termed Noise-Resistant Event Parsing (NREP) for AVVP, which utilizes evidential deep learning to overcome the noise in generated pseudo labels. Instead of using conventional additive probability, it tackles this task by calculating probabilities with Subjective Logic for each event category.
We design a dual evidential learning architecture, which contains a Modality-wise Evidential Learning module (MEL) and a Temporal-wise Evidential Learning module (TEL), to adapt EDL in modal and temporal dimensions. Such a design defends the model from noise disturbance in discriminating modality dependencies and temporal fore-background for each video event synchronously.
To collaborate the modality-wise and temporal-wise evidential learning branches, we furtherly propose an attention consistency learning mechanism, which keeps the model perceiving from meaningful temporal foregrounds in two branches consistently.
We conduct extensive experiments and the experimental results demonstrate that our method remarkably outperforms other state-of-the-art methods.

Data Preparation

Please follow VALOR and CMPAE to download the off-the-shelf visual and audio features.

Training and Testing

Please use the following command to train our model:

sh run.sh

If you want to evaluate a pre-trained model, please use the following command:

Comparisons with recent state-of-the-art methods

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
feature_extractor		feature_extractor
figures		figures
nets		nets
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
dataloader.py		dataloader.py
main.py		main.py
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Noise-Resistant Event Parsing

Insight of Our Work

Data Preparation

Training and Testing

Comparisons with recent state-of-the-art methods

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Introduction

Noise-Resistant Event Parsing

Insight of Our Work

Data Preparation

Training and Testing

Comparisons with recent state-of-the-art methods

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages