Skip to content

Zero-Shot Audio-Visual Compound Expression Recognition Method based on Emotion Probability Fusion

Notifications You must be signed in to change notification settings

ElenaRyumina/AVCER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Zero-Shot Audio-Visual Compound Expression Recognition Method based on Emotion Probability Fusion

App

The official repository for "Zero-Shot Audio-Visual Compound Expression Recognition Method based on Emotion Probability Fusion", as a part of CVPRW 2024 (Accepted)

Abstract

A Compound Expression Recognition (CER) as a part of affective computing is a novel task in intelligent human-computer interaction and multimodal user interfaces. We propose a novel audio-visual method for CER. Our method relies on emotion recognition models that fuse modalities at the emotion probability level, while decisions regarding the prediction of compound expressions are based on the pair-wise sum of weighted emotion probability distributions. Notably, our method does not use any training data specific to the target task. Thus, the problem is a zero-shot classification task. The method is evaluated in multi-corpus training and cross-corpus validation setups. We achieved F1-score values equal to 32.15% and 25.56% for the AffWild2 and C-EXPR-DB test subsets without training on target corpus and target task, respectively. Therefore, our method is on par with methods developed training target corpus or target task.

Quick Start

This repository introduces a new zero-short audio-visual method for compound expression recognition.

Model weights are available at models. You should download them and place them in src/weights. You will also need weights for the RetinaFace detection model. Please refer to the original repository.

To predict compound expression by a video, you should run the command:

python run.py --path_video <your path to a video file> --path_save <your path to save results>

Example of predictions obtained by static visual (VS), dynamic visual (VD), audio (A), and audio-visual (AV) models:

predictions

Citation

If you are using AVCER in your research, please site:

@inproceedings{RYUMINA2024CWPRV,
  title        = {Zero-Shot Audio-Visual Compound Expression Recognition Method based on Emotion Probability Fusion},
  author       = {Elena Ryumina and Maxim Markitantov and Dmitry Ryumin and Heysem Kaya and Alexey Karpov},
  booktitle    = {IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops},
  year         = {2024},
  pages        = {1--9},
}

Acknowledgments

Parts of this project page were adopted from the Nerfies page.

Website License

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

About

Zero-Shot Audio-Visual Compound Expression Recognition Method based on Emotion Probability Fusion

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published