EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition

Authors official PyTorch implementation of the EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition. If you use this code for your research, please cite our paper.

EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition
Niki Maria Foteinopoulou and Ioannis Patras

Abstract: Facial Expression Recognition (FER) is a crucial task in affective computing, but its conventional focus on the seven basic emotions limits its applicability to the complex and expanding emotional spectrum. To address the issue of new and unseen emotions present in dynamic in-the-wild FER, we propose a novel vision-language model that utilises sample-level text descriptions (i.e. captions of the context, expressions or emotional cues), as natural language supervision, aiming to enhance the learning of rich latent representations, for zero-shot classification. To test this, we evaluate using zero-shot classification of the model trained on sample-level descriptions, on four popular dynamic FER datasets. Our findings show that this approach yields significant improvements when compared to baseline methods. Specifically, for zero-shot video FER, we outperform CLIP by over 10% in terms of Weighted Average Recall and 5% in terms of Unweighted Average Recall on several datasets. Furthermore, we evaluate the representations obtained from the network trained using sample-level descriptions on the downstream task of mental health symptom estimation, achieving performance comparable or superior to state-of-the-art methods and strong agreement with human experts. Namely, we achieve a Pearson's Correlation Coefficient of up to 0.85, which is comparable to human experts' agreement.

Overview

In a nutshell, we follow the CLIP contrastive training paradigm to jointly optimise a video and a text encoder. The video and text encoders of the network are jointly trained using a contrastive loss over the cosine similarities of the video-text pairings in the mini-batch. More specifically, the video encoder ($E_V$) is composed of the CLIP image encoder ($E_I$) and a Transformer Encoder, to learn the temporal relationships of the frame spatial representations. The text encoder ($E_T$) used in our approach is the CLIP text encoder. The weights of the image and text encoders in our model are initialised using the large pre-trained weights of CLIP, as FER datasets are not large enough to train a VLM from scratch with adequate generalisation. Contrary to the previous video VLM works in both action recognition and FER, we propose using sample level descriptions for better representation learning, rather than embeddings of class prototypes. This leads to more semantically rich representations which in turn allows for better generalisation.

Installation

We recommend installing the required packages using python's native virtual environment as follows:

$ python -m venv venv
$ source venv/bin/activate
(venv) $ pip install --upgrade pip
(venv) $ pip install -r requirements.txt

For using the aforementioned virtual environment in a Jupyter Notebook, you need to manually add the kernel as follows:

(venv) $ python -m ipykernel install --user --name=venv

Downstream task weights:

The weights used for the downstream task (without the FC layer) can be found here

Acknowledgements

This work is supported by EPSRC DTP studentship (No. EP/R513106/1) and EU H2020 AI4Media (No. 951911). This research utilised Queen Mary's Apocrita HPC facility, supported by QMUL Research-IT. http://doi.org/10.5281/zenodo.438045

Citation

@inproceedings{foteinopoulou_emoclip_2024,
	title = {{EmoCLIP}: {A} {Vision}-{Language} {Method} for {Zero}-{Shot} {Video} {Facial} {Expression} {Recognition}},
	author = {Foteinopoulou, Niki Maria and Patras, Ioannis},
	year = {2024},
	booktitle = {The 18th IEEE International Conference on Automatic Face and Gesture Recognition}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
DataLoaders		DataLoaders
architecture		architecture
experiments		experiments
figs		figs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
eval.py		eval.py
requirements.txt		requirements.txt
train.py		train.py
train_loo.py		train_loo.py
train_supervised.py		train_supervised.py
utils.py		utils.py

License

NickyFot/EmoCLIP

Folders and files

Latest commit

History

Repository files navigation

EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition

Overview

Installation

Downstream task weights:

Acknowledgements

Citation

About

Resources

License

Stars

Watchers

Forks

Languages