Official PyTorch Implementation
Tamar Glaser, Emanuel Ben Baruch, Gilad Sharir, Nadav Zamir, Asaf Noy, Lihi Zelnik-Manor
DAMO Academy, Alibaba Group
Abstract
In recent years the amounts of personal photos captured increased significantly, giving rise to new challenges in multi-image understanding and high-level image understanding. Event recognition in personal photo albums presents one challenging scenario where life events are recognized from a disordered collection of images, including both relevant and irrelevant images. Event recognition in images also presents the challenge of high-level image understanding, as opposed to low-level image object classification. In absence of methods to analyze multiple inputs, previous methods adopted temporal mechanisms, including various forms of recurrent neural networks. However, their effective temporal window is local. In addition, they are not a natural choice given the disordered characteristic of photo albums. We address this gap with a tailor-made solution, combining the power of CNNs for image representation and transformers for album representation to perform global reasoning on image collection, offering a practical and efficient solution for photo albums event recognition. Our solution reaches state-of-the-art results on 3 prominent benchmarks, achieving above 90% mAP on all datasets. We further explore the related image-importance task in event recognition, demonstrating how the learned attentions correlate with the human-annotated importance for this subjective task, thus opening the door for new applications.
An implementation of our model for photo albumm event recognition using transformers is found here.
class TAggregate(nn.Module)
We provide a pre-trained model on ML-CUFED dataset, which can be found here
We provide an inference code, that demonstrates how to load our model, pre-process some sample albums do actual inference. Example run:
python infer.py \
--model_path=./models_local/peta_32.pth \
--model_name=mtresnetaggregate \
--album_path albums/Personal_sports/44_65592177@N00 \
--threshold=0.9 \
@article{Glaser22PETA,
author = {Tamar Glaser and
Emanuel Ben Baruch and
Gilad Sharir and
Nadav Zamir and
Asaf Noy and
Lihi Zelnik{-}Manor},
title = {PETA: Photo Albums Event Recognition using Transformers Attention},
year = {2022},
url = {https://arxiv.org/},
archivePrefix = {arXiv},
eprint = {},
timestamp = {Wed, 18 Aug 2021 19:52:30 +0200}
}
Several albums from ML-CUFED dataset (link) are used in this project. Some components of this code implementation are adapted from the repository https://github.com/Alibaba-MIIL/ASL.