The PyTorch code of the TCSVT 2023 paper “ESA: External Space Attention Aggregation for Image-Text Retrieval” (Finished in 2021)
We referred to the implementations of VSE++, SCAN and vse_infy to build up our codebase.
Due to the large gap between vision and language modalities, effective and efficient image-text retrieval is still an unsolved problem.
Recent progress devotes to unilaterally pursuing retrieval accuracy by either entangled image-text interaction or large-scale vision-language pre-training in a brute force way. However, the former often leads to unacceptable retrieval time explosion when deploying on large-scale databases. The latter heavily relies on the extra corpus to learn better alignment in the feature space while obscuring the contribution of the network architecture.
In this work, we aim to investigate a trade-off to balance effectiveness and efficiency. To this end, on the premise of efficient retrieval, we propose the plug-and-play External Space attention Aggregation (ESA) module to enable element-wise fusion of modal features under spatial dimensional attention. Based on flexible spatial awareness, we further propose the Self-Expanding triplet Loss (SEL) to expand the representation space of samples and optimize the alignment of embedding space.
The extensive experiments demonstrate the effectiveness of our method on two benchmark datasets.
With identical visual and textual backbones, our single model has outperformed the ensemble modal of similar methods, and our ensemble model can further expand the advantage.
Meanwhile, compared with the vision-language pre-training embedding-base method that used
The following tables show partial results of the ensemble model ESA* and the download link of pretrained weights on Flickr30K and COCO datasets. The folder "ESA_BIGRU" provides the code of using BIGRU as the textual backbone. Please check out the folder "ESA_BERT" for the code of using BERT-base as the textual backbone.
dataset | Visual Backbone | Text Backbone | IR1 | IR5 | TR1 | TR5 | Rsum | Link |
---|---|---|---|---|---|---|---|---|
Flickr30K | BUTD region | BiGRU | 83.1 | 96.3 | 62.4 | 87.2 | 520.2 | Here |
Flickr30K | BUTD region | BERT-base | 84.6 | 96.6 | 66.3 | 88.8 | 528.0 | Here |
COCO 5-fold 1K COCO 5k |
BUTD region | BiGRU | 80.4 59.1 |
96.5 85.5 |
64.2 41.8 |
91.3 72.3 |
527.6 433.5 |
Here |
COCO 5-fold 1K COCO 5k |
BUTD region | BERT-base | 81.0 61.1 |
96.9 86.6 |
66.4 43.9 |
92.2 74.1 |
531.9 443.0 |
Here |
Python: 3.6
RTX 3090
Ubuntu 14.04.6 LTS
Install packages:
conda env create -f ESA.yaml
All datasets used in the experiments are organized in the following manner:
data
├── coco
│ ├── precomp # pre-computed BUTD region features for COCO, provided by SCAN
│ │ ├── train_ids.txt
│ │ ├── train_caps.txt
│ │ ├── ......
│ │
│ ├── images # raw coco images
│ ├── train2014
│ └── val2014
│
├── f30k
│ ├── precomp # pre-computed BUTD region features for Flickr30K, provided by SCAN
│ │ ├── train_ids.txt
│ │ ├── train_caps.txt
│ │ ├── ......
│ │
│ ├── flickr30k-images # raw coco images
│ ├── xxx.jpg
│ └── ...
│
└── vocab # vocab files provided by SCAN (only used when the text backbone is BiGRU)
The download links for original COCO/F30K images, precomputed BUTD features, and corresponding vocabularies are from the official repo of SCAN. The precomp
folders contain pre-computed BUTD region features, data/coco/images
contains raw MS-COCO images, and data/f30k/flickr30k-images
contains raw Flickr30K images.
Because the download link for the pre-computed features in SCAN is seemingly taken down. The link provided by the author of vse_infty contains a copy of these files.
Training on the Flicker30K or COCO dataset:
- Switch to the shell folder in the corresponding path. For example:
cd ./ESA_BIGRU/shell/
- run ./train_xxx_f30k.sh or ./train_xxx_coco.sh. For example:
sh train_GRU_f30k.sh
- Evaluation: Run the following commands after modifying the default data and model path to yourself path.
cd ../
python eval_ensemble.py
If this codebase is useful to you, please cite our work:
@article{zhu2023esa,
title={ESA: External Space Attention Aggregation for Image-Text Retrieval},
author={Zhu, Hongguang and Zhang, Chunjie and Wei, Yunchao and Huang, Shujuan and Zhao, Yao},
journal={IEEE Transactions on Circuits and Systems for Video Technology},
year={2023},
publisher={IEEE}
}
If you have any questions, please feel free to contact me: zhuhongguang1103@gmail.com or hongguang@bjtu.edu.cn.
- Chen, Jiacheng, et al. "Learning the best pooling strategy for visual semantic embedding." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.
- Faghri, Fartash, et al. "Vse++: Improving visual-semantic embeddings with hard negatives." arXiv preprint arXiv:1707.05612 (2017).
- Lee, Kuang-Huei, et al. "Stacked cross attention for image-text matching." Proceedings of the European conference on computer vision (ECCV). 2018.
- Diao, Haiwen, et al. Cross-modal_Retrieval_Tutorial