ESA: External Space Attention Aggregation for Image-Text Retrieval

The PyTorch code of the TCSVT 2023 paper “ESA: External Space Attention Aggregation for Image-Text Retrieval” (Finished in 2021)

We referred to the implementations of VSE++, SCAN and vse_infy to build up our codebase.

📖 Introduction

Due to the large gap between vision and language modalities, effective and efficient image-text retrieval is still an unsolved problem. Recent progress devotes to unilaterally pursuing retrieval accuracy by either entangled image-text interaction or large-scale vision-language pre-training in a brute force way. However, the former often leads to unacceptable retrieval time explosion when deploying on large-scale databases. The latter heavily relies on the extra corpus to learn better alignment in the feature space while obscuring the contribution of the network architecture. In this work, we aim to investigate a trade-off to balance effectiveness and efficiency. To this end, on the premise of efficient retrieval, we propose the plug-and-play External Space attention Aggregation (ESA) module to enable element-wise fusion of modal features under spatial dimensional attention. Based on flexible spatial awareness, we further propose the Self-Expanding triplet Loss (SEL) to expand the representation space of samples and optimize the alignment of embedding space. The extensive experiments demonstrate the effectiveness of our method on two benchmark datasets. With identical visual and textual backbones, our single model has outperformed the ensemble modal of similar methods, and our ensemble model can further expand the advantage. Meanwhile, compared with the vision-language pre-training embedding-base method that used $83\times$ image-text pairs than ours, our approach not only surpasses in performance but also accelerates $3\times$ on retrieval time.

⚖️ The Reported Results

Flicker30K dataset

COCO dataset

📌 Pretrained Model Weight

The following tables show partial results of the ensemble model ESA* and the download link of pretrained weights on Flickr30K and COCO datasets. The folder "ESA_BIGRU" provides the code of using BIGRU as the textual backbone. Please check out the folder "ESA_BERT" for the code of using BERT-base as the textual backbone.

dataset	Visual Backbone	Text Backbone	IR1	IR5	TR1	TR5	Rsum	Link
Flickr30K	BUTD region	BiGRU	83.1	96.3	62.4	87.2	520.2	Here
Flickr30K	BUTD region	BERT-base	84.6	96.6	66.3	88.8	528.0	Here
COCO 5-fold 1K COCO 5k	BUTD region	BiGRU	80.4 59.1	96.5 85.5	64.2 41.8	91.3 72.3	527.6 433.5	Here
COCO 5-fold 1K COCO 5k	BUTD region	BERT-base	81.0 61.1	96.9 86.6	66.4 43.9	92.2 74.1	531.9 443.0	Here

🔧 Setup and Environments

Python: 3.6
RTX 3090
Ubuntu 14.04.6 LTS

Install packages:

conda env create -f ESA.yaml

📁 Dataset

All datasets used in the experiments are organized in the following manner:

data
├── coco
│   ├── precomp  # pre-computed BUTD region features for COCO, provided by SCAN
│   │      ├── train_ids.txt
│   │      ├── train_caps.txt
│   │      ├── ......
│   │
│   ├── images   # raw coco images
│        ├── train2014
│        └── val2014
│  
├── f30k
│   ├── precomp  # pre-computed BUTD region features for Flickr30K, provided by SCAN
│   │      ├── train_ids.txt
│   │      ├── train_caps.txt
│   │      ├── ......
│   │
│   ├── flickr30k-images   # raw coco images
│          ├── xxx.jpg
│          └── ...
│   
└── vocab  # vocab files provided by SCAN (only used when the text backbone is BiGRU)

The download links for original COCO/F30K images, precomputed BUTD features, and corresponding vocabularies are from the official repo of SCAN. The precomp folders contain pre-computed BUTD region features, data/coco/images contains raw MS-COCO images, and data/f30k/flickr30k-images contains raw Flickr30K images. Because the download link for the pre-computed features in SCAN is seemingly taken down. The link provided by the author of vse_infty contains a copy of these files.

🔍 Training and Evaluation

Training on the Flicker30K or COCO dataset:

Switch to the shell folder in the corresponding path. For example:

cd ./ESA_BIGRU/shell/

run ./train_xxx_f30k.sh or ./train_xxx_coco.sh. For example:

sh train_GRU_f30k.sh

Evaluation: Run the following commands after modifying the default data and model path to yourself path.

cd ../
python eval_ensemble.py

📝 Citation

If this codebase is useful to you, please cite our work:

@article{zhu2023esa,
  title={ESA: External Space Attention Aggregation for Image-Text Retrieval},
  author={Zhu, Hongguang and Zhang, Chunjie and Wei, Yunchao and Huang, Shujuan and Zhao, Yao},
  journal={IEEE Transactions on Circuits and Systems for Video Technology},
  year={2023},
  publisher={IEEE}
}

🐼 Contacts

If you have any questions, please feel free to contact me: zhuhongguang1103@gmail.com or hongguang@bjtu.edu.cn.

📚 Reference

Chen, Jiacheng, et al. "Learning the best pooling strategy for visual semantic embedding." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.
Faghri, Fartash, et al. "Vse++: Improving visual-semantic embeddings with hard negatives." arXiv preprint arXiv:1707.05612 (2017).
Lee, Kuang-Huei, et al. "Stacked cross attention for image-text matching." Proceedings of the European conference on computer vision (ECCV). 2018.
Diao, Haiwen, et al. Cross-modal_Retrieval_Tutorial

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
ESA_BERT		ESA_BERT
ESA_BIGRU		ESA_BIGRU
docs/img		docs/img
ESA.yaml		ESA.yaml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESA_BERT

ESA_BERT

ESA_BIGRU

ESA_BIGRU

docs/img

docs/img

ESA.yaml

ESA.yaml

LICENSE

LICENSE

README.md

README.md

Repository files navigation

ESA: External Space Attention Aggregation for Image-Text Retrieval

📖 Introduction

⚖️ The Reported Results

Flicker30K dataset

COCO dataset

📌 Pretrained Model Weight

🔧 Setup and Environments

📁 Dataset

🔍 Training and Evaluation

📝 Citation

🐼 Contacts

📚 Reference

About

Releases

Packages

Languages

License

KevinLight831/ESA

Folders and files

Latest commit

History

Repository files navigation

ESA: External Space Attention Aggregation for Image-Text Retrieval

📖 Introduction

⚖️ The Reported Results

Flicker30K dataset

COCO dataset

📌 Pretrained Model Weight

🔧 Setup and Environments

📁 Dataset

🔍 Training and Evaluation

📝 Citation

🐼 Contacts

📚 Reference

About

Resources

License

Stars

Watchers

Forks

Languages