Skip to content

KevinLight831/ESA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ESA: External Space Attention Aggregation for Image-Text Retrieval

GitHub visitors

The PyTorch code of the TCSVT 2023 paper “ESA: External Space Attention Aggregation for Image-Text Retrieval” (Finished in 2021)

We referred to the implementations of VSE++, SCAN and vse_infy to build up our codebase.

📖 Introduction

Due to the large gap between vision and language modalities, effective and efficient image-text retrieval is still an unsolved problem. Recent progress devotes to unilaterally pursuing retrieval accuracy by either entangled image-text interaction or large-scale vision-language pre-training in a brute force way. However, the former often leads to unacceptable retrieval time explosion when deploying on large-scale databases. The latter heavily relies on the extra corpus to learn better alignment in the feature space while obscuring the contribution of the network architecture. In this work, we aim to investigate a trade-off to balance effectiveness and efficiency. To this end, on the premise of efficient retrieval, we propose the plug-and-play External Space attention Aggregation (ESA) module to enable element-wise fusion of modal features under spatial dimensional attention. Based on flexible spatial awareness, we further propose the Self-Expanding triplet Loss (SEL) to expand the representation space of samples and optimize the alignment of embedding space. The extensive experiments demonstrate the effectiveness of our method on two benchmark datasets. With identical visual and textual backbones, our single model has outperformed the ensemble modal of similar methods, and our ensemble model can further expand the advantage. Meanwhile, compared with the vision-language pre-training embedding-base method that used $83\times$ image-text pairs than ours, our approach not only surpasses in performance but also accelerates $3\times$ on retrieval time.

⚖️ The Reported Results

Flicker30K dataset

F30K

COCO dataset

COCO

📌 Pretrained Model Weight

The following tables show partial results of the ensemble model ESA* and the download link of pretrained weights on Flickr30K and COCO datasets. The folder "ESA_BIGRU" provides the code of using BIGRU as the textual backbone. Please check out the folder "ESA_BERT" for the code of using BERT-base as the textual backbone.

dataset Visual Backbone Text Backbone IR1 IR5 TR1 TR5 Rsum Link
Flickr30K BUTD region BiGRU 83.1 96.3 62.4 87.2 520.2 Here
Flickr30K BUTD region BERT-base 84.6 96.6 66.3 88.8 528.0 Here
COCO 5-fold 1K
COCO 5k
BUTD region BiGRU 80.4
59.1
96.5
85.5
64.2
41.8
91.3
72.3
527.6
433.5
Here
COCO 5-fold 1K
COCO 5k
BUTD region BERT-base 81.0
61.1
96.9
86.6
66.4
43.9
92.2
74.1
531.9
443.0
Here

🔧 Setup and Environments

Python: 3.6
RTX 3090
Ubuntu 14.04.6 LTS

Install packages:

conda env create -f ESA.yaml

📁 Dataset

All datasets used in the experiments are organized in the following manner:

data
├── coco
│   ├── precomp  # pre-computed BUTD region features for COCO, provided by SCAN
│   │      ├── train_ids.txt
│   │      ├── train_caps.txt
│   │      ├── ......
│   │
│   ├── images   # raw coco images
│        ├── train2014
│        └── val2014
│  
├── f30k
│   ├── precomp  # pre-computed BUTD region features for Flickr30K, provided by SCAN
│   │      ├── train_ids.txt
│   │      ├── train_caps.txt
│   │      ├── ......
│   │
│   ├── flickr30k-images   # raw coco images
│          ├── xxx.jpg
│          └── ...
│   
└── vocab  # vocab files provided by SCAN (only used when the text backbone is BiGRU)

The download links for original COCO/F30K images, precomputed BUTD features, and corresponding vocabularies are from the official repo of SCAN. The precomp folders contain pre-computed BUTD region features, data/coco/images contains raw MS-COCO images, and data/f30k/flickr30k-images contains raw Flickr30K images. Because the download link for the pre-computed features in SCAN is seemingly taken down. The link provided by the author of vse_infty contains a copy of these files.

🔍 Training and Evaluation

Training on the Flicker30K or COCO dataset:

  1. Switch to the shell folder in the corresponding path. For example:
cd ./ESA_BIGRU/shell/
  1. run ./train_xxx_f30k.sh or ./train_xxx_coco.sh. For example:
sh train_GRU_f30k.sh
  1. Evaluation: Run the following commands after modifying the default data and model path to yourself path.
cd ../
python eval_ensemble.py

📝 Citation

If this codebase is useful to you, please cite our work:

@article{zhu2023esa,
  title={ESA: External Space Attention Aggregation for Image-Text Retrieval},
  author={Zhu, Hongguang and Zhang, Chunjie and Wei, Yunchao and Huang, Shujuan and Zhao, Yao},
  journal={IEEE Transactions on Circuits and Systems for Video Technology},
  year={2023},
  publisher={IEEE}
}

🐼 Contacts

If you have any questions, please feel free to contact me: zhuhongguang1103@gmail.com or hongguang@bjtu.edu.cn.

📚 Reference

  1. Chen, Jiacheng, et al. "Learning the best pooling strategy for visual semantic embedding." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.
  2. Faghri, Fartash, et al. "Vse++: Improving visual-semantic embeddings with hard negatives." arXiv preprint arXiv:1707.05612 (2017).
  3. Lee, Kuang-Huei, et al. "Stacked cross attention for image-text matching." Proceedings of the European conference on computer vision (ECCV). 2018.
  4. Diao, Haiwen, et al. Cross-modal_Retrieval_Tutorial

About

ESA: External Space Attention Aggregation for Image-Text Retrieval

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published