Skip to content

Learning Semantic Relationship among Instances for Image-Text Matching, CVPR, 2023

License

Notifications You must be signed in to change notification settings

CrossmodalGroup/HREM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Learning Semantic Relationship among Instances for Image-Text Matching

The codes for our paper "Learning Semantic Relationship among Instances for Image-Text Matching", which is accepted by the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. We referred to the implementations of GPO, VSE++, SCAN, and NAAF to build up our codes.

Introduction

Image-text matching, a bridge connecting image and language, is an important task, which generally learns a holistic cross-modal embedding to achieve a high-quality semantic alignment between the two modalities.

However, previous studies only focus on capturing fragment-level relation within a sample from a particular modality, e.g., salient regions in an image or text words in a sentence, where they usually pay less attention to capturing instance-level interactions among samples and modalities, e.g., multiple images and texts. In this paper, we argue that sample relations could help learn subtle differences for hard negative instances, and thus transfer shared knowledge for infrequent samples should be promising in obtaining better holistic embeddings. Therefore, we propose a novel hierarchical relation modeling framework (HREM), which explicitly capture both fragment- and instance-level relations to learn discriminative and robust cross-modal embeddings. Extensive experiments on Flickr30K and MS-COCO show our proposed method outperforms the state-of-the-art methods.

Preparation

Environments

We recommend the following key dependencies.

  • python >= 3.8
  • torch >= 1.7.0
  • torchvision >= 0.8.0
  • transformers >=2.1.1
  • opencv-python
  • tensorboard

Data

We organize all dataset in the following manner like GPO, we first need download the pre-computed BUTD region features provides by SCAN (The official authors). You also can download the dataset through Baidu Cloud. Download links are Flickr30K and MSCOCO, the extraction code is: USTC.

We need get the pretrained checkpoint files for BERT-base model. You also can use the pretrained checkpoint downloaded by transformers automatically, just set BertModel.from_pretrained('bert-base-uncased') and BertTokenizer.from_pretrained('bert-base-uncased'), then the files will be downloaded at ~/.cache.

data
├── coco_precomp  # coco dataset
│   ├── train_ims.npy
│   ├── train_caps.txt
│   ├── dev_ims.npy
│   ├── dev_caps.txt
│   ├── testall_ims.npy
│   ├── testall_caps.txt
│   ├── ......
│
├── f30k_precomp  # f30k dataset
│   ├── train_ims.npy
│   ├── train_caps.txt
│   ├── dev_ims.npy
│   ├── dev_caps.txt
│   ├── test_ims.npy
│   ├── test_caps.txt
│   ├── ......
│
├── bert-base-uncased    # the pretrained ckpt files for BERT-base
│   ├── config.json
│   ├── tokenizer_config.txt
│   ├── vocab.txt
│   ├── pytorch_model.bin
│   ├── ......
│

Training

We first need set up the related arguments for the datasets and models paths. Detailed information about the arguments can be found in arguments.py and graph_lib.py.

  • --dataset: the chosen datasets, e.g., f30k and coco.
  • --data_path: the root path of datasets, e.g., data/
  • --bert_path: the path of pretrained model files of BERT-base, e.g., data/bert-base-uncased
  • --gpu-id, the chosen GPU number, e.g., 0-7
  • --logger_name, the path of logger files, e.g., runs/f30k_test or runs/coco_test

Assuming we have set all the paths, including the model weights and the datasets, we run the train.py for training. The training script need about 20,000 GPU-Memory (one RTX-3090 GPU).

# fusion mechanism on f30k
python train.py --batch_size 128 --data_path data/ --dataset f30k --logger_name runs/f30k_test --mask_weight 1.0

# fusion mechanism on coco
python train.py --batch_size 256 --data_path data/ --dataset coco --logger_name runs/coco_test --mask_weight 1.5

Evaluation

Run eval.py to evaluate the trained models on f30k or coco.

python eval.py  --data_path data/ --dataset f30k 
python eval.py  --data_path data/ --dataset coco 

Performances

The following tables show the results of image-to-text retrieval on MSCOCO and Flickr30K datasets. In these experiments, we use the BUTD region featues for images, and use the BERT-base as the textual encoder. We also provide the training logs and checkpoint files for two datasets.

Datasets Interaction type I2T R@1 I2T R@5 T2I R@1 T2I R@5 Model checkpoint
Flickr30K Fusion 84.4 96.7 62.4 86.6 Here
MSCOCO-1K Fusion 81.9 96.9 65.9 91.4 Here
MSCOCO-5K Fusion 62.5 87.4 44.0 73.9 as above

Reference

@InProceedings{Fu_2023_CVPR,
    author    = {Fu, Zheren and Mao, Zhendong and Song, Yan and Zhang, Yongdong},
    title     = {Learning Semantic Relationship Among Instances for Image-Text Matching},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {15159-15168}
}

About

Learning Semantic Relationship among Instances for Image-Text Matching, CVPR, 2023

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages