Skip to content
PyTorch code for ICCV'19 paper "Visual Semantic Reasoning for Image-Text Matching"
Python MATLAB Lua C++ C
Branch: master
Clone or download
Latest commit 3f079d6 Oct 15, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
GCN_lib Initial commit Sep 5, 2019
coco-caption Initial commit Sep 5, 2019
cocoapi-master Initial commit Sep 5, 2019
fig Initial commit Sep 6, 2019
misc Initial commit Sep 5, 2019
models Initial commit Sep 5, 2019
vocab Initial commit Sep 5, 2019
README.md Update README.md Oct 15, 2019
__init__.py Initial commit Sep 5, 2019
data.py VSRN Sep 5, 2019
evaluate_models.py Initial commit Sep 5, 2019
evaluation.py Initial commit Sep 5, 2019
evaluation_models.py VSRN Sep 5, 2019
model.py VSRN Sep 5, 2019
opts.py VSRN Sep 5, 2019
requirement.txt VSRN Sep 5, 2019
train.py VSRN Sep 6, 2019
vocab.py VSRN Sep 5, 2019

README.md

Visual Semantic Reasoning for Image-Text Matching (VSRN)

PyTorch code for VSRN described in the paper "Visual Semantic Reasoning for Image-Text Matching". The paper will appear in ICCV 2019 as oral presentation. It is built on top of the VSE++.

Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li and Yun Fu. "Visual Semantic Reasoning for Image-Text Matching", ICCV, 2019. [pdf]

Introduction

Image-text matching has been a hot research topic bridging the vision and language areas. It remains challenging because the current representation of image usually lacks global semantic concepts as in its corresponding text caption. To address this issue, we propose a simple and interpretable reasoning model to generate visual representation that captures key objects and semantic concepts of a scene. Specifically, we first build up connections between image regions and perform reasoning with Graph Convolutional Networks to generate features with semantic relationships. Then, we propose to use the gate and memory mechanism to perform global semantic reasoning on these relationship-enhanced features, select the discriminative information and gradually generate the representation for the whole scene. Experiments validate that our method achieves a new state-of-the-art for the image-text matching on MS-COCO and Flickr30K datasets. It outperforms the current best method by 6.8% relatively for image retrieval and 4.8% relatively for caption retrieval on MS-COCO (Recall@1 using 1K test set). On Flickr30K, our model improves image retrieval by 12.6% relatively and caption retrieval by 5.8% relatively (Recall@1).

model

Requirements

We recommended the following dependencies.

import nltk
nltk.download()
> d punkt

Download data

Download the dataset files and pre-trained models. We use splits produced by Andrej Karpathy.

We follow bottom-up attention model and SCAN to obtain image features for fair comparison. More details about data pre-processing (optional) can be found here. All the data needed for reproducing the experiments in the paper, including image features and vocabularies, can be downloaded from SCAN by using:

wget https://scanproject.blob.core.windows.net/scan-data/data.zip

You can also get the data from google drive: https://drive.google.com/drive/u/1/folders/1os1Kr7HeTbh8FajBNegW8rjJf6GIhFqC. We refer to the path of extracted files for data.zip as $DATA_PATH.

Training new models

Run train.py:

For MSCOCO:

python train.py --data_path $DATA_PATH --data_name coco_precomp --logger_name runs/coco_VSRN --max_violation

For Flickr30K:

python train.py --data_path $DATA_PATH --data_name f30k_precomp --logger_name runs/filker_VSRN --max_violation --max_len 40

Evaluate trained models

Modify the model_path and data_path in the evaluation_models.py file. Then Run evaluation_models.py:

python evaluation_models.py

To do cross-validation on MSCOCO 1K test set, pass fold5=True. Pass fold5=False for evaluation on MSCOCO 5K test set. Pretrained models can be downloaded from https://drive.google.com/file/d/1C4Z8ZgJuvrChigPO7g-IGd68y6VqWQ5n/view?usp=sharing

Reference

If you found this code useful, please cite the following paper:

@inproceedings{li2019vsrn,
  title={Visual semantic reasoning for image-text matching},
  author={Li, Kunpeng and Zhang, Yulun and Li, Kai and Li, Yuanyuan and Fu, Yun},
  booktitle={ICCV},
  year={2019}
}

License

Apache License 2.0

You can’t perform that action at this time.