Position Focused Attention Network for Image-Text Matching
This is the source code of Position Focused Attention Network, an approch for Image-Text Matching based on position attention from Tencent. It is built on top of the SCAN (by Kuang-Huei Lee) in PyTorch.
Requirements and Installation
We recommended the following dependencies.
Download the dataset files. We use the dataset files created by SCAN Kuang-Huei Lee. The position information of images can be downloaded from here (for Flickr30K) and here (for MS-COCO). Noting that we only upload the position information and caption in MS-COCO dataset, while the image feature is not uploaded because of its huge storage. The original image feature can be downloaded from SCAN. When using the original image features, we should reorder these samples from the sample ids or sample captions. The Tencent-News dataset files can be downloaded from here and here.
#For Flickr30K dataset wget https://drive.google.com/open?id=1ZiF1IoeExPcn9V9L78X6jEYuMxR96OLO #For MS-COCO dataset wget https://drive.google.com/open?id=1DaCZxeXOCm05u-Gf-_MG_zSNKO1UxBat #For Tencent-News training dataset wget https://drive.google.com/open?id=1WKq05mhSMc2u0SLtCWkUzgmqTLx95kXR #For Tencent-News testing dataset wget https://drive.google.com/open?id=1dPyo2EBHQoHkqx-Dl4R7ISb8t-rVG_KK
Training new models
To train Flickr30K and MS-COCO models:
In order to further improve the performance of PFAN on Tencent-News dataset, the whole image feautre is also considered. The details are shown in Tencent_PFAN code:
Arguments used to train Flickr30K models and MS-COCO models are as same as those of SCAN:
The models on Tencent-News can be downloaded from here.
Evaluate trained models on Flickr30K and MS-COCO
from vocab import Vocabulary import evaluation evaluation.evalrank("$RUN_PATH/f30k_precomp/model_best.pth.tar", data_path="$DATA_PATH", split="test")
Evaluate position-attention (PFAN-A) and position-only (PFAN-P) models
Evaluate trained models on Tencent-News
First, start the server to process requests
sh run_server.sh # port 5091 is sentence model and port 5092 is tag model
Then, send requests to get results from the server
cd test_server python test.py dist_sentence_t2i.json sentence 5091 # to get the results using sentence model and sentence data python test.py dist_tag_t2i.json tag 5091 # to get the results using sentence model and tag data python test.py dist_tag_new_t2i.json tag 5092 # to get the results using tag model and tag data
Finally, get the MAP@1-3 and A@1-3
cd test_server python compute_map.py