This is the source code of our TCYB 2020 paper "Unsupervised Visual-textual Correlation Learning with Fine-grained Semantic Alignment". Please cite the following paper if you use our code.
Yuxin Peng, Zhaoda Ye, Jinwei Qi and Yunkan Zhuo, "Unsupervised Visual-textual Correlation Learning with Fine-grained Semantic Alignment", IEEE Transactions on Cybernetics (TCYB), DOI:10.1109/TCYB.2020.3015084, Sep. 2020.
The main code is implemented with pytorch.
We adopt the object detection model (https://github.com/peteanderson80/bottom-up-attention) and SceneGraphParser (https://github.com/vacancy/SceneGraphParser) to extracted the image and text entity. The entity files for flickr can be find in ./data folder.
-
Based on IOU: Uses the script in ./caption/IOU
-
Based on Generation model: We adopt OpenNMT (https://github.com/OpenNMT/OpenNMT) for Caption generation. The generation captions for flickr can be find in ./data folder.
cd ./Cross-modal/local
Train the model: sh script.sh
Test and obtain the similarity score: python test.py
cd ./Cross-modal/global
Train the model: sh script.sh
Test and obtain the global representation: python test.py
Use the script in ./Merge