Skip to content
Modified version of the Two_branch_network repo
Python Shell
Branch: master
Clone or download
Pull request Compare This branch is 5 commits ahead of lwwang:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Two-Branch Neural Networks

This repo is from Two Branch Networks (Liwei Wang, et al.) and has been modified to enable testing different settings as done in this paper like fine-tuning a word embedding, using different language models, and using cca initialized fully connected layers.

This code has been tested using python 2.7 and Tensorflow 1.2.1

You can find a pytorch implementation of this codebase here, which typically trains a bit faster than this version.


You can download and unpack the caption data using:


This doesn't include precomputed visual features or language embeddings. You can obtain the ResNet-152 visual features we used here. The code is setup to load word embeddings from a space separated text file. By default the code will load MT GrOVLE embeddings which it assumes has been placed in the data directory. When tuning the word_embedding_reg we found values anywhere between 1.5 and 0 to be optimal depending on the word embedding tested, and tuning this parameter for the word embedding can considerably improve performance.


After setting up the datasets, you can train a model using the provided script:

# GPU_ID is the GPU you want to test on
# DATASET in {flickr, coco} determines which dataset is used 
# LANGUAGE_MODEL in {avg, attend, gru} language encoder used to aggregate word embeddings
# EXPERIMENT_NAME a descriptor of what to call this experiment
# Examples:
./ --train 0 coco avg default_avg
./ --train 1 flickr gru default_gru

Training using both avg and attend language models should take less than an hour on a Titan Xp GPU (on Flickr30K, just a few minutes), but gru and other simple alternative recurrent models take considerably longer to train and tends to perform worse on this task when using a pretrained word embedding (see additional results here). More complicated recurrent models may improve performance, however.

Evaluating the model on the 1K test splits for each dataset can be accomplished using:

# GPU_ID is the GPU you want to test on
# DATASET in {flickr, coco} determines which dataset is used
# LANGUAGE_MODEL in {avg, attend, gru} language encoder used to aggregate word embeddings
# CHECKPOINT is the full path of the checkpoint to load
# Examples:
./ --test 1 coco attend models/coco/default_attend/two_branch_chpt-22660
./ --val 0 flickr gru models/flickr/default_gru/two_branch_chpt-5940

When evaluating it's important to note the discrepancy in the splits on the Flickr30K dataset. At least two (if not more) splits are used to evaluate the dataset on this task. The difference in performance between different splits can easily account for a 1-2% difference (this is also true on MSCOCO, but there is more stability in splits there). It isn't clear if one split always gets better performance than other, and without trying many different models on the same splits it can't be known with any certainty. We use the same splits as provided by Flickr30K Entities dataset.

Example experiments

Below we provide an example of one of our runs training and testing a self-attention language model using the MT GrOVLE embeddings (which is a little better than the results reported here, and better than the Two Branch Network's original paper):

# The three values for each direction correspond to Recall@{1, 5, 10} (6 numbers total), 
# and mR refers to the mean of the six recall values.

# For the Flickr30K dataset
./ --train 1 flickr attend default_attend
./ --test 1 flickr attend models/flickr/default_attend/two_branch_chpt-5940

im2sent: 61.7 86.5 93.2 sent2im: 45.6 76.2 85.3 mr: 74.8

# For the MSCOCO dataset
./ --train 2 coco attend default_attend
./ --test 2 coco attend models/coco/default_attend/two_branch_chpt-22660

im2sent: 68.7 93.5 97.4 sent2im: 54.5 85.6 93.3 mr: 82.2

This is more comparaible to using the the pca-reduced HGLMM features as reported in this paper. The HGLMM features may still perform better on this task (after tuning hyperparameters), but is 6K-D rather than 300-D used by the default embeddings, and, thus, requires more parameters and additional computational time. You can get some precomputed HGLMM features here, or can compute them yourself using the code here. You can also likely improve performance by using CCA initialization of the fully connected layers, and is supported by this codebase. It assumes you are provided with a layer weight file in the same format as used by this repo.


If you use this repo in your project please cite the following papers on the Two Branch Network:

  title={Learning deep structure-preserving image-text embeddings},
  author={Wang, Liwei and Li, Yin and Lazebnik, Svetlana},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},

  title={Learning two-branch neural networks for image-text matching tasks},
  author={Wang, Liwei and Li, Yin and Huang, Jing and Lazebnik, Svetlana},

In addition, if you use the MT GrOVLE word embeddings, want to compare to ResNet results, or use the self-attention model please also cite:

  title={Language Features Matter: {E}ffective Language Representations for Vision-Language Tasks},
  author={Andrea Burns and Reuben Tan and Kate Saenko and Stan Sclaroff and Bryan A. Plummer},
  booktitle={The IEEE International Conference on Computer Vision (ICCV)},

Finally, if you use CCA Initialization please cite:

  title={Revisiting Image-Language Networks for Open-ended Phrase Detection},
  author={Bryan A. Plummer and Kevin J. Shih and Yichen Li and Ke Xu and Svetlana Lazebnik and Stan Sclaroff and Kate Saenko},
You can’t perform that action at this time.