Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning

source code of our paper Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning

Environments

CUDA 11.3
Python 3.8.5
PyTorch 1.10.2

We used Anaconda to setup a deep learning workspace that supports PyTorch. Run the following script to install the required packages.

conda create --name nrccr_env python=3.8.5
conda activate nrccr_env
git clone https://github.com/LiJiaBei-7/nrccr.git
cd nrccr
pip install -r requirements.txt
conda deactivate

Required Data

We use three public datasets: VATEX, MSR-VTT-CN, and Multi-30K. The extracted feature is placed in $HOME/VisualSearch/.

For Multi-30K, we have provided translation version (from Google Translate) of Task1 and Task2, respectively. [Task1: Applied to translation tasks. Task2: Applied to captioning tasks.].

In addition, we also provide MSCOCO dataset here, and corresponding performance below. The validation and test set on Japanese from STAIR Captions, and that on Chinese from COCO-CN.

Training set:

source(en) + translation(en2xx) + back-translation(en2xx2en)

Validation set and test set:

target(xx) + translation(xx2en)

Dataset	feature	caption
VATEX	vatex-i3d.tar.gz, pwd:p3p0	vatex_caption, pwd:oy27
MSR-VTT-CN	msrvtt10k-resnext101_resnet152.tar.gz, pwd:p3p0	cn_caption, pwd:oy27
Multi-30K	multi30k-resnet152.tar.gz, pwd:5khe	multi30k_caption, pwd:oy27
MSCOCO		mscoco_caption, pwd:13dc

ROOTPATH=$HOME/VisualSearch
mkdir -p $ROOTPATH && cd $ROOTPATH

Organize these files like this:
# download the data of VATEX[English, Chinese]
VisualSearch/VATEX/
	FeatureData/
		i3d_kinetics/
			feature.bin
			id.txt
			shape.txt
			video2frames.txt
	TextData/
		xx.txt

# download the data of MSR-VTT-CN[English, Chinese]
VisualSearch/msrvttcn/
	FeatureData/
		resnext101-resnet152/
			feature.bin
			id.txt
			shape.txt
			video2frames.txt
	TextData/
		xx.txt

# download the data of Multi-30K[Englich, German, French, Czech]
# For Task2, the training set was translated from Flickr30K, which contains five captions per image, while for task1, each image corresponds to one caption.
# The validation and test set on French and Czech are same in both tasks.
VisualSearch/multi30k/
	FeatureData/
		train_id.txt
		val_id.txt
		test_id_2016.txt

	resnet_152[optional]/
		train-resnet_152-avgpool.npy
		val-resnet_152-avgpool.npy
		test_2016_flickr-resnet_152-avgpool.npy	
	TextData/
		xx.txt	
	flickr30k-images/
		xx.jpg

# download the data of MSCOCO[English, Chinese, Japanese]
VisualSearch/mscoco/
	FeatureData/
		train_id.txt
		ja_val_id.txt
		zh_val_id.txt
		ja_test_id.txt
		zh_test_id.txt
	TextData/
		xx.txt
	all_pics/
		xx.jpg
		
	image_ids.txt

NRCCR on VATEX

Model Training and Evaluation

Run the following script to train and evaluate NRCCR network. Specifically, it will train NRCCR network and select a checkpoint that performs best on the validation set as the final model. Notice that we only save the best-performing checkpoint on the validation set to save disk space.

ROOTPATH=$HOME/VisualSearch

conda activate nrccr_env

# To train the model on the MSR-VTT, which the feature is resnext-101_resnet152-13k 
# Template:
./do_all_vatex.sh $ROOTPATH <gpu-id>

# Example:
# Train NRCCR 
./do_all_vatex.sh $ROOTPATH 0

<gpu-id> is the index of the GPU where we train on.

Evaluation using Provided Checkpoints

Download trained checkpoint on VATEX from Baidu pan (url, pwd:ise6) and run the following script to evaluate it.

ROOTPATH=$HOME/VisualSearch/

tar zxf $ROOTPATH/<best_model>.pth.tar -C $ROOTPATH

./do_test_vatex.sh $ROOTPATH $MODELDIR <gpu-id>
# $MODELDIR is the path of checkpoints, $ROOTPATH/.../runs_0

Expected Performance

Type	Text-to-Video Retrieval					Video-to-Text Retrieval					SumR
Type	R@1	R@5	R@10	MedR	mAP	R@1	R@5	R@10	MedR	mAP	SumR
en2cn	30.8	64.4	74.6	3.0	45.78	43.1	72.3	81.4	2.0	32.57	366.5

NRCCR on MSR-VTT-CN

Model Training and Evaluation

Run the following script to train and evaluate NRCCR network on MSR-VTT-CN.

ROOTPATH=$HOME/VisualSearch

conda activate nrccr_env

# To train the model on the VATEX
./do_all_msrvttcn.sh $ROOTPATH <gpu-id>

Evaluation using Provided Checkpoints

Download trained checkpoint on MSR-VTT-CN from Baidu pan (url, pwd:ise6) and run the following script to evaluate it.

ROOTPATH=$HOME/VisualSearch/

tar zxf $ROOTPATH/<best_model>.pth.tar -C $ROOTPATH

./do_test_msrvttcn.sh $ROOTPATH $MODELDIR <gpu-id>
# $MODELDIR is the path of checkpoints, $ROOTPATH/.../runs_0

Expected Performance

Type	Text-to-Video Retrieval					Video-to-Text Retrieval					SumR
Type	R@1	R@5	R@10	MedR	mAP	R@1	R@5	R@10	MedR	mAP	SumR
en2cn	28.9	56.3	67.3	4.0	41.28	28.9	57.6	69.0	4.0	42.02	308

NRCCR on Multi-30K

Model Training and Evaluation

Run the following script to train and evaluate NRCCR network on Multi-30K. Besides, if you want use the clip as the backbone to train, you need to download the raw images from here for Flickr30K.

ROOTPATH=$HOME/VisualSearch

conda activate nrccr_env

# To train the model on the Multi-30K
./do_all_multi30k.sh $ROOTPATH <task> <gpu-id>

Evaluation using Provided Checkpoints

Download trained checkpoint on Multi-30K from Baidu pan (url, pwd:ise6) and run the following script to evaluate it.

ROOTPATH=$HOME/VisualSearch/

tar zxf $ROOTPATH/<best_model>.pth.tar -C $ROOTPATH

./do_test_multi30k.sh $ROOTPATH $MODELDIR $image_path <gpu-id>
# $MODELDIR is the path of checkpoints, $ROOTPATH/.../runs_0
# $image_path is the path of the raw images for Flickr30K, if you use the frozen resnet-152, just set the None.

Expected Performance

Task1:

Type	Text-to-Video Retrieval					Video-to-Text Retrieval					SumR
Type	R@1	R@5	R@10	MedR	mAP	R@1	R@5	R@10	MedR	mAP	SumR
en2de_clip	53.8	81.8	88.3	1.0	66.60	53.8	82.7	90.3	1.0	66.66	450.7
en2fr_clip	54.7	81.7	89.2	1.0	67.05	54.9	82.7	89.7	1.0	67.29	452.9
en2cs_clip	52.6	79.4	87.9	1.0	65.26	52.3	78.7	87.8	1.0	64.68	438.7
en2cs_resnet152	29.5	56.0	68.1	4.0	41.89	27.5	55.1	67.4	4.0	40.59	303.6

Task2 :

(with clip)

en2de_SumR	en2fr_SumR	en2cs_SumR
480.9	482.1	467.1

NRCCR on MSCOCO

Model Training and Evaluation

Run the following script to train and evaluate NRCCR network on MSCOCO.

ROOTPATH=$HOME/VisualSearch

conda activate nrccr_env

# To train the model on the Multi-30K
./do_all_mscoco.sh $ROOTPATH <gpu-id>

Expected Performance

(with clip)

en2cn_SumR	en2ja_SumR
512.4	507.0

Reference

If you find the package useful, please consider citing our paper:

@inproceedings{wang2022cross,
  title={Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning},
  author={Yabing Wang and Jianfeng Dong and Tianxiang Liang and Minsong Zhang and Rui Cai and Xun Wang},
  journal={In Proceedings of the 30th ACM international conference on Multimedia},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
basic		basic
clip		clip
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
adv.py		adv.py
do_all_mscoco.sh		do_all_mscoco.sh
do_all_msrvttcn.sh		do_all_msrvttcn.sh
do_all_multi30k.sh		do_all_multi30k.sh
do_all_vatex.sh		do_all_vatex.sh
do_test_msrvttcn.sh		do_test_msrvttcn.sh
do_test_multi30k.sh		do_test_multi30k.sh
do_test_vatex.sh		do_test_vatex.sh
evaluation.py		evaluation.py
framework.png		framework.png
loss.py		loss.py
model.py		model.py
requirements.txt		requirements.txt
test_base.py		test_base.py
tester_img.py		tester_img.py
tester_vid.py		tester_vid.py
train_base.py		train_base.py
trainer_img.py		trainer_img.py
trainer_vid.py		trainer_vid.py
transformer_cross.py		transformer_cross.py
validate.py		validate.py

License

HuiGuanLab/nrccr

Folders and files

Latest commit

History

Repository files navigation

Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning

Table of Contents

Environments

Required Data

NRCCR on VATEX

Model Training and Evaluation

Evaluation using Provided Checkpoints

Expected Performance

NRCCR on MSR-VTT-CN

Model Training and Evaluation

Evaluation using Provided Checkpoints

Expected Performance

NRCCR on Multi-30K

Model Training and Evaluation

Evaluation using Provided Checkpoints

Expected Performance

NRCCR on MSCOCO

Model Training and Evaluation

Expected Performance

Reference

About

Resources

License

Stars

Watchers

Forks

Languages