GitHub - JeremyLinky/YouTube-VLN: [ICCV'23] Learning Vision-and-Language Navigation from YouTube Videos

Learning Vision-and-Language Navigation from YouTube Videos

Kunyang Lin^{1 2*} Peihao Chen^1* Diwei Huang¹ Thomas H. Li⁶ Mingkui Tan^{1 5†} Chuang Gan^{3 4}
¹South China University of Technology ²Key Laboratory of Big Data and Intelligent Robot, Ministry of Education ³UMass Amherst ⁴MIT-IBM Watson AI Lab ⁵Key Laboratory of Big Data and Intelligent Robot, Ministry of Education ⁶Peking University Shenzhen Graduate School

Getting started

This project is developed with Python 3.6.13, Pytorch 1.10.1. Please install dependencies by follows:

conda env create -f env.yaml
conda activate lily

or install the environment by

pip install -r requirements.txt

Some packages may be missed you need to refer to the requirements.txt to install manually.

Preparing dataset

We provide the detailed construction process of our proposed YouTube-VLN dataset in YouTube_VLN.md. The whole process may take a certain amount of time. If you want to directly use the generated dataset for training, please download the following data:blush:.

1、Download the image features (totally 11 files) and put them into data/YouTube-VLN/youtube_img_features:

image features 0、 image features 1、 image features 2、 image features 3、 image features 4、 image features 5、 image features 6、 image features 7、 image features 8、 image features 9、 image features 10

2、Download the trainset and testset put them into data/YouTube-VLN/ytb.

3、Download the checkpoint of VilBERT pre-trained on Conceptual Captions and then put it into data/YouTube-VLN.

4、Download the matterport-ResNet-101-faster-rcnn features and unzip it and then put it into data/YouTube-VLN.

5、Download the instruction template and then put it into data/task.

6、Follow download.py to download the other data of tasks.

python scripts/download.py

Training

1. Pre-traing Lily using YouTube-VLN

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch \
    --nproc_per_node 4 \
    --master_port 1234 \
    -m pretrain \
    --pre_dataset ytb \
    --from_pretrained data/pretrained_model.bin \
    --save_name ytbvln_2e5_500_MRT \
    --prefix merge+ \
    --separators \
    --masked_vision \
    --masked_language \
    --ranking \
    --traj_judge \
    --batch_size 8 \
    --learning_rate 2e-5 \
    --num_epochs 500 \
    --save_epochs 100

2. Fine-tune with masking loss

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch \
    --nproc_per_node 4 \
    --master_port 5555 \
    -m train \
    --from_pretrained result/ytbvln_2e5_500_MRT/data/best_ranking.bin \
    --save_name ytbvln_2e5_500_MRT_ranking_30M \
    --masked_vision \
    --masked_language \
    --batch_size 12 \
    --num_epochs 30

3. Fine-tune with ranking loss and shuffling loss

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch \
    --nproc_per_node 8 \
    --master_port 5555 \
    -m train \
    --from_pretrained result/ytbvln_2e5_500_MRT_ranking_30M/data/29.bin \
    --save_name ytbvln_2e5_500_MRT_ranking_30M_30RS \
    --shuffle_visual_features \
    --ranking \
    --batch_size 16 \
    --num_epochs 30

4. Fine-tune with ranking loss and shuffling loss using speaker augmented data

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch \
    --nproc_per_node 8 \
    --master_port 5555 \
    -m train \
    --from_pretrained result/ytbvln_2e5_500_MRT_ranking_30M/data/29.bin \
    --save_name ytbvln_2e5_500_MRT_ranking_30M_30RSA \
    --prefix aug+ \
    --beam_prefix aug_ \
    --shuffle_visual_features \
    --ranking \
    --batch_size 16 \
    --num_epochs 30

Testing

CUDA_VISIBLE_DEVICES=0 python test.py \
  --from_pretrained result/ytbvln_2e5_500_MRT_ranking_30M_30RSA/data/best_unseen.bin \
  --save_name ytbvln_2e5_500_MRT_ranking_30M_30RSA \
  --split val_unseen

python scripts/calculate-metrics.py results/ytbvln_2e5_500_MRT_ranking_30M_30RSA/test_val_unseen/_results_val_unseen.json

Here we provide our trained model, feel free to test it.

Citation

If you find this work helpful, please kindly consider citing our paper:

@article{lin2023ytbvln,
  title = {Learning Vision-and-Language Navigation from YouTube Videos},
  author = {Lin, Kunyang and Chen, Peihao and Huang, Diwei and Li, Thomas H. and Tan, Mingkui and Gan, Chuang},
  journal = {arXiv preprint arXiv:2307.11984}, 
  year = {2023},
}

@misc{lin2023ytbvln,
  title = {Learning Vision-and-Language Navigation from YouTube Videos},
  author = {Lin, Kunyang and Chen, Peihao and Huang, Diwei and Li, Thomas H. and Tan, Mingkui and Gan, Chuang},
  howpublished = {\url{https://github.com/JeremyLinky/YouTube-VLN}}, 
  year = {2023},
}

Acknowledgements

Our code is partially modified from Airbert, video-dqn and Probes-VLN. Thanks for their awesome works and please consider citing them at the same time.

Contact

For any questions, please feel free to file an issue or contact:revolving_hearts::

Kunyang Lin: imkunyanglin@gmail.com
Diwei Huang: sediweihuang@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
readme		readme
scripts		scripts
utils		utils
vilbert		vilbert
LICENSE		LICENSE
README.md		README.md
env.yaml		env.yaml
lily.py		lily.py
pretrain.py		pretrain.py
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py

License

JeremyLinky/YouTube-VLN

Folders and files

Latest commit

History

Repository files navigation

Learning Vision-and-Language Navigation from YouTube Videos

Getting started

Preparing dataset

Training

1. Pre-traing Lily using YouTube-VLN

2. Fine-tune with masking loss

3. Fine-tune with ranking loss and shuffling loss

4. Fine-tune with ranking loss and shuffling loss using speaker augmented data

Testing

Citation

Acknowledgements

Contact

About

Resources

License

Stars

Watchers

Forks

Languages