Skip to content

OpenNLPLab/FAVDBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

32 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

FAVDBench: Fine-grained Audible Video Description

๐Ÿ  GitHub โ€ข ๐Ÿค— Hugging Face โ€ข ๐Ÿค– OpenDataLab โ€ข ๐Ÿ’ฌ Apply Dataset

[CVPR2023] [Project Page] [arXiv] [Demo][BibTex] [ไธญๆ–‡็ฎ€ไป‹]

This repository provides the official implementation for the CVPR2023 paper "Fine-grained Audible Video Description". We build a novel task: FAVD and a new dataset: FAVDBench in this paper.

๐Ÿ˜ƒ This repository offers the code necessary to replicate the results outlined in the paper. However, we encourage you to explore additional tasks using the FAVDBench dataset.

Table of Content

Apply for Dataset

You can access the FAVDBench dataset by visiting the OpenNLPLab/Download webpage. To obtain the dataset, please complete the corresponding Google Forms. Once we receive your application, we will respond promptly. Alternatively, if you encounter any issues with the form, you can also submit your application (indicating your Name, Affiliation) via email to opennlplab@gmail.com. (Kindly review both your inbox and spam folder for the access e-mail!)

These downloaded data can be placed or linked to the directory AVLFormer/datasets.

Installation

In general, the code requires python>=3.7, as well as pytorch>=1.10 and torchvision>=0.8. You can follow recommend_env.sh to configure a recommend conda environment:

  1. Create virtual env

    conda create -n FAVDBench; conda activate FAVDBench
  2. Install pytorch-related packages:

    conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
  3. Install basic packages:

    pip install fairscale opencv-python
    pip install deepspeed PyYAML fvcore ete3 transformers pandas timm h5py
    pip install tensorboardX easydict progressbar matplotlib future deprecated scipy av scikit-image boto3 einops addict yapf
  4. Install mmcv-full

    pip install mmcv-full==1.6.1 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.12/index.html
  5. Install apex

    git clone https://github.com/NVIDIA/apex
    cd apex
    pip install -v --disable-pip-version-check --no-cache-dir ./
  6. Clone related repo for eval

    cd ./AVLFormer/src/evalcap
    git clone https://github.com/xiaoweihu/cider.git
    git clone https://github.com/LuoweiZhou/coco-caption.git
    mv ./coco-caption ./coco_caption 
  7. Install ffmpeg & ffprobe

  • Use ffmpeg -version and ffprobe -version to check whether ffmpeg and ffprobe are installed.

  • Installation guideline:

      # For ubuntu
      sudo apt update
      sudo apt install ffmpeg
    
      # For mac
      brew update
      brew install ffmpeg

Dataset Preparation

๐Ÿ“Note:


  1. Refer to the Apply for Dataset section to download the raw video files directly into the datasets folder.

  2. Retrieve the metadata.zip file into the datasets folder, then proceed to unzip it.

  3. Activate conda env conda activate FAVDBench.

  4. Extract the frames from videos and convert them into a single TSV (Tab-Separated Values) file.

    # check the path
    pwd
    >>> FAVDBench/AVLFormer
    
    # check the preparation
    ls datasets
    >>> audios metadata videos audios
    
    # data pre-processing
    bash data_prepro/run.sh
    
    # validate the data pre-processing
    ls datasets
    >>> audios frames  frame_tsv  metadata videos
    
    ls datasets/frames
    >>> train-32frames test-32frames val-32frames
    
    ls datasets/frame_tsv
    test_32frames.img.lineidx   test_32frames.img.tsv    test_32frames.img.lineidx.8b    
    val_32frames.img.lineidx    val_32frames.img.tsv     val_32frames.img.lineidx.8b
    train_32frames.img.lineidx  train_32frames.img.tsv   train_32frames.img.lineidx.8b       

    ๐Ÿ“Note

    • The contents within datasets/frames serve as intermediate files for training, although they hold utility for inference and scoring.
    • datasets/frame_tsv files are specifically designed for training purposes.
    • Should you encounter any problems, access Quick Links for Dataset Preparation to download the processed files or initiate a new issue in GitHub.
  5. Convert the audio files in mp3 format to the h5py format by archiving them.

    python data_prepro/convert_h5py.py train
    python data_prepro/convert_h5py.py val
    python data_prepro/convert_h5py.py test
    # check the preparation
    ls datasets/audio_hdf
    >>> test_mp3.hdf  train_mp3.hdf  val_mp3.hdf

    ๐Ÿ“Note

Quick Links for Dataset Preparation

URL md5sum
meta4raw-video ๐Ÿ“ผ meta.zip 5b50445f2e3136a83c95b396fc69c84a
metadata ๐Ÿ’ป metadata.zip f03e61e48212132bfd9589c2d8041cb1
audio_mp3 ๐ŸŽต audio_mp3.tar e2a3eb49edbb21273a4bad0abc32cda7
audio_hdf ๐ŸŽต audio_hdf.tar 79f09f444ce891b858cb728d2fdcdc1b
frame_tsv ๐ŸŽ† Dropbox / ็™พๅบฆ็ฝ‘็›˜ 6c237a72d3a2bbb9d6b6d78ac1b55ba2

Experiments

๐Ÿ“Note:

Preparation

Please visit Video Swin Transformer to download pre-trained weights models.

Download swin_base_patch244_window877_kinetics400_22k.pth and swin_base_patch244_window877_kinetics600_22k.pth, and place them under models/video_swin_transformer directory.

FAVDBench/AVLFormer
|-- datasets      (purposes)
|   |--audios     (raw-data)  
|   |--audio_hdf  (training, evaluation)
|   |--audio_mp3  (evaluation, inference)
|   |--frame_tsv  (training)
|   |--frames     (evaluation)
|   |--meta       (raw-data)
|   |--metadata   (training)
|   |--videos     (raw-data, inference)
|-- models  
|   |--captioning/bert-base-uncased
|   |-- video_swin_transformer
|    |   |-- swin_base_patch244_window877_kinetics600_22k.pth
|    |   |-- swin_base_patch244_window877_kinetics400_22k.pth

Training

  • The run.sh file provides training scripts catered for single GPU, multiple GPUs, and distributed across multiple nodes with GPUs.
  • The hyperparameters could be beneficial.
  • For detailed instructions, refer to this guide.

Inference

  • The inference.sh file offers scripts for inferences.
  • Attention: The baseline for inference necessitates both raw video and audio data, which could be found here.
  • For detailed instructions, refer to this guide.

Quick Links for Experiments

URL md5sum
weight ๐Ÿ”’ GitHub / ็™พๅบฆ็ฝ‘็›˜ 5d6579198373b79a21cfa67958e9af83
hyperparameters ๐Ÿงฎ args.json -
prediction โ˜€๏ธ prediction_coco_fmt.json -
metrics ๐Ÿ”ข metrics.log -

Evaluation

AudioScore

cd Metrics/AudioScore

python score.py \
    --pred_path PATH_to_PREDICTION_JSON_in_COCO_FORMAT \

๐Ÿ“Note:

  • Additional weights is required to download, please refer to the installation.
  • URL md5sum
    TriLip ๐Ÿ‘ TriLip.bin 6baef8a9b383fa7c94a4c56856b0af6d

EntitySore

cd Metrics/EntityScore

python score.py \
    --pred_path PATH_to_PREDICTION_JSON_in_COCO_FORMAT \
    --refe_path AVLFormer/datasets/metadata/test.caption_coco_format.json \
    --model_path t5-base

CLIPScore

  • Please refer to CLIPScore to evaluate the model.

License

The community usage of FAVDBench model & code requires adherence to Apache 2.0. The FAVDBench model & code supports commercial use.

Acknowledgments

Our project is developed based on the following open source projects:

Citation

If you use FAVD or FAVDBench in your research, please use the following BibTeX entry.

@InProceedings{Shen_2023_CVPR,
    author    = {Shen, Xuyang and Li, Dong and Zhou, Jinxing and Qin, Zhen and He, Bowen and Han, Xiaodong and Li, Aixuan and Dai, Yuchao and Kong, Lingpeng and Wang, Meng and Qiao, Yu and Zhong, Yiran},
    title     = {Fine-Grained Audible Video Description},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {10585-10596}
}