The official implementation OF paper "Sequence as A Whole: A Unified Framework for Video Action Localization with Long-range Text Query" [Paper]
We propose a unified framework which handles the whole video in sequential manner with long-range and dense visual-linguistic interaction in an end-to-end manner. Specifically, a lightweight relevance filtering based transformer (Ref-Transformer) is designed, which is composed of relevance filtering based attention and temporally expanded MLP. The text-relevant spatial regions and temporal clips in video can be efficiently highlighted through the relevance filtering and then propagated among the whole video sequence with the temporally expanded MLP. The unified framework can be utilized to varies video-text action localization tasks, e.g., referring video segmentation, temporal sentence grounding, and spatiotemporal video grounding.
-
python 3.8
-
pytorch 1.9.1
-
torchtext 0.10.1
run cd referring_segmentation
for referring video segmentation task.
Download the A2D Sentences dataset and J-HMDB Sentences dataset from https://kgavrilyuk.github.io/publication/actor_action/ and convert the videos to RGB frames.
For A2D Sentences dataset, run python pre_proc\video2imgs.py
to convert videos to RGB frames. The following directory structure is expected:
-a2d_sentences
-Rename_Images
-a2d_annotation_with_instances
-videoset.csv
-a2d_missed_videos.txt
-a2d_annotation.txt
-jhmdb_sentences
-Rename_Images
-puppet_mask
-jhmdb_annotation.txt
Edit the item datasets_root
in json/onfig_$DATASET$.json
to be the current dataset path.
Run python pre_proc\generate_data_list.py
to generate the training and testing data splits.
Download the pretrained DeepLabResNet from https://github.com/VainF/DeepLabV3Plus-Pytorch and put it into model/pretrained/
.
Only the A2D Sentences dataset is adopted for training, run:
python main.py --json_file=json\config_a2d_sentences.json --mode=train
For A2d Sentences dataset, run:
python main.py --json_file=json\config_a2d_sentences.json --mode=test
For JHMDB Sentences dataset, run:
python main.py --json_file=json\config_jhmdb_sentences.json --mode=test
run cd temporal_grounding
for referring temporal sentence grounding task.
-
For charades-STA dataset, download the pre-extracted I3D features following LGI4temporalgrounding and the pre-extracted VGG feature following 2D-TAN.
-
For TACoS dataset, download the pre-extracted C3D features following 2D-TAN
-
For ActivityNet Captions dataset, download the pre-extracted C3D features from http://activity-net.org/challenges/2016/download.html.
The config files can be find in ./json
and the following model settings are supported
-config_ActivityNet_C3D_anchor.json
-config_ActivityNet_C3D_regression.json
-config_Charades-STA_I3D_anchor.json
-config_Charades-STA_I3D_regression.json
-config_Charades-STA_VGG_anchor.json
-config_Charades-STA_VGG_regression.json
-config_TACoS_C3D_anchor.json
-config_TACoS_C3D_regression.json
Set the "datasets_root"
in each config file to be your feature path.
To train on different dataset with different grounding heads, run
python main.py --json_file=$JSON_FILE_PATH$ --mode=train
For evaluation, run
python main.py --json_file=$JSON_FILE_PATH$ --mode=test --checkpoint=$CHECKPOINT_PATH$
The pretrained models and their correspondance performance are shown bellow
Datasets | Feature | Decoder | Checkpoints |
---|---|---|---|
Charades-STA | I3D | Regression | [Baidu | gj54 ] |
Charades-STA | I3D | Anchor | [Baidu | 5j3a ] |
Charades-STA | VGG | Regression | [Baidu | 52xf ] |
Charades-STA | VGG | Anchor | [Baidu | rdmx ] |
ActivityNet | C3D | Regression | [Baidu | 6sbh ] |
ActivityNet | C3D | Anchor | [Baidu | ysr5 ] |
TACOS | C3D | Regression | [Baidu | iwx2 ] |
TACOS | C3D | Anchor | [Baidu | 1ube ] |
run cd spatiotemporal_grounding
for spatiotemporal video grounding task. The code for spatiotemporal grounding is built on the TubeDETR codebase.
We prepare the HC-STVG
and VidSTG
datasets following the TubeDETR. The annotation formation of the VidSTG dataset has been optimized to reduce the training memory usage.
videos
VidSTG dataset: Download VidOR videos from the VidOR dataset providers
HC-STVG dataset: Download HC-STVG videos from the HC-STVG dataset providers.
Edit the item vidstg_vid_path
in spatiotemporal_grounding/config/vidstg.json
and the hcstvg_vid_path
in spatiotemporal_grounding/config/hcstvg.json
to be the current video path.
annotations
Download the preprocessed annotation files from [https://pan.baidu.com/s/1oiV9PmtRqRxxdxMvqrJj_w, password: n6y4]. Then put the downloaded annotations
into spatiotemporal_grounding
.
To train on HC-STVG dataset, run
python main.py --combine_datasets=hcstvg --combine_datasets_val=hcstvg --dataset_config config/hcstvg.json --output-dir=hcstvg_result
To train on VidSTG dataset, run
python main.py --combine_datasets=vidstg --combine_datasets_val=vidstg --dataset_config config/vidstg.json --output-dir=vidstg_result
To evaluate on HC-STVG dataset, run:
python main.py --combine_datasets=hcstvg --combine_datasets_val=hcstvg --dataset_config config/hcstvg.json --output-dir=hcstvg_result --eval --resume=$CHECKPOINT_PATH$
To evaluate on VidSTG dataset, run
python main.py --combine_datasets=vidstg --combine_datasets_val=vidstg --dataset_config config/vidstg.json --output-dir=vidstg_result --eval --resume=$CHECKPOINT_PATH$
@article{2023saw,
title = {Sequence as A Whole: A Unified Framework for Video Action Localization with Long-range Text Query},
author = {Yuting Su, Weikang Wang, Jing Liu, Shuang Ma, Xiaokang Yang},
booktitle = {IEEE Transactions on Image Processing},
year = {2023}
}