SAW

The official implementation OF paper "Sequence as A Whole: A Unified Framework for Video Action Localization with Long-range Text Query" [Paper]

We propose a unified framework which handles the whole video in sequential manner with long-range and dense visual-linguistic interaction in an end-to-end manner. Specifically, a lightweight relevance filtering based transformer (Ref-Transformer) is designed, which is composed of relevance filtering based attention and temporally expanded MLP. The text-relevant spatial regions and temporal clips in video can be efficiently highlighted through the relevance filtering and then propagated among the whole video sequence with the temporally expanded MLP. The unified framework can be utilized to varies video-text action localization tasks, e.g., referring video segmentation, temporal sentence grounding, and spatiotemporal video grounding.

Requirements

python 3.8
pytorch 1.9.1
torchtext 0.10.1

Referring Video Segmentation

run cd referring_segmentation for referring video segmentation task.

1. Dataset

Download the A2D Sentences dataset and J-HMDB Sentences dataset from https://kgavrilyuk.github.io/publication/actor_action/ and convert the videos to RGB frames.

For A2D Sentences dataset, run python pre_proc\video2imgs.py to convert videos to RGB frames. The following directory structure is expected:

-a2d_sentences
    -Rename_Images
    -a2d_annotation_with_instances
    -videoset.csv
    -a2d_missed_videos.txt
    -a2d_annotation.txt
-jhmdb_sentences
    -Rename_Images
    -puppet_mask
    -jhmdb_annotation.txt

Edit the item datasets_root in json/onfig_$DATASET$.json to be the current dataset path.

Run python pre_proc\generate_data_list.py to generate the training and testing data splits.

2. Backbone

Download the pretrained DeepLabResNet from https://github.com/VainF/DeepLabV3Plus-Pytorch and put it into model/pretrained/.

4. Training

Only the A2D Sentences dataset is adopted for training, run:

python main.py --json_file=json\config_a2d_sentences.json --mode=train

5. Evaluation

For A2d Sentences dataset, run:

python main.py --json_file=json\config_a2d_sentences.json --mode=test

For JHMDB Sentences dataset, run:

python main.py --json_file=json\config_jhmdb_sentences.json --mode=test

Temporal Sentence Grounding

run cd temporal_grounding for referring temporal sentence grounding task.

1. Dataset

For charades-STA dataset, download the pre-extracted I3D features following LGI4temporalgrounding and the pre-extracted VGG feature following 2D-TAN.
For TACoS dataset, download the pre-extracted C3D features following 2D-TAN
For ActivityNet Captions dataset, download the pre-extracted C3D features from http://activity-net.org/challenges/2016/download.html.

2. Training and Evaluation

The config files can be find in ./json and the following model settings are supported

-config_ActivityNet_C3D_anchor.json
-config_ActivityNet_C3D_regression.json
-config_Charades-STA_I3D_anchor.json
-config_Charades-STA_I3D_regression.json
-config_Charades-STA_VGG_anchor.json
-config_Charades-STA_VGG_regression.json
-config_TACoS_C3D_anchor.json
-config_TACoS_C3D_regression.json

Set the "datasets_root" in each config file to be your feature path.

To train on different dataset with different grounding heads, run

python main.py --json_file=$JSON_FILE_PATH$ --mode=train

For evaluation, run

python main.py --json_file=$JSON_FILE_PATH$ --mode=test --checkpoint=$CHECKPOINT_PATH$

The pretrained models and their correspondance performance are shown bellow

Datasets	Feature	Decoder	Checkpoints
Charades-STA	I3D	Regression	[Baidu \| gj54 ]
Charades-STA	I3D	Anchor	[Baidu \| 5j3a ]
Charades-STA	VGG	Regression	[Baidu \| 52xf ]
Charades-STA	VGG	Anchor	[Baidu \| rdmx ]
ActivityNet	C3D	Regression	[Baidu \| 6sbh ]
ActivityNet	C3D	Anchor	[Baidu \| ysr5 ]
TACOS	C3D	Regression	[Baidu \| iwx2 ]
TACOS	C3D	Anchor	[Baidu \| 1ube ]

Spatiotemporal Video Grounding

run cd spatiotemporal_grounding for spatiotemporal video grounding task. The code for spatiotemporal grounding is built on the TubeDETR codebase.

1. Dataset

We prepare the HC-STVG and VidSTG datasets following the TubeDETR. The annotation formation of the VidSTG dataset has been optimized to reduce the training memory usage.

videos

VidSTG dataset: Download VidOR videos from the VidOR dataset providers

HC-STVG dataset: Download HC-STVG videos from the HC-STVG dataset providers.

Edit the item vidstg_vid_path in spatiotemporal_grounding/config/vidstg.json and the hcstvg_vid_path in spatiotemporal_grounding/config/hcstvg.json to be the current video path.

annotations

Download the preprocessed annotation files from [https://pan.baidu.com/s/1oiV9PmtRqRxxdxMvqrJj_w, password: n6y4]. Then put the downloaded annotations into spatiotemporal_grounding.

2. Training and Evaluation

To train on HC-STVG dataset, run

python main.py --combine_datasets=hcstvg --combine_datasets_val=hcstvg --dataset_config config/hcstvg.json --output-dir=hcstvg_result

To train on VidSTG dataset, run

python main.py --combine_datasets=vidstg --combine_datasets_val=vidstg --dataset_config config/vidstg.json --output-dir=vidstg_result

To evaluate on HC-STVG dataset, run:

python main.py --combine_datasets=hcstvg --combine_datasets_val=hcstvg --dataset_config config/hcstvg.json --output-dir=hcstvg_result --eval --resume=$CHECKPOINT_PATH$

To evaluate on VidSTG dataset, run

python main.py --combine_datasets=vidstg --combine_datasets_val=vidstg --dataset_config config/vidstg.json --output-dir=vidstg_result --eval --resume=$CHECKPOINT_PATH$

Citation

@article{2023saw,
    title     = {Sequence as A Whole: A Unified Framework for Video Action Localization with Long-range Text Query},
    author    = {Yuting Su, Weikang Wang, Jing Liu, Shuang Ma, Xiaokang Yang},
    booktitle = {IEEE Transactions on Image Processing},
    year      = {2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
docs		docs
referring_segmentation		referring_segmentation
spatiotemporal_grounding		spatiotemporal_grounding
temporal_grounding		temporal_grounding
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SAW

Requirements

Referring Video Segmentation

1. Dataset

2. Backbone

4. Training

5. Evaluation

Temporal Sentence Grounding

1. Dataset

2. Training and Evaluation

Spatiotemporal Video Grounding

1. Dataset

2. Training and Evaluation

Citation

About

Releases

Packages

Contributors 2

Languages

TJUMMG/SAW

Folders and files

Latest commit

History

Repository files navigation

SAW

Requirements

Referring Video Segmentation

1. Dataset

2. Backbone

4. Training

5. Evaluation

Temporal Sentence Grounding

1. Dataset

2. Training and Evaluation

Spatiotemporal Video Grounding

1. Dataset

2. Training and Evaluation

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages