Video Mamba Suite: Dense Video Captioning

Implementation for Mamba-based PDVC (ICCV 2021) [paper]

With additional supports:

Mamba-based feature encoder
DDP Training (not supported in the official repo)

Table of Contents:

Preparation
Training and Validation
Performance
- Dense video captioning
Acknowledgement

Preparation

Install pytorch and dependencies. The code is run successfully on (1) torch==1.13.1+cu117 or (2) torch==2.1.2+cu118, other pytorch versions may also work.

conda install ffmpeg
pip install -r requirement.txt

Compile the deformable attention layer (requires GCC >= 5.4).

cd pdvc/ops
sh make.sh

Install mamba follow the main README.md, make sure you can import mamba successfully

from mamba_ssm.modules.mamba_simple import Mamba
from mamba_ssm.modules.mamba_new import Mamba

Make sure the submodules SODA and pycocoevalcap exists, which are used for evaluation.

from densevid_eval3.SODA.soda import SODA

If the modules do not exist, please make sure you have cloned the repo with '--recursive', or run

cd path/to/your/video-mamba-suite/
git submodule update --init --recursive

Training

Download Video Features

cd data/anet/features
bash download_anet_c3d.sh
# bash download_anet_tsn.sh
# bash download_i3d_vggish_features.sh
# bash download_tsp_features.sh

The preprocessed C3D features have been uploaded to baiduyun drive

Download YouCook2 Video Features

cd data/yc2/features
bash download_yc2_tsn_features.sh

Dense Video Captioning

PDVC-DeformableTransformer with learnt proposals

# E.g. Train on ANet dataset with 8gpus
torchrun --nproc_per_node=8 train.py \
--cfg_path cfgs/anet_c3d_pdvc.yml \ 
--disable_cudnn 1 \
--save_dir /path/to/your/folder/anet_c3d_pdvc_deformableTransformer_8gpus/ \
--encoder_type deformable \

PDVC-Mamba with learnt proposals

# E.g. Train on ANet dataset with 8gpus
torchrun --nproc_per_node=8 train.py \
--cfg_path cfgs/anet_c3d_pdvc.yml \ 
--disable_cudnn 1 \
--save_dir /path/to/your/folder/anet_c3d_pdvc_mamba_8gpus/ \
--encoder_type mamba-dbm \

# E.g. Train on YouCook dataset with 1gpu
torchrun --nproc_per_node=1 train.py \
--cfg_path cfgs/yc2_tsn_pdvc.yml \ 
--disable_cudnn 1 \
--save_dir /path/to/your/folder/yc2_tsn_pdvc_mamba_1gpu/ \
--encoder_type mamba-dbm \

Video Paragraph Captioning

PDVC-Mamba with gt-proposals

# E.g. Train on ANet dataset with 8gpus
torchrun --nproc_per_node=8 train.py \
--cfg_path cfgs/anet_c3d_pdvc_gt.yml \ 
--disable_cudnn 1 \
--criteria_for_best_ckpt pc \
--save_dir /path/to/your/folder/anet_c3d_pc_mamba_8gpus/ \
--encoder_type mamba-dbm \

Performance

Dense video captioning on ANet (with learnt proposals)

Model	Features	config_path	Url	Recall	Precision	B-4	M	R	C	SODA
PDVC-Deformable	C3D	cfgs/anet_c3d_pdvc.yml	todo	51.74	56.11	1.75	6.73	14.73	26.07	5.47
PDVC-Mamba	C3D	cfgs/anet_c3d_pdvc.yml	todo	52.45	56.33	1.76	7.16	14.83	26.77	5.27

Dense video captioning on YouCook2 (with learnt proposals)

Model	Features	config_path	Url	Recall	Precision	B-4	M	R	C	SODA
PDVC-Deformable	TSN	cfgs/yc2_tsn_pdvc.yml	todo	23.00	31.12	0.73	4.25	9.31	20.48	4.02
PDVC-Mamba	TSN	cfgs/yc2_tsn_pdvc.yml	todo	25.27	32.41	0.86	4.44	9.62	21.90	4.32

Notes: 'B-4', 'M', 'R', 'C' refers to 'BLEU-4', 'METEOR', 'ROUGE-L' and 'CIDER'. More details can be found in PDVC

Acknowledgement

The codebase is based on PDVC. We thanks the authors for their efforts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Video Mamba Suite: Dense Video Captioning

Preparation

Training

Download Video Features

Download YouCook2 Video Features

Dense Video Captioning

Video Paragraph Captioning

Performance

Dense video captioning on ANet (with learnt proposals)

Dense video captioning on YouCook2 (with learnt proposals)

Acknowledgement

Files

README.md

Latest commit

History

README.md

File metadata and controls

Video Mamba Suite: Dense Video Captioning

Preparation

Training

Download Video Features

Download YouCook2 Video Features

Dense Video Captioning

Video Paragraph Captioning

Performance

Dense video captioning on ANet (with learnt proposals)

Dense video captioning on YouCook2 (with learnt proposals)

Acknowledgement