Video captioning models in Pytorch (Work in progress)

This repository contains Pytorch implementation of video captioning SOTA models from 2015-2019 on MSVD and MSRVTT datasets. Details are given in below table

Model	Datasets	Paper name	Year	Status
Mean Pooling	MSVD, MSRVTT	Translating videos to natural language using deep recurrent neural networks[1]	2015	Implemented
S2VT	MSVD, MSRVTT	Sequence to Sequence - Video to Text[2]	2015	Implemented
SA-LSTM	MSVD, MSRVTT	Describing videos by exploiting temporal structure[3]	2015	Implemented
Recnet	MSVD, MSRVTT	Reconstruction Network for Video Captioning[4]	2018	Implemented
MARN	MSVD, MSRVTT	Memory-Attended Recurrent Network for Video Captioning[5]	2019	Implemented

*More recent models will be added in future

Environment

Ubuntu 18.04
CUDA 11.0
Nvidia GeForce RTX 2080Ti

Requirements

Java 8
Python 3.8.5
- Pytorch 1.7.0
- Other Python libraries specified in requirements.txt

How to Use

Step 1. Setup python virtual environment

$ virtualenv .env
$ source .env/bin/activate
(.env) $ pip install --upgrade pip
(.env) $ pip install -r requirements.txt

Step 2. Prepare data, path and hyperparameter settings

Extract features from network you want to use, and locate them at <PROJECT ROOT>/<DATASET>/features/<DATASET>_APPEARANCE_<NETWORK>_<FRAME_LENGTH>.hdf5. To extracted features follow the repository here. Or simply download the already extracted features from given table and locate them in <PROJECT ROOT>/<DATASET>/features/

Dataset Feature Type Inception-v4 InceptionResNetV2 ResNet-101 REsNext-101

MSVD Appearance link link link -

MSR-VTT Appearance link link link -

MSVD Motion - - - link

You can change hyperparameters by modifying config.py.

Step 3. Prepare Evaluation Codes

Clone evaluation codes from the official coco-evaluation repo.

(.env) $ git clone https://github.com/tylin/coco-caption.git
(.env) $ mv coco-caption/pycocoevalcap .
(.env) $ rm -rf coco-caption

Or simply copy the pycocoevalcap folder and its contents in the project root.

Step 4. Training

Follow the demo given in video_captioning.ipynb.

Step 5. Inference

Follow the demo given in video_captioning.ipynb.

Quantitative Results

*MSVD

Model	Pretrained model	BLEU4	METEOR	ROUGE_L	CIDER	Pretrained
Mean Pooling	Inceptionv4	42.2	31.6	68.2	69.7	link
SA-LSTM	InceptionvResNetV2	45.5	32.5	69.0	78.0	link
S2VT	Inceptionv4	-	-	-	-	-
RecNet (global )	Inceptionv4	-	-	-	-	-
RecNet (local)	Inceptionv4	-	-	-	-	-
MARN	Inceptionv4	-	-	-	-	-

*MSRVTT

Model	Pretrained model	BLEU4	METEOR	ROUGE_L	CIDER	Pretrained
Mean Pooling	Inceptionv4	34.9	25.5	58.12	35.76	link
SA-LSTM	Inceptionv4	-	-	-	-	-
S2VT	Inceptionv4	-	-	-	-	-
RecNet (global )	Inceptionv4	-	-	-	-	-
RecNet (local)	Inceptionv4	-	-	-	-	-
MARN	Inceptionv4	-	-	-	-	-

References

[1] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. Translating videos to natural language using deep recurrent neural networks. In NAACLHLT, 2015.

[2] Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond J. Mooney, Trevor Darrell and Kate Saenko. Sequence to Sequence - Video to Text. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015

[3] Yao, Li, et al. "Describing videos by exploiting temporal structure." Proceedings of the IEEE international conference on computer vision. 2015.

[4] Wang, Bairui, et al. "Reconstruction Network for Video Captioning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

[5] Wenjie Pei, Jiyuan Zhang, Xiangrong Wang, Lei Ke, Xiaoyong Shen, and Yu-Wing Tai. Memory-attended recurrent network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8347–8356, 2019

Acknowlegement

I got some of the coding ideas and the extracted video features from hobincar/pytorch-video-feature-extractor Many thanks!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md
config.py		config.py
creat_ml_dic.py		creat_ml_dic.py
data.py		data.py
dictionary.py		dictionary.py
evaluate.py		evaluate.py
extra_ml_feat.py		extra_ml_feat.py
extra_msrvtt_feat.py		extra_msrvtt_feat.py
requirements.txt		requirements.txt
train_ml.py		train_ml.py
train_ml_marn.py		train_ml_marn.py
train_vtt.py		train_vtt.py
utils.py		utils.py
video_captioning.ipynb		video_captioning.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Video captioning models in Pytorch (Work in progress)

Environment

Requirements

How to Use

Step 1. Setup python virtual environment

Step 2. Prepare data, path and hyperparameter settings

Step 3. Prepare Evaluation Codes

Step 4. Training

Step 5. Inference

Quantitative Results

References

Acknowlegement

About

Uh oh!

Releases

Packages

Languages

Dataset	Feature Type	Inception-v4	InceptionResNetV2	ResNet-101	REsNext-101
MSVD	Appearance	link	link	link	-
MSR-VTT	Appearance	link	link	link	-
MSVD	Motion	-	-	-	link

License

BigBoyZYX/Video_Caption

Folders and files

Latest commit

History

Repository files navigation

Video captioning models in Pytorch (Work in progress)

Environment

Requirements

How to Use

Step 1. Setup python virtual environment

Step 2. Prepare data, path and hyperparameter settings

Step 3. Prepare Evaluation Codes

Step 4. Training

Step 5. Inference

Quantitative Results

References

Acknowlegement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages