Multimodal Video Captioning

Code for my Master Thesis on Multimodal Video Captioning. The SwinBERT model was used as baseline, and I integrated audio features extracted with VGGish to the architecture, resulting in an up to +1.6 gain in captioning Metrics.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
models/32frm/vatex/best-checkpoint		models/32frm/vatex/best-checkpoint
prepro		prepro
scripts		scripts
src		src
README.md		README.md
attention_mask.npy		attention_mask.npy
aud_mean.npy		aud_mean.npy
aud_std.npy		aud_std.npy
launch_container.sh		launch_container.sh
linelists.py		linelists.py
redo_prepro.py		redo_prepro.py
reproduce_results.ipynb		reproduce_results.ipynb
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Video Captioning

About

Languages

El-Zag/Multimodal-Video-Captioning

Folders and files

Latest commit

History

Repository files navigation

Multimodal Video Captioning

About

Topics

Resources

Stars

Watchers

Forks

Languages