Skip to content

a multi-modal video caption dataset with richer annotation

License

Notifications You must be signed in to change notification settings

SCZwangxiao/DEPICT

Repository files navigation

DEPICT: Towards Holistic Long Video-Language Understanding

dataset_examples

About DEPICT dataset

From paper "DEPICT: Towards Holistic Long Video-Language Understanding" (Under review)

In this work, we curate the DEPICT dataset, a high-quality large-scale video caption dataset of untrimmed videos, consisting of 304K videos and 81.3M caption tokens. It has richer annotations than existing video caption datasets, including 8x more caption tokens, 2x more average unique tokens/video, and more modalities.

dataset_statistics

Download

We provide two downloading approach:

Unzip the files to forge the following directories:

data
└── depict
    └── annotations
        └── train.json
        └── val.json
        └── test.json
    └── videos

Visualization

We provide code for dataset visualization in ./visualization.ipynb.

If the video is unplayable in the Jupyter notebook above, you may be missing the hevc/H.265 codec. Install it or convert videos to the H.264 codec for proper visualization. Please note that this issue will not affect the baseline codes, as both video decoding package pyav and decord can handle this.

Baselines

We provide instructions to run the baselines in the paper to reproduce all experiments including ablation studies:

After finishing inference, run evaluation.py to get evluation results.

Annotation format

The annotations are organized in a nested manner shown below:

[
  {
    "BVid": "BV1hT421X7xV",
    "video_duration": 158,
    "video_title": "Both a Home-cooked Dish and a Street-side Specialty Snack, Master teaches you how to make Salt and Pepper Mushrooms",
    "summarization": "This is a simple and easy-to-learn home-cooked dish - the recipe for Salt and Pepper Mushrooms, suitable for selling at a stall ...",
    "asr_results": "Do we have a familiar and delicious home-cooked dish that can be made into a street snack, light and swift to set up a stall ..."
  },
  ...
]

About

a multi-modal video caption dataset with richer annotation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published