Benchmark introduced in Advancing Narrative Long Video Generation via Training-Free Identity-Aware Memory
Jinzhuo Liu1, Jiangning Zhang1✉, Wencan Jiang1, Yabiao Wang2, Dingkang Liang3, Zhucun Xue1, Ran Yi4, Yong Liu1
1Zhejiang University,
2Tencent Youtu Lab,
3Huazhong University of Science and Technology,
4Shanghai Jiao Tong University
✉Corresponding author
We introduce NarraStream-Bench, a benchmark for narrative streaming video generation that features 324 multi-prompt scripts spanning six dimensions and a three-dimensional evaluation protocol that integrates both traditional metrics and multimodal large language model- based assessment. The benchmark is introduced together with IAMFlow.
Comparison of related long-video generation benchmarks.
| Benchmark | VQ | TC | IC | Prompt Type | Aggregation Strategy | Year |
|---|---|---|---|---|---|---|
| VBench-Long | ✓ | × | × | Single | Slow-Fast Avg. | 2024 |
| LV-Bench | ✓ | ✓ | × | Single | VDE | 2025 |
| NarrLV | × | ✓ | ✓ | Single | TNA-based QA | 2025 |
| NarraStream-Bench | ✓ | ✓ | ✓ | Multi | Narrative-Aware | 2026 |
git clone git@github.com:Eddie0521/NarraStream-Bench.git
cd NarraStream-Bench
conda create -n NarraStream-Bench python=3.10
conda activate NarraStream-Bench
# Install a PyTorch build that matches your CUDA/runtime first.
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
Download the metric backbones and auxiliary weights:
bash scripts/download_weights.shBy default, checkpoints are saved to ./pretrained and resolved by configs/paths.yaml. Expected checkpoints include CLIP, DINO, RAFT, AMT, VTSS, and LanguageBind video weights.
NarraStream-Bench uses API-backed MLLM/VLM metrics by default. Set your API key before running the full evaluation:
export SILICONFLOW_API_KEY=your_api_key
Prepare generated videos and prompts in the following structure:
your_dataset/
├── prompt.jsonl
└── video/
├── sample_0.mp4
├── sample_1.mp4
└── ...
Each line in prompt.jsonl should contain one sample:
{"prompts": ["segment prompt 1", "segment prompt 2", "segment prompt 3"]}The number of videos must match the number of prompt samples. If videos are not named as sample_0.mp4, sample_1.mp4, ..., NarraStream-Bench will read all supported video files in natural sorted order.
bash scripts/run_narrastream_bench.sh \
--run-name my_eval \
--video-dir your_dataset/video \
--prompts your_dataset/prompt.jsonl \
--gpu-id 0Results are saved under runs/<run-name>/ by default:
runs/<run-name>/
├── processed/
│ ├── eval_data.json
│ ├── .preprocess_signature
│ └── sample_*/
│ ├── seg_0.mp4
│ ├── seg_1.mp4
│ └── ...
└── results/
├── results_latest.json
├── results_YYYYMMDD_HHMMSS.json
├── steps/
├── raw_metrics/
└── artifacts/
The main files to inspect are:
- results_latest.json: latest resumable snapshot, updated after each metric.
- results_YYYYMMDD_HHMMSS.json: final timestamped result file.
- processed/eval_data.json: preprocessed segment metadata.
Please leave us a star 🌟 and cite our paper if you find our work helpful.
@misc{liu2026advancingnarrativelongvideo,
title={Advancing Narrative Long Video Generation via Training-Free Identity-Aware Memory},
author={Jinzhuo Liu and Jiangning Zhang and Wencan Jiang and Yabiao Wang and Dingkang Liang and Zhucun Xue and Ran Yi and Yong Liu},
year={2026},
eprint={2605.18733},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.18733},
}
