Qinghe Wang1
Xiaoyu Shi2✉
Baolu Li1
Weikang Bian3
Quande Liu2
Huchuan Lu1
Xintao Wang2
Pengfei Wan2
Kun Gai2
Xu Jia1✉
1Dalian University of Technology
2Kling Team, Kuaishou Technology
3The Chinese University of Hong Kong
✉Corresponding author
Note: The open-source version is based on Wan2.1-T2V-1.3B and Wan2.1-T2V-14B.
- [2026.02.10]: Training and inference code, model checkpoints are available.
- [2026.01.26]: We win First Prize (🥇 1st Place) at the AAAI CVM 2026 Main Track.
- [2025.12.03]: Release the Project Page and the arXiv version.
MultiShotMaster is a controllable multi-shot narrative video generation framework that supports 1) text-driven inter-shot consistency, 2) variable shot counts and shot durations, 3) customized subject with motion control, and 4) background-driven customized scene.
- Codes & Models for Multi-Shot & Multi-Reference Generation
- Codes & Models for Multi-Shot Generation
Environment
- Create a conda environment and install dependencies:
git clone https://github.com/KlingTeam/MultiShotMaster
cd MultiShotMaster
conda create -n MultiShotMaster python=3.12 -y
conda activate MultiShotMaster
pip install -e .
pip install -r requirement.txt
pip install flash-attn --no-build-isolation
Model Checkpoints
- Download Checkpoints using huggingface-cli:
pip install "huggingface_hub[cli]"
huggingface-cli download KlingTeam/MultiShotMaster --local-dir checkpoints
# or using git:
git lfs install
git clone https://huggingface.co/KlingTeam/MultiShotMaster- Set the model paths in
checkpoints/model_configs.
Inference with a Single GPU
# 1.3B model support 480p only
python infer_multishot.py \
--test_csv_path "toy_cases/test_multishot.csv" \
--output_name "1.3B" \
--model_path_json "checkpoints/model_configs/model_path_1.3B.json" \
--target_width 832 \
--target_height 480
# 14B model supports 480p and 720p (we have trained on 480p and 720p data jointly)
python infer_multishot.py \
--test_csv_path "toy_cases/test_multishot.csv" \
--output_name "14B_720" \
--model_path_json "checkpoints/model_configs/model_path_14B.json" \
--target_width 1280 \
--target_height 720
Inference with Multiple GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 infer_multishot.py \
--test_csv_path "toy_cases/test_multishot.csv" \
--output_name "14B_720" \
--model_path_json "checkpoints/model_configs/model_path_14B.json" \
--target_width 1280 \
--target_height 720 \
--use_usp True
Hints for Shot Arrangement
- Taking
toy_cases/toy_captions/test_case_1.jsonas an example, users can define the subject's appearance, overall scene, and style using the global caption, and customize the content for each shot using per-shot captions." - Users can configure the frame count for each shot in the shot_groups field of
toy_cases/test_multishot.csv. (Note: Training setting is ≤5 shots & ≤308 frames)
Inference with Customized Multi-Shot Prompts (with Recaption)
# Please set up the Gemini API on L37 in `infer_multishot_with_recaption_example.py` for recaption.
python infer_multishot_with_recaption_example.py \
--output_name "1.3B_customized_input" \
--model_path_json "checkpoints/model_configs/model_path_1.3B.json" \
--target_width 832 \
--target_height 480
Single-Node Training:
# 1.3B model
bash train_1.3B_single_node.sh
# 14B model (We only release an example code for training 14B with batch_size = 1 per GPU. If you want train 14B model on longer multi-shot video data, you need to implement sequence parallel on our code.)
bash train_14B_single_node.sh
Multi-Node Distributed Training:
# set IP address and Port of the master node in `train_1.3B_multi_node.sh`
bash train_1.3B_multi_node.sh 0 # (on the master node)
bash train_1.3B_multi_node.sh 1 # (on the first worker node)
...
python multi_shot_caption_annotation.py --video_csv_path "toy_cases/data.csv"
- DiffSynth-Studio: the codebase we built upon. Thanks for their wonderful work.
- Wan: the base model we built upon. Thanks for their wonderful work.
- https://huggingface.co/datasets/BAAI/CI-VID
- https://huggingface.co/datasets/NumlockUknowSth/Cine250K
Please leave us a star 🌟 and cite our paper if you find our work helpful.
@article{wang2025multishotmaster,
title={MultiShotMaster: A Controllable Multi-Shot Video Generation Framework},
author={Wang, Qinghe and Shi, Xiaoyu and Li, Baolu and Bian, Weikang and Liu, Quande and Lu, Huchuan and Wang, Xintao and Wan, Pengfei and Gai, Kun and Jia, Xu},
journal={arXiv preprint arXiv:2512.03041},
year={2025}
}
