Skip to content

Official Implementation of "MultiShotMaster: A Controllable Multi-Shot Video Generation Framework"

License

Notifications You must be signed in to change notification settings

KlingTeam/MultiShotMaster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MultiShotMaster: A Controllable Multi-Shot Video Generation Framework

Qinghe Wang1 Xiaoyu Shi2✉ Baolu Li1 Weikang Bian3 Quande Liu2 Huchuan Lu1
Xintao Wang2 Pengfei Wan2 Kun Gai2 Xu Jia1✉

1Dalian University of Technology    2Kling Team, Kuaishou Technology
3The Chinese University of Hong Kong    Corresponding author

   

Note: The open-source version is based on Wan2.1-T2V-1.3B and Wan2.1-T2V-14B.

🔥 Updates

📌 TL;DR

MultiShotMaster is a controllable multi-shot narrative video generation framework that supports 1) text-driven inter-shot consistency, 2) variable shot counts and shot durations, 3) customized subject with motion control, and 4) background-driven customized scene.

📑 Open-Source Plan

  • Codes & Models for Multi-Shot & Multi-Reference Generation
  • Codes & Models for Multi-Shot Generation

🛠️ Installation

Environment

  • Create a conda environment and install dependencies:
git clone https://github.com/KlingTeam/MultiShotMaster
cd MultiShotMaster
conda create -n MultiShotMaster python=3.12 -y
conda activate MultiShotMaster
pip install -e .
pip install -r requirement.txt
pip install flash-attn --no-build-isolation

Model Checkpoints

  • Download Checkpoints using huggingface-cli:
pip install "huggingface_hub[cli]"
huggingface-cli download KlingTeam/MultiShotMaster --local-dir checkpoints

# or using git:
git lfs install
git clone https://huggingface.co/KlingTeam/MultiShotMaster

🔑 Inference

Inference with a Single GPU

# 1.3B model support 480p only
python infer_multishot.py \
    --test_csv_path "toy_cases/test_multishot.csv" \
    --output_name "1.3B" \
    --model_path_json "checkpoints/model_configs/model_path_1.3B.json" \
    --target_width 832 \
    --target_height 480    

# 14B model supports 480p and 720p (we have trained on 480p and 720p data jointly)
python infer_multishot.py \
    --test_csv_path "toy_cases/test_multishot.csv" \
    --output_name "14B_720" \
    --model_path_json "checkpoints/model_configs/model_path_14B.json" \
    --target_width 1280 \
    --target_height 720

Inference with Multiple GPUs

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 infer_multishot.py \
    --test_csv_path "toy_cases/test_multishot.csv" \
    --output_name "14B_720" \
    --model_path_json "checkpoints/model_configs/model_path_14B.json" \
    --target_width 1280 \
    --target_height 720 \
    --use_usp True

Hints for Shot Arrangement

  1. Taking toy_cases/toy_captions/test_case_1.json as an example, users can define the subject's appearance, overall scene, and style using the global caption, and customize the content for each shot using per-shot captions."
  2. Users can configure the frame count for each shot in the shot_groups field of toy_cases/test_multishot.csv. (Note: Training setting is ≤5 shots & ≤308 frames)

Inference with Customized Multi-Shot Prompts (with Recaption)

# Please set up the Gemini API on L37 in `infer_multishot_with_recaption_example.py` for recaption.
python infer_multishot_with_recaption_example.py \
    --output_name "1.3B_customized_input" \
    --model_path_json "checkpoints/model_configs/model_path_1.3B.json" \
    --target_width 832 \
    --target_height 480   

⚙️ Training

Single-Node Training:

# 1.3B model
bash train_1.3B_single_node.sh

# 14B model (We only release an example code for training 14B with batch_size = 1 per GPU. If you want train 14B model on longer multi-shot video data, you need to implement sequence parallel on our code.)
bash train_14B_single_node.sh

Multi-Node Distributed Training:

# set IP address and Port of the master node in `train_1.3B_multi_node.sh`
bash train_1.3B_multi_node.sh 0  # (on the master node)
bash train_1.3B_multi_node.sh 1  # (on the first worker node)
...

📏 Multi-Shot Caption Annotation

python multi_shot_caption_annotation.py --video_csv_path "toy_cases/data.csv"

🤗 Acknowledgement

  • DiffSynth-Studio: the codebase we built upon. Thanks for their wonderful work.
  • Wan: the base model we built upon. Thanks for their wonderful work.

🧱 Open-Sourced Multi-Shot Data

🌟 Citation

Please leave us a star 🌟 and cite our paper if you find our work helpful.

@article{wang2025multishotmaster,
  title={MultiShotMaster: A Controllable Multi-Shot Video Generation Framework},
  author={Wang, Qinghe and Shi, Xiaoyu and Li, Baolu and Bian, Weikang and Liu, Quande and Lu, Huchuan and Wang, Xintao and Wan, Pengfei and Gai, Kun and Jia, Xu},
  journal={arXiv preprint arXiv:2512.03041},
  year={2025}
}

About

Official Implementation of "MultiShotMaster: A Controllable Multi-Shot Video Generation Framework"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •