demo_video.mp4
🔥 For more results, visit our homepage. 🔥
🙏🏻 If you find our work helpful, please consider giving us a ⭐ star.
2026/03/19: 🔥 We release the inference code and pretrained weights for the public Wan-based X-Dub release.2025/12/31: 🔥 We release the paper and project homepage: paper | homepage.
This repository contains the official PyTorch implementation of X-Dub, introduced in From Inpainting to Editing: Unlocking Robust Mask-Free Visual Dubbing via Generative Bootstrapping (formerly From Inpainting to Editing: A Self-Bootstrapping Framework for Context-Rich Visual Dubbing).
Due to company policy, we cannot open-source the internal model used in the paper. This repository instead releases a public X-Dub (Wan-5B) version based on Wan2.2-TI2V-5B. Because of the different backbone, we do not use the LoRA tuning described in the paper; instead, we use multi-stage SFT in the public release to achieve a similar effect. In our experiments, X-Dub (Wan-5B) produces satisfying lip-synced results broadly aligned with the internal version X-Dub (internal-1B):
More qualitative results of X-Dub (Wan-5B)
result_01.mp4
result_02.mp4
result_03.mp4
result_04.mp4
We still observe some differences in the current public release. Compared with the internal version, X-Dub (Wan-5B) shows the following practical differences:
- Better generalization to non-human characters such as cartoons, animated roles, and animals.
- Slightly weaker temporal stability, with occasional flickering.
- Slightly weaker subject consistency, including possible identity drift or color drift.
- Occasional severe noisy frames in a small portion of cases (~2%).
- Roughly 2× slower inference without acceleration strategies.
Some failure cases of X-Dub (Wan-5B)
failure_01.mp4
🏃 We are still trying to find the best implementation strategy, and will actively improve this repository. Quantitative comparisons between the public release and the internal version will be reported in future updates. If you have suggestions, please open an issue for discussion.
git clone https://github.com/KlingAIResearch/X-Dub.git
cd X-Dub
conda create -n x-dub python=3.10 -y
conda activate x-dubInstall Python dependencies:
pip install -r requirements.txtInstall OpenMMLab dependencies:
pip install chumpy==0.70 --no-build-isolation
pip install mmengine==0.10.7
pip install mmcv==2.1.0 --no-build-isolation
pip install mmdet==3.2.0
pip install mmpose==1.3.2Install this repository (adapted from DiffSynth-Studio):
pip install -e . --no-depsDownload the released bundle directly to checkpoints/:
pip install -U "huggingface_hub[cli]"
hf download KlingTeam/X-Dub --local-dir ./checkpoints --repo-type modelMove the DWpose files into dwpose_tools/models/:
mkdir -p dwpose_tools/models
cp -r ./checkpoints/dwpose_tools/models/. ./dwpose_tools/models/
rm -rf ./checkpoints/dwpose_toolsAfter download, the expected layout is:
checkpoints/
├── X-Dub_model.safetensors
├── Wan2.2_VAE.safetensors
├── models_t5_umt5-xxl-enc-bf16.safetensors
├── umt5-xxl/
├── whisper/
│ └── large-v2.pt
└── wav2vec2-base-960h/
dwpose_tools/models/
├── yolox_l_8xb8-300e_coco.py
├── yolox_l_8x8_300e_coco_20211126_140236-d3bd2b23.pth
├── rtmw-x_8xb320-270e_cocktail14-384x288.py
└── rtmw-x_simcc-cocktail14_pt-ucoco_270e-384x288-f840f204_20231122.pth
python infer_lip_sync_pipeline.py \
--video_path assets/examples/video.mp4 \
--audio_path assets/examples/audio.wav \
--ckpt_path checkpoints/X-Dub_model.safetensors \
--ref_cfg_scale 2.5 \
--audio_cfg_scale 10.0 \
--num_inference_steps 30 \
--output_dir ./resultsThe inference pipeline supports arbitrary-size input videos and performs online auto-cropping. The current version supports single-person videos only. The inference script will:
- crop the faicial region
- run lip-sync generation on the cropped and resized video (512x512)
- map the generated result back to the original complete video
Current cropping limitations
For ease of use, this repository uses DWPose to estimate facial ldmks for cropping. This differs from the more complex offline FLAME-mesh-based cropping pipeline used in the paper.
The current online strategy may introduce visible jitter and may fail to follow the face reliably when the head moves rapidly. The current release also does not support target tracking in multi-person scenes.
🏃 We plan to improve the cropping strategy and add better multi-person support in future updates.
ref_cfg_scaleandaudio_cfg_scalecontrol the balance between reference appearance fidelity and audio-driven mouth motion. Different cases may prefer slightly different values.- We recommend setting
num_inference_stepsin the range of25-50. Higher values increase runtime and may improve quality, but this has not been exhaustively evaluated yet.
- Report quantitative comparisons between the public version and the paper version.
- Support multi-person video dubbing.
- Improve the cropping pipeline.
- Inference acceleration.
This work can be misused for identity impersonation or deceptive synthetic media. We support clear labeling of AI-generated content and encourage further work on reliable detection methods. All models and materials in this repository are intended for academic research and technical demonstration only.
If you have questions, please contact: hexu18@mails.tsinghua.edu.cn
We thank Wan2.2 for the open-source model backbone, and DiffSynth-Studio for the training and inference framework.
@misc{he2025inpaintingeditingselfbootstrappingframework,
title={From Inpainting to Editing: A Self-Bootstrapping Framework for Context-Rich Visual Dubbing},
author={Xu He and Haoxian Zhang and Hejia Chen and Changyuan Zheng and Liyang Chen and Songlin Tang and Jiehui Huang and Xiaoqiang Liu and Pengfei Wan and Zhiyong Wu},
year={2025},
eprint={2512.25066},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.25066},
}