Skip to content

KlingAIResearch/X-Dub

Repository files navigation

From Inpainting to Editing: Unlocking Robust Mask-Free Visual Dubbing via Generative Bootstrapping

1 Tsinghua University   2 Kling Team, Kuaishou Technology   3 Beihang University   4 HKUST   5 CUHK
* Work done at Kling Team, Kuaishou Technology  Project leader  Corresponding author

arXiv  project homepage  GitHub stars

demo_video.mp4

🔥 For more results, visit our homepage. 🔥

🙏🏻 If you find our work helpful, please consider giving us a ⭐ star.

🔥 Updates

📖 Introduction

This repository contains the official PyTorch implementation of X-Dub, introduced in From Inpainting to Editing: Unlocking Robust Mask-Free Visual Dubbing via Generative Bootstrapping (formerly From Inpainting to Editing: A Self-Bootstrapping Framework for Context-Rich Visual Dubbing).

Due to company policy, we cannot open-source the internal model used in the paper. This repository instead releases a public X-Dub (Wan-5B) version based on Wan2.2-TI2V-5B. Because of the different backbone, we do not use the LoRA tuning described in the paper; instead, we use multi-stage SFT in the public release to achieve a similar effect. In our experiments, X-Dub (Wan-5B) produces satisfying lip-synced results broadly aligned with the internal version X-Dub (internal-1B):

More qualitative results of X-Dub (Wan-5B)
result_01.mp4
result_02.mp4
result_03.mp4
result_04.mp4

We still observe some differences in the current public release. Compared with the internal version, X-Dub (Wan-5B) shows the following practical differences:

  • Better generalization to non-human characters such as cartoons, animated roles, and animals.
  • Slightly weaker temporal stability, with occasional flickering.
  • Slightly weaker subject consistency, including possible identity drift or color drift.
  • Occasional severe noisy frames in a small portion of cases (~2%).
  • Roughly 2× slower inference without acceleration strategies.
Some failure cases of X-Dub (Wan-5B)
failure_01.mp4

🏃 We are still trying to find the best implementation strategy, and will actively improve this repository. Quantitative comparisons between the public release and the internal version will be reported in future updates. If you have suggestions, please open an issue for discussion.

🏁 Getting Started

⚠️ Inference typically requires ~21 GB VRAM.

1. 🛠️ Clone the code and prepare the environment

git clone https://github.com/KlingAIResearch/X-Dub.git
cd X-Dub

conda create -n x-dub python=3.10 -y
conda activate x-dub

Install Python dependencies:

pip install -r requirements.txt

Install OpenMMLab dependencies:

pip install chumpy==0.70 --no-build-isolation
pip install mmengine==0.10.7
pip install mmcv==2.1.0 --no-build-isolation
pip install mmdet==3.2.0
pip install mmpose==1.3.2

Install this repository (adapted from DiffSynth-Studio):

pip install -e . --no-deps

2. 📥 Download pretrained weights

Download the released bundle directly to checkpoints/:

pip install -U "huggingface_hub[cli]"
hf download KlingTeam/X-Dub --local-dir ./checkpoints --repo-type model

Move the DWpose files into dwpose_tools/models/:

mkdir -p dwpose_tools/models
cp -r ./checkpoints/dwpose_tools/models/. ./dwpose_tools/models/
rm -rf ./checkpoints/dwpose_tools

After download, the expected layout is:

checkpoints/
├── X-Dub_model.safetensors
├── Wan2.2_VAE.safetensors
├── models_t5_umt5-xxl-enc-bf16.safetensors
├── umt5-xxl/
├── whisper/
│   └── large-v2.pt
└── wav2vec2-base-960h/

dwpose_tools/models/
├── yolox_l_8xb8-300e_coco.py
├── yolox_l_8x8_300e_coco_20211126_140236-d3bd2b23.pth
├── rtmw-x_8xb320-270e_cocktail14-384x288.py
└── rtmw-x_simcc-cocktail14_pt-ucoco_270e-384x288-f840f204_20231122.pth

3. 🚀 Inference

python infer_lip_sync_pipeline.py \
  --video_path assets/examples/video.mp4 \
  --audio_path assets/examples/audio.wav \
  --ckpt_path checkpoints/X-Dub_model.safetensors \
  --ref_cfg_scale 2.5 \
  --audio_cfg_scale 10.0 \
  --num_inference_steps 30 \
  --output_dir ./results

📢 Input Video Auto-Cropping

The inference pipeline supports arbitrary-size input videos and performs online auto-cropping. The current version supports single-person videos only. The inference script will:

  • crop the faicial region
  • run lip-sync generation on the cropped and resized video (512x512)
  • map the generated result back to the original complete video
Current cropping limitations

For ease of use, this repository uses DWPose to estimate facial ldmks for cropping. This differs from the more complex offline FLAME-mesh-based cropping pipeline used in the paper.

The current online strategy may introduce visible jitter and may fail to follow the face reliably when the head moves rapidly. The current release also does not support target tracking in multi-person scenes.

🏃 We plan to improve the cropping strategy and add better multi-person support in future updates.

💡 Practical Hints

  • ref_cfg_scale and audio_cfg_scale control the balance between reference appearance fidelity and audio-driven mouth motion. Different cases may prefer slightly different values.
  • We recommend setting num_inference_steps in the range of 25-50. Higher values increase runtime and may improve quality, but this has not been exhaustively evaluated yet.

📝 TODO

  • Report quantitative comparisons between the public version and the paper version.
  • Support multi-person video dubbing.
  • Improve the cropping pipeline.
  • Inference acceleration.

⚖️ Ethical Considerations

This work can be misused for identity impersonation or deceptive synthetic media. We support clear labeling of AI-generated content and encourage further work on reliable detection methods. All models and materials in this repository are intended for academic research and technical demonstration only.

If you have questions, please contact: hexu18@mails.tsinghua.edu.cn

🙏 Acknowledgments

We thank Wan2.2 for the open-source model backbone, and DiffSynth-Studio for the training and inference framework.

🔖 Citation

@misc{he2025inpaintingeditingselfbootstrappingframework,
      title={From Inpainting to Editing: A Self-Bootstrapping Framework for Context-Rich Visual Dubbing}, 
      author={Xu He and Haoxian Zhang and Hejia Chen and Changyuan Zheng and Liyang Chen and Songlin Tang and Jiehui Huang and Xiaoqiang Liu and Pengfei Wan and Zhiyong Wu},
      year={2025},
      eprint={2512.25066},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.25066}, 
}

About

Try X-Dub to sync any character in a video with any audio you like | Official repository for "From Inpainting to Editing: Unlocking Robust Mask-Free Visual Dubbing via Generative Bootstrapping"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages