From Inpainting to Editing: Unlocking Robust Mask-Free Visual Dubbing via Generative Bootstrapping

Xu He^1,* Haoxian Zhang^2,† Hejia Chen³ Changyuan Zheng¹ Liyang Chen¹

Songlin Tang² Jiehui Huang⁴ Xiaoqiang Liu² Pengfei Wan² Zhiyong Wu^1,5,✉

¹ Tsinghua University ² Kling Team, Kuaishou Technology ³ Beihang University ⁴ HKUST ⁵ CUHK

^* Work done at Kling Team, Kuaishou Technology ^† Project leader ^✉ Corresponding author

demo_video.mp4

🔥 For more results, visit our homepage. 🔥

🙏🏻 If you find our work helpful, please consider giving us a ⭐ star.

🔥 Updates

2026/03/19: 🔥 We release the inference code and pretrained weights for the public Wan-based X-Dub release.
2025/12/31: 🔥 We release the paper and project homepage: paper | homepage.

📖 Introduction

This repository contains the official PyTorch implementation of X-Dub, introduced in From Inpainting to Editing: Unlocking Robust Mask-Free Visual Dubbing via Generative Bootstrapping (formerly From Inpainting to Editing: A Self-Bootstrapping Framework for Context-Rich Visual Dubbing).

Due to company policy, we cannot open-source the internal model used in the paper. This repository instead releases a public X-Dub (Wan-5B) version based on Wan2.2-TI2V-5B. Because of the different backbone, we do not use the LoRA tuning described in the paper; instead, we use multi-stage SFT in the public release to achieve a similar effect. In our experiments, X-Dub (Wan-5B) produces satisfying lip-synced results broadly aligned with the internal version X-Dub (internal-1B):

More qualitative results of X-Dub (Wan-5B)

result_01.mp4

result_02.mp4

result_03.mp4

result_04.mp4

We still observe some differences in the current public release. Compared with the internal version, X-Dub (Wan-5B) shows the following practical differences:

Better generalization to non-human characters such as cartoons, animated roles, and animals.
Slightly weaker temporal stability, with occasional flickering.
Slightly weaker subject consistency, including possible identity drift or color drift.
Occasional severe noisy frames in a small portion of cases (~2%).
Roughly 2× slower inference without acceleration strategies.

Some failure cases of X-Dub (Wan-5B)

failure_01.mp4

🏃 We are still trying to find the best implementation strategy, and will actively improve this repository. Quantitative comparisons between the public release and the internal version will be reported in future updates. If you have suggestions, please open an issue for discussion.

🏁 Getting Started

⚠️ Inference typically requires ~21 GB VRAM.

1. 🛠️ Clone the code and prepare the environment

git clone https://github.com/KlingAIResearch/X-Dub.git
cd X-Dub

conda create -n x-dub python=3.10 -y
conda activate x-dub

Install Python dependencies:

pip install -r requirements.txt

Install OpenMMLab dependencies:

pip install chumpy==0.70 --no-build-isolation
pip install mmengine==0.10.7
pip install mmcv==2.1.0 --no-build-isolation
pip install mmdet==3.2.0
pip install mmpose==1.3.2

Install this repository (adapted from DiffSynth-Studio):

pip install -e . --no-deps

2. 📥 Download pretrained weights

Download the released bundle directly to checkpoints/:

pip install -U "huggingface_hub[cli]"
hf download KlingTeam/X-Dub --local-dir ./checkpoints --repo-type model

Move the DWpose files into dwpose_tools/models/:

mkdir -p dwpose_tools/models
cp -r ./checkpoints/dwpose_tools/models/. ./dwpose_tools/models/
rm -rf ./checkpoints/dwpose_tools

After download, the expected layout is:

checkpoints/
├── X-Dub_model.safetensors
├── Wan2.2_VAE.safetensors
├── models_t5_umt5-xxl-enc-bf16.safetensors
├── umt5-xxl/
├── whisper/
│   └── large-v2.pt
└── wav2vec2-base-960h/

dwpose_tools/models/
├── yolox_l_8xb8-300e_coco.py
├── yolox_l_8x8_300e_coco_20211126_140236-d3bd2b23.pth
├── rtmw-x_8xb320-270e_cocktail14-384x288.py
└── rtmw-x_simcc-cocktail14_pt-ucoco_270e-384x288-f840f204_20231122.pth

3. 🚀 Inference

python infer_lip_sync_pipeline.py \
  --video_path assets/examples/video.mp4 \
  --audio_path assets/examples/audio.wav \
  --ckpt_path checkpoints/X-Dub_model.safetensors \
  --ref_cfg_scale 2.5 \
  --audio_cfg_scale 10.0 \
  --num_inference_steps 30 \
  --output_dir ./results

📢 Input Video Auto-Cropping

The inference pipeline supports arbitrary-size input videos and performs online auto-cropping. The current version supports single-person videos only. The inference script will:

crop the faicial region
run lip-sync generation on the cropped and resized video (512x512)
map the generated result back to the original complete video

Current cropping limitations

For ease of use, this repository uses DWPose to estimate facial ldmks for cropping. This differs from the more complex offline FLAME-mesh-based cropping pipeline used in the paper.

The current online strategy may introduce visible jitter and may fail to follow the face reliably when the head moves rapidly. The current release also does not support target tracking in multi-person scenes.

🏃 We plan to improve the cropping strategy and add better multi-person support in future updates.

💡 Practical Hints

ref_cfg_scale and audio_cfg_scale control the balance between reference appearance fidelity and audio-driven mouth motion. Different cases may prefer slightly different values.
We recommend setting num_inference_steps in the range of 25-50. Higher values increase runtime and may improve quality, but this has not been exhaustively evaluated yet.

📝 TODO

Report quantitative comparisons between the public version and the paper version.
Support multi-person video dubbing.
Improve the cropping pipeline.
Inference acceleration.

⚖️ Ethical Considerations

This work can be misused for identity impersonation or deceptive synthetic media. We support clear labeling of AI-generated content and encourage further work on reliable detection methods. All models and materials in this repository are intended for academic research and technical demonstration only.

If you have questions, please contact: hexu18@mails.tsinghua.edu.cn

🙏 Acknowledgments

We thank Wan2.2 for the open-source model backbone, and DiffSynth-Studio for the training and inference framework.

🔖 Citation

@misc{he2025inpaintingeditingselfbootstrappingframework,
      title={From Inpainting to Editing: A Self-Bootstrapping Framework for Context-Rich Visual Dubbing}, 
      author={Xu He and Haoxian Zhang and Hejia Chen and Changyuan Zheng and Liyang Chen and Songlin Tang and Jiehui Huang and Xiaoqiang Liu and Pengfei Wan and Zhiyong Wu},
      year={2025},
      eprint={2512.25066},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.25066}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
diffsynth		diffsynth
dwpose_tools		dwpose_tools
.gitignore		.gitignore
README.md		README.md
infer_lip_sync_pipeline.py		infer_lip_sync_pipeline.py
lip_sync_preprocess.py		lip_sync_preprocess.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

From Inpainting to Editing: Unlocking Robust Mask-Free Visual Dubbing via Generative Bootstrapping

🔥 Updates

📖 Introduction

🏁 Getting Started

1. 🛠️ Clone the code and prepare the environment

2. 📥 Download pretrained weights

3. 🚀 Inference

📢 Input Video Auto-Cropping

💡 Practical Hints

📝 TODO

⚖️ Ethical Considerations

🙏 Acknowledgments

🔖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

From Inpainting to Editing: Unlocking Robust Mask-Free Visual Dubbing via Generative Bootstrapping

🔥 Updates

📖 Introduction

🏁 Getting Started

1. 🛠️ Clone the code and prepare the environment

2. 📥 Download pretrained weights

3. 🚀 Inference

📢 Input Video Auto-Cropping

💡 Practical Hints

📝 TODO

⚖️ Ethical Considerations

🙏 Acknowledgments

🔖 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages