An efficient transfer paradigm enabling high-quality, end-to-end pixel-space diffusion with minimal computational overhead and data requirements.
- [2026.05.12] Technical report released.
- [2026.05.22] 1K-resolution training code, inference code, weights, and dataset released.
- [2026.05.23] Online demo. (Thanks to multimodalart for the support!)
| Status | Item |
|---|---|
| ✅ | 1K inference code & weights |
| ✅ | Training code |
| 🛠️ | 4K/8K/10K UHR generation |
| 🛠️ | Compatibility with more LDM model |
git clone https://github.com/TencentYoutuResearch/T2I-L2P.git
cd T2I-L2P
pip install -e .Checkpoint:
| Model | Params | HuggingFace |
|---|---|---|
| L2P-z-image (1k resolution) | 6B | 🤗 |
import torch
from diffsynth.pipelines.z_image_L2P import ZImagePipeline, ModelConfig
main_model_path = "/path/model-1k-merge.safetensors"
text_encoder_paths = [
"/path/Z-Image-Turbo/text_encoder/model-00001-of-00003.safetensors",
"/path/Z-Image-Turbo/text_encoder/model-00002-of-00003.safetensors",
"/path/Z-Image-Turbo/text_encoder/model-00003-of-00003.safetensors",
]
tokenizer_path = "/path/Z-Image-Turbo/tokenizer"
pipe = ZImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(path=[main_model_path]),
ModelConfig(path=text_encoder_paths),
],
tokenizer_config=ModelConfig(path=tokenizer_path),
)
prompt = "an origami pig on fire in the middle of a dark room with a pentagram on the floor"
image = pipe(
prompt=prompt,
seed=42,
rand_device="cuda",
num_inference_steps=30,
cfg_scale=2.0,
height=1024,
width=1024,
)
image.save("example.png")First, install gradio:
pip install gradioLaunch a multi-GPU web UI:
python app.pyThe demo auto-detects free GPUs, dispatches each request to an idle device, and exposes a Gradio interface at http://0.0.0.0:23231.
The full training pipeline consists of four steps: (1) prepare the Z-Image base weights → (2) convert them into a pixel-space initialization → (3) launch training → (4) merge the trained delta back with the pixel-init weights for inference.
Download the official Z-Image-Turbo checkpoint from Hugging Face:
Convert the latent-space DiT weights into a pixel-space initialization that L2P can fine-tune from:
python examples/z_image/L2P_convert_weight.py \
--latent_ckpt_files \
/path/to/Z-Image-Turbo/transformer/diffusion_pytorch_model-00001-of-00003.safetensors \
/path/to/Z-Image-Turbo/transformer/diffusion_pytorch_model-00002-of-00003.safetensors \
/path/to/Z-Image-Turbo/transformer/diffusion_pytorch_model-00003-of-00003.safetensors \
--output_path ./pretrain_weight/Z-Image-Pixel-Init/diffusion_pytorch_model.safetensorsStandard training :
bash train_run.shLow-VRAM training (single GPU < 24 GB VRAM):
bash train_run_low_VRAM.shProvide a directory of images plus a CSV metadata file:
data/
├── images/ # raw image folder
└── metadata.csv # columns: file_name, text, ...
python merge_weights.py \
--file_a ./models/train/L2P_Standard/step-xxx.safetensors \
--file_b ./pretrain_weight/Z-Image-Pixel-Init/diffusion_pytorch_model.safetensors \
--file_out ./models/train/L2P_Standard/model-merge.safetensors--file_a: trained checkpoint from Step 3--file_b: pixel-init weights from Step 2--file_out: merged single-file weight
If you find this work useful, please consider citing:
@article{chen2026l2p,
title = {L2P: Unlocking Latent Potential for Pixel Generation},
author = {Chen, Zhennan and Zhu, Junwei and Chen, Xu and Zhang, Jiangning and
Chen, Jiawei and Zeng, Zhuoqi and Zhang, Wei and Wang, Chengjie and
Yang, Jian and Tai, Ying},
journal = {arXiv preprint arXiv:2605.12013},
year = {2026}
}
@article{chen2025dip,
title = {DiP: Taming Diffusion Models in Pixel Space},
author = {Chen, Zhennan and Zhu, Junwei and Chen, Xu and Zhang, Jiangning and
Hu, Xiaobin and Zhao, Hanzhen and Wang, Chengjie and Yang, Jian and
Tai, Ying},
journal = {arXiv preprint arXiv:2511.18822},
year = {2025}
}L2P is built upon the excellent open-source work of DiffSynth-Studio, Z-Image.