Yuheng Chen1,* Β Teng Hu1,* Β Yuji Wang1 Β Qingdong He2 Β Lizhuang Ma1,β Β Jiangning Zhang3,β‘
1Shanghai Jiao Tong University Β Β 2University of Electronic Science and Technology of China Β Β 3Zhejiang University
*Equal contribution Β Β β Corresponding author Β Β β‘Project lead
ST-DRC is a spatial-temporal decoupled reference conditioning framework for identity-preserving text-to-video generation. Given a reference face and a textual prompt, the goal is to synthesize a high-fidelity video that follows the requested action, scene, and temporal dynamics while preserving the reference identity across frames.
The core idea is to treat the reference image as a latent in-context identity memory: ST-DRC encodes the reference image with the video VAE and concatenates it with noisy video latents, allowing the diffusion transformer to access fine-grained identity cues through its native spatio-temporal attention. To reduce appearance copy-paste from the reference image, ST-DRC introduces Temporal-Adjacent Spatial-Shifted RoPE (TASS-RoPE), reference-robust identity enhancement, and decoupled text-reference guidance for controllable inference.
| Component | Purpose |
|---|---|
| Latent in-context reference injection | Provides low-level identity details without extra identity adapters. |
| TASS-RoPE | Keeps reference tokens temporally accessible while spatially shifting them to suppress pixel-level copying. |
| Reference-robust identity enhancement | Uses appearance-invariant reference augmentation and face-guided identity objectives. |
| Decoupled text-reference guidance | Independently controls prompt adherence and reference fidelity at inference time. |
- 2026-06-01: Paper released on arXiv.
- Release paper
- Release project page
- Release inference code
- Release model checkpoint
- Release training code
The inference code, model checkpoints, and training code will be released in this repository. Please follow the timeline above for release status.
We gratefully thank the authors of Lightricks/LTX-2 for their excellent open-source codebase.
If you find this work useful for your research, please consider citing:
@article{chen2026spatial,
title={Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation},
author={Chen, Yuheng and Hu, Teng and Wang, Yuji and He, Qingdong and Ma, Lizhuang and Zhang, Jiangning},
journal={arXiv preprint arXiv:2606.02441},
year={2026}
}
