Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation

Yuheng Chen^1,* Teng Hu^1,* Yuji Wang¹ Qingdong He² Lizhuang Ma^1,† Jiangning Zhang^3,‡

¹Shanghai Jiao Tong University ²University of Electronic Science and Technology of China ³Zhejiang University

^*Equal contribution ^†Corresponding author ^‡Project lead

✨ Introduction

ST-DRC is a spatial-temporal decoupled reference conditioning framework for identity-preserving text-to-video generation. Given a reference face and a textual prompt, the goal is to synthesize a high-fidelity video that follows the requested action, scene, and temporal dynamics while preserving the reference identity across frames.

The core idea is to treat the reference image as a latent in-context identity memory: ST-DRC encodes the reference image with the video VAE and concatenates it with noisy video latents, allowing the diffusion transformer to access fine-grained identity cues through its native spatio-temporal attention. To reduce appearance copy-paste from the reference image, ST-DRC introduces Temporal-Adjacent Spatial-Shifted RoPE (TASS-RoPE), reference-robust identity enhancement, and decoupled text-reference guidance for controllable inference.

🔥 Highlights

Component	Purpose
Latent in-context reference injection	Provides low-level identity details without extra identity adapters.
TASS-RoPE	Keeps reference tokens temporally accessible while spatially shifting them to suppress pixel-level copying.
Reference-robust identity enhancement	Uses appearance-invariant reference augmentation and face-guided identity objectives.
Decoupled text-reference guidance	Independently controls prompt adherence and reference fidelity at inference time.

🎬 Visual Showcase

🧠 Method Overview

📢 News

2026-06-01: Paper released on arXiv.

🗓️ Timeline

🚀 Getting Started

The inference code, model checkpoints, and training code will be released in this repository. Please follow the timeline above for release status.

🙏 Acknowledgements

We gratefully thank the authors of Lightricks/LTX-2 for their excellent open-source codebase.

📚 Citation

If you find this work useful for your research, please consider citing:

@article{chen2026spatial,
  title={Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation},
  author={Chen, Yuheng and Hu, Teng and Wang, Yuji and He, Qingdong and Ma, Lizhuang and Zhang, Jiangning},
  journal={arXiv preprint arXiv:2606.02441},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation

✨ Introduction

🔥 Highlights

🎬 Visual Showcase

🧠 Method Overview

📢 News

🗓️ Timeline

🚀 Getting Started

🙏 Acknowledgements

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation

✨ Introduction

🔥 Highlights

🎬 Visual Showcase

🧠 Method Overview

📢 News

🗓️ Timeline

🚀 Getting Started

🙏 Acknowledgements

📚 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages