Skip to content

AliothChen/ST-DRC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 

Repository files navigation

ST-DRC logo

Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation

Yuheng Chen1,* Β  Teng Hu1,* Β  Yuji Wang1 Β  Qingdong He2 Β  Lizhuang Ma1,† Β  Jiangning Zhang3,‑

1Shanghai Jiao Tong University Β Β  2University of Electronic Science and Technology of China Β Β  3Zhejiang University

*Equal contribution Β Β  †Corresponding author Β Β  ‑Project lead

     

✨ Introduction

ST-DRC is a spatial-temporal decoupled reference conditioning framework for identity-preserving text-to-video generation. Given a reference face and a textual prompt, the goal is to synthesize a high-fidelity video that follows the requested action, scene, and temporal dynamics while preserving the reference identity across frames.

The core idea is to treat the reference image as a latent in-context identity memory: ST-DRC encodes the reference image with the video VAE and concatenates it with noisy video latents, allowing the diffusion transformer to access fine-grained identity cues through its native spatio-temporal attention. To reduce appearance copy-paste from the reference image, ST-DRC introduces Temporal-Adjacent Spatial-Shifted RoPE (TASS-RoPE), reference-robust identity enhancement, and decoupled text-reference guidance for controllable inference.

πŸ”₯ Highlights

Component Purpose
Latent in-context reference injection Provides low-level identity details without extra identity adapters.
TASS-RoPE Keeps reference tokens temporally accessible while spatially shifting them to suppress pixel-level copying.
Reference-robust identity enhancement Uses appearance-invariant reference augmentation and face-guided identity objectives.
Decoupled text-reference guidance Independently controls prompt adherence and reference fidelity at inference time.

🎬 Visual Showcase

ST-DRC teaser

🧠 Method Overview

ST-DRC method overview

πŸ“’ News

  • 2026-06-01: Paper released on arXiv.

πŸ—“οΈ Timeline

  • Release paper
  • Release project page
  • Release inference code
  • Release model checkpoint
  • Release training code

πŸš€ Getting Started

The inference code, model checkpoints, and training code will be released in this repository. Please follow the timeline above for release status.

πŸ™ Acknowledgements

We gratefully thank the authors of Lightricks/LTX-2 for their excellent open-source codebase.

πŸ“š Citation

If you find this work useful for your research, please consider citing:

@article{chen2026spatial,
  title={Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation},
  author={Chen, Yuheng and Hu, Teng and Wang, Yuji and He, Qingdong and Ma, Lizhuang and Zhang, Jiangning},
  journal={arXiv preprint arXiv:2606.02441},
  year={2026}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors