* Corresponding author · † Project lead
While Test-Time Scaling (TTS) offers a promising direction to enhance video generation without the surging costs of training, current test-time video generation methods based on diffusion models suffer from exorbitant candidate exploration costs and lack temporal guidance. To address these structural bottlenecks, we propose shifting the focus to streaming video generation. We identify that its chunk-level synthesis and few denoising steps are intrinsically suited for TTS, significantly lowering computational overhead while enabling fine-grained temporal control. Driven by this insight, we introduced Stream-T1, a pioneering comprehensive TTS framework exclusively tailored for streaming video generation. Evaluated on both 5s and 30s comprehensive video benchmarks, Stream-T1 demonstrates profound superiority, significantly improving temporal consistency, motion smoothness, and frame-level visual quality.
- Stream‑Scaled Noise Propagation: actively refines the initial latent noise of the generating chunk using historically proven, high-quality previous chunk noise, effectively establishes temporal dependency and utilizing the historical Gaussian prior to guide the current generation;
- Stream‑Scaled Reward Pruning: comprehensively evaluates generated candidates to strike an optimal balance between local spatial aesthetics and global temporal coherence by integrating immediate short-term assessments with sliding-window-based long-term evaluations;
- Stream‑Scaled Memory Sinking: dynamically routes the context evicted from KV-cache into distinct updating pathways guided by the reward feedback, ensuring that previously generated visual information effectively anchors and guides the subsequent video stream.
- Release the paper and project page.
- Release the inference code.
- Release test cases with our pretrained model, prompts, and reference image.
The inference are conducted on 1 A800 GPU (80GB VRAM)
git clone https://github.com/FrameX-AI/Stream-T1.git
cd Stream-T1
cd metrics
https://github.com/KlingAIResearch/VideoAlign.git
All the tests are conducted in Linux. To set up our environment in Linux, please run:
conda create -n StreamT1 python=3.10 -y
conda activate StreamT1
pip install -r requirements.txt
1.base model checkpoints
huggingface-cli download Efficient-Large-Model/LongLive --local-dir longlive_models
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir wan_models/Wan2.1-T2V-1.3B
2.reward model checkpoints
huggingface-cli download MizzenAI/HPSv3 --local-dir metrics/models/hpsv3_model
huggingface-cli download KlingTeam/VideoReward --local-dir metrics/models/videoalign
bash stream_scaling.sh
Don't forget to cite this source if it proves useful in your research!
@misc{tu2026streamt1testtimescalingstreaming,
title={Stream-T1: Test-Time Scaling for Streaming Video Generation},
author={Yijing Tu and Shaojin Wu and Mengqi Huang and Wenchuan Wang and Yuxin Wang and Chunxiao Liu and Zhendong Mao},
year={2026},
eprint={2605.04461},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.04461},
}
- LongLive: the codebase and algorithm we built upon. Thanks for their wonderful work.
- HPSv3 and videoalign: the reward model we use. Thanks for their wonderful work.
See LICENSE.
