† Corresponding author · ‡ Project lead
TL;DR: Existing distribution-matching distillation (DMD) methods for streaming video diffusion treat every rollout, frame, and pixel as equally informative supervision. Stream-R1 instead reweights the DMD objective along two complementary axes — Inter-Reliability across rollouts and Intra-Perplexity across spatiotemporal regions — with a single shared video reward model. The student concentrates updates where the local reward landscape has not yet flattened, converging to the teacher's high-quality mode rather than its full mixture, and surpasses the multi-step Wan2.1 teacher on VBench Total/Semantic at 23.1 FPS with no architectural change and zero inference overhead.
Qualitative results across 30 s / 60 s / 2 min / 3 min: https://stream-r1.github.io/#duration.
Stream-R1 modulates the standard DMD generator loss as
with three reward-guided components, all derived from one pretrained video reward model:
- Inter-Reliability Weighting — the DMD gradient
g = f_fake − f_realvaries in reliability across rollouts; we exponentially rescale each rollout's loss byexp(β·r_final), so reliable rollouts dominate supervision while low-quality rollouts are attenuated. - Intra-Perplexity Weighting — back-propagates the reward model to obtain a per-pixel saliency volume
S ∈ R^{F×H×W}, factorizes it into a temporal profile and per-frame spatial maps, and uses the product asW_intra. Optimization pressure concentrates on the regions and frames where the local reward landscape has not yet flattened — i.e. where further refinement yields the largest expected gain. - Adaptive Reward Balancing — tracks per-axis (VQ / MQ / TA) improvement in a sliding window and subtracts the std of per-axis deltas from the reward, keeping the three quality axes improving at similar rates.
Saliency from the three axes is fused with an adaptive softmax weighting that allocates more attention to the currently weaker axis, so a single reward signal drives both W_inter and W_intra.
| Knob | Value | Role |
|---|---|---|
reward_mode |
BalancedOverall |
Inter-Reliability + Adaptive Reward Balancing |
spatial_reward / spatial_reward_pixel_grad |
true / true |
Intra-Perplexity spatial (pixel-level gradient saliency) |
temporal_saliency_weighting |
true |
Intra-Perplexity temporal (per-frame importance) |
spatial_reward_combination |
adaptive |
adaptive saliency fusion across VQ/MQ/TA |
spatial_reward_min_weight |
0.15 |
spatial floor (σ_min) |
temporal_saliency_min_weight |
0.2 |
temporal floor (τ_min) |
full_training_steps × gradient_accumulation_steps |
1000 × 8 |
8000 raw steps on 8 GPUs |
- NVIDIA GPU: ≥24 GB for inference, ≥80 GB for training (8 GPUs recommended).
- Linux, ≥64 GB RAM.
git clone https://github.com/FrameX-AI/Stream-R1.git
cd Stream-R1
conda create -n stream_r1 python=3.10
conda activate stream_r1
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
pip install -e .Required for training (teacher / reward / init) and inference (Stream-R1):
| Model | Download |
|---|---|
| VideoReward | Hugging Face |
| Wan2.1-T2V-1.3B | Hugging Face |
| Wan2.1-T2V-14B | Hugging Face |
| ODE Initialization | Hugging Face |
| Stream-R1 (T2V-1.3B) | Hugging Face |
After downloading:
checkpoints/
├── Videoreward/
├── Wan2.1-T2V-1.3B/
├── Wan2.1-T2V-14B/
├── Stream-R1-T2V-1.3B/
└── ode_init.pt
Or run the helper:
pip install "huggingface_hub[cli]"
bash download_checkpoints.shPlace the released Stream-R1 weights at checkpoints/Stream-R1-T2V-1.3B/stream_r1.pt
(any filename works — pass it via --checkpoint_path). You can also run inference
on a checkpoint produced by your own training run
(output/<timestamp>_stream_r1/checkpoint_model_*/generator.pt).
# 5-second video
python inference.py \
--num_output_frames 21 \
--config_path configs/stream_r1.yaml \
--checkpoint_path checkpoints/Stream-R1-T2V-1.3B/stream_r1.pt \
--output_folder videos/stream_r1-5s \
--data_path prompts/MovieGenVideoBench_extended.txt \
--use_ema
# 30-second video
python inference.py \
--num_output_frames 120 \
--config_path configs/stream_r1.yaml \
--checkpoint_path checkpoints/Stream-R1-T2V-1.3B/stream_r1.pt \
--output_folder videos/stream_r1-30s \
--data_path prompts/MovieGenVideoBench_extended.txt \
--use_emabash run_stream_r1.shThe launcher reads configs/exp_stream_r1.yaml, runs train.py via torchrun on 8 GPUs (1000 optimizer steps × grad-accum 8 → 8000 raw steps), then renders 20 evaluation videos from the final checkpoint. Override defaults with environment variables, e.g.:
NUM_GPUS=4 CUDA_VISIBLE_DEVICES=0,1,2,3 bash run_stream_r1.shManual launch:
torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=5235 --rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_PORT train.py \
--config_path configs/exp_stream_r1.yaml \
--logdir logs/stream_r1 \
--disable-wandbFor multi-node training, set --nnodes, --node-rank, and --rdzv_endpoint=$MASTER_IP:$MASTER_PORT accordingly.
| Model | Params | FPS↑ | Total↑ | Quality↑ | Semantic↑ |
|---|---|---|---|---|---|
| Wan2.1 (multi-step teacher) | 1.3B | 0.78 | 84.26 | 85.30 | 80.09 |
| LTX-Video | 1.9B | 8.98 | 80.00 | 82.30 | 70.79 |
| SkyReels-V2 | 1.3B | 0.49 | 82.67 | 84.70 | 74.53 |
| MAGI-1 | 4.5B | 0.19 | 79.18 | 82.04 | 67.74 |
| NOVA | 0.6B | 0.88 | 80.12 | 80.39 | 79.05 |
| Pyramid Flow | 2B | 6.7 | 81.72 | 84.74 | 69.62 |
| CausVid | 1.3B | 17.0 | 82.88 | 83.93 | 78.69 |
| Self Forcing | 1.3B | 17.0 | 83.80 | 84.59 | 80.64 |
| LongLive | 1.3B | 20.7 | 83.22 | 83.68 | 81.37 |
| Rolling Forcing | 1.3B | 17.5 | 81.22 | 84.08 | 69.78 |
| Reward Forcing | 1.3B | 23.1 | 84.13 | 84.84 | 81.32 |
| Stream-R1 (Ours) | 1.3B | 23.1 | 84.40 | 85.14 | 81.44 |
Stream-R1 surpasses its multi-step Wan2.1 teacher on Total and Semantic while running ~30× faster, demonstrating that reward-guided distillation can push the student beyond the teacher's quality frontier. (Underlined = second best, bold = best.)
| Model | Visual↑ | Dynamic↑ | Text↑ |
|---|---|---|---|
| SkyReels-V2 | 3.30 | 3.05 | 2.70 |
| CausVid | 4.66 | 3.16 | 3.32 |
| Self Forcing | 3.89 | 3.44 | 3.11 |
| LongLive | 4.79 | 3.81 | 3.98 |
| Reward Forcing | 4.82 | 4.18 | 4.04 |
| Stream-R1 (Ours) | 4.92 | 4.04 | 4.11 |
Project page and qualitative results: https://stream-r1.github.io/
A BibTeX entry will be added shortly. In the meantime please cite via the arXiv preprint at https://arxiv.org/abs/2605.03849.
Built on CausVid, Self Forcing, Wan2.1, and VideoAlign. Stream-R1 extends the Reward Forcing codebase with the Inter-Reliability / Intra-Perplexity formulation.
See LICENSE.
For questions about the project, please open a GitHub issue or reach out to the corresponding author Mengqi Huang.