Skip to content

FrameX-AI/Stream-R1

Repository files navigation

Stream-R1:
Reliability-Perplexity Aware Reward Distillation
for Streaming Video Generation


1 University of Science and Technology of China, 2 FrameX.AI, 3 Independent Researcher

Corresponding author  ·  Project lead



Project Page Paper Models

Overview

TL;DR: Existing distribution-matching distillation (DMD) methods for streaming video diffusion treat every rollout, frame, and pixel as equally informative supervision. Stream-R1 instead reweights the DMD objective along two complementary axes — Inter-Reliability across rollouts and Intra-Perplexity across spatiotemporal regions — with a single shared video reward model. The student concentrates updates where the local reward landscape has not yet flattened, converging to the teacher's high-quality mode rather than its full mixture, and surpasses the multi-step Wan2.1 teacher on VBench Total/Semantic at 23.1 FPS with no architectural change and zero inference overhead.

Qualitative results across 30 s / 60 s / 2 min / 3 min: https://stream-r1.github.io/#duration.

Method

Stream-R1 modulates the standard DMD generator loss as

$$\mathcal{L}_{\text{Stream-R1}} ;=; \underbrace{\exp(\beta \cdot r_{\text{final}})}_{\mathbf{W}_{\text{inter}}};\cdot;\text{mean}!\big(\underbrace{\mathbf{W}_{\text{intra}}}_{F\times H\times W},\odot,\mathcal{L}_{\text{DMD}}\big)$$

with three reward-guided components, all derived from one pretrained video reward model:

  1. Inter-Reliability Weighting — the DMD gradient g = f_fake − f_real varies in reliability across rollouts; we exponentially rescale each rollout's loss by exp(β·r_final), so reliable rollouts dominate supervision while low-quality rollouts are attenuated.
  2. Intra-Perplexity Weighting — back-propagates the reward model to obtain a per-pixel saliency volume S ∈ R^{F×H×W}, factorizes it into a temporal profile and per-frame spatial maps, and uses the product as W_intra. Optimization pressure concentrates on the regions and frames where the local reward landscape has not yet flattened — i.e. where further refinement yields the largest expected gain.
  3. Adaptive Reward Balancing — tracks per-axis (VQ / MQ / TA) improvement in a sliding window and subtracts the std of per-axis deltas from the reward, keeping the three quality axes improving at similar rates.

Saliency from the three axes is fused with an adaptive softmax weighting that allocates more attention to the currently weaker axis, so a single reward signal drives both W_inter and W_intra.

Shipped configuration (configs/exp_stream_r1.yaml)

Knob Value Role
reward_mode BalancedOverall Inter-Reliability + Adaptive Reward Balancing
spatial_reward / spatial_reward_pixel_grad true / true Intra-Perplexity spatial (pixel-level gradient saliency)
temporal_saliency_weighting true Intra-Perplexity temporal (per-frame importance)
spatial_reward_combination adaptive adaptive saliency fusion across VQ/MQ/TA
spatial_reward_min_weight 0.15 spatial floor (σ_min)
temporal_saliency_min_weight 0.2 temporal floor (τ_min)
full_training_steps × gradient_accumulation_steps 1000 × 8 8000 raw steps on 8 GPUs

Requirements

  • NVIDIA GPU: ≥24 GB for inference, ≥80 GB for training (8 GPUs recommended).
  • Linux, ≥64 GB RAM.

Installation

git clone https://github.com/FrameX-AI/Stream-R1.git
cd Stream-R1

conda create -n stream_r1 python=3.10
conda activate stream_r1

pip install -r requirements.txt
pip install flash-attn --no-build-isolation
pip install -e .

Pretrained Checkpoints

Required for training (teacher / reward / init) and inference (Stream-R1):

Model Download
VideoReward Hugging Face
Wan2.1-T2V-1.3B Hugging Face
Wan2.1-T2V-14B Hugging Face
ODE Initialization Hugging Face
Stream-R1 (T2V-1.3B) Hugging Face

After downloading:

checkpoints/
├── Videoreward/
├── Wan2.1-T2V-1.3B/
├── Wan2.1-T2V-14B/
├── Stream-R1-T2V-1.3B/
└── ode_init.pt

Or run the helper:

pip install "huggingface_hub[cli]"
bash download_checkpoints.sh

Inference

Place the released Stream-R1 weights at checkpoints/Stream-R1-T2V-1.3B/stream_r1.pt (any filename works — pass it via --checkpoint_path). You can also run inference on a checkpoint produced by your own training run (output/<timestamp>_stream_r1/checkpoint_model_*/generator.pt).

# 5-second video
python inference.py \
    --num_output_frames 21 \
    --config_path configs/stream_r1.yaml \
    --checkpoint_path checkpoints/Stream-R1-T2V-1.3B/stream_r1.pt \
    --output_folder videos/stream_r1-5s \
    --data_path prompts/MovieGenVideoBench_extended.txt \
    --use_ema

# 30-second video
python inference.py \
    --num_output_frames 120 \
    --config_path configs/stream_r1.yaml \
    --checkpoint_path checkpoints/Stream-R1-T2V-1.3B/stream_r1.pt \
    --output_folder videos/stream_r1-30s \
    --data_path prompts/MovieGenVideoBench_extended.txt \
    --use_ema

Training

bash run_stream_r1.sh

The launcher reads configs/exp_stream_r1.yaml, runs train.py via torchrun on 8 GPUs (1000 optimizer steps × grad-accum 8 → 8000 raw steps), then renders 20 evaluation videos from the final checkpoint. Override defaults with environment variables, e.g.:

NUM_GPUS=4 CUDA_VISIBLE_DEVICES=0,1,2,3 bash run_stream_r1.sh

Manual launch:

torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=5235 --rdzv_backend=c10d \
    --rdzv_endpoint=$MASTER_PORT train.py \
    --config_path configs/exp_stream_r1.yaml \
    --logdir logs/stream_r1 \
    --disable-wandb

For multi-node training, set --nnodes, --node-rank, and --rdzv_endpoint=$MASTER_IP:$MASTER_PORT accordingly.

Results

VBench (5-second, 832×480)

Model Params FPS↑ Total↑ Quality↑ Semantic↑
Wan2.1 (multi-step teacher) 1.3B 0.78 84.26 85.30 80.09
LTX-Video 1.9B 8.98 80.00 82.30 70.79
SkyReels-V2 1.3B 0.49 82.67 84.70 74.53
MAGI-1 4.5B 0.19 79.18 82.04 67.74
NOVA 0.6B 0.88 80.12 80.39 79.05
Pyramid Flow 2B 6.7 81.72 84.74 69.62
CausVid 1.3B 17.0 82.88 83.93 78.69
Self Forcing 1.3B 17.0 83.80 84.59 80.64
LongLive 1.3B 20.7 83.22 83.68 81.37
Rolling Forcing 1.3B 17.5 81.22 84.08 69.78
Reward Forcing 1.3B 23.1 84.13 84.84 81.32
Stream-R1 (Ours) 1.3B 23.1 84.40 85.14 81.44

Stream-R1 surpasses its multi-step Wan2.1 teacher on Total and Semantic while running ~30× faster, demonstrating that reward-guided distillation can push the student beyond the teacher's quality frontier. (Underlined = second best, bold = best.)

VideoReward (per-axis)

Model Visual↑ Dynamic↑ Text↑
SkyReels-V2 3.30 3.05 2.70
CausVid 4.66 3.16 3.32
Self Forcing 3.89 3.44 3.11
LongLive 4.79 3.81 3.98
Reward Forcing 4.82 4.18 4.04
Stream-R1 (Ours) 4.92 4.04 4.11

Project page and qualitative results: https://stream-r1.github.io/

Citation

A BibTeX entry will be added shortly. In the meantime please cite via the arXiv preprint at https://arxiv.org/abs/2605.03849.

Acknowledgements

Built on CausVid, Self Forcing, Wan2.1, and VideoAlign. Stream-R1 extends the Reward Forcing codebase with the Inter-Reliability / Intra-Perplexity formulation.

License

See LICENSE.

Contact

For questions about the project, please open a GitHub issue or reach out to the corresponding author Mengqi Huang.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors