PRTS — Primitive Reasoning and Tasking System

Scaling Reward-Free Contrastive RL into VLA Pre-training

A Vision–Language–Action foundation model that, for the first time, scales reward-label-free contrastive RL into VLA pre-training itself — equipping a single Qwen3-VL backbone with a quantitative, language-grounded sense of goal-reachability, at near-BC compute.

📰 News

2026/05 PRTS arXiv preprint released — arXiv:2604.27472.
2026/05 Pre-trained PRTS-4B checkpoint pushed to 🤗 TeleEmbodied/PRTS-4B.

🚀 Release Plan

We will progressively open-source the rest of the PRTS stack. Tick = done, square = upcoming.

PRTS-4B pre-trained checkpoint — 🤗 TeleEmbodied/PRTS-4B
Project page — ⚠️ currently under maintenance, content out-of-date ; the site https://rhodes-team-prts.github.io/ will be refreshed in the next few days with final video demos, BibTeX, and benchmark cards.
Minimal SFT post-training code for LIBERO + real-robot platforms
CRL value visualization scripts
LIBERO evaluation of PRTS
PRTS-4B post-trained checkpoints for LIBERO / SimplerEnv WidowX — the exact checkpoints behind Tables 1–4 of the paper, for one-click reproduction

✨ Why PRTS?

Most VLA models pretrain by behavior cloning — they learn what to do, but never internalize how close the current state is to satisfying the instruction. PRTS reframes pre-training as a goal-conditioned RL problem and supervises a language-conditioned contrastive value alongside the action loss, all from offline trajectory structure alone.

The geometry the model converges to is sharp: the inner product

$$ \phi(s,\mathbf{a})^{\top}\psi(l) \approx \log Q^{\pi}_{l}(s,\mathbf{a}) $$

of the state–action embedding and the goal embedding tracks the log-discounted goal-occupancy along expert rollouts. It rises as the policy approaches the language goal, and stays flat under a mismatched instruction.

Value visualization on a held-out Pick Shoes rollout. Same physical trajectory; two language conditionings. Green — correct instruction: the value rises with subgoal-aligned local peaks at first shoe lifted → first shoe placed → second shoe placed → box almost shut, climbing to 0.57 over the episode. Red — wrong instruction (objects that don't appear in the scene): the score stays in 0.04 ~ 0.20 and never crosses the green curve in any frames.

The signal is computed by the pre-trained checkpoint without any post-training adaptation.

Highlights


🧭 Goal-reachability awareness, end-to-end	The contrastive value head is co-trained inside the same Qwen3-VL backbone the policy uses. No separate value network, no curated reward dataset, no offline-RL post-training loop.
💰 Reward-label-free	Supervision comes purely from the temporal structure of demonstrations — no per-episode success labels and no curated value-training corpus.
⚡ Near-BC pre-training cost	A role-aware causal mask fused into FlashAttention via a custom CuTe kernel keeps per-layer attention within `1.18 ×` of vanilla FA3, vs. `2.7 ×–8.8 ×` for off-the-shelf FlexAttention. End-to-end pre-training scales at `≥ 85 %` linear efficiency on 64 × H100.
🌍 Out-of-distribution wins grow with the shift	On 5 simulation suites and 14 real-world tasks, PRTS matches or exceeds the strongest prior VLAs at ¼ – ⅛ the post-training compute, with the gap widening as evaluation moves further off-distribution: novel-instruction following (`+38.8` over π_0.5), long-horizon execution, and recovery under human intervention.

📊 Results

Standard simulation benchmarks

PRTS reaches state-of-the-art average success rate on every standard suite, at a small post-training budget among all directly comparable VLAs.

Method	LIBERO	LIBERO-Plus	LIBERO-Pro	SimplerEnv (WidowX)
OpenVLA-OFT	97.1	69.6	—	41.8
GR00T-N1.5	97.0	—	—	61.9
π₀ ^{(bs=32, 30K)}	94.2	53.6	45.3	27.1
π_0.5 ^{(bs=256, 30K)}	96.9	80.7	53.3	—
ABot-M0 ^{(bs=32, 30K)}	97.9	78.7	52.2	—
PRTS (Ours) ^{(bs=32, 30K)}	98.4	81.4	58.8	77.1

The gap to baselines grows as evaluation drifts further off-distribution: +0.5 on LIBERO → +0.7 on LIBERO-Plus → +5.5 on LIBERO-Pro.

LIBERO-Pro: novel instruction & position swapping

This is where PRTS's CRL shaped representations really earn its keep. The benchmark holds the visual scene fixed and rewrites either the instruction (Task axis) or the target relation (Position axis).

Method	Semantic	Object	Position	Task	Average
π₀ ^{(bs=32, 30K)}	90.5	90.5	0.0	0.0	45.3
π_0.5 ^{(bs=256, 30K)}	95.8	96.0	20.8	0.8	53.3
ABot-M0 ^{(bs=32, 30K)}	97.1	82.5	7.1	22.3	52.2
PRTS (Ours) ^{(bs=32, 30K)}	97.0	82.3	24.3	31.5	58.8

On the Task axis, π₀ and π_0.5 collapse below 1 %, and the strongest comparable VLA (ABot-M0) reaches only 22.3 %. PRTS holds 31.5 % — although π_0.5 achieves the second-best average result, PRTS suprisingly outperform it a large margin +30.7 on the hardest Task axis.

Real-world: 14 tasks across 2 platforms

We deploy PRTS on a 14-DoF dual-arm RealMan platform (11 tasks) and a 7-DoF Flexiv single-arm platform (3 tasks). All three policies (PRTS, π₀, π_0.5) share identical post-training data, schedule, and 20-trial physical evaluation protocol.

Method	RealMan dual-arm (avg over 11 tasks)	Flexiv single-arm (avg over 3 tasks)
π₀	67.3	60.0
π_0.5	85.5	75.0
PRTS (Ours)	95.9	90.0

PRTS hits ≥ 90 % on every one of the 11 RealMan tasks and 100 % on four of them. On the genuinely long-horizon Office Long Term task (~2 min of continuous bimanual operation), π_0.5 collapses to 40 % under multi-task interference while PRTS holds 95 %.

Zero-shot novel-instruction generalization

The cleanest test of "does the policy follow the language goal?" is to take a deployed task and change the language instruction to recombine seen primitives in a new way (e.g. Paper Rubbish with a soy-sauce bottle in place of the trash item). All four task-generalization cells, all 20 trials physical:

Method	Paper Rubbish	Place Block	Pick Shoes	Stack Cups	Average
π₀	5.0	0.0	30.0	20.0	13.8
π_0.5	65.0	15.0	35.0	25.0	35.0
PRTS (Ours)	80.0	55.0	85.0	75.0	73.8

+38.8 average margin over π_0.5 — the most direct empirical evidence that PRTS's value head ties the language goal to feasible state-action outcomes.

Pre-training efficiency

(a) Aggregate token throughput vs. number of H100 GPUs (log-log) — PRTS retains ≥ 85 % of perfect linear scaling up to 64 GPUs. (b) Per-layer attention forward time at matched packing — our role-aware CuTe-FlashAttention kernel sits at 0.531 ms / layer (1.18 × of the BC-only FA3 reference at 0.45 ms), versus 1.23 ms (2.7 ×) and 3.95 ms (8.8 ×) for the alternative FlexAttention realizations of the same role-aware mask.

📚 Citation

If you find PRTS useful, please consider citing:

@article{zhang2026prts,
  title   = {PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations},
  author={Yang Zhang and Jiangyuan Zhao and Chenyou Fan and Fangzheng Yan and Tian Li and Haitong Tang and Sen Fu and Xuan'er Wu and Qizhen Weng and Weinan Zhang and Xiu Li and Chi Zhang and Chenjia Bai and Xuelong Li},
  journal = {arXiv preprint arXiv:2604.27472},
  year    = {2026},
}

📄 License

Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0). See LICENSE for details. Released model weights and code are free for academic and non-commercial use; commercial use is not permitted under this license.

🙏 Acknowledgements

PRTS builds on Qwen3-VL, FlashAttention, LeRobot, and OpenPI. We thank the authors of Contrastive RL for the ideas behind the contrastive value formulation.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets/readme		assets/readme
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PRTS — Primitive Reasoning and Tasking System

Scaling Reward-Free Contrastive RL into VLA Pre-training

📰 News

🚀 Release Plan

✨ Why PRTS?

Highlights

📊 Results

Standard simulation benchmarks

LIBERO-Pro: novel instruction & position swapping

Real-world: 14 tasks across 2 platforms

Zero-shot novel-instruction generalization

Pre-training efficiency

📚 Citation

📄 License

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

PRTS — Primitive Reasoning and Tasking System

Scaling Reward-Free Contrastive RL into VLA Pre-training

📰 News

🚀 Release Plan

✨ Why PRTS?

Highlights

📊 Results

Standard simulation benchmarks

LIBERO-Pro: novel instruction & position swapping

Real-world: 14 tasks across 2 platforms

Zero-shot novel-instruction generalization

Pre-training efficiency

📚 Citation

📄 License

🙏 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages