A Vision–Language–Action foundation model that, for the first time, scales reward-label-free contrastive RL into VLA pre-training itself — equipping a single Qwen3-VL backbone with a quantitative, language-grounded sense of goal-reachability, at near-BC compute.
- 2026/05 PRTS arXiv preprint released — arXiv:2604.27472.
- 2026/05 Pre-trained PRTS-4B checkpoint pushed to 🤗 TeleEmbodied/PRTS-4B.
We will progressively open-source the rest of the PRTS stack. Tick = done, square = upcoming.
- PRTS-4B pre-trained checkpoint — 🤗 TeleEmbodied/PRTS-4B
- Project page —
⚠️ currently under maintenance, content out-of-date ; the site https://rhodes-team-prts.github.io/ will be refreshed in the next few days with final video demos, BibTeX, and benchmark cards. - Minimal SFT post-training code for LIBERO + real-robot platforms
- CRL value visualization scripts
- LIBERO evaluation of PRTS
- PRTS-4B post-trained checkpoints for LIBERO / SimplerEnv WidowX — the exact checkpoints behind Tables 1–4 of the paper, for one-click reproduction
Most VLA models pretrain by behavior cloning — they learn what to do, but never internalize how close the current state is to satisfying the instruction. PRTS reframes pre-training as a goal-conditioned RL problem and supervises a language-conditioned contrastive value alongside the action loss, all from offline trajectory structure alone.
The geometry the model converges to is sharp: the inner product
of the state–action embedding and the goal embedding tracks the log-discounted goal-occupancy along expert rollouts. It rises as the policy approaches the language goal, and stays flat under a mismatched instruction.
| 🧭 Goal-reachability awareness, end-to-end | The contrastive value head is co-trained inside the same Qwen3-VL backbone the policy uses. No separate value network, no curated reward dataset, no offline-RL post-training loop. |
| 💰 Reward-label-free | Supervision comes purely from the temporal structure of demonstrations — no per-episode success labels and no curated value-training corpus. |
| ⚡ Near-BC pre-training cost | A role-aware causal mask fused into FlashAttention via a custom CuTe kernel keeps per-layer attention within 1.18 × of vanilla FA3, vs. 2.7 ×–8.8 × for off-the-shelf FlexAttention. End-to-end pre-training scales at ≥ 85 % linear efficiency on 64 × H100. |
| 🌍 Out-of-distribution wins grow with the shift | On 5 simulation suites and 14 real-world tasks, PRTS matches or exceeds the strongest prior VLAs at ¼ – ⅛ the post-training compute, with the gap widening as evaluation moves further off-distribution: novel-instruction following (+38.8 over π0.5), long-horizon execution, and recovery under human intervention. |
PRTS reaches state-of-the-art average success rate on every standard suite, at a small post-training budget among all directly comparable VLAs.
| Method | LIBERO | LIBERO-Plus | LIBERO-Pro | SimplerEnv (WidowX) |
|---|---|---|---|---|
| OpenVLA-OFT | 97.1 | 69.6 | — | 41.8 |
| GR00T-N1.5 | 97.0 | — | — | 61.9 |
| π0 (bs=32, 30K) | 94.2 | 53.6 | 45.3 | 27.1 |
| π0.5 (bs=256, 30K) | 96.9 | 80.7 | 53.3 | — |
| ABot-M0 (bs=32, 30K) | 97.9 | 78.7 | 52.2 | — |
| PRTS (Ours) (bs=32, 30K) | 98.4 | 81.4 | 58.8 | 77.1 |
The gap to baselines grows as evaluation drifts further off-distribution: +0.5 on LIBERO → +0.7 on LIBERO-Plus → +5.5 on LIBERO-Pro.
This is where PRTS's CRL shaped representations really earn its keep. The benchmark holds the visual scene fixed and rewrites either the instruction (Task axis) or the target relation (Position axis).
| Method | Semantic | Object | Position | Task | Average |
|---|---|---|---|---|---|
| π0 (bs=32, 30K) | 90.5 | 90.5 | 0.0 | 0.0 | 45.3 |
| π0.5 (bs=256, 30K) | 95.8 | 96.0 | 20.8 | 0.8 | 53.3 |
| ABot-M0 (bs=32, 30K) | 97.1 | 82.5 | 7.1 | 22.3 | 52.2 |
| PRTS (Ours) (bs=32, 30K) | 97.0 | 82.3 | 24.3 | 31.5 | 58.8 |
On the Task axis, π0 and π0.5 collapse below 1 %, and the strongest comparable VLA (ABot-M0) reaches only 22.3 %. PRTS holds 31.5 % — although π0.5 achieves the second-best average result, PRTS suprisingly outperform it a large margin +30.7 on the hardest Task axis.
We deploy PRTS on a 14-DoF dual-arm RealMan platform (11 tasks) and a 7-DoF Flexiv single-arm platform (3 tasks). All three policies (PRTS, π0, π0.5) share identical post-training data, schedule, and 20-trial physical evaluation protocol.
| Method | RealMan dual-arm (avg over 11 tasks) | Flexiv single-arm (avg over 3 tasks) |
|---|---|---|
| π0 | 67.3 | 60.0 |
| π0.5 | 85.5 | 75.0 |
| PRTS (Ours) | 95.9 | 90.0 |
PRTS hits ≥ 90 % on every one of the 11 RealMan tasks and 100 % on four of them. On the genuinely long-horizon Office Long Term task (~2 min of continuous bimanual operation), π0.5 collapses to 40 % under multi-task interference while PRTS holds 95 %.
The cleanest test of "does the policy follow the language goal?" is to take a deployed task and change the language instruction to recombine seen primitives in a new way (e.g. Paper Rubbish with a soy-sauce bottle in place of the trash item). All four task-generalization cells, all 20 trials physical:
| Method | Paper Rubbish | Place Block | Pick Shoes | Stack Cups | Average |
|---|---|---|---|---|---|
| π0 | 5.0 | 0.0 | 30.0 | 20.0 | 13.8 |
| π0.5 | 65.0 | 15.0 | 35.0 | 25.0 | 35.0 |
| PRTS (Ours) | 80.0 | 55.0 | 85.0 | 75.0 | 73.8 |
+38.8 average margin over π0.5 — the most direct empirical evidence that PRTS's value head ties the language goal to feasible state-action outcomes.
(a) Aggregate token throughput vs. number of H100 GPUs (log-log) — PRTS retains ≥ 85 % of perfect linear scaling up to 64 GPUs. (b) Per-layer attention forward time at matched packing — our role-aware CuTe-FlashAttention kernel sits at 0.531 ms / layer (1.18 × of the BC-only FA3 reference at 0.45 ms), versus 1.23 ms (2.7 ×) and 3.95 ms (8.8 ×) for the alternative FlexAttention realizations of the same role-aware mask.
If you find PRTS useful, please consider citing:
@article{zhang2026prts,
title = {PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations},
author={Yang Zhang and Jiangyuan Zhao and Chenyou Fan and Fangzheng Yan and Tian Li and Haitong Tang and Sen Fu and Xuan'er Wu and Qizhen Weng and Weinan Zhang and Xiu Li and Chi Zhang and Chenjia Bai and Xuelong Li},
journal = {arXiv preprint arXiv:2604.27472},
year = {2026},
}Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0). See LICENSE for details. Released model weights and code are free for academic and non-commercial use; commercial use is not permitted under this license.
PRTS builds on Qwen3-VL, FlashAttention, LeRobot, and OpenPI. We thank the authors of Contrastive RL for the ideas behind the contrastive value formulation.



