Wanpeng Zhang1,3, Hao Luo1,3, Sipeng Zheng3, Yicheng Feng1,3, Haiweng Xu1,3,
Ziheng Xi2,3, Chaoyi Xu1,3, Haoqi Yuan1,3, Zongqing Lu1,3,†
1Peking University
2Tsinghua University
3BeingBeyond
PTR is a reward-free and conservative post-training method for robot policies that uses post-action consequences to decide which logged samples deserve more gradient budget. For each sample, PTR encodes the observed post-action consequence as a latent target, inserts it into a pool of mismatched alternatives, and asks whether the matched future can be identified from the current context and the logged action chunk. The resulting posterior-to-uniform ratio defines the PTR score, which is converted into a clipped-and-mixed weight and applied to the original supervised action objective through self-normalized weighted regression.
- [2026-03-17]: We publish PTR! Check our paper here. 🔥🔥🔥
If you find our work useful, please consider citing:
@article{zhang2026conservative,
title={Conservative Offline Robot Policy Learning via Posterior-Transition Reweighting},
author={Zhang, Wanpeng and Luo, Hao and Zheng, Sipeng and Feng, Yicheng and Xu, Haiweng and Xi, Ziheng and Xu, Chaoyi and Yuan, Haoqi and Lu, Zongqing},
journal={arXiv preprint arXiv:2603.16542},
year={2026}
}