From Vision-Language-Action Models to a Real-World Robot Learning Stack
Tencent Robotics X × Tencent Hy Team
teaser.mp4
[2026-06-16]🌐 Added ModelScope links for all models and data.[2026-06-15]🚀 We have released Hy-Embodied-0.5-VLA — including the codebase, theHy-Embodied-0.5-VLA-UMI(ModelScope) andHy-Embodied-0.5-VLA-RoboTwin(ModelScope) models, and theHy-Embodied-0.5-VLA-Dataegocentric UMI dataset (2K+ hours)!
We introduce Hy-Embodied-0.5-VLA (Hy-VLA) — an end-to-end Vision-Language-Action system that spans the full robot learning stack: data collection, model design, pre-training, supervised fine-tuning, RL post-training, and real-world deployment. Built on the Hy-Embodied-0.5 MoT backbone, Hy-VLA integrates a flow-matching action expert, a compact memory encoder for multi-frame history, and a delta-chunk action representation decoupled from embodiment-specific kinematics.
Powered by 10,000+ hours of high-fidelity UMI demonstrations collected via a custom fingertip interface with optical motion-capture, Hy-VLA achieves state-of-the-art results on the RoboTwin 2.0 benchmark (90.9% / 90.1% on Clean / Randomized) and demonstrates robust cross-embodiment transfer across four real-world robot platforms. Paired with FlowPRO preference optimization and an asynchronous inference framework, Hy-VLA establishes a scalable paradigm for continuous dexterous manipulation.
- 🧠 Unified VLA Architecture: Extends the Hy-Embodied-0.5 MoT backbone with a dual-tower flow-matching action expert. The VLM tower handles vision-language understanding while the action expert generates continuous action chunks — all tied together through shared cross-modal attention.
- 🎯 Delta-Chunk Action Representation: Actions are predicted as relative-to-current-frame end-effector delta chunks, decoupling the policy from embodiment-specific kinematics and enabling seamless cross-embodiment transfer.
- 📹 Compact Memory Encoder: A parameter-free temporal-spatial attention mechanism interleaved within the ViT encoder compresses K-frame multi-view history into current-frame tokens, preserving temporal context without inflating the token budget.
- 📊 Hy-UMI-10K Dataset: 10K+ hours of sub-millimeter precision dual-arm demonstrations across 70+ tasks, collected with a custom fingertip UMI rig tracked by an optical motion-capture system. 2K+ hours are publicly released.
- 🚀 FlowPRO Post-Training (under review): A critic-free preference optimization algorithm that converts real-robot failure interventions into rapid policy improvement without reward models.
- ⚡ Asynchronous Deployment Stack: Producer-consumer inference with cubic Bézier chunk stitching enables high-frequency closed-loop control across heterogeneous robot platforms.
Hy-Embodied-0.5-VLA/
├── hy_vla/ # Core model definition, training, and inference
│ ├── modeling_hy_vla.py # HyVLA model class
│ ├── modeling_dual_tower.py # Dual-tower transformer (VLM + action expert)
│ ├── configuration_hy_vla.py # Model configuration
│ ├── space_time_attention.py # Temporal-spatial attention for memory encoder
│ ├── data/ # Dataloader and dataset utilities
│ ├── config/ # YAML configuration files
│ └── hunyuan_vl_mot/ # Vendored Hy-Embodied VLM backbone (fallback)
├── scripts/ # Training, evaluation, and preprocessing scripts
│ ├── quick_start.py # Fast smoke-test for a released checkpoint
│ ├── train_umi_vlm.sh # Stage-1 pre-training launcher
│ ├── train_robotwin_vlm.sh # Stage-2 SFT from VLM backbone
│ ├── train_robotwin_umi.sh # Stage-2 SFT from UMI pretrain
│ ├── train_table_vlm.sh # Single-table fast-iteration training
│ ├── eval_robotwin_test.sh # Quick RoboTwin regression (6 tasks)
│ ├── eval_robotwin_full.sh # Full RoboTwin sweep (50 tasks × 100 rollouts)
│ ├── compute_norm_lance.py # Pre-compute norm stats from Lance data
│ ├── compute_norm_hdf5.py # Pre-compute norm stats from HDF5 data
│ └── vis_umi_episode.py # Render an episode as MP4
├── robotwin_eval/ # RoboTwin adapter for evaluation
├── assets/ # Example data and index files
└── pyproject.toml # Python project configuration (uv/pip)
- 🖥️ OS: Linux (recommended)
- 🐍 Python: 3.12 (recommended and tested)
- ⚡ CUDA: 12.x
- 🔥 PyTorch: ≥ 2.4
- 🎮 GPU: NVIDIA GPU with CUDA support (≥ 16 GB VRAM recommended)
git clone https://github.com/Tencent-Hunyuan/Hy-Embodied-0.5-VLA
cd Hy-Embodied-0.5-VLA
# One-off: install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Materialize the virtual environment
uv syncpip install -r requirements.txtNote: Hy-VLA depends on an upstream
transformersfork that supports the Hy-Embodied MoT backbone. The pinned commit is specified in bothrequirements.txtandpyproject.toml. If the fork URL is unreachable, a verbatim vendor copy athy_vla/hunyuan_vl_mot/serves as fallback.
The fastest way to verify a fresh install is the bundled smoke test:
import torch
from huggingface_hub import snapshot_download
from hy_vla import HyVLA, HyVLAConfig
ckpt = snapshot_download("tencent/Hy-Embodied-0.5-VLA-RoboTwin")
config = HyVLAConfig.from_pretrained(ckpt)
policy = HyVLA.from_pretrained(ckpt, config=config)
policy.enable_video_encoder_if_needed()
policy = policy.to(device="cuda", dtype=torch.bfloat16).eval()
# (B, K, C, H, W); K=6 history slots
img = torch.zeros(1, 6, 3, 224, 224, device="cuda", dtype=torch.bfloat16)
# Normalized dual-arm EEF: [xyz(3) + rot6d(6) + gripper(1)] * 2
state = torch.zeros((1, config.max_state_dim), device="cuda", dtype=torch.bfloat16)
batch = {
"observation.images.top_head": img,
"observation.images.hand_left": img,
"observation.images.hand_right": img,
"observation.state": state,
"task": ["pick up the bottle"],
}
with torch.no_grad():
actions = policy.forward_evaluate(batch)["pred"]
actions = actions[..., : config.action_feature.shape[0]]
print(actions.shape)Or simply:
python scripts/quick_start.pyHy-VLA follows the Vision-Language-Action paradigm built on three components:
Backbone — Hy-Embodied-0.5 MoT. A Mixture-of-Transformers architecture with modality-adaptive computation. Visual tokens are routed through dedicated vision-specific parameters while text tokens use the original language parameters; cross-modal interaction is limited to shared self-attention layers. The backbone encodes images at native resolution via Hy-ViT 2.0.
Action Expert — Dual-Tower Flow Matching. Rather than discretizing actions into language tokens, a dedicated 370M-parameter action expert models the continuous action distribution via conditional flow matching. The VLM tower and action expert tower share attention, allowing grounded vision-language context to guide continuous action generation. At inference, actions are generated by integrating the learned velocity field over 10 Euler steps with KV-cached observation prefixes.
Compact Memory Encoder. Interleaved temporal-spatial attention within the ViT encoder compresses K-frame multi-view history. Temporal attention (causal across frames) and spatial attention (bidirectional within each frame) reuse the same QKV projections — zero new parameters relative to the single-image encoder. Past-frame tokens are discarded after temporal mixing, keeping the VLM token count constant regardless of history length.
| Model | HuggingFace | ModelScope | Description |
|---|---|---|---|
| Hy-VLA-UMI | tencent/Hy-Embodied-0.5-VLA-UMI |
Tencent-Hunyuan/Hy-Embodied-0.5-VLA-UMI |
Pre-trained on the Hy-UMI-10K corpus; intended as a generalist starting point for fine-tuning |
| Hy-VLA-RoboTwin | tencent/Hy-Embodied-0.5-VLA-RoboTwin |
Tencent-Hunyuan/Hy-Embodied-0.5-VLA-RoboTwin |
Post-trained on the RoboTwin 2.0 50-task benchmark |
Both checkpoints are self-contained — they ship their own tokenizer.json, vlm_config_dict, chat_template.jinja, and pre-computed normalization statistics (norm_stats.pkl). If needed, you can regenerate normalization stats on your own data using scripts/compute_norm_lance.py.
Hy-VLA is pre-trained on Hy-Embodied-0.5-VLA-Data, a large-scale bimanual manipulation dataset with 250K+ episodes spanning 10K+ hours of dual-arm teleoperation trajectories, wiht 2K+ hours released. The open-source release contains approximately one-fifth of the full training corpus, partitioned into 22 Lance-format tables compatible with LeRobot v3.0.
| Attribute | Value |
|---|---|
| Format | Lance (LeRobot v3.0 schema) |
| Resolution | 240 × 424 px |
| Cameras | cam_high (head), cam_left_wrist, cam_right_wrist |
| State | 16-dim dual-arm EEF (pos + quat + gripper per arm) |
| FPS | 30 |
| Total Episodes | 250K+ |
| Total Duration | 2000+ hours |
LanceTableReader reads a single Lance table (local or HF Hub):
from hy_vla.data.lance_dataset import LanceTableReader
# Local directory
reader = LanceTableReader(root="./table_000")
# HF Hub
reader = LanceTableReader(
repo_id="tencent/Hy-Embodied-0.5-VLA-Data",
table_name="table_000",
)
# Access
frame = reader[42] # single frame dict
episode = reader.get_episode(3) # all frames of episode 3Also compatible with raw
lance,lancedb, andlerobot-lancedb(LeRobotLanceDataset).
table_000_episode_666.mp4
You can render any episode locally:
# Use the HF Hub dataset, pick table_000 episode 666
python scripts/vis_umi_episode.py -t table_000 -e 666
# Local Lance root
python scripts/vis_umi_episode.py /path/to/Hy-Embodied-0.5-Data -e 0 --no-3dHy-VLA supports multiple training workflows, each with corresponding evaluation results.
The VLM tower is initialized from tencent/HY-Embodied-0.5 and the action expert is randomly initialized. Training uses the full 10K-hour UMI corpus under the flow-matching objective (Eq. 1) with history length K=1 and action chunk horizon H=50 at 10 Hz. The model is trained for 200K steps with a global batch size of 1,024.
# Single-table fast iteration
export TABLE_NAME=table_001
bash scripts/train_table_vlm.sh
# Full corpus (64 GPUs)
export CHIEF_IP=<chief-ip> INDEX=0
bash scripts/train_umi_vlm.shStarting from the pre-trained checkpoint, SFT activates the compact memory encoder (K=6 frames) and fine-tunes on task-specific demonstrations. For real-world deployment, we train for 60K steps with batch size 32; for RoboTwin 2.0, we use batch size 128 with action downsampling (stride 3).
# --- Training ---
export CHIEF_IP=<chief-ip> INDEX=0
bash scripts/train_robotwin_umi.sh # Fine-tune from Hy-VLA-UMI on RoboTwin
# --- Evaluation ---
export ROBOTWIN_DIR=/path/to/RoboTwin
export CKPT_PATH=tencent/Hy-Embodied-0.5-VLA-RoboTwin
# Quick regression (6 tasks × 10 rollouts)
bash scripts/eval_robotwin_test.sh
# Full sweep (50 tasks × 100 rollouts, 8 GPUs)
bash scripts/eval_robotwin_full.shNote: The eval scripts automatically symlink
Hy-VLA/robotwin_eval/→RoboTwin/policy/hy_vla, so that RoboTwin'seval_policy.pycan discover the Hy-VLA policy adapter without any manual configuration.
| Method | Clean | Randomized |
|---|---|---|
| π₀ | 65.9 | 58.4 |
| ABot-M0 | 81.2 | 80.4 |
| π₀.₅ | 82.7 | 76.8 |
| Qwen-VLA | 86.1 | 87.2 |
| LingBot-VLA | 86.5 | 85.3 |
| starVLA | 88.2 | 88.3 |
| Motus | 88.7 | 87.0 |
| JoyAI-RA | 90.5 | 89.3 |
| Hy-VLA | 90.9 | 90.1 |
Success rate (%) averaged over 100 rollouts per task × 50 tasks. Best in bold.
We validate SFT across two deployment tracks:
- Track A (Intra-Embodiment): Fine-tune and evaluate on the same robot (Dobot X-Trainer, 4 tasks)
- Track B (Cross-Embodiment): Fine-tune only on UMI demonstrations and deploy to morphologically different robots (JAKA K1, Astribot S1)
| Method | Set the Table | Fold & Store Glasses | Zip Up the Pen Case | Insert Bottles |
|---|---|---|---|---|
| π₀ | 79% | 67% | 48% | 80% |
| π₀.₅ | 88% | 75% | 57% | 84% |
| Hy-Embodied (w/o UMI pretrain) | 80% | 65% | 43% | 70% |
| Hy-VLA (Ours) | 83% | 94% | 73% | 94% |
Four bimanual tasks on the Dobot X-Trainer. UMI pre-training provides significant gains on precision-critical tasks (glasses folding, zipper manipulation) compared to the baseline without UMI data.
| Method | JAKA K1 · Organize Accessory | Astribot S1 · Clean Up Table |
|---|---|---|
| π₀ | 88% | 87% |
| π₀.₅ | 81% | 89% |
| Hy-Embodied (w/o UMI pretrain) | 38% | 44% |
| Hy-VLA (Ours) | 90% | 89% |
Cross-embodiment deployment to JAKA K1 and Astribot S1, fine-tuned only on UMI demonstrations without any target-robot teleoperation data. The ablation without UMI pre-training collapses, confirming that the large-scale UMI corpus is essential for embodiment-agnostic action priors.
Beyond standard SFT, Hy-VLA supports FlowPRO — a critic-free preference optimization algorithm (under review, code coming soon) that converts real-robot failure interventions into policy improvement. The RPRO loss directly contrasts preferred and dispreferred action chunks per state, with a symmetric proximal regularizer that prevents reward hacking.
FlowPRO operates iteratively:
- Collect preference pairs via teleoperated intervention-and-rollback
- Convert sparse corrections into dense per-state tuples via Smooth Interpolation
- Optimize with the RPRO loss on mixed batches
| Method | Bottle | Cap | USB | Zip |
|---|---|---|---|---|
| DAgger | 93 ± 2.1% / 27s | 88 ± 1.8% / 29s | 86 ± 2.4% / 25s | 83 ± 2.0% / 55s |
| π₀.₆* | 95 ± 1.5% / 24s | 95 ± 1.2% / 27s | 95 ± 1.4% / 23s | 89 ± 1.6% / 45s |
| RPRO (Ours) | 99 ± 0.6% / 16s | 99 ± 0.7% / 21s | 98 ± 0.9% / 22s | 94 ± 1.1% / 37s |
Success rate (mean ± std over 3 seeds, 100 rollouts each) and mean completion time after K=3 rounds of post-training on four X-Trainer bimanual tasks. RPRO achieves near-ceiling success rates with substantially shorter completion times.
Required before the first training run:
python scripts/compute_norm_lance.py \
--lance-source tencent/Hy-Embodied-0.5-VLA-Data \
--output norm_stats.pklThe released checkpoints already ship with pre-computed norm stats. Use this script only if you are training on custom data.
We thank the Hugging Face and LeRobot communities for their infrastructure and tooling. This work builds upon Hy-Embodied-0.5, FlowPRO, and the RoboTwin 2.0 benchmark.
If you find Hy-VLA useful for your research, please cite:
@article{tencent2026hyembodied05vla,
title={Hy-Embodied-0.5-VLA: From Vision-Language-Action Models to a Real-World Robot Learning Stack},
author={Tencent Robotics X and Tencent Hy Team},
journal={arXiv preprint arXiv:2606.14409},
year={2026}
}Released under the Apache-2.0 License. See LICENSE for full terms.


