An open specification for task-level demonstration data for vision–language–action (VLA) models. Six fields. Controlled vocabularies. Apache-2.0. Draft v1.
- What this is
- Why it exists
- Core fields (v1 draft)
- Field-level rationale
- Out of scope for v1
- How to use
- Interoperability
- Extended design notes
- Status and roadmap
- Related projects
- Citation
- License
- Contributing
- 中文简介
menily/schema is a JSON specification that defines one unit of task-level demonstration data for training vision–language–action (VLA) models. A single file = a single task. Every file conforms to the same six top-level fields: task_id, language, visual, action, body, meta.
The specification is language-agnostic (JSON). A Python reference implementation is available in menily/toolkit.
VLA models (π0, OpenVLA, NVIDIA GR00T N1, Gemini Robotics, Ψ₀, ...) consume something very specific: a task-level trajectory where a natural-language goal is paired with a visual context and an action sequence that, together, form one semantically closed unit of robot behavior.
This is different from:
- raw video (no semantic boundaries)
- motion capture files (no task annotation, no language)
- RLHF datasets (reward signals, not demonstrations)
- teleoperation traces alone (no language grounding)
As of April 2026 there is no publicly released, interoperable specification for this layer. Every laboratory and company invents its own format. Cross-institutional data pooling is broken. Open datasets can't be merged. Tooling can't be reused.
menily/schema is one attempt at a common ground. Its twin goals:
- Interoperate with Open X-Embodiment / RLDS (trajectory layer) downstream and BONES-SEED / NVIDIA SOMA (motion layer) upstream.
- Be adoptable by labs that don't want to be locked into a single vendor — the schema is Apache-2.0 and requires no runtime dependencies on Menily Intelligence.
{
"schema_version": "menily.task-demo/1",
"task_id": "uuid",
"language": {
"instruction": "Pour water from the blue cup into the kettle.",
"language_code": "en",
"variants": ["给水壶加水", "…"]
},
"visual": {
"frames": "path/to/frames/",
"fps": 30,
"camera_intrinsics": { "fx": 1128.5, "fy": 1128.5, "cx": 960, "cy": 540 },
"viewpoint": "ego"
},
"action": {
"space": "ee_6dof",
"trajectory": [ /* N × action_dim */ ],
"timestamps": [ /* N */ ],
"gripper": [ /* N × 1 */ ]
},
"body": {
"morphology": "bimanual_humanoid",
"dof_map": {
"right_arm": [0,1,2,3,4,5,6],
"left_arm": [7,8,9,10,11,12,13]
}
},
"meta": {
"source": "pov_video",
"collection_region": "SEA",
"collection_time": "2026-01-14T08:20:00Z",
"quality_flags": ["no_slip", "no_contact_gap"]
}
}| Field | Allowed values |
|---|---|
visual.viewpoint |
"ego" | "third-person" | "overhead" |
action.space |
"ee_6dof" | "joint_Ndof" | "whole_body_Mdof" |
body.morphology |
"single_arm" | "bimanual" | "bimanual_humanoid" | "mobile_manipulator" | "quadruped" | "humanoid" |
meta.source |
"pov_video" | "vr_demo" | "mocap" | "teleop" | "sim_generated" |
meta.collection_region |
"NA" | "EU" | "SEA" | "EA" | "SA" | "AF" | "OC" |
| Field | Decision | Why |
|---|---|---|
language.variants |
Recommended-required | Multilingual paraphrase is ~zero marginal cost (LLM-generated) and critical for deployment robustness. |
visual.viewpoint |
Controlled vocabulary | Ego vs third-person are qualitatively different training signals; mixing without labels degrades visual encoders. |
action.space |
Controlled vocabulary; single space per file | Implicit action spaces are the most common cause of silent training corruption. |
body.morphology |
Required | Cross-embodiment transfer is unrecoverable without explicit morphology. |
body.dof_map |
Required | DoF index → joint mapping is non-discoverable from raw trajectories. |
meta.source |
Controlled vocabulary | Different sources have qualitatively different noise profiles; downstream cleaning depends on this. |
meta.collection_region |
Top-level field | Geographic distribution is a commonly-ignored bias source; making it first-class forces awareness. |
The following are deliberately excluded from v1 scope:
- ❌ Reward / return-to-go fields —
menily/schemais not an RL data format. Use D4RL or RLDS for reinforcement learning datasets. - ❌ Scene graphs — scene parsing is a downstream task; visual tokens come from frames.
- ❌ Human biometric metadata — Menily does not collect, and no schema field is reserved.
- ❌ Embedded URDF / MJCF — body morphology is a compact index; full physics models are referenced externally.
Any JSON parser will do. Below is Python using the menily/toolkit helper:
from menily.toolkit import schema
task = schema.TaskLevelDemoV1.load("./task_001.json")
print(task.language.instruction) # "Pour water from the blue cup..."
print(task.action.space) # "ee_6dof"
print(task.body.morphology) # "bimanual_humanoid"
print(len(task.action.trajectory)) # Nreport = task.validate()
assert report.passed
for w in report.warnings:
print("warning:", w)Or as a standalone CLI (PyPI release pending):
menily-schema validate ./task_001.jsonfrom menily.toolkit import pov, schema
tasks = pov.segment(
video_path="./demo.mp4",
language="Pour water from the blue cup into the kettle.",
language_variants=["把蓝色杯子里的水倒进水壶里"],
fps=30, viewpoint="ego",
body_morphology="bimanual_humanoid",
collection_region="SEA",
)
for task in tasks:
task.save_as(schema="menily.task-demo/1", out_dir="./out/")Designed to interoperate with existing standards:
| Direction | Target | Method |
|---|---|---|
| Downstream | Open X-Embodiment / RLDS | Task.to_rlds() |
| Downstream | HuggingFace Datasets | Task.to_hf_dataset() |
| Upstream | NVIDIA SOMA / SOMA-X | body.morphology + body.dof_map namespace-aligned |
| Upstream | BONES-SEED | Consumed as motion source; task-level semantics overlaid |
| Bidirectional | RLDS | from_rlds() converts Open X-Embodiment data in |
A long-form walkthrough of every field decision — why language.variants is recommended-required, why action.space is a controlled vocabulary, why body.morphology + body.dof_map are the real keys to cross-embodiment transfer, and the 15→6 field consolidation process:
- 📝 VLA 任务级示教数据 schema 设计笔记:Menily/schema v1 规范与六字段解析 — Masashi, CSDN, April 2026
- 📝 给 VLA 训练数据设计一份 schema:六字段是怎么砍下来的 — Masashi, cnblogs.com, April 2026 (retrospective on field cuts)
- 📄 Task-Level Demonstration Data for VLA Models: A Survey — 12-page preprint, April 2026
v1 is a draft, not a finalized standard. Field-level critique via GitHub Issues is welcome and actively incorporated.
- v1 field set frozen (6 top-level fields)
- Controlled vocabularies defined
- Interoperability tested against RLDS + HF Datasets
- Reference validator CLI (pending menily/toolkit PyPI release)
- Worked examples for Unitree G1, Fourier GR-1, Apptronik Apollo
- Schema v2 — long-horizon task decomposition, multi-agent scenarios,
invariant_landmarkswaypoint schema
| Repo | Description |
|---|---|
| menily/toolkit | Python reference implementation — three adapters (POV / VR / MoCap) + schema validator |
| menily/research | Research notes on design decisions behind this schema |
| menily.ai | Organization site — team, publications, contact |
If you use menily/schema in research, please cite:
@misc{menily2026schema,
author = {Masashi},
title = {menily/schema: A Task-Level Demonstration Data
Specification for Vision-Language-Action Models},
year = {2026},
howpublished = {Menily Intelligence, Apache-2.0 open specification},
url = {https://github.com/MenilyIntelligence/schema},
note = {Version menily.task-demo/1, draft v1}
}The companion survey paper:
@misc{masashi2026tasklevel,
author = {Masashi},
title = {Task-Level Demonstration Data for Vision-Language-Action
Models: A Survey of Schemas, Adapters, and
Cross-Embodiment Transfer},
year = {2026},
month = {April},
howpublished = {Menily Intelligence Research, self-hosted preprint},
url = {https://www.menily.ai/research/01-task-level-vla-data-survey.pdf},
note = {Draft v0.1}
}Apache License 2.0 — see LICENSE (to be added with first tagged release).
- 🐛 Bug reports & clarifications → open an Issue
- 💡 Field-level design proposals → PRs welcome for spec text; discuss in an Issue first
- 📧 Direct technical discussion → Masashi@Menily.AI
- 🌐 Organization → github.com/MenilyIntelligence · menily.ai
v1 is a draft — we expect two kinds of feedback:
- Field-level critique — naming, semantics, granularity, split/merge decisions.
- Format mapping requests — if your team has existing data pipelines and wants to see how to map to
menily/schema, email to open a discussion.
menily/schema 是一份针对 VLA(视觉-语言-动作)模型训练的任务级示教数据规范。定义 task_id / language / visual / action / body / meta 六大顶层字段,统一异构数据源的格式,便于跨机构数据池化与跨具身迁移。
设计目标:与 Open X-Embodiment / RLDS(轨迹层)向下兼容,与 BONES-SEED / NVIDIA SOMA(动作层)向上兼容,填补中间的任务级语义层。
v1 是草案,欢迎通过 GitHub Issues 提字段设计建议,或邮件 Masashi@Menily.AI 讨论现有数据格式与 menily/schema 的互转方案。
更长的设计笔记:VLA 任务级示教数据 schema 设计笔记(CSDN)