Skip to content

RLWRLD/RLDX-1

Repository files navigation


RLDX-1 is a Vision-Language-Action model (VLA) for human-like dexterous manipulation. Beyond the versatile intelligence inherited from pre-trained VLM backbones, RLDX-1 adds three functional capabilities — motion awareness, long-term memory, and physical sensing — through a unified Multi-Stream Action Transformer (MSAT) architecture, a synthetic-augmented training pipeline, and a real-time inference stack.


RLDX-1 architecture

Highlights

  • Multi-Stream Action Transformer (MSAT). Cognition, physics, and action each get a dedicated stream coupled by joint self-attention — an extension of MM-DiT to action modeling.
  • Motion awareness. Multi-frame observations + a motion module capture temporal dynamics; intermediate VLM layers compress video tokens to keep the policy efficient.
  • Long-term memory. A memory module fuses past cognition features with the current ones for history-grounded decisions beyond a short multi-frame window.
  • Physical sensing. Tactile and torque enter as a dedicated physics stream; the decoder is jointly trained to predict future physical signals.
  • Three-stage training. Pre-training (generalization) → mid-training (functionality) → post-training (task adaptation), with synthetic data augmenting rare manipulation scenarios.
  • Real-time inference. Static graph capture + custom fused kernels bring the all-modality model to 43.7 ms / step on RTX 5090 (1.63× speedup, >22 Hz).

Performance

Simulation Benchmarks

Success rates (%) of RLDX-1 fine-tuned on each benchmark's training set, compared to recent frontier VLA baselines.

Method LIBERO (Avg) LIBERO-Plus SIMPLER Google-VM SIMPLER Google-VA SIMPLER WidowX RoboCasa Kitchen GR-1 Tabletop RoboCasa365 (Avg)
π0-FAST 85.5 64.2 61.9 59.0 48.3 63.6 21.7
π0 94.1 54.6 58.8 54.8 27.1 62.5 13.6 14.8
π0.5 96.9 86.5 72.7 68.4 46.9 62.1 15.4 16.9
GR00T N1.5 86.5 66.3 52.4 43.7 62.0 65.7 48.0 20.0
GR00T N1.6 96.7 72.6 76.1 57.1 57.1 66.2 47.6 26.9
RLDX-1 (ours) 97.8 86.7 81.5 77.4 71.9 70.6 58.7 32.1

The first five columns cover the established LIBERO / SIMPLER family; the last three (RoboCasa Kitchen, GR-1 Tabletop, RoboCasa365) are long-horizon, humanoid, and compositional benchmarks. Per-benchmark checkpoints, embodiment tags, and reproduce commands are listed under Reproducing Benchmark Results.


Installation

Requirements: Python 3.10, CUDA 12.x, uv v0.8.4+

git clone https://github.com/RLWRLD/RLDX-1.git
cd RLDX-1
uv sync --python 3.10
uv pip install -e .

Verify installation:

uv run python -c "import rldx; print(rldx.__version__)"

For simulator setup, dev tooling, and full troubleshooting, see docs/installation.md.


Documentation

Hands-on guides live under docs/:

Guide What it covers
installation.md Environment setup, simulator venvs, dev tooling, common pitfalls
architecture.md Five-stage walkthrough of the RLDX-1 model and its config flags
training.md launch_train.py recipes (fine-tune / mid-train), LoRA, training-time RTC, dataset layout
embodiment_tags.md What EmbodimentTag is and how to pick one for a custom robot
evaluation.md RoboCasa / LIBERO / SIMPLER / GR-1 eval, server + rollout split, results aggregation
inference_server.md run_rldx_server.py CLI, wire protocol, RTC modes, --compile levels, simulator + real-robot deployment

Pretrained & Midtrained Checkpoints

Checkpoint Description Params HuggingFace
RLDX-1-PT Pre-trained (video) 6.9B RLWRLD/RLDX-1-PT
RLDX-1-MT-DROID Mid-trained on DROID with all add-ons 8.1B RLWRLD/RLDX-1-MT-DROID
RLDX-1-MT-ALLEX Mid-trained on ALLEX with all add-ons 8.1B RLWRLD/RLDX-1-MT-ALLEX

Data Preparation

RLDX-1 uses LeRobot v2.1 format datasets. To convert your data:

# Convert a single dataset
bash run_scripts/data/convert_lerobot_single.sh /path/to/your/data

# Convert multiple datasets
bash run_scripts/data/convert_lerobot_multiple.sh /path/to/data/root

Each dataset must carry a meta/modality.json that slices the flat state / action vectors into named joint groups and remaps video columns to modality keys. Schema and a worked example are in docs/training.md.

Custom Embodiment Config

Define your robot's modality configuration:

# my_modality_config.py
from rldx.data.types import ModalityConfig

MODALITY_CONFIGS = {
    "my_robot": {
        "image": ModalityConfig(...),
        "state": ModalityConfig(...),
        "action": ModalityConfig(...),
    }
}

Pass it via --modality-config-path my_modality_config.py during training, together with an EmbodimentTag that selects the per-robot MLP head slot (default: GENERAL_EMBODIMENT; see docs/embodiment_tags.md for the picker).

The EmbodimentTag design and per-embodiment MLP head structure follow the convention introduced by NVIDIA GR00T N1.7.


Fine-tuning

This section covers how to fine-tune RLDX-1 from a pre-trained checkpoint (RLWRLD/RLDX-1-PT) on your own LeRobot v2.1 dataset. The training entry point is a single CLI (rldx/experiment/launch_train.py) where flags toggle the optional functional capabilities described in Highlights:

  • --video-length N — temporal frames per observation (motion awareness)
  • --use-memory — temporal memory module (long-term memory)
  • --use-motion — motion module inside the VLM backbone
  • --use-physics --physics-keys ... — tactile / torque streams (physical sensing)

LoRA, training-time RTC, and the full flag list are documented in docs/training.md. Below are the canonical recipes.

Single dataset, no add-ons

uv run python rldx/experiment/launch_train.py \
    --base-model-path RLWRLD/RLDX-1-PT \
    --dataset-path /path/to/your/dataset \
    --embodiment-tag GENERAL_EMBODIMENT \
    --video-length 4 \
    --n-cog-tokens 64 \
    --global-batch-size 64 \
    --learning-rate 1e-4 \
    --max-steps 60000 \
    --save-steps 5000 \
    --output-dir ./outputs/my_finetune

With all add-ons (memory + motion + physics)

Recommended for embodiments where memory, motion awareness, or contact sensing matter. To enable a single add-on instead of all three, keep just the corresponding --use-* flag(s) and drop the rest.

uv run python rldx/experiment/launch_train.py \
    --base-model-path RLWRLD/RLDX-1-PT \
    --dataset-path /path/to/your/dataset \
    --embodiment-tag GENERAL_EMBODIMENT \
    --video-length 4 \
    --use-memory --memory-length 4 --concat-memory \
    --use-motion --motion-insert-layer 9 \
    --use-physics --physics-keys tactile torque --physics-dims 30 7 \
    --new-param-warmup-steps 2000 \
    --n-cog-tokens 64 \
    --global-batch-size 64 \
    --max-steps 60000 \
    --output-dir ./outputs/my_finetune_all

Key Training Flags

Flag Description Default
--video-length Number of video frames (video token compression is always on; set to 1 for single-frame) 4
--video-stride Stride between frames in action-step units 2
--use-memory Enable temporal memory module False
--memory-length Memory context window (timesteps) 4
--use-motion Enable motion module False
--use-physics Enable physics signal conditioning False
--n-cog-tokens Number of cognition tokens 64
--global-batch-size Total batch size across GPUs 64
--new-param-warmup-steps Warmup steps for newly added modules 0

LoRA fine-tuning

For memory-constrained fine-tunes you can replace full-parameter tuning of the action model (MSAT) and/or the backbone VLM with PEFT LoRA adapters:

--action-model-use-lora --action-model-lora-rank 16 --action-model-lora-alpha 32
--backbone-use-lora --backbone-lora-rank 16 --backbone-lora-alpha 32 --backbone-lora-num-layers -1

--action-model-use-lora overrides --tune-diffusion-model; --backbone-use-lora overrides --tune-top-llm-layers. Full flag list and target-module defaults are in docs/training.md.

Training-time Real-Time Chunking

If you intend to serve the checkpoint with --rtc-inference-mode trained (faster, fullgraph-compatible), enable training-time RTC at training time:

--rtc-training-max-delay 4

The training-time RTC formulation follows Black et al. (Training-Time Action Conditioning for Efficient Real-Time Chunking); the inference-side counterpart is Black et al. (Real-Time Execution of Action Chunking Flow Policies). See docs/training.md and docs/inference_server.md for usage details.


Inference

RLDX-1 ships two inference paths sharing the same model + processor:

  • In-process — load RLDXPolicy and call get_action(obs) directly from Python. Best for evaluation scripts and notebook prototyping.
  • ZeroMQ serverrldx/eval/run_rldx_server.py for real-robot deployment, with two orthogonal optimizations layered on top of the base path:
    • Graph capture + kernel fusion (--compile {submodule, fullgraph}) — static-graph CUDA-graph capture and custom fused operators bring the all-modality model to 43.7 ms / step on RTX 5090 (1.63× speedup over PyTorch eager, >22 Hz).
    • Real-Time Chunking (--rtc-inference-mode {guided, trained}) — chunk-boundary stitching for smooth action handoff between consecutive chunks.

Quick Start

import torch
from rldx.policy.rldx_policy import RLDXPolicy
from rldx.data.embodiment_tags import EmbodimentTag

policy = RLDXPolicy(
    model_path="RLWRLD/RLDX-1-FT-ROBOCASA",
    embodiment_tag=EmbodimentTag.GENERAL_EMBODIMENT,
    device="cuda:0",
)

# Single-step inference
action = policy.get_action(observation)

Serving (ZeroMQ)

For real-time robot deployment:

# Start the policy server
uv run python rldx/eval/run_rldx_server.py \
    --model-path RLWRLD/RLDX-1-FT-ROBOCASA \
    --embodiment-tag GENERAL_EMBODIMENT \
    --host 0.0.0.0 --port 20000

Real-time inference (graph capture + RTC)

The server brings the all-modality model to 43.7 ms / step on RTX 5090 (1.63× speedup, >22 Hz) through two orthogonal knobs:

--compile {none, submodule, fullgraph} — graph capture + kernel fusion.

  • submodule — compiles each learnable sub-module. Preserves autograd. ~30 s warmup.
  • fullgraph — CUDA-graph capture and operator fusion over the full VLA forward. Lowest steady-state latency, ~90–210 s warmup.
    • Tuned for RTX 5090 (Blackwell, sm_120). On other GPU architectures use --compile submodule for the intended result.

--rtc-inference-mode {none, guided, trained} — Real-Time Chunking for chunk-boundary stitching.

The full flag list, the compile × RTC compatibility matrix, and a walkthrough of the trade-offs are in docs/inference_server.md.


Reproducing Benchmark Results

Each benchmark has a self-contained eval README; this table maps each result row in Performance to the fine-tuned checkpoint we used, the embodiment tag the server expects, and the runnable guide.

Benchmark Fine-tuned Checkpoint Embodiment Tag Eval Guide
LIBERO RLWRLD/RLDX-1-FT-LIBERO GENERAL_EMBODIMENT run_scripts/eval/libero/README.md
LIBERO-Plus RLWRLD/RLDX-1-FT-LIBERO GENERAL_EMBODIMENT run_scripts/eval/libero_plus/README.md
SimplerEnv Google RLWRLD/RLDX-1-FT-SIMPLER-GOOGLE OXE_FRACTAL run_scripts/eval/simpler/README.md
SimplerEnv WidowX RLWRLD/RLDX-1-FT-SIMPLER-WIDOWX OXE_BRIDGE_ORIG run_scripts/eval/simpler/README.md
GR-1 Tabletop RLWRLD/RLDX-1-FT-GR1 GENERAL_EMBODIMENT run_scripts/eval/gr1_tabletop/README.md
RoboCasa Kitchen (24 tasks) RLWRLD/RLDX-1-FT-ROBOCASA GENERAL_EMBODIMENT run_scripts/eval/robocasa_kitchen/README.md
RoboCasa365 RLWRLD/RLDX-1-FT-RC365 GENERAL_EMBODIMENT run_scripts/eval/robocasa_365/README.md

Shared mechanics (server + rollout split, common flags, troubleshooting) are documented in docs/evaluation.md.


Project Structure

rldx/
├── configs/                              # Model, data, and training configurations
├── data/                                 # Dataset loaders, processors, and statistics
├── experiment/                           # Training entry points and utilities
├── eval/                                 # Evaluation scripts and sim environments
├── inference/                            # Inference engine: GraphSafe substrate, fused Triton kernels, RTC dispatch
├── model/
│   ├── core/                             # Core model (RLDX-1, processor, setup)
│   ├── modules/
│   │   ├── backbone/                     # RLDX-1-VLM backbone (with video token compression)
│   │   ├── action_model/                 # MSAT diffusion action model + physics head
│   │   ├── memory.py                     # Temporal memory transformer
│   │   ├── norms.py                      # Shared normalization primitives
│   │   └── embodiment_conditioned_mlp.py
│   ├── pipeline.py                       # Training/inference pipeline glue
│   └── registry.py                       # Embodiment + variant registry
├── policy/                               # Inference policy wrappers
└── utils/                                # Distributed training utilities

Citation

@article{rldx2026,
  title={RLDX-1 Technical Report},
  author={Dongyoung Kim and Huiwon Jang and Myungkyu Koo and Suhyeok Jang and Taeyoung Kim and others},
  year={2026},
  journal={arXiv preprint arXiv:2605.03269},
  eprint={2605.03269},
  archivePrefix={arXiv}
}

Acknowledgments

RLDX-1 builds upon the following open-source projects:

License

  • Code: released under the Apache License 2.0. The codebase is built on the NVIDIA Isaac GR00T N1.7 framework — third-party attributions and per-file provenance headers are preserved in the source tree.
  • Model weights: distributed on Hugging Face under the RLWRLD Model License v1.0 (a non-commercial license with attribution and share-alike terms). By using any RLWRLD/RLDX-1-* checkpoint you agree to those terms.

Contributions

We currently do not accept external pull requests on this repository. If you encounter a bug, broken reproduction step, or have a question about RLDX-1, please open an issue at github.com/RLWRLD/RLDX-1/issues and we will follow up there.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors