RLDX-1

RLDX-1 is a Vision-Language-Action model (VLA) for human-like dexterous manipulation. Beyond the versatile intelligence inherited from pre-trained VLM backbones, RLDX-1 adds three functional capabilities — motion awareness, long-term memory, and physical sensing — through a unified Multi-Stream Action Transformer (MSAT) architecture, a synthetic-augmented training pipeline, and a real-time inference stack.

Highlights

Multi-Stream Action Transformer (MSAT). Cognition, physics, and action each get a dedicated stream coupled by joint self-attention — an extension of MM-DiT to action modeling.
Motion awareness. Multi-frame observations + a motion module capture temporal dynamics; intermediate VLM layers compress video tokens to keep the policy efficient.
Long-term memory. A memory module fuses past cognition features with the current ones for history-grounded decisions beyond a short multi-frame window.
Physical sensing. Tactile and torque enter as a dedicated physics stream; the decoder is jointly trained to predict future physical signals.
Three-stage training. Pre-training (generalization) → mid-training (functionality) → post-training (task adaptation), with synthetic data augmenting rare manipulation scenarios.
Real-time inference. Static graph capture + custom fused kernels bring the all-modality model to 43.7 ms / step on RTX 5090 (1.63× speedup, >22 Hz).

Performance

Simulation Benchmarks

Success rates (%) of RLDX-1 fine-tuned on each benchmark's training set, compared to recent frontier VLA baselines.

Method	LIBERO (Avg)	LIBERO-Plus	SIMPLER Google-VM	SIMPLER Google-VA	SIMPLER WidowX	RoboCasa Kitchen	GR-1 Tabletop	RoboCasa365 (Avg)
π0-FAST	85.5	64.2	61.9	59.0	48.3	63.6	—	21.7
π0	94.1	54.6	58.8	54.8	27.1	62.5	13.6	14.8
π0.5	96.9	86.5	72.7	68.4	46.9	62.1	15.4	16.9
GR00T N1.5	86.5	66.3	52.4	43.7	62.0	65.7	48.0	20.0
GR00T N1.6	96.7	72.6	76.1	57.1	57.1	66.2	47.6	26.9
RLDX-1 (ours)	97.8	86.7	81.5	77.4	71.9	70.6	58.7	32.1

The first five columns cover the established LIBERO / SIMPLER family; the last three (RoboCasa Kitchen, GR-1 Tabletop, RoboCasa365) are long-horizon, humanoid, and compositional benchmarks. Per-benchmark checkpoints, embodiment tags, and reproduce commands are listed under Reproducing Benchmark Results.

Installation

Requirements: Python 3.10, CUDA 12.x, uv v0.8.4+

git clone https://github.com/RLWRLD/RLDX-1.git
cd RLDX-1
uv sync --python 3.10
uv pip install -e .

Verify installation:

uv run python -c "import rldx; print(rldx.__version__)"

For simulator setup, dev tooling, and full troubleshooting, see docs/installation.md.

Documentation

Hands-on guides live under docs/:

Guide	What it covers
`installation.md`	Environment setup, simulator venvs, dev tooling, common pitfalls
`architecture.md`	Five-stage walkthrough of the RLDX-1 model and its config flags
`training.md`	`launch_train.py` recipes (fine-tune / mid-train), LoRA, training-time RTC, dataset layout
`embodiment_tags.md`	What `EmbodimentTag` is and how to pick one for a custom robot
`evaluation.md`	RoboCasa / LIBERO / SIMPLER / GR-1 eval, server + rollout split, results aggregation
`inference_server.md`	`run_rldx_server.py` CLI, wire protocol, RTC modes, `--compile` levels, simulator + real-robot deployment

Pretrained & Midtrained Checkpoints

Checkpoint	Description	Params	HuggingFace
`RLDX-1-PT`	Pre-trained (video)	6.9B	RLWRLD/RLDX-1-PT
`RLDX-1-MT-DROID`	Mid-trained on DROID with all add-ons	8.1B	RLWRLD/RLDX-1-MT-DROID
`RLDX-1-MT-ALLEX`	Mid-trained on ALLEX with all add-ons	8.1B	RLWRLD/RLDX-1-MT-ALLEX

Data Preparation

RLDX-1 uses LeRobot v2.1 format datasets. To convert your data:

# Convert a single dataset
bash run_scripts/data/convert_lerobot_single.sh /path/to/your/data

# Convert multiple datasets
bash run_scripts/data/convert_lerobot_multiple.sh /path/to/data/root

Each dataset must carry a meta/modality.json that slices the flat state / action vectors into named joint groups and remaps video columns to modality keys. Schema and a worked example are in docs/training.md.

Custom Embodiment Config

Define your robot's modality configuration:

# my_modality_config.py
from rldx.data.types import ModalityConfig

MODALITY_CONFIGS = {
    "my_robot": {
        "image": ModalityConfig(...),
        "state": ModalityConfig(...),
        "action": ModalityConfig(...),
    }
}

Pass it via --modality-config-path my_modality_config.py during training, together with an EmbodimentTag that selects the per-robot MLP head slot (default: GENERAL_EMBODIMENT; see docs/embodiment_tags.md for the picker).

The EmbodimentTag design and per-embodiment MLP head structure follow the convention introduced by NVIDIA GR00T N1.7.

Fine-tuning

This section covers how to fine-tune RLDX-1 from a pre-trained checkpoint (RLWRLD/RLDX-1-PT) on your own LeRobot v2.1 dataset. The training entry point is a single CLI (rldx/experiment/launch_train.py) where flags toggle the optional functional capabilities described in Highlights:

--video-length N — temporal frames per observation (motion awareness)
--use-memory — temporal memory module (long-term memory)
--use-motion — motion module inside the VLM backbone
--use-physics --physics-keys ... — tactile / torque streams (physical sensing)

LoRA, training-time RTC, and the full flag list are documented in docs/training.md. Below are the canonical recipes.

Single dataset, no add-ons

uv run python rldx/experiment/launch_train.py \
    --base-model-path RLWRLD/RLDX-1-PT \
    --dataset-path /path/to/your/dataset \
    --embodiment-tag GENERAL_EMBODIMENT \
    --video-length 4 \
    --n-cog-tokens 64 \
    --global-batch-size 64 \
    --learning-rate 1e-4 \
    --max-steps 60000 \
    --save-steps 5000 \
    --output-dir ./outputs/my_finetune

With all add-ons (memory + motion + physics)

Recommended for embodiments where memory, motion awareness, or contact sensing matter. To enable a single add-on instead of all three, keep just the corresponding --use-* flag(s) and drop the rest.

uv run python rldx/experiment/launch_train.py \
    --base-model-path RLWRLD/RLDX-1-PT \
    --dataset-path /path/to/your/dataset \
    --embodiment-tag GENERAL_EMBODIMENT \
    --video-length 4 \
    --use-memory --memory-length 4 --concat-memory \
    --use-motion --motion-insert-layer 9 \
    --use-physics --physics-keys tactile torque --physics-dims 30 7 \
    --new-param-warmup-steps 2000 \
    --n-cog-tokens 64 \
    --global-batch-size 64 \
    --max-steps 60000 \
    --output-dir ./outputs/my_finetune_all

Key Training Flags

Flag	Description	Default
`--video-length`	Number of video frames (video token compression is always on; set to `1` for single-frame)	`4`
`--video-stride`	Stride between frames in action-step units	`2`
`--use-memory`	Enable temporal memory module	`False`
`--memory-length`	Memory context window (timesteps)	`4`
`--use-motion`	Enable motion module	`False`
`--use-physics`	Enable physics signal conditioning	`False`
`--n-cog-tokens`	Number of cognition tokens	`64`
`--global-batch-size`	Total batch size across GPUs	`64`
`--new-param-warmup-steps`	Warmup steps for newly added modules	`0`

LoRA fine-tuning

For memory-constrained fine-tunes you can replace full-parameter tuning of the action model (MSAT) and/or the backbone VLM with PEFT LoRA adapters:

--action-model-use-lora --action-model-lora-rank 16 --action-model-lora-alpha 32
--backbone-use-lora --backbone-lora-rank 16 --backbone-lora-alpha 32 --backbone-lora-num-layers -1

--action-model-use-lora overrides --tune-diffusion-model; --backbone-use-lora overrides --tune-top-llm-layers. Full flag list and target-module defaults are in docs/training.md.

Training-time Real-Time Chunking

If you intend to serve the checkpoint with --rtc-inference-mode trained (faster, fullgraph-compatible), enable training-time RTC at training time:

--rtc-training-max-delay 4

The training-time RTC formulation follows Black et al. (Training-Time Action Conditioning for Efficient Real-Time Chunking); the inference-side counterpart is Black et al. (Real-Time Execution of Action Chunking Flow Policies). See docs/training.md and docs/inference_server.md for usage details.

Inference

RLDX-1 ships two inference paths sharing the same model + processor:

In-process — load RLDXPolicy and call get_action(obs) directly from Python. Best for evaluation scripts and notebook prototyping.
ZeroMQ server — rldx/eval/run_rldx_server.py for real-robot deployment, with two orthogonal optimizations layered on top of the base path:
- Graph capture + kernel fusion (--compile {submodule, fullgraph}) — static-graph CUDA-graph capture and custom fused operators bring the all-modality model to 43.7 ms / step on RTX 5090 (1.63× speedup over PyTorch eager, >22 Hz).
- Real-Time Chunking (--rtc-inference-mode {guided, trained}) — chunk-boundary stitching for smooth action handoff between consecutive chunks.

Quick Start

import torch
from rldx.policy.rldx_policy import RLDXPolicy
from rldx.data.embodiment_tags import EmbodimentTag

policy = RLDXPolicy(
    model_path="RLWRLD/RLDX-1-FT-ROBOCASA",
    embodiment_tag=EmbodimentTag.GENERAL_EMBODIMENT,
    device="cuda:0",
)

# Single-step inference
action = policy.get_action(observation)

Serving (ZeroMQ)

For real-time robot deployment:

# Start the policy server
uv run python rldx/eval/run_rldx_server.py \
    --model-path RLWRLD/RLDX-1-FT-ROBOCASA \
    --embodiment-tag GENERAL_EMBODIMENT \
    --host 0.0.0.0 --port 20000

Real-time inference (graph capture + RTC)

The server brings the all-modality model to 43.7 ms / step on RTX 5090 (1.63× speedup, >22 Hz) through two orthogonal knobs:

--compile {none, submodule, fullgraph} — graph capture + kernel fusion.

submodule — compiles each learnable sub-module. Preserves autograd. ~30 s warmup.
fullgraph — CUDA-graph capture and operator fusion over the full VLA forward. Lowest steady-state latency, ~90–210 s warmup.
- Tuned for RTX 5090 (Blackwell, sm_120). On other GPU architectures use --compile submodule for the intended result.

--rtc-inference-mode {none, guided, trained} — Real-Time Chunking for chunk-boundary stitching.

guided — works with any flow-matching checkpoint.
trained — requires a checkpoint trained with --rtc-training-max-delay > 0. Pairs with --compile fullgraph.
Implementation follows Black et al. (Real-Time Execution of Action Chunking Flow Policies). The trained mode uses the integration from Black et al. (Training-Time Action Conditioning for Efficient Real-Time Chunking).

The full flag list, the compile × RTC compatibility matrix, and a walkthrough of the trade-offs are in docs/inference_server.md.

Reproducing Benchmark Results

Each benchmark has a self-contained eval README; this table maps each result row in Performance to the fine-tuned checkpoint we used, the embodiment tag the server expects, and the runnable guide.

Benchmark	Fine-tuned Checkpoint	Embodiment Tag	Eval Guide
LIBERO	RLWRLD/RLDX-1-FT-LIBERO	`GENERAL_EMBODIMENT`	`run_scripts/eval/libero/README.md`
LIBERO-Plus	RLWRLD/RLDX-1-FT-LIBERO	`GENERAL_EMBODIMENT`	`run_scripts/eval/libero_plus/README.md`
SimplerEnv Google	RLWRLD/RLDX-1-FT-SIMPLER-GOOGLE	`OXE_FRACTAL`	`run_scripts/eval/simpler/README.md`
SimplerEnv WidowX	RLWRLD/RLDX-1-FT-SIMPLER-WIDOWX	`OXE_BRIDGE_ORIG`	`run_scripts/eval/simpler/README.md`
GR-1 Tabletop	RLWRLD/RLDX-1-FT-GR1	`GENERAL_EMBODIMENT`	`run_scripts/eval/gr1_tabletop/README.md`
RoboCasa Kitchen (24 tasks)	RLWRLD/RLDX-1-FT-ROBOCASA	`GENERAL_EMBODIMENT`	`run_scripts/eval/robocasa_kitchen/README.md`
RoboCasa365	RLWRLD/RLDX-1-FT-RC365	`GENERAL_EMBODIMENT`	`run_scripts/eval/robocasa_365/README.md`

Shared mechanics (server + rollout split, common flags, troubleshooting) are documented in docs/evaluation.md.

Project Structure

rldx/
├── configs/                              # Model, data, and training configurations
├── data/                                 # Dataset loaders, processors, and statistics
├── experiment/                           # Training entry points and utilities
├── eval/                                 # Evaluation scripts and sim environments
├── inference/                            # Inference engine: GraphSafe substrate, fused Triton kernels, RTC dispatch
├── model/
│   ├── core/                             # Core model (RLDX-1, processor, setup)
│   ├── modules/
│   │   ├── backbone/                     # RLDX-1-VLM backbone (with video token compression)
│   │   ├── action_model/                 # MSAT diffusion action model + physics head
│   │   ├── memory.py                     # Temporal memory transformer
│   │   ├── norms.py                      # Shared normalization primitives
│   │   └── embodiment_conditioned_mlp.py
│   ├── pipeline.py                       # Training/inference pipeline glue
│   └── registry.py                       # Embodiment + variant registry
├── policy/                               # Inference policy wrappers
└── utils/                                # Distributed training utilities

Citation

@article{rldx2026,
  title={RLDX-1 Technical Report},
  author={Dongyoung Kim and Huiwon Jang and Myungkyu Koo and Suhyeok Jang and Taeyoung Kim and others},
  year={2026},
  journal={arXiv preprint arXiv:2605.03269},
  eprint={2605.03269},
  archivePrefix={arXiv}
}

Acknowledgments

RLDX-1 builds upon the following open-source projects:

NVIDIA GR00T N1.7 — Training Codebase
Qwen3-VL — Vision-language backbone
FLUX — MMDiT architecture

License

Code: released under the Apache License 2.0. The codebase is built on the NVIDIA Isaac GR00T N1.7 framework — third-party attributions and per-file provenance headers are preserved in the source tree.
Model weights: distributed on Hugging Face under the RLWRLD Model License v1.0 (a non-commercial license with attribution and share-alike terms). By using any RLWRLD/RLDX-1-* checkpoint you agree to those terms.

Contributions

We currently do not accept external pull requests on this repository. If you encounter a bug, broken reproduction step, or have a question about RLDX-1, please open an issue at github.com/RLWRLD/RLDX-1/issues and we will follow up there.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github		.github
assets		assets
docs		docs
external_dependencies		external_dependencies
getting_started		getting_started
rldx		rldx
run_scripts		run_scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE.md		LICENSE.md
README.md		README.md
pixi.lock		pixi.lock
pixi.toml		pixi.toml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RLDX-1

Highlights

Performance

Simulation Benchmarks

Installation

Documentation

Pretrained & Midtrained Checkpoints

Data Preparation

Custom Embodiment Config

Fine-tuning

Single dataset, no add-ons

With all add-ons (memory + motion + physics)

Key Training Flags

LoRA fine-tuning

Training-time Real-Time Chunking

Inference

Quick Start

Serving (ZeroMQ)

Real-time inference (graph capture + RTC)

Reproducing Benchmark Results

Project Structure

Citation

Acknowledgments

License

Contributions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RLDX-1

Highlights

Performance

Simulation Benchmarks

Installation

Documentation

Pretrained & Midtrained Checkpoints

Data Preparation

Custom Embodiment Config

Fine-tuning

Single dataset, no add-ons

With all add-ons (memory + motion + physics)

Key Training Flags

LoRA fine-tuning

Training-time Real-Time Chunking

Inference

Quick Start

Serving (ZeroMQ)

Real-time inference (graph capture + RTC)

Reproducing Benchmark Results

Project Structure

Citation

Acknowledgments

License

Contributions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages