This repository provides the reference implementation for WorldVLN. It includes an autoregressive inference for closed-loop action prediction, as well as two-stage training pipelines: (1) supervised backbone and action decoder training, and (2) action-aware GRPO–based optimization.
We recommend using a single Python 3.10 environment for the released workflows. In our validated launch scripts, the Python interpreter is passed explicitly through PYTHON_BIN, so after activating your environment it is recommended to export:
export PYTHON_BIN=$(which python)- Create a Python 3.10 environment.
conda create -n worldvln python=3.10
conda activate worldvln-
Install a PyTorch build that matches your CUDA environment. For the released training and action-aware GRPO workflows, a PyTorch 2.5.1 environment is the recommended baseline.
-
Install the shared dependencies used by the released workflows.
pip install -r requirements.txtOfficial WorldVLN weights are available on Hugging Face:
Download the weights to your preferred checkpoint directory and configure the relevant training or inference scripts to point to them.
Simulation / benchmark resources:
- IndoorUAV simulation environment download guide: Indoor_UAV
- UAV-Flow benchmark and evaluation environment: buaa-colalab/UAV-Flow
The repository currently provides two main inference entry points.
From the repository root:
export PYTHON_BIN=$(which python)
export INFINITY_CKPT=/path/to/infinity/global_step_xxx.pth
export STAGE2_LATENT2ACTION_CKPT=/path/to/WorldVLN_action_decoder.pt
bash infer/run_server.shCommon environment variables:
INFINITY_CKPT: main InfinityStar / WorldVLN checkpoint used by the serviceSTAGE2_LATENT2ACTION_CKPT: Stage-2 latent-to-action checkpoint for action predictionINFINITY_SERVER_CONFIG: optional override forinfer/config.jsonINFINITY_REPO_ROOT: optional override for the defaultWorldmodel/runtime/INFINITY_LATENT_CACHE_ROOT: runtime cache directory used by the serviceHOST,PORT: bind address for Uvicorn
- Entry points: infer/run_server.sh, infer/server.py
- Configuration: infer/config.json
- Windows-side client: infer/client.py
WorldVLN runs in an autoregressive closed-loop protocol over a trajectory session_id:
- Input (per call):
images_base64(RGB frames) + an optionalinstructionon the first call.- First call typically sends 1 warmup frame.
- Next calls send
stepframes (default 16) to advance the session timeline.
- State (server-side): the server stores history by
session_idand maintains a streaming world-model session. - Output (per call):
actionsas delta actions in cm/deg with order[dx, dy, dz, droll, dyaw, dpitch].- In the default
tsformer_latentmode, the server emits one segment worth of actions whenever enough frames have been received.
- In the default
Example (strict closed-loop, enable allow_future_segments=1):
- Send (1 real frame + instruction/prompt) → get the next
stepactions (typically 16). - Execute them, collect the next
stepreal frames, send them → get the nextstepactions. - Repeat with the same
session_iduntil the trajectory ends.
Clients in this repo follow the same pattern:
infer/client.py(simulation / dataset example): sends1, step, step, ...frames under a stablesession_id, and writes per-segment*_actions.json/*_poses.json.action_aware_grpo/windows_client.py(Windows-side rollout integration / debugging): uses the same session protocol and output format, but is packaged as a Win-friendly client for action-aware GRPO workflows.
This repository is organized into two stages:
- Stage 1 (supervised): backbone finetuning + action decoder training.
- Stage 2 (action-aware GRPO): rollout collection + GRPO training.
bash train/scripts/train_from_base.shThe backbone finetuning workflow is located under train/.
- Entry point: train/scripts/train_from_base.sh
- Main trainer: train/train.py
- Detailed guide: train/TRAINING.md
# Stage A: adapter distillation
bash train/action_decoder/scripts/train_stageA_ddp.sh
# Stage B: latent-to-action training
bash train/action_decoder/scripts/train_stageB_ddp.shThe action decoder workflow is located under Worldmodel/action_decoder/src/ and is organized into two steps (Stage A + Stage B).
The action decoder training entrypoints live under train/action_decoder/ and are organized into two steps:
- Stage A adapter distillation: train/action_decoder/scripts/train_stageA_ddp.sh
- Stage B latent-to-action training: train/action_decoder/scripts/train_stageB_ddp.sh
- Main scripts: train/action_decoder/tools/train_stageA_ddp.py, train/action_decoder/tools/train_stageB_ddp.py
This workflow trains the mapping from visual latent features to 6-DoF motion outputs.
Data contract (training manifest):
{
"items_train": [
{
"latent_path": "path/to/latents.pt",
"traj_json_path": "path/to/preprocessed_logs.json",
"images_dir": "path/to/images"
}
]
}Stage A required environment variables:
MANIFEST_JSONTSFORMER_CKPTINF_VAE_PATH
Run Stage A:
bash train/action_decoder/scripts/train_stageA_ddp.shStage B required environment variables:
MANIFEST_JSONTSFORMER_PRETRAINEDADAPTER_CKPTINFINITYSTAR_VAE_PATH
Run Stage B:
bash train/action_decoder/scripts/train_stageB_ddp.shStart the local inference service used by rollout:
INFINITY_CKPT=/path/to/infinity/global_step_xxx.pth \
CHECKPOINTS_DIR=/path/to/checkpointsinf \
ACTIONHEAD_CKPT=/path/to/actionhead/checkpoint_last.pth \
ACTIONHEAD_RUN_CONFIG=/path/to/actionhead/run_config.json \
bash action_aware_grpo/run_infer_server.shRun rollout collection:
unset ALL_PROXY all_proxy
export NO_PROXY=127.0.0.1,localhost
SRC_JSON=/path/to/reference_video_full_49f_trajectory_prompts.json \
INFINITY_CKPT=/path/to/infinity/global_step_xxx.pth \
CHECKPOINTS_DIR=/path/to/checkpointsinf \
ACTIONHEAD_CKPT=/path/to/actionhead/checkpoint_last.pth \
ACTIONHEAD_RUN_CONFIG=/path/to/actionhead/run_config.json \
CUDA_VISIBLE_DEVICES=0 \
GRPO_LOCAL_GPU_IDS=0 \
NPROC_PER_NODE=1 \
NNODES=1 \
NODE_RANK=0 \
UAVFLOW_STAGEA_ROLLOUT_BACKEND=remote_sim \
UAVFLOW_SIMULATOR_BASE_URL=http://127.0.0.1:18765 \
UAVFLOW_SIMULATOR_TIMEOUT_S=120 \
UAVFLOW_TASK_JSON_ROOT=/path/to/uavflow_tasks/select_from_train_jsons \
bash action_aware_grpo/scripts/run_stagea_collect.sh RUN_ID=remote_sim_smoke TOP_N=1 K_CAND=1 STAGEA_NPROC=1 STAGEA_PROGRESS_EVERY_N=1Run train (partial-freeze optimization):
CHECKPOINTS_DIR=/path/to/checkpointsinf \
RUSH_RESUME=/path/to/infinity/global_step_xxx.pth \
REPLAY_META_DIR=/path/to/replay_meta_rollout_smoke \
bash action_aware_grpo/scripts/run_stageb_partialfreeze.sh PARTIAL_FREEZE_MODE=smoke RUN_ID=stageb_smokeThe action-aware GRPO workflow is located under action_aware_grpo/ and is organized into two steps: rollout and train.
- Server entry point: action_aware_grpo/grpo_server.py
- Windows-side client (for action-aware GRPO rollout integration / debugging): action_aware_grpo/windows_client.py
- Rollout collection: action_aware_grpo/scripts/run_stagea_collect.sh
- Train (partial-freeze optimization): action_aware_grpo/scripts/run_stageb_partialfreeze.sh
- Remote simulator service wrapper: action_aware_grpo/scripts/run_remote_sim_service.sh
- Local inference launcher used by rollout: action_aware_grpo/run_infer_server.sh
At a high level:
- Rollout consumes rollout sources and model assets, then generates rollout caches and replay metadata.
- Train consumes replay metadata and runs optimization to produce updated checkpoints and logs.
For simulator-backed rollout details, see action_aware_grpo/docs/remote_sim.md.
We sincerely thank the following projects for their exceptional effort: InfinityStar, TSformer-VO.
If you find this work useful, welcome to cite the WorldVLN paper:
@misc{zhao2026worldvln,
title={WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation},
author={Baining Zhao and Jiacheng Xu and Weicheng Feng and Xin Zhang and Zhaolu Wang and Haoyang Wang and Shilong Ji and Ziyou Wang and Jianjie Fang and Zhiheng Zheng and Weichen Zhang and Yu Shang and Wei Wu and Chen Gao and Xinlei Chen and Yong Li},
year={2026},
eprint={2605.15964},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2605.15964},
}This project is released under the CC BY 4.0 license. See LICENSE.

