Yifei Dong1,*,
Fengyi Wu1,*,
Yilong Dai1,*,
Lingdong Kong2,
Guangyu Chen1,
Xu Zhu1,
Qiyu Hu1,
Tianyu Wang1,
Johnalbert Garnica1,
Feng Liu3,
Siyu Huang4,
Qi Dai5,
Zhi-Qi Cheng1,†
1UW, 2NUS, 3Clemson, 4Drexel, 5Microsoft
LCVN studies language-conditioned visual navigation, where an embodied agent follows natural language instructions based solely on an initial egocentric observation — without access to goal images or intermediate environmental feedback. We formulate this as open-loop trajectory prediction conditioned on linguistic instructions and introduce the LCVN Dataset, a benchmark of 39,016 trajectories and 117,048 human-verified instructions spanning diverse environments and instruction styles. We propose two complementary model families: LCVN-WM + LCVN-AC, combining a diffusion-based world model with a latent-space actor–critic agent trained via intrinsic rewards, and LCVN-Uni, an autoregressive multimodal architecture that jointly predicts actions and future observations in a single forward pass.
We now release a partial dataset for the purpose of debugging and demonstrating the data format. You can find them in data_samples.
To set up the dataset, you can run the following commands to unzip the samples and move them to the expected directory:
mkdir -p data/
unzip data_samples/data_samples.zip -d data/Each trajectory folder should contain frame images and a traj_data.pkl:
data/lcvn/
├── {trajectory_id}/
│ ├── 0.jpg
│ ├── 1.jpg
│ ├── ...
│ └── traj_data.pkl
└── ...
LCVN-WM is a diffusion-based world model that imagines future visual states conditioned on actions and language instructions. It is built on the LDiT (Language-conditioned Diffusion Transformer) backbone.
conda env create -f environment.yml
conda activate lcvn
pip install -r requirements.txt
pip install -e ./lcvn-ac --no-depsVerify the installation:
python -c 'import lcvn_ac; import torch; print("Installation successful!")'Create the outputs directory:
mkdir -p outputsTraining uses wandb for logging. Set up your account and log in before running any training, and set wandb.entity=<YOUR_WANDB_ENTITY> after logging in:
wandb loginIf you do not have a wandb account or prefer to run without cloud logging, use offline mode by prepending WANDB_MODE=offline to any training command.
Run the full pipeline script to process the raw dataset into training-ready format:
cd lcvn-wm
bash build_lcvn_pipeline.shThis script prepares the dataset end-to-end in the following order:
- Build metadata from raw trajectory folders (instructions loaded directly from
traj_data.pkl) - Encode frames → latents using
stabilityai/sd-vae-ft-ema(initial encoding) - Build initial cache from the SD-VAE latents
- Train custom VAE on this dataset
- Re-encode frames with the trained VAE
- Rebuild final cache using the new latents
Each step can be skipped individually by setting the corresponding RUN_STEPx=0 environment variable. The final dataset is placed under lcvn-wm/data/lcvn/.
WANDB_MODE=online python -m main \
'+name=LcvnWM_Social_DiT_XL' \
dataset=lcvn \
algorithm=ldit_video_social \
experiment=video_generation \
wandb.entity=<YOUR_WANDB_ENTITY> \
+logger.wandb.log_model=FalseResume from checkpoint — add this argument to the command above:
load='"/path/to/checkpoint.ckpt"'
Note: The nested quotes are required by Hydra to parse paths containing
=.
The latest checkpoint is automatically copied to outputs/social_dit_xl.ckpt at the end of training.
WANDB_MODE=online python -m main \
'+name=Inference_Final_Fix' \
dataset=lcvn \
algorithm=ldit_video_social \
experiment=video_generation \
experiment.tasks=[test] \
experiment.ema.enable=False \
wandb.entity=<YOUR_WANDB_ENTITY> \
load="../outputs/social_dit_xl.ckpt" \
+logger.wandb.log_model=False \
+trainer.limit_test_batches=4LCVN-AC is a latent-space actor–critic agent that learns navigation policies from intrinsic rollout rewards generated by LCVN-WM.
Navigate to the lcvn-ac directory:
cd ../lcvn-acBefore training, verify config/train_dfot.yaml has the correct checkpoint paths:
dfot_checkpoint_path: ../outputs/social_dit_xl.ckptdfot_vae_checkpoint_path: ../outputs/vae.ckpt
Default (4 GPUs):
./train_ac.sh datamodule.batch_size=64Single GPU:
NPROC=1 ./train_ac.sh datamodule.batch_size=64Resume from checkpoint:
./train_ac.sh datamodule.batch_size=64 ckpt_path=../outputs/ac.ckptpython -m lcvn_ac.scripts.test_single_trajectory_real_dfot \
+checkpoint="../outputs/ac.ckpt" \
datamodule.root_data_dir="${oc.env:PWD}/../lcvn-wm/data/lcvn" \
datamodule.batch_size=1 \
trainer.devices=1LCVN-Uni is an alternative agent that adopts an autoregressive multimodal backbone to jointly predict both actions and future observations in a single forward pass. It uses a separate environment from LCVN-WM + LCVN-AC.
cd lcvn-uni
conda create -n lcvn-uni python=3.10
conda activate lcvn-uni
pip install torch==2.4.0
pip install -r requirements.txt --userbash train.shBefore running, make sure the dataset path, output path, and GPU settings in
train.share correct.
bash eval.shBefore running, make sure the checkpoint path, dataset path, and GPU settings in
eval.share correct.
If you find this work useful, please consider citing:
@article{dong2026lcvn,
title={Language-Conditioned World Modeling for Visual Navigation},
author={Dong, Yifei and Wu, Fengyi and Dai, Yilong and Kong, Lingdong and Chen, Guangyu and Zhu, Xu and Hu, Qiyu and Wang, Tianyu and Garnica, Johnalbert and Liu, Feng and Huang, Siyu and Dai, Qi and Cheng, Zhi-Qi},
year={2026}
}This work is built based on DFoT, UniWM and LUMOS. Thanks to all the authors for their great work.

