Skip to content

F1y1113/LCVN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔮 Language-Conditioned World Modeling
for Visual Navigation

Yifei Dong1,*, Fengyi Wu1,*, Yilong Dai1,*, Lingdong Kong2, Guangyu Chen1, Xu Zhu1, Qiyu Hu1, Tianyu Wang1, Johnalbert Garnica1, Feng Liu3, Siyu Huang4, Qi Dai5, Zhi-Qi Cheng1,†
1UW, 2NUS, 3Clemson, 4Drexel, 5Microsoft

task

LCVN studies language-conditioned visual navigation, where an embodied agent follows natural language instructions based solely on an initial egocentric observation — without access to goal images or intermediate environmental feedback. We formulate this as open-loop trajectory prediction conditioned on linguistic instructions and introduce the LCVN Dataset, a benchmark of 39,016 trajectories and 117,048 human-verified instructions spanning diverse environments and instruction styles. We propose two complementary model families: LCVN-WM + LCVN-AC, combining a diffusion-based world model with a latent-space actor–critic agent trained via intrinsic rewards, and LCVN-Uni, an autoregressive multimodal architecture that jointly predicts actions and future observations in a single forward pass.


📑 Table of Contents


🗂️ LCVN Dataset

Data Preparation

We now release a partial dataset for the purpose of debugging and demonstrating the data format. You can find them in data_samples.

To set up the dataset, you can run the following commands to unzip the samples and move them to the expected directory:

mkdir -p data/
unzip data_samples/data_samples.zip -d data/

Each trajectory folder should contain frame images and a traj_data.pkl:

data/lcvn/
├── {trajectory_id}/
│   ├── 0.jpg
│   ├── 1.jpg
│   ├── ...
│   └── traj_data.pkl
└── ...

🧠 LCVN-WM

LCVN-WM is a diffusion-based world model that imagines future visual states conditioned on actions and language instructions. It is built on the LDiT (Language-conditioned Diffusion Transformer) backbone.

wm

Installation

conda env create -f environment.yml
conda activate lcvn
pip install -r requirements.txt
pip install -e ./lcvn-ac --no-deps

Verify the installation:

python -c 'import lcvn_ac; import torch; print("Installation successful!")'

Create the outputs directory:

mkdir -p outputs

Weights & Biases (wandb)

Training uses wandb for logging. Set up your account and log in before running any training, and set wandb.entity=<YOUR_WANDB_ENTITY> after logging in:

wandb login

If you do not have a wandb account or prefer to run without cloud logging, use offline mode by prepending WANDB_MODE=offline to any training command.

Data Processing

Run the full pipeline script to process the raw dataset into training-ready format:

cd lcvn-wm
bash build_lcvn_pipeline.sh

This script prepares the dataset end-to-end in the following order:

  1. Build metadata from raw trajectory folders (instructions loaded directly from traj_data.pkl)
  2. Encode frames → latents using stabilityai/sd-vae-ft-ema (initial encoding)
  3. Build initial cache from the SD-VAE latents
  4. Train custom VAE on this dataset
  5. Re-encode frames with the trained VAE
  6. Rebuild final cache using the new latents

Each step can be skipped individually by setting the corresponding RUN_STEPx=0 environment variable. The final dataset is placed under lcvn-wm/data/lcvn/.

Train

WANDB_MODE=online python -m main \
  '+name=LcvnWM_Social_DiT_XL' \
  dataset=lcvn \
  algorithm=ldit_video_social \
  experiment=video_generation \
  wandb.entity=<YOUR_WANDB_ENTITY> \
  +logger.wandb.log_model=False

Resume from checkpoint — add this argument to the command above:

load='"/path/to/checkpoint.ckpt"'

Note: The nested quotes are required by Hydra to parse paths containing =.

The latest checkpoint is automatically copied to outputs/social_dit_xl.ckpt at the end of training.

Test (Autoregressive Inference)

WANDB_MODE=online python -m main \
  '+name=Inference_Final_Fix' \
  dataset=lcvn \
  algorithm=ldit_video_social \
  experiment=video_generation \
  experiment.tasks=[test] \
  experiment.ema.enable=False \
  wandb.entity=<YOUR_WANDB_ENTITY> \
  load="../outputs/social_dit_xl.ckpt" \
  +logger.wandb.log_model=False \
  +trainer.limit_test_batches=4

🤖 LCVN-AC

LCVN-AC is a latent-space actor–critic agent that learns navigation policies from intrinsic rollout rewards generated by LCVN-WM.

Navigate to the lcvn-ac directory:

cd ../lcvn-ac

Before training, verify config/train_dfot.yaml has the correct checkpoint paths:

  • dfot_checkpoint_path: ../outputs/social_dit_xl.ckpt
  • dfot_vae_checkpoint_path: ../outputs/vae.ckpt

Train

Default (4 GPUs):

./train_ac.sh datamodule.batch_size=64

Single GPU:

NPROC=1 ./train_ac.sh datamodule.batch_size=64

Resume from checkpoint:

./train_ac.sh datamodule.batch_size=64 ckpt_path=../outputs/ac.ckpt

Test

python -m lcvn_ac.scripts.test_single_trajectory_real_dfot \
  +checkpoint="../outputs/ac.ckpt" \
  datamodule.root_data_dir="${oc.env:PWD}/../lcvn-wm/data/lcvn" \
  datamodule.batch_size=1 \
  trainer.devices=1

🔄 LCVN-Uni

LCVN-Uni is an alternative agent that adopts an autoregressive multimodal backbone to jointly predict both actions and future observations in a single forward pass. It uses a separate environment from LCVN-WM + LCVN-AC.

Environment Setup

cd lcvn-uni
conda create -n lcvn-uni python=3.10
conda activate lcvn-uni
pip install torch==2.4.0
pip install -r requirements.txt --user

Training

bash train.sh

Before running, make sure the dataset path, output path, and GPU settings in train.sh are correct.

Evaluation

bash eval.sh

Before running, make sure the checkpoint path, dataset path, and GPU settings in eval.sh are correct.


📄 Citation

If you find this work useful, please consider citing:

@article{dong2026lcvn,
  title={Language-Conditioned World Modeling for Visual Navigation},
  author={Dong, Yifei and Wu, Fengyi and Dai, Yilong and Kong, Lingdong and Chen, Guangyu and Zhu, Xu and Hu, Qiyu and Wang, Tianyu and Garnica, Johnalbert and Liu, Feng and Huang, Siyu and Dai, Qi and Cheng, Zhi-Qi},
  year={2026}
}

🤝 Acknowledgments

This work is built based on DFoT, UniWM and LUMOS. Thanks to all the authors for their great work.

About

Official implementation for "Language-Conditioned World Modeling for Visual Navigation"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors