AwareVLN equips VLN with sparse self-aware reasoning at key navigation nodes. A unified VLM switches between [REASON] and [ACT]; an automatic data engine provides scalable supervision.
To build the training environment, run:
./environment_setup.sh awarevln
conda activate awarevlnTraining annotations of reasoning are produced by our automatic data engine, which labels sparse self-aware reasoning at key nodes. Download from Dataset and extract videos.tar.gz in each subfolder.
-
r2r / rxr: Trajectories from rollouts of existing policy, with corrections when needed; reasoning annotations from our data engine.
-
r2rfollow / rxrfollow: Trajectories that follow expert paths; reasoning annotations from our data engine.
-
Human: Not included. Follow NaVILA-Dataset: use video IDs, download with
yt-dlp, extract frames viascripts/extract_rawframes.pyin the NaVILA repo.
The data should have structure like:
AwareVLN-Dataset
├─ reason
| ├─ r2r
| | ├─ _anno_cot
| | | ├─ annotations_shuffle_uni.json
| | | ├─ cot_new.json
| | ├─ videos
| ├─ rxr
| | ├─ ...
| ├─ r2rfollow
| | ├─ ...
| ├─ rxrfollow
| | ├─ ...
├─ Human
| ├─ raw_frames
| | ├─ <video_id>
| | | ├─ 0001.jpg
| | | ├─ ...
| ├─ annotations_shuffled.jsonWe start from NaVILA-style VILA (Llama-3 8B + SigLIP + mm_projector, 8 frames), and fine-tune with our reasoning data to learn self-aware reasoning. The pretrained model and our trained AwareVLN weights are available here.
export AWAREVLN_DATA_ROOT=/path/to/data
bash scripts/train/sft_8frames.shThis repository builds on VLN-CE, which relies on older versions of Habitat-Lab and Habitat-Sim.
- Create conda env
awarevln-eval(Python 3.10)
conda create -n awarevln-eval python=3.10
conda activate awarevln-eval- Build Habitat-Sim & Lab (v0.1.7) from source
Follow the VLN-CE setup guide. To resolve NumPy compatibility issues, apply the following hotfix:
python evaluation/scripts/habitat_sim_autofix.py # replace habitat_sim/utils/common.py- Install VLN-CE dependencies
pip install -r evaluation/requirements.txt- Install VILA dependencies
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.8/flash_attn-2.5.8+cu122torch2.3cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install -e .
pip install -e ".[train]"
pip install -e ".[eval]"
pip install git+https://github.com/huggingface/transformers@v4.37.2
site_pkg_path=$(python -c 'import site; print(site.getsitepackages()[0])')
cp -rv ./llava/train/transformers_replace/* $site_pkg_path/transformers/
cp -rv ./llava/train/deepspeed_replace/* $site_pkg_path/deepspeed/- Fix WebDataset version
pip install webdataset==0.1.103Follow VLN-CE and download R2R / RxR annotations and MP3D scenes under evaluation/data/ (Val-Unseen, monocular RGB):
evaluation/data/datasets
├─ RxR_VLNCE_v0
| ├─ val_unseen
| | ├─ val_unseen_guide.json.gz
| | ├─ ...
├─ R2R_VLNCE_v1-3_preprocessed
| ├─ val_unseen
| | ├─ val_unseen.json.gz
| | ├─ ...
evaluation/data/scene_datasets
├─ mp3d
| ├─ 17DRP5sb8fy
| | ├─ 17DRP5sb8fy.glb
| | ├─ ...- Trained AwareVLN weights are available here, or use your own
outputs/. - Run evaluation on R2R-CE using:
cd evaluation
bash scripts/eval/r2r.shExamples:
- Single GPU:
MODEL_PATH=../ck/awarevln TOTAL_CHUNKS=1 GPU_LIST="0" bash scripts/eval/r2r.sh - Multiple GPUs (e.g., 8 GPUs):
MODEL_PATH=../ck/awarevln TOTAL_CHUNKS=8 GPU_LIST="0,1,2,3,4,5,6,7" bash scripts/eval/r2r.sh
- Run evaluation on RxR-CE using:
MODEL_PATH=../ck/awarevln bash scripts/eval/rxr.sh- Results are saved under
evaluation/eval_awarevln/<CKPT_NAME>/. Metrics are aggregated automatically; to re-run:
python scripts/eval_jsons.py eval_awarevln/awarevln/VLN-CE-v1/val_unseen NUM_CHUNKS
python scripts/eval_jsons.py eval_awarevln/awarevln/RxR-VLN-CE-v1/val_unseen NUM_CHUNKSAwareVLN performs structured reasoning during navigation—for example, detecting a misinterpreted turn and issuing a corrective plan, or recognizing a completed subtask and planning the next phase aligned with the instruction.
@article{guo2026awarevln,
title={AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation},
author={Wenxuan Guo and Xiuwei Xu and Yichen Liu and Xiangyu Li and Hang Yin and Huangxing Chen and Wenzhao Zheng and Jianjiang Feng and Jie Zhou and Jiwen Lu},
journal={arXiv preprint arXiv:2605.22816},
year={2026},
url={https://arxiv.org/abs/2605.22816},
}
