AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation

CVPR 2026

💡 Introduction

AwareVLN equips VLN with sparse self-aware reasoning at key navigation nodes. A unified VLM switches between [REASON] and [ACT]; an automatic data engine provides scalable supervision.

🚀 Training

Installation

To build the training environment, run:

./environment_setup.sh awarevln
conda activate awarevln

Dataset

Training annotations of reasoning are produced by our automatic data engine, which labels sparse self-aware reasoning at key nodes. Download from Dataset and extract videos.tar.gz in each subfolder.

r2r / rxr: Trajectories from rollouts of existing policy, with corrections when needed; reasoning annotations from our data engine.
r2rfollow / rxrfollow: Trajectories that follow expert paths; reasoning annotations from our data engine.
Human: Not included. Follow NaVILA-Dataset: use video IDs, download with yt-dlp, extract frames via scripts/extract_rawframes.py in the NaVILA repo.

The data should have structure like:

AwareVLN-Dataset
├─ reason
|   ├─ r2r
|   |    ├─ _anno_cot
|   |    |    ├─ annotations_shuffle_uni.json
|   |    |    ├─ cot_new.json
|   |    ├─ videos
|   ├─ rxr
|   |    ├─ ...
|   ├─ r2rfollow
|   |    ├─ ...
|   ├─ rxrfollow
|   |    ├─ ...
├─ Human
|   ├─ raw_frames
|   |    ├─ <video_id>
|   |    |    ├─ 0001.jpg
|   |    |    ├─ ...
|   ├─ annotations_shuffled.json

Training

We start from NaVILA-style VILA (Llama-3 8B + SigLIP + mm_projector, 8 frames), and fine-tune with our reasoning data to learn self-aware reasoning. The pretrained model and our trained AwareVLN weights are available here.

export AWAREVLN_DATA_ROOT=/path/to/data
bash scripts/train/sft_8frames.sh

📊 Evaluation

Installation

This repository builds on VLN-CE, which relies on older versions of Habitat-Lab and Habitat-Sim.

Create conda env awarevln-eval (Python 3.10)

conda create -n awarevln-eval python=3.10
conda activate awarevln-eval

Build Habitat-Sim & Lab (v0.1.7) from source

Follow the VLN-CE setup guide. To resolve NumPy compatibility issues, apply the following hotfix:

python evaluation/scripts/habitat_sim_autofix.py # replace habitat_sim/utils/common.py

Install VLN-CE dependencies

pip install -r evaluation/requirements.txt

Install VILA dependencies

pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.8/flash_attn-2.5.8+cu122torch2.3cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

pip install -e .
pip install -e ".[train]"
pip install -e ".[eval]"

pip install git+https://github.com/huggingface/transformers@v4.37.2
site_pkg_path=$(python -c 'import site; print(site.getsitepackages()[0])')
cp -rv ./llava/train/transformers_replace/* $site_pkg_path/transformers/
cp -rv ./llava/train/deepspeed_replace/* $site_pkg_path/deepspeed/

Fix WebDataset version

pip install webdataset==0.1.103

Data

Follow VLN-CE and download R2R / RxR annotations and MP3D scenes under evaluation/data/ (Val-Unseen, monocular RGB):

evaluation/data/datasets
├─ RxR_VLNCE_v0
|   ├─ val_unseen
|   |    ├─ val_unseen_guide.json.gz
|   |    ├─ ...
├─ R2R_VLNCE_v1-3_preprocessed
|   ├─ val_unseen
|   |    ├─ val_unseen.json.gz
|   |    ├─ ...
evaluation/data/scene_datasets
├─ mp3d
|   ├─ 17DRP5sb8fy
|   |    ├─ 17DRP5sb8fy.glb
|   |    ├─ ...

Running Evaluation

Trained AwareVLN weights are available here, or use your own outputs/.
Run evaluation on R2R-CE using:

cd evaluation
bash scripts/eval/r2r.sh

Examples:

Single GPU:

MODEL_PATH=../ck/awarevln TOTAL_CHUNKS=1 GPU_LIST="0" bash scripts/eval/r2r.sh

Multiple GPUs (e.g., 8 GPUs):

MODEL_PATH=../ck/awarevln TOTAL_CHUNKS=8 GPU_LIST="0,1,2,3,4,5,6,7" bash scripts/eval/r2r.sh

Run evaluation on RxR-CE using:

MODEL_PATH=../ck/awarevln bash scripts/eval/rxr.sh

Results are saved under evaluation/eval_awarevln/<CKPT_NAME>/. Metrics are aggregated automatically; to re-run:

python scripts/eval_jsons.py eval_awarevln/awarevln/VLN-CE-v1/val_unseen NUM_CHUNKS
python scripts/eval_jsons.py eval_awarevln/awarevln/RxR-VLN-CE-v1/val_unseen NUM_CHUNKS

🎬 Demo

AwareVLN performs structured reasoning during navigation—for example, detecting a misinterpreted turn and issuing a corrective plan, or recognizing a completed subtask and planning the next phase aligned with the instruction.

📜 Citation

@article{guo2026awarevln,
      title={AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation}, 
      author={Wenxuan Guo and Xiuwei Xu and Yichen Liu and Xiangyu Li and Hang Yin and Huangxing Chen and Wenzhao Zheng and Jianjiang Feng and Jie Zhou and Jiwen Lu},
      journal={arXiv preprint arXiv:2605.22816},
      year={2026},
      url={https://arxiv.org/abs/2605.22816}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
evaluation		evaluation
llava		llava
packages		packages
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment_setup.sh		environment_setup.sh
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation

💡 Introduction

🚀 Training

Installation

Dataset

Training

📊 Evaluation

Installation

Data

Running Evaluation

🎬 Demo

📜 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation

💡 Introduction

🚀 Training

Installation

Dataset

Training

📊 Evaluation

Installation

Data

Running Evaluation

🎬 Demo

📜 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages