LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

🎉 News • 📖 Introduction • 📊 Performance • 🏗️ Architecture • 🏋️ Training • 🧪 Evaluation • 🎈 Citation

🎉 News

[2026, May 24] Evaluation code and model checkpoints are available.
[2026, June 7] Training code is now available in train/.

📖 Introduction

This repository hosts the code and model weights of LLaVA-UHD v4, a multimodal large language model (MLLM) designed for efficient high-resolution visual encoding. LLaVA-UHD v4 rethinks the conventional global-encoding-plus-post-ViT-compression paradigm and introduces a slice-based encoding framework with intra-ViT early compression. By moving token reduction into shallow ViT layers, our model substantially reduces the computational cost of visual encoding while preserving fine-grained perception ability.

Across eight standard benchmarks covering document understanding, OCR, mathematical reasoning, and general VQA, LLaVA-UHD v4 matches or even surpasses a post-ViT compression baseline under the same 16× final compression ratio, while reducing visual-encoding FLOPs by 55.8%. These results demonstrate that aggressive token compression can be performed inside the vision encoder without sacrificing downstream performance, offering a practical path toward scalable high-resolution MLLMs.

📊 Performance

The figure above highlights the core efficiency–performance trade-off of LLaVA-UHD v4. Across training scales from 4M to 64M samples, LLaVA-UHD v4 closely tracks the performance of the strong post-ViT compression baseline, indicating that intra-ViT early compression preserves the model's scaling behavior. At the same time, by moving part of the token reduction into the vision encoder, LLaVA-UHD v4 reduces visual-encoding FLOPs from 3555G to 1573G, achieving a 55.8% reduction in computation.

🏗️ Architecture

Unlike previous high-resolution MLLMs that encode the full image globally and compress visual tokens only after the ViT, LLaVA-UHD v4 adopts slice-based encoding and moves part of the compression directly into the vision encoder. The intra-ViT compressor first performs local window attention to aggregate neighboring visual information, then applies pixel-unshuffle and MLP-based fusion to reduce the token count. As a result, the remaining ViT layers operate on a much shorter visual sequence, substantially lowering the cost of high-resolution visual encoding while maintaining strong fine-grained perception.

🏋️ Training

Training code is available in train/. It implements the slice-based encoding with intra-ViT early compression and supports the full pipeline: stage 1 → stage 2 → stage 3 → stage 4.

1) Prepare environment

cd train
# Use your own virtual environment path
source /path/to/venv/bin/activate
pip install -r requirements.txt

2) Run training

cd train

# Full four-stage pipeline (stage 1 → stage 2 → stage 3 → stage 4)
PREFIX=my_exp bash model_vlu_minicpm/model_tunnel/run_tunnel.sh \
  --model_type uhd_mlp_insert_window_attention_ViTmlp_4_4 --insert_layer_id 6

# Or run a single-stage SFT
PREFIX=my_exp bash model_vlu_minicpm/model_tunnel/run_sft.sh \
  --model_type uhd_mlp_insert_window_attention_ViTmlp_4_4 --insert_layer_id 6

Configure data/model paths (e.g. LLM_PATH, VOCABS_PATH, VPM_PATH, STAGE1_FILE, STAGE4_FILE) via environment variables or by editing the /path/to/... placeholders in the scripts. See the scripts under train/model_vlu_minicpm/model_tunnel/ for stage arguments, checkpoint resuming, and override options.

🧪 Evaluation

1) Prepare environment

cd vlmevalkit
# Use your own virtual environment path
source /path/to/venv/bin/activate
pip install -r requirements.txt

If you want run_eval.sh to auto-activate your environment, set:

export VENV_PATH=/path/to/venv

Note: some benchmarks require an LLM judge; set OPENAI_API_KEY before evaluation.
If needed, you can also set OPENAI_API_BASE (or OPENAI_API_KEY_JUDGE / OPENAI_API_BASE_JUDGE).

2) Run evaluation

cd vlmevalkit

export MODEL_PATH=/path/to/model_or_checkpoint
export MODEL_NAME=MiniCPM_4_V
export DATASETS="MMMU_DEV_VAL MathVista_MINI MMBench_DEV_EN_V11 MMBench_DEV_CN_V11 MMStar HallusionBench AI2D_TEST OCRBench"
export SAVE_NAME=llava_uhd_v4_eval

# Optional settings
export SAVE_ROOT=/path/to/save/root
export GPU_NUM=8

bash ./scripts/run_eval.sh "$MODEL_PATH" "$MODEL_NAME" "$DATASETS" "$SAVE_NAME"

🎈 Citation

If you find LLaVA-UHD v4 helpful, please cite us.

@misc{fang2026llavauhdv4makesefficient,
      title={LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?}, 
      author={Kechen Fang and Yihua Qin and Chongyi Wang and Wenshuo Ma and Tianyu Yu and Yuan Yao},
      year={2026},
      eprint={2605.08985},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.08985}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
figures		figures
paper		paper
train		train
vlmevalkit		vlmevalkit
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

🎉 News

📖 Introduction

📊 Performance

🏗️ Architecture

🏋️ Training

1) Prepare environment

2) Run training

🧪 Evaluation

1) Prepare environment

2) Run evaluation

🎈 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

🎉 News

📖 Introduction

📊 Performance

🏗️ Architecture

🏋️ Training

1) Prepare environment

2) Run training

🧪 Evaluation

1) Prepare environment

2) Run evaluation

🎈 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages