- [2026, May 24] Evaluation code and model checkpoints are available.
- [2026, June 7] Training code is now available in
train/.
This repository hosts the code and model weights of LLaVA-UHD v4, a multimodal large language model (MLLM) designed for efficient high-resolution visual encoding. LLaVA-UHD v4 rethinks the conventional global-encoding-plus-post-ViT-compression paradigm and introduces a slice-based encoding framework with intra-ViT early compression. By moving token reduction into shallow ViT layers, our model substantially reduces the computational cost of visual encoding while preserving fine-grained perception ability.
Across eight standard benchmarks covering document understanding, OCR, mathematical reasoning, and general VQA, LLaVA-UHD v4 matches or even surpasses a post-ViT compression baseline under the same 16× final compression ratio, while reducing visual-encoding FLOPs by 55.8%. These results demonstrate that aggressive token compression can be performed inside the vision encoder without sacrificing downstream performance, offering a practical path toward scalable high-resolution MLLMs.
The figure above highlights the core efficiency–performance trade-off of LLaVA-UHD v4. Across training scales from 4M to 64M samples, LLaVA-UHD v4 closely tracks the performance of the strong post-ViT compression baseline, indicating that intra-ViT early compression preserves the model's scaling behavior. At the same time, by moving part of the token reduction into the vision encoder, LLaVA-UHD v4 reduces visual-encoding FLOPs from 3555G to 1573G, achieving a 55.8% reduction in computation.
Unlike previous high-resolution MLLMs that encode the full image globally and compress visual tokens only after the ViT, LLaVA-UHD v4 adopts slice-based encoding and moves part of the compression directly into the vision encoder. The intra-ViT compressor first performs local window attention to aggregate neighboring visual information, then applies pixel-unshuffle and MLP-based fusion to reduce the token count. As a result, the remaining ViT layers operate on a much shorter visual sequence, substantially lowering the cost of high-resolution visual encoding while maintaining strong fine-grained perception.
Training code is available in train/. It implements the slice-based encoding with intra-ViT early compression and supports the full pipeline: stage 1 → stage 2 → stage 3 → stage 4.
cd train
# Use your own virtual environment path
source /path/to/venv/bin/activate
pip install -r requirements.txtcd train
# Full four-stage pipeline (stage 1 → stage 2 → stage 3 → stage 4)
PREFIX=my_exp bash model_vlu_minicpm/model_tunnel/run_tunnel.sh \
--model_type uhd_mlp_insert_window_attention_ViTmlp_4_4 --insert_layer_id 6
# Or run a single-stage SFT
PREFIX=my_exp bash model_vlu_minicpm/model_tunnel/run_sft.sh \
--model_type uhd_mlp_insert_window_attention_ViTmlp_4_4 --insert_layer_id 6Configure data/model paths (e.g. LLM_PATH, VOCABS_PATH, VPM_PATH, STAGE1_FILE, STAGE4_FILE) via environment variables or by editing the /path/to/... placeholders in the scripts. See the scripts under train/model_vlu_minicpm/model_tunnel/ for stage arguments, checkpoint resuming, and override options.
cd vlmevalkit
# Use your own virtual environment path
source /path/to/venv/bin/activate
pip install -r requirements.txtIf you want run_eval.sh to auto-activate your environment, set:
export VENV_PATH=/path/to/venvNote: some benchmarks require an LLM judge; set OPENAI_API_KEY before evaluation.
If needed, you can also set OPENAI_API_BASE (or OPENAI_API_KEY_JUDGE / OPENAI_API_BASE_JUDGE).
cd vlmevalkit
export MODEL_PATH=/path/to/model_or_checkpoint
export MODEL_NAME=MiniCPM_4_V
export DATASETS="MMMU_DEV_VAL MathVista_MINI MMBench_DEV_EN_V11 MMBench_DEV_CN_V11 MMStar HallusionBench AI2D_TEST OCRBench"
export SAVE_NAME=llava_uhd_v4_eval
# Optional settings
export SAVE_ROOT=/path/to/save/root
export GPU_NUM=8
bash ./scripts/run_eval.sh "$MODEL_PATH" "$MODEL_NAME" "$DATASETS" "$SAVE_NAME"If you find LLaVA-UHD v4 helpful, please cite us.
@misc{fang2026llavauhdv4makesefficient,
title={LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?},
author={Kechen Fang and Yihua Qin and Chongyi Wang and Wenshuo Ma and Tianyu Yu and Yuan Yao},
year={2026},
eprint={2605.08985},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.08985},
}

