Shilin Yan1*โ โก, Jintao Tong1,2*, Hongwei Xue1โ , Xiaojun Tang1, Yangyang Wang1,
Kunyu Shi1, Guannan Zhang1, Ruixuan Li2โก, Yixiong Zou2โก
*Equal contribution โ โ Project Leader โ โกCorresponding Author
1Accio Team, Alibaba Group โ 2Huazhong University of Science and Technology
- ๐ฅ [2026/04/10] Paper, code, and model weights released!
| ๐ฏ 98% โ 2% Tool Calls Reduces blind tool invocation by orders of magnitude |
๐ SOTA Performance Best accuracy across 13 benchmarks among open-source 8B agents |
๐ง Meta-Cognitive Wisdom Learns when to use tools, not just how |
"The art of being wise is the art of knowing what to overlook." โ William James
Current agentic multimodal models suffer from a critical meta-cognitive deficit: blind tool invocation ๐งโ. They reflexively call external tools even when queries are directly resolvable from the visual context, leading to severe latency overhead and reasoning degradation.
Existing RL methods attempt to fix this by scalarizing accuracy and tool-efficiency rewards. However, this coupled reward formulation creates an irreconcilable optimization dilemma โ aggressive penalties suppress essential tool use, while mild penalties are entirely washed out by accuracy reward variance during advantage normalization.
We propose Hierarchical Decoupled Policy Optimization (HDPO), a framework that maintains two orthogonal optimization channels:
- ๐ฏ An accuracy channel that globally maximizes task correctness
- โก An efficiency channel that enforces tool parsimony exclusively within accurate trajectories
This decoupled design naturally induces an implicit curriculum ๐ โ first learn to be correct, then learn to be efficient โ without any manual scheduling.
Our resulting model, Metis, reduces tool invocations by orders of magnitude (e.g., from 98% to 2%) while simultaneously elevating reasoning accuracy across diverse benchmarks.
Comparison of tool-use efficiency and task performance. Existing methods rely heavily on tool calls. Metis uses tools selectively while achieving the best overall performance.
Overview of Metis. A strategic multimodal reasoning agent that selectively invokes code execution, text search, and image search tools during multi-turn reasoning.
HDPO resolves reward coupling through three key components:
| Reward | Formula | Description |
|---|---|---|
| ๐ฏ Accuracy | r_acc = 0.9 ยท r_ans + 0.1 ยท r_fmt | Binary answer correctness + format compliance |
| โก Tool Efficiency | r_tool = 1/(T+1) if correct, else 0 | Inverse penalty on tool invocations, conditioned on correctness |
- ๐ฏ Accuracy advantage โ Standard GRPO over all G rollouts per prompt
- โก Tool efficiency advantage โ Conditional GRPO computed exclusively over the qualifying set Q = {correct rollouts with |Q| โฅ 2}
This prevents incorrect rollouts from inflating tool efficiency advantages and eliminates cross-objective gradient entanglement.
The final HDPO objective combines two independent clipped surrogate losses:
L_HDPO = w_acc ยท L_GRPO(A_acc) + w_tool ยท L_GRPO(A_tool)
Since advantages are normalized independently, each gradient component delivers a clean, orthogonal learning signal.
Comparison between coupled-reward optimization and HDPO. Existing methods entangle accuracy and efficiency into a single reward signal, while HDPO decouples them into separate branches.
| Model | V*Bench | HR4K | HR8K | TreeBench | MME-RW | SEED2+ | CharXiv(DQ) | CharXiv(RQ) |
|---|---|---|---|---|---|---|---|---|
| Open-Source Models | ||||||||
| LLaVA-OneVision | 75.4 | 63.0 | 59.8 | 37.3 | 57.4 | 65.4 | - | - |
| InternVL3-8B | 81.2 | 70.0 | 69.3 | 38.8 | - | 69.7 | 73.6 | 37.6 |
| Qwen2.5-VL-7B-Instruct | 75.3 | 65.5 | 62.1 | 37.0 | 56.8 | 70.4 | 72.7 | 40.2 |
| Qwen2.5-VL-32B-Instruct | 80.6 | 69.3 | 63.6 | 42.5 | 59.1 | 72.4 | 83.2 | 48.0 |
| Qwen3-VL-8B-Instruct | 86.4 | 78.9 | 74.6 | 40.7 | 61.9 | 71.0 | 83.0 | 46.3 |
| Agentic Multimodal Models | ||||||||
| Pixel-Reasoner | 84.3 | 72.6 | 66.1 | 39.0 | 64.4 | - | - | - |
| DeepEyes | 83.3 | 73.2 | 69.5 | 37.5 | 64.1 | - | - | - |
| Thyme | 82.2 | 77.0 | 72.0 | - | 64.8 | - | - | - |
| DeepEyesV2 | 81.8 | 77.9 | 73.8 | 42.5 | 64.9 | 70.5 | 78.6 | 48.9 |
| Mini-o3 | 88.2 | 77.5 | 73.3 | - | 65.5 | - | - | - |
| SenseNova-MARS-8B | 92.2 | 83.1 | 78.4 | - | 67.9 | - | - | - |
| Skywork-R1V4-30B-A3B | 88.0 | 82.8 | 79.8 | - | 71.4 | - | - | - |
| Metis (Ours) | 91.1 | 83.5 | 82.0 | 45.2 | 70.3 | 72.5 | 83.4 | 54.1 |
| Model | MathVista | MathVerse | WeMath | DynaMath | LogicVista | Avg. |
|---|---|---|---|---|---|---|
| Open-Source Models | ||||||
| LLaVA-OneVision | 58.6 | 19.3 | 20.9 | - | 33.3 | - |
| Qwen2.5-VL-7B-Instruct | 68.3 | 45.6 | 34.6 | 53.3 | 45.9 | 49.5 |
| InternVL3-8B | 71.6 | 39.8 | 37.1 | - | 44.1 | - |
| Qwen3-VL-8B-Instruct | 76.3 | 61.3 | 38.8 | 65.5 | 54.9 | 59.4 |
| Text-only Reasoning Models | ||||||
| MM-Eureka-7B | 72.6 | 50.3 | 21.8 | - | 46.3 | - |
| ThinkLite-VL-7B | 75.1 | 52.1 | 41.8 | - | 42.7 | - |
| VL-Rethinker-7B | 74.9 | 54.2 | 36.3 | - | 42.7 | - |
| VLAA-Thinker-7B | 71.7 | - | 35.7 | - | 45.9 | - |
| Agentic Multimodal Models | ||||||
| DeepEyes | 70.1 | 47.3 | 38.9 | 55.0 | 47.7 | 51.8 |
| Thyme | 70.0 | - | 39.3 | - | 49.0 | - |
| DeepEyesV2 | 71.9 | 52.7 | 38.1 | 57.2 | 48.7 | 53.7 |
| Metis (Ours) | 78.0 | 65.9 | 65.2 | 69.2 | 56.2 | 66.9 |
| Method | V*Bench | HR4K | HR8K | CharXiv(RQ) | MathVista |
|---|---|---|---|---|---|
| Standard GRPO (w_tool=0) | 88.7 | 81.0 | 79.2 | 51.0 | 76.9 |
| HDPO (w_tool=0.10) | 88.0 | 83.5 | 81.0 | 52.7 | 77.4 |
| HDPO (w_tool=0.15) โ | 91.1 | 83.5 | 82.0 | 54.1 | 78.0 |
| HDPO (w_tool=0.20) | 87.4 | 82.5 | 80.5 | 51.5 | 77.2 |
To illustrate the meta-cognitive tool-use behavior cultivated by HDPO, we present two representative cases from the main paper.
๐ญ Case 1: Direct Reasoning โ No Tool Needed
When the query is resolvable from the visual context and parametric knowledge alone, Metis abstains from calling any external tool and answers directly. The agent learns to trust its own capabilities for queries within its competence, avoiding the latency overhead and noise injection of redundant tool calls.
The query can be resolved through visual understanding and prior knowledge alone. Metis abstains from tool invocation and answers directly, exemplifying the meta-cognitive restraint instilled by HDPO.
๐ฌ Case 2: Targeted Code Execution โ Precision When Needed
When fine-grained visual analysis exceeds the model's native resolution capabilities, Metis strategically invokes code execution to crop and enlarge the relevant region. Code execution is not a default fallback, but a precision instrument deployed only when the visual evidence at the original resolution is genuinely ambiguous.
The question requires comparing curves in a specific subplot region difficult to resolve at the original image scale. Metis invokes code to crop and enlarge the relevant area, enabling precise identification of curve behavior near the queried time step.
๐ก Takeaway: Metis has internalized a principled decision boundary โ abstaining when internal knowledge suffices, and selectively engaging external tools only when genuinely necessary.
- ๐ Python >= 3.10
- ๐ง CUDA >= 12.1
- ๐ฅ๏ธ 8ร GPUs (180GB each, e.g. B200) for RL training
git clone https://github.com/Accio-Lab/Metis.git
cd Metis
# Install verl (base RL framework) as editable dependency
pip install -e verl
# Install Metis (HDPO + tool server)
pip install -e ".[vllm,search_tool,python_code_dep]"During RL training, an LLM judge evaluates whether the agent's answers are correct. The reward manager (metis.py) calls an OpenAI-compatible endpoint to get CORRECT / INCORRECT verdicts.
You can deploy any strong LLM as the judge. We recommend using vLLM:
# Deploy a judge model (on a GPU machine, e.g. 1ร A100)
vllm serve \
--model Qwen/Qwen3-235B-A22B-Instruct-2507 \
--port 8000 \
--tensor-parallel-size 8
# Verify it's running:
curl http://localhost:8000/v1/modelsAfter the judge is up, note the URL (e.g. http://<judge_ip>:8000/v1). You will pass it as JUDGE_BASE_URL in Step 4.
๐ก Tip: Any OpenAI-compatible server works (vLLM, SGLang, TGI, or even a commercial API like OpenAI / DeepSeek). Just set
JUDGE_API_KEYandJUDGE_BASE_URLaccordingly.
The agent's text search tool calls a web search backend during training. You need an API key from one of the following providers:
| Provider | Env Variable | Sign Up |
|---|---|---|
| Serper (recommended) | SERPER_API_KEY |
serper.dev |
| SerpApi | SERPER_API_KEY |
serpapi.com |
| BrightData | BRIGHTDATA_API_TOKEN + BRIGHTDATA_ZONE |
brightdata.com |
# Example: using Serper (default)
export SERPER_API_KEY="your-serper-api-key"
export SEARCH_PROVIDER="serper" # "serper" | "serpapi" | "brightdata"๐ก Tip: If your training data does not contain search-type tasks, the search tool will not be invoked and you can skip this step.
The tool server provides sandboxed Python execution, text search, and image search capabilities for the agent.
๐ก Tip: If you run training on a single node, you can skip this step โ the training script will automatically start a local tool server. You only need to start it manually when deploying the tool server on a separate dedicated machine.
# On a dedicated machine (pass search API keys via environment)
SERPER_API_KEY="your-key" \
bash examples/train/start_tool_server.sh [PORT] [WORKERS]
# Example:
bash examples/train/start_tool_server.sh 30569 32
# After startup, the script will print:
# URL: http://<server_ip>:30569/get_observation
# Save this URL for the next step.# Single node (8 GPUs) โ tool server auto-starts locally
MODEL_PATH=path/to/your/sft-checkpoint \
TRAIN_DATA="[data/train.parquet]" \
VAL_DATA="[data/val.parquet]" \
JUDGE_BASE_URL=http://<judge_ip>:8000/v1 \
SERPER_API_KEY="your-serper-api-key" \
bash examples/train/train_metis.sh 0.15
# Single node with a remote tool server (started in Step 3)
MODEL_PATH=path/to/your/sft-checkpoint \
TRAIN_DATA="[data/train.parquet]" \
VAL_DATA="[data/val.parquet]" \
JUDGE_BASE_URL=http://<judge_ip>:8000/v1 \
REMOTE_TOOL_SERVER_URL=http://<server_ip>:<port>/get_observation \
bash examples/train/train_metis.sh 0.15
# Multi-node (Head node, e.g. 2 nodes)
NODE_RANK=0 NNODES=2 \
MODEL_PATH=path/to/your/sft-checkpoint \
TRAIN_DATA="[data/train.parquet]" \
VAL_DATA="[data/val.parquet]" \
JUDGE_BASE_URL=http://<judge_ip>:8000/v1 \
SERPER_API_KEY="your-serper-api-key" \
REMOTE_TOOL_SERVER_URL=http://<server_ip>:<port>/get_observation \
bash examples/train/train_metis.sh 0.15
# Multi-node (Worker node)
NODE_RANK=1 MASTER_ADDR=<head_ip> NNODES=2 \
bash examples/train/train_metis.sh 0.15๐ก The argument
0.15is the tool-efficiency loss weightw_tool. See the ablation table for guidance on tuning.
| Variable | Description | Default |
|---|---|---|
MODEL_PATH |
๐ Path to SFT checkpoint | Required |
TRAIN_DATA |
๐ Training parquet files | Required |
VAL_DATA |
๐ Validation parquet files | Required |
JUDGE_BASE_URL |
๐ Base URL for the LLM judge endpoint | http://localhost:8000/v1 |
JUDGE_API_KEY |
๐ API key for the LLM judge | "EMPTY" |
SEARCH_PROVIDER |
๐ Search backend to use | "serper" |
SERPER_API_KEY |
๐ API key for Serper / SerpApi | Required for search |
BRIGHTDATA_API_TOKEN |
๐ API token for BrightData SERP | Alternative to Serper |
BRIGHTDATA_ZONE |
๐ BrightData zone name | Required if using BrightData |
REMOTE_TOOL_SERVER_URL |
๐ฅ๏ธ URL to an external tool server | Auto-start locally |
WANDB_API_KEY |
๐ Weights & Biases API key | Optional |
METIS_SESSION_DIR |
๐พ Directory for tool execution sessions | /tmp/metis_sessions |
Metis/
โโโ verl_tool/ # ๐ง Core Metis implementation
โ โโโ trainer/
โ โ โโโ main_ppo.py # ๐ Training entry point
โ โ โโโ config/ # โ๏ธ Hydra YAML configs
โ โ โโโ ppo/
โ โ โโโ hdpo_algos.py # โญ HDPO advantage estimation (core algorithm)
โ โ โโโ ray_trainer.py # ๐ Ray-based PPO trainer with dual rewards
โ โ โโโ reward.py # ๐ฏ Reward computation utilities
โ โโโ workers/
โ โ โโโ hdpo_actor.py # โญ HDPO actor with dual-loss update
โ โ โโโ hdpo_fsdp_worker.py # ๐ FSDP worker integration
โ โ โโโ reward_manager/
โ โ โโโ metis.py # โญ Dual reward manager (accuracy + tool efficiency)
โ โโโ agent_loop/ # ๐ Multi-turn agent loop with tool use
โ โโโ servers/ # ๐ ๏ธ Tool server (code execution, search)
โ โโโ serve.py # ๐ FastAPI tool server entry point
โ โโโ tools/
โ โโโ metis.py # ๐งฐ Full tool: Python + text search + image search
โ โโโ metis_code.py # ๐ป Code-only tool variant
โ โโโ utils/
โ โโโ ipython_executor.py # ๐ Ray-managed Jupyter kernel pool
โ โโโ search_engine.py # ๐ Async Google search with caching
โโโ verl/ # ๐ฆ Base RL framework (editable dependency)
โโโ examples/
โ โโโ train/
โ โโโ train_metis.sh # ๐ Multi-node HDPO training script
โ โโโ start_tool_server.sh # ๐ Tool server launch script
โโโ paper/ # ๐ LaTeX source
โโโ assets/ # ๐ผ๏ธ Figures for README
โโโ pyproject.toml
| File | Description |
|---|---|
verl_tool/trainer/ppo/hdpo_algos.py |
โญ HDPO advantage estimation โ conditional GRPO for tool efficiency |
verl_tool/workers/hdpo_actor.py |
โญ Dual-loss policy update: w_acc ยท L_acc + w_tool ยท L_tool |
verl_tool/workers/reward_manager/metis.py |
โญ Dual reward computation: accuracy + tool efficiency |
verl_tool/trainer/ppo/ray_trainer.py |
๐ Ray-distributed PPO trainer with HDPO integration |
verl_tool/workers/hdpo_fsdp_worker.py |
๐ FSDP integration for HDPO actor |
verl_tool/servers/tools/metis.py |
๐งฐ Tool environment: Python execution, text/image search |
Our SFT corpus is curated from publicly available tool-augmented multimodal trajectories (DeepEyesV2, V-Interaction, Thyme, OpenMMReasoner) through a three-stage pipeline:
- ๐ง Eradicating hallucinated environmental dynamics โ Execute all code in sandbox; discard trajectories with execution failures
- ๐งน Isolating genuine tool necessity โ Filter out samples where the base model achieves pass@8 = 1 without tools
- ๐ง Multidimensional meta-cognitive filtering โ LLM judge evaluates visual relevance, reasoning coherence, and tool-use rationale
| Hyperparameter | Value |
|---|---|
| ๐๏ธ Backbone | Qwen3-VL-8B-Instruct |
| ๐ฆ Batch size | 128 |
| ๐ฒ Rollouts per prompt (G) | 16 |
| ๐ Learning rate | 1e-6 |
| ๐ KL coefficient | 0 |
| โ๏ธ Loss weights | w_acc = 1.0, w_tool = 0.15 |
| ๐ Max response length | 16,384 tokens |
| ๐ Training prompts | ~5K (45% perception, 36% search, 19% math/reasoning) |
If you find Metis or HDPO useful in your research, please consider citing:
@article{yan2026metis,
title={Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models},
author={Yan, Shilin and Tong, Jintao and Xue, Hongwei and Tang, Xiaojun and Wang, Yangyang and Shi, Kunyu and Zhang, Guannan and Li, Ruixuan and Zou, Yixiong},
journal={arXiv preprint arXiv:2604.08545},
year={2026}
}This project is released under the Apache 2.0 License.
Metis is built upon the following open-source projects: