Act Wisely: Cultivating Meta-Cognitive Tool Use
in Agentic Multimodal Models

Shilin Yan^1*†‡, Jintao Tong^1,2*, Hongwei Xue^1†, Xiaojun Tang¹, Yangyang Wang¹,
Kunyu Shi¹, Guannan Zhang¹, Ruixuan Li^2‡, Yixiong Zou^2‡

^*Equal contribution ^†Project Leader ^‡Corresponding Author

¹Accio Team, Alibaba Group ²Huazhong University of Science and Technology

📰 News

🔥 [2026/04/10] Paper, code, and model weights released!

✨ Highlights

🎯 98% → 2% Tool Calls
_{Reduces blind tool invocation
by orders of magnitude}

📈 SOTA Performance
_{Best accuracy across 13 benchmarks
among open-source 8B agents}

🧠 Meta-Cognitive Wisdom
_{Learns when to use tools,
not just how}

💡 Introduction

"The art of being wise is the art of knowing what to overlook." — William James

Current agentic multimodal models suffer from a critical meta-cognitive deficit: blind tool invocation 🔧❌. They reflexively call external tools even when queries are directly resolvable from the visual context, leading to severe latency overhead and reasoning degradation.

Existing RL methods attempt to fix this by scalarizing accuracy and tool-efficiency rewards. However, this coupled reward formulation creates an irreconcilable optimization dilemma — aggressive penalties suppress essential tool use, while mild penalties are entirely washed out by accuracy reward variance during advantage normalization.

We propose Hierarchical Decoupled Policy Optimization (HDPO), a framework that maintains two orthogonal optimization channels:

🎯 An accuracy channel that globally maximizes task correctness
⚡ An efficiency channel that enforces tool parsimony exclusively within accurate trajectories

This decoupled design naturally induces an implicit curriculum 📚 — first learn to be correct, then learn to be efficient — without any manual scheduling.

Our resulting model, Metis, reduces tool invocations by orders of magnitude (e.g., from 98% to 2%) while simultaneously elevating reasoning accuracy across diverse benchmarks.

Comparison of tool-use efficiency and task performance. Existing methods rely heavily on tool calls. Metis uses tools selectively while achieving the best overall performance.

🔬 Metis Framework

Overview of Metis. A strategic multimodal reasoning agent that selectively invokes code execution, text search, and image search tools during multi-turn reasoning.

HDPO resolves reward coupling through three key components:

1️⃣ Dual Reward Design

Reward	Formula	Description
🎯 Accuracy	r_acc = 0.9 · r_ans + 0.1 · r_fmt	Binary answer correctness + format compliance
⚡ Tool Efficiency	r_tool = 1/(T+1) if correct, else 0	Inverse penalty on tool invocations, conditioned on correctness

2️⃣ Decoupled Advantage Estimation

🎯 Accuracy advantage — Standard GRPO over all G rollouts per prompt
⚡ Tool efficiency advantage — Conditional GRPO computed exclusively over the qualifying set Q = {correct rollouts with |Q| ≥ 2}

This prevents incorrect rollouts from inflating tool efficiency advantages and eliminates cross-objective gradient entanglement.

3️⃣ Hierarchical Policy Update

The final HDPO objective combines two independent clipped surrogate losses:

L_HDPO = w_acc · L_GRPO(A_acc) + w_tool · L_GRPO(A_tool)

Since advantages are normalized independently, each gradient component delivers a clean, orthogonal learning signal.

Comparison between coupled-reward optimization and HDPO. Existing methods entangle accuracy and efficiency into a single reward signal, while HDPO decouples them into separate branches.

📊 Results

👁️ Perception and Document Understanding

Model	V*Bench	HR4K	HR8K	TreeBench	MME-RW	SEED2+	CharXiv(DQ)	CharXiv(RQ)
Open-Source Models
LLaVA-OneVision	75.4	63.0	59.8	37.3	57.4	65.4	-	-
InternVL3-8B	81.2	70.0	69.3	38.8	-	69.7	73.6	37.6
Qwen2.5-VL-7B-Instruct	75.3	65.5	62.1	37.0	56.8	70.4	72.7	40.2
Qwen2.5-VL-32B-Instruct	80.6	69.3	63.6	42.5	59.1	72.4	83.2	48.0
Qwen3-VL-8B-Instruct	86.4	78.9	74.6	40.7	61.9	71.0	83.0	46.3
Agentic Multimodal Models
Pixel-Reasoner	84.3	72.6	66.1	39.0	64.4	-	-	-
DeepEyes	83.3	73.2	69.5	37.5	64.1	-	-	-
Thyme	82.2	77.0	72.0	-	64.8	-	-	-
DeepEyesV2	81.8	77.9	73.8	42.5	64.9	70.5	78.6	48.9
Mini-o3	88.2	77.5	73.3	-	65.5	-	-	-
SenseNova-MARS-8B	92.2	83.1	78.4	-	67.9	-	-	-
Skywork-R1V4-30B-A3B	88.0	82.8	79.8	-	71.4	-	-	-
Metis (Ours)	91.1	83.5	82.0	45.2	70.3	72.5	83.4	54.1

🧮 Mathematical and Logical Reasoning

Model	MathVista	MathVerse	WeMath	DynaMath	LogicVista	Avg.
Open-Source Models
LLaVA-OneVision	58.6	19.3	20.9	-	33.3	-
Qwen2.5-VL-7B-Instruct	68.3	45.6	34.6	53.3	45.9	49.5
InternVL3-8B	71.6	39.8	37.1	-	44.1	-
Qwen3-VL-8B-Instruct	76.3	61.3	38.8	65.5	54.9	59.4
Text-only Reasoning Models
MM-Eureka-7B	72.6	50.3	21.8	-	46.3	-
ThinkLite-VL-7B	75.1	52.1	41.8	-	42.7	-
VL-Rethinker-7B	74.9	54.2	36.3	-	42.7	-
VLAA-Thinker-7B	71.7	-	35.7	-	45.9	-
Agentic Multimodal Models
DeepEyes	70.1	47.3	38.9	55.0	47.7	51.8
Thyme	70.0	-	39.3	-	49.0	-
DeepEyesV2	71.9	52.7	38.1	57.2	48.7	53.7
Metis (Ours)	78.0	65.9	65.2	69.2	56.2	66.9

🔍 Ablation: Effect of w_tool

Method	V*Bench	HR4K	HR8K	CharXiv(RQ)	MathVista
Standard GRPO (w_tool=0)	88.7	81.0	79.2	51.0	76.9
HDPO (w_tool=0.10)	88.0	83.5	81.0	52.7	77.4
HDPO (w_tool=0.15) ✅	91.1	83.5	82.0	54.1	78.0
HDPO (w_tool=0.20)	87.4	82.5	80.5	51.5	77.2

🧠 Meta-Cognitive Tool Arbitration

To illustrate the meta-cognitive tool-use behavior cultivated by HDPO, we present two representative cases from the main paper.

💭 Case 1: Direct Reasoning — No Tool Needed

When the query is resolvable from the visual context and parametric knowledge alone, Metis abstains from calling any external tool and answers directly. The agent learns to trust its own capabilities for queries within its competence, avoiding the latency overhead and noise injection of redundant tool calls.

The query can be resolved through visual understanding and prior knowledge alone. Metis abstains from tool invocation and answers directly, exemplifying the meta-cognitive restraint instilled by HDPO.

🔬 Case 2: Targeted Code Execution — Precision When Needed

When fine-grained visual analysis exceeds the model's native resolution capabilities, Metis strategically invokes code execution to crop and enlarge the relevant region. Code execution is not a default fallback, but a precision instrument deployed only when the visual evidence at the original resolution is genuinely ambiguous.

The question requires comparing curves in a specific subplot region difficult to resolve at the original image scale. Metis invokes code to crop and enlarge the relevant area, enabling precise identification of curve behavior near the queried time step.

💡 Takeaway: Metis has internalized a principled decision boundary — abstaining when internal knowledge suffices, and selectively engaging external tools only when genuinely necessary.

🛠️ Installation

📋 Prerequisites

🐍 Python >= 3.10
🔧 CUDA >= 12.1
🖥️ 8× GPUs (180GB each, e.g. B200) for RL training

📦 Install

git clone https://github.com/Accio-Lab/Metis.git
cd Metis

# Install verl (base RL framework) as editable dependency
pip install -e verl

# Install Metis (HDPO + tool server)
pip install -e ".[vllm,search_tool,python_code_dep]"

🚀 Quick Start

1️⃣ Deploy the Judge Model (Required)

During RL training, an LLM judge evaluates whether the agent's answers are correct. The reward manager (metis.py) calls an OpenAI-compatible endpoint to get CORRECT / INCORRECT verdicts.

You can deploy any strong LLM as the judge. We recommend using vLLM:

# Deploy a judge model (on a GPU machine, e.g. 1× A100)
vllm serve \
    --model Qwen/Qwen3-235B-A22B-Instruct-2507 \
    --port 8000 \
    --tensor-parallel-size 8

# Verify it's running:
curl http://localhost:8000/v1/models

After the judge is up, note the URL (e.g. http://<judge_ip>:8000/v1). You will pass it as JUDGE_BASE_URL in Step 4.

💡 Tip: Any OpenAI-compatible server works (vLLM, SGLang, TGI, or even a commercial API like OpenAI / DeepSeek). Just set JUDGE_API_KEY and JUDGE_BASE_URL accordingly.

2️⃣ Configure the Search API (Required for text search tool)

The agent's text search tool calls a web search backend during training. You need an API key from one of the following providers:

Provider	Env Variable	Sign Up
Serper (recommended)	`SERPER_API_KEY`	serper.dev
SerpApi	`SERPER_API_KEY`	serpapi.com
BrightData	`BRIGHTDATA_API_TOKEN` + `BRIGHTDATA_ZONE`	brightdata.com

# Example: using Serper (default)
export SERPER_API_KEY="your-serper-api-key"
export SEARCH_PROVIDER="serper"        # "serper" | "serpapi" | "brightdata"

💡 Tip: If your training data does not contain search-type tasks, the search tool will not be invoked and you can skip this step.

3️⃣ Start the Tool Server (Optional)

The tool server provides sandboxed Python execution, text search, and image search capabilities for the agent.

💡 Tip: If you run training on a single node, you can skip this step — the training script will automatically start a local tool server. You only need to start it manually when deploying the tool server on a separate dedicated machine.

# On a dedicated machine (pass search API keys via environment)
SERPER_API_KEY="your-key" \
bash examples/train/start_tool_server.sh [PORT] [WORKERS]

# Example:
bash examples/train/start_tool_server.sh 30569 32
# After startup, the script will print:
#   URL: http://<server_ip>:30569/get_observation
# Save this URL for the next step.

4️⃣ Run HDPO Training

# Single node (8 GPUs) — tool server auto-starts locally
MODEL_PATH=path/to/your/sft-checkpoint \
TRAIN_DATA="[data/train.parquet]" \
VAL_DATA="[data/val.parquet]" \
JUDGE_BASE_URL=http://<judge_ip>:8000/v1 \
SERPER_API_KEY="your-serper-api-key" \
bash examples/train/train_metis.sh 0.15

# Single node with a remote tool server (started in Step 3)
MODEL_PATH=path/to/your/sft-checkpoint \
TRAIN_DATA="[data/train.parquet]" \
VAL_DATA="[data/val.parquet]" \
JUDGE_BASE_URL=http://<judge_ip>:8000/v1 \
REMOTE_TOOL_SERVER_URL=http://<server_ip>:<port>/get_observation \
bash examples/train/train_metis.sh 0.15

# Multi-node (Head node, e.g. 2 nodes)
NODE_RANK=0 NNODES=2 \
MODEL_PATH=path/to/your/sft-checkpoint \
TRAIN_DATA="[data/train.parquet]" \
VAL_DATA="[data/val.parquet]" \
JUDGE_BASE_URL=http://<judge_ip>:8000/v1 \
SERPER_API_KEY="your-serper-api-key" \
REMOTE_TOOL_SERVER_URL=http://<server_ip>:<port>/get_observation \
bash examples/train/train_metis.sh 0.15

# Multi-node (Worker node)
NODE_RANK=1 MASTER_ADDR=<head_ip> NNODES=2 \
bash examples/train/train_metis.sh 0.15

💡 The argument 0.15 is the tool-efficiency loss weight w_tool. See the ablation table for guidance on tuning.

⚙️ Environment Variables

Variable	Description	Default
`MODEL_PATH`	📂 Path to SFT checkpoint	Required
`TRAIN_DATA`	📊 Training parquet files	Required
`VAL_DATA`	📊 Validation parquet files	Required
`JUDGE_BASE_URL`	🌐 Base URL for the LLM judge endpoint	`http://localhost:8000/v1`
`JUDGE_API_KEY`	🔑 API key for the LLM judge	`"EMPTY"`
`SEARCH_PROVIDER`	🔍 Search backend to use	`"serper"`
`SERPER_API_KEY`	🔍 API key for Serper / SerpApi	Required for search
`BRIGHTDATA_API_TOKEN`	🔍 API token for BrightData SERP	Alternative to Serper
`BRIGHTDATA_ZONE`	🔍 BrightData zone name	Required if using BrightData
`REMOTE_TOOL_SERVER_URL`	🖥️ URL to an external tool server	Auto-start locally
`WANDB_API_KEY`	📈 Weights & Biases API key	Optional
`METIS_SESSION_DIR`	💾 Directory for tool execution sessions	`/tmp/metis_sessions`

🏗️ Project Structure

Metis/
├── verl_tool/                          # 🧠 Core Metis implementation
│   ├── trainer/
│   │   ├── main_ppo.py                 # 🚀 Training entry point
│   │   ├── config/                     # ⚙️ Hydra YAML configs
│   │   └── ppo/
│   │       ├── hdpo_algos.py           # ⭐ HDPO advantage estimation (core algorithm)
│   │       ├── ray_trainer.py          # 🔄 Ray-based PPO trainer with dual rewards
│   │       └── reward.py               # 🎯 Reward computation utilities
│   ├── workers/
│   │   ├── hdpo_actor.py               # ⭐ HDPO actor with dual-loss update
│   │   ├── hdpo_fsdp_worker.py         # 🔗 FSDP worker integration
│   │   └── reward_manager/
│   │       └── metis.py                # ⭐ Dual reward manager (accuracy + tool efficiency)
│   ├── agent_loop/                     # 🔁 Multi-turn agent loop with tool use
│   └── servers/                        # 🛠️ Tool server (code execution, search)
│       ├── serve.py                    # 🌐 FastAPI tool server entry point
│       └── tools/
│           ├── metis.py                # 🧰 Full tool: Python + text search + image search
│           ├── metis_code.py           # 💻 Code-only tool variant
│           └── utils/
│               ├── ipython_executor.py # 🐍 Ray-managed Jupyter kernel pool
│               └── search_engine.py    # 🔍 Async Google search with caching
├── verl/                               # 📦 Base RL framework (editable dependency)
├── examples/
│   └── train/
│       ├── train_metis.sh              # 📜 Multi-node HDPO training script
│       └── start_tool_server.sh        # 📜 Tool server launch script
├── paper/                              # 📄 LaTeX source
├── assets/                             # 🖼️ Figures for README
└── pyproject.toml

📌 Core Files

File	Description
`verl_tool/trainer/ppo/hdpo_algos.py`	⭐ HDPO advantage estimation — conditional GRPO for tool efficiency
`verl_tool/workers/hdpo_actor.py`	⭐ Dual-loss policy update: `w_acc · L_acc + w_tool · L_tool`
`verl_tool/workers/reward_manager/metis.py`	⭐ Dual reward computation: accuracy + tool efficiency
`verl_tool/trainer/ppo/ray_trainer.py`	🔄 Ray-distributed PPO trainer with HDPO integration
`verl_tool/workers/hdpo_fsdp_worker.py`	🔗 FSDP integration for HDPO actor
`verl_tool/servers/tools/metis.py`	🧰 Tool environment: Python execution, text/image search

📚 Training Details

🧊 SFT Stage (Cold Start)

Our SFT corpus is curated from publicly available tool-augmented multimodal trajectories (DeepEyesV2, V-Interaction, Thyme, OpenMMReasoner) through a three-stage pipeline:

🔧 Eradicating hallucinated environmental dynamics — Execute all code in sandbox; discard trajectories with execution failures
🧹 Isolating genuine tool necessity — Filter out samples where the base model achieves pass@8 = 1 without tools
🧠 Multidimensional meta-cognitive filtering — LLM judge evaluates visual relevance, reasoning coherence, and tool-use rationale

⚡ RL Stage (HDPO)

Hyperparameter	Value
🏗️ Backbone	Qwen3-VL-8B-Instruct
📦 Batch size	128
🎲 Rollouts per prompt (G)	16
📐 Learning rate	1e-6
🔒 KL coefficient	0
⚖️ Loss weights	w_acc = 1.0, w_tool = 0.15
📏 Max response length	16,384 tokens
📊 Training prompts	~5K (45% perception, 36% search, 19% math/reasoning)

📖 Citation

If you find Metis or HDPO useful in your research, please consider citing:

@article{yan2026metis,
  title={Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models},
  author={Yan, Shilin and Tong, Jintao and Xue, Hongwei and Tang, Xiaojun and Wang, Yangyang and Shi, Kunyu and Zhang, Guannan and Li, Ruixuan and Zou, Yixiong},
  journal={arXiv preprint arXiv:2604.08545},
  year={2026}
}

📄 License

This project is released under the Apache 2.0 License.

🙏 Acknowledgments

Metis is built upon the following open-source projects:

🌋 verl — Volcano Engine Reinforcement Learning for LLMs
🐯 verl-tool — Tool-augmented agent training framework
🔮 Qwen3-VL — Qwen3 Vision-Language Model

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
examples/train		examples/train
verl		verl
verl_tool		verl_tool
.gitignore		.gitignore
LICENSE		LICENSE
Metis_arxiv.pdf		Metis_arxiv.pdf
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Act Wisely: Cultivating Meta-Cognitive Tool Usein Agentic Multimodal Models

📰 News

✨ Highlights

💡 Introduction

🔬 Metis Framework

1️⃣ Dual Reward Design

2️⃣ Decoupled Advantage Estimation

3️⃣ Hierarchical Policy Update

📊 Results

👁️ Perception and Document Understanding

🧮 Mathematical and Logical Reasoning

🔍 Ablation: Effect of w_tool

🧠 Meta-Cognitive Tool Arbitration

🛠️ Installation

📋 Prerequisites

📦 Install

🚀 Quick Start

1️⃣ Deploy the Judge Model (Required)

2️⃣ Configure the Search API (Required for text search tool)

3️⃣ Start the Tool Server (Optional)

4️⃣ Run HDPO Training

⚙️ Environment Variables

🏗️ Project Structure

📌 Core Files

📚 Training Details

🧊 SFT Stage (Cold Start)

⚡ RL Stage (HDPO)

📖 Citation

📄 License

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Act Wisely: Cultivating Meta-Cognitive Tool Use
in Agentic Multimodal Models

Packages