Skip to content

Accio-Lab/Metis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

5 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Act Wisely: Cultivating Meta-Cognitive Tool Use
in Agentic Multimodal Models

Shilin Yan1*โ€ โ€ก, Jintao Tong1,2*, Hongwei Xue1โ€ , Xiaojun Tang1, Yangyang Wang1,
Kunyu Shi1, Guannan Zhang1, Ruixuan Li2โ€ก, Yixiong Zou2โ€ก

*Equal contribution โ€ƒ โ€ Project Leader โ€ƒ โ€กCorresponding Author

1Accio Team, Alibaba Group โ€ƒ 2Huazhong University of Science and Technology

Project Page Paper GitHub RL Model RL Data ColdStart Model ColdStart Data


๐Ÿ“ฐ News

  • ๐Ÿ”ฅ [2026/04/10] Paper, code, and model weights released!

โœจ Highlights

๐ŸŽฏ 98% โ†’ 2% Tool Calls
Reduces blind tool invocation
by orders of magnitude
๐Ÿ“ˆ SOTA Performance
Best accuracy across 13 benchmarks
among open-source 8B agents
๐Ÿง  Meta-Cognitive Wisdom
Learns when to use tools,
not just how

๐Ÿ’ก Introduction

"The art of being wise is the art of knowing what to overlook." โ€” William James

Current agentic multimodal models suffer from a critical meta-cognitive deficit: blind tool invocation ๐Ÿ”งโŒ. They reflexively call external tools even when queries are directly resolvable from the visual context, leading to severe latency overhead and reasoning degradation.

Existing RL methods attempt to fix this by scalarizing accuracy and tool-efficiency rewards. However, this coupled reward formulation creates an irreconcilable optimization dilemma โ€” aggressive penalties suppress essential tool use, while mild penalties are entirely washed out by accuracy reward variance during advantage normalization.

We propose Hierarchical Decoupled Policy Optimization (HDPO), a framework that maintains two orthogonal optimization channels:

  • ๐ŸŽฏ An accuracy channel that globally maximizes task correctness
  • โšก An efficiency channel that enforces tool parsimony exclusively within accurate trajectories

This decoupled design naturally induces an implicit curriculum ๐Ÿ“š โ€” first learn to be correct, then learn to be efficient โ€” without any manual scheduling.

Our resulting model, Metis, reduces tool invocations by orders of magnitude (e.g., from 98% to 2%) while simultaneously elevating reasoning accuracy across diverse benchmarks.


Comparison of tool-use efficiency and task performance. Existing methods rely heavily on tool calls. Metis uses tools selectively while achieving the best overall performance.

๐Ÿ”ฌ Metis Framework


Overview of Metis. A strategic multimodal reasoning agent that selectively invokes code execution, text search, and image search tools during multi-turn reasoning.

HDPO resolves reward coupling through three key components:

1๏ธโƒฃ Dual Reward Design

Reward Formula Description
๐ŸŽฏ Accuracy r_acc = 0.9 ยท r_ans + 0.1 ยท r_fmt Binary answer correctness + format compliance
โšก Tool Efficiency r_tool = 1/(T+1) if correct, else 0 Inverse penalty on tool invocations, conditioned on correctness

2๏ธโƒฃ Decoupled Advantage Estimation

  • ๐ŸŽฏ Accuracy advantage โ€” Standard GRPO over all G rollouts per prompt
  • โšก Tool efficiency advantage โ€” Conditional GRPO computed exclusively over the qualifying set Q = {correct rollouts with |Q| โ‰ฅ 2}

This prevents incorrect rollouts from inflating tool efficiency advantages and eliminates cross-objective gradient entanglement.

3๏ธโƒฃ Hierarchical Policy Update

The final HDPO objective combines two independent clipped surrogate losses:

L_HDPO = w_acc ยท L_GRPO(A_acc) + w_tool ยท L_GRPO(A_tool)

Since advantages are normalized independently, each gradient component delivers a clean, orthogonal learning signal.


Comparison between coupled-reward optimization and HDPO. Existing methods entangle accuracy and efficiency into a single reward signal, while HDPO decouples them into separate branches.

๐Ÿ“Š Results

๐Ÿ‘๏ธ Perception and Document Understanding

ModelV*BenchHR4KHR8KTreeBenchMME-RWSEED2+CharXiv(DQ)CharXiv(RQ)
Open-Source Models
LLaVA-OneVision75.463.059.837.357.465.4--
InternVL3-8B81.270.069.338.8-69.773.637.6
Qwen2.5-VL-7B-Instruct75.365.562.137.056.870.472.740.2
Qwen2.5-VL-32B-Instruct80.669.363.642.559.172.483.248.0
Qwen3-VL-8B-Instruct86.478.974.640.761.971.083.046.3
Agentic Multimodal Models
Pixel-Reasoner84.372.666.139.064.4---
DeepEyes83.373.269.537.564.1---
Thyme82.277.072.0-64.8---
DeepEyesV281.877.973.842.564.970.578.648.9
Mini-o388.277.573.3-65.5---
SenseNova-MARS-8B92.283.178.4-67.9---
Skywork-R1V4-30B-A3B88.082.879.8-71.4---
Metis (Ours)91.183.582.045.270.372.583.454.1

๐Ÿงฎ Mathematical and Logical Reasoning

ModelMathVistaMathVerseWeMathDynaMathLogicVistaAvg.
Open-Source Models
LLaVA-OneVision58.619.320.9-33.3-
Qwen2.5-VL-7B-Instruct68.345.634.653.345.949.5
InternVL3-8B71.639.837.1-44.1-
Qwen3-VL-8B-Instruct76.361.338.865.554.959.4
Text-only Reasoning Models
MM-Eureka-7B72.650.321.8-46.3-
ThinkLite-VL-7B75.152.141.8-42.7-
VL-Rethinker-7B74.954.236.3-42.7-
VLAA-Thinker-7B71.7-35.7-45.9-
Agentic Multimodal Models
DeepEyes70.147.338.955.047.751.8
Thyme70.0-39.3-49.0-
DeepEyesV271.952.738.157.248.753.7
Metis (Ours)78.065.965.269.256.266.9

๐Ÿ” Ablation: Effect of w_tool

Method V*Bench HR4K HR8K CharXiv(RQ) MathVista
Standard GRPO (w_tool=0) 88.7 81.0 79.2 51.0 76.9
HDPO (w_tool=0.10) 88.0 83.5 81.0 52.7 77.4
HDPO (w_tool=0.15) โœ… 91.1 83.5 82.0 54.1 78.0
HDPO (w_tool=0.20) 87.4 82.5 80.5 51.5 77.2

๐Ÿง  Meta-Cognitive Tool Arbitration

To illustrate the meta-cognitive tool-use behavior cultivated by HDPO, we present two representative cases from the main paper.

๐Ÿ’ญ Case 1: Direct Reasoning โ€” No Tool Needed

When the query is resolvable from the visual context and parametric knowledge alone, Metis abstains from calling any external tool and answers directly. The agent learns to trust its own capabilities for queries within its competence, avoiding the latency overhead and noise injection of redundant tool calls.


The query can be resolved through visual understanding and prior knowledge alone. Metis abstains from tool invocation and answers directly, exemplifying the meta-cognitive restraint instilled by HDPO.

๐Ÿ”ฌ Case 2: Targeted Code Execution โ€” Precision When Needed

When fine-grained visual analysis exceeds the model's native resolution capabilities, Metis strategically invokes code execution to crop and enlarge the relevant region. Code execution is not a default fallback, but a precision instrument deployed only when the visual evidence at the original resolution is genuinely ambiguous.


The question requires comparing curves in a specific subplot region difficult to resolve at the original image scale. Metis invokes code to crop and enlarge the relevant area, enabling precise identification of curve behavior near the queried time step.

๐Ÿ’ก Takeaway: Metis has internalized a principled decision boundary โ€” abstaining when internal knowledge suffices, and selectively engaging external tools only when genuinely necessary.

๐Ÿ› ๏ธ Installation

๐Ÿ“‹ Prerequisites

  • ๐Ÿ Python >= 3.10
  • ๐Ÿ”ง CUDA >= 12.1
  • ๐Ÿ–ฅ๏ธ 8ร— GPUs (180GB each, e.g. B200) for RL training

๐Ÿ“ฆ Install

git clone https://github.com/Accio-Lab/Metis.git
cd Metis

# Install verl (base RL framework) as editable dependency
pip install -e verl

# Install Metis (HDPO + tool server)
pip install -e ".[vllm,search_tool,python_code_dep]"

๐Ÿš€ Quick Start

1๏ธโƒฃ Deploy the Judge Model (Required)

During RL training, an LLM judge evaluates whether the agent's answers are correct. The reward manager (metis.py) calls an OpenAI-compatible endpoint to get CORRECT / INCORRECT verdicts.

You can deploy any strong LLM as the judge. We recommend using vLLM:

# Deploy a judge model (on a GPU machine, e.g. 1ร— A100)
vllm serve \
    --model Qwen/Qwen3-235B-A22B-Instruct-2507 \
    --port 8000 \
    --tensor-parallel-size 8

# Verify it's running:
curl http://localhost:8000/v1/models

After the judge is up, note the URL (e.g. http://<judge_ip>:8000/v1). You will pass it as JUDGE_BASE_URL in Step 4.

๐Ÿ’ก Tip: Any OpenAI-compatible server works (vLLM, SGLang, TGI, or even a commercial API like OpenAI / DeepSeek). Just set JUDGE_API_KEY and JUDGE_BASE_URL accordingly.

2๏ธโƒฃ Configure the Search API (Required for text search tool)

The agent's text search tool calls a web search backend during training. You need an API key from one of the following providers:

Provider Env Variable Sign Up
Serper (recommended) SERPER_API_KEY serper.dev
SerpApi SERPER_API_KEY serpapi.com
BrightData BRIGHTDATA_API_TOKEN + BRIGHTDATA_ZONE brightdata.com
# Example: using Serper (default)
export SERPER_API_KEY="your-serper-api-key"
export SEARCH_PROVIDER="serper"        # "serper" | "serpapi" | "brightdata"

๐Ÿ’ก Tip: If your training data does not contain search-type tasks, the search tool will not be invoked and you can skip this step.

3๏ธโƒฃ Start the Tool Server (Optional)

The tool server provides sandboxed Python execution, text search, and image search capabilities for the agent.

๐Ÿ’ก Tip: If you run training on a single node, you can skip this step โ€” the training script will automatically start a local tool server. You only need to start it manually when deploying the tool server on a separate dedicated machine.

# On a dedicated machine (pass search API keys via environment)
SERPER_API_KEY="your-key" \
bash examples/train/start_tool_server.sh [PORT] [WORKERS]

# Example:
bash examples/train/start_tool_server.sh 30569 32
# After startup, the script will print:
#   URL: http://<server_ip>:30569/get_observation
# Save this URL for the next step.

4๏ธโƒฃ Run HDPO Training

# Single node (8 GPUs) โ€” tool server auto-starts locally
MODEL_PATH=path/to/your/sft-checkpoint \
TRAIN_DATA="[data/train.parquet]" \
VAL_DATA="[data/val.parquet]" \
JUDGE_BASE_URL=http://<judge_ip>:8000/v1 \
SERPER_API_KEY="your-serper-api-key" \
bash examples/train/train_metis.sh 0.15

# Single node with a remote tool server (started in Step 3)
MODEL_PATH=path/to/your/sft-checkpoint \
TRAIN_DATA="[data/train.parquet]" \
VAL_DATA="[data/val.parquet]" \
JUDGE_BASE_URL=http://<judge_ip>:8000/v1 \
REMOTE_TOOL_SERVER_URL=http://<server_ip>:<port>/get_observation \
bash examples/train/train_metis.sh 0.15

# Multi-node (Head node, e.g. 2 nodes)
NODE_RANK=0 NNODES=2 \
MODEL_PATH=path/to/your/sft-checkpoint \
TRAIN_DATA="[data/train.parquet]" \
VAL_DATA="[data/val.parquet]" \
JUDGE_BASE_URL=http://<judge_ip>:8000/v1 \
SERPER_API_KEY="your-serper-api-key" \
REMOTE_TOOL_SERVER_URL=http://<server_ip>:<port>/get_observation \
bash examples/train/train_metis.sh 0.15

# Multi-node (Worker node)
NODE_RANK=1 MASTER_ADDR=<head_ip> NNODES=2 \
bash examples/train/train_metis.sh 0.15

๐Ÿ’ก The argument 0.15 is the tool-efficiency loss weight w_tool. See the ablation table for guidance on tuning.

โš™๏ธ Environment Variables

Variable Description Default
MODEL_PATH ๐Ÿ“‚ Path to SFT checkpoint Required
TRAIN_DATA ๐Ÿ“Š Training parquet files Required
VAL_DATA ๐Ÿ“Š Validation parquet files Required
JUDGE_BASE_URL ๐ŸŒ Base URL for the LLM judge endpoint http://localhost:8000/v1
JUDGE_API_KEY ๐Ÿ”‘ API key for the LLM judge "EMPTY"
SEARCH_PROVIDER ๐Ÿ” Search backend to use "serper"
SERPER_API_KEY ๐Ÿ” API key for Serper / SerpApi Required for search
BRIGHTDATA_API_TOKEN ๐Ÿ” API token for BrightData SERP Alternative to Serper
BRIGHTDATA_ZONE ๐Ÿ” BrightData zone name Required if using BrightData
REMOTE_TOOL_SERVER_URL ๐Ÿ–ฅ๏ธ URL to an external tool server Auto-start locally
WANDB_API_KEY ๐Ÿ“ˆ Weights & Biases API key Optional
METIS_SESSION_DIR ๐Ÿ’พ Directory for tool execution sessions /tmp/metis_sessions

๐Ÿ—๏ธ Project Structure

Metis/
โ”œโ”€โ”€ verl_tool/                          # ๐Ÿง  Core Metis implementation
โ”‚   โ”œโ”€โ”€ trainer/
โ”‚   โ”‚   โ”œโ”€โ”€ main_ppo.py                 # ๐Ÿš€ Training entry point
โ”‚   โ”‚   โ”œโ”€โ”€ config/                     # โš™๏ธ Hydra YAML configs
โ”‚   โ”‚   โ””โ”€โ”€ ppo/
โ”‚   โ”‚       โ”œโ”€โ”€ hdpo_algos.py           # โญ HDPO advantage estimation (core algorithm)
โ”‚   โ”‚       โ”œโ”€โ”€ ray_trainer.py          # ๐Ÿ”„ Ray-based PPO trainer with dual rewards
โ”‚   โ”‚       โ””โ”€โ”€ reward.py               # ๐ŸŽฏ Reward computation utilities
โ”‚   โ”œโ”€โ”€ workers/
โ”‚   โ”‚   โ”œโ”€โ”€ hdpo_actor.py               # โญ HDPO actor with dual-loss update
โ”‚   โ”‚   โ”œโ”€โ”€ hdpo_fsdp_worker.py         # ๐Ÿ”— FSDP worker integration
โ”‚   โ”‚   โ””โ”€โ”€ reward_manager/
โ”‚   โ”‚       โ””โ”€โ”€ metis.py                # โญ Dual reward manager (accuracy + tool efficiency)
โ”‚   โ”œโ”€โ”€ agent_loop/                     # ๐Ÿ” Multi-turn agent loop with tool use
โ”‚   โ””โ”€โ”€ servers/                        # ๐Ÿ› ๏ธ Tool server (code execution, search)
โ”‚       โ”œโ”€โ”€ serve.py                    # ๐ŸŒ FastAPI tool server entry point
โ”‚       โ””โ”€โ”€ tools/
โ”‚           โ”œโ”€โ”€ metis.py                # ๐Ÿงฐ Full tool: Python + text search + image search
โ”‚           โ”œโ”€โ”€ metis_code.py           # ๐Ÿ’ป Code-only tool variant
โ”‚           โ””โ”€โ”€ utils/
โ”‚               โ”œโ”€โ”€ ipython_executor.py # ๐Ÿ Ray-managed Jupyter kernel pool
โ”‚               โ””โ”€โ”€ search_engine.py    # ๐Ÿ” Async Google search with caching
โ”œโ”€โ”€ verl/                               # ๐Ÿ“ฆ Base RL framework (editable dependency)
โ”œโ”€โ”€ examples/
โ”‚   โ””โ”€โ”€ train/
โ”‚       โ”œโ”€โ”€ train_metis.sh              # ๐Ÿ“œ Multi-node HDPO training script
โ”‚       โ””โ”€โ”€ start_tool_server.sh        # ๐Ÿ“œ Tool server launch script
โ”œโ”€โ”€ paper/                              # ๐Ÿ“„ LaTeX source
โ”œโ”€โ”€ assets/                             # ๐Ÿ–ผ๏ธ Figures for README
โ””โ”€โ”€ pyproject.toml

๐Ÿ“Œ Core Files

File Description
verl_tool/trainer/ppo/hdpo_algos.py โญ HDPO advantage estimation โ€” conditional GRPO for tool efficiency
verl_tool/workers/hdpo_actor.py โญ Dual-loss policy update: w_acc ยท L_acc + w_tool ยท L_tool
verl_tool/workers/reward_manager/metis.py โญ Dual reward computation: accuracy + tool efficiency
verl_tool/trainer/ppo/ray_trainer.py ๐Ÿ”„ Ray-distributed PPO trainer with HDPO integration
verl_tool/workers/hdpo_fsdp_worker.py ๐Ÿ”— FSDP integration for HDPO actor
verl_tool/servers/tools/metis.py ๐Ÿงฐ Tool environment: Python execution, text/image search

๐Ÿ“š Training Details

๐ŸงŠ SFT Stage (Cold Start)

Our SFT corpus is curated from publicly available tool-augmented multimodal trajectories (DeepEyesV2, V-Interaction, Thyme, OpenMMReasoner) through a three-stage pipeline:

  1. ๐Ÿ”ง Eradicating hallucinated environmental dynamics โ€” Execute all code in sandbox; discard trajectories with execution failures
  2. ๐Ÿงน Isolating genuine tool necessity โ€” Filter out samples where the base model achieves pass@8 = 1 without tools
  3. ๐Ÿง  Multidimensional meta-cognitive filtering โ€” LLM judge evaluates visual relevance, reasoning coherence, and tool-use rationale

โšก RL Stage (HDPO)

Hyperparameter Value
๐Ÿ—๏ธ Backbone Qwen3-VL-8B-Instruct
๐Ÿ“ฆ Batch size 128
๐ŸŽฒ Rollouts per prompt (G) 16
๐Ÿ“ Learning rate 1e-6
๐Ÿ”’ KL coefficient 0
โš–๏ธ Loss weights w_acc = 1.0, w_tool = 0.15
๐Ÿ“ Max response length 16,384 tokens
๐Ÿ“Š Training prompts ~5K (45% perception, 36% search, 19% math/reasoning)

๐Ÿ“– Citation

If you find Metis or HDPO useful in your research, please consider citing:

@article{yan2026metis,
  title={Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models},
  author={Yan, Shilin and Tong, Jintao and Xue, Hongwei and Tang, Xiaojun and Wang, Yangyang and Shi, Kunyu and Zhang, Guannan and Li, Ruixuan and Zou, Yixiong},
  journal={arXiv preprint arXiv:2604.08545},
  year={2026}
}

๐Ÿ“„ License

This project is released under the Apache 2.0 License.

๐Ÿ™ Acknowledgments

Metis is built upon the following open-source projects:

  • ๐ŸŒ‹ verl โ€” Volcano Engine Reinforcement Learning for LLMs
  • ๐Ÿฏ verl-tool โ€” Tool-augmented agent training framework
  • ๐Ÿ”ฎ Qwen3-VL โ€” Qwen3 Vision-Language Model

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages