- Authors: Yankai Jiang, Qiaoru Li, Binlu Xu, Haoran Sun, Chao Ding, Junting Dong, Yuxiang Caiπ§, Xuhong Zhang, Jianwei Yin
- Institutes: Zhejiang University; Shanghai AI Laboratory
- Resources: [πPaper] [π€Huggingface]
IBISAgent is a novel agentic Multimodal Large Language Model (MLLM) framework designed to address the limitations of existing medical MLLMs in fine-grained pixel-level understanding. unlike previous approaches that rely on implicit segmentation tokens and single-pass reasoning, IBISAgent reformulates segmentation as a vision-centric, multi-step decision-making process.
By treating segmentation tools (e.g., MedSAM2) as plug-and-play modules controllable through natural language, IBISAgent iteratively generates interleaved reasoning (Thinking) and text-based click actions (Action) to progressively refine segmentation masks. This approach mimics the interactive behavior of human experts, allowing for self-correction and high-quality mask generation without requiring architectural modifications to the MLLM.
- π₯ Agentic Reasoning Framework. We reformulate medical image segmentation as a multi-step Markov Decision Process (MDP), enabling the model to "think" and "act" iteratively to solve complex visual grounding tasks.
- π₯ No Implicit Tokens. IBISAgent eliminates the need for special
<SEG>tokens and external pixel decoders, preserving the LLM's inherent text generation capabilities and ensuring better generalization. - π₯ Two-Stage Training Strategy.
- Cold-Start SFT: Initialized with high-quality trajectory data synthesized from automatic click simulation and self-reflection error correction.
- Agentic Reinforcement Learning. Further optimized using GRPO with novel fine-grained rewards (Region-based Click Placement, Progressive Improvement), enabling the model to discover advanced segmentation strategies beyond imitation.
- π₯ SOTA Performance. IBISAgent significantly outperforms existing medical MLLMs on both in-domain and out-of-domain benchmarks, demonstrating superior robustness and pixel-level reasoning ability.
Please refer to our Huggingface repository for the pre-trained model weights.
Taking the example of SAM2 as the segmentation tool, we provide the inference code for IBISAgent.
- Create a new conda environment and install the required packages.
conda create -n ibisagent python=3.12
conda activate ibisagent
pip install -r infer/requirements.txt- Install
sam2python library from the official repo.
git clone https://github.com/facebookresearch/sam2.git
cd sam2
pip install -e .- Download our RL-trained model weights to
infer/models/mllmfrom here.
huggingface-cli download manglu3935/IBIS \
--include "qwen2_5vl-7b-RL/*" \
--local-dir infer/models/mllm \
--local-dir-use-symlinks False- Run the multi-turn inference script.
python infer/multi_turn.py \
--image "infer/test_img.png" \
--prompt "Can you find a liver in this image?" \
--mllm_path "infer/models/mllm"Parameters:
| Parameter | Description | Default | Required |
|---|---|---|---|
--image |
Path to the input medical image | None |
Yes |
--prompt |
User text prompt (e.g., 'Is there a colon tumor?') | None |
Yes |
--mllm_path |
Path to the MLLM model | infer/models/mllm |
No |
--max_turns |
Maximum number of iterations | 20 |
No |
--use_history |
Whether to enable chat history (1 for True, 0 for False) | 0 |
No |
--output_dir |
Directory to save results | ./outputs |
No |
IBISAgent uses a two-stage training strategy: Cold-Start SFT followed by Agentic Reinforcement Learning (GRPO).
- Create a new conda environment and install the required packages.
conda create -n ibisagent-train python=3.12
conda activate ibisagent-train
pip install -r requirements.txtWe use LLaMA-Factory for supervised fine-tuning with LoRA. The configuration file is provided at training_scripts/qwen2_5vl_lora_sft.yaml.
-
Set up the dataset following the
dataset_info.jsonformat required by LLaMA-Factory, and update thedatasetandoutput_dirfields in the config file. -
Specify the base model path in
model_name_or_path. -
Run SFT training:
llamafactory-cli train training_scripts/qwen2_5vl_lora_sft.yamlWe use VERL for GRPO-based reinforcement learning. The training script is at training_scripts/grpo.sh.
-
Prepare your training data in
.parquetformat and update thedata.train_filesanddata.val_filespaths intraining_scripts/grpo.sh. -
Set the
modelvariable to the path of your SFT-merged checkpoint, and run:
bash training_scripts/grpo.shBuilding the Supervised Fine-Tuning (SFT) data consists of two main parts:
- Building Trajectory Data. Generating simulated interactive point prompts using SAM2.
- CoT Distillation. Generating textual reasoning steps using a Teacher MLLM (Please refer to Appendix B.3 Reasoning Generation for Our SFT Dataset in our paper for details on this step).
Here we provide the script and instructions for the first part: generating your own trajectory data for custom 2D images or 3D NIfTI volumes.
We provide a versatile script trajectory_gen/run_tr_gen.py to automatically generate high-quality interactive click trajectories. It simulates human-like iterative clicking behavior on your segmentation data.
For 2D Images (e.g., PNG, JPG)
python trajectory_gen/run_tr_gen.py \
--type img \
--image /path/to/image.jpg \
--mask /path/to/mask.png \
--output /path/to/output.json \
--iou_threshold 0.8 \
--max_steps 20(Note: The script assumes white pixels (>128) in the mask represent the target. If your mask uses black for the target, add the --reverse_mask flag.)
For 3D NIfTI Volumes You can specify the slicing axis, slice index, mask label value, and optional windowing parameters for NIfTI data.
python trajectory_gen/run_tr_gen.py \
--type nii \
--image /path/to/volume.nii.gz \
--mask /path/to/segmentation.nii.gz \
--output /path/to/output.json \
--slice_axis z \
--slice_idx 55 \
--mask_value 2 \
--window_min -125 \
--window_max 1000 \
--iou_threshold 0.85The script will output a JSON file containing the file metadata and the generated sequence of steps, including the normalized coordinates of the simulated clicks and the RLE-compressed predicted masks at each step.
Note: The RLE mask string can be converted into a binary mask image using the following code snippet:
import numpy as np
import base64
import zlib
def decode_rle_zlib_b64_to_mask(rle_b64_str: str, shape: tuple) -> np.ndarray:
"""Decode a zlib-compressed base64 RLE string back to a binary mask."""
if not rle_b64_str:
return np.zeros(shape, dtype=np.uint8)
compressed = base64.b64decode(rle_b64_str)
runs_str = zlib.decompress(compressed).decode('utf-8')
runs = np.array([int(x) for x in runs_str.split()])
pixels = np.zeros(np.prod(shape), dtype=np.uint8)
current_idx = 0
current_val = 0
for run in runs:
pixels[current_idx:current_idx + run] = current_val
current_idx += run
current_val = 1 - current_val # Alternate between 0 and 1
return pixels.reshape(shape)
# Example usage:
# shape = (512, 512) # Use the size from 'nii_data' or 'img_data' in the JSON
# mask = decode_rle_zlib_b64_to_mask("eJxVUtuS2jAM/...", shape)Once you have the trajectory JSON files (from the SFT data generation step above), use trajectory_gen/convert_to_rl_data.py to convert them into the .parquet format required by VERL for GRPO training.
python trajectory_gen/convert_to_rl_data.py \
--dataset your_dataset \
--data_root /path/to/med-seg-rl \
--output_root /path/to/output \
--split trainThe output Parquet file will be saved to {output_root}/{dataset}/{split}_rl.parquet, which can be directly used as data.train_files in training_scripts/grpo.sh.
- [2026/02/28] π Code and dataset release preparation.
- [2026/02/21] π IBISAgent is accepted to CVPR 2026!
- Release training scripts (SFT & RL)
- Release inference code
- Release pre-trained model weights
- Release Cold-Start and RL datasets
If you find our work helpful for your research, please consider giving one star βοΈ and citing:
@inproceedings{jiang2026ibisagent,
title={IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation},
author={Jiang, Yankai and Li, Qiaoru and Xu, Binlu and Sun, Haoran and Ding, Chao and Dong, Junting and Cai, Yuxiang and Zhang, Xuhong and Yin, Jianwei},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}- VERL: The reinforcement learning framework we built upon.
- Qwen2.5-VL-7B: The MLLM used in our agent.
- SAM2: The segmentation tool used in our agent.
- MedSAM2: The segmentation tool used in our agent.


