[CVPR 2026] IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation

Authors: Yankai Jiang, Qiaoru Li, Binlu Xu, Haoran Sun, Chao Ding, Junting Dong, Yuxiang Cai📧, Xuhong Zhang, Jianwei Yin
Institutes: Zhejiang University; Shanghai AI Laboratory
Resources: [📖Paper] [🤗Huggingface]

📖 Introduction

IBISAgent is a novel agentic Multimodal Large Language Model (MLLM) framework designed to address the limitations of existing medical MLLMs in fine-grained pixel-level understanding. unlike previous approaches that rely on implicit segmentation tokens and single-pass reasoning, IBISAgent reformulates segmentation as a vision-centric, multi-step decision-making process.

By treating segmentation tools (e.g., MedSAM2) as plug-and-play modules controllable through natural language, IBISAgent iteratively generates interleaved reasoning (Thinking) and text-based click actions (Action) to progressively refine segmentation masks. This approach mimics the interactive behavior of human experts, allowing for self-correction and high-quality mask generation without requiring architectural modifications to the MLLM.

💡 Highlights

🔥 Agentic Reasoning Framework. We reformulate medical image segmentation as a multi-step Markov Decision Process (MDP), enabling the model to "think" and "act" iteratively to solve complex visual grounding tasks.
🔥 No Implicit Tokens. IBISAgent eliminates the need for special <SEG> tokens and external pixel decoders, preserving the LLM's inherent text generation capabilities and ensuring better generalization.
🔥 Two-Stage Training Strategy.
- Cold-Start SFT: Initialized with high-quality trajectory data synthesized from automatic click simulation and self-reflection error correction.
- Agentic Reinforcement Learning. Further optimized using GRPO with novel fine-grained rewards (Region-based Click Placement, Progressive Improvement), enabling the model to discover advanced segmentation strategies beyond imitation.
🔥 SOTA Performance. IBISAgent significantly outperforms existing medical MLLMs on both in-domain and out-of-domain benchmarks, demonstrating superior robustness and pixel-level reasoning ability.

Model Weights

Please refer to our Huggingface repository for the pre-trained model weights.

🤖 Inference

Taking the example of SAM2 as the segmentation tool, we provide the inference code for IBISAgent.

Create a new conda environment and install the required packages.

conda create -n ibisagent python=3.12
conda activate ibisagent
pip install -r infer/requirements.txt

Install sam2 python library from the official repo.

git clone https://github.com/facebookresearch/sam2.git
cd sam2
pip install -e .

Download our RL-trained model weights to infer/models/mllm from here.

huggingface-cli download manglu3935/IBIS \
    --include "qwen2_5vl-7b-RL/*" \
    --local-dir infer/models/mllm \
    --local-dir-use-symlinks False

Run the multi-turn inference script.

python infer/multi_turn.py \
    --image "infer/test_img.png" \
    --prompt "Can you find a liver in this image?" \
    --mllm_path "infer/models/mllm"

Parameters:

Parameter	Description	Default	Required
`--image`	Path to the input medical image	`None`	Yes
`--prompt`	User text prompt (e.g., 'Is there a colon tumor?')	`None`	Yes
`--mllm_path`	Path to the MLLM model	`infer/models/mllm`	No
`--max_turns`	Maximum number of iterations	`20`	No
`--use_history`	Whether to enable chat history (1 for True, 0 for False)	`0`	No
`--output_dir`	Directory to save results	`./outputs`	No

🏋️ Training

IBISAgent uses a two-stage training strategy: Cold-Start SFT followed by Agentic Reinforcement Learning (GRPO).

Create a new conda environment and install the required packages.

conda create -n ibisagent-train python=3.12
conda activate ibisagent-train
pip install -r requirements.txt

Stage 1: Cold-Start SFT

We use LLaMA-Factory for supervised fine-tuning with LoRA. The configuration file is provided at training_scripts/qwen2_5vl_lora_sft.yaml.

Set up the dataset following the dataset_info.json format required by LLaMA-Factory, and update the dataset and output_dir fields in the config file.
Specify the base model path in model_name_or_path.
Run SFT training:

llamafactory-cli train training_scripts/qwen2_5vl_lora_sft.yaml

Stage 2: Agentic Reinforcement Learning (GRPO)

We use VERL for GRPO-based reinforcement learning. The training script is at training_scripts/grpo.sh.

Prepare your training data in .parquet format and update the data.train_files and data.val_files paths in training_scripts/grpo.sh.
Set the model variable to the path of your SFT-merged checkpoint, and run:

bash training_scripts/grpo.sh

🔧 Build Your SFT Data

Building the Supervised Fine-Tuning (SFT) data consists of two main parts:

Building Trajectory Data. Generating simulated interactive point prompts using SAM2.
CoT Distillation. Generating textual reasoning steps using a Teacher MLLM (Please refer to Appendix B.3 Reasoning Generation for Our SFT Dataset in our paper for details on this step).

Here we provide the script and instructions for the first part: generating your own trajectory data for custom 2D images or 3D NIfTI volumes.

Generate Trajectories

We provide a versatile script trajectory_gen/run_tr_gen.py to automatically generate high-quality interactive click trajectories. It simulates human-like iterative clicking behavior on your segmentation data.

For 2D Images (e.g., PNG, JPG)

python trajectory_gen/run_tr_gen.py \
    --type img \
    --image /path/to/image.jpg \
    --mask /path/to/mask.png \
    --output /path/to/output.json \
    --iou_threshold 0.8 \
    --max_steps 20

(Note: The script assumes white pixels (>128) in the mask represent the target. If your mask uses black for the target, add the --reverse_mask flag.)

For 3D NIfTI Volumes You can specify the slicing axis, slice index, mask label value, and optional windowing parameters for NIfTI data.

python trajectory_gen/run_tr_gen.py \
    --type nii \
    --image /path/to/volume.nii.gz \
    --mask /path/to/segmentation.nii.gz \
    --output /path/to/output.json \
    --slice_axis z \
    --slice_idx 55 \
    --mask_value 2 \
    --window_min -125 \
    --window_max 1000 \
    --iou_threshold 0.85

The script will output a JSON file containing the file metadata and the generated sequence of steps, including the normalized coordinates of the simulated clicks and the RLE-compressed predicted masks at each step.

Note: The RLE mask string can be converted into a binary mask image using the following code snippet:

import numpy as np
import base64
import zlib

def decode_rle_zlib_b64_to_mask(rle_b64_str: str, shape: tuple) -> np.ndarray:
    """Decode a zlib-compressed base64 RLE string back to a binary mask."""
    if not rle_b64_str:
        return np.zeros(shape, dtype=np.uint8)
        
    compressed = base64.b64decode(rle_b64_str)
    runs_str = zlib.decompress(compressed).decode('utf-8')
    runs = np.array([int(x) for x in runs_str.split()])
    
    pixels = np.zeros(np.prod(shape), dtype=np.uint8)
    
    current_idx = 0
    current_val = 0
    for run in runs:
        pixels[current_idx:current_idx + run] = current_val
        current_idx += run
        current_val = 1 - current_val  # Alternate between 0 and 1
        
    return pixels.reshape(shape)

# Example usage:
# shape = (512, 512) # Use the size from 'nii_data' or 'img_data' in the JSON
# mask = decode_rle_zlib_b64_to_mask("eJxVUtuS2jAM/...", shape)

🔧 Build Your RL Data

Once you have the trajectory JSON files (from the SFT data generation step above), use trajectory_gen/convert_to_rl_data.py to convert them into the .parquet format required by VERL for GRPO training.

python trajectory_gen/convert_to_rl_data.py \
    --dataset your_dataset \
    --data_root /path/to/med-seg-rl \
    --output_root /path/to/output \
    --split train

The output Parquet file will be saved to {output_root}/{dataset}/{split}_rl.parquet, which can be directly used as data.train_files in training_scripts/grpo.sh.

📜 News

[2026/02/28] 🚀 Code and dataset release preparation.
[2026/02/21] 🎉 IBISAgent is accepted to CVPR 2026!

👨‍💻 Todo

Release training scripts (SFT & RL)
Release inference code
Release pre-trained model weights
Release Cold-Start and RL datasets

✒️ Citation

If you find our work helpful for your research, please consider giving one star ⭐️ and citing:

@inproceedings{jiang2026ibisagent,
  title={IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation},
  author={Jiang, Yankai and Li, Qiaoru and Xu, Binlu and Sun, Haoran and Ding, Chao and Dong, Junting and Cai, Yuxiang and Zhang, Xuhong and Yin, Jianwei},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

❤️ Acknowledgments

VERL: The reinforcement learning framework we built upon.
Qwen2.5-VL-7B: The MLLM used in our agent.
SAM2: The segmentation tool used in our agent.
MedSAM2: The segmentation tool used in our agent.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
assets		assets
infer		infer
training_scripts		training_scripts
trajectory_gen		trajectory_gen
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[CVPR 2026] IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation

📖 Introduction

💡 Highlights

Model Weights

🤖 Inference

🏋️ Training

Stage 1: Cold-Start SFT

Stage 2: Agentic Reinforcement Learning (GRPO)

🔧 Build Your SFT Data

Generate Trajectories

🔧 Build Your RL Data

📜 News

👨‍💻 Todo

✒️ Citation

❤️ Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

[CVPR 2026] IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation

📖 Introduction

💡 Highlights

Model Weights

🤖 Inference

🏋️ Training

Stage 1: Cold-Start SFT

Stage 2: Agentic Reinforcement Learning (GRPO)

🔧 Build Your SFT Data

Generate Trajectories

🔧 Build Your RL Data

📜 News

👨‍💻 Todo

✒️ Citation

❤️ Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages