Skip to content

SakuraEntropia/ibisagent

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

22 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

[CVPR 2026] IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation

  • Authors: Yankai Jiang, Qiaoru Li, Binlu Xu, Haoran Sun, Chao Ding, Junting Dong, Yuxiang CaiπŸ“§, Xuhong Zhang, Jianwei Yin
  • Institutes: Zhejiang University; Shanghai AI Laboratory
  • Resources: [πŸ“–Paper] [πŸ€—Huggingface]

πŸ“– Introduction

IBISAgent is a novel agentic Multimodal Large Language Model (MLLM) framework designed to address the limitations of existing medical MLLMs in fine-grained pixel-level understanding. unlike previous approaches that rely on implicit segmentation tokens and single-pass reasoning, IBISAgent reformulates segmentation as a vision-centric, multi-step decision-making process.

By treating segmentation tools (e.g., MedSAM2) as plug-and-play modules controllable through natural language, IBISAgent iteratively generates interleaved reasoning (Thinking) and text-based click actions (Action) to progressively refine segmentation masks. This approach mimics the interactive behavior of human experts, allowing for self-correction and high-quality mask generation without requiring architectural modifications to the MLLM.

IBISAgent Introduction

πŸ’‘ Highlights

  • πŸ”₯ Agentic Reasoning Framework. We reformulate medical image segmentation as a multi-step Markov Decision Process (MDP), enabling the model to "think" and "act" iteratively to solve complex visual grounding tasks.
  • πŸ”₯ No Implicit Tokens. IBISAgent eliminates the need for special <SEG> tokens and external pixel decoders, preserving the LLM's inherent text generation capabilities and ensuring better generalization.
  • πŸ”₯ Two-Stage Training Strategy.
    • Cold-Start SFT: Initialized with high-quality trajectory data synthesized from automatic click simulation and self-reflection error correction.
    • Agentic Reinforcement Learning. Further optimized using GRPO with novel fine-grained rewards (Region-based Click Placement, Progressive Improvement), enabling the model to discover advanced segmentation strategies beyond imitation.
  • πŸ”₯ SOTA Performance. IBISAgent significantly outperforms existing medical MLLMs on both in-domain and out-of-domain benchmarks, demonstrating superior robustness and pixel-level reasoning ability.

IBISAgent Performance

Model Weights

Please refer to our Huggingface repository for the pre-trained model weights.

πŸ€– Inference

Taking the example of SAM2 as the segmentation tool, we provide the inference code for IBISAgent.

  1. Create a new conda environment and install the required packages.
conda create -n ibisagent python=3.12
conda activate ibisagent
pip install -r infer/requirements.txt
  1. Install sam2 python library from the official repo.
git clone https://github.com/facebookresearch/sam2.git
cd sam2
pip install -e .
  1. Download our RL-trained model weights to infer/models/mllm from here.
huggingface-cli download manglu3935/IBIS \
    --include "qwen2_5vl-7b-RL/*" \
    --local-dir infer/models/mllm \
    --local-dir-use-symlinks False
  1. Run the multi-turn inference script.
python infer/multi_turn.py \
    --image "infer/test_img.png" \
    --prompt "Can you find a liver in this image?" \
    --mllm_path "infer/models/mllm"

Parameters:

Parameter Description Default Required
--image Path to the input medical image None Yes
--prompt User text prompt (e.g., 'Is there a colon tumor?') None Yes
--mllm_path Path to the MLLM model infer/models/mllm No
--max_turns Maximum number of iterations 20 No
--use_history Whether to enable chat history (1 for True, 0 for False) 0 No
--output_dir Directory to save results ./outputs No

πŸ‹οΈ Training

IBISAgent uses a two-stage training strategy: Cold-Start SFT followed by Agentic Reinforcement Learning (GRPO).

  1. Create a new conda environment and install the required packages.
conda create -n ibisagent-train python=3.12
conda activate ibisagent-train
pip install -r requirements.txt

Stage 1: Cold-Start SFT

We use LLaMA-Factory for supervised fine-tuning with LoRA. The configuration file is provided at training_scripts/qwen2_5vl_lora_sft.yaml.

  1. Set up the dataset following the dataset_info.json format required by LLaMA-Factory, and update the dataset and output_dir fields in the config file.

  2. Specify the base model path in model_name_or_path.

  3. Run SFT training:

llamafactory-cli train training_scripts/qwen2_5vl_lora_sft.yaml

Stage 2: Agentic Reinforcement Learning (GRPO)

We use VERL for GRPO-based reinforcement learning. The training script is at training_scripts/grpo.sh.

  1. Prepare your training data in .parquet format and update the data.train_files and data.val_files paths in training_scripts/grpo.sh.

  2. Set the model variable to the path of your SFT-merged checkpoint, and run:

bash training_scripts/grpo.sh

πŸ”§ Build Your SFT Data

Building the Supervised Fine-Tuning (SFT) data consists of two main parts:

  1. Building Trajectory Data. Generating simulated interactive point prompts using SAM2.
  2. CoT Distillation. Generating textual reasoning steps using a Teacher MLLM (Please refer to Appendix B.3 Reasoning Generation for Our SFT Dataset in our paper for details on this step).

Here we provide the script and instructions for the first part: generating your own trajectory data for custom 2D images or 3D NIfTI volumes.

Generate Trajectories

We provide a versatile script trajectory_gen/run_tr_gen.py to automatically generate high-quality interactive click trajectories. It simulates human-like iterative clicking behavior on your segmentation data.

For 2D Images (e.g., PNG, JPG)

python trajectory_gen/run_tr_gen.py \
    --type img \
    --image /path/to/image.jpg \
    --mask /path/to/mask.png \
    --output /path/to/output.json \
    --iou_threshold 0.8 \
    --max_steps 20

(Note: The script assumes white pixels (>128) in the mask represent the target. If your mask uses black for the target, add the --reverse_mask flag.)

For 3D NIfTI Volumes You can specify the slicing axis, slice index, mask label value, and optional windowing parameters for NIfTI data.

python trajectory_gen/run_tr_gen.py \
    --type nii \
    --image /path/to/volume.nii.gz \
    --mask /path/to/segmentation.nii.gz \
    --output /path/to/output.json \
    --slice_axis z \
    --slice_idx 55 \
    --mask_value 2 \
    --window_min -125 \
    --window_max 1000 \
    --iou_threshold 0.85

The script will output a JSON file containing the file metadata and the generated sequence of steps, including the normalized coordinates of the simulated clicks and the RLE-compressed predicted masks at each step.

Note: The RLE mask string can be converted into a binary mask image using the following code snippet:

import numpy as np
import base64
import zlib

def decode_rle_zlib_b64_to_mask(rle_b64_str: str, shape: tuple) -> np.ndarray:
    """Decode a zlib-compressed base64 RLE string back to a binary mask."""
    if not rle_b64_str:
        return np.zeros(shape, dtype=np.uint8)
        
    compressed = base64.b64decode(rle_b64_str)
    runs_str = zlib.decompress(compressed).decode('utf-8')
    runs = np.array([int(x) for x in runs_str.split()])
    
    pixels = np.zeros(np.prod(shape), dtype=np.uint8)
    
    current_idx = 0
    current_val = 0
    for run in runs:
        pixels[current_idx:current_idx + run] = current_val
        current_idx += run
        current_val = 1 - current_val  # Alternate between 0 and 1
        
    return pixels.reshape(shape)

# Example usage:
# shape = (512, 512) # Use the size from 'nii_data' or 'img_data' in the JSON
# mask = decode_rle_zlib_b64_to_mask("eJxVUtuS2jAM/...", shape)

πŸ”§ Build Your RL Data

Once you have the trajectory JSON files (from the SFT data generation step above), use trajectory_gen/convert_to_rl_data.py to convert them into the .parquet format required by VERL for GRPO training.

python trajectory_gen/convert_to_rl_data.py \
    --dataset your_dataset \
    --data_root /path/to/med-seg-rl \
    --output_root /path/to/output \
    --split train

The output Parquet file will be saved to {output_root}/{dataset}/{split}_rl.parquet, which can be directly used as data.train_files in training_scripts/grpo.sh.

πŸ“œ News

  • [2026/02/28] πŸš€ Code and dataset release preparation.
  • [2026/02/21] πŸŽ‰ IBISAgent is accepted to CVPR 2026!

πŸ‘¨β€πŸ’» Todo

  • Release training scripts (SFT & RL)
  • Release inference code
  • Release pre-trained model weights
  • Release Cold-Start and RL datasets

βœ’οΈ Citation

If you find our work helpful for your research, please consider giving one star ⭐️ and citing:

@inproceedings{jiang2026ibisagent,
  title={IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation},
  author={Jiang, Yankai and Li, Qiaoru and Xu, Binlu and Sun, Haoran and Ding, Chao and Dong, Junting and Cai, Yuxiang and Zhang, Xuhong and Yin, Jianwei},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

❀️ Acknowledgments

  • VERL: The reinforcement learning framework we built upon.
  • Qwen2.5-VL-7B: The MLLM used in our agent.
  • SAM2: The segmentation tool used in our agent.
  • MedSAM2: The segmentation tool used in our agent.

About

Discover the repository for "IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation," a pioneering study that has been accepted for presentation at CVPR 2026.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 99.8%
  • Shell 0.2%