Skip to content

Official implementation of: "Leveraging Vision-Language Models for Open-Vocabulary Instance Segmentation and Tracking" by Pätzold, Nogga & Behnke. Robotics and Automation Letters. 2025.

License

Notifications You must be signed in to change notification settings

AIS-Bonn/vlm_gist

Repository files navigation

VLM-GIST — Vision-Language Model Grounding for Instance Segmentation & Tracking

Official implementation of: Leveraging Vision-Language Models for Open-Vocabulary Instance Segmentation and Tracking by Bastian Pätzold, Jan Nogga and Sven Behnke. IEEE Robotics and Automation Letters (RA-L). 2025.

Benchmark

We provide detailed results and all experimental settings.

COCO minival (subset)
Model                        |  F1   |  Rec. | Prec. |  mAP  || Ins. | Mat. |  Time  || Fail | Ret.
----------------------------------------------------------------------------------------------------
Gemini 2.5 Pro (high)        | 0.541 | 0.489 | 0.606 | 0.338 || 11.4 | 0.50 |  14.5s ||  nan |  nan
Gemini 2.5 Pro (preview)     | 0.537 | 0.487 | 0.599 | 0.350 || 11.1 | 0.51 |  13.3s ||  nan |  nan
Gemini 2.5 Pro               | 0.537 | 0.487 | 0.598 | 0.337 || 11.4 | 0.50 |  14.7s ||  nan |  nan
GLM 4.5V (think)             | 0.526 | 0.445 | 0.642 | 0.334 ||  8.2 | 0.60 |  14.9s ||  nan |  nan
Grok 4                       | 0.524 | 0.531 | 0.518 | 0.340 || 12.1 | 0.59 |  58.2s ||  nan |  nan
InternVL 3.5 38B (think)     | 0.523 | 0.464 | 0.599 | 0.336 ||  9.8 | 0.55 |  59.1s ||  nan |  nan
Gemini 2.5 Flash (medium)    | 0.520 | 0.487 | 0.558 | 0.323 || 12.7 | 0.48 |   6.6s ||  nan |  nan
Gemini 2.5 Flash Lite (high) | 0.520 | 0.449 | 0.616 | 0.316 ||  8.9 | 0.57 |   9.5s || 0.00 | 0.05
Gemini 2.5 Flash (low)       | 0.518 | 0.479 | 0.563 | 0.326 || 12.1 | 0.49 |   5.7s ||  nan |  nan
InternVL 3.5 20B A4B (think) | 0.518 | 0.457 | 0.597 | 0.312 ||  9.2 | 0.58 |  40.1s ||  nan |  nan
Gemini 2.5 Flash (high)      | 0.515 | 0.493 | 0.540 | 0.327 || 13.0 | 0.49 |   7.6s ||  nan |  nan
Qwen3-VL 235B A22B (reason)  | 0.513 | 0.459 | 0.580 | 0.324 ||  8.5 | 0.66 |   5.0s || 0.00 | 0.14
GLM 4.5V                     | 0.510 | 0.412 | 0.670 | 0.328 ||  7.4 | 0.60 |  15.9s || 0.03 | 0.19
Gemini 2.5 Flash             | 0.509 | 0.453 | 0.579 | 0.321 ||  9.5 | 0.58 |   2.9s ||  nan |  nan
GPT-4.1                      | 0.508 | 0.472 | 0.550 | 0.330 || 10.5 | 0.57 |   7.5s ||  nan |  nan
Gemini 2.5 Flash (preview)   | 0.507 | 0.444 | 0.590 | 0.323 ||  9.2 | 0.57 |   3.2s ||  nan |  nan
GPT-5 mini                   | 0.505 | 0.501 | 0.509 | 0.327 || 13.7 | 0.50 |  27.9s ||  nan |  nan
GPT-4.1 mini                 | 0.504 | 0.426 | 0.618 | 0.297 ||  7.8 | 0.62 |   4.3s ||  nan |  nan
InternVL 3.5 4B (think)      | 0.501 | 0.432 | 0.597 | 0.292 ||  8.9 | 0.57 |  35.0s ||  nan |  nan
GPT-5 mini (high)            | 0.501 | 0.476 | 0.529 | 0.297 || 13.9 | 0.46 |  82.4s || 0.02 | 0.10
Qwen3-VL 235B A22B           | 0.500 | 0.456 | 0.554 | 0.329 ||  9.4 | 0.62 |   6.1s || 0.01 | 0.10
OVIS 2.5 9B                  | 0.499 | 0.399 | 0.667 | 0.316 ||  7.0 | 0.60 |   6.5s ||  nan |  nan
InternVL 3.5 30B A3B         | 0.499 | 0.410 | 0.635 | 0.306 ||  7.1 | 0.64 |  35.1s ||  nan |  nan
InternVL 3.5 30B A3B (think) | 0.490 | 0.415 | 0.600 | 0.289 ||  9.2 | 0.52 |  38.6s ||  nan |  nan
Mistral Medium 3.1 (t=1.00)  | 0.489 | 0.408 | 0.611 | 0.313 ||  8.4 | 0.56 |   2.9s ||  nan |  nan
OVIS 2.5 9B (think)          | 0.489 | 0.394 | 0.644 | 0.301 ||  8.1 | 0.53 |  24.0s ||  nan |  nan
GPT-5 (high)                 | 0.487 | 0.510 | 0.466 | 0.311 || 16.7 | 0.46 |  88.6s ||  nan |  nan
GPT-5 nano (high)            | 0.487 | 0.415 | 0.589 | 0.297 ||  9.4 | 0.53 |  44.2s ||  nan |  nan
GPT-5                        | 0.486 | 0.503 | 0.471 | 0.311 || 15.8 | 0.47 |  47.6s ||  nan |  nan
GPT-5 nano                   | 0.485 | 0.416 | 0.581 | 0.289 ||  9.2 | 0.54 |  24.1s ||  nan |  nan
Gemma 27B                    | 0.478 | 0.418 | 0.558 | 0.284 || 10.6 | 0.50 |   9.5s || 0.00 | 0.00
Claude Sonnet 4              | 0.472 | 0.372 | 0.646 | 0.291 ||  7.9 | 0.51 |   5.5s ||  nan |  nan
Qwen-VL Plus                 | 0.472 | 0.368 | 0.659 | 0.270 ||  6.8 | 0.59 |   2.8s || 0.02 | 0.53
Mistral Medium 3.1 (t=0.15)  | 0.472 | 0.418 | 0.541 | 0.308 ||  9.3 | 0.59 |   3.0s ||  nan |  nan
InternVL 3.5 2B (think)      | 0.470 | 0.388 | 0.595 | 0.294 ||  7.9 | 0.58 |  27.7s ||  nan |  nan
InternVL 3.5 2B              | 0.466 | 0.386 | 0.588 | 0.288 ||  6.8 | 0.67 | 124.8s ||  nan |  nan
Gemma 12B                    | 0.462 | 0.384 | 0.578 | 0.291 ||  9.3 | 0.50 |  12.5s || 0.00 | 0.37
Gemini 2.5 Flash Lite        | 0.447 | 0.435 | 0.460 | 0.304 || 10.1 | 0.66 |   2.8s || 0.00 | 0.01
Gemma 4B                     | 0.430 | 0.323 | 0.641 | 0.255 ||  6.4 | 0.55 |   3.3s || 0.00 | 0.01
GPT-4.1 nano                 | 0.415 | 0.341 | 0.530 | 0.232 ||  6.5 | 0.69 |   3.5s ||  nan |  nan
Custom Dataset
Model                        |  F1   |  Rec. | Prec. |  mAP  || Ins. | Mat. |  Time  || Fail | Ret.
----------------------------------------------------------------------------------------------------
Gemini 2.5 Pro               | 0.464 | 0.417 | 0.523 | 0.363 || 14.7 | 1.00 |  16.0s ||  nan |  nan
Gemini 2.5 Pro (high)        | 0.460 | 0.415 | 0.515 | 0.366 || 14.8 | 1.00 |  17.7s ||  nan |  nan
Gemini 2.5 Pro (preview)     | 0.451 | 0.402 | 0.515 | 0.357 || 14.3 | 1.00 |  14.9s ||  nan |  nan
Gemini 2.5 Flash (low)       | 0.435 | 0.395 | 0.485 | 0.344 || 15.0 | 1.00 |   7.1s ||  nan |  nan
Grok 4                       | 0.428 | 0.385 | 0.482 | 0.340 || 14.7 | 1.00 |  34.7s ||  nan |  nan
GPT-5 mini (high)            | 0.422 | 0.401 | 0.445 | 0.342 || 16.6 | 1.00 |  82.7s ||  nan |  nan
GPT-5                        | 0.418 | 0.413 | 0.424 | 0.352 || 17.9 | 1.00 |  54.2s ||  nan |  nan
OVIS 2.5 9B (think)          | 0.415 | 0.318 | 0.596 | 0.281 ||  9.8 | 1.00 |  32.0s ||  nan |  nan
Gemini 2.5 Flash (high)      | 0.415 | 0.379 | 0.458 | 0.324 || 15.3 | 1.00 |   7.4s ||  nan |  nan
GPT-5 mini                   | 0.414 | 0.388 | 0.444 | 0.338 || 16.1 | 1.00 |  32.7s ||  nan |  nan
Gemini 2.5 Flash Lite (high) | 0.412 | 0.340 | 0.522 | 0.291 || 12.0 | 1.00 |  11.9s || 0.00 | 0.02
GPT-4.1 mini                 | 0.410 | 0.319 | 0.571 | 0.279 || 10.3 | 1.00 |   6.0s ||  nan |  nan
GPT-4.1                      | 0.407 | 0.356 | 0.476 | 0.308 || 13.8 | 1.00 |   8.7s ||  nan |  nan
GPT-5 (high)                 | 0.407 | 0.413 | 0.400 | 0.348 || 19.0 | 1.00 | 101.4s ||  nan |  nan
Gemini 2.5 Flash             | 0.406 | 0.345 | 0.495 | 0.305 || 12.8 | 1.00 |   3.6s ||  nan |  nan
InternVL 3.5 38B (think)     | 0.405 | 0.332 | 0.521 | 0.295 || 11.7 | 1.00 |  58.3s ||  nan |  nan
Gemini 2.5 Flash (preview)   | 0.401 | 0.341 | 0.486 | 0.297 || 12.9 | 1.00 |   4.0s ||  nan |  nan
Gemini 2.5 Flash (medium)    | 0.400 | 0.366 | 0.441 | 0.311 || 15.3 | 1.00 |   7.2s ||  nan |  nan
GLM 4.5V (think)             | 0.400 | 0.312 | 0.555 | 0.276 || 10.4 | 1.00 |  14.9s ||  nan |  nan
Qwen3-VL 235B A22B           | 0.397 | 0.320 | 0.521 | 0.278 || 11.3 | 1.00 |   8.0s || 0.00 | 0.06
Qwen3-VL 235B A22B (reason)  | 0.392 | 0.329 | 0.483 | 0.286 || 12.6 | 1.00 |   8.3s || 0.00 | 0.14
GPT-5 nano (high)            | 0.383 | 0.319 | 0.478 | 0.280 || 12.3 | 1.00 |  51.3s ||  nan |  nan
InternVL 3.5 30B A3B (think) | 0.378 | 0.302 | 0.506 | 0.259 || 11.0 | 1.00 |  38.5s ||  nan |  nan
GPT-5 nano                   | 0.373 | 0.306 | 0.479 | 0.267 || 11.8 | 1.00 |  28.9s ||  nan |  nan
GLM 4.5V                     | 0.368 | 0.279 | 0.538 | 0.243 ||  9.9 | 1.00 |  16.0s || 0.03 | 0.22
OVIS 2.5 9B                  | 0.365 | 0.267 | 0.573 | 0.239 ||  8.6 | 1.00 |   8.0s ||  nan |  nan
InternVL 3.5 4B (think)      | 0.363 | 0.284 | 0.503 | 0.246 || 10.4 | 1.00 |  36.1s ||  nan |  nan
Mistral Medium 3.1 (t=0.15)  | 0.351 | 0.285 | 0.458 | 0.247 || 11.5 | 1.00 |   3.7s ||  nan |  nan
Qwen-VL Plus                 | 0.346 | 0.260 | 0.520 | 0.227 ||  9.2 | 1.00 |   4.4s || 0.00 | 0.34
Gemini 2.5 Flash Lite        | 0.345 | 0.282 | 0.445 | 0.241 || 11.7 | 1.00 |   3.1s || 0.00 | 0.02
InternVL 3.5 30B A3B         | 0.344 | 0.254 | 0.533 | 0.222 ||  8.8 | 1.00 |  42.2s ||  nan |  nan
InternVL 3.5 4B              | 0.343 | 0.262 | 0.495 | 0.229 ||  9.8 | 1.00 |  37.1s ||  nan |  nan
InternVL 3.5 2B (think)      | 0.337 | 0.258 | 0.486 | 0.220 ||  9.8 | 1.00 |  33.1s ||  nan |  nan
Gemma 27B                    | 0.327 | 0.267 | 0.421 | 0.229 || 11.7 | 1.00 |  11.4s || 0.00 | 0.00
Mistral Medium 3.1 (t=1.00)  | 0.327 | 0.284 | 0.383 | 0.247 || 13.7 | 1.00 |   4.0s ||  nan |  nan
Claude Sonnet 4              | 0.321 | 0.250 | 0.451 | 0.216 || 10.2 | 1.00 |   6.3s ||  nan |  nan
Gemma 12B                    | 0.314 | 0.239 | 0.456 | 0.205 || 10.1 | 1.00 |  14.1s || 0.05 | 0.56
GPT-4.1 nano                 | 0.310 | 0.229 | 0.480 | 0.201 ||  8.8 | 1.00 |   4.6s ||  nan |  nan
InternVL 3.5 2B              | 0.294 | 0.207 | 0.505 | 0.179 ||  7.5 | 1.00 | 117.4s ||  nan |  nan
Gemma 4B                     | 0.282 | 0.206 | 0.448 | 0.179 ||  8.5 | 1.00 |   3.2s || 0.00 | 0.00
Weighted Datasets
Model                        |  F1   |  Rec. | Prec. |  mAP  || Ins. | Mat. |  Time  || Fail | Ret.
----------------------------------------------------------------------------------------------------
Gemini 2.5 Pro (high)        | 0.500 | 0.452 | 0.560 | 0.352 || 13.1 | 0.75 |  16.1s ||  nan |  nan
Gemini 2.5 Pro               | 0.500 | 0.452 | 0.560 | 0.350 || 13.0 | 0.75 |  15.3s ||  nan |  nan
Gemini 2.5 Pro (preview)     | 0.494 | 0.444 | 0.557 | 0.353 || 12.7 | 0.76 |  14.1s ||  nan |  nan
Gemini 2.5 Flash (low)       | 0.477 | 0.437 | 0.524 | 0.335 || 13.5 | 0.75 |   6.4s ||  nan |  nan
Grok 4                       | 0.476 | 0.458 | 0.500 | 0.340 || 13.4 | 0.80 |  46.4s ||  nan |  nan
Gemini 2.5 Flash Lite (high) | 0.466 | 0.395 | 0.569 | 0.303 || 10.5 | 0.79 |  10.7s || 0.00 | 0.03
Gemini 2.5 Flash (high)      | 0.465 | 0.436 | 0.499 | 0.325 || 14.1 | 0.75 |   7.5s ||  nan |  nan
InternVL 3.5 38B (think)     | 0.464 | 0.398 | 0.560 | 0.315 || 10.8 | 0.78 |  58.7s ||  nan |  nan
GLM 4.5V (think)             | 0.463 | 0.379 | 0.599 | 0.305 ||  9.3 | 0.80 |  14.9s ||  nan |  nan
GPT-5 mini (high)            | 0.461 | 0.438 | 0.487 | 0.319 || 15.2 | 0.73 |  82.6s ||  nan |  nan
Gemini 2.5 Flash (medium)    | 0.460 | 0.426 | 0.499 | 0.317 || 14.0 | 0.74 |   6.9s ||  nan |  nan
GPT-5 mini                   | 0.460 | 0.445 | 0.476 | 0.332 || 14.9 | 0.75 |  30.3s ||  nan |  nan
GPT-4.1                      | 0.458 | 0.414 | 0.513 | 0.319 || 12.1 | 0.79 |   8.1s ||  nan |  nan
Gemini 2.5 Flash             | 0.457 | 0.399 | 0.537 | 0.313 || 11.1 | 0.79 |   3.2s ||  nan |  nan
GPT-4.1 mini                 | 0.457 | 0.373 | 0.595 | 0.288 ||  9.0 | 0.81 |   5.2s ||  nan |  nan
Gemini 2.5 Flash (preview)   | 0.454 | 0.393 | 0.538 | 0.310 || 11.1 | 0.79 |   3.6s ||  nan |  nan
GPT-5                        | 0.452 | 0.458 | 0.448 | 0.331 || 16.8 | 0.74 |  50.9s ||  nan |  nan
OVIS 2.5 9B (think)          | 0.452 | 0.356 | 0.620 | 0.291 ||  9.0 | 0.77 |  28.0s ||  nan |  nan
Qwen3-VL 235B A22B (reason)  | 0.452 | 0.394 | 0.531 | 0.305 || 10.5 | 0.83 |   6.6s || 0.00 | 0.14
Qwen3-VL 235B A22B           | 0.449 | 0.388 | 0.538 | 0.304 || 10.3 | 0.81 |   7.1s || 0.01 | 0.08
GPT-5 (high)                 | 0.447 | 0.462 | 0.433 | 0.330 || 17.8 | 0.73 |  95.0s ||  nan |  nan
GLM 4.5V                     | 0.439 | 0.346 | 0.604 | 0.286 ||  8.6 | 0.80 |  15.9s || 0.03 | 0.20
GPT-5 nano (high)            | 0.435 | 0.367 | 0.533 | 0.289 || 10.8 | 0.76 |  47.7s ||  nan |  nan
InternVL 3.5 30B A3B (think) | 0.434 | 0.359 | 0.553 | 0.274 || 10.1 | 0.76 |  38.6s ||  nan |  nan
InternVL 3.5 4B (think)      | 0.432 | 0.358 | 0.550 | 0.269 ||  9.6 | 0.78 |  35.5s ||  nan |  nan
GPT-5 nano                   | 0.429 | 0.361 | 0.530 | 0.278 || 10.5 | 0.77 |  26.5s ||  nan |  nan
InternVL 3.5 30B A3B         | 0.421 | 0.332 | 0.584 | 0.264 ||  7.9 | 0.82 |  38.7s ||  nan |  nan
Mistral Medium 3.1 (t=0.15)  | 0.412 | 0.352 | 0.499 | 0.278 || 10.4 | 0.79 |   3.3s ||  nan |  nan
Qwen-VL Plus                 | 0.409 | 0.314 | 0.589 | 0.249 ||  8.0 | 0.79 |   3.6s || 0.01 | 0.44
Mistral Medium 3.1 (t=1.00)  | 0.408 | 0.346 | 0.497 | 0.280 || 11.0 | 0.78 |   3.4s ||  nan |  nan
InternVL 3.5 2B (think)      | 0.403 | 0.323 | 0.540 | 0.257 ||  8.8 | 0.79 |  30.4s ||  nan |  nan
Gemma 27B                    | 0.402 | 0.343 | 0.489 | 0.256 || 11.1 | 0.75 |  10.5s || 0.00 | 0.00
Claude Sonnet 4              | 0.397 | 0.311 | 0.548 | 0.253 ||  9.1 | 0.75 |   5.9s ||  nan |  nan
Gemini 2.5 Flash Lite        | 0.396 | 0.359 | 0.453 | 0.272 || 10.9 | 0.83 |   3.0s || 0.00 | 0.01
Gemma 12B                    | 0.388 | 0.312 | 0.517 | 0.248 ||  9.7 | 0.75 |  13.3s || 0.02 | 0.47
InternVL 3.5 2B              | 0.380 | 0.297 | 0.547 | 0.234 ||  7.2 | 0.84 | 121.1s ||  nan |  nan
GPT-4.1 nano                 | 0.363 | 0.285 | 0.505 | 0.216 ||  7.6 | 0.85 |   4.0s ||  nan |  nan
Gemma 4B                     | 0.356 | 0.265 | 0.544 | 0.217 ||  7.4 | 0.78 |   3.2s || 0.00 | 0.01
Details

Models are sorted in descending order by F-1 score.

Legend:

  • F1: The achieved F-1 score of detections that passed label matching compared to groundtruth annotations.
  • Rec.: The achieved recall score of detections that passed label matching compared to groundtruth annotations.
  • Prec.: The achieved precision score of detections that passed label matching compared to groundtruth annotations.
  • mAP: The achieved mAP score of detections that passed label matching compared to groundtruth annotations.
  • Ins.: The average number of object instances in a valid structured description per image.
  • Mat.: The ratio of matched detections by the label matching procedure over all detections.
  • Time: The median time to generate a valid structured description over all images.
  • Fail: The rate of invalid structured descriptions after all (4) generation attempts.
  • Ret.: The average number of retry attempts to generate a valid structured description per image (0 to 3).

Remarks:

  • Models where the last two columns report nan were evaluated with an infinite and untracked number of retry attempts, until a valid structured description was obtained.
  • All models were used and interpreted at best effort, limiting parallel usage, attempting to extract JSON from within markdown tags or reasoning content, etc.
  • Reasons for failed attempts may include rate limits, content moderation, timeouts, reaching max. token limits, etc.
  • All reported times may heavily be affected by the used hardware, rate limits, server load, etc.

Setup

Vision Models

Follow the documentation of NimbRo Vision Servers to serve the vision models you plan on using, e.g. MM Grounding DINO or SAM2-Realtime.

Python

To install all required Python dependencies:

pip install -r requirements.txt

To access all features and improve processing speed, also install the Python dependencies of NimbRo Utils.

ROS2 Jazzy

Include this repository together with NimbRo API, NimbRo API Interfaces and NimbRo Utilities in the source folder of your colcon workspace. After building them:

colcon build --packages-select nimbro_utils nimbro_api_interfaces nimbro_api vlm_gist --symlink-install

and re-sourcing:

source install/local_setup.bash

and launching NimbRo API:

ros2 launch nimbro_api launch.py

several nodes to interact (download, describe, detect, validate, evaluate, label match, etc.) with FiftyOne datasets, and node extensions for applying the VLM-GIST pipeline, stages thereof, or baseline methods.

See the provided Jupyter Notebooks for reproducing results or using the node extensions.

Citation

If you utilize this package in your research, please cite:

https://arxiv.org/abs/2503.16538

@article{paetzold25vlmgist,
    author={Bastian P{\"a}tzold and Jan Nogga and Sven Behnke},
    title={Leveraging Vision-Language Models for Open-Vocabulary Instance Segmentation and Tracking},
    journal={IEEE Robotics and Automation Letters (RA-L)},
    year={2025}
}

License

vlm_gist (code, scripts, etc.) is licensed under BSD‑3.

The custom dataset (see dataset) is licensed under CC BY‑NC‑SA 4.0 (see license), where 42 images (00013.jpg to 00054.jpg) are taken from the AgiBot World dataset.

Contact

Bastian Pätzold paetzold@ais.uni-bonn.de
Jan Nogga nogga@ais.uni-bonn.de

About

Official implementation of: "Leveraging Vision-Language Models for Open-Vocabulary Instance Segmentation and Tracking" by Pätzold, Nogga & Behnke. Robotics and Automation Letters. 2025.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published