Home

Welcome to the 'SUTS for CVLMs' wiki

SUTS for VLMS is a spatial understanding (SU) test suite (TS) for vision-language models (VLMs). There are currently 4 main test types in the test suite, and each one works by generating an image along with a pairs of sentence captions in which one sentence describes the image and the other falsely describes the image. The job of the CVLM is to identify the correct sentence captions.

Image generation is done using Unity. See more at Object Spawning and Camera Movement

The main test types so far are described at the following pages.

Additionally there is one test unrelated to spatial understanding.

Simple Object Detection

Home

Main Test Types

Secondary Test Types

Simple Object Detection

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Welcome to the 'SUTS for CVLMs' wiki

Home

Main Test Types

Secondary Test Types

Clone this wiki locally