Skip to content
David Krug edited this page Aug 15, 2022 · 9 revisions

Welcome to the 'SUTS for CVLMs' wiki

SUTS for VLMS is a spatial understanding (SU) test suite (TS) for vision-language models (VLMs). There are currently 4 main test types in the test suite, and each one works by generating an image along with a pairs of sentence captions in which one sentence describes the image and the other falsely describes the image. The job of the CVLM is to identify the correct sentence captions.

Image generation is done using Unity. See more at Object Spawning and Camera Movement

The main test types so far are described at the following pages.

Additionally there is one test unrelated to spatial understanding.