# Evaluating Prompt Sensitivity for VLMs on Simple Spatial Reasoning: A Controlled Study on LLaVA
###### 20.12.2025
#### Notebook by [Adam Astamir](https://adamastamir.vercel.app/)
##### Supported by [Jae Hee Lee](https://jaeheelee.gitlab.io/)
##### [University of Hamburg](https://www.uni-hamburg.de/)

| Paper | Model(s) | Task Type | Prompt Role | Key Failure | What’s Missing |
|------|----------|-----------|-------------|-------------|---------------|
| What’s “Up” with Vision-Language Models? (2023) | 18 VLMs (CLIP, BLIP, BLIP-2, XVLM, CoCa, FLAVA...) | Image–text matching for basic spatial relations (left/right, on/under, in front/behind) | Minimal. Prompting controlled for isolation of spatial reasoning | Models fail to reliably distinguish basic spatial relations. performance often near random |  Focus on prompting not given. Also no Counting involved   |
| Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models (2024)| 7 VLMs (GPT-4V, **LLaVA**, BLIP-2, MiniGPT-4, OpenFlamingo, InstructBLIP, Otter) |  Multiple-choice spatial reasoning tasks (e.g., map navigation, object arrangement, grid puzzles)  | Moderate. Text prompts guide reasoning, images provide context | VLMs often ignore visual input or rely too heavily on text. accuracy sometimes at or below random chance; struggles with multi-step spatial reasoning  | No experiments on CLIP specifically. Synthetic image used may not include biases of real world examples  |
| Enhancing Spatial Reasoning in Vision-Language Models via CoT Prompting & RL (2025) | PaLI-Gemma2, Qwen2.5-VL, Llama-4-Scout |  Counting, Relations, Depth, Distance, Egocentric/Object Movement (SAT, CV-Bench, CLEVR, VSR) | All about prompting. Many different techniques |  Naïve CoT can harm performance; models may rely on superficial linguistic patterns; standard SFT fails to generalize OOD | No LlaVA and little CLIP metion  |
| Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas (2025) | **LLaVA-1.5, LLaVA-1.6**, Qwen2-VL | Simple Multiple-choice spatial reasoning tasks (left/right, on/under, front/behind) from WhatsUp & VSR datasets  | Moderate. Prompts ask for object relations; adaptive attention (ADAPTVIS) uses model confidence to guide reasoning |  VLMs underutilize image tokens; attention misalignment causes spatial hallucinations; model over-relies on familiar relationships; low-confidence relations fail |  Focus on CLIP as backbone is limited; dependency on validation sets for ADAPTVIS; only intermediate layers analyzed; specific prompt effects not deeply varied |

## Abstract

### Introduction

WRITE AN INTORDUTCION

### Research Question

How sensitive is spatial reasoning performance in VLMs to prompt structure & ordering?

### Hypothesis

Spatial reasoning performance of VLMs significantly varies depending on structure and order of prompts, such that prompts with clear instructions yield higher accuracy than with ambiguous or shuffled instructions.

### Methods

Write methods here
used Whatsup dataset of 205 images in 4 differing positions totaling 820 images because it isolates simple spatial relations like left or right much more precisely than most existing datasets - The images in the dataset are exactly the same with minimized background, except that the objects being tested move around to have differing spatial relations to one another and also account for special biases like a dog usually being under the table instead also being placed on top or objects like house being above water instead of under it not being used. Whatsup also shows that the word "behind"(52% accuracy) is not as performant as "in the background" (67% accuracy).

WHAT TO TEST? (only simple spatial relations?)
CoT prompting, VoT prompting?, CoT with Scene Graph prompting!, Multiple Choice prompting, Shuffled Prompting, Simple prompting, different Words like backround instead of behind prompting, Describe CoT (First describe Objects then spatial relations prompting), optical flow Cot?, Structured few-Shot promting?, Conversational few-Shot prompting?,

WILL TEST:
Simple instruction to serve as baseline and reflect naive prompting (Which object is to the left of X? Where is X in relation to the object?...)
Chain of Thought prompting (Think step by step, (insert basic prompt))
Scene Graph CoT (First list all objects and their relative positions, then answer: (insert basic prompt)) (Literature suggests this is way better than CoT and CoT may even hurt performance and also this is way better generally)
Multi-stage descriptive prompting (Describe all objects in the scene and their colors/positions. Now answer: (Insert basic prompt)) (Paper suggests that explicit separation of perception and reasoning may improve accuracy)
Multiple Choice Prompts((insert basice prompt) Choose one of the following: ...)
Alternative Wording! ("behind" vs "in the background", "On top" vs "above"...) (Whatsup suggests Wording changes performance significantly (behind = 52%, in the background = 67%))
Shuffled or ambiguous prompts (Left,Right,Above... -> Right,Above,Left... , is Left of table to left of me or left of table meant?) (Tests sensitivity to ordering and clarity, directly addressing hypothesis)
Few-Shot Prompting (“Earlier, you identified positions like this: (prev answer). Now answer this one (insert basic prompt).”) (Could improve reasoning by giving context, paper suggests this)

Metrics defining success:
Accuracy in % (Primary) (obvious)
Prompt sensitivity (Secondary) (Measures how much prompting mattered)
Confidence (Secondary) (Insight into whether attention or confidence signals align with spatial reasoning performance. Also can talk about newest Paper ADAPTVIS)
Error Type (Secondary) (Tells me if left or right is more performant than above or below)

Will test on all of Whatsup Images Dataset consisting of 205 images in 4 different spatial relations totaling 820 images. Batch size will be 32. This is fine because Whatsup images are nice as mentioned


### Research Goal

To systematically evaluate the effect of prompt design on VLM spatial reasoning performance, identify limitations, and provide insights for more effective prompt engineering.

### Key findings

Can be empty for now

## Literature Review

explin why spatial reasoning is hard and give some summary of relevant papers include links?
Whatsup Paper
Whatsup suggests that prompting does impact performance significantly and also shows that spatial relations performance generally lands at around 50% suggesting that its barely better than guessing and sometimes even lands at less than 50% meaning its systematically guessing wrong - probable cause would the biases like a cup usually being on top of a table instead of under.
SpatialEval Paper
The SpatialEval paper suggests that VLMs, including LLaVA, heavily rely on textual cues and often underutilize visual input—sometimes performing better without any visual cues at all. It provides both a benchmark framework and concrete evidence that LLaVA’s spatial reasoning is highly sensitive to textual information and task setup, supporting a controlled analysis of how different prompts influence performance on spatial tasks.
CoT Paper
The statement "our empirical results show that naive CoT prompting (e.g., ”think first, then answer”) not only fails to improve spatial reasoning in VLMs, but may even degrade performance." supports importance of prompting for spatial relations.
CoT worsens performance and Scene-Graph based CoT significantly improves it (also reduces the chance of reward hacking maybe?).
Using the prompt alongside the image or only the prompt increases accuracy supporting hypothesis. 
the paper Systematically evaluates how different prompting strategies affect spatial reasoning in VLMs. It shows that naive prompting (like simple CoT) can degrade performance, while structured prompts based on scene graphs and multi-stage approaches improve accuracy. The use of few-shot and conversational prompting highlights how subtle changes in prompt design can influence reasoning outcomes.
Additionally, the paper examines generalization under out-of-distribution conditions, showing that model performance can shift depending on prompt phrasing—critical evidence. Finally, the GRPO reinforcement learning fine-tuning results demonstrate that beyond prompt design, the model’s ability to align visual and textual cues also strongly impacts spatial reasoning performance, providing context for interpreting own experiments.
Why is Spatial Reasoning hard for VLMs paper
Paper specifically examines LLaVA and studies how prompting interacts with attention for spatial reasoning. It shows that prompt-based tasks alone are insufficient if the model misallocates attention; the key insight is that model confidence can guide adaptive attention (ADAPTVIS) to significantly improve performance. LLaVA struggles with spatial reasoning mainly due to sparse and misaligned attention on image tokens. Standard prompts guide reasoning moderately but cannot fix attention issues(relevant?). ADAPTVIS: confidence-guided attention intervention improves spatial reasoning up to 50 absolute points (relevant?).
Intermediate layers of LLaVA are crucial for processing visual info; early layers capture global structure, middle layers refine spatial understanding (relevant?).
Attention distribution, not just quantity, is critical. Confidence scores indicate familiarity with relations.
"Our investigation begins with a key observation: despite image tokens comprising around
90% of the input sequence, they receive only about 10% of the model’s attention. This significant imbalance suggests that textual priors often overshadow visual evidence,
explaining VLMs’ struggles with vision-centric tasks." statement suggests hypothesis.



## Experimental Design

WHOLE LOTTA STUFF HERE
using the Whatsup dataset IMages. Why? PUT LINK HERE TOO

first i took the multiple choice only parts and varied wording and shuffled too to test how order and wording impacts performance. Then I will take the highest performant multiple choice prompt and compare to other prompting types on performance.

In [1]:
print("Works")

Works


## Actual Experiment

Experimetn


## Results & Analysis

![Accuracy](MultipleChoioceAccuracy.png)
Above has lower accuracy for Shows that Different Wording impacts performance in a negative way.


## Conclusion

COnlsulion