# Evaluating Prompt Sensitivity for VLMs on Simple Spatial Reasoning: A Controlled Study on LLaVA
###### 20.12.2025
#### Notebook by [Adam Astamir](https://adamastamir.vercel.app/)
##### Supported by [Jae Hee Lee](https://jaeheelee.gitlab.io/)
##### [University of Hamburg](https://www.uni-hamburg.de/)

| Paper | Model(s) | Task Type | Prompt Role | Key Failure | What’s Missing |
|------|----------|-----------|-------------|-------------|---------------|
| What’s “Up” with Vision-Language Models? (2023) | 18 VLMs (CLIP, BLIP, BLIP-2, XVLM, CoCa, FLAVA...) | Image–text matching for basic spatial relations (left/right, on/under, in front/behind) | Minimal. Prompting controlled for isolation of spatial reasoning | Models fail to reliably distinguish basic spatial relations. performance often near random |  Focus on prompting not given. Also no Counting involved   |
| Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models (2024)| 7 VLMs (GPT-4V, **LLaVA**, BLIP-2, MiniGPT-4, OpenFlamingo, InstructBLIP, Otter) |  Multiple-choice spatial reasoning tasks (e.g., map navigation, object arrangement, grid puzzles)  | Moderate. Text prompts guide reasoning, images provide context | VLMs often ignore visual input or rely too heavily on text. accuracy sometimes at or below random chance; struggles with multi-step spatial reasoning  | No experiments on CLIP specifically. Synthetic image used may not include biases of real world examples  |
| Enhancing Spatial Reasoning in Vision-Language Models via CoT Prompting & RL (2025) | PaLI-Gemma2, Qwen2.5-VL, Llama-4-Scout |  Counting, Relations, Depth, Distance, Egocentric/Object Movement (SAT, CV-Bench, CLEVR, VSR) | All about prompting. Many different techniques |  Naïve CoT can harm performance; models may rely on superficial linguistic patterns; standard SFT fails to generalize OOD | No LlaVA and little CLIP metion  |
| Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas (2025) | **LLaVA-1.5, LLaVA-1.6**, Qwen2-VL | Simple Multiple-choice spatial reasoning tasks (left/right, on/under, front/behind) from WhatsUp & VSR datasets  | Moderate. Prompts ask for object relations; adaptive attention (ADAPTVIS) uses model confidence to guide reasoning |  VLMs underutilize image tokens; attention misalignment causes spatial hallucinations; model over-relies on familiar relationships; low-confidence relations fail |  Focus on CLIP as backbone is limited; dependency on validation sets for ADAPTVIS; only intermediate layers analyzed; specific prompt effects not deeply varied |

## Abstract

Vision–Language Models (VLMs), including LLaVA, have advanced considerably in multimodal understanding, but they continue to exhibit weaknesses in simple spatial reasoning tasks, such as identifying whether an object is to the left, right, above, or below another object. This study investigates how sensitive VLM performance is to prompt design, including prompt structure, option ordering, and wording. We hypothesize that spatial reasoning accuracy significantly varies depending on prompt clarity and phrasing, with clear, well-structured prompts yielding higher performance than ambiguous ones. To test this, we use part of the WhatsUp dataset, which contains 205 images presented in four controlled spatial positions, totaling 820 images, out of which we use 418 from control group 2. The dataset isolates basic spatial relations while minimizing background biases and confounding object-context associations. We evaluate multiple prompting strategies, including baseline multiple-choice prompts, we test for shuffled and wording-variant prompts, Chain-of-Thought (CoT) prompting, Scene-Graph-based CoT and multi-stage descriptive prompts. Accuracy, prompt sensitivity, and error patterns are measured to assess how subtle variations in prompt design influence VLM reasoning. Preliminary analyses indicate that even small changes in phrasing or option ordering can have notable effects on model performance, supporting the hypothesis that VLM spatial reasoning is highly prompt-sensitive. The study aims to provide a controlled, systematic evaluation of prompt effects on spatial reasoning, quantify performance variations, and identify strategies that reliably improve VLM reasoning in simple spatial tasks.  (KEY FINDINGS MISSING)

### Key findings

Can be empty for now

## Literature Review

Spatial reasoning presents a persistent challenge for vision-language models (VLMs) because it requires the integration of visual information with textual cues in a coherent way. Many studies have shown that models like LLaVA tend to over-rely on textual prompts, often underutilizing the visual context. In some cases, performance actually improves when visual input is removed, highlighting the difficulty these models face in correctly aligning image representations with language understanding. Misaligned attention mechanisms further contribute to errors, leading to what has been described as “spatial hallucinations,” where objects’ relative positions are misinterpreted even in simple, controlled scenes.

The WhatsUp dataset, introduced in 2023, provides a benchmark for evaluating basic spatial relations, such as left/right, on/under, and in front/behind. Its controlled design minimizes background distractions and isolates object positions, making it ideal for assessing prompt sensitivity. Empirical results on WhatsUp indicate that VLMs often perform near chance levels, around 50%, and subtle variations in wording can have a substantial effect on accuracy. For example, asking about an object “behind” another yields only 52% accuracy, whereas phrasing it as “in the background” increases performance to 67%. These results illustrate how linguistic framing alone can significantly influence model behavior, even for straightforward spatial tasks.

Further evidence comes from the SpatialEval study, which shows that LLaVA and other VLMs frequently rely more on textual than visual information, often ignoring image content entirely. The study demonstrates that prompt phrasing and structure materially affect model performance, underscoring the importance of controlled experiments to isolate the effect of prompt design.

Attempts to improve reasoning through Chain-of-Thought (CoT) prompting—guiding the model to think step by step—have yielded mixed results. Naive CoT prompts often fail to improve performance and can even degrade accuracy. In contrast, approaches that explicitly separate perception from reasoning, such as Scene-Graph-based CoT or multi-stage descriptive prompting, have been shown to enhance spatial understanding significantly. By first describing all objects and their spatial relationships, these methods help models interpret visual information more effectively, reducing errors caused by superficial linguistic biases.

Recent work examining LLaVA’s attention patterns offers additional insight. Despite image tokens comprising approximately 90% of the input, they receive only about 10% of the model’s attention, indicating a strong bias toward textual priors. Techniques such as ADAPTVIS, which use confidence-guided attention, can partially correct this imbalance and improve spatial reasoning accuracy. This highlights that prompt design alone is insufficient; effective evaluation of spatial reasoning also requires attention to how visual information is processed internally.

Overall, the literature suggests that spatial reasoning in VLMs is highly sensitive to prompt phrasing, structure, and ordering. Unstructured or ambiguous prompts frequently lead to poor performance, whereas structured, multi-stage, or descriptive prompts improve outcomes by explicitly guiding models to consider visual and spatial relationships. However, there remains a lack of systematic investigation into how multiple-choice variations, ordering, and alternative wordings influence simple spatial reasoning in LLaVA. This gap motivates the controlled experiments presented in this work, focusing on isolating the effects of prompt design on model performance.

## Experimental Design

Our initial experiments focused exclusively on multiple-choice prompts to identify a strong baseline before testing more complex prompting strategies. Using the WhatsUp controlled dataset images, we evaluated variations in prompt wording and the ordering of multiple-choice options. This approach allows us to isolate the effects of text phrasing and option positioning on LLaVA’s spatial reasoning.

We tested a range of prompting strategies including simple baseline instructions (“Which object is to the left of X?”), alternative wordings ("under" versus “underneath", “above” versus “on top”), and shuffled answer options. Our goal was to determine which combination of wording and ordering yields the highest accuracy so that subsequent experiments with Chain-of-Thought, Scene-Graph-based CoT, multi-stage descriptive prompting, and few-shot prompting can be compared to a strong multiple-choice baseline and use the most performant option of order and wording. Metrics for evaluation included accuracy, prompt sensitivity, confidence, and error type. All experiments were run with a batch size of 32 images.

## Actual Experiment

Testing LLaVA with multiple-choice prompts and their variations, we observed a wide range of accuracies, indicating significant prompt sensitivity. The baseline multiple-choice prompt achieved 58.37% accuracy, while shuffling the answer options reduced performance to 45.93%. Some combinations of shuffled options and altered wording, such as “Shuffle Vary Above,” produced extremely low accuracy at 26.08%, showing that subtle changes in both phrasing and ordering can drastically impair performance. Other variations, such as “Shuffle Vary Below” or “Shuffle Vary Under,” achieved 48.33% and 54.55%, respectively. Alternative wording alone, without shuffling, resulted in modest changes compared to the baseline, with “Vary Above” at 55.02%, “Vary Below” at 54.07%, and “Vary Under” at 54.31%.

These results indicate that LLaVA is highly sensitive to option order and wording, with shuffling having a particularly strong negative impact. The model appears to exhibit a “first-option bias,” relying on position rather than true spatial reasoning. While alternative wording alone slightly affects performance, ordering interacts with phrasing to produce complex effects. Overall, the model struggles with simple spatial reasoning, and performance can swing from near random to moderately above chance depending solely on prompt design.Experimetn


## Results & Analysis

![Accuracy 49.58%](MultipleChoiceAccuracy.png)
49.58% accuracy showing how llava struggles with simple spatial reasoning and is sometimes even below random chance. Also some prompts performed much better than others:
overall accuracy 49.58%
overall confidence 47.56%   
Multiple Choice: Acc=58.37%, Conf=47.56%
Multiple Choice Shuffle: Acc=45.93%, Conf=50.54%
Multiple Choice Shuffle Vary Above: Acc=26.08%, Conf=59.96%
Multiple Choice Shuffle Vary Below: Acc=48.33%, Conf=49.72%
Multiple Choice Shuffle Vary Under: Acc=54.55%, Conf=45.92%
Multiple Choice Vary Above: Acc=55.02%, Conf=44.75%
Multiple Choice Vary Below: Acc=54.07%, Conf=49.03%
Multiple Choice Vary Under: Acc=54.31%, Conf=52.97%

 Effect of shuffling options
	•	Multiple Choice Shuffle: 45.93%
	•	Multiple Choice Shuffle Vary Above: 26.08%
	•	Multiple Choice Shuffle Vary Below: 48.33%
	•	Multiple Choice Shuffle Vary Under: 54.55%

Interpretation:
	•	Shuffling decreases performance, sometimes dramatically (26.08%).
	•	This shows strong sensitivity to option order. LLaVA seems to have biases toward certain positions in multiple-choice lists (“first option bias” is well-documented in language models).
	•	Variations in wording interact with shuffling in complex ways; for example, “Vary Above” + shuffle is very low (26.08%), suggesting that subtle phrasing + order change can confuse the model significantly.
    3. Effect of alternative wording without shuffling
	•	Multiple Choice Vary Above: 55.02%
	•	Multiple Choice Vary Below: 54.07%
	•	Multiple Choice Vary Under: 54.31%

Interpretation:
	•	Slightly lower than base multiple-choice (58.37%), but generally consistent.
	•	Confirms WhatsUp findings: wording changes can impact accuracy, but if the options are clear and in a normal order, LLaVA can handle them relatively well.
	•	LLaVA seems less sensitive to wording alone than to shuffled wording—so ordering interacts with phrasing.

Certain prompt configurations induce systematic failure modes, leading to performance significantly below random chance, indicating the application of incorrect but consistent linguistic heuristics rather than visual reasoning.


## Conclusion

From these experiments, we conclude that LLaVA’s spatial reasoning is highly sensitive to prompt phrasing and option ordering. Baseline multiple-choice prompts with clear, unshuffled options provide the strongest performance. Shuffling or ambiguous phrasing can reduce accuracy drastically, demonstrating the model’s reliance on textual patterns rather than visual reasoning.