Would you still call this Dax? Novel Visual References in VLMs and Humans

This repository contains the code, data, and pre-computed results for the EMNLP paper:

Would you still call this Dax? Novel Visual References in VLMs and Humans EMNLP 2025 (under review)

Abstract

We study how vision-language models (VLMs) handle novel visual concept references under systematic visual perturbation. We introduce NVRD (Novel Visual Reference Dataset), a controlled benchmark of 90 objects across four ontological categories (Known, Novel, Shape-Shape, Shape-Texture), each perturbed by 11 types of visual transformation across up to 20 severity levels (19,176 images total). Using three complementary probing paradigms — nonce word generation (§4.1), log-probability scoring (§4.2), and dual-image Likert-scale rating (§4.3) — we evaluate five state-of-the-art VLMs alongside 30 human participants. Our results reveal that while humans show robust, category-sensitive reference preservation, all VLMs exhibit substantially different degradation profiles, with notable overreliance on surface features and limited sensitivity to semantic category boundaries.

Dataset

NVRD is hosted on HuggingFace at adadtur/nvrd.

Property	Value
Objects	90 (across 4 categories)
Categories	Known, Novel, Shape-Shape, Shape-Texture
Perturbation types	11 (add, background, color, jpeg, noise, pixelate, remove, scale, shape, style, texture)
Levels per perturbation	up to 20
Total images	19,176
Human participants	30 (Prolific)

Quick setup

git clone https://github.com/your-org/nvrd.git
cd nvrd
python -m venv .venv && source .venv/bin/activate

# Install PyTorch first (match your CUDA version):
pip install torch==2.7.1 --index-url https://download.pytorch.org/whl/cu126   # CUDA 12.6
# pip install torch==2.2.2 --index-url https://download.pytorch.org/whl/cu118 # CUDA 11.8
# pip install torch==2.7.1 --index-url https://download.pytorch.org/whl/cpu   # CPU only

pip install -r requirements.txt

Dataset download

Download all NVRD images from HuggingFace into data/sample-data/:

python data/download_dataset.py

This downloads ~19K images from HuggingFace. The first run downloads ~971 Parquet shard files before extracting images — expect 20–30 minutes before any PNGs appear on disk. Subsequent runs skip already-downloaded images.

It creates the following layout (which the experiment scripts expect):

data/sample-data/
├── known/
│   ├── {object}.png               (original)
│   └── perturbations_comp/
│       └── {ptype}/
│           └── {object}_{level}.png
├── novel/
│   └── ...
└── modified/
    ├── shape-shape/
    │   └── ...
    └── shape-texture/
        └── ...

Reproducing figures

All figures are saved to figures/ (created automatically).

Figures 3 & 4 — Nonce reference generation accuracy + z-scored log-prob by object category:

python analysis/plot_figure3.py

Output: figures/combined_gen_prob_by_category.pdf

Figure 5 — Human vs model Likert rating comparison:

python analysis/plot_figure5.py

Outputs (all in figures/):

fig_cat_model_ratings.pdf — all-model ratings by object category
fig_cat_model_ratings_molmo2.pdf — Molmo2 only
fig_cat_human_comparison.pdf — human + all-model overlay by category
fig_cat_human_comparison_molmo2.pdf — human + Molmo2 overlay
scatter_human_vs_model_by_obj_cat.pdf — scatter correlation by category

Table 2 — Cross-task Spearman correlations:

python analysis/cross_task_consistency.py

Output: figures/cross_task_consistency.tex (also printed to stdout)

Appendix figures — Human comparison by perturbation type, ablations, nonce vs vanilla breakdowns:

python analysis/plot_appendix.py

Outputs multiple PDFs in figures/.

Running experiments

Experiments require:

The dataset downloaded to data/sample-data/ (see above)
A nonce word mapping file at data/nonce_word_mapping.json
For generation experiments: precomputed ICL pools at data/precomputed_icl_pools.json

Environment variables

Variable	Required for	Description
`HF_TOKEN`	Downloading gated HF models	HuggingFace access token
`OPENAI_API_KEY`	GPT-4o-mini experiments	OpenAI API key
`GEMINI_API_KEY`	Gemini 2.5 Flash experiments	Google Gemini API key

Model keys

The following model keys are supported across all experiment scripts:

Key	Model
Key	Model
---	---
`qwen2-vl-7b`	Qwen/Qwen2-VL-7B-Instruct
`qwen2.5-vl-72b`	Qwen/Qwen2.5-VL-72B-Instruct
`idefics3-8b`	HuggingFaceM4/Idefics3-8B-Llama3
`molmo2-8b`	allenai/Molmo2-8B
`gpt-4o-mini`	gpt-4o-mini
`gemini-2.5-flash`	gemini-2.5-flash

Generation experiment (§4.1 + §4.2)

Runs both nonce word generation and log-probability scoring (for local models):

python experiments/run_generation.py qwen2-vl-7b

Optional arguments:

python experiments/run_generation.py <model_key> [split] [--prob-only]

  split       : known, novel, shape-shape, shape-texture, or all (default)
  --prob-only : skip generation, run probability experiments only

Note: Local models (Qwen, Idefics, Molmo) require a GPU with ~16 GB VRAM. API models (gpt-4o-mini, gemini-2.5-flash) require the corresponding API key.

Results are saved to results/generation/ and results/probability/.

Likert rating experiment (§4.3)

python experiments/run_likert.py qwen2-vl-7b

Optional arguments:

python experiments/run_likert.py <model_key> [split]

Results are saved to results/prolific_style/.

Sycophancy ablation (§6.1)

python experiments/run_sycophancy.py qwen2-vl-7b

Optional arguments:

python experiments/run_sycophancy.py [model_key] [--n_pairs N] [--seed S]

  --n_pairs : number of cross-object pairs to sample (default: 1000)
  --seed    : random seed (default: 42)

Results are saved to results/sycophancy_ablation/.

Pre-computed results

Large result files (532 MB total) are not included in this Git repository.

For server users: results/generation/, results/probability/, and results/prolific_style/ are symlinks to the local cluster copy.
For external users: the full result files are available from the authors upon request or via HuggingFace dataset card.
Sycophancy ablation results (1.4 MB) are included directly in results/sycophancy_ablation/.

See results/README.md for file format documentation.

Repository structure

nvrd/
├── README.md
├── requirements.txt
├── .gitignore
├── data/
│   ├── download_dataset.py        download NVRD from HuggingFace
│   ├── nonce_words.txt            nonce word list
│   └── human_study/
│       └── trial-results-1.csv   human participant ratings (n=30)
├── experiments/
│   ├── run_generation.py          §4.1 name generation + §4.2 log-prob
│   ├── run_likert.py              §4.3 Likert rating
│   └── run_sycophancy.py          §6.1 sycophancy ablation
├── analysis/
│   ├── plot_settings.py           shared constants + data loaders
│   ├── plot_figure3.py            Figure 3 (gen accuracy + z-log-prob by category)
│   ├── plot_figure5.py            Figure 5 (human vs model Likert)
│   ├── plot_appendix.py           Appendix figures
│   └── cross_task_consistency.py  Table 2
└── results/
    ├── README.md                  result format documentation
    ├── generation/                symlink -> 331 MB JSONL files
    ├── probability/               symlink -> 102 MB JSONL files
    ├── prolific_style/            symlink -> 99 MB JSONL files
    └── sycophancy_ablation/       1.4 MB, included directly

Citation

If you use NVRD or this code in your work, please cite:

@inproceedings{atur2025nvrd,
  title     = {Would you still call this Dax? Novel Visual References in VLMs and Humans},
  author    = {Atur, Aditi and others},
  booktitle = {Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  year      = {2025},
}

License

The code in this repository is released under the MIT License. The NVRD dataset is released under CC BY 4.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Would you still call this Dax? Novel Visual References in VLMs and Humans

Abstract

Dataset

Quick setup

Dataset download

Reproducing figures

Running experiments

Environment variables

Model keys

Generation experiment (§4.1 + §4.2)

Likert rating experiment (§4.3)

Sycophancy ablation (§6.1)

Pre-computed results

Repository structure

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
analysis		analysis
data		data
experiments		experiments
results		results
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Would you still call this Dax? Novel Visual References in VLMs and Humans

Abstract

Dataset

Quick setup

Dataset download

Reproducing figures

Running experiments

Environment variables

Model keys

Generation experiment (§4.1 + §4.2)

Likert rating experiment (§4.3)

Sycophancy ablation (§6.1)

Pre-computed results

Repository structure

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages