This repository contains the code, data, and pre-computed results for the EMNLP paper:
Would you still call this Dax? Novel Visual References in VLMs and Humans EMNLP 2025 (under review)
We study how vision-language models (VLMs) handle novel visual concept references under systematic visual perturbation. We introduce NVRD (Novel Visual Reference Dataset), a controlled benchmark of 90 objects across four ontological categories (Known, Novel, Shape-Shape, Shape-Texture), each perturbed by 11 types of visual transformation across up to 20 severity levels (19,176 images total). Using three complementary probing paradigms — nonce word generation (§4.1), log-probability scoring (§4.2), and dual-image Likert-scale rating (§4.3) — we evaluate five state-of-the-art VLMs alongside 30 human participants. Our results reveal that while humans show robust, category-sensitive reference preservation, all VLMs exhibit substantially different degradation profiles, with notable overreliance on surface features and limited sensitivity to semantic category boundaries.
NVRD is hosted on HuggingFace at adadtur/nvrd.
| Property | Value |
|---|---|
| Objects | 90 (across 4 categories) |
| Categories | Known, Novel, Shape-Shape, Shape-Texture |
| Perturbation types | 11 (add, background, color, jpeg, noise, pixelate, remove, scale, shape, style, texture) |
| Levels per perturbation | up to 20 |
| Total images | 19,176 |
| Human participants | 30 (Prolific) |
git clone https://github.com/your-org/nvrd.git
cd nvrd
python -m venv .venv && source .venv/bin/activate
# Install PyTorch first (match your CUDA version):
pip install torch==2.7.1 --index-url https://download.pytorch.org/whl/cu126 # CUDA 12.6
# pip install torch==2.2.2 --index-url https://download.pytorch.org/whl/cu118 # CUDA 11.8
# pip install torch==2.7.1 --index-url https://download.pytorch.org/whl/cpu # CPU only
pip install -r requirements.txtDownload all NVRD images from HuggingFace into data/sample-data/:
python data/download_dataset.pyThis downloads ~19K images from HuggingFace. The first run downloads ~971 Parquet shard files before extracting images — expect 20–30 minutes before any PNGs appear on disk. Subsequent runs skip already-downloaded images.
It creates the following layout (which the experiment scripts expect):
data/sample-data/
├── known/
│ ├── {object}.png (original)
│ └── perturbations_comp/
│ └── {ptype}/
│ └── {object}_{level}.png
├── novel/
│ └── ...
└── modified/
├── shape-shape/
│ └── ...
└── shape-texture/
└── ...
All figures are saved to figures/ (created automatically).
Figures 3 & 4 — Nonce reference generation accuracy + z-scored log-prob by object category:
python analysis/plot_figure3.pyOutput: figures/combined_gen_prob_by_category.pdf
Figure 5 — Human vs model Likert rating comparison:
python analysis/plot_figure5.pyOutputs (all in figures/):
fig_cat_model_ratings.pdf— all-model ratings by object categoryfig_cat_model_ratings_molmo2.pdf— Molmo2 onlyfig_cat_human_comparison.pdf— human + all-model overlay by categoryfig_cat_human_comparison_molmo2.pdf— human + Molmo2 overlayscatter_human_vs_model_by_obj_cat.pdf— scatter correlation by category
Table 2 — Cross-task Spearman correlations:
python analysis/cross_task_consistency.pyOutput: figures/cross_task_consistency.tex (also printed to stdout)
Appendix figures — Human comparison by perturbation type, ablations, nonce vs vanilla breakdowns:
python analysis/plot_appendix.pyOutputs multiple PDFs in figures/.
Experiments require:
- The dataset downloaded to
data/sample-data/(see above) - A nonce word mapping file at
data/nonce_word_mapping.json - For generation experiments: precomputed ICL pools at
data/precomputed_icl_pools.json
| Variable | Required for | Description |
|---|---|---|
HF_TOKEN |
Downloading gated HF models | HuggingFace access token |
OPENAI_API_KEY |
GPT-4o-mini experiments | OpenAI API key |
GEMINI_API_KEY |
Gemini 2.5 Flash experiments | Google Gemini API key |
The following model keys are supported across all experiment scripts:
| Key | Model |
|---|---|
| Key | Model |
| --- | --- |
qwen2-vl-7b |
Qwen/Qwen2-VL-7B-Instruct |
qwen2.5-vl-72b |
Qwen/Qwen2.5-VL-72B-Instruct |
idefics3-8b |
HuggingFaceM4/Idefics3-8B-Llama3 |
molmo2-8b |
allenai/Molmo2-8B |
gpt-4o-mini |
gpt-4o-mini |
gemini-2.5-flash |
gemini-2.5-flash |
Runs both nonce word generation and log-probability scoring (for local models):
python experiments/run_generation.py qwen2-vl-7bOptional arguments:
python experiments/run_generation.py <model_key> [split] [--prob-only]
split : known, novel, shape-shape, shape-texture, or all (default)
--prob-only : skip generation, run probability experiments only
Note: Local models (Qwen, Idefics, Molmo) require a GPU with ~16 GB VRAM. API models (gpt-4o-mini, gemini-2.5-flash) require the corresponding API key.
Results are saved to results/generation/ and results/probability/.
python experiments/run_likert.py qwen2-vl-7bOptional arguments:
python experiments/run_likert.py <model_key> [split]
Results are saved to results/prolific_style/.
python experiments/run_sycophancy.py qwen2-vl-7bOptional arguments:
python experiments/run_sycophancy.py [model_key] [--n_pairs N] [--seed S]
--n_pairs : number of cross-object pairs to sample (default: 1000)
--seed : random seed (default: 42)
Results are saved to results/sycophancy_ablation/.
Large result files (532 MB total) are not included in this Git repository.
- For server users:
results/generation/,results/probability/, andresults/prolific_style/are symlinks to the local cluster copy. - For external users: the full result files are available from the authors upon request or via HuggingFace dataset card.
- Sycophancy ablation results (1.4 MB) are included directly in
results/sycophancy_ablation/.
See results/README.md for file format documentation.
nvrd/
├── README.md
├── requirements.txt
├── .gitignore
├── data/
│ ├── download_dataset.py download NVRD from HuggingFace
│ ├── nonce_words.txt nonce word list
│ └── human_study/
│ └── trial-results-1.csv human participant ratings (n=30)
├── experiments/
│ ├── run_generation.py §4.1 name generation + §4.2 log-prob
│ ├── run_likert.py §4.3 Likert rating
│ └── run_sycophancy.py §6.1 sycophancy ablation
├── analysis/
│ ├── plot_settings.py shared constants + data loaders
│ ├── plot_figure3.py Figure 3 (gen accuracy + z-log-prob by category)
│ ├── plot_figure5.py Figure 5 (human vs model Likert)
│ ├── plot_appendix.py Appendix figures
│ └── cross_task_consistency.py Table 2
└── results/
├── README.md result format documentation
├── generation/ symlink -> 331 MB JSONL files
├── probability/ symlink -> 102 MB JSONL files
├── prolific_style/ symlink -> 99 MB JSONL files
└── sycophancy_ablation/ 1.4 MB, included directly
If you use NVRD or this code in your work, please cite:
@inproceedings{atur2025nvrd,
title = {Would you still call this Dax? Novel Visual References in VLMs and Humans},
author = {Atur, Aditi and others},
booktitle = {Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
year = {2025},
}The code in this repository is released under the MIT License. The NVRD dataset is released under CC BY 4.0.