Object-DINO is a training-free method that extracts distributed, object-centric information from self-supervised Vision Transformers (such as DINO). It leverages this localized visual evidence for two applications: unsupervised object discovery and mitigating object hallucinations in Multimodal Large Language Models (MLLMs).
Object-DINO/
├── Demo.ipynb
├── unsupervised_object_discovery/
└── mllm_hallucinations/
├── coco/
├── chair/
├── pope/
└── mme/
| Environment | Used For |
|---|---|
dinov3_env |
Unsupervised Object Discovery |
llava |
MLLM guidance generation |
marine |
POPE & CHAIR evaluation |
# Create the environments
conda env create -f envs/dinov3_env.yml
conda env create -f envs/llava.yml
conda env create -f envs/marine.ymlThis application replaces the standard TokenCut baseline (which natively uses all final-layer heads) with our custom, dynamically selected set of object-centric heads distributed across the network (object_dino_feature_extraction.py).
Please refer to Download_data.md for downloading VOC2007, VOC2012, and COCO 2014 datasets.
sbatch unsupervised_object_discovery/run_object_discovery.sh| Argument | Description | Value Used |
|---|---|---|
--dataset |
Dataset name | VOC07, VOC12, COCO20k |
--set |
Split | trainval / train |
--which_features |
Feature type | object_dino (our method) |
--arch |
Backbone | vit_base |
--tau |
Graph threshold | -0.35 |
Our method provides explicit visual grounding by generating object-centric similarity maps using DINOv3. These maps are used to guide LLaVA 1.5's decoding process via logit blending, amplifying tokens that are consistent with the visual evidence to reduce hallucination:
combined_logits = α · logits(original image) + (1 - α) · logits(highlighted image)
To generate highlighted images, run dino_coco.py, dino_pope.py, or dino_mme.py using the dinov3_env environment.
Runs LLaVA 1.5 with α-guided decoding across all three benchmarks:
sbatch mllm_hallucinations/run_quick_test.shThis runs sequentially:
pope/guidance_pope.py— answers POPE yes/no questions with guidancecoco/guidance_coco.py— generates guided captions for CHAIR evaluationmme/guidance_mme.py— generates guided answers for MME tasks
conda activate marine
bash mllm_hallucinations/eval_quick_test.shThis runs:
- POPE:
convert_pope.py→eval_pope.py - CHAIR:
chair_alpha.sh(converts JSON → runschair.py) - MME:
eval_mme.py
# Guidance (llava env)
cd mllm_hallucinations/pope
CUDA_VISIBLE_DEVICES=0 python -u guidance_pope.py
# Eval (marine env)
python convert_pope.py
python eval_pope.py# Guidance (llava env)
cd mllm_hallucinations/coco
CUDA_VISIBLE_DEVICES=0 python -u guidance_coco.py
# Eval (marine env)
cd mllm_hallucinations/chair
bash chair_alpha.sh# Guidance (llava env)
cd mllm_hallucinations/mme
CUDA_VISIBLE_DEVICES=0 python -u guidance_mme.py
# Eval
python eval_mme.pyThis repository builds upon several excellent open-source projects. We would like to thank the authors of:
If you find this work useful, please consider citing:
@article{rawlekar2026finding,
title={Finding Distributed Object-Centric Properties in Self-Supervised Transformers},
author={Rawlekar, Samyak and Swain, Amitabh and Cai, Yujun and Wang, Yiwei and Yang, Ming-Hsuan and Ahuja, Narendra},
journal={arXiv preprint arXiv:2603.26127},
year={2026}
}
