[CVPR 2026 Highlight] Object-DINO

Object-DINO is a training-free method that extracts distributed, object-centric information from self-supervised Vision Transformers (such as DINO). It leverages this localized visual evidence for two applications: unsupervised object discovery and mitigating object hallucinations in Multimodal Large Language Models (MLLMs).

Repository Structure

Object-DINO/
├── Demo.ipynb     
├── unsupervised_object_discovery/     
└── mllm_hallucinations/               
    ├── coco/                          
    ├── chair/                         
    ├── pope/                          
    └── mme/

Environments

Environment	Used For
`dinov3_env`	Unsupervised Object Discovery
`llava`	MLLM guidance generation
`marine`	POPE & CHAIR evaluation

Creating the Environments

# Create the environments
conda env create -f envs/dinov3_env.yml
conda env create -f envs/llava.yml
conda env create -f envs/marine.yml

Application 1: Unsupervised Object Discovery

This application replaces the standard TokenCut baseline (which natively uses all final-layer heads) with our custom, dynamically selected set of object-centric heads distributed across the network (object_dino_feature_extraction.py).

Datasets

Please refer to Download_data.md for downloading VOC2007, VOC2012, and COCO 2014 datasets.

Full Run (VOC07 + VOC12 + COCO20k)

sbatch unsupervised_object_discovery/run_object_discovery.sh

Key Arguments

Argument	Description	Value Used
`--dataset`	Dataset name	`VOC07`, `VOC12`, `COCO20k`
`--set`	Split	`trainval` / `train`
`--which_features`	Feature type	`object_dino` (our method)
`--arch`	Backbone	`vit_base`
`--tau`	Graph threshold	`-0.35`

Application 2: MLLM Hallucination Mitigation

Our method provides explicit visual grounding by generating object-centric similarity maps using DINOv3. These maps are used to guide LLaVA 1.5's decoding process via logit blending, amplifying tokens that are consistent with the visual evidence to reduce hallucination:

combined_logits = α · logits(original image) + (1 - α) · logits(highlighted image)

Highlighted Images

To generate highlighted images, run dino_coco.py, dino_pope.py, or dino_mme.py using the dinov3_env environment.

Step 1 — Guidance Generation (`llava` env)

Runs LLaVA 1.5 with α-guided decoding across all three benchmarks:

sbatch mllm_hallucinations/run_quick_test.sh

This runs sequentially:

pope/guidance_pope.py — answers POPE yes/no questions with guidance
coco/guidance_coco.py — generates guided captions for CHAIR evaluation
mme/guidance_mme.py — generates guided answers for MME tasks

Step 2 — Evaluation (`marine` env)

conda activate marine
bash mllm_hallucinations/eval_quick_test.sh

This runs:

POPE: convert_pope.py → eval_pope.py
CHAIR: chair_alpha.sh (converts JSON → runs chair.py)
MME: eval_mme.py

Running Individual Datasets

POPE only

# Guidance (llava env)
cd mllm_hallucinations/pope
CUDA_VISIBLE_DEVICES=0 python -u guidance_pope.py

# Eval (marine env)
python convert_pope.py
python eval_pope.py

CHAIR only

# Guidance (llava env)
cd mllm_hallucinations/coco
CUDA_VISIBLE_DEVICES=0 python -u guidance_coco.py

# Eval (marine env)
cd mllm_hallucinations/chair
bash chair_alpha.sh

MME only

# Guidance (llava env)
cd mllm_hallucinations/mme
CUDA_VISIBLE_DEVICES=0 python -u guidance_mme.py

# Eval
python eval_mme.py

Acknowledgments

This repository builds upon several excellent open-source projects. We would like to thank the authors of:

Citation

If you find this work useful, please consider citing:

@article{rawlekar2026finding,
  title={Finding Distributed Object-Centric Properties in Self-Supervised Transformers},
  author={Rawlekar, Samyak and Swain, Amitabh and Cai, Yujun and Wang, Yiwei and Yang, Ming-Hsuan and Ahuja, Narendra},
  journal={arXiv preprint arXiv:2603.26127},
  year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[CVPR 2026 Highlight] Object-DINO

Repository Structure

Environments

Creating the Environments

Application 1: Unsupervised Object Discovery

Datasets

Full Run (VOC07 + VOC12 + COCO20k)

Key Arguments

Application 2: MLLM Hallucination Mitigation

Highlighted Images

Step 1 — Guidance Generation (`llava` env)

Step 2 — Evaluation (`marine` env)

Running Individual Datasets

POPE only

CHAIR only

MME only

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
envs		envs
mllm_hallucinations		mllm_hallucinations
unsupervised_object_discovery		unsupervised_object_discovery
.gitignore		.gitignore
Demo.ipynb		Demo.ipynb
Download_data.md		Download_data.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

[CVPR 2026 Highlight] Object-DINO

Repository Structure

Environments

Creating the Environments

Application 1: Unsupervised Object Discovery

Datasets

Full Run (VOC07 + VOC12 + COCO20k)

Key Arguments

Application 2: MLLM Hallucination Mitigation

Highlighted Images

Step 1 — Guidance Generation (llava env)

Step 2 — Evaluation (marine env)

Running Individual Datasets

POPE only

CHAIR only

MME only

Acknowledgments

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Step 1 — Guidance Generation (`llava` env)

Step 2 — Evaluation (`marine` env)

Packages