AGLA: Assembly of Global and Local Attention

The official repo of paper AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention.

Authors

Wenbin An, Feng Tian, Sicong Leng, Jiahao Nie, Haonan Lin, QianYing Wang, Guang Dai, Ping Chen, Shijian Lu.

Abstract

Large Vision-Language Models (LVLMs) often suffer from object hallucinations, where the generated textual responses do not align with the actual objects in the image. This paper identifies attention deficiency towards discriminative local image features as a key cause of this issue. We introduce the Assembly of Global and Local Attention (AGLA), a training-free, plug-and-play method designed to reduce object hallucinations by combining global features for response generation and local features for visual discrimination. Our extensive experiments demonstrate that AGLA consistently mitigates object hallucinations and enhances the overall perception capabilities of LVLMs across various discriminative and generative benchmarks.

Content

Data

We conducted experiments using four public datasets:

The question data for POPE and CHAIR are included in this repository. You will need to download the COCO_val2014 image files separately. For the MME dataset, you can request the data through the provided link. The LLaVA-Bench-Wild dataset can be downloaded from Huggingface via the provided link.

Model

We experimented with two LVLMs: LLaVA and InstructBLIP. An overview of our model is shown below.

Requirements

The environment is based on Python 3.9. Detailed dependencies are listed in requirements.txt.

Running

To run experiments on POPE with LLaVA 1.5 or InstructBLIP, use the following commands in the eval folder:

sh llava1.5_pope.bash
sh instructblip_pope.bash

To evaluate model performance on POPE, use eval_pope.py.

For other datasets, modify the file paths and prompts in run_llava.py and run_instructblip.py to generate results and evaluate model performance following the guidance of their original repositories.

Results

Thanks

The logit adjustment framework (i.e., sample.py) is based on VCD.

Citation

If our paper or code is helpful to you, please consider citing our work:

@article{an2024agla,
  title={AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention},
  author={An, Wenbin and Tian, Feng and Leng, Sicong and Nie, Jiahao and Lin, Haonan and Wang, QianYing and Dai, Guang and Chen, Ping and Lu, Shijian},
  journal={arXiv preprint arXiv:2406.12718},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AGLA: Assembly of Global and Local Attention

Authors

Abstract

Content

Data

Model

Requirements

Running

Results

Thanks

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
eval		eval
image		image
lavis		lavis
llava		llava
README.md		README.md
requirements.txt		requirements.txt

Lackel/AGLA

Folders and files

Latest commit

History

Repository files navigation

AGLA: Assembly of Global and Local Attention

Authors

Abstract

Content

Data

Model

Requirements

Running

Results

Thanks

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages