ICD: Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding

Official implementation of Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding

👀 Overview

⭐ Setup

Environment

conda create -n icd -y python=3.10
conda activate icd

# install dependency
pip install -r requirements.txt

Note: The dependencies are refered to VCD, LLaVA, MiniGPT-4, InstructBLIP. You could also easily setup the environment by following the instructions from these repos.

Datasets

Download images and annotations of the following datasets for the inference and evaluation.

Some annotations could be found in experiments/data.

📌 File Structure

After the inference, the generated results folder has the following structure:

results
  ├── mme
  └── pope
      └── ib
          ├── baseline
          └── icd
              ├── ow_format
              └── yn_format
                  ├── prompt1
                  └── prompt2
                      ├── normal.json
                      └── question.json

_format represents which question format is used for LLM generation. You could specify yn_format, ow_format and no_format by adding --format . after the running command.

Inside the prompt folders, there are two files: normal.json and question.json. The question.json file indicates that the question is integrated with the instructional disturbance in Q-former. The normal.json file used solely the instructional disturbance in Q-former.

🕹️ Usage

How to use ICD

To effectively demonstrate how ICD works, here is an example using InstructBLIP on ICD and highlighting some key contributions.

Replace sampling function by:
```
from icd_utils.icd_sample import evolve_icd_sampling
evolve_icd_sampling()
```
The icd_sample function replace the original sampling function in the transformers library. This mainly incorporates the constractive decoding method, where the constractive decoding occurs between the logits of the default model and the logits of the model with instructional disturbance, while keeping the rest unchanged.

Modify experiments/lavis/models/blip2_models/blip2_vicuna_instruct.py

if preprompt_cd is not None:
  text_Qformer_cd = self.tokenizer(
      preprompt_cd,
      padding='longest',
      truncation=True,
      max_length=self.max_txt_len,
      return_tensors="pt",
  ).to(image.device)
# ....
if use_cd:
  query_output_cd = self.Qformer.bert(
      text_Qformer_cd.input_ids,
      attention_mask=Qformer_atts_cd,
      query_embeds=query_tokens,
      encoder_hidden_states=image_embeds,
      encoder_attention_mask=image_atts,
      return_dict=True,
  )

The Q-former in InstructBLIP can accept a textual instruction to extract instruction-aware visual features from the frozen image encoder (InstructBLIP). In ICD, the Q-former is given an instructional disturbance to lead the model extract vague and incomplete visual features.

This modification processes the given instructional disturbance (preprompt_cd) like the default model, then the outputs are sent toghether with the default outputs to the sampling function.

Generate the results

model.generate({"image": image, "prompt": question},preprompt_cd = preprompt,
                        use_nucleus_sampling=True, num_beams=1,
                        top_p = 1, repetition_penalty=1, cd_alpha=cd_alpha, 			
               					cd_beta=cd_beta, use_cd=True)[0]

Quick start

You could find all the inference and evaluation codes in experiments/gen_scripts and experiments/eval_scripts respectively. Here is an example using InstructBILP to inference and evaluation in various tasks.

Inference

# POPE
CUDA_VISIBLE_DEVICES=0 python icd_ib_pope.py --gvqa_image_root /path/to/gvqa_image_folder --coco_image_root /path/to/coco_image_folder --question_folder ../data/pope --save_folder ./pope/ib
# MME
CUDA_VISIBLE_DEVICES=0 python icd_ib_mme.py --data_path /path/to/MME_folder --save_folder ./mme/ib
# if use vcd + icd
CUDA_VISIBLE_DEVICES=0 python icd_ib_mme.py --data_path /path/to/MME_folder --save_folder ./mme/ib --vcd
# llava-bench
CUDA_VISIBLE_DEVICES=0 python icd_llava_bench_ib.py --question_file /path/to/question_file --image_root /path/to/images --save_folder ./llava_bench/ib
# Co-occurence
CUDA_VISIBLE_DEVICES=0 python icd_ib_co.py --gt_objects ../data/co_occur/gt_objects.json --image_root /path/to/coco_val2014 --save_folder ./co_occur/ib
# OK-VQA
CUDA_VISIBLE_DEVICES=0 python icd_ib_ok_vqa.py --question_file ../data/ok_vqa/OpenEnded_mscoco_val2014_questions.json --image_root /path/to/images --save_folder ./ok_vqa/ib
# Text-VQA
CUDA_VISIBLE_DEVICES=0 python icd_ib_text_vqa.py --question_file ../data/text_vqa/TextVQA_0.5.1_val.json --image_root /path/to/images --save_folder ./text_vqa/ib
# CHAIR (Please run experiments/data/chair/prepare_data.py first.)
CUDA_VISIBLE_DEVICES=0 python icd_ib_text_vqa.py --question_file ../data/chair/chair-val.jsonl --image_root /path/to/images --save_folder ./chair/ib

Evaluation

# POPE
python eval_pope.py --label_folder ../gen_scripts/data/pope --ans_folder ../gen_scripts/pope_results/ib/icd
# MME
python eval_mme.py --results_dir ../gen_scripts/mme/ib
# Co-occurence
python icd_ib_co.py --gt_objects ../data/co_occur/gt_objects.json --ans_folder ../gen_scripts/co_occur/ib/icd
# OK-VQA
python evaluate-ok_vqa.py --label_file ../data/ok_vqa/mscoco_val2014_annotations_enhanced.json --ans_folder ../gen_scripts/ok_vqa/ib/icd
# Text-VQA
python evaluate-text_vqa.py --label_file ../data/text_vqa/TextVQA_0.5.1_val.json --ans_folder ../gen_scripts/text_vqa/ib/icd
# CHAIR
python chair.py --cap_file ../path/to/file --coco_path /path/to/coco_val2014_annotations

Note:

The CHAIR question file and evaluation file refer to yuezih/less-is-more, please check the repo for more details
The arguments for each task are different and depend on the task requirements, some arguments should be set to your own path. Please check the details in experiments/gen_scripts and experiments/eval_scripts.

📈 Experiments

The efficacy of ICD on POPE

Tabel 1(Part 1): The default under methods denotes the standard ¸decoding, whereas VCD represents Visual Contrastive Decoding [CVPR2024], and ICD is our instruction contrastive decoding. The best performances within each setting are bolded
The efficacy of ICD on MME

Figure3: : Performance on MME full benchmark on InstructBLIP.
Please refer to our paper for more detailed experimental results.

📝 Citation

If you find our project useful, we hope you can star our repo and kindly cite:

@inproceedings{wang-etal-2024-mitigating,
    title = "Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding",
    author = "Wang, Xintong  and Pan, Jingheng  and Ding, Liang  and Biemann, Chris",
    booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
    year = "2024",
    url = "https://aclanthology.org/2024.findings-acl.937",
    pages = "15840--15853",
}

📎 Acknowledgement

This project is benefits from the following works:

Thanks for their awesome works.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
experiments		experiments
figs		figs
icd_utils		icd_utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ICD: Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding

👀 Overview

⭐ Setup

Environment

Datasets

📌 File Structure

🕹️ Usage

How to use ICD

Quick start

📈 Experiments

📝 Citation

📎 Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

PostMindLab/ICD

Folders and files

Latest commit

History

Repository files navigation

ICD: Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding

👀 Overview

⭐ Setup

Environment

Datasets

📌 File Structure

🕹️ Usage

How to use ICD

Quick start

📈 Experiments

📝 Citation

📎 Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages