Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning

This repository is the official implementation of paper Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning.

Pipeline of MVP

Requirements

To install requirements:

pip install -r requirements.txt

Datasets

Datasets are in MVP/benchmark. Before inference, you need to download the images into the MVP/data folder.

Image Caption

In MVP framework, we need to caption the image first, and you can use the following command in caption.sh:

python caption/llava_caption.py \
    --model-path liuhaotian/llava-v1.5-7b \
    --image-folder MVP/data/coco \
    --question-file MVP/benchmark/POPE/coco/coco_pope_popular.json \
    --answers-file MVP/output/coco_pope_popular_caption_llava_bottom-up.jsonl \
    --perspective bottom-up \
    --temperature 0.7 \
    --top_p 0.95 \
    --max_new_tokens 512 \
    --num_beams 1 --seed 336

This will create a file under the output folder that stores all the captions. Of course, you need to execute (bottom-up, top-down, regular) separately under the perspective parameter.

We have prepared the caption file and can use it directly in the output folder.

MVP

To employ MVP, you can use the following command in MVP_llava.sh:

#!/bin/bash

declare -a files=("MVP_llava")

declare -a perspectives=("bottom-up" "top-down" "regular")

declare -a question_files=("coco")
declare -a question_types=("popular")

for file in "${files[@]}"; do
  for perspective in "${perspectives[@]}"; do
    for dataset in "${question_files[@]}"; do
      for type in "${question_types[@]}"; do
        question_file="MVP/benchmark/POPE/dataset/{dataset}/{dataset}_pope_${type}.json"
        output_file="MVP/output/(basename "(basename "file" .py)_{perspective}_{perspective}_{dataset}_${type}_pope.jsonl"
        log_file="MVP/logs/(basename "(basename "file" .py)_{perspective}_{perspective}_{dataset}_${type}_pope.log"

        nohup srun -p -n1 -N1 --gres=gpu:1 --quotatype=reserved python "MVP/$file" \
          --model-path liuhaotian/llava-v1.5-7b \
          --image-folder "MVP/data/${dataset}" \
          --question-file "$question_file" \
          --perspective "$perspective" \
          --answers-file "$output_file" \
          --temperature 0.7 \
          --top_p 1.0 --topk 3 \
          --max_new_tokens 50 \
          --num_beams 1 --seed 336
          1>"$log_file" 2>&1 &

        sleep 3
      done
    done
  done
done

After that, you can obtain the result files in the output folder.

Important arguments

--perspective: the caption perspective.
--topk: employ topk's reasoning paths.

Evaluation

To evaluate the performance of MVP, you can use the following command in eval_pope.sh:

python eval/eval_pope.py \
    --gt_files MVP/benchmark/POPE/coco/coco_pope_popular.json \
    --gen_files_bottom_up MVP/output/MVP_llava_bottom-up_coco_popular_pope.jsonl \
    --gen_files_top_down MVP/output/MVP_llava_top-down_coco_popular_pope.jsonl \
    --gen_files_regular MVP/output/MVP_llava_regular_coco_popular_pope.jsonl \
    --a 0.4 --b 0.4 --c 0.2

Important arguments

--a: the weight of bottom-up path.
--b: the weight of top-down path.
--c: the weight of regular path.

Experiment Results

MVP's performance on POPE:

MVP's performance on MME:

Case Study

How to cite

If you interested or inspired by this work, you can cite us by:

@misc{qu2024lookcomparedecidealleviating,
      title={Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning}, 
      author={Xiaoye Qu and Jiashuo Sun and Wei Wei and Yu Cheng},
      year={2024},
      eprint={2408.17150},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.17150}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning

Pipeline of MVP

Requirements

Datasets

Image Caption

MVP

Important arguments

Evaluation

Important arguments

Experiment Results

Case Study

How to cite

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
MVP		MVP
assets		assets
benchmark		benchmark
caption		caption
eval		eval
model		model
output		output
vanilla_decoding		vanilla_decoding
.DS_Store		.DS_Store
MVP_llava.sh		MVP_llava.sh
caption.sh		caption.sh
eval_pope.sh		eval_pope.sh
readme.md		readme.md
requirements.txt		requirements.txt
vanilla_llava.sh		vanilla_llava.sh

GasolSun36/MVP

Folders and files

Latest commit

History

Repository files navigation

Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning

Pipeline of MVP

Requirements

Datasets

Image Caption

MVP

Important arguments

Evaluation

Important arguments

Experiment Results

Case Study

How to cite

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages