Skip to content

OFA-Sys/TouchStone

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation



TouchStone: Evaluating Vision-Language Models by Language Models

Paper

TOUCHSTONE is a comprehensive assessment of multimodal language models, encompassing not only basic recognition and comprehension but also extending to literary creation. By using strong LLMs as judges and converting multimodal information into text, our TouchStone allows for efficient and accurate assessment of dialogue quality, leveraging the power of advanced language models without the need for manual intervention.

DATASET

TouchStone is a diverse and comprehensive dataset that covers five key dimensions: Basic Descriptive Ability, Visual Recognition Ability, Visual Comprehension Ability, Visual Storytelling Ability, and Multi-image Analysis Ability. You can download the dataset here.

Our dataset currently places more emphasis on assessing basic abilities, where the highest proportion of questions pertains to recognition, accounting for about 44.1%, followed by comprehension questions at 29.6%. The proportions of the other categories are 15.3% for basic descriptive ability, 7.4% for visual storytelling ability, and 3.6% for multi-image analysis ability. There are a total of 908 dialogue.

Methods

TouchStone leverages fine-grained annotation and strong LLMs to evaluate LVLMs. Firstly, fine-grained descriptions of images are obtained through manual annotation and inspection. These descriptions, along with questions, are fed into GPT-4 (text-only) to generate reference answers. On the other hand, different LVLMs directly take visual signals and questions as input to generate answers. The generated answers, reference answers, questions, and fine-grained descriptions are all scored by GPT-4. The final scores are averaged and used to rank the models, representing their comprehensive performance.

New Results

Rank Model Score
🏅️ GPT-4V 803.5
🥈 CogVLM 742.0
🥉 Qwen-VL 711.6
4 Emu2 703.8
5 mPLUG-Owl 605.4
6 LLaVA 602.7
7 LLaMA-AdapterV2 590.1
8 InstructBLIP 552.4
9 MiniGPT4 531.7
10 PandaGPT 488.5

Evaluation Results

Run Evaluation

Read image
import io
import base64
import pandas as pd
from PIL import Image

def decode_base64_to_image(base64_string):
    image_data = base64.b64decode(base64_string)
    image = Image.open(io.BytesIO(image_data))
    return image

df = pd.read_csv("touchstone_20230831.tsv", sep='\t')
index = 0
image = decode_base64_to_image(df.iloc[index]['image'])
question = df.iloc[index]['question']
human_annotation = df.iloc[index]['human_annotation']
gpt4_ha_answer = df.iloc[index]['gpt4_ha_answer']
category = df.iloc[index]['category']
task_name = df.iloc[index]['task_name']
Format requirement
  • The submitted file should be in CSV format with the delimiter set as '\t'.
  • The submitted file must contain the following fields: index, question, human_annotation, gpt4_ha_answer, category, task_name, and response. The "response" field represents the model's answer, while the other fields should match the evaluation dataset file.
  • The number of rows in the submission.xlsx file (excluding the header) should be consistent with the evaluation dataset, which is 908 rows.

The evaluation script is provided in eval.py.

python eval.py submit_file openai_key --model-name your_model 

Citation

@misc{bai2023touchstone,
      title={TouchStone: Evaluating Vision-Language Models by Language Models}, 
      author={Shuai Bai and Shusheng Yang and Jinze Bai and Peng Wang and Xingxuan Zhang and Junyang Lin and Xinggang Wang and Chang Zhou and Jingren Zhou},
      year={2023},
      eprint={2308.16890},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

About

Touchstone: Evaluating Vision-Language Models by Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages