From 4b986c77c54bf3d77a65baa9c90cb4405feb2698 Mon Sep 17 00:00:00 2001 From: "Teo (Timothy) Wu Haoning" <38696372+teowu@users.noreply.github.com> Date: Wed, 17 Jan 2024 13:19:54 +0800 Subject: [PATCH] Update README.md --- README.md | 26 ++++++++++++++------------ 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/README.md b/README.md index 9570463..7ba19d1 100644 --- a/README.md +++ b/README.md @@ -39,7 +39,9 @@ _How do multi-modaility LLMs perform on low-level computer vision?_ *Equal contribution. #Corresponding author. - +
+We are accepted as an ICLR2024 Spotlight. See you in Vienna! +
Paper | Project Page | Github | @@ -69,27 +71,27 @@ The proposed Q-Bench includes three realms for low-level vision: perception (A1) - We are open to **submission-based evaluation** for the two tasks. The details for submission is as follows. - For assessment (A3), as we use **public datasets**, we provide an abstract evaluation code for arbitrary MLLMs for anyone to test. - - -## GPT-4V! +## Close-source MLLMs (GPT-4V-Turbo, Gemini, GPT-4V)
-Our latest experiment suggests that [GPT-4V](https://chat.openai.com) is primarily entry ***human-level*** on general low-level perception, marking a new era for low-level visual perception and understanding! +We test on two close-source API models, GPT-4V-Turbo (`gpt-4-vision-preview`, replacing the no-longer-available *old version* GPT-4V results) and Gemini Pro (`gemini-pro-vision`). +Slightly improved compared with the older version, GPT-4V still tops among all MLLMs and almost a junior-level human's performance. Gemini Pro follows closely behind. -Here is the comparison of [GPT-4V](https://chat.openai.com) and non-expert human on `test` set of Task A1 (Perception). |**Participant Name** | yes-or-no | what | how | distortion | others | in-context distortion | in-context others | overall | -| - | - | - | - | - | - | -| - | -| -| GPT-4V | 0.7792 | 0.7918 | 0.6268 | 0.7058 | 0.7303 | 0.7466 | 0.7795 | 0.7336 (+0.1142 to best open-source) | -| human-1 | 0.8248 | 0.7939 | 0.6029 | 0.7562 | 0.7208 | 0.7637 | 0.7300 | 0.7431 (+0.0095 to GPT-4V) | -| human-2-senior | **0.8431** | **0.8894** | **0.7202** | **0.7965** | **0.7947** | **0.8390** | **0.8707** | **0.8174** (+0.0838 to GPT-4V) | +| - | - | - | - | - | - | -| - | - | +| Gemini-Pro (`gpt-4-vision-preview`) | 0.7221 | 0.7300 |0.6645 | 0.6530 | 0.7291 | 0.7082 | 0.7665 | 0.7058 | +| GPT-4V-Turbo (`gpt-4-vision-preview`) |0.7722 | 0.7839 | 0.6645 |0.7101 | 0.7107 | 0.7936 | 0.7891 | 0.7410 | +| GPT-4V (*old version*) | 0.7792 | 0.7918 | 0.6268 | 0.7058 | 0.7303 | 0.7466 | 0.7795 | 0.7336 | +| human-1-junior | 0.8248 | 0.7939 | 0.6029 | 0.7562 | 0.7208 | 0.7637 | 0.7300 | 0.7431 | +| human-2-senior | **0.8431** | **0.8894** | **0.7202** | **0.7965** | **0.7947** | **0.8390** | **0.8707** | **0.8174** | + +We have also evaluated several new open-source models recently, and will release their results soon. -Human-1 is an ordinary person with no training while human-2-senior is a trained ordinary person but still not expert. GPT-4V is witnessed to be on par with human-1, but still room to go to surpass human-2-expert. -We sincerely hope that one day **open-source models** can also get that level (or even better) and we believe that it is coming soon. Try to challenge and beat it! ## Submission Guideline for A1/A2