[Feat] Support multi-modal evaluation on MME benchmark. #197

yyk-wew · 2023-08-11T14:47:51Z

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Please describe the motivation of this PR and the goal you want to achieve through this PR.

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here and update the documentation.

Checklist

Before PR:

Pre-commit or other linting tools are used to fix the potential lint issues.
Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
The documentation has been modified accordingly, like docstring or example tutorials.

After PR:

If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects.
CLA has been signed and all committers have signed the CLA in this PR.

opencompass/metrics/mme_score.py

Leymore · 2023-08-14T02:32:59Z

opencompass/metrics/mme_score.py

+        metric['Perception'] = score
+
+        score = 0
+        for task in self.task_dict['Cognition']:
+            score += metric[task]['score']
+        metric['Cognition'] = score
+
+        metric['Overall'] = metric['Perception'] + metric['Cognition']


Should these metrics be sum or average?

From MME paper:

In addition, we calculate the score of a subtask based on the sum of accuracy and accuracy+. The perception score is the sum of scores of all perception subtasks. The cognition score is calculated in the same way.

So it should be a sum.

Leymore · 2023-08-14T02:35:58Z

opencompass/metrics/mme_score.py

+                'acc': acc,
+                'acc_plus': acc_plus,
+                'score': 100 * (acc + acc_plus)


Will single_acc / double_acc be better names than acc / acc_plus?

From MME paper:

Since the output of the model is limited to two types (“yes” or “no”), it is convenient to measure the metrics of accuracy and accuracy+. The former is calculated based on each question, while the latter is based on each image where both of the two questions need to be answered correctly. The random accuracies of the two metrics are equal to 50% and 25%, respectively.

acc and acc_plus may better keep up with the paper.

YuanLiuuuuuu · 2023-08-14T09:11:13Z

configs/multimodal/minigpt_4/minigpt_4_7b_mme.py

+               data_dir='/path/to/MME',
+               pipeline=val_pipeline)
+
+minigpt_4_dataloader = dict(batch_size=1,


Better rename it to minigpt_4_mme_dataloader. The same to minigpt_4_model and minigpt_4_evaluator

YuanLiuuuuuu · 2023-08-14T09:13:09Z

opencompass/multimodal/datasets/mme.py

+
+    Args:
+        data_dir (str): The path of the dataset.
+        pipeline (dict): The data augmentation.


Suggested change

pipeline (dict): The data augmentation.

pipeline (List[dict]): The data augmentation.

Modified in the latest commit.

YuanLiuuuuuu · 2023-08-14T09:13:34Z

opencompass/multimodal/datasets/mme.py

+        self.pipeline = Compose(pipeline)
+        self.load_data(data_dir)
+
+    def load_data(self, data_dir):


Please also add typehint

YuanLiuuuuuu · 2023-08-14T09:14:50Z

opencompass/multimodal/models/minigpt_4/prompt_constructor.py

+    """Prompt constructor for MiniGPT-4 on MME.
+
+    Args:
+        image_prompt (str): Image prompt.


Suggested change

image_prompt (str): Image prompt.

image_prompt (str): Image prompt. Defaults to `''`.

Please also check other parts and make changes accordingly.

Done. Add default value to both mmbench and mme prompt constructors.

yyk-wew · 2023-08-18T05:56:50Z

Issue

Failed to reproduce the paper result of MiniGPT-4.

Implementation Details

Generate Function

Following the demo in the official repo of MiniGPT-4, we build our generate function as below:

https://github.com/InternLM/opencompass/blob/90c07a3dfd99f14bfbc5b43f59b96ce48fc4d0ec/opencompass/multimodal/models/minigpt_4/minigpt_4.py#L154-L198

Prompt Building

We have tried several different formats. Two of them are given here:

# prompt 1
sys_prompt+"###Human: "+question+img+" "+"###Assistant: "
# prompt 2
sys_prompt+"###Human: "+question+" "+ "###Human: " +img+" "+"###Assistant: "

The sys_prompt follows the official config.

'Give the following image: ImageContent. You will be able to see the image once I provide it to you. Please answer my questions.'

And the question and image are loaded from MME benchmark.
For more details, please check MiniGPT4MMEPromptConstructor class.

Generate Function

In the official repo, the default recipe of generate function is as below:

outputs = self.llama_model.generate(
            inputs_embeds=prompt_embs,
            max_new_tokens=300,
            stopping_criteria=self.stopping_criteria,
            num_beams=1,
            do_sample=True,
            min_length=1,
            top_p=0.9,
            repetition_penalty=1.0,
            length_penalty=1,
            temperature=1.0,
        )

We also tried inference with beam search, see:
https://github.com/InternLM/opencompass/blob/90c07a3dfd99f14bfbc5b43f59b96ce48fc4d0ec/opencompass/multimodal/models/minigpt_4/minigpt_4.py#L179-L190

In the following section, we name the official recipe as official and our recipe with beam search as ours.

Experiments

Generate Recipe	Prompt	Perception	Cognition	Overall
official	prompt 1	415	55	470
official	prompt2	190	58	249
ours	prompt 1	696	108	804
-	-	797	292	1089

Note: The landmark task is not included in our all experiments.
The last line of the table is the result provided in the paper.

Metric Sanity Check

To validate the correctness of MMEMetric, we write a simple script using compute_metrics function of MMEMetric to evaluate LaVIN, which is provided in the official repo as a sample.
Our script is as below:

import os
from collections import defaultdict

samples = []
task_dict = {
        'Perception': [
            'existence', 'count', 'position', 'color', 'posters', 'celebrity',
            'scene', 'artwork', 'OCR', 'landmark'
        ],
        'Cognition': [
            'commonsense_reasoning', 'numerical_calculation',
            'text_translation', 'code_reasoning'
        ]
    }  # noqa

def read_result(fn, category):
    with open(fn, 'r') as f:
        line = f.readline()
        while line:
            img_path, question, answer, response = line.split('\t')

            prefix_pred_ans = response[:4].lower()

            if 'yes' in prefix_pred_ans:
                pred_answer = 'yes'
            elif 'no' in prefix_pred_ans:
                pred_answer = 'no'
            else:
                pred_answer = 'other'

            samples.append({'img_path': img_path, 'pred': 1 if answer.lower() == pred_answer.lower() else 0, 'task': category})
            line = f.readline()
    print(category, " done.")


def compute_metrics(results: list) -> dict:

    # reorganize results
    record = dict()
    for task in (task_dict['Perception'] +
                    task_dict['Cognition']):
        record[task] = defaultdict(int)
    for sample in results:
        record[sample['task']][sample['img_path']] += sample['pred']

    # compute subtask score
    metric = dict()
    for task in (task_dict['Perception'] +
                    task_dict['Cognition']):
        single_sum, double_sum = 0., 0.
        for v in record[task].values():
            assert 0 <= v <= 2
            if v == 2:
                single_sum += 2
                double_sum += 1
            elif v == 1:
                single_sum += 1
        acc = single_sum / 2 / len(record[task])
        acc_plus = double_sum / len(record[task])

        metric[task] = {
            'acc': acc,
            'acc_plus': acc_plus,
            'score': 100 * (acc + acc_plus)
        }

    # compute overall score
    score = 0
    for task in task_dict['Perception']:
        score += metric[task]['score']
    metric['Perception'] = score

    score = 0
    for task in task_dict['Cognition']:
        score += metric[task]['score']
    metric['Cognition'] = score

    metric['Overall'] = metric['Perception'] + metric['Cognition']

    return metric


if __name__ == "__main__":
    fn_list = os.listdir("./LaVIN")
    for fn in fn_list:
        read_result(os.path.join("./LaVIN", fn), fn[:-4])
    metric = compute_metrics(samples)
    print(metric)

The result is :

{'existence': {'acc': 0.95, 'acc_plus': 0.9, 'score': 185.0}, 'count': {'acc': 0.6166666666666667, 'acc_plus': 0.26666666666666666, 'score': 88.33333333333333}, 'position': {'acc': 0.5333333333333333, 'acc_plus': 0.1, 'score': 63.33333333333333}, 'color': {'acc': 0.5833333333333334, 'acc_plus': 0.16666666666666666, 'score': 75.0}, 'posters': {'acc': 0.5918367346938775, 'acc_plus': 0.20408163265306123, 'score': 79.59183673469387}, 'celebrity': {'acc': 0.37941176470588234, 'acc_plus': 0.09411764705882353, 'score': 47.35294117647059}, 'scene': {'acc': 0.7875, 'acc_plus': 0.58, 'score': 136.75}, 'artwork': {'acc': 0.5925, 'acc_plus': 0.28, 'score': 87.25}, 'OCR': {'acc': 0.675, 'acc_plus': 0.4, 'score': 107.50000000000001}, 'landmark': {'acc': 0.64, 'acc_plus': 0.295, 'score': 93.5}, 'commonsense_reasoning': {'acc': 0.5857142857142857, 'acc_plus': 0.2857142857142857, 'score': 87.14285714285714}, 'numerical_calculation': {'acc': 0.55, 'acc_plus': 0.1, 'score': 65.0}, 'text_translation': {'acc': 0.475, 'acc_plus': 0.0, 'score': 47.5}, 'code_reasoning': {'acc': 0.5, 'acc_plus': 0.0, 'score': 50.0}, 'Perception': 963.6114445778311, 'Cognition': 249.64285714285714, 'Overall': 1213.2543017206883}

Same as the result obtained by official evaluation script.

* [Feat] Support multi-modal evaluation on MME benchmark. * [Fix] Remove debug code. * [Fix] Remove redundant codes and add type hints. * [Fix] Rename in config. * [Fix] Rebase main. * [Fix] Fix isort and yapf conflict.

…#197) * [Feat] Support multi-modal evaluation on MME benchmark. * [Fix] Remove debug code. * [Fix] Remove redundant codes and add type hints. * [Fix] Rename in config. * [Fix] Rebase main. * [Fix] Fix isort and yapf conflict.

mm-assistant bot assigned tonysy Aug 11, 2023

Leymore requested a review from YuanLiuuuuuu August 14, 2023 02:23

Leymore reviewed Aug 14, 2023

View reviewed changes

YuanLiuuuuuu reviewed Aug 14, 2023

View reviewed changes

YuanLiuuuuuu approved these changes Aug 14, 2023

View reviewed changes

yyk-wew added 5 commits August 17, 2023 19:46

[Feat] Support multi-modal evaluation on MME benchmark.

0e4ecaf

[Fix] Remove debug code.

6d39707

[Fix] Remove redundant codes and add type hints.

8dee838

[Fix] Rename in config.

1568257

[Fix] Rebase main.

065f090

yyk-wew force-pushed the yyk/mme branch from f25b958 to 065f090 Compare August 18, 2023 05:26

[Fix] Fix isort and yapf conflict.

8e2bfd0

YuanLiuuuuuu merged commit a655222 into open-compass:main Aug 21, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Support multi-modal evaluation on MME benchmark. #197

[Feat] Support multi-modal evaluation on MME benchmark. #197

yyk-wew commented Aug 11, 2023

Leymore Aug 14, 2023

yyk-wew Aug 14, 2023 •

edited

Loading

Leymore Aug 14, 2023

yyk-wew Aug 14, 2023

YuanLiuuuuuu Aug 14, 2023

yyk-wew Aug 14, 2023

YuanLiuuuuuu Aug 14, 2023

yyk-wew Aug 14, 2023

YuanLiuuuuuu Aug 14, 2023

yyk-wew Aug 14, 2023

YuanLiuuuuuu Aug 14, 2023

yyk-wew Aug 14, 2023

yyk-wew commented Aug 18, 2023 •

edited

Loading

	pipeline (dict): The data augmentation.
	pipeline (List[dict]): The data augmentation.

	image_prompt (str): Image prompt.
	image_prompt (str): Image prompt. Defaults to `''`.

[Feat] Support multi-modal evaluation on MME benchmark. #197

[Feat] Support multi-modal evaluation on MME benchmark. #197

Conversation

yyk-wew commented Aug 11, 2023

Motivation

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist

Choose a reason for hiding this comment

yyk-wew Aug 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yyk-wew commented Aug 18, 2023 • edited Loading

Issue

Implementation Details

Generate Function

Prompt Building

Generate Function

Experiments

Metric Sanity Check

yyk-wew Aug 14, 2023 •

edited

Loading

yyk-wew commented Aug 18, 2023 •

edited

Loading