Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Support multi-modal evaluation on MME benchmark. #197

Merged
merged 6 commits into from
Aug 21, 2023

Conversation

yyk-wew
Copy link
Contributor

@yyk-wew yyk-wew commented Aug 11, 2023

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Please describe the motivation of this PR and the goal you want to achieve through this PR.

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here and update the documentation.

Checklist

Before PR:

  • Pre-commit or other linting tools are used to fix the potential lint issues.
  • Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
  • The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  • The documentation has been modified accordingly, like docstring or example tutorials.

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects.
  • CLA has been signed and all committers have signed the CLA in this PR.

opencompass/metrics/mme_score.py Outdated Show resolved Hide resolved
opencompass/metrics/mme_score.py Outdated Show resolved Hide resolved
metric['Perception'] = score

score = 0
for task in self.task_dict['Cognition']:
score += metric[task]['score']
metric['Cognition'] = score

metric['Overall'] = metric['Perception'] + metric['Cognition']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these metrics be sum or average?

Copy link
Contributor Author

@yyk-wew yyk-wew Aug 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From MME paper:

In addition, we calculate the score of a subtask based on the sum of accuracy and accuracy+. The perception score is the sum of scores of all perception subtasks. The cognition score is calculated in the same way.

So it should be a sum.

Comment on lines +81 to +76
'acc': acc,
'acc_plus': acc_plus,
'score': 100 * (acc + acc_plus)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will single_acc / double_acc be better names than acc / acc_plus?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From MME paper:

Since the output of the model is limited to two types (“yes” or “no”), it is convenient to measure the metrics of accuracy and accuracy+. The former is calculated based on each question, while the latter is based on each image where both of the two questions need to be answered correctly. The random accuracies of the two metrics are equal to 50% and 25%, respectively.

acc and acc_plus may better keep up with the paper.

data_dir='/path/to/MME',
pipeline=val_pipeline)

minigpt_4_dataloader = dict(batch_size=1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better rename it to minigpt_4_mme_dataloader. The same to minigpt_4_model and minigpt_4_evaluator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


Args:
data_dir (str): The path of the dataset.
pipeline (dict): The data augmentation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pipeline (dict): The data augmentation.
pipeline (List[dict]): The data augmentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified in the latest commit.

self.pipeline = Compose(pipeline)
self.load_data(data_dir)

def load_data(self, data_dir):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add typehint

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

"""Prompt constructor for MiniGPT-4 on MME.

Args:
image_prompt (str): Image prompt.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
image_prompt (str): Image prompt.
image_prompt (str): Image prompt. Defaults to `''`.

Please also check other parts and make changes accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Add default value to both mmbench and mme prompt constructors.

@yyk-wew
Copy link
Contributor Author

yyk-wew commented Aug 18, 2023

Issue

Failed to reproduce the paper result of MiniGPT-4.

Implementation Details

Generate Function

Following the demo in the official repo of MiniGPT-4, we build our generate function as below:

https://github.com/InternLM/opencompass/blob/90c07a3dfd99f14bfbc5b43f59b96ce48fc4d0ec/opencompass/multimodal/models/minigpt_4/minigpt_4.py#L154-L198

Prompt Building

We have tried several different formats. Two of them are given here:

# prompt 1
sys_prompt+"###Human: "+question+img+" "+"###Assistant: "
# prompt 2
sys_prompt+"###Human: "+question+" "+ "###Human: " +img+" "+"###Assistant: "

The sys_prompt follows the official config.

'Give the following image: ImageContent. You will be able to see the image once I provide it to you. Please answer my questions.'

And the question and image are loaded from MME benchmark.
For more details, please check MiniGPT4MMEPromptConstructor class.

Generate Function

In the official repo, the default recipe of generate function is as below:

outputs = self.llama_model.generate(
            inputs_embeds=prompt_embs,
            max_new_tokens=300,
            stopping_criteria=self.stopping_criteria,
            num_beams=1,
            do_sample=True,
            min_length=1,
            top_p=0.9,
            repetition_penalty=1.0,
            length_penalty=1,
            temperature=1.0,
        )

We also tried inference with beam search, see:
https://github.com/InternLM/opencompass/blob/90c07a3dfd99f14bfbc5b43f59b96ce48fc4d0ec/opencompass/multimodal/models/minigpt_4/minigpt_4.py#L179-L190

In the following section, we name the official recipe as official and our recipe with beam search as ours.

Experiments

Generate Recipe Prompt Perception Cognition Overall
official prompt 1 415 55 470
official prompt2 190 58 249
ours prompt 1 696 108 804
- - 797 292 1089

Note: The landmark task is not included in our all experiments.
The last line of the table is the result provided in the paper.

Metric Sanity Check

To validate the correctness of MMEMetric, we write a simple script using compute_metrics function of MMEMetric to evaluate LaVIN, which is provided in the official repo as a sample.
Our script is as below:

import os
from collections import defaultdict

samples = []
task_dict = {
        'Perception': [
            'existence', 'count', 'position', 'color', 'posters', 'celebrity',
            'scene', 'artwork', 'OCR', 'landmark'
        ],
        'Cognition': [
            'commonsense_reasoning', 'numerical_calculation',
            'text_translation', 'code_reasoning'
        ]
    }  # noqa

def read_result(fn, category):
    with open(fn, 'r') as f:
        line = f.readline()
        while line:
            img_path, question, answer, response = line.split('\t')

            prefix_pred_ans = response[:4].lower()

            if 'yes' in prefix_pred_ans:
                pred_answer = 'yes'
            elif 'no' in prefix_pred_ans:
                pred_answer = 'no'
            else:
                pred_answer = 'other'

            samples.append({'img_path': img_path, 'pred': 1 if answer.lower() == pred_answer.lower() else 0, 'task': category})
            line = f.readline()
    print(category, " done.")


def compute_metrics(results: list) -> dict:

    # reorganize results
    record = dict()
    for task in (task_dict['Perception'] +
                    task_dict['Cognition']):
        record[task] = defaultdict(int)
    for sample in results:
        record[sample['task']][sample['img_path']] += sample['pred']

    # compute subtask score
    metric = dict()
    for task in (task_dict['Perception'] +
                    task_dict['Cognition']):
        single_sum, double_sum = 0., 0.
        for v in record[task].values():
            assert 0 <= v <= 2
            if v == 2:
                single_sum += 2
                double_sum += 1
            elif v == 1:
                single_sum += 1
        acc = single_sum / 2 / len(record[task])
        acc_plus = double_sum / len(record[task])

        metric[task] = {
            'acc': acc,
            'acc_plus': acc_plus,
            'score': 100 * (acc + acc_plus)
        }

    # compute overall score
    score = 0
    for task in task_dict['Perception']:
        score += metric[task]['score']
    metric['Perception'] = score

    score = 0
    for task in task_dict['Cognition']:
        score += metric[task]['score']
    metric['Cognition'] = score

    metric['Overall'] = metric['Perception'] + metric['Cognition']

    return metric


if __name__ == "__main__":
    fn_list = os.listdir("./LaVIN")
    for fn in fn_list:
        read_result(os.path.join("./LaVIN", fn), fn[:-4])
    metric = compute_metrics(samples)
    print(metric)
    

The result is :

{'existence': {'acc': 0.95, 'acc_plus': 0.9, 'score': 185.0}, 'count': {'acc': 0.6166666666666667, 'acc_plus': 0.26666666666666666, 'score': 88.33333333333333}, 'position': {'acc': 0.5333333333333333, 'acc_plus': 0.1, 'score': 63.33333333333333}, 'color': {'acc': 0.5833333333333334, 'acc_plus': 0.16666666666666666, 'score': 75.0}, 'posters': {'acc': 0.5918367346938775, 'acc_plus': 0.20408163265306123, 'score': 79.59183673469387}, 'celebrity': {'acc': 0.37941176470588234, 'acc_plus': 0.09411764705882353, 'score': 47.35294117647059}, 'scene': {'acc': 0.7875, 'acc_plus': 0.58, 'score': 136.75}, 'artwork': {'acc': 0.5925, 'acc_plus': 0.28, 'score': 87.25}, 'OCR': {'acc': 0.675, 'acc_plus': 0.4, 'score': 107.50000000000001}, 'landmark': {'acc': 0.64, 'acc_plus': 0.295, 'score': 93.5}, 'commonsense_reasoning': {'acc': 0.5857142857142857, 'acc_plus': 0.2857142857142857, 'score': 87.14285714285714}, 'numerical_calculation': {'acc': 0.55, 'acc_plus': 0.1, 'score': 65.0}, 'text_translation': {'acc': 0.475, 'acc_plus': 0.0, 'score': 47.5}, 'code_reasoning': {'acc': 0.5, 'acc_plus': 0.0, 'score': 50.0}, 'Perception': 963.6114445778311, 'Cognition': 249.64285714285714, 'Overall': 1213.2543017206883}

Same as the result obtained by official evaluation script.

@YuanLiuuuuuu YuanLiuuuuuu merged commit a655222 into open-compass:main Aug 21, 2023
1 check passed
go-with-me000 pushed a commit to go-with-me000/opencompass that referenced this pull request Oct 9, 2023
* [Feat] Support multi-modal evaluation on MME benchmark.

* [Fix] Remove debug code.

* [Fix] Remove redundant codes and add type hints.

* [Fix] Rename in config.

* [Fix] Rebase main.

* [Fix] Fix isort and yapf conflict.
liuyaox pushed a commit to liuyaox/opencompass that referenced this pull request Jun 26, 2024
…#197)

* [Feat] Support multi-modal evaluation on MME benchmark.

* [Fix] Remove debug code.

* [Fix] Remove redundant codes and add type hints.

* [Fix] Rename in config.

* [Fix] Rebase main.

* [Fix] Fix isort and yapf conflict.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants