Validate MMMLU #475

ollmer · 2023-05-05T22:21:42Z

Hi! First of all thank you for the great project that helps so much in comparing the capabilities of different models.
I'm currently using the lm-evaluation-harness to evaluate a few models on a MMLU task (aka hendryksTest).
I've tried to run 5-shot evaluation of the Huggyllama 7b model from the Huggingface hub and got the following results:

MMLU Average accuracy: 38.3%
MMLU Average norm accuracy: 34.1%

It is slightly different from the 35.1% mentioned in the table 9 of the Llama paper.
Then I've tried to perform eval with the original implementation using the version of flan eval script) slightly tweaked for the hf causal model. With this script I got exactly the same accuracy of 35.1% as reported in paper.
The config used during the lm-evaluation-harness eval:

{
    "model": "hf-causal",
    "model_args": "pretrained=huggyllama/llama-7b",
    "num_fewshot": 5,
    "batch_size": null,
    "device": "0",
    "no_cache": true,
    "limit": null,
    "bootstrap_iters": 100000,
    "description_dict": {}
 }

I went further to check the source of discrepancies and find out that the approach itself and the prompts fed to the model are quite different.

1. Difference in likelihood estimation

The OG MMLU implementation exploits the fact that the answer to the task is just the one letter from [A, B, C, D]. So the evaluation of a single sample done in one forward pass. The resulting logits are filtered by the 4 token indexes corresponding to the answer letters. Then the argmax used to select the letter with the biggest softmax value out of 4.

The lm-evaluation-harness implementation uses more generic approach. It generates 4 various samples with the different answers. And, more important, it uses the text of one of the options itself and not just it's letter. Then the sum likelihood of the each sample (only for continuation, without the common context, to be precise) is computed and compared. So, compared samples will have the different length and the acc_norm metric seems to be more representative here.

2. Prompt Template

Here are the examples, 1-shot for the sake of simplicity:
That's how the prompt in OG MMLU looks like:

The following are multiple choice questions (with answers) about  astronomy.

Why is Mars red?
A. Because the surface is covered with heavily oxidized ("rusted") minerals.
B. Because the atmosphere scatters more light at bluer wavelengths transmitting mostly red light.
C. Because Mars is covered with ancient lava flows which are red in color.
D. Because flowing water on Mars's surface altered the surface minerals several billion years ago.
Answer: A

The lunar maria are:
A. ancient heavily cratered highlands
B. dark lavas inside volcanic calderas
C. dark lavas filling older impact basins
D. the bright regions on the Moon
Answer:

The lm-evaluation-harness prompts:
Sample choice 1

Question: A comet’s tail points in the following direction:
Choices:
A. away from the Sun
B. towards the Sun
C. in the direction of movement
D. against the direction of movement
Answer: away from the Sun

Question: The lunar maria are:
Choices:
A. ancient heavily cratered highlands
B. dark lavas inside volcanic calderas
C. dark lavas filling older impact basins
D. the bright regions on the Moon
Answer: dark lavas filling older impact bas

Sample choice 2

Question: A comet’s tail points in the following direction:
Choices:
A. away from the Sun
B. towards the Sun
C. in the direction of movement
D. against the direction of movement
Answer: away from the Sun

Question: The lunar maria are:
Choices:
A. ancient heavily cratered highlands
B. dark lavas inside volcanic calderas
C. dark lavas filling older impact basins
D. the bright regions on the Moon
Answer: the bright regions on the

As you can see there are few differences:

OG MMLU sample has the prefix with the task subject
Question format is slightly different, no prefix "Question:" in OG MMLU
The answer itself seems truncated for some reason that I didn't fully understand. Is it done to match the sample length in tokens? If so, how can the acc_norm be different from acc then?

The resulting metric difference for mentioned model looks not so big, but when I've repeated the same evaluations for the GPT-NeoX-20B, I got slightly bigger difference: 32.9% with the lm-evaluation-harness, 26.1% with the OG MMLU eval. So I think this distinction is worth considering.

The text was updated successfully, but these errors were encountered:

StellaAthena · 2023-05-06T03:58:57Z

@ollmer This is really excellent work, thank you. Could you open a PR correcting the implementation? I would also be interested in running your implementation on the GPT-3 API and seeing if we get the same results from the original MMMLU paper (we might not because of the models changing, but it would be very cool if we did).

StellaAthena · 2023-05-06T04:07:58Z

I’ve added this to a project called “Task Validation.” Briefly, we actually just started a systematic review of our evaluation benchmark implementations to verify that they agree with the papers that introduce them. If you have some additional time and enjoy chasing stuff like this down, there are a lot more we need to tackle :)

ollmer · 2023-05-12T23:42:52Z

Could you open a PR correcting the implementation?

I've finally found the time to do it: #497

ollmer · 2023-05-12T23:48:57Z

I would also be interested in running your implementation on the GPT-3 API

It is worth checking, but running around 6000 samples (~1.2e6 tokens) through the davinci API would cost around $25 :/.

StellaAthena · 2023-05-15T14:08:28Z

@ollmer EleutherAI will run it and pick up the tab.

Fu-Cheng · 2023-05-31T08:25:26Z

Thank you for providing such comprehensive descriptions and explanations. I have two questions regarding the mmlu validation. Firstly, I have observed that the model input comprises both the context (question) and the continuation (answer) (lm_eval/base.py). I am interested to know how you ensured that the answer did not inadvertently influence the model's output. Secondly, I have noticed that the execution time in this PR is considerably longer compared to OG-MMLU. I conducted tests using the huggingllma/llama-7b model with five shots on an A6000 GPU. It took about 3 hours with OG-MMLU and 7 hours with this PR. Have you encountered this issue as well?

StellaAthena · 2023-05-31T16:24:30Z

Thank you for providing such comprehensive descriptions and explanations. I have two questions regarding the mmlu validation. Firstly, I have observed that the model input comprises both the context (question) and the continuation (answer) (lm_eval/base.py). I am interested to know how you ensured that the answer did not inadvertently influence the model's output. Secondly, I have noticed that the execution time in this PR is considerably longer compared to OG-MMLU. I conducted tests using the huggingllma/llama-7b model with five shots on an A6000 GPU. It took about 3 hours with OG-MMLU and 7 hours with this PR. Have you encountered this issue as well?

The evaluation protocol specified by the paper is formatted like a multiple choice test question. This includes the fact that the answer choices are shown to the model as part of the question.

haileyschoelkopf · 2023-06-23T17:37:41Z

Closed following #497 and the heroic effort of @ollmer !

StellaAthena added the validation For validation of task implementations. label May 6, 2023

StellaAthena assigned ollmer May 6, 2023

StellaAthena changed the title ~~MMLU implementation difference~~ Validate MMMLU May 8, 2023

haileyschoelkopf linked a pull request Jun 23, 2023 that will close this issue

MMLU task fix #497

Merged

haileyschoelkopf closed this as completed Jun 23, 2023

c1505 mentioned this issue Jul 12, 2023

Central repository for results from running the evaluations #662

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate MMMLU #475

Validate MMMLU #475

ollmer commented May 5, 2023

StellaAthena commented May 6, 2023

StellaAthena commented May 6, 2023

ollmer commented May 12, 2023

ollmer commented May 12, 2023 •

edited

StellaAthena commented May 15, 2023

Fu-Cheng commented May 31, 2023 •

edited

StellaAthena commented May 31, 2023

haileyschoelkopf commented Jun 23, 2023

Validate MMMLU #475

Validate MMMLU #475

Comments

ollmer commented May 5, 2023

1. Difference in likelihood estimation

2. Prompt Template

StellaAthena commented May 6, 2023

StellaAthena commented May 6, 2023

ollmer commented May 12, 2023

ollmer commented May 12, 2023 • edited

StellaAthena commented May 15, 2023

Fu-Cheng commented May 31, 2023 • edited

StellaAthena commented May 31, 2023

haileyschoelkopf commented Jun 23, 2023

ollmer commented May 12, 2023 •

edited

Fu-Cheng commented May 31, 2023 •

edited