Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate MMMLU #475

Closed
ollmer opened this issue May 5, 2023 · 8 comments · Fixed by #497
Closed

Validate MMMLU #475

ollmer opened this issue May 5, 2023 · 8 comments · Fixed by #497
Assignees
Labels
validation For validation of task implementations.

Comments

@ollmer
Copy link

ollmer commented May 5, 2023

Hi! First of all thank you for the great project that helps so much in comparing the capabilities of different models.
I'm currently using the lm-evaluation-harness to evaluate a few models on a MMLU task (aka hendryksTest).
I've tried to run 5-shot evaluation of the Huggyllama 7b model from the Huggingface hub and got the following results:

MMLU Average accuracy: 38.3%
MMLU Average norm accuracy: 34.1%

It is slightly different from the 35.1% mentioned in the table 9 of the Llama paper.
Then I've tried to perform eval with the original implementation using the version of flan eval script) slightly tweaked for the hf causal model. With this script I got exactly the same accuracy of 35.1% as reported in paper.
The config used during the lm-evaluation-harness eval:

{
    "model": "hf-causal",
    "model_args": "pretrained=huggyllama/llama-7b",
    "num_fewshot": 5,
    "batch_size": null,
    "device": "0",
    "no_cache": true,
    "limit": null,
    "bootstrap_iters": 100000,
    "description_dict": {}
 }

I went further to check the source of discrepancies and find out that the approach itself and the prompts fed to the model are quite different.

1. Difference in likelihood estimation

The OG MMLU implementation exploits the fact that the answer to the task is just the one letter from [A, B, C, D]. So the evaluation of a single sample done in one forward pass. The resulting logits are filtered by the 4 token indexes corresponding to the answer letters. Then the argmax used to select the letter with the biggest softmax value out of 4.

The lm-evaluation-harness implementation uses more generic approach. It generates 4 various samples with the different answers. And, more important, it uses the text of one of the options itself and not just it's letter. Then the sum likelihood of the each sample (only for continuation, without the common context, to be precise) is computed and compared. So, compared samples will have the different length and the acc_norm metric seems to be more representative here.

2. Prompt Template

Here are the examples, 1-shot for the sake of simplicity:
That's how the prompt in OG MMLU looks like:

The following are multiple choice questions (with answers) about  astronomy.

Why is Mars red?
A. Because the surface is covered with heavily oxidized ("rusted") minerals.
B. Because the atmosphere scatters more light at bluer wavelengths transmitting mostly red light.
C. Because Mars is covered with ancient lava flows which are red in color.
D. Because flowing water on Mars's surface altered the surface minerals several billion years ago.
Answer: A

The lunar maria are:
A. ancient heavily cratered highlands
B. dark lavas inside volcanic calderas
C. dark lavas filling older impact basins
D. the bright regions on the Moon
Answer:

The lm-evaluation-harness prompts:
Sample choice 1

Question: A comet’s tail points in the following direction:
Choices:
A. away from the Sun
B. towards the Sun
C. in the direction of movement
D. against the direction of movement
Answer: away from the Sun

Question: The lunar maria are:
Choices:
A. ancient heavily cratered highlands
B. dark lavas inside volcanic calderas
C. dark lavas filling older impact basins
D. the bright regions on the Moon
Answer: dark lavas filling older impact bas

Sample choice 2

Question: A comet’s tail points in the following direction:
Choices:
A. away from the Sun
B. towards the Sun
C. in the direction of movement
D. against the direction of movement
Answer: away from the Sun

Question: The lunar maria are:
Choices:
A. ancient heavily cratered highlands
B. dark lavas inside volcanic calderas
C. dark lavas filling older impact basins
D. the bright regions on the Moon
Answer: the bright regions on the

As you can see there are few differences:

  • OG MMLU sample has the prefix with the task subject
  • Question format is slightly different, no prefix "Question:" in OG MMLU
  • The answer itself seems truncated for some reason that I didn't fully understand. Is it done to match the sample length in tokens? If so, how can the acc_norm be different from acc then?

The resulting metric difference for mentioned model looks not so big, but when I've repeated the same evaluations for the GPT-NeoX-20B, I got slightly bigger difference: 32.9% with the lm-evaluation-harness, 26.1% with the OG MMLU eval. So I think this distinction is worth considering.

@StellaAthena
Copy link
Member

@ollmer This is really excellent work, thank you. Could you open a PR correcting the implementation? I would also be interested in running your implementation on the GPT-3 API and seeing if we get the same results from the original MMMLU paper (we might not because of the models changing, but it would be very cool if we did).

@StellaAthena StellaAthena added the validation For validation of task implementations. label May 6, 2023
@StellaAthena
Copy link
Member

I’ve added this to a project called “Task Validation.” Briefly, we actually just started a systematic review of our evaluation benchmark implementations to verify that they agree with the papers that introduce them. If you have some additional time and enjoy chasing stuff like this down, there are a lot more we need to tackle :)

@StellaAthena StellaAthena changed the title MMLU implementation difference Validate MMMLU May 8, 2023
@ollmer
Copy link
Author

ollmer commented May 12, 2023

Could you open a PR correcting the implementation?

I've finally found the time to do it: #497

@ollmer
Copy link
Author

ollmer commented May 12, 2023

I would also be interested in running your implementation on the GPT-3 API

It is worth checking, but running around 6000 samples (~1.2e6 tokens) through the davinci API would cost around $25 :/.

@StellaAthena
Copy link
Member

@ollmer EleutherAI will run it and pick up the tab.

@Fu-Cheng
Copy link

Fu-Cheng commented May 31, 2023

Thank you for providing such comprehensive descriptions and explanations. I have two questions regarding the mmlu validation. Firstly, I have observed that the model input comprises both the context (question) and the continuation (answer) (lm_eval/base.py). I am interested to know how you ensured that the answer did not inadvertently influence the model's output. Secondly, I have noticed that the execution time in this PR is considerably longer compared to OG-MMLU. I conducted tests using the huggingllma/llama-7b model with five shots on an A6000 GPU. It took about 3 hours with OG-MMLU and 7 hours with this PR. Have you encountered this issue as well?

@StellaAthena
Copy link
Member

Thank you for providing such comprehensive descriptions and explanations. I have two questions regarding the mmlu validation. Firstly, I have observed that the model input comprises both the context (question) and the continuation (answer) (lm_eval/base.py). I am interested to know how you ensured that the answer did not inadvertently influence the model's output. Secondly, I have noticed that the execution time in this PR is considerably longer compared to OG-MMLU. I conducted tests using the huggingllma/llama-7b model with five shots on an A6000 GPU. It took about 3 hours with OG-MMLU and 7 hours with this PR. Have you encountered this issue as well?

The evaluation protocol specified by the paper is formatted like a multiple choice test question. This includes the fact that the answer choices are shown to the model as part of the question.

@haileyschoelkopf haileyschoelkopf linked a pull request Jun 23, 2023 that will close this issue
@haileyschoelkopf
Copy link
Contributor

Closed following #497 and the heroic effort of @ollmer !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
validation For validation of task implementations.
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

4 participants