New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate MMMLU #475
Comments
@ollmer This is really excellent work, thank you. Could you open a PR correcting the implementation? I would also be interested in running your implementation on the GPT-3 API and seeing if we get the same results from the original MMMLU paper (we might not because of the models changing, but it would be very cool if we did). |
I’ve added this to a project called “Task Validation.” Briefly, we actually just started a systematic review of our evaluation benchmark implementations to verify that they agree with the papers that introduce them. If you have some additional time and enjoy chasing stuff like this down, there are a lot more we need to tackle :) |
I've finally found the time to do it: #497 |
It is worth checking, but running around 6000 samples (~1.2e6 tokens) through the davinci API would cost around $25 :/. |
@ollmer EleutherAI will run it and pick up the tab. |
Thank you for providing such comprehensive descriptions and explanations. I have two questions regarding the mmlu validation. Firstly, I have observed that the model input comprises both the context (question) and the continuation (answer) (lm_eval/base.py). I am interested to know how you ensured that the answer did not inadvertently influence the model's output. Secondly, I have noticed that the execution time in this PR is considerably longer compared to OG-MMLU. I conducted tests using the huggingllma/llama-7b model with five shots on an A6000 GPU. It took about 3 hours with OG-MMLU and 7 hours with this PR. Have you encountered this issue as well? |
The evaluation protocol specified by the paper is formatted like a multiple choice test question. This includes the fact that the answer choices are shown to the model as part of the question. |
Hi! First of all thank you for the great project that helps so much in comparing the capabilities of different models.
I'm currently using the lm-evaluation-harness to evaluate a few models on a MMLU task (aka hendryksTest).
I've tried to run 5-shot evaluation of the Huggyllama 7b model from the Huggingface hub and got the following results:
It is slightly different from the 35.1% mentioned in the table 9 of the Llama paper.
Then I've tried to perform eval with the original implementation using the version of flan eval script) slightly tweaked for the hf causal model. With this script I got exactly the same accuracy of 35.1% as reported in paper.
The config used during the lm-evaluation-harness eval:
I went further to check the source of discrepancies and find out that the approach itself and the prompts fed to the model are quite different.
1. Difference in likelihood estimation
The OG MMLU implementation exploits the fact that the answer to the task is just the one letter from [A, B, C, D]. So the evaluation of a single sample done in one forward pass. The resulting logits are filtered by the 4 token indexes corresponding to the answer letters. Then the argmax used to select the letter with the biggest softmax value out of 4.
The lm-evaluation-harness implementation uses more generic approach. It generates 4 various samples with the different answers. And, more important, it uses the text of one of the options itself and not just it's letter. Then the sum likelihood of the each sample (only for continuation, without the common context, to be precise) is computed and compared. So, compared samples will have the different length and the acc_norm metric seems to be more representative here.
2. Prompt Template
Here are the examples, 1-shot for the sake of simplicity:
That's how the prompt in OG MMLU looks like:
The lm-evaluation-harness prompts:
Sample choice 1
Sample choice 2
As you can see there are few differences:
The resulting metric difference for mentioned model looks not so big, but when I've repeated the same evaluations for the GPT-NeoX-20B, I got slightly bigger difference: 32.9% with the lm-evaluation-harness, 26.1% with the OG MMLU eval. So I think this distinction is worth considering.
The text was updated successfully, but these errors were encountered: