New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MMLU task fix #497
MMLU task fix #497
Conversation
… to the original eval code
The PR will be in Draft status until I get the eval numbers for a few models to compare with the original Hendryck's eval results. |
@ollmer Any update on running evals to confirm performance? |
Sorry for the delay, here are the numbers for 5-shot comparison:
|
There is a difference with hendrycks/test in that the underscores in subjects aren't |
@gakada Good catch! |
@gakada, fixed, thanks! |
@ollmer Can you rerun the comparison between this repo and the hendrycks code for Pythia, GPT-NeoX-20B, and LLaMA again with the prompt change? |
Do you mean the subjects with removed underscores or some kind of new prompt changes that should be merged from the main branch first? |
With the removed underscores and leading spaces fixed. |
I'm testing this currently. |
I've tested it with the updated code. Results are almost the same:
Raw jsons: results_after_fixes.zip |
This PR with current |
Did you run this branch with the new tokenizer and get 35%, or are you guessing? It would be good to pull main into this fork and then rerun it. Also, we really need to add code for running the major topic areas (STEM, Social Sciences, Humanities, Other) and aggregate results. @ollmer since you obviously have a script for doing this, can you add it? I recommend making new tasks which call the subtasks and then aggregate results before returning the category-level results. We should encourage people to call these top-level categories and not the subtasks as that's what the paper recommends and doing the aggregation by hand is error-prone. |
I made the evaluation of a few more recent models, here is the aggregated table of avg scores:
Jsons: results_2.zip Commands to reproduce:
There were some issues with the non-fast tokenizer in open_llama_7b and the absence of the expected sequence len field in configs of mpt-7b and falcon-7b. This fix is required to run mmlu eval for these models. |
Are those llama tokenizers fixes in transformers or in the current repo? Could you provide a link to the code, please? If it's here, I can do reeval with the merged main.
Yes, that's a good idea. I don't have time for that right now, but I hope to get back to it in a few days. |
From the current repo, #531. |
With merged main LLama scores are better indeed:
|
I’m a little worried that the scores drifted further away. Is this fluctuation within the margin of error? |
Now it isn't significant (by p-values or CIs, at least overall), previously 0.321 vs 0.351 would be significant. |
mosaicml/mpt-7b 0.311 @ollmer are those the last ones for MPT and Open_llama 7B ? |
Hi! Where do you get these numbers from? I haven't evaluated these models after the PR was merged, so I don't know how their respective numbers have changed. |
In your post of 10 days ago above |
Ah, sorry. These were the numbers on this branch, eval was done before the merging current main, so the numbers are a bit outdated. |
I rescored with master MPT7B=27.9 it took more than 3 hours ..... when it takes 45mn on our implementation here: https://github.com/OpenNMT/OpenNMT-py/tree/master/eval_llm/MMLU taken from https://github.com/FranxYao/chain-of-thought-hub. Anyway scores are much closer I wanted to double check that llama7B is clearly above others. |
MMLU task fix
MMLU task fix
How do you compute the average score? There seems to be no average score in result.json. |
Update and fix the MMLU task to achieve the same behavior as the original implementation.
One difference remains unresolved, the original code choose the number of few-shot samples < N if the tokenized N-shot prompt is longer than the model max_length.