Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MMLU task fix #497

Merged
merged 6 commits into from Jun 15, 2023
Merged

MMLU task fix #497

merged 6 commits into from Jun 15, 2023

Conversation

ollmer
Copy link

@ollmer ollmer commented May 12, 2023

Update and fix the MMLU task to achieve the same behavior as the original implementation.

  1. Fix huggingface dataset name that was changed from hendrycks_test to cais/mmlu.
  2. Implement exact same prompt format and few-shot prefix.
  3. Use the unchanged order of the dev set during the few-shot sample selection.

One difference remains unresolved, the original code choose the number of few-shot samples < N if the tokenized N-shot prompt is longer than the model max_length.

@CLAassistant
Copy link

CLAassistant commented May 12, 2023

CLA assistant check
All committers have signed the CLA.

@ollmer ollmer mentioned this pull request May 12, 2023
@ollmer
Copy link
Author

ollmer commented May 12, 2023

The PR will be in Draft status until I get the eval numbers for a few models to compare with the original Hendryck's eval results.
Intended models: Llama 7B, Pythia 12B, GPT-NeoX 20B.

@StellaAthena
Copy link
Member

@ollmer Any update on running evals to confirm performance?

lm_eval/tasks/hendrycks_test.py Outdated Show resolved Hide resolved
lm_eval/tasks/hendrycks_test.py Outdated Show resolved Hide resolved
@ollmer
Copy link
Author

ollmer commented May 28, 2023

Sorry for the delay, here are the numbers for 5-shot comparison:

Model Category Hendrycks Eval Accuracy This PR Accuracy Accuracy in Paper
EleutherAI/pythia-12b Average 0.251 0.269
STEM 0.251 0.280
humanities 0.262 0.267
other (business, health, misc.) 0.260 0.263
social sciences 0.226 0.265
EleutherAI/gpt-neox-20b Average 0.261 0.251
STEM 0.250 0.232
humanities 0.272 0.269
other (business, health, misc.) 0.277 0.276
social sciences 0.240 0.233
huggyllama/llama-7b Average 0.351 0.322 0.351
STEM 0.306 0.296 0.305
humanities 0.339 0.328 0.340
other (business, health, misc.) 0.382 0.352 0.381
social sciences 0.382 0.319 0.383
huggyllama/llama-13b Average 0.470 0.469 0.469
STEM 0.363 0.373 0.358
humanities 0.450 0.515 0.450
other (business, health, misc.) 0.531 0.490 0.533
social sciences 0.541 0.538 0.538

@ollmer ollmer marked this pull request as ready for review May 28, 2023 18:25
@ollmer ollmer requested a review from jon-tow as a code owner May 28, 2023 18:25
@gakada
Copy link
Contributor

gakada commented Jun 5, 2023

There is a difference with hendrycks/test in that the underscores in subjects aren't replaced, so e.g. it is ... about high_school_physics. instead of ... about high school physics.. Also maybe strings should be striped, e.g. some questions begin with spaces, some not.

@StellaAthena
Copy link
Member

@gakada Good catch!

@StellaAthena StellaAthena self-requested a review June 6, 2023 14:29
@StellaAthena StellaAthena changed the title MMLU task fix MMMLU task fix Jun 6, 2023
@ollmer
Copy link
Author

ollmer commented Jun 6, 2023

@gakada, fixed, thanks!

@ollmer ollmer changed the title MMMLU task fix MMLU task fix Jun 6, 2023
@StellaAthena StellaAthena changed the title MMLU task fix MMMLU task fix Jun 7, 2023
@StellaAthena
Copy link
Member

@ollmer Can you rerun the comparison between this repo and the hendrycks code for Pythia, GPT-NeoX-20B, and LLaMA again with the prompt change?

@ollmer
Copy link
Author

ollmer commented Jun 8, 2023

Do you mean the subjects with removed underscores or some kind of new prompt changes that should be merged from the main branch first?

@StellaAthena
Copy link
Member

Do you mean the subjects with removed underscores or some kind of new prompt changes that should be merged from the main branch first?

With the removed underscores and leading spaces fixed.

StellaAthena
StellaAthena previously approved these changes Jun 8, 2023
@StellaAthena StellaAthena self-requested a review June 8, 2023 22:17
@StellaAthena StellaAthena dismissed their stale review June 8, 2023 22:17

Changes were made but more are needed

@StellaAthena StellaAthena changed the title MMMLU task fix MMLU task fix Jun 9, 2023
@StellaAthena
Copy link
Member

I'm testing this currently.

@ollmer
Copy link
Author

ollmer commented Jun 12, 2023

I've tested it with the updated code. Results are almost the same:

Model Category Hendrycks Eval Accuracy This PR before fixes This PR after fixes Accuracy in Paper
EleutherAI/pythia-12b Average 0.251 0.269 0.268
STEM 0.251 0.280 0.279
humanities 0.262 0.267 0.263
other (business, health, misc.) 0.260 0.263 0.261
social sciences 0.226 0.265 0.262
EleutherAI/gpt-neox-20b Average 0.261 0.251 0.251
STEM 0.250 0.232 0.238
humanities 0.272 0.269 0.271
other (business, health, misc.) 0.277 0.276 0.270
social sciences 0.240 0.233 0.226
huggyllama/llama-7b Average 0.351 0.322 0.321 0.351
STEM 0.306 0.296 0.295 0.305
humanities 0.339 0.328 0.326 0.340
other (business, health, misc.) 0.382 0.352 0.352 0.381
social sciences 0.382 0.319 0.319 0.383
huggyllama/llama-13b Average 0.470 0.469 0.469 0.469
STEM 0.363 0.373 0.377 0.358
humanities 0.450 0.515 0.516 0.450
other (business, health, misc.) 0.531 0.490 0.487 0.533
social sciences 0.541 0.538 0.536 0.538

Raw jsons: results_after_fixes.zip

@gakada
Copy link
Contributor

gakada commented Jun 13, 2023

This PR with current master also should be different, since there are LLaMA tokenizer fixes, e.g. huggyllama/llama-7b can be around to 35% as well.

@StellaAthena
Copy link
Member

This PR with current master also should be different, since there are LLaMA tokenizer fixes, e.g. huggyllama/llama-7b can be around to 35% as well.

Did you run this branch with the new tokenizer and get 35%, or are you guessing? It would be good to pull main into this fork and then rerun it.

Also, we really need to add code for running the major topic areas (STEM, Social Sciences, Humanities, Other) and aggregate results. @ollmer since you obviously have a script for doing this, can you add it? I recommend making new tasks which call the subtasks and then aggregate results before returning the category-level results. We should encourage people to call these top-level categories and not the subtasks as that's what the paper recommends and doing the aggregation by hand is error-prone.

@ollmer
Copy link
Author

ollmer commented Jun 13, 2023

I made the evaluation of a few more recent models, here is the aggregated table of avg scores:

Model MMLU 5-shot Avg. Acc.
huggyllama/llama-13b 0.469
bigcode/starcoderplus 0.437
huggyllama/llama-7b 0.321
mosaicml/mpt-7b 0.311
tiiuae/falcon-7b 0.278
openlm-research/open_llama_7b 0.271
EleutherAI/pythia-12b 0.268
EleutherAI/gpt-neox-20b 0.251

Jsons: results_2.zip

Commands to reproduce:

mkdir results
python main.py --model hf-causal --model_args pretrained=huggyllama/llama-7b --tasks hendrycksTest-* --device cuda:0 --output_path ./results/huggyllama_llama-7b_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5
python main.py --model hf-causal --model_args pretrained=huggyllama/llama-13b --tasks hendrycksTest-* --device cuda:0 --output_path ./results/huggyllama_llama-13b_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5
python main.py --model hf-causal --model_args pretrained=EleutherAI/pythia-12b --tasks hendrycksTest-* --device cuda:0 --output_path ./results/EleutherAI_pythia-12b_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5
python main.py --model hf-causal --model_args pretrained=EleutherAI/gpt-neox-20b --tasks hendrycksTest-* --device cuda:0 --output_path ./results/EleutherAI_gpt-neox-20b_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5
python main.py --model hf-causal --model_args pretrained=openlm-research/open_llama_7b --tasks hendrycksTest-* --device cuda:0 --output_path ./results/openlm-research_open_llama_7b_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5
python main.py --model hf-causal --model_args pretrained=bigcode/starcoderplus --tasks hendrycksTest-* --device cuda:0 --output_path ./results/bigcode_starcoderplus_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5
python main.py --model hf-causal --model_args pretrained=tiiuae/falcon-7b,trust_remote_code=True --tasks hendrycksTest-* --device cuda:0 --output_path ./results/tiiuae_falcon-7b_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5
python main.py --model hf-causal --model_args pretrained=mosaicml/mpt-7b,trust_remote_code=True --tasks hendrycksTest-* --device cuda:0 --output_path ./results/mosaicml_mpt-7b_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5

There were some issues with the non-fast tokenizer in open_llama_7b and the absence of the expected sequence len field in configs of mpt-7b and falcon-7b. This fix is required to run mmlu eval for these models.

@ollmer
Copy link
Author

ollmer commented Jun 13, 2023

This PR with current master also should be different, since there are LLaMA tokenizer fixes, e.g. huggyllama/llama-7b can be around to 35% as well.

Are those llama tokenizers fixes in transformers or in the current repo? Could you provide a link to the code, please? If it's here, I can do reeval with the merged main.

@ollmer since you obviously have a script for doing this, can you add it? I recommend making new tasks

Yes, that's a good idea. I don't have time for that right now, but I hope to get back to it in a few days.

@gakada
Copy link
Contributor

gakada commented Jun 13, 2023

From the current repo, #531.

@ollmer
Copy link
Author

ollmer commented Jun 14, 2023

With merged main LLama scores are better indeed:

Model Before merge After merge Paper
EleutherAI/gpt-neox-20b 0.251 0.251
EleutherAI/pythia-12b 0.268 0.268
huggyllama/llama-13b 0.469 0.475 0.469
huggyllama/llama-7b 0.321 0.357 0.351

@StellaAthena
Copy link
Member

I’m a little worried that the scores drifted further away. Is this fluctuation within the margin of error?

@gakada
Copy link
Contributor

gakada commented Jun 15, 2023

Now it isn't significant (by p-values or CIs, at least overall), previously 0.321 vs 0.351 would be significant.

@StellaAthena StellaAthena merged commit 3e2e6d8 into EleutherAI:master Jun 15, 2023
2 checks passed
@vince62s
Copy link

mosaicml/mpt-7b 0.311
tiiuae/falcon-7b 0.278
openlm-research/open_llama_7b 0.271

mosaicml/mpt-7b 0.311
tiiuae/falcon-7b 0.278
openlm-research/open_llama_7b 0.271

@ollmer are those the last ones for MPT and Open_llama 7B ?

@haileyschoelkopf haileyschoelkopf linked an issue Jun 23, 2023 that may be closed by this pull request
@ollmer
Copy link
Author

ollmer commented Jun 23, 2023

Hi! Where do you get these numbers from? I haven't evaluated these models after the PR was merged, so I don't know how their respective numbers have changed.

@vince62s
Copy link

In your post of 10 days ago above

@ollmer
Copy link
Author

ollmer commented Jun 23, 2023

Ah, sorry. These were the numbers on this branch, eval was done before the merging current main, so the numbers are a bit outdated.

@vince62s
Copy link

I rescored with master MPT7B=27.9 it took more than 3 hours ..... when it takes 45mn on our implementation here: https://github.com/OpenNMT/OpenNMT-py/tree/master/eval_llm/MMLU taken from https://github.com/FranxYao/chain-of-thought-hub. Anyway scores are much closer I wanted to double check that llama7B is clearly above others.

@Reason-Wang
Copy link

I made the evaluation of a few more recent models, here is the aggregated table of avg scores:

Model MMLU 5-shot Avg. Acc.
huggyllama/llama-13b 0.469
bigcode/starcoderplus 0.437
huggyllama/llama-7b 0.321
mosaicml/mpt-7b 0.311
tiiuae/falcon-7b 0.278
openlm-research/open_llama_7b 0.271
EleutherAI/pythia-12b 0.268
EleutherAI/gpt-neox-20b 0.251
Jsons: results_2.zip

Commands to reproduce:

mkdir results
python main.py --model hf-causal --model_args pretrained=huggyllama/llama-7b --tasks hendrycksTest-* --device cuda:0 --output_path ./results/huggyllama_llama-7b_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5
python main.py --model hf-causal --model_args pretrained=huggyllama/llama-13b --tasks hendrycksTest-* --device cuda:0 --output_path ./results/huggyllama_llama-13b_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5
python main.py --model hf-causal --model_args pretrained=EleutherAI/pythia-12b --tasks hendrycksTest-* --device cuda:0 --output_path ./results/EleutherAI_pythia-12b_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5
python main.py --model hf-causal --model_args pretrained=EleutherAI/gpt-neox-20b --tasks hendrycksTest-* --device cuda:0 --output_path ./results/EleutherAI_gpt-neox-20b_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5
python main.py --model hf-causal --model_args pretrained=openlm-research/open_llama_7b --tasks hendrycksTest-* --device cuda:0 --output_path ./results/openlm-research_open_llama_7b_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5
python main.py --model hf-causal --model_args pretrained=bigcode/starcoderplus --tasks hendrycksTest-* --device cuda:0 --output_path ./results/bigcode_starcoderplus_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5
python main.py --model hf-causal --model_args pretrained=tiiuae/falcon-7b,trust_remote_code=True --tasks hendrycksTest-* --device cuda:0 --output_path ./results/tiiuae_falcon-7b_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5
python main.py --model hf-causal --model_args pretrained=mosaicml/mpt-7b,trust_remote_code=True --tasks hendrycksTest-* --device cuda:0 --output_path ./results/mosaicml_mpt-7b_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5

There were some issues with the non-fast tokenizer in open_llama_7b and the absence of the expected sequence len field in configs of mpt-7b and falcon-7b. This fix is required to run mmlu eval for these models.

How do you compute the average score? There seems to be no average score in result.json.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validate MMMLU
7 participants