MMLU task fix #497

ollmer · 2023-05-12T23:40:49Z

Update and fix the MMLU task to achieve the same behavior as the original implementation.

Fix huggingface dataset name that was changed from hendrycks_test to cais/mmlu.
Implement exact same prompt format and few-shot prefix.
Use the unchanged order of the dev set during the few-shot sample selection.

One difference remains unresolved, the original code choose the number of few-shot samples < N if the tokenized N-shot prompt is longer than the model max_length.

… to the original eval code

CLAassistant · 2023-05-12T23:40:54Z

All committers have signed the CLA.

ollmer · 2023-05-12T23:44:15Z

The PR will be in Draft status until I get the eval numbers for a few models to compare with the original Hendryck's eval results.
Intended models: Llama 7B, Pythia 12B, GPT-NeoX 20B.

StellaAthena · 2023-05-24T17:48:45Z

@ollmer Any update on running evals to confirm performance?

lm_eval/tasks/hendrycks_test.py

ollmer · 2023-05-28T18:22:10Z

Sorry for the delay, here are the numbers for 5-shot comparison:

Model	Category	Hendrycks Eval Accuracy	This PR Accuracy	Accuracy in Paper
EleutherAI/pythia-12b	Average	0.251	0.269
	STEM	0.251	0.280
	humanities	0.262	0.267
	other (business, health, misc.)	0.260	0.263
	social sciences	0.226	0.265
EleutherAI/gpt-neox-20b	Average	0.261	0.251
	STEM	0.250	0.232
	humanities	0.272	0.269
	other (business, health, misc.)	0.277	0.276
	social sciences	0.240	0.233
huggyllama/llama-7b	Average	0.351	0.322	0.351
	STEM	0.306	0.296	0.305
	humanities	0.339	0.328	0.340
	other (business, health, misc.)	0.382	0.352	0.381
	social sciences	0.382	0.319	0.383
huggyllama/llama-13b	Average	0.470	0.469	0.469
	STEM	0.363	0.373	0.358
	humanities	0.450	0.515	0.450
	other (business, health, misc.)	0.531	0.490	0.533
	social sciences	0.541	0.538	0.538

gakada · 2023-06-05T22:33:26Z

There is a difference with hendrycks/test in that the underscores in subjects aren't replaced, so e.g. it is ... about high_school_physics. instead of ... about high school physics.. Also maybe strings should be striped, e.g. some questions begin with spaces, some not.

StellaAthena · 2023-06-06T14:29:29Z

@gakada Good catch!

ollmer · 2023-06-06T17:58:36Z

@gakada, fixed, thanks!

lm_eval/tasks/hendrycks_test.py

StellaAthena · 2023-06-08T16:42:28Z

@ollmer Can you rerun the comparison between this repo and the hendrycks code for Pythia, GPT-NeoX-20B, and LLaMA again with the prompt change?

ollmer · 2023-06-08T16:48:32Z

Do you mean the subjects with removed underscores or some kind of new prompt changes that should be merged from the main branch first?

StellaAthena · 2023-06-08T18:41:05Z

Do you mean the subjects with removed underscores or some kind of new prompt changes that should be merged from the main branch first?

With the removed underscores and leading spaces fixed.

Changes were made but more are needed

StellaAthena · 2023-06-09T20:43:11Z

I'm testing this currently.

ollmer · 2023-06-12T18:04:02Z

I've tested it with the updated code. Results are almost the same:

Model	Category	Hendrycks Eval Accuracy	This PR before fixes	This PR after fixes	Accuracy in Paper
EleutherAI/pythia-12b	Average	0.251	0.269	0.268
	STEM	0.251	0.280	0.279
	humanities	0.262	0.267	0.263
	other (business, health, misc.)	0.260	0.263	0.261
	social sciences	0.226	0.265	0.262
EleutherAI/gpt-neox-20b	Average	0.261	0.251	0.251
	STEM	0.250	0.232	0.238
	humanities	0.272	0.269	0.271
	other (business, health, misc.)	0.277	0.276	0.270
	social sciences	0.240	0.233	0.226
huggyllama/llama-7b	Average	0.351	0.322	0.321	0.351
	STEM	0.306	0.296	0.295	0.305
	humanities	0.339	0.328	0.326	0.340
	other (business, health, misc.)	0.382	0.352	0.352	0.381
	social sciences	0.382	0.319	0.319	0.383
huggyllama/llama-13b	Average	0.470	0.469	0.469	0.469
	STEM	0.363	0.373	0.377	0.358
	humanities	0.450	0.515	0.516	0.450
	other (business, health, misc.)	0.531	0.490	0.487	0.533
	social sciences	0.541	0.538	0.536	0.538

Raw jsons: results_after_fixes.zip

gakada · 2023-06-13T01:48:15Z

This PR with current master also should be different, since there are LLaMA tokenizer fixes, e.g. huggyllama/llama-7b can be around to 35% as well.

StellaAthena · 2023-06-13T04:17:13Z

This PR with current master also should be different, since there are LLaMA tokenizer fixes, e.g. huggyllama/llama-7b can be around to 35% as well.

Did you run this branch with the new tokenizer and get 35%, or are you guessing? It would be good to pull main into this fork and then rerun it.

Also, we really need to add code for running the major topic areas (STEM, Social Sciences, Humanities, Other) and aggregate results. @ollmer since you obviously have a script for doing this, can you add it? I recommend making new tasks which call the subtasks and then aggregate results before returning the category-level results. We should encourage people to call these top-level categories and not the subtasks as that's what the paper recommends and doing the aggregation by hand is error-prone.

ollmer · 2023-06-13T20:16:02Z

I made the evaluation of a few more recent models, here is the aggregated table of avg scores:

Model	MMLU 5-shot Avg. Acc.
huggyllama/llama-13b	0.469
bigcode/starcoderplus	0.437
huggyllama/llama-7b	0.321
mosaicml/mpt-7b	0.311
tiiuae/falcon-7b	0.278
openlm-research/open_llama_7b	0.271
EleutherAI/pythia-12b	0.268
EleutherAI/gpt-neox-20b	0.251

Jsons: results_2.zip

Commands to reproduce:

mkdir results
python main.py --model hf-causal --model_args pretrained=huggyllama/llama-7b --tasks hendrycksTest-* --device cuda:0 --output_path ./results/huggyllama_llama-7b_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5
python main.py --model hf-causal --model_args pretrained=huggyllama/llama-13b --tasks hendrycksTest-* --device cuda:0 --output_path ./results/huggyllama_llama-13b_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5
python main.py --model hf-causal --model_args pretrained=EleutherAI/pythia-12b --tasks hendrycksTest-* --device cuda:0 --output_path ./results/EleutherAI_pythia-12b_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5
python main.py --model hf-causal --model_args pretrained=EleutherAI/gpt-neox-20b --tasks hendrycksTest-* --device cuda:0 --output_path ./results/EleutherAI_gpt-neox-20b_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5
python main.py --model hf-causal --model_args pretrained=openlm-research/open_llama_7b --tasks hendrycksTest-* --device cuda:0 --output_path ./results/openlm-research_open_llama_7b_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5
python main.py --model hf-causal --model_args pretrained=bigcode/starcoderplus --tasks hendrycksTest-* --device cuda:0 --output_path ./results/bigcode_starcoderplus_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5
python main.py --model hf-causal --model_args pretrained=tiiuae/falcon-7b,trust_remote_code=True --tasks hendrycksTest-* --device cuda:0 --output_path ./results/tiiuae_falcon-7b_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5
python main.py --model hf-causal --model_args pretrained=mosaicml/mpt-7b,trust_remote_code=True --tasks hendrycksTest-* --device cuda:0 --output_path ./results/mosaicml_mpt-7b_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5

There were some issues with the non-fast tokenizer in open_llama_7b and the absence of the expected sequence len field in configs of mpt-7b and falcon-7b. This fix is required to run mmlu eval for these models.

ollmer · 2023-06-13T20:20:57Z

This PR with current master also should be different, since there are LLaMA tokenizer fixes, e.g. huggyllama/llama-7b can be around to 35% as well.

Are those llama tokenizers fixes in transformers or in the current repo? Could you provide a link to the code, please? If it's here, I can do reeval with the merged main.

@ollmer since you obviously have a script for doing this, can you add it? I recommend making new tasks

Yes, that's a good idea. I don't have time for that right now, but I hope to get back to it in a few days.

gakada · 2023-06-13T21:26:22Z

From the current repo, #531.

ollmer · 2023-06-14T08:08:29Z

With merged main LLama scores are better indeed:

Model	Before merge	After merge	Paper
EleutherAI/gpt-neox-20b	0.251	0.251
EleutherAI/pythia-12b	0.268	0.268
huggyllama/llama-13b	0.469	0.475	0.469
huggyllama/llama-7b	0.321	0.357	0.351

StellaAthena · 2023-06-15T13:56:16Z

I’m a little worried that the scores drifted further away. Is this fluctuation within the margin of error?

gakada · 2023-06-15T14:41:58Z

Now it isn't significant (by p-values or CIs, at least overall), previously 0.321 vs 0.351 would be significant.

vince62s · 2023-06-23T17:05:50Z

mosaicml/mpt-7b	0.311
tiiuae/falcon-7b	0.278
openlm-research/open_llama_7b	0.271

mosaicml/mpt-7b 0.311
tiiuae/falcon-7b 0.278
openlm-research/open_llama_7b 0.271

@ollmer are those the last ones for MPT and Open_llama 7B ?

ollmer · 2023-06-23T19:59:03Z

Hi! Where do you get these numbers from? I haven't evaluated these models after the PR was merged, so I don't know how their respective numbers have changed.

vince62s · 2023-06-23T20:02:37Z

In your post of 10 days ago above

ollmer · 2023-06-23T20:53:13Z

Ah, sorry. These were the numbers on this branch, eval was done before the merging current main, so the numbers are a bit outdated.

vince62s · 2023-06-23T21:04:08Z

I rescored with master MPT7B=27.9 it took more than 3 hours ..... when it takes 45mn on our implementation here: https://github.com/OpenNMT/OpenNMT-py/tree/master/eval_llm/MMLU taken from https://github.com/FranxYao/chain-of-thought-hub. Anyway scores are much closer I wanted to double check that llama7B is clearly above others.

MMLU task fix

Reason-Wang · 2023-09-22T13:19:32Z

I made the evaluation of a few more recent models, here is the aggregated table of avg scores:

Model MMLU 5-shot Avg. Acc.
huggyllama/llama-13b 0.469
bigcode/starcoderplus 0.437
huggyllama/llama-7b 0.321
mosaicml/mpt-7b 0.311
tiiuae/falcon-7b 0.278
openlm-research/open_llama_7b 0.271
EleutherAI/pythia-12b 0.268
EleutherAI/gpt-neox-20b 0.251
Jsons: results_2.zip

Commands to reproduce:

mkdir results
python main.py --model hf-causal --model_args pretrained=huggyllama/llama-7b --tasks hendrycksTest-* --device cuda:0 --output_path ./results/huggyllama_llama-7b_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5
python main.py --model hf-causal --model_args pretrained=huggyllama/llama-13b --tasks hendrycksTest-* --device cuda:0 --output_path ./results/huggyllama_llama-13b_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5
python main.py --model hf-causal --model_args pretrained=EleutherAI/pythia-12b --tasks hendrycksTest-* --device cuda:0 --output_path ./results/EleutherAI_pythia-12b_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5
python main.py --model hf-causal --model_args pretrained=EleutherAI/gpt-neox-20b --tasks hendrycksTest-* --device cuda:0 --output_path ./results/EleutherAI_gpt-neox-20b_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5
python main.py --model hf-causal --model_args pretrained=openlm-research/open_llama_7b --tasks hendrycksTest-* --device cuda:0 --output_path ./results/openlm-research_open_llama_7b_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5
python main.py --model hf-causal --model_args pretrained=bigcode/starcoderplus --tasks hendrycksTest-* --device cuda:0 --output_path ./results/bigcode_starcoderplus_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5
python main.py --model hf-causal --model_args pretrained=tiiuae/falcon-7b,trust_remote_code=True --tasks hendrycksTest-* --device cuda:0 --output_path ./results/tiiuae_falcon-7b_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5
python main.py --model hf-causal --model_args pretrained=mosaicml/mpt-7b,trust_remote_code=True --tasks hendrycksTest-* --device cuda:0 --output_path ./results/mosaicml_mpt-7b_hendrycksTest-_1686256999.json  --no_cache --num_fewshot 5

There were some issues with the non-fast tokenizer in open_llama_7b and the absence of the expected sequence len field in configs of mpt-7b and falcon-7b. This fix is required to run mmlu eval for these models.

How do you compute the average score? There seems to be no average score in result.json.

ollmer added 2 commits May 13, 2023 01:00

fix mmlu task, set updated dataset name and make the prompt identical…

48c6bd6

… to the original eval code

add comment

7b8d9b7

ollmer mentioned this pull request May 12, 2023

Validate MMMLU #475

Closed

StellaAthena requested changes May 24, 2023

View reviewed changes

lm_eval/tasks/hendrycks_test.py Outdated Show resolved Hide resolved

lm_eval/tasks/hendrycks_test.py Outdated Show resolved Hide resolved

bump eval version

3a424af

ollmer marked this pull request as ready for review May 28, 2023 18:25

ollmer requested a review from jon-tow as a code owner May 28, 2023 18:25

ollmer mentioned this pull request May 28, 2023

Evaluation script for Huggingface Causal models hendrycks/test#13

Open

gakada mentioned this pull request May 31, 2023

Add MultipleChoiceExactTask #537

Merged

StellaAthena self-requested a review June 6, 2023 14:29

StellaAthena changed the title ~~MMLU task fix~~ MMMLU task fix Jun 6, 2023

address review comments

f034974

ollmer requested review from haileyschoelkopf and lintangsutawika as code owners June 6, 2023 17:58

ollmer changed the title ~~MMMLU task fix~~ MMLU task fix Jun 6, 2023

gakada reviewed Jun 6, 2023

View reviewed changes

lm_eval/tasks/hendrycks_test.py Outdated Show resolved Hide resolved

StellaAthena changed the title ~~MMLU task fix~~ MMMLU task fix Jun 7, 2023

StellaAthena reviewed Jun 7, 2023

View reviewed changes

lm_eval/tasks/hendrycks_test.py Show resolved Hide resolved

fix unreachable return

c117e78

arnocandel mentioned this pull request Jun 8, 2023

Add evaluations of h2oGPT models with hendrycks/test - Humanities/STEM/SocialSciences/Other h2oai/h2ogpt#251

Open

StellaAthena previously approved these changes Jun 8, 2023

View reviewed changes

StellaAthena self-requested a review June 8, 2023 22:17

StellaAthena changed the title ~~MMMLU task fix~~ MMLU task fix Jun 9, 2023

Merge remote-tracking branch 'upstream/master' into mmlu_fix

5ba0c5f

StellaAthena merged commit 3e2e6d8 into EleutherAI:master Jun 15, 2023
2 checks passed

vince62s mentioned this pull request Jun 23, 2023

How do you explain the discrepancy on Falcon7B vs Open llm leaderboard FranxYao/chain-of-thought-hub#40

Closed

haileyschoelkopf linked an issue Jun 23, 2023 that may be closed by this pull request

Validate MMMLU #475

Closed

taoari mentioned this pull request Jun 25, 2023

Revert PR 497 for MMLU/hendrycksTest to be compatible with Open LLM Leaderboard #614

Closed

This was referenced Jul 11, 2023

Minerva MMLU-STEM Replication wellecks/lm-evaluation-harness#7

Merged

Bring reg. MMLU in line with original authors' code wellecks/lm-evaluation-harness#8

Merged

qmdnls pushed a commit to qmdnls/lm-evaluation-harness that referenced this pull request Aug 17, 2023

Merge pull request EleutherAI#497 from ollmer/mmlu_fix

8110930

MMLU task fix

rainyBJ mentioned this pull request Aug 21, 2023

Questions about MMLU Results? mit-han-lab/llm-awq#75

Open

LZY-the-boys pushed a commit to LZY-the-boys/lm-evaluation-harness-fast that referenced this pull request Sep 12, 2023

Merge pull request EleutherAI#497 from ollmer/mmlu_fix

38f6eb4

MMLU task fix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MMLU task fix #497

MMLU task fix #497

ollmer commented May 12, 2023

CLAassistant commented May 12, 2023 •

edited

ollmer commented May 12, 2023

StellaAthena commented May 24, 2023

ollmer commented May 28, 2023

gakada commented Jun 5, 2023 •

edited

StellaAthena commented Jun 6, 2023

ollmer commented Jun 6, 2023

StellaAthena commented Jun 8, 2023

ollmer commented Jun 8, 2023

StellaAthena commented Jun 8, 2023

StellaAthena commented Jun 9, 2023

ollmer commented Jun 12, 2023

gakada commented Jun 13, 2023

StellaAthena commented Jun 13, 2023

ollmer commented Jun 13, 2023

ollmer commented Jun 13, 2023

gakada commented Jun 13, 2023

ollmer commented Jun 14, 2023

StellaAthena commented Jun 15, 2023

gakada commented Jun 15, 2023

vince62s commented Jun 23, 2023

ollmer commented Jun 23, 2023

vince62s commented Jun 23, 2023

ollmer commented Jun 23, 2023

vince62s commented Jun 23, 2023

Reason-Wang commented Sep 22, 2023

MMLU task fix #497

MMLU task fix #497

Conversation

ollmer commented May 12, 2023

CLAassistant commented May 12, 2023 • edited

ollmer commented May 12, 2023

StellaAthena commented May 24, 2023

ollmer commented May 28, 2023

gakada commented Jun 5, 2023 • edited

StellaAthena commented Jun 6, 2023

ollmer commented Jun 6, 2023

StellaAthena commented Jun 8, 2023

ollmer commented Jun 8, 2023

StellaAthena commented Jun 8, 2023

StellaAthena commented Jun 9, 2023

ollmer commented Jun 12, 2023

gakada commented Jun 13, 2023

StellaAthena commented Jun 13, 2023

ollmer commented Jun 13, 2023

ollmer commented Jun 13, 2023

gakada commented Jun 13, 2023

ollmer commented Jun 14, 2023

StellaAthena commented Jun 15, 2023

gakada commented Jun 15, 2023

vince62s commented Jun 23, 2023

ollmer commented Jun 23, 2023

vince62s commented Jun 23, 2023

ollmer commented Jun 23, 2023

vince62s commented Jun 23, 2023

Reason-Wang commented Sep 22, 2023

CLAassistant commented May 12, 2023 •

edited

gakada commented Jun 5, 2023 •

edited