Accuracy gap between single GPU and multiple GPUs #1751

HsuWanTing · 2024-04-26T06:44:17Z

I'm using lm-eval v0.4.2 to evaluate Llama 7b on the open llm leaderboard benchmark.
I found that there are accuracy gaps between single GPU and multiple GPUs as below. (I used data parallel)

	average	ARC-c	HellaSwag	MMLU	TruthfulQA	WinoGrande	GSM8K
4 GPUs (batch size 4)	46.58	50.85	78.13	35.14	34.08	71.82	9.48
4 GPUs (batch size 1)	46.61	50.85	78.12	35.17	34.08	71.9	9.55
1 GPU (batch size 4)	46.37	50.43	77.82	35.14	34.08	71.74	9.02
1 GPU (batch size 1)	46.42	50.43	77.83	35.17	34.08	71.74	9.25

Single GPU got overall lower accuracies. ARC-c, hellaswag and GSM8K drop 0.3~0.5.
I thought that data-parallel only speeds up the evaluation. Where did the difference come from?

Below is the command line I used for ARC-c.
Use CUDA_VISIBLE_DEVICES to control the number of GPUs.

accelerate launch --main_process_port $PORT -m lm_eval \
        --model hf \
        --model_args pretrained=huggyllama/llama-7b \
        --tasks arc_challenge \
        --num_fewshot 25 \
        --output_path $output_path \
        --batch_size $batch_size

The text was updated successfully, but these errors were encountered:

LSinev · 2024-04-26T07:04:55Z

Thank you for your efforts! Great table with results to compare!

Where did the difference come from?

Please check other issues/discussions about speed, batches and multiple GPU usage for ideas. For example (but not limited to),
#1625
#704 (comment)

HsuWanTing · 2024-04-26T08:34:51Z

Thank @LSinev for the quick reply.
I've checked the issues you linked and also searched for some others by myself.
Most of the issues focus on the difference between different batch sizes which is usually very small.
I understand there will be some order difference so the loglikelihood might be slightly different and this small difference is acceptable to me.

However, I didn't find the issue about the difference between different numbers of GPUs.
In my case, the accuracy drops 0.3~0.5 when using a single GPU. I think the drop is quite large. Is this also an expected result?

LSinev · 2024-04-26T09:01:27Z

Is this also an expected result?

No idea. According to your results from table it is also task dependent issue.
You may want to further research this case with deep diving in code an logging. May be even use not yet merged PRs like #1731 to check consistency of tokenization (may be some adding of special tokens happens incorrectly), propagation of seeds, splitting (and restoring original order) of batches and so on.

If even just batch size makes difference, I suppose multiple GPU may propose even more difference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accuracy gap between single GPU and multiple GPUs #1751

Accuracy gap between single GPU and multiple GPUs #1751

HsuWanTing commented Apr 26, 2024 •

edited

Loading

LSinev commented Apr 26, 2024

HsuWanTing commented Apr 26, 2024

LSinev commented Apr 26, 2024

Accuracy gap between single GPU and multiple GPUs #1751

Accuracy gap between single GPU and multiple GPUs #1751

Comments

HsuWanTing commented Apr 26, 2024 • edited Loading

LSinev commented Apr 26, 2024

HsuWanTing commented Apr 26, 2024

LSinev commented Apr 26, 2024

HsuWanTing commented Apr 26, 2024 •

edited

Loading