Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accuracy gap between single GPU and multiple GPUs #1751

Open
HsuWanTing opened this issue Apr 26, 2024 · 3 comments
Open

Accuracy gap between single GPU and multiple GPUs #1751

HsuWanTing opened this issue Apr 26, 2024 · 3 comments

Comments

@HsuWanTing
Copy link

HsuWanTing commented Apr 26, 2024

I'm using lm-eval v0.4.2 to evaluate Llama 7b on the open llm leaderboard benchmark.
I found that there are accuracy gaps between single GPU and multiple GPUs as below. (I used data parallel)

  average ARC-c HellaSwag MMLU TruthfulQA WinoGrande GSM8K
4 GPUs (batch size 4) 46.58 50.85 78.13 35.14 34.08 71.82 9.48
4 GPUs (batch size 1) 46.61 50.85 78.12 35.17 34.08 71.9 9.55
1 GPU (batch size 4) 46.37 50.43 77.82 35.14 34.08 71.74 9.02
1 GPU (batch size 1) 46.42 50.43 77.83 35.17 34.08 71.74 9.25

Single GPU got overall lower accuracies. ARC-c, hellaswag and GSM8K drop 0.3~0.5.
I thought that data-parallel only speeds up the evaluation. Where did the difference come from?

Below is the command line I used for ARC-c.
Use CUDA_VISIBLE_DEVICES to control the number of GPUs.

accelerate launch --main_process_port $PORT -m lm_eval \
        --model hf \
        --model_args pretrained=huggyllama/llama-7b \
        --tasks arc_challenge \
        --num_fewshot 25 \
        --output_path $output_path \
        --batch_size $batch_size
@LSinev
Copy link
Contributor

LSinev commented Apr 26, 2024

Thank you for your efforts! Great table with results to compare!

Where did the difference come from?

Please check other issues/discussions about speed, batches and multiple GPU usage for ideas. For example (but not limited to),
#1625
#704 (comment)

@HsuWanTing
Copy link
Author

Thank @LSinev for the quick reply.
I've checked the issues you linked and also searched for some others by myself.
Most of the issues focus on the difference between different batch sizes which is usually very small.
I understand there will be some order difference so the loglikelihood might be slightly different and this small difference is acceptable to me.

However, I didn't find the issue about the difference between different numbers of GPUs.
In my case, the accuracy drops 0.3~0.5 when using a single GPU. I think the drop is quite large. Is this also an expected result?

@LSinev
Copy link
Contributor

LSinev commented Apr 26, 2024

Is this also an expected result?

No idea. According to your results from table it is also task dependent issue.
You may want to further research this case with deep diving in code an logging. May be even use not yet merged PRs like #1731 to check consistency of tokenization (may be some adding of special tokens happens incorrectly), propagation of seeds, splitting (and restoring original order) of batches and so on.

If even just batch size makes difference, I suppose multiple GPU may propose even more difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants