Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any benchmark results? #1

Closed
soloice opened this issue May 16, 2019 · 12 comments
Closed

Any benchmark results? #1

soloice opened this issue May 16, 2019 · 12 comments

Comments

@soloice
Copy link

soloice commented May 16, 2019

What GPUs are you using? What's the maximum batch_size for 11/12/16/32 GB GPUs (e.g.: GTX 1080 Ti, Titan X, V100) and the corresponding performance? Thanks.

@haoyuhu
Copy link
Owner

haoyuhu commented May 16, 2019

  1. What GPUs are you using?
    With batch_size of 24(each GPU) and Tensorflow 1.13.1, I successfully train a classifier based on bert-large-uncased on 2 x Tesla P40(about 95% RAM of GPUs used) for QQP dataset. And the prediction looks fine.

  2. What's the maximum batch_size for 11/12/16/32 GB GPUs (e.g.: GTX 1080 Ti, Titan X, V100)?
    tf.distribute.MirroredStrategy is used to achieve Multi-GPU for this project, which mirrors vars to distribute across multiple devices and machines. I think the maximum batch_size for each GPU is almost the same as bert. So global batch_size depends on how many GPUs there are.

  3. Corresponding performance
    The training speed and the number of GPUs are almost linear with the same hyperparameters.

@soloice
Copy link
Author

soloice commented May 16, 2019

Thanks. Does 24 indicate the global batch size? i.e.: 12 samples of length 128 on each Tesla P40 card (24 GB memory)?

@haoyuhu
Copy link
Owner

haoyuhu commented May 16, 2019

24 is the batch size for each GPU. Global batch size is 24 * 2 = 48.

@soloice
Copy link
Author

soloice commented May 16, 2019

Oops, this is awesome. Your batch size is twice as large as that reported in BERT readme (after scaling according to memory size). How did this happen? By using fp16 precision?

@haoyuhu
Copy link
Owner

haoyuhu commented May 16, 2019

No fp16, but I'm planning to support it.
total RAM usage = model RAM usage + batch_size × memory RAM per sample. So batch_size and RAM usage are not linear, BERT is too large. On the other hand, maybe there are some performance optimizations from Tensorflow1.11 to 1.13.1.

@soloice
Copy link
Author

soloice commented May 16, 2019

Thanks for the clarification. A nice project and I'm playing with it.

@soloice soloice closed this as completed May 16, 2019
@haoyuhu
Copy link
Owner

haoyuhu commented May 16, 2019

Feel free to open a new issue if you encounter any problems while you are playing.

@haoyuhu
Copy link
Owner

haoyuhu commented May 17, 2019

Now fp16 is available on branch fp16, feel free to use it. I will update readme and merge it to master later.
@soloice

@soloice
Copy link
Author

soloice commented May 17, 2019

Thanks for your quick development. Currently I don't have access to a Volta architecture GPU, so I guess the fp16 performance would be much slower than its fp32 counterpart? e.g.: for GTX 1080 Ti and Tesla P40 (both of them are of Pascal architecture), the fp16 performance is 1:64 (1/64 of fp32 FLOPS).

@soloice
Copy link
Author

soloice commented May 17, 2019

I'm facing some strange issues with BERT-Large on my 11 GB TX 1080 Ti.

Code model size maximum_sequence_length max_batch_size_per_GPU remark
original bert base 128 36 Pretty good. Even larger than 32 on a 12 GB Titan X as mentioned in the BERT repo
original bert large 64 2 Very poor. Titan X could hold 12 such samples!
this repo large 64 0 Oops. This is a disaster.

I guess the poor performance is not due to this repo but the original BERT repo. Have you encountered any strange issue with BERT-Large like this?

@haoyuhu
Copy link
Owner

haoyuhu commented May 17, 2019

Thanks for your quick development. Currently I don't have access to a Volta architecture GPU, so I guess the fp16 performance would be much slower than its fp32 counterpart? e.g.: for GTX 1080 Ti and Tesla P40 (both of them are of Pascal architecture), the fp16 performance is 1:64 (1/64 of fp32 FLOPS).

I only did a simple test for FP16, and the training speed did not drop significantly (probably I made a mistake). I will do more detailed and comprehensive testing later.

I guess the poor performance is not due to this repo but the original BERT repo. Have you encountered any strange issue with BERT-Large like this?
It's weird. Did OOM occur when training the LARGE model with this repo?

It is weird. Did OOM occur when training LARGE model with this repo? I don't recommend trying bert-large-uncased on GPUs(RAM < 16GB), because Multi-GPU has very little benefit for training in this case. Maybe training your classifier with the bert-base-uncased model is a better option if you can tolerate a 1% reduction in eval_accuracy.
REF: google-research/bert#4 (comment)

@soloice
Copy link
Author

soloice commented May 17, 2019

Did OOM occur when training LARGE model with this repo?

Yes, even a batch size of 1 leads to OOM. I'd better play with BERT-Base. Thanks for your quick reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants