Any benchmark results? #1

soloice · 2019-05-16T08:00:09Z

What GPUs are you using? What's the maximum batch_size for 11/12/16/32 GB GPUs (e.g.: GTX 1080 Ti, Titan X, V100) and the corresponding performance? Thanks.

haoyuhu · 2019-05-16T10:51:27Z

What GPUs are you using?
With batch_size of 24(each GPU) and Tensorflow 1.13.1, I successfully train a classifier based on bert-large-uncased on 2 x Tesla P40(about 95% RAM of GPUs used) for QQP dataset. And the prediction looks fine.
What's the maximum batch_size for 11/12/16/32 GB GPUs (e.g.: GTX 1080 Ti, Titan X, V100)?
tf.distribute.MirroredStrategy is used to achieve Multi-GPU for this project, which mirrors vars to distribute across multiple devices and machines. I think the maximum batch_size for each GPU is almost the same as bert. So global batch_size depends on how many GPUs there are.
Corresponding performance
The training speed and the number of GPUs are almost linear with the same hyperparameters.

soloice · 2019-05-16T10:55:46Z

Thanks. Does 24 indicate the global batch size? i.e.: 12 samples of length 128 on each Tesla P40 card (24 GB memory)?

haoyuhu · 2019-05-16T10:59:33Z

24 is the batch size for each GPU. Global batch size is 24 * 2 = 48.

soloice · 2019-05-16T11:03:09Z

Oops, this is awesome. Your batch size is twice as large as that reported in BERT readme (after scaling according to memory size). How did this happen? By using fp16 precision?

haoyuhu · 2019-05-16T11:22:09Z

No fp16, but I'm planning to support it.
total RAM usage = model RAM usage + batch_size × memory RAM per sample. So batch_size and RAM usage are not linear, BERT is too large. On the other hand, maybe there are some performance optimizations from Tensorflow1.11 to 1.13.1.

soloice · 2019-05-16T11:39:33Z

Thanks for the clarification. A nice project and I'm playing with it.

haoyuhu · 2019-05-16T12:09:48Z

Feel free to open a new issue if you encounter any problems while you are playing.

haoyuhu · 2019-05-17T07:20:44Z

Now fp16 is available on branch fp16, feel free to use it. I will update readme and merge it to master later.
@soloice

soloice · 2019-05-17T08:08:25Z

Thanks for your quick development. Currently I don't have access to a Volta architecture GPU, so I guess the fp16 performance would be much slower than its fp32 counterpart? e.g.: for GTX 1080 Ti and Tesla P40 (both of them are of Pascal architecture), the fp16 performance is 1:64 (1/64 of fp32 FLOPS).

soloice · 2019-05-17T08:30:19Z

I'm facing some strange issues with BERT-Large on my 11 GB TX 1080 Ti.

Code	model size	maximum_sequence_length	max_batch_size_per_GPU	remark
original bert	base	128	36	Pretty good. Even larger than 32 on a 12 GB Titan X as mentioned in the BERT repo
original bert	large	64	2	Very poor. Titan X could hold 12 such samples!
this repo	large	64	0	Oops. This is a disaster.

I guess the poor performance is not due to this repo but the original BERT repo. Have you encountered any strange issue with BERT-Large like this?

haoyuhu · 2019-05-17T09:07:04Z

Thanks for your quick development. Currently I don't have access to a Volta architecture GPU, so I guess the fp16 performance would be much slower than its fp32 counterpart? e.g.: for GTX 1080 Ti and Tesla P40 (both of them are of Pascal architecture), the fp16 performance is 1:64 (1/64 of fp32 FLOPS).

I only did a simple test for FP16, and the training speed did not drop significantly (probably I made a mistake). I will do more detailed and comprehensive testing later.

I guess the poor performance is not due to this repo but the original BERT repo. Have you encountered any strange issue with BERT-Large like this?
It's weird. Did OOM occur when training the LARGE model with this repo?

It is weird. Did OOM occur when training LARGE model with this repo? I don't recommend trying bert-large-uncased on GPUs(RAM < 16GB), because Multi-GPU has very little benefit for training in this case. Maybe training your classifier with the bert-base-uncased model is a better option if you can tolerate a 1% reduction in eval_accuracy.
REF: google-research/bert#4 (comment)

soloice · 2019-05-17T09:25:06Z

Did OOM occur when training LARGE model with this repo?

Yes, even a batch size of 1 leads to OOM. I'd better play with BERT-Base. Thanks for your quick reply.

soloice closed this as completed May 16, 2019

haoyuhu mentioned this issue May 17, 2019

Question: Does it make sense to integrate tensorflow2? #3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any benchmark results? #1

Any benchmark results? #1

soloice commented May 16, 2019 •

edited

Loading

haoyuhu commented May 16, 2019 •

edited

Loading

soloice commented May 16, 2019 •

edited

Loading

haoyuhu commented May 16, 2019

soloice commented May 16, 2019

haoyuhu commented May 16, 2019

soloice commented May 16, 2019

haoyuhu commented May 16, 2019

haoyuhu commented May 17, 2019

soloice commented May 17, 2019 •

edited

Loading

soloice commented May 17, 2019

haoyuhu commented May 17, 2019

soloice commented May 17, 2019

Any benchmark results? #1

Any benchmark results? #1

Comments

soloice commented May 16, 2019 • edited Loading

haoyuhu commented May 16, 2019 • edited Loading

soloice commented May 16, 2019 • edited Loading

haoyuhu commented May 16, 2019

soloice commented May 16, 2019

haoyuhu commented May 16, 2019

soloice commented May 16, 2019

haoyuhu commented May 16, 2019

haoyuhu commented May 17, 2019

soloice commented May 17, 2019 • edited Loading

soloice commented May 17, 2019

haoyuhu commented May 17, 2019

soloice commented May 17, 2019

soloice commented May 16, 2019 •

edited

Loading

haoyuhu commented May 16, 2019 •

edited

Loading

soloice commented May 16, 2019 •

edited

Loading

soloice commented May 17, 2019 •

edited

Loading