effective fastertransformer total time costs of batch size > 1 is more than batch size = 1 #399

DogeWatch · 2022-12-19T04:01:06Z

I use the ParallelGptContextDecoderOp with remove_padding=True to get hidden states of last tokens, but I found the total time costs of batch size = 18 is more than batch size = 1

This phenomenon not same with https://github.com/NVIDIA/FasterTransformer/blob/main/docs/bert_guide.md#bert-performance-on-t4-and-pytorch

DogeWatch · 2022-12-23T01:09:20Z

I found that longer inputs lead to lower batching performance

byshiue · 2022-12-23T10:54:45Z

The cost of bs > 1 is larger than bs = 1 is an expected behavior. Effective transformer only removes the useless computing about padding. The performance improvement highly depends on how many padding in the inputs.

DogeWatch · 2022-12-26T03:39:28Z

The cost of bs > 1 is larger than bs = 1 is an expected behavior. Effective transformer only removes the useless computing about padding. The performance improvement highly depends on how many padding in the inputs.

I found that the impact of batching on performance is related to the input length. In my experiment, using the faster transformer, when the input length is 100, the performance of bs=10 is 30% higher than that of bs=1. When the input length is 955, the performance of bs=10 is 10% higher than that of bs=1. When the input length is 2048, the performance of bs=10 is 6% lower than that of bs=1. Is this also expected?

byshiue · 2022-12-26T03:47:22Z

What's your meaning for performance? In any case, the latency of bs 1 should be faster than bs 10 when the input length are same. I don't know how you get the conclusion: when the input length is 100, the performance of bs=10 is 30% higher than that of bs=1.
Do you mean the throughput?

DogeWatch · 2022-12-27T06:24:23Z

In my case, the performance means the time costs of inferencing 1000 items

byshiue · 2022-12-27T06:51:00Z

Do you mean there are 1000 sentences. When bs = 1, you need to run 1000 times. While when bs = 10, you only need to run 100 times?

For that case, the comparison is on throughput. When input length is very long, bs = 1 and bs = 10 may have similar performance.

DogeWatch · 2023-01-03T06:47:33Z

For that case, the comparison is on throughput. When input length is very long, bs = 1 and bs = 10 may have similar performance.

So what is the reason for this phenomenon?

byshiue · 2023-01-03T07:56:06Z

Because the GPU is fully used even if bs = 1. So, increasing the batch size does not bring any benefit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

effective fastertransformer total time costs of batch size > 1 is more than batch size = 1 #399

effective fastertransformer total time costs of batch size > 1 is more than batch size = 1 #399

DogeWatch commented Dec 19, 2022 •

edited

DogeWatch commented Dec 23, 2022

byshiue commented Dec 23, 2022

DogeWatch commented Dec 26, 2022

byshiue commented Dec 26, 2022

DogeWatch commented Dec 27, 2022

byshiue commented Dec 27, 2022

DogeWatch commented Jan 3, 2023

byshiue commented Jan 3, 2023

effective fastertransformer total time costs of batch size > 1 is more than batch size = 1 #399

effective fastertransformer total time costs of batch size > 1 is more than batch size = 1 #399

Comments

DogeWatch commented Dec 19, 2022 • edited

DogeWatch commented Dec 23, 2022

byshiue commented Dec 23, 2022

DogeWatch commented Dec 26, 2022

byshiue commented Dec 26, 2022

DogeWatch commented Dec 27, 2022

byshiue commented Dec 27, 2022

DogeWatch commented Jan 3, 2023

byshiue commented Jan 3, 2023

DogeWatch commented Dec 19, 2022 •

edited