New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
effective fastertransformer total time costs of batch size > 1 is more than batch size = 1 #399
Comments
I found that longer inputs lead to lower batching performance |
The cost of bs > 1 is larger than bs = 1 is an expected behavior. Effective transformer only removes the useless computing about padding. The performance improvement highly depends on how many padding in the inputs. |
I found that the impact of batching on performance is related to the input length. In my experiment, using the faster transformer, when the input length is 100, the performance of bs=10 is 30% higher than that of bs=1. When the input length is 955, the performance of bs=10 is 10% higher than that of bs=1. When the input length is 2048, the performance of bs=10 is 6% lower than that of bs=1. Is this also expected? |
What's your meaning for |
In my case, the performance means the time costs of inferencing 1000 items |
Do you mean there are 1000 sentences. When bs = 1, you need to run 1000 times. While when bs = 10, you only need to run 100 times? For that case, the comparison is on throughput. When input length is very long, bs = 1 and bs = 10 may have similar performance. |
So what is the reason for this phenomenon? |
Because the GPU is fully used even if bs = 1. So, increasing the batch size does not bring any benefit. |
I use the ParallelGptContextDecoderOp with remove_padding=True to get hidden states of last tokens, but I found the total time costs of batch size = 18 is more than batch size = 1
This phenomenon not same with https://github.com/NVIDIA/FasterTransformer/blob/main/docs/bert_guide.md#bert-performance-on-t4-and-pytorch
The text was updated successfully, but these errors were encountered: