Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

effective fastertransformer total time costs of batch size > 1 is more than batch size = 1 #399

Open
DogeWatch opened this issue Dec 19, 2022 · 8 comments

Comments

@DogeWatch
Copy link

DogeWatch commented Dec 19, 2022

I use the ParallelGptContextDecoderOp with remove_padding=True to get hidden states of last tokens, but I found the total time costs of batch size = 18 is more than batch size = 1
image

This phenomenon not same with https://github.com/NVIDIA/FasterTransformer/blob/main/docs/bert_guide.md#bert-performance-on-t4-and-pytorch

@DogeWatch
Copy link
Author

I found that longer inputs lead to lower batching performance

@byshiue
Copy link
Collaborator

byshiue commented Dec 23, 2022

The cost of bs > 1 is larger than bs = 1 is an expected behavior. Effective transformer only removes the useless computing about padding. The performance improvement highly depends on how many padding in the inputs.

@DogeWatch
Copy link
Author

The cost of bs > 1 is larger than bs = 1 is an expected behavior. Effective transformer only removes the useless computing about padding. The performance improvement highly depends on how many padding in the inputs.

I found that the impact of batching on performance is related to the input length. In my experiment, using the faster transformer, when the input length is 100, the performance of bs=10 is 30% higher than that of bs=1. When the input length is 955, the performance of bs=10 is 10% higher than that of bs=1. When the input length is 2048, the performance of bs=10 is 6% lower than that of bs=1. Is this also expected?

@byshiue
Copy link
Collaborator

byshiue commented Dec 26, 2022

What's your meaning for performance? In any case, the latency of bs 1 should be faster than bs 10 when the input length are same. I don't know how you get the conclusion: when the input length is 100, the performance of bs=10 is 30% higher than that of bs=1.
Do you mean the throughput?

@DogeWatch
Copy link
Author

In my case, the performance means the time costs of inferencing 1000 items

@byshiue
Copy link
Collaborator

byshiue commented Dec 27, 2022

Do you mean there are 1000 sentences. When bs = 1, you need to run 1000 times. While when bs = 10, you only need to run 100 times?

For that case, the comparison is on throughput. When input length is very long, bs = 1 and bs = 10 may have similar performance.

@DogeWatch
Copy link
Author

For that case, the comparison is on throughput. When input length is very long, bs = 1 and bs = 10 may have similar performance.

So what is the reason for this phenomenon?

@byshiue
Copy link
Collaborator

byshiue commented Jan 3, 2023

Because the GPU is fully used even if bs = 1. So, increasing the batch size does not bring any benefit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants