Skip to content

Conversation

@FanhaiLu1
Copy link
Collaborator

Context:

For disaggregation serving, I would like to know what are max performance for prefill only. Below are example of questions:

  1. How many prefill we can do per seconds
  2. What are the bottleneck
  3. what the percentage real performance vs in theory one
  4. If compute bound, how could we improve it

Benefits:

This PR collect latency metrics of prefill for different token size (from 16 to 32768), it help me to explore above questions and also help other engineer who would like do similar analysis in future.

Results for llama2 7B:

---- execute First Token: 737
---- execute time: 0.0057392120361328125 for token_len: 16

---- execute First Token: 13
---- execute time: 0.005574226379394531 for token_len: 32

---- execute First Token: 3634
---- execute time: 0.006200551986694336 for token_len: 64

---- execute First Token: 29874
---- execute time: 0.007735252380371094 for token_len: 128

---- execute First Token: 29871
---- execute time: 0.011573553085327148 for token_len: 256

---- execute First Token: 29964
---- execute time: 0.020201683044433594 for token_len: 512

---- execute First Token: 414
---- execute time: 0.038352012634277344 for token_len: 1024

---- execute First Token: 1319
---- execute time: 0.07815766334533691 for token_len: 2048

---- execute First Token: 1068
---- execute time: 0.16911959648132324 for token_len: 4096

---- execute First Token: 313
---- execute time: 0.4015989303588867 for token_len: 8192

---- execute First Token: 404
---- execute time: 0.9996061325073242 for token_len: 16384

---- execute First Token: 5519
---- execute time: 2.7764148712158203 for token_len: 32768

@FanhaiLu1 FanhaiLu1 requested review from qihqi and wang2yn84 May 3, 2024 03:51
@qihqi qihqi merged commit 9606a1f into AI-Hypercomputer:main May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants