Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize pytorch engine inference with falcon model #1234

Merged
merged 3 commits into from
Mar 4, 2024

Conversation

grimoire
Copy link
Collaborator

@grimoire grimoire commented Mar 4, 2024

Fix falcon tp

before

concurrency: 256
elapsed_time: 184.839s

first token latency(s)(min, max, ave): 0.645, 13.974, 4.690
per-token latency(s) percentile(50, 75, 95, 99): [0.139, 0.147, 0.263, 0.414]

number of prompt tokens: 242197
number of completion tokens: 220686
token throughput (completion token): 1193.938 token/s
token throughput (prompt + completion token): 2504.254 token/s
RPS (request per second): 5.410 req/s
RPM (request per minute): 324.607 req/min


after

concurrency: 256
elapsed_time: 128.932s

first token latency(s)(min, max, ave): 0.270, 10.948, 3.440
per-token latency(s) percentile(50, 75, 95, 99): [0.095, 0.102, 0.232, 0.441]

number of prompt tokens: 242197
number of completion tokens: 220686
token throughput (completion token): 1711.640 token/s
token throughput (prompt + completion token): 3590.119 token/s
RPS (request per second): 7.756 req/s
RPM (request per minute): 465.360 req/min


Only tested on origin falcon-7b

@@ -100,7 +100,7 @@ def __rotary_emb_fn(query_states, key_states, value_states):
scaling_factor=scaling_factor,
out_q=query_states[None],
out_k=key_states[None])
return query_states, key_states, value_states
return query_states[0], key_states[0], value_states
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在未修改之前,这里带来的问题是什么?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没有带来问题,不过 q_state 和 k_state 会从 3d 变成 4d,我觉得是个潜在风险

@lvhan028 lvhan028 changed the title Torch optimize falcon optimize pytorch engine inference with falcon model Mar 4, 2024
Copy link
Collaborator

@RunningLeon RunningLeon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lvhan028 lvhan028 merged commit a6e8188 into InternLM:main Mar 4, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants