Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor turbomind attention #1116

Merged
merged 57 commits into from
Mar 6, 2024
Merged

Conversation

lzhangzz
Copy link
Collaborator

@lzhangzz lzhangzz commented Feb 4, 2024

TODO

  • Fix attention/decoding on sm_75/sm_70
  • Fix dispatch heuristic for split-k decoding
  • BF16 dispatch

@lzhangzz lzhangzz added the WIP label Feb 4, 2024
@zhyncs
Copy link
Contributor

zhyncs commented Feb 28, 2024

Hi @lzhangzz Maybe you could merge the latest main branch to fix the win build error.

@lzhangzz lzhangzz removed the WIP label Mar 1, 2024
@lvhan028 lvhan028 changed the base branch from main to turbomind-2.1 March 4, 2024 06:05
@lvhan028
Copy link
Collaborator

lvhan028 commented Mar 4, 2024

  • LLM Evaluation with turbomind
    • llama2
    • internlm2
    • qwen
    • llama2-4bit
    • internlm2-4bit
    • qwen-4bit
  • CLI chat
  • pipeline api
  • long-context inference
  • deepseek linear rope scaling
  • v100
  • T4

@lvhan028
Copy link
Collaborator

lvhan028 commented Mar 4, 2024

@irexyc May help verify the VL models

@zhyncs
Copy link
Contributor

zhyncs commented Mar 5, 2024

Hi @lzhangzz I used the model lmsys/vicuna-13b-v1.3 to compare performance with the latest main branch, and found that the improvement in throughput is not significant. The following is the reproduction method, maybe pay attention. Thanks.

# convert
python3 -m lmdeploy convert llama /workdir/vicuna-13b-v1.3
# server
python3 -m lmdeploy serve api_server /workdir/workspace
# client
python3 benchmark/profile_restful_api.py --server_addr 127.0.0.1:23333 --tokenizer_path /workdir/vicuna-13b-v1.3 --dataset /workdir/ShareGPT_V3_unfiltered_cleaned_split.json --concurrency 128 --num_prompts 5000
# ShareGPT_V3_unfiltered_cleaned_split.json
https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
# turbomind 3
concurrency: 128
elapsed_time: 641.477s

number of prompt tokens: 1239680
number of completion tokens: 1162390
token throughput (completion token): 1812.053 token/s
token throughput (prompt + completion token): 3744.594 token/s
RPS (request per second): **7.795 req/s**
RPM (request per minute): 467.671 req/min

# latest main
concurrency: 128
elapsed_time: 660.434s

number of prompt tokens: 1239680
number of completion tokens: 1162390
token throughput (completion token): 1760.039 token/s
token throughput (prompt + completion token): 3637.107 token/s
RPS (request per second): **7.571 req/s**
RPM (request per minute): 454.247 req/min

7.795/7.571-1=3%

@zhyncs
Copy link
Contributor

zhyncs commented Mar 5, 2024

In addition to improving readability, what are the main benefits of refactoring attention. In which usage scenarios will performance improvements be more significant?

@lvhan028
Copy link
Collaborator

lvhan028 commented Mar 5, 2024

This PR doesn't target at improving the performance. Instead, it focuses on cleaning the messy attention codes to firm the foundation of supporting more models.
This is the first step. Later on, @lzhangzz will work on decoupling the continuous batching and model implementation.

@zhyncs
Copy link
Contributor

zhyncs commented Mar 5, 2024

This PR doesn't target at improving the performance. Instead, it focuses on cleaning the messy attention codes to firm the foundation of supporting more models. This is the first step. Later on, @lzhangzz will work on decoupling the continuous batching and model implementation.

When I discussed with @lzhangzz earlier, I learned that there would be performance improvements, but I did not inquire specifically about the scenarios and models. Therefore, I believe there is a performance improvement. If the current test results meet your expectations, then there should be no doubt.

@lzhangzz
Copy link
Collaborator Author

lzhangzz commented Mar 5, 2024

For 7B models with ShareGPT_V3, attention takes roughly 1/3 of total GPU time (13B model should be similar). I guess 3% RPS improvement is significant enough.

Current dispatching strategy is to maximize bandwidth utilization when there is enough data. There are faster configs for the data distribution of ShareGPT_V3 (which have some other limitations).

@zhyncs
Copy link
Contributor

zhyncs commented Mar 5, 2024

For 7B models with ShareGPT_V3, attention takes roughly 1/3 of total GPU time (13B model should be similar). I guess 3% RPS improvement is significant enough.

Current dispatching strategy is to maximize bandwidth utilization when there is enough data. There are faster configs for the data distribution of ShareGPT_V3 (which have some other limitations).

make sense

@zhyncs
Copy link
Contributor

zhyncs commented Mar 6, 2024

Hi @lzhangzz, please merge the latest changes from the main branch to address the documentation workflow issue.

@lvhan028 lvhan028 merged commit 76596db into InternLM:turbomind-2.1 Mar 6, 2024
5 of 9 checks passed
zhyncs pushed a commit to zhyncs/lmdeploy that referenced this pull request Mar 11, 2024
* Unify prefill and decode passes

* dynamic split-fuse

* refactor

* correct input count calculation

* remove unused

* lint

* lint

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* turbomind attention

* integration with turbomind

* fix sm70

* fix sm80

* tensor core decoding

* use `movmatrix`

* optimize

* arbitrary number of stages

* split-k determinism

* simplify

* better split-k

* minor

* smem tiling & async QK

* minor

* reduce binary size

* move flash attention

* bf16

* fix build

* fix lint

* fix lint

* fix lint

* multi-stream

* fix missing header

* fix msvc build

* fix msvc build

* fix lint

* fix ut

* remove `decoder_masked_multihead_attention`

* fix sm75

* tune sm70

* update comments

* add linear rope scaling

* param passing for linear rope scaling

* fix lint

* fix bf16 dispatch

* fix ntk scaling factor

* avoid building `test_attention` when test is disabled

* fix lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants