Refactor turbomind attention #1116

lzhangzz · 2024-02-04T12:37:47Z

TODO

Fix attention/decoding on sm_75/sm_70
Fix dispatch heuristic for split-k decoding
BF16 dispatch

zhyncs · 2024-02-28T12:49:47Z

Hi @lzhangzz Maybe you could merge the latest main branch to fix the win build error.

src/turbomind/models/llama/unified_attention_layer.cc

lvhan028 · 2024-03-04T11:29:15Z

lvhan028 · 2024-03-04T11:29:38Z

@irexyc May help verify the VL models

zhyncs · 2024-03-05T04:50:36Z

Hi @lzhangzz I used the model lmsys/vicuna-13b-v1.3 to compare performance with the latest main branch, and found that the improvement in throughput is not significant. The following is the reproduction method, maybe pay attention. Thanks.

# convert
python3 -m lmdeploy convert llama /workdir/vicuna-13b-v1.3
# server
python3 -m lmdeploy serve api_server /workdir/workspace
# client
python3 benchmark/profile_restful_api.py --server_addr 127.0.0.1:23333 --tokenizer_path /workdir/vicuna-13b-v1.3 --dataset /workdir/ShareGPT_V3_unfiltered_cleaned_split.json --concurrency 128 --num_prompts 5000
# ShareGPT_V3_unfiltered_cleaned_split.json
https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

# turbomind 3
concurrency: 128
elapsed_time: 641.477s

number of prompt tokens: 1239680
number of completion tokens: 1162390
token throughput (completion token): 1812.053 token/s
token throughput (prompt + completion token): 3744.594 token/s
RPS (request per second): **7.795 req/s**
RPM (request per minute): 467.671 req/min

# latest main
concurrency: 128
elapsed_time: 660.434s

number of prompt tokens: 1239680
number of completion tokens: 1162390
token throughput (completion token): 1760.039 token/s
token throughput (prompt + completion token): 3637.107 token/s
RPS (request per second): **7.571 req/s**
RPM (request per minute): 454.247 req/min

7.795/7.571-1=3%

zhyncs · 2024-03-05T04:57:07Z

In addition to improving readability, what are the main benefits of refactoring attention. In which usage scenarios will performance improvements be more significant?

lvhan028 · 2024-03-05T05:44:55Z

This PR doesn't target at improving the performance. Instead, it focuses on cleaning the messy attention codes to firm the foundation of supporting more models.
This is the first step. Later on, @lzhangzz will work on decoupling the continuous batching and model implementation.

zhyncs · 2024-03-05T06:22:52Z

This PR doesn't target at improving the performance. Instead, it focuses on cleaning the messy attention codes to firm the foundation of supporting more models. This is the first step. Later on, @lzhangzz will work on decoupling the continuous batching and model implementation.

When I discussed with @lzhangzz earlier, I learned that there would be performance improvements, but I did not inquire specifically about the scenarios and models. Therefore, I believe there is a performance improvement. If the current test results meet your expectations, then there should be no doubt.

lzhangzz · 2024-03-05T09:01:30Z

For 7B models with ShareGPT_V3, attention takes roughly 1/3 of total GPU time (13B model should be similar). I guess 3% RPS improvement is significant enough.

Current dispatching strategy is to maximize bandwidth utilization when there is enough data. There are faster configs for the data distribution of ShareGPT_V3 (which have some other limitations).

zhyncs · 2024-03-05T09:10:35Z

For 7B models with ShareGPT_V3, attention takes roughly 1/3 of total GPU time (13B model should be similar). I guess 3% RPS improvement is significant enough.

Current dispatching strategy is to maximize bandwidth utilization when there is enough data. There are faster configs for the data distribution of ShareGPT_V3 (which have some other limitations).

make sense

zhyncs · 2024-03-06T06:04:59Z

Hi @lzhangzz, please merge the latest changes from the main branch to address the documentation workflow issue.

* Unify prefill and decode passes * dynamic split-fuse * refactor * correct input count calculation * remove unused * lint * lint * fix msvc build * fix msvc build * fix msvc build * fix msvc build * fix msvc build * fix msvc build * fix msvc build * fix msvc build * fix msvc build * turbomind attention * integration with turbomind * fix sm70 * fix sm80 * tensor core decoding * use `movmatrix` * optimize * arbitrary number of stages * split-k determinism * simplify * better split-k * minor * smem tiling & async QK * minor * reduce binary size * move flash attention * bf16 * fix build * fix lint * fix lint * fix lint * multi-stream * fix missing header * fix msvc build * fix msvc build * fix lint * fix ut * remove `decoder_masked_multihead_attention` * fix sm75 * tune sm70 * update comments * add linear rope scaling * param passing for linear rope scaling * fix lint * fix bf16 dispatch * fix ntk scaling factor * avoid building `test_attention` when test is disabled * fix lint

lzhangzz added 20 commits November 24, 2023 07:46

Unify prefill and decode passes

28ae6bd

dynamic split-fuse

963047b

refactor

b3a8380

correct input count calculation

4fbd659

remove unused

de8016f

Merge remote-tracking branch 'origin/main' into unify3

793ba97

lint

82690c1

lint

9780108

fix msvc build

6a3fe75

fix msvc build

751eeb6

fix msvc build

ee9c10a

fix msvc build

dc59510

fix msvc build

72f00fb

fix msvc build

ad254d4

fix msvc build

afcbcfd

fix msvc build

38b6ce2

fix msvc build

a6f56b7

turbomind attention

2b73281

Merge branch 'tm2.2' into tm-attn

b442a1e

integration with turbomind

64f3a2d

lzhangzz added the WIP label Feb 4, 2024

lzhangzz added 9 commits February 5, 2024 17:06

fix sm70

61c5d0f

fix sm80

152e9ed

tensor core decoding

d06dcc1

use movmatrix

07a047c

optimize

8f38531

arbitrary number of stages

79d6658

split-k determinism

83131e3

simplify

138bf29

better split-k

00de11f

fix ut

c2e9571

lzhangzz added 3 commits February 28, 2024 13:25

remove decoder_masked_multihead_attention

aca3979

fix sm75

a2711d1

tune sm70

27dcddb

ispobock reviewed Mar 1, 2024

View reviewed changes

src/turbomind/models/llama/unified_attention_layer.cc Show resolved Hide resolved

update comments

586b7c7

lzhangzz removed the WIP label Mar 1, 2024

lvhan028 changed the base branch from main to turbomind-2.1 March 4, 2024 06:05

lzhangzz added 5 commits March 4, 2024 06:16

add linear rope scaling

7c6773e

Merge remote-tracking branch 'origin/main' into tm-attn

cb30a1f

param passing for linear rope scaling

2075d39

fix lint

37b6379

fix bf16 dispatch

9f9bd50

lzhangzz added 3 commits March 5, 2024 11:43

fix ntk scaling factor

9a39dce

avoid building test_attention when test is disabled

27eb247

fix lint

dd94fa7

lvhan028 approved these changes Mar 6, 2024

View reviewed changes

lvhan028 merged commit 76596db into InternLM:turbomind-2.1 Mar 6, 2024
5 of 9 checks passed

lzhangzz mentioned this pull request May 22, 2024

[Bug] It seems the memory of internlm2 is bad when input prompt is longtext. #1634

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor turbomind attention #1116

Refactor turbomind attention #1116

lzhangzz commented Feb 4, 2024 •

edited

zhyncs commented Feb 28, 2024

lvhan028 commented Mar 4, 2024 •

edited

lvhan028 commented Mar 4, 2024

zhyncs commented Mar 5, 2024

zhyncs commented Mar 5, 2024

lvhan028 commented Mar 5, 2024

zhyncs commented Mar 5, 2024

lzhangzz commented Mar 5, 2024

zhyncs commented Mar 5, 2024

zhyncs commented Mar 6, 2024

Refactor turbomind attention #1116

Refactor turbomind attention #1116

Conversation

lzhangzz commented Feb 4, 2024 • edited

zhyncs commented Feb 28, 2024

lvhan028 commented Mar 4, 2024 • edited

lvhan028 commented Mar 4, 2024

zhyncs commented Mar 5, 2024

zhyncs commented Mar 5, 2024

lvhan028 commented Mar 5, 2024

zhyncs commented Mar 5, 2024

lzhangzz commented Mar 5, 2024

zhyncs commented Mar 5, 2024

zhyncs commented Mar 6, 2024

lzhangzz commented Feb 4, 2024 •

edited

lvhan028 commented Mar 4, 2024 •

edited