[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model #4799

linxihui · 2024-05-14T02:50:37Z

Supports of Microsoft Phi-3-Small-8K and Phi-3-Small-128K models, which use blocksparse flash attention
Prefilling Triton kernel for block-sparse attn
Modified paged attention CUDA with the block-sparse attention, which allows hybrid sparsity pattern for each attention head.
Use torch SPDA in prefilling phase for V100 or older GPUs, as well as CPU

This is joint work between Microsoft GenAI @linxihui, @beagleski, and vLLM @zhuohan123, @simon-mo @youkaichao.

…arse_vert_stride

typo fixed Co-authored-by: Michael Goin <michael@neuralmagic.com>

linxihui · 2024-05-14T18:27:51Z

@mgoin Thanks so much for reviewing. I've replied and made changed accordingly. Let me know if you have other suggestions.

beagleski · 2024-05-16T19:25:12Z

@WoosukKwon Any suggestions on the kernel side?

simon-mo · 2024-05-22T18:14:59Z

I made a pass. I think once this PR adds unit test for both the Triton and PagedAttention kernels it should be good to go. You might also need to run clang-format to fix the merge conflict.

linxihui · 2024-05-23T13:43:59Z

I made a pass. I think once this PR adds unit test for both the Triton and PagedAttention kernels it should be good to go. You might also need to run clang-format to fix the merge conflict.

Thanks @simon-mo for the review. I'll add the mising unitests today.

…nterface (e.g., unittest)

…g beta state version warning

…attn

…s requested.

simon-mo · 2024-05-25T04:53:12Z

I have tested the PR locally as well.

AllenDou · 2024-05-27T08:29:08Z

Phi-3-small's SPECIAL_TOKENS('<|******|>') will cause guided_grammar crash

  File "/root/vllm/vllm/engine/async_llm_engine.py", line 39, in _raise_exception_on_finish
    task.result()
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 517, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 491, in engine_step
    request_outputs = await self.engine.step_async()
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 225, in step_async
    output = await self.model_executor.execute_model_async(
  File "/root/vllm/vllm/executor/gpu_executor.py", line 117, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/vllm/vllm/worker/worker.py", line 272, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/vllm/vllm/worker/model_runner.py", line 709, in execute_model
    logits = self.model.compute_logits(hidden_states, sampling_metadata)
  File "/root/vllm/vllm/model_executor/models/phi3_small.py", line 403, in compute_logits
    logits = self.logits_processor(self.lm_head.weight, hidden_states,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/vllm/vllm/model_executor/layers/logits_processor.py", line 58, in forward
    logits = _apply_logits_processors(logits, sampling_metadata)
  File "/root/vllm/vllm/model_executor/layers/logits_processor.py", line 115, in _apply_logits_processors
    logits_row = logits_processor(past_tokens_ids,
  File "/root/vllm/vllm/model_executor/guided_decoding/outlines_logits_processors.py", line 53, in __call__
    allowed_tokens = self.fsm.allowed_token_ids(self.fsm_state[seq_id])
  File "/usr/local/lib/python3.10/dist-packages/outlines/fsm/fsm.py", line 329, in allowed_token_ids
    self.regex_fsm = RegexFSM(regex_string, self.tokenizer)
  File "/usr/local/lib/python3.10/dist-packages/outlines/fsm/fsm.py", line 123, in __init__
    regex_string, tuple(sorted(tokenizer.vocabulary.items()))
TypeError: '<' not supported between instances of 'str' and 'bytes'

server:
python3 -m vllm.entrypoints.openai.api_server --model microsoft/Phi-3-small-8k-instruct --tensor-parallel-size 1 --served-model-name modelx --disable-log-stats --trust-remote-code

client:

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "modelx",
        "prompt": ["Generate a sql state that select col_1 from table_1 where it is equals to 1"],
        "max_tokens": 20,
        "temperature": 0,
        "guided_grammar": "start: select_statement\r\nselect_statement: \"SELECT\" column \"from\" table \"where\" condition\r\ncolumn: \"col_1\" | \"col_2\"\r\ntable: \"table_1\" | \"table_2\"\r\ncondition: column \"=\" number\r\nnumber: \"1\" | \"2\""
    }'

#5068 add a test case.

…-Small model (vllm-project#4799) Co-authored-by: beagleski <yunanzhang@microsoft.com> Co-authored-by: bapatra <bapatra@microsoft.com> Co-authored-by: Barun Patra <codedecde@users.noreply.github.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>

beagleski and others added 30 commits April 7, 2024 17:49

vllm format w/ mha

0e4c28d

original triton kernel

b2e7c0a

added prompt phase bs attn

d5308c5

minor change

d73cdb3

changes to make prompt with bs attn works

176275e

wip for backup

1116e01

paged attn kernel with blocksparse support

24ab443

some cleaning

b6c2ebe

some cleaning

4e28773

some cleaning

bccec2f

support tp

ab0df74

fixed TP

f11d590

flash2 logic + q broadcast

1670a3d

clean up

8531eaa

split local

b20312c

split-local

26c6222

finished spliting local and stride

0a52b2b

clean up

f85da14

added sparse support

0ee826b

seem to work, but need binding and unit test.

3891f22

add binding

1440eba

add more backend interface; change is_sparse to be guarded by blocksp…

6571c58

…arse_vert_stride

refactor to phi3

7143bac

longrope support

a1f37a9

code cleaning; larger block_size

8ff8be7

LongRoPE

439c7c7

v100 support

bfba8d5

Merge branch 'eric/cuda-kernel-longrope' into eric/bs-attn-longrope-spda

9473082

bs backend for prompt

f4c53d3

merge eric/cuda-kernle

809f3f5

linxihui and others added 2 commits May 14, 2024 14:02

Update vllm/attention/backends/blocksparse_attn.py

2955cec

typo fixed Co-authored-by: Michael Goin <michael@neuralmagic.com>

fixed according to suggestion by @mgoin

eb16d9a

This was referenced May 21, 2024

[New Model]: Phi-3-medium-128k-instruct support #4953

Closed

v0.4.3 Release Tracker #4895

Open

added unittest

1600156

janimo mentioned this pull request May 23, 2024

[New Model]: microsoft/Phi-3-small-128k-instruct #4993

Closed

linxihui added 15 commits May 23, 2024 21:37

add unittest for blocksaprse

e7f9918

reverted changes

2afd8b1

refactored blocksparse code for better readability

def0c4c

used default values to blocksparse params to be compatible with old i…

359cc7f

…nterface (e.g., unittest)

default value to be consistent

6d0441b

replaced pytorch sparse matrix utils to scipy.sparse to avoid annoyin…

52bf2b5

…g beta state version warning

added unittest for blocksparse attn prefilling and blocksparse paged …

97f3662

…attn

merged upstream/main; resolved conflict

754e306

clean; format

c834882

updated metadata def

644fc14

run bash format.sh

435dd38

clang-format

8a22c26

ruff fix

8554331

matched interface with gpu verison; throw error if blocksparse attn i…

547692e

…s requested.

run clang-format

daf94f3

simon-mo merged commit 8e192ff into vllm-project:main May 25, 2024
63 of 65 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model #4799

[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model #4799

linxihui commented May 14, 2024 •

edited

linxihui commented May 14, 2024 •

edited

beagleski commented May 16, 2024

simon-mo commented May 22, 2024

linxihui commented May 23, 2024

simon-mo commented May 25, 2024

AllenDou commented May 27, 2024 •

edited

[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model #4799

[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model #4799

Conversation

linxihui commented May 14, 2024 • edited

linxihui commented May 14, 2024 • edited

beagleski commented May 16, 2024

simon-mo commented May 22, 2024

linxihui commented May 23, 2024

simon-mo commented May 25, 2024

AllenDou commented May 27, 2024 • edited

linxihui commented May 14, 2024 •

edited

linxihui commented May 14, 2024 •

edited

AllenDou commented May 27, 2024 •

edited