[Inference] Clean duplicated vector utils #5715

Courtesy-Xs · 2024-05-14T05:45:07Z

📌 Checklist before creating the PR

I have created an issue for this PR for traceability
The title follows the standard format: [doc/gemini/tensor/...]: A concise description
I have added relevant tags if possible for us to better distinguish different PRs
I have installed pre-commit: pip install pre-commit && pre-commit install

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

📝 What does this PR do?

Summarize your work here.
if you have any plots/diagrams/screenshots/tables, please attach them here.

💥 Checklist before requesting a review

I have linked my PR to an issue (instruction)
My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
I have performed a self-review of my code
I have added thorough tests.
I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

🌝 Yes, I do.
🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

* add engine and scheduler * add dirs --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

…ch#5147) * request handler * add readme --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

…hpcaitech#5159) * [inference/nfc] remove outdated inference tests * remove outdated kernel tests * remove deprecated triton kernels * remove imports from deprecated kernels

) * add infer_struct and infer_config * update codes * change InferConfig * Add hf_model_config to the engine * rm _get_hf_model_config * update codes * made adjustments according to the feedback from the reviewer. * update codes * add ci test for config and struct

* [Inference] Add KVCache Manager * function refactored * add test for KVCache Manager * add attr beam width * Revise alloc func in CacheManager * Fix docs and pytests * add tp slicing for head number * optimize shapes of tensors used as physical cache * Apply using InferenceConfig on KVCacheManager * rm duplicate config file * Optimize cache allocation: use contiguous cache * Fix config in pytest (and config)

* unify the config setting * fix test * fix import * fix test * fix * fix * add logger * revise log info --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

* add infer_struct and infer_config * update codes * change InferConfig * Add hf_model_config to the engine * rm _get_hf_model_config * update codes * made adjustments according to the feedback from the reviewer. * update codes * add ci test for config and struct * Add the logic of the inference engine * update engine and test * Recover cache_manager.py * add logger * fix conflict * update codes * update codes * update model and tokenizer * fix add the logic about shardformer * change kvcache_manager docstring * add policy * fix ci bug in test_kvcache_manager.py * remove codes related o tokenizer and move model_policy * fix code style * add ordered_set to requirements-infer.txt * Delete extra empty lines * add ordered_set to requirements-test.txt

* add logit processor and request handler * add * add * add * fix * add search tokens and update func * finish request handler * add running list test * fix test * fix some bug * add * add * fix bugs * fix some bugs * fix bug * fix * fix * add copy fun * del useless attn * fix request status --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

…ng (hpcaitech#5192) * add context attn unpadded triton kernel * test compatibility * kv cache copy (testing) * fix k/v cache copy * fix kv cache copy and test * fix boundary of block ptrs * add support for GQA/MQA and testing * fix import statement --------- Co-authored-by: Round Heng <yuanhengzhao@Rounds-MacBook-Pro.local>

…h#5219) * add attn * add attention test * fix attn forward * fix decoding

…el (hpcaitech#5229) * fix accuracy * alignment in attention * fix attention * fix * fix bugs * fix bugs * fix bugs

* fix bugs * comment * use more accurate atol * fix

…tech#5249) * add flash decoding unpad triton kernel * rename flash decoding kernel * add kernel testing (draft) * revise pytest * support kv group (GQA) * (trivial) fix api and pytest * (trivial) func renaming * (trivial) func/file renaming * refactor pytest for attention * (trivial) format and consistent vars of context/decode attn * (trivial) remove test redundancy

* add kv copy triton kernel during decoding stage * add pytest and fix kernel * fix test utilities * revise kernel config * add benchmark for kvcache copy

* adapt to baichuan2 13B * add baichuan2 13B TP * update baichuan tp logic * rm unused code * Fix TP logic * fix alibi slopes tp logic * rm nn.Module * Polished the code. * change BAICHUAN_MODEL_NAME_OR_PATH * Modified the logic for loading Baichuan weights. * fix typos

…kvcache_memcpy oper… (hpcaitech#5663) * refactor kvcache manager and rotary_embedding and kvcache_memcpy operator * refactor decode_kv_cache_memcpy * enable alibi in pagedattention

…_cache_copy (hpcaitech#5680)

* add alibi to flash attn function * rm redundant modifications

* kvmemcpy triton for new kcache layout * revise tests for new kcache layout * naive triton flash decoding - new kcache layout * rotary triton kernel - new kcache layout * remove redundancy - triton decoding * remove redundancy - triton kvcache copy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…pcaitech#5679)

…pcaitech#5685) [Sync] Update from main to feature/colossal-infer - Merge pull request hpcaitech#5685 from yuanheng-zhao/inference/merge/main

…hpcaitech#5686)

…5695) - Fix key value number assignment in KVCacheManager, as well as method of accessing

* clean requirements * modify example inference struct * add test ci scripts * mark test_infer as submodule * rm deprecated cls & deps * import of HAS_FLASH_ATTN * prune inference tests to be run * prune triton kernel tests * increment pytest timeout mins * revert import path in openmoe

* Adapt temperature processing logic * add ValueError for top_p and top_k * add GQA Test * fix except_msg

…h#5693) * Adapt temperature processing logic * add ValueError for top_p and top_k * add GQA Test * fix except_msg * support ignore EOS token * change variable's name * fix annotation

* add api server * fix * add * add completion service and fix bug * add generation config * revise shardformer * fix bugs * add docstrings and fix some bugs * fix bugs and add choices for prompt template

…tinuous batching test and example (hpcaitech#5432) * finish online test and add examples * fix test_contionus_batching * fix some bugs * fix bash * fix * fix inference * finish revision * fix typos * revision

…caitech#5470) * fix bugs * fix bugs * fix api server * fix api server * add chat api and test * del request.n

fix

* fix test bugs * add do sample test * del useless lines * fix comments * fix tests * delete version tag * delete version tag * add * del test sever * fix test * fix * Revert "add" This reverts commit b9305fb.

[Feature]Online Serving

* add quant kvcache interface * delete unused output * complete args comments

…tech#5706) * add convert_fp8 op for fp8 test in the future * rerun ci

…ch#5708) * Adapt repetition_penalty and no_repeat_ngram_size * fix no_repeat_ngram_size_logit_process * remove batch_updated * fix annotation * modified codes based on the review feedback. * rm get_batch_token_ids

* rpc support source * kv cache logical/physical disaggregation * sampler refactor * colossalai launch built in * Unitest * Rpyc support --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

CjhHa1 and others added 30 commits January 11, 2024 13:39

[Inference] First PR for rebuild colossal-infer (hpcaitech#5143)

4cf4682

* add engine and scheduler * add dirs --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

[Inference] Add readme (roadmap) and fulfill request handler (hpcaite…

56e75ee

…ch#5147) * request handler * add readme --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

[Inference/NFC] Clean outdated inference tests and deprecated kernels (…

2bb9224

…hpcaitech#5159) * [inference/nfc] remove outdated inference tests * remove outdated kernel tests * remove deprecated triton kernels * remove imports from deprecated kernels

[Inference]Update inference config and fix test (hpcaitech#5178)

93aeacc

* unify the config setting * fix test * fix import * fix test * fix * fix * add logger * revise log info --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

Add padding llama model

86853a3

Fixed a bug in the inference frame

62fd08e

fix bugs in request_handler

6296858

precision alignment

9489dc6

Fixed a writing error

4df8876

add context_attention_unpadded

02c1bf8

fix bugs in sampler

bbfebfb

Fixed a typo

b2eb9cd

fix beam_width

3ad1f3b

[Inference] Pytorch Attention func, pad&nopad input support (hpcaitec…

bfd9b1b

…h#5219) * add attn * add attention test * fix attn forward * fix decoding

fix bugs in attention.py and request_handler.py

47e53ea

adapted to pad_context_forward

fa4fbdb

[Hotfix] Fix accuracy and align attention method api with Triton kern…

e545a87

…el (hpcaitech#5229) * fix accuracy * alignment in attention * fix attention * fix * fix bugs * fix bugs * fix bugs

fix bugs related to processing padding mask

2a73e82

fix CI bugs

fab294c

rm torch.cuda.synchronize

10e3c9f

fix bugs in request_handler.py and engine.py

d40eb26

[Inference] Kernel: no pad rotary embedding (hpcaitech#5252)

fded91d

* fix bugs * comment * use more accurate atol * fix

[git] fixed rebased files

1ded7e8

[kernel] Add KV cache copy kernel during decoding (hpcaitech#5261)

fa85e02

* add kv copy triton kernel during decoding stage * add pytest and fix kernel * fix test utilities * revise kernel config * add benchmark for kvcache copy

yuehuayingxueluo and others added 29 commits April 30, 2024 15:47

[Inference/Kernel] refactor kvcache manager and rotary_embedding and …

5cd75ce

…kvcache_memcpy oper… (hpcaitech#5663) * refactor kvcache manager and rotary_embedding and kvcache_memcpy operator * refactor decode_kv_cache_memcpy * enable alibi in pagedattention

[Inference/Feat] Add kvcache quant support for fused_rotary_embedding…

ef8e4ff

…_cache_copy (hpcaitech#5680)

[inference]Add alibi to flash attn function (hpcaitech#5678)

f799631

* add alibi to flash attn function * rm redundant modifications

[Inference] Fix quant bits order (hpcaitech#5681)

9df016f

[sync] resolve conflicts of merging main

56ed09a

[Fix] Fix & Update Inference Tests (compatibility w/ main)

8754aba

[Inference] Remove unnecessary float4_ and rename float8_ to float8 (h…

725fbd2

…pcaitech#5679)

[Sync] Update from main to feature/colossal-infer (Merge pull request h…

db7b305

…pcaitech#5685) [Sync] Update from main to feature/colossal-infer - Merge pull request hpcaitech#5685 from yuanheng-zhao/inference/merge/main

[Inference/Feat] Add quant kvcache support for decode_kv_cache_memcpy (…

1ace106

…hpcaitech#5686)

[hotfix] Fix KV Heads Number Assignment in KVCacheManager (hpcaitech#…

f9afe0a

…5695) - Fix key value number assignment in KVCacheManager, as well as method of accessing

[hotfix] fix OpenMOE example import path (hpcaitech#5697)

12e7c28

[Inference]Adapt temperature processing logic (hpcaitech#5689)

9c2fe79

* Adapt temperature processing logic * add ValueError for top_p and top_k * add GQA Test * fix except_msg

[Inference] Support the logic related to ignoring EOS token (hpcaitec…

d482922

…h#5693) * Adapt temperature processing logic * add ValueError for top_p and top_k * add GQA Test * fix except_msg * support ignore EOS token * change variable's name * fix annotation

[Inference] ADD async and sync Api server using FastAPI (hpcaitech#5396)

69cd7e0

* add api server * fix * add * add completion service and fix bug * add generation config * revise shardformer * fix bugs * add docstrings and fix some bugs * fix bugs and add choices for prompt template

[Inference] Finish Online Serving Test, add streaming output api, con…

de378cd

…tinuous batching test and example (hpcaitech#5432) * finish online test and add examples * fix test_contionus_batching * fix some bugs * fix bash * fix * fix inference * finish revision * fix typos * revision

[Online Server] Chat Api for streaming and not streaming response (hp…

c064032

…caitech#5470) * fix bugs * fix bugs * fix api server * fix api server * add chat api and test * del request.n

[Inference] resolve rebase conflicts

7bbb28e

fix

[Inference] Fix bugs and docs for feat/online-server (hpcaitech#5598)

61a1b2e

* fix test bugs * add do sample test * del useless lines * fix comments * fix tests * delete version tag * delete version tag * add * del test sever * fix test * fix * Revert "add" This reverts commit b9305fb.

resolve rebase conflicts on Branch feat/online-serving

bc9063a

[Inference] Add example test_ci script

5d9a494

Merge pull request hpcaitech#5588 from hpcaitech/feat/online-serving

492520d

[Feature]Online Serving

[Inference/Feat] Add quant kvcache interface (hpcaitech#5700)

bfad393

* add quant kvcache interface * delete unused output * complete args comments

[Inference/Feat] Add convert_fp8 op for fp8 test in the future (hpcai…

50104ab

…tech#5706) * add convert_fp8 op for fp8 test in the future * rerun ci

delete copy_vector

30ea54f

Courtesy-Xs requested a review from a team as a code owner May 14, 2024 05:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inference] Clean duplicated vector utils #5715

[Inference] Clean duplicated vector utils #5715

Courtesy-Xs commented May 14, 2024

[Inference] Clean duplicated vector utils #5715

Are you sure you want to change the base?

[Inference] Clean duplicated vector utils #5715

Conversation

Courtesy-Xs commented May 14, 2024

📌 Checklist before creating the PR

🚨 Issue number

📝 What does this PR do?

💥 Checklist before requesting a review

⭐️ Do you enjoy contributing to Colossal-AI?