support output logprobs with turbomind backend. #1391

irexyc · 2024-04-03T11:43:39Z

Motivation

Add logprobs output.

Openai has different logprobs structure of chat.completions and completions apis, however vllm use same structure of these two api. I think the logprobs structure of completions is more user-friendly, so I followed vllm to use this structure with these two apis.

Modification

pytorch / turbomind output with dataclass class
logprobs logits for turbomind backend

Use cases (Optional)

from openai import OpenAI
client = OpenAI(base_url='http://0.0.0.0:23333/v1', api_key='sk-l6bdprDovMBW6bRs22B05f5dBa3f417d8bC13e3d131f73Aa')

model_name = client.models.list().data[0].id
completion = client.chat.completions.create(
  model=model_name
  messages=[
    {"role": "user", "content": "讲一个笑话"}
  ],
  logprobs=True,
  top_logprobs=2,
  max_tokens=10,
  # stream=True
)

completion = client.completions.create(
  model=model_name,
  prompt="今天天气真好",
  logprobs=2,
  max_tokens=10,
  # stream=True
)

from lmdeploy import pipeline, GenerationConfig, PytorchEngineConfig

pipe = pipeline('/nvme/shared/vicuna-7b-v1.5/', backend_config=PytorchEngineConfig())
pipe('hello', gen_config=GenerationConfig(logprobs=10, top_k=40, max_new_tokens=10))

lvhan028 · 2024-04-04T06:23:04Z

build failed on windows platform

lvhan028 · 2024-04-16T11:17:11Z

May merge latest main to resolve pr_ete_test worflow error

lvhan028 · 2024-04-17T04:18:29Z

lmdeploy/messages.py

+    Args:
+        status (ResponseType): the response type.
+        token_ids (List[int]): the output token ids.
+        num_token (int): the length of output token, for turbomind, num_token


是可能会多出来一个token么？

stop word 的时候

lmdeploy/lmdeploy/turbomind/turbomind.py

Lines 744 to 745 in e5aaca5

output[-1].item() in gen_config.stop_words:

outputs = (status, output[:-1].tolist(), len_)

lvhan028 · 2024-04-17T04:47:07Z

lmdeploy/serve/openai/protocol.py

@@ -61,6 +61,8 @@ class ChatCompletionRequestQos(BaseModel):
    messages: Union[str, List[Dict[str, str]]]
    temperature: Optional[float] = 0.7
    top_p: Optional[float] = 1.0
+    logprobs: Optional[bool] = False
+    top_logprobs: Optional[int] = None


有没有上界的限制？

openai 上限是5。vllm没有限制，turbomind 受限于top_k的kernel，上限是1024 (or 1023)

那要校验参数的合法性。不要引起crash，hung等严重的问题

lvhan028 · 2024-04-17T06:07:03Z

from lmdeploy import pipeline, GenerationConfig, PytorchEngineConfig, TurbomindEngineConfig

pipe = pipeline('/workspace/140_models/InternLM/internlm2-chat-7b', backend_config=PytorchEngineConfig())
response = pipe('hello', gen_config=GenerationConfig(logprobs=10, top_k=1, max_new_tokens=10))
print(response)

pytorch engine should warn that "logprobs" hasn't been supported yet.

lvhan028 · 2024-04-17T06:11:17Z

from lmdeploy import pipeline, GenerationConfig, PytorchEngineConfig, TurbomindEngineConfig


pipe = pipeline('/workspace/140_models/InternLM/internlm2-chat-7b')
response = pipe('hello', gen_config=GenerationConfig(logprobs=10, top_k=1, max_new_tokens=10))
print(response)

The result is:

Response(text='你好！有什么我可以帮助你的吗？', generate_token_len=9, input_token_len=103, session_id=0, finish_reason='stop', token_ids=[77230, 60477, 69259, 74010, 68417, 68364, 61076, 60504], logprobs=[{77230: 0.0}, {60477: 0.0}, {69259: 0.0}, {74010: 0.0}, {68417: 0.0}, {68364: 0.0}, {61076: 0.0}, {60504: 0.0}])

为什么 prob的值都是 0.0?
generate_token_len 和 token_ids的长度，logprobs 的长度不等。我感觉这会让用户很困惑，我们解释起来也麻烦。能不能在 turbomind.py 中处理好呢？
当把 top_k 设置为 2 时，结果看起来也不对。

irexyc · 2024-04-17T06:18:35Z

top_k 为1的话，只有一个候选词，概率是1，log一下就是0了。

logprobs 的长度应该跟token_ids的长度是一样的，跟generate_token_len长度不一样应该是因为遇到stop word了。我记得这里是为了kv_cache的step吧

top_k 为 2的时候结果是什么？

lvhan028 · 2024-04-17T06:19:37Z

从 pipeline 的层面来说，推理时的generate参数，行为，需要和 transformers 一致。
行为一致我们这么规定：

相同的 batch prompt，相同的generate config，如果 batch 中每个prompt通过transformers得到的结果一样，那么lmdeploy也应该一样。不约束和transformers的结果一模一样

lvhan028 · 2024-04-17T06:21:23Z

top_k 为1的话，只有一个候选词，概率是1，log一下就是0了。

logprobs 的长度应该跟token_ids的长度是一样的，跟generate_token_len长度不一样应该是因为遇到stop word了。我记得这里是为了kv_cache的step吧

top_k 为 2的时候结果是什么？

忘记要取 log 了，那应该没问题

lvhan028 · 2024-04-17T07:20:41Z

建议增加 ut，测试 sampling kernel

src/turbomind/kernels/sampling_topp_kernels.cu

AllentDan · 2024-04-17T08:55:11Z

lmdeploy/serve/openai/api_server.py

            async for res in generator:
+                logprobs = None
+                if request.logprobs and res.logprobs:


这里会有 request.logprobs 有，但是 res.logprobs 无的情况吗？

有，pytorch backend的时候

src/turbomind/triton_backend/llama/LlamaTritonModelInstance.cc

src/turbomind/layers/constant.h

lvhan028 · 2024-04-17T11:01:27Z

分别在 pipeline.md, api_server.md 增加 example，介绍获取 logprobs的用法吧。

lmdeploy/turbomind/turbomind.py

src/turbomind/kernels/sampling_topk_kernels.cu

lzhangzz · 2024-04-19T06:20:48Z

We need to benchmark the performance impact of requesting logprobs.

lvhan028 · 2024-04-19T11:32:31Z

We need to benchmark the performance impact of requesting logprobs.

internlm2-7b, rps 23.734

lvhan028 · 2024-04-20T02:20:18Z

evaluation test pass

irexyc · 2024-04-20T13:08:14Z

with vocab = 92544

irexyc added 8 commits April 3, 2024 08:39

logprobs for turbomind

26f1b57

update async engine to support logprobs

00f6db8

updat openai protocal & api to support logprobs

218a01b

speedup fetching cutlass when compling

07024c9

Merge remote-tracking branch 'origin/main' into logprob

3eb3dbd

fix batching

c7c7b93

fix lint

7fb8949

fix windows build

5258264

lvhan028 added the enhancement New feature or request label Apr 3, 2024

lvhan028 requested review from lzhangzz and lvhan028 April 4, 2024 06:06

irexyc added 2 commits April 8, 2024 03:39

fix topk topp tests

ac38829

use openai format of chat.completion

1effec9

Merge remote-tracking branch 'origin/main' into logprob

88be9c2

lvhan028 requested a review from AllentDan April 17, 2024 03:22

irexyc added 2 commits April 17, 2024 03:33

fix engine output

ac736ce

remove empty block

cdc900f

lvhan028 reviewed Apr 17, 2024

View reviewed changes

AllentDan reviewed Apr 17, 2024

View reviewed changes

lvhan028 reviewed Apr 17, 2024

View reviewed changes

src/turbomind/triton_backend/llama/LlamaTritonModelInstance.cc Show resolved Hide resolved

lvhan028 reviewed Apr 17, 2024

View reviewed changes

src/turbomind/triton_backend/llama/LlamaTritonModelInstance.cc Outdated Show resolved Hide resolved

lvhan028 reviewed Apr 17, 2024

View reviewed changes

src/turbomind/layers/constant.h Outdated Show resolved Hide resolved

move constant file

2ebc194

lzhangzz reviewed Apr 19, 2024

View reviewed changes

lmdeploy/turbomind/turbomind.py Outdated Show resolved Hide resolved

src/turbomind/kernels/sampling_topk_kernels.cu Outdated Show resolved Hide resolved

irexyc added 2 commits April 19, 2024 11:07

use normal variable

5397b01

fix benchmark

e38e602

irexyc added 2 commits April 20, 2024 12:56

add ut & fix bugs

fb03a61

remove magic number

496bdcf

irexyc added 2 commits April 20, 2024 15:09

Merge remote-tracking branch 'origin/main' into logprob

388e72c

fix windows build

fec302d

lvhan028 approved these changes Apr 21, 2024

View reviewed changes

lzhangzz approved these changes Apr 21, 2024

View reviewed changes

lvhan028 merged commit b797a90 into InternLM:main Apr 21, 2024
9 checks passed

lvhan028 added a commit that referenced this pull request Apr 23, 2024

Fix the side effect in engine_intance brought by PR #1391 (#1480)

6b8718d

lvhan028 mentioned this pull request May 7, 2024

[Feature] Return most probable tokens + logprobs #598

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support output logprobs with turbomind backend. #1391

support output logprobs with turbomind backend. #1391

irexyc commented Apr 3, 2024 •

edited

lvhan028 commented Apr 4, 2024

lvhan028 commented Apr 16, 2024

lvhan028 Apr 17, 2024

irexyc Apr 17, 2024

lvhan028 Apr 17, 2024

irexyc Apr 17, 2024

lvhan028 Apr 17, 2024

lvhan028 commented Apr 17, 2024

lvhan028 commented Apr 17, 2024 •

edited

irexyc commented Apr 17, 2024

lvhan028 commented Apr 17, 2024

lvhan028 commented Apr 17, 2024

lvhan028 commented Apr 17, 2024 •

edited

AllentDan Apr 17, 2024

irexyc Apr 20, 2024

lvhan028 commented Apr 17, 2024

lzhangzz commented Apr 19, 2024

lvhan028 commented Apr 19, 2024

lvhan028 commented Apr 20, 2024

irexyc commented Apr 20, 2024

	output[-1].item() in gen_config.stop_words:
	outputs = (status, output[:-1].tolist(), len_)

support output logprobs with turbomind backend. #1391

support output logprobs with turbomind backend. #1391

Conversation

irexyc commented Apr 3, 2024 • edited

Motivation

Modification

Use cases (Optional)

lvhan028 commented Apr 4, 2024

lvhan028 commented Apr 16, 2024

lvhan028 Apr 17, 2024

Choose a reason for hiding this comment

irexyc Apr 17, 2024

Choose a reason for hiding this comment

lvhan028 Apr 17, 2024

Choose a reason for hiding this comment

irexyc Apr 17, 2024

Choose a reason for hiding this comment

lvhan028 Apr 17, 2024

Choose a reason for hiding this comment

lvhan028 commented Apr 17, 2024

lvhan028 commented Apr 17, 2024 • edited

irexyc commented Apr 17, 2024

lvhan028 commented Apr 17, 2024

lvhan028 commented Apr 17, 2024

lvhan028 commented Apr 17, 2024 • edited

AllentDan Apr 17, 2024

Choose a reason for hiding this comment

irexyc Apr 20, 2024

Choose a reason for hiding this comment

lvhan028 commented Apr 17, 2024

lzhangzz commented Apr 19, 2024

lvhan028 commented Apr 19, 2024

lvhan028 commented Apr 20, 2024

irexyc commented Apr 20, 2024

irexyc commented Apr 3, 2024 •

edited

lvhan028 commented Apr 17, 2024 •

edited

lvhan028 commented Apr 17, 2024 •

edited