-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support output logprobs with turbomind backend. #1391
Conversation
build failed on windows platform |
May merge latest main to resolve pr_ete_test worflow error |
Args: | ||
status (ResponseType): the response type. | ||
token_ids (List[int]): the output token ids. | ||
num_token (int): the length of output token, for turbomind, num_token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是可能会多出来一个token么?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stop word 的时候
lmdeploy/lmdeploy/turbomind/turbomind.py
Lines 744 to 745 in e5aaca5
output[-1].item() in gen_config.stop_words: | |
outputs = (status, output[:-1].tolist(), len_) |
@@ -61,6 +61,8 @@ class ChatCompletionRequestQos(BaseModel): | |||
messages: Union[str, List[Dict[str, str]]] | |||
temperature: Optional[float] = 0.7 | |||
top_p: Optional[float] = 1.0 | |||
logprobs: Optional[bool] = False | |||
top_logprobs: Optional[int] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
有没有上界的限制?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
openai 上限是5。vllm没有限制,turbomind 受限于top_k的kernel,上限是1024 (or 1023)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
那要校验参数的合法性。不要引起crash,hung等严重的问题
from lmdeploy import pipeline, GenerationConfig, PytorchEngineConfig, TurbomindEngineConfig
pipe = pipeline('/workspace/140_models/InternLM/internlm2-chat-7b', backend_config=PytorchEngineConfig())
response = pipe('hello', gen_config=GenerationConfig(logprobs=10, top_k=1, max_new_tokens=10))
print(response) pytorch engine should warn that "logprobs" hasn't been supported yet. |
from lmdeploy import pipeline, GenerationConfig, PytorchEngineConfig, TurbomindEngineConfig
pipe = pipeline('/workspace/140_models/InternLM/internlm2-chat-7b')
response = pipe('hello', gen_config=GenerationConfig(logprobs=10, top_k=1, max_new_tokens=10))
print(response) The result is:
|
top_k 为1的话,只有一个候选词,概率是1,log一下就是0了。 logprobs 的长度应该跟token_ids的长度是一样的, 跟generate_token_len长度不一样应该是因为遇到stop word了。我记得这里是为了kv_cache的step吧 top_k 为 2的时候结果是什么? |
从 pipeline 的层面来说,推理时的generate参数,行为,需要和 transformers 一致。
|
忘记要取 log 了,那应该没问题 |
建议增加 ut,测试 sampling kernel |
async for res in generator: | ||
logprobs = None | ||
if request.logprobs and res.logprobs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里会有 request.logprobs 有,但是 res.logprobs 无的情况吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
有,pytorch backend的时候
分别在 pipeline.md, api_server.md 增加 example,介绍获取 logprobs的用法吧。 |
We need to benchmark the performance impact of requesting logprobs. |
internlm2-7b, rps 23.734 |
evaluation test pass |
Motivation
Add logprobs output.
Openai has different logprobs structure ofchat.completions
andcompletions
apis, however vllm use same structure of these two api. I think the logprobs structure ofcompletions
is more user-friendly, so I followed vllm to use this structure with these two apis.Modification
Use cases (Optional)