-
Notifications
You must be signed in to change notification settings - Fork 660
[Feature][Executor] GPU Model Runner Supports prompt_logprobs and max_logprobs #4769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for your contribution! |
| if request.sampling_params.prompt_logprobs is not None: | ||
| self.prompt_logprobs_reqs[request.request_id] = request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
超长时间压测后有没有考虑内存增长的情况?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
另外RL场景中使用的话,需要在clear_requests函数中清空一下model_runner的一些对象,包括这个,也顺便梳理下有没有其他的对象需要清除
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prompt变长和增大并发都会导致显存增长,但不存在显存泄露,这是符合预期的。
| self.prompt_logprobs_reqs.pop(request.request_id, None) | ||
| self.in_progress_prompt_logprobs.pop(request.request_id, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
抢占时不需要 del self.prompt_logprobs_reqs[req.request_id] 吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
上面也有抢占时prompt_logprobs_reqs清除的逻辑啊。
| if isinstance(prompt_token_ids, np.ndarray): | ||
| prompt_token_ids = prompt_token_ids.tolist() | ||
| prompt_token_ids_tensor = paddle.to_tensor(prompt_token_ids, dtype="int64") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paddle 不支持 ndarray 直接转tensor吗
gongshaotian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Motivation
GPUModelRunner supports max_logprobs=-1 and prompt_logprobs.
Modifications
Usage or Command
export FD_USE_GET_SAVE_OUTPUT_V1=1 python -m fastdeploy.entrypoints.openai.api_server \ --model ./ERNIE-4.5-0.3B-PT \ --max-model-len 32768 \ --max-num-seqs 128 \ --tensor-parallel-size 1 \ --enable-logprob \ --max-logprobs -1 \ --no-enable-prefix-caching \Accuracy Tests
TODO: Server layer should support
top_logprobs=-1andprompt_logprobs.Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.