-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optional logprobs & fix llama eos/stop token #40
Conversation
Have you implemented per-request cache clearing? Great work here as usual 😃 |
Do we need to clear kv cache for each request? I thought kv cache could be used for each session (each session may have multiple requests) for context learning. |
If users deploy |
Yes, there is a kv cache leak problem because only one cache_engine instance is used for the entire lifetime of llm_engine. One solution is to create the cache_engine on a per-session basis, allowing multiple requests within a session (from a single user) to share the kv cache. This cache should be cleared after the session ends or automatically if the server's cache is nearly full. Another strategy, as you suggested, is to clear the kv cache after each request and rebuild the session context using prefix caching (this is new to me) or by recomputing the session's chat history. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll probably add the KV cache clearing in a later PR. Thank you!
Make logprobs as optional (no logprobs field in chat completion response by default),
Fix the EOS problem for the LLaMa pipeline.
Tested cases:
First request:
First response:
Second request:
Second response:
Apparently, the KV cache in the previous generation was used in the second request for context learning. However, the second response has some redundant outputs (no eos token was found in the second response).