-
Notifications
You must be signed in to change notification settings - Fork 39
Description
Is your feature request related to a problem? Please describe.
I don't know if I should have opened this issue in the parent llama-cpp-python repository, but this fork seems to be the only properly maintained one, so thank you for your work.
My issue is resuming an uncompleted assistant response. For example:
[
{"role": "user", "message": "What are Python decorators?"},
{"role": "assistant", "message": "Hey, Python deco"}
]As you can see, the response of assistant is broken halfway. I needed a consistent way to continue generating this message, but I couldn't find anything related in the docs or the issues.
llama-cpp-python seems to handle formatting and everything internally, so the only proper way seems to be hacking my own formatter to somehow allowing this feature. But again, this solution is very hacky.
Describe the solution you'd like
This is currently what I do in my project which I use this fork of llama-cpp-python.
# _formatter is my own formatter, instance of llama_cpp.llama_chat_format.Jinja2ChatFormatter
# Temporarily disable appending new section tokens
formatter_state = shared.model._formatter.add_generation_prompt
shared.model._formatter.add_generation_prompt = False
prompt: str = shared.model._formatter(messages=messages).prompt
shared.model._formatter.add_generation_prompt = formatter_state
# Remove any leftover tokens so it can continue generating smoothly
prompt = prompt.rstrip()
prompt = prompt.removesuffix(eos_token)
prompt = prompt.removeprefix(bos_token)
# Instead of using create_chat_completion, generate regular text with our formatted chat prompt
llm(prompt)This isn't as pretty as the create_chat_completion interface, but it gets the job done currently. So I want to hear your opinions on a possible method for continuing chat assistant response.
Or at least a better interface to underlying formatters so we can alter the chat state. Because I had similar issues with enabling/disabling thinking with reasoning models. Which I had to again create my own Jinja2 formatter and remove thinking from the template string.