Chat completion from unfinished response

**Is your feature request related to a problem? Please describe.**
I don't know if I should have opened this issue in the parent `llama-cpp-python` repository, but this fork seems to be the only properly maintained one, so thank you for your work.

My issue is resuming an uncompleted assistant response. For example:
```json
[
  {"role": "user", "message": "What are Python decorators?"},
  {"role": "assistant", "message": "Hey, Python deco"}
]
```
As you can see, the response of assistant is broken halfway. I needed a consistent way to continue generating this message, but I couldn't find anything related in the docs or the issues.

`llama-cpp-python` seems to handle formatting and everything internally, so the only proper way seems to be hacking my own formatter to somehow allowing this feature. But again, this solution is very hacky.

**Describe the solution you'd like**
This is currently what I do in my project which I use this fork of `llama-cpp-python`.
```py
# _formatter is my own formatter, instance of llama_cpp.llama_chat_format.Jinja2ChatFormatter

# Temporarily disable appending new section tokens
formatter_state = shared.model._formatter.add_generation_prompt
shared.model._formatter.add_generation_prompt = False
prompt: str = shared.model._formatter(messages=messages).prompt
shared.model._formatter.add_generation_prompt = formatter_state

# Remove any leftover tokens so it can continue generating smoothly
prompt = prompt.rstrip()
prompt = prompt.removesuffix(eos_token)
prompt = prompt.removeprefix(bos_token)

# Instead of using create_chat_completion, generate regular text with our formatted chat prompt
llm(prompt)
```
This isn't as pretty as the `create_chat_completion` interface, but it gets the job done currently. So I want to hear your opinions on a possible method for continuing chat assistant response.

Or at least a better interface to underlying formatters so we can alter the chat state. Because I had similar issues with enabling/disabling thinking with reasoning models. Which I had to again create my own Jinja2 formatter and remove thinking from the template string.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chat completion from unfinished response #97

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Chat completion from unfinished response #97

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions