New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementing local OpenAI API-style chat completions on any given inference server #1174
Implementing local OpenAI API-style chat completions on any given inference server #1174
Conversation
@haileyschoelkopf - any thoughts or opinions on whether adding a new class is the way to go here appreciated. I had two thoughts
I decided to go with the first approach for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to dispute / discuss this, but I think although the current approach is alright, we can minimize duplicated code via simply implementing arbitrary base_url
features in the actual openAI LM classes. Let me know what you think!
we could have an assertion that base_url
is provided if OPENAI_API_KEY
is not provided (in the case of something that is a non-self-hosted server we may still want an API key to be provided alongside a unique base_url
value).
As far as the HF Tokenizer usage, I think having a kwarg for tokenizer_backend: Literal["tiktoken", "huggingface"]
might be alright for user experience? I think we mostly only need the tokenizer's EOS/EOT/EOD token, as well as just tokenizer encoding length for the purposes of sorting long-to-short in input length (could probably remove _encode_pair()
.
We should factor out the code duplication where possible since we're subclassing the OpenAIChatCompletionsLM
class, but I think this makes sense.
Sorry, I realize some of these comments are regarding the existing OAI implementations, so may not be immediately necessary to address in this PR!
Yep, this makes sense! I'm always for less code to maintain. When I was initially thinking about this, my assumption was that eventually we'd want to abstract out OpenAI and have the generic endpoint class not be tied to any vendor-specific implementation, so this was my attempt at starting to keep that logic separate. In the changes you suggested, would you as the user still call Maybe the first step could be implementing your suggestion and just get that working, and the next scope of work could be to create a class independent of the OpenAI API - does that seem reasonable or outside of what you were envisioning? Or maybe it's more effort than necessary to reimplement it in a generic way for local models? Let me know what you think. |
I think we can add a secondary model name
I think that for the time being, if companies or OS libraries support an OpenAI-mirroring interface for their API, it's reasonable to assume they'll continue to mirror OpenAI's api in future (at least for a good while, or while OpenAI still is the dominant provider s.t. it's beneficial to use their interface to minimize user friction in switching to other providers).
Sounds like a plan! If you are willing to do this (allowing the OpenAI implementations to take base_url and a HF tokenizer) for OpenAI's completions API as well in this PR that'd be awesome but no worries if you just want to commit to the ChatCompletions model. I'm not necessarily opposed to reimplementing some more generic API class (in particular, I think Llama-CPP/GGUF which we currently support separately can be merged into the same class as this extended OpenAI one), but would probably prefer to only do this if there is an existing standard that providers have already converged on--I think innovating our own abstraction or model API interface would be out of scope for this project. |
Yep, sounds good will add this to the Pr. Thanks for the discussion and clarification! |
51a8841
to
db85da5
Compare
Ok, I think this is good to go 🤞 - one last question - when we pass in an incorrect arg, for example
Do we want to do any error handling before that or is this standard behavior across the library? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much, @veekaybee ! I left a couple very minor nits, once those are resolved this can be good to go.
There's also the Completions API but either that can be handled in another PR or I can port the ChatCompletions changes up to that one myself.
@haileyschoelkopf Ok for me to merge this or do you generally merge? |
okay to merge! EDIT: oops, looks like it was blocked because a conversation wasn't resolved yet, resolved it! |
Thanks very much @veekaybee for all your work on this! |
Thanks so much for your patience on this @haileyschoelkopf !! 👏 |
how to do this?
then I get error
|
Hey! yeah, this is the intended usage (for chat models at the moment, Completions only coming soon!). You can eval generative tasks like This is currently the result because OpenAI's ChatCompletions didn't support logits until very recently. Will try to get support for those in asap! |
ok thank you for the update! |
…erence server (EleutherAI#1174) * LocalChatCompletionsLM add * clean up completions class * clean up completions class * update tokens * README * fix constructor * eos token * folding local-chat-completions into OpenAIChatCompletions * refactoring to include gen_kwargs as passable option * add todo on chat completion kwarg validation * Ruff and README fix * generalize to **kwargs * remove unnecessary kwargs * README and remove kwargs * README
…erence server (EleutherAI#1174) * LocalChatCompletionsLM add * clean up completions class * clean up completions class * update tokens * README * fix constructor * eos token * folding local-chat-completions into OpenAIChatCompletions * refactoring to include gen_kwargs as passable option * add todo on chat completion kwarg validation * Ruff and README fix * generalize to **kwargs * remove unnecessary kwargs * README and remove kwargs * README
Hi @veekaybee , thanks for your contribution! I'm now testing your command:
but the output has a value of
In the description of this PR, you've reported the same table also with value
I'll try with a larger llama model later, but wanted to check with you in case you've spotted the same. Thanks! |
Hi all, It appears that this local chat completions PR doesn't actually function. Each benchmark I try (non-logit based) fail and never hit the actual server. Instead I get a 200OK in the log and the GPUs never spool up. Can someone confirm that this merge works? |
This PR addresses this issue:
#1072 (comment)
by passing a
base_url
to a new class,LocalChatCompletionsLM
, which inherits fromOpenaiChatCompletionsLM
and accepts a local HuggingFace-style model name and uses TikToken with modifications to pass HuggingFace model encodings.To use this, you'll need to pass
EMPTY
as your OpenAI token and hit a local inference server.Confirmed that it works by running
lm_eval --model local-chat-completions --tasks gsm8k --model_args model=facebook/opt-125m,base_url=http://{yourip}:8000/v1 against a local vLLM inference server
Compare to the OpenAI task:
lm_eval --model openai-chat-completions --tasks gsm8k
this should also work with an OpenAI key.Related PRS and context:
Sample response: