Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing local OpenAI API-style chat completions on any given inference server #1174

Merged
merged 15 commits into from Dec 20, 2023

Conversation

veekaybee
Copy link
Contributor

This PR addresses this issue:

#1072 (comment)

by passing a base_url to a new class, LocalChatCompletionsLM, which inherits from OpenaiChatCompletionsLM and accepts a local HuggingFace-style model name and uses TikToken with modifications to pass HuggingFace model encodings.

To use this, you'll need to pass EMPTY as your OpenAI token and hit a local inference server.

Confirmed that it works by running lm_eval --model local-chat-completions --tasks gsm8k --model_args model=facebook/opt-125m,base_url=http://{yourip}:8000/v1 against a local vLLM inference server
Screenshot 2023-12-19 at 2 29 32 PM

Compare to the OpenAI task:

lm_eval --model openai-chat-completions --tasks gsm8k this should also work with an OpenAI key.

Related PRS and context:

Sample response:

local-chat-completions (), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|  Filter  |n-shot|  Metric   |Value|   |Stderr|
|-----|-------|----------|-----:|-----------|----:|---|-----:|
|gsm8k|Yaml   |get-answer|     5|exact_match|    0|±  |     0|

INFO:     {serverip}:47136 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 12-19 12:03:57 async_llm_engine.py:379] Received request cmpl-0ce5d6fa82ba417cbf178069974aaf28: prompt: "Question: At Mario's barbershop haircuts are 50% more expensive during the weekends. If Mario paid $18 for his last haircut on Monday, how much he would have paid the day before?\nAnswer: Mario's $18 cut on Monday would have been 50% more expensive on Sunday or $18*50% = $<<18*50*.01=9>>9 more expensive\nThat means he would have paid $9 more on Sunday than what he paid ($18) on Monday or $9+$18 = $<<9+18=27>>27\n#### 27\n\nQuestion: Jack buys a squat rack for $2500.  The barbell cost 1/10 as much.  How much did he pay for everything?\nAnswer: The barbell cost 2500/10=$<<2500/10=250>>250\nSo he paid 2500+250=$<<2500+250=2750>>2750 for everything\n#### 2750\n\nQuestion: There are 9 boys and 12 girls in a class. The teacher needs to create groups with three members for their class activity. How many groups are formed?\nAnswer: There are 9 + 12 = <<9+12=21>>21 students in a class.\nHence, 21/3 = <<21/3=7>>7 groups are formed.\n#### 7\n\nQuestion: What is fifteen more than a quarter of 48?\nAnswer: A quarter of 48 is 48/4=<<48/4=12>>12.\nThe number is 12+15=<<12+15=27>>27.\n#### 27\n\nQuestion: A bond paper ream has 500 sheets and costs $27. An office needs 5000 sheets of bond paper. How much will it cost to buy their needed sheets of paper?\nAnswer: An office needs to buy 5000/500 = <<5000/500=10>>10 reams of bond paper.\nSo, it will cost 10 x $27 = $<<10*27=270>>270.\n#### 270\n\nQuestion: Jon runs a triathlon.  It takes him 40 minutes for the swim, an hour and 20 minutes for the bike ride and 50 minutes for the run.  Compared to Jon, James finishes the swim 10% faster but takes 5 minutes longer on the bike.  If Jon won by 10 minutes, how long did it take James to do the run?\nAnswer:</s>", sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], ignore_eos=False, max_tokens=256, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt token ids: [2, 45641, 35, 497, 8782, 18, 2003, 9569, 9547, 39799, 7046, 32, 654, 207, 55, 3214, 148, 5, 12729, 4, 318, 8782, 1199, 68, 1366, 13, 39, 94, 29618, 15, 302, 6, 141, 203, 37, 74, 33, 1199, 5, 183, 137, 116, 50118, 33683, 35, 8782, 18, 68, 1366, 847, 15, 302, 74, 33, 57, 654, 207, 55, 3214, 15, 395, 50, 68, 1366, 3226, 1096, 207, 5457, 68, 48203, 1366, 3226, 1096, 44460, 2663, 5214, 466, 44226, 466, 55, 3214, 50118, 1711, 839, 37, 74, 33, 1199, 68, 466, 55, 15, 395, 87, 99, 37, 1199, 1358, 1366, 43, 15, 302, 50, 68, 466, 2744, 1629, 1366, 5457, 68, 48203, 466, 2744, 1366, 5214, 2518, 44226, 2518, 50118, 49629, 974, 50118, 50118, 45641, 35, 2722, 13079, 10, 31147, 20004, 13, 68, 41374, 4, 1437, 20, 2003, 11312, 701, 112, 73, 698, 25, 203, 4, 1437, 1336, 203, 222, 37, 582, 13, 960, 116, 50118, 33683, 35, 20, 2003, 11312, 701, 35014, 73, 698, 45946, 48203, 41374, 73, 698, 5214, 5714, 44226, 5714, 50118, 2847, 37, 1199, 35014, 2744, 5714, 45946, 48203, 41374, 2744, 5714, 5214, 2518, 1096, 44226, 2518, 1096, 13, 960, 50118, 49629, 974, 1096, 50118, 50118, 45641, 35, 345, 32, 361, 2786, 8, 316, 1972, 11, 10, 1380, 4, 20, 3254, 782, 7, 1045, 1134, 19, 130, 453, 13, 49, 1380, 1940, 4, 1336, 171, 1134, 32, 4829, 116, 50118, 33683, 35, 345, 32, 361, 2055, 316, 5457, 48188, 466, 2744, 1092, 5214, 2146, 44226, 2146, 521, 11, 10, 1380, 4, 50118, 725, 4086, 6, 733, 73, 246, 5457, 48188, 2146, 73, 246, 5214, 406, 44226, 406, 1134, 32, 4829, 4, 50118, 49629, 262, 50118, 50118, 45641, 35, 653, 16, 23843, 55, 87, 10, 297, 9, 2929, 116, 50118, 33683, 35, 83, 297, 9, 2929, 16, 2929, 73, 306, 5214, 48203, 3818, 73, 306, 5214, 1092, 44226, 1092, 4, 50118, 133, 346, 16, 316, 2744, 996, 5214, 48203, 1092, 2744, 996, 5214, 2518, 44226, 2518, 4, 50118, 49629, 974, 50118, 50118, 45641, 35, 83, 2175, 2225, 769, 424, 34, 1764, 12208, 8, 1042, 68, 2518, 4, 660, 558, 782, 23221, 12208, 9, 2175, 2225, 4, 1336, 203, 40, 24, 701, 7, 907, 49, 956, 12208, 9, 2225, 116, 50118, 33683, 35, 660, 558, 782, 7, 907, 23221, 73, 1497, 5457, 48188, 31830, 73, 1497, 5214, 698, 44226, 698, 769, 7042, 9, 2175, 2225, 4, 50118, 2847, 6, 24, 40, 701, 158, 3023, 68, 2518, 5457, 68, 48203, 698, 3226, 2518, 5214, 21063, 44226, 21063, 4, 50118, 49629, 18673, 50118, 50118, 45641, 35, 4160, 1237, 10, 7182, 22166, 4, 1437, 85, 1239, 123, 843, 728, 13, 5, 6966, 6, 41, 1946, 8, 291, 728, 13, 5, 4806, 3068, 8, 654, 728, 13, 5, 422, 4, 1437, 23570, 7, 4160, 6, 957, 11630, 5, 6966, 158, 207, 3845, 53, 1239, 195, 728, 1181, 15, 5, 4806, 4, 1437, 318, 4160, 351, 30, 158, 728, 6, 141, 251, 222, 24, 185, 957, 7, 109, 5, 422, 116, 50118, 33683, 35, 2].

@veekaybee
Copy link
Contributor Author

@haileyschoelkopf - any thoughts or opinions on whether adding a new class is the way to go here appreciated. I had two thoughts

  1. Create a new class in the OpenAI completions class - lots of replication that we'd need to then refactor out
  2. Create an entirely new class for local models, likely doesn't make sense since it inherits from OpenAI

I decided to go with the first approach for now.

Copy link
Contributor

@haileyschoelkopf haileyschoelkopf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to dispute / discuss this, but I think although the current approach is alright, we can minimize duplicated code via simply implementing arbitrary base_url features in the actual openAI LM classes. Let me know what you think!

we could have an assertion that base_url is provided if OPENAI_API_KEY is not provided (in the case of something that is a non-self-hosted server we may still want an API key to be provided alongside a unique base_url value).

As far as the HF Tokenizer usage, I think having a kwarg for tokenizer_backend: Literal["tiktoken", "huggingface"] might be alright for user experience? I think we mostly only need the tokenizer's EOS/EOT/EOD token, as well as just tokenizer encoding length for the purposes of sorting long-to-short in input length (could probably remove _encode_pair().

We should factor out the code duplication where possible since we're subclassing the OpenAIChatCompletionsLM class, but I think this makes sense.

Sorry, I realize some of these comments are regarding the existing OAI implementations, so may not be immediately necessary to address in this PR!

lm_eval/models/openai_completions.py Outdated Show resolved Hide resolved
lm_eval/models/openai_completions.py Outdated Show resolved Hide resolved
@veekaybee
Copy link
Contributor Author

actual openAI LM classes. Let me know what you think!

Yep, this makes sense! I'm always for less code to maintain. When I was initially thinking about this, my assumption was that eventually we'd want to abstract out OpenAI and have the generic endpoint class not be tied to any vendor-specific implementation, so this was my attempt at starting to keep that logic separate.

In the changes you suggested, would you as the user still call openai-chat-completions and pass the base url? It seems like this might be confusing for users and relying on the API for something it's not really meant for long-term (i.e. calling non-OpenAI models) might not be useful, but we could definitely add documentation around this.

Maybe the first step could be implementing your suggestion and just get that working, and the next scope of work could be to create a class independent of the OpenAI API - does that seem reasonable or outside of what you were envisioning? Or maybe it's more effort than necessary to reimplement it in a generic way for local models? Let me know what you think.

@haileyschoelkopf
Copy link
Contributor

In the changes you suggested, would you as the user still call openai-chat-completions and pass the base url?

I think we can add a secondary model name chat-completions or local-chat-completions so that users can also access the OpenAIChatCompletions api via that if they desire, or still call it via openai-chat-completions.

relying on the API for something it's not really meant for long-term (i.e. calling non-OpenAI models) might not be useful,

my assumption was that eventually we'd want to abstract out OpenAI and have the generic endpoint class not be tied to any vendor-specific implementation, so this was my attempt at starting to keep that logic separate.

I think that for the time being, if companies or OS libraries support an OpenAI-mirroring interface for their API, it's reasonable to assume they'll continue to mirror OpenAI's api in future (at least for a good while, or while OpenAI still is the dominant provider s.t. it's beneficial to use their interface to minimize user friction in switching to other providers).

Maybe the first step could be implementing your suggestion and just get that working, and the next scope of work could be to create a class independent of the OpenAI API - does that seem reasonable or outside of what you were envisioning?

Sounds like a plan! If you are willing to do this (allowing the OpenAI implementations to take base_url and a HF tokenizer) for OpenAI's completions API as well in this PR that'd be awesome but no worries if you just want to commit to the ChatCompletions model.

I'm not necessarily opposed to reimplementing some more generic API class (in particular, I think Llama-CPP/GGUF which we currently support separately can be merged into the same class as this extended OpenAI one), but would probably prefer to only do this if there is an existing standard that providers have already converged on--I think innovating our own abstraction or model API interface would be out of scope for this project.

@veekaybee
Copy link
Contributor Author

veekaybee commented Dec 20, 2023

Sounds like a plan! If you are willing to do this (allowing the OpenAI implementations to take base_url and a HF tokenizer) for OpenAI's completions API as well in this PR that'd be awesome but no worries if you just want to commit to the ChatCompletions model.

Yep, sounds good will add this to the Pr. Thanks for the discussion and clarification!

@veekaybee
Copy link
Contributor Author

veekaybee commented Dec 20, 2023

Ok, I think this is good to go 🤞 - one last question - when we pass in an incorrect arg, for example number instead of n, we get a TypeError at the Completions API create step:

lm_eval --model local-chat-completions --tasks gsm8k --model_args model=facebook/opt-125m,base_url=http://{ip}:8000/v1 --gen_kwargs number=1
TypeError: Completions.create() got an unexpected keyword argument 'number'

Do we want to do any error handling before that or is this standard behavior across the library?

Copy link
Contributor

@haileyschoelkopf haileyschoelkopf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much, @veekaybee ! I left a couple very minor nits, once those are resolved this can be good to go.

There's also the Completions API but either that can be handled in another PR or I can port the ChatCompletions changes up to that one myself.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
lm_eval/models/openai_completions.py Outdated Show resolved Hide resolved
@haileyschoelkopf haileyschoelkopf enabled auto-merge (squash) December 20, 2023 20:45
@veekaybee
Copy link
Contributor Author

@haileyschoelkopf Ok for me to merge this or do you generally merge?

@haileyschoelkopf haileyschoelkopf merged commit fcfc0c6 into EleutherAI:main Dec 20, 2023
8 checks passed
@haileyschoelkopf
Copy link
Contributor

haileyschoelkopf commented Dec 20, 2023

okay to merge! EDIT: oops, looks like it was blocked because a conversation wasn't resolved yet, resolved it!

@haileyschoelkopf
Copy link
Contributor

Thanks very much @veekaybee for all your work on this!

@veekaybee
Copy link
Contributor Author

Thanks so much for your patience on this @haileyschoelkopf !! 👏

@ehartford
Copy link

how to do this?
I try like this

lm_eval --model local-chat-completions --tasks mmlu,mmlu_flan_cot_fewshot --model_args base_url=http://localhost:8000/v1

then I get error

2023-12-22:01:05:53,757 INFO     [task.py:344] Building contexts for task on rank 0...
2023-12-22:01:05:53,784 INFO     [evaluator.py:314] Running loglikelihood requests
Traceback (most recent call last):
  File "/home/azureuser/miniconda3/envs/eval/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
             ^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/__main__.py", line 231, in cli_evaluate
    results = evaluator.simple_evaluate(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/utils.py", line 402, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/evaluator.py", line 150, in simple_evaluate
    results = evaluate(
              ^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/utils.py", line 402, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/evaluator.py", line 325, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 483, in loglikelihood
    raise NotImplementedError("No support for logits.")
NotImplementedError: No support for logits.

@haileyschoelkopf
Copy link
Contributor

haileyschoelkopf commented Dec 22, 2023

Hey! yeah, this is the intended usage (for chat models at the moment, Completions only coming soon!). You can eval generative tasks like gsm8k right now, but not ones that require logprobs like MMLU for now.

This is currently the result because OpenAI's ChatCompletions didn't support logits until very recently. Will try to get support for those in asap!

@ehartford
Copy link

ok thank you for the update!

wx-zhang pushed a commit to wx-zhang/lm-evaluation-harness that referenced this pull request Dec 24, 2023
…erence server (EleutherAI#1174)

* LocalChatCompletionsLM add

* clean up completions class

* clean up completions class

* update tokens

* README

* fix constructor

* eos token

* folding local-chat-completions into OpenAIChatCompletions

* refactoring to include gen_kwargs as passable option

* add todo on chat completion kwarg validation

* Ruff and README fix

* generalize to **kwargs

* remove unnecessary kwargs

* README and remove kwargs

* README
proserve pushed a commit to actualize-ae/lm-evaluation-harness that referenced this pull request Dec 26, 2023
…erence server (EleutherAI#1174)

* LocalChatCompletionsLM add

* clean up completions class

* clean up completions class

* update tokens

* README

* fix constructor

* eos token

* folding local-chat-completions into OpenAIChatCompletions

* refactoring to include gen_kwargs as passable option

* add todo on chat completion kwarg validation

* Ruff and README fix

* generalize to **kwargs

* remove unnecessary kwargs

* README and remove kwargs

* README
@sergiopperez
Copy link
Contributor

Hi @veekaybee , thanks for your contribution! I'm now testing your command:

lm_eval --model local-chat-completions --tasks gsm8k --model_args model=${model_name},base_url=${servername}:8000/v1

but the output has a value of 0:

|Tasks|Version|  Filter  |n-shot|  Metric   |Value|   |Stderr|
|-----|------:|----------|-----:|-----------|----:|---|-----:|
|gsm8k|      2|get-answer|     5|exact_match|    0|±  |     0|

In the description of this PR, you've reported the same table also with value 0. Is the result of 0 expected? I'm using a llama2-7b model, and from the gsm8k score in the llama paper it should be low but not 0. As a comparison, when I use OpenAI's API, the value is not 0:

$ lm_eval --model openai-completions --tasks gsm8k --model_args model=davinci-002 --limit 5

|Tasks|Version|  Filter  |n-shot|  Metric   |Value|   |Stderr|
|-----|------:|----------|-----:|-----------|----:|---|-----:|
|gsm8k|      2|get-answer|     5|exact_match|  0.2|±  |   0.2|

I'll try with a larger llama model later, but wanted to check with you in case you've spotted the same. Thanks!

@nickmitchko
Copy link

Hi all,

It appears that this local chat completions PR doesn't actually function. Each benchmark I try (non-logit based) fail and never hit the actual server. Instead I get a 200OK in the log and the GPUs never spool up.

Can someone confirm that this merge works?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants