-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent evaluation results with Chat Template #1841
Comments
Did you figure out how to use the chat template with vllm? |
Yeah, It's simple.
|
Hi! After #2034 both HF and vllm should support chat templating via |
Hi,I have a question, why is the score of Open LLM Leaderboard vastly different from yours?The Llama3-8b-Instruct's score on the open_llm_leaderboard is only 68.69.Do you have any evaluate details? I have check the default max_tok_len of the lm-evaluation-harness and LeaderBoard Config are both 256. |
I evaluated llama3-8b-Instruct using the gsm8k benchmark, and found some interesting phenomenons.
Huggingface and vllm has similar results
If I use vllm to start an API service, and use lm-eval local-chat mode to evaluate gsm8k, resulting a different accuracy.
I browsed the source code of lm-eval, and found that the API use the apply chat template, while inference of huggingface and vllm mode do not use chat template (i.e. there are no special tokens)
I tried to use the chat template in the tokenizer and log some intermate results, could you give me some insights about it???
output of vllm in the lm-eval
output of vllm using chat template of llama, and this do not use system prompt:
output of vllm using chat template of llama, and this uses system prompt "you are a helpful assistant":
In the final, the accuracy of vllm with chat template significantly drop!!!
Are you have any idea about it?
I think the people should use more chat-template in the evaluation, since it is close to the real sceneries.
The text was updated successfully, but these errors were encountered: