-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support wrapping prompts with a given Chat Template #1098
Comments
I thought HF did a great job in explaining how their chat templating works. |
My previous hangup with HF's chat templating support was that it was new and wasn't clear if people were going to use it. But I looked and quite a few notable finetunes seem to use the feature, so I think using HF's setup may be the way to go. |
I concur |
IMO we probably should allow users to pass system prompts. The LLama-2 paper provides a lot of detail about the different prompts they used (for human evaluation though).
The default behaviour of the HF tokenizer seems to be to apply the chat template to all segments and tokenize in one go. This question might be more relevant for multi-turn evaluations. |
There is this effort to handle the different chat formats and conversions in between -- https://github.com/deployradiant/pychatml |
This feature would be great for benchmarks like IFEval which expect a chat formatted input! Regarding chat templates in For the system prompt, I suppose a Finally, for few-shot I believe the recommended formatting is as to provide examples as alternating user/assistant pairs like this:
Which in ChatML would produce something like
|
@lewtun see branch “add_chat_templating”. We are implementing with the HF tokenizer. One problem is figuring out how to pick out the few shot examples for different kinds of tests - at the moment the code just wraps the whole context as one user message, but I agree it would be ideal to have each few shot example as a new user/assistant entry in the list of messages. I’m trying this on my own fork in a hacky way just to see what the results are for specific tests. Good tip on the IFEval I will check that out! CC @haileyschoelkopf |
@haileyschoelkopf is IFEval in the harness yet (https://arxiv.org/pdf/2311.07911.pdf)? Not seeing it after running:
|
It is but requires installation of the (We should see about making a more clear listing for this case.) |
Very nice! Feel free to ping me or @Rocketknight1 (the creator of the templating system in |
@lewtun @haileyschoelkopf I'm currently running a test with Mixtral-8x7B-Instruct on IFEval with and without chat templating, will report back my results tomorrow morning.
|
^ results for above are super promising!
Changing use_chat_template to True/False is only change between runs! |
That's fantastic to hear @daniel-furman ! I'm sorry for the delay in pushing this feature forward (and the fact that the system prompt in this branch is not currently yet configurable). The major blocker to getting this merged is the fact that, in my testing, scores were worse for models evaluated on tasks like |
@haileyschoelkopf No worries! This IFEval result is the first positive swing I have seen in my testing. Couple thoughts... |
@haileyschoelkopf @lewtun I went a little overboard testing IFEval this weekend... Experiments were executed with models in half precision bfloat16, a workstation equipped with 2x H100 80 GB SXM5 chips, and my fork of the lm-eval package at hash 0c0c314c0df4c10f35bf7c17dc80f745f8027e9b. More details can be found here @ towardsdatascience.com/evaluations-with-chat-formats-7604067023c9 |
Super happy to see the feature coming to the harness! It's been highly requested and it will really be interesting to see how it changes rankings on the leaderboard 🔥 |
The alternation between "user" and "assistant" for the few-shot text and targets does seem more intuitive! Would require a clearer delineation between each example though, which isn't true for many tasks right now. Also would probably need to down-stream the few-shot construction to the model level rather than sending it through as one big blob. |
Hey @daniel-furman that table is great! Can I use it in a tweet about chat templates? |
@Rocketknight1 you betcha, go for it! Just to note - this is the first eval I am seeing a positive swing via chat templating. The open llm leaderboard evals are showing a regression after applying templates - pending further development though to pinpoint the cause of that result. |
That's surprising! I would expect that chat models should work much better in basically all cases when their correct template is used |
Those results are pending spacing issues on continuation / few shot setup / testing more models. @haileyschoelkopf was seeing the same dips on her initial tests. There’s a reason I picked IFEval :). |
@clefourrier do you also see performance drops when evaluating chat models with templates in |
@lewtun Yep! But I'll be able to do a more in depth and comparable analysis once we integrate ifeval, to see if we observe the same |
@daniel-furman Awesome, is there a way I can define a custom template? I am using your forked lm_eval repo. |
@haileyschoelkopf maybe we can close this discussion now that the feature has been added? :) |
LM Evaluation Harness has historically been used and designed around evaluation of base language models in the few-shot and zero-shot settings.
Given the substantial interest in evaluating chat models using lm-eval, an important feature would be allowing one to take the prompt in a benchmark, and evaluate
There are a few considerations that will come up in how we implement this:
--chat_template
CLI arg? pass--model_args chat_template=....
?User: <description> Assistant: <fewshot1, answer1>, <fewshot2, answer2>, <context>
-> predicttarget
User: <Q: ... A: > Assistant:
-> predicttarget
An adjacent feature to this would be considering the addition of more flexible / permissive answer extraction.
Chat templating is a fairly high priority addition to the library based on user requests and feedback, but we want to make sure we add it in a way that doesn't hurt reproducibility and that is easy to maintain, utilize as a non-power-user, and work with in general.
The text was updated successfully, but these errors were encountered: