Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support wrapping prompts with a given Chat Template #1098

Open
haileyschoelkopf opened this issue Dec 11, 2023 · 26 comments
Open

Support wrapping prompts with a given Chat Template #1098

haileyschoelkopf opened this issue Dec 11, 2023 · 26 comments
Assignees
Labels
feature request A feature that isn't implemented yet. help wanted Contributors and extra help welcome. opinions wanted For discussing open questions.
Milestone

Comments

@haileyschoelkopf
Copy link
Contributor

LM Evaluation Harness has historically been used and designed around evaluation of base language models in the few-shot and zero-shot settings.

Given the substantial interest in evaluating chat models using lm-eval, an important feature would be allowing one to take the prompt in a benchmark, and evaluate

There are a few considerations that will come up in how we implement this:

  • What existing chat formats are standard? Are there particular winners or commonalities amongst the popular ones?
  • Are there any libraries or tools that support chat templating and serve as repositories for this utility? Could we import from any of these, offloading non-evaluation logic to an external tool used elsewhere in the ecosystem?
  • How should we expect users to work with chat templates? Should we require them to add a new YAML file that manually modifies the prompt to include chat template in the prompt strings? provide a --chat_template CLI arg? pass --model_args chat_template=.... ?
  • Where should content be inserted within the chat format, and how, if necessary, should it be configurable? For example, should few-shot examples be:
    • User: <description> Assistant: <fewshot1, answer1>, <fewshot2, answer2>, <context> -> predict target
    • User: <Q: ... A: > Assistant: -> predict target
  • How should we allow for users to specify a system prompt, if at all?
  • How do these interact with tokenization? For example, https://docs.mistral.ai/models#chat-template details that tokenizing each segment of the chat transcript separately is the intended behavior.

An adjacent feature to this would be considering the addition of more flexible / permissive answer extraction.

Chat templating is a fairly high priority addition to the library based on user requests and feedback, but we want to make sure we add it in a way that doesn't hurt reproducibility and that is easy to maintain, utilize as a non-power-user, and work with in general.

@haileyschoelkopf haileyschoelkopf added help wanted Contributors and extra help welcome. feature request A feature that isn't implemented yet. opinions wanted For discussing open questions. labels Dec 11, 2023
@haileyschoelkopf haileyschoelkopf self-assigned this Dec 11, 2023
@baberabb
Copy link
Contributor

I thought HF did a great job in explaining how their chat templating works.

@haileyschoelkopf
Copy link
Contributor Author

My previous hangup with HF's chat templating support was that it was new and wasn't clear if people were going to use it. But I looked and quite a few notable finetunes seem to use the feature, so I think using HF's setup may be the way to go.

@StellaAthena
Copy link
Member

My previous hangup with HF's chat templating support was that it was new and wasn't clear if people were going to use it. But I looked and quite a few notable finetunes seem to use the feature, so I think using HF's setup may be the way to go.

I concur

@baberabb
Copy link
Contributor

baberabb commented Dec 14, 2023

  • How should we allow for users to specify a system prompt, if at all?

IMO we probably should allow users to pass system prompts. The LLama-2 paper provides a lot of detail about the different prompts they used (for human evaluation though).

Screenshot 2023-12-14 at 14 18 19

The default behaviour of the HF tokenizer seems to be to apply the chat template to all segments and tokenize in one go. This question might be more relevant for multi-turn evaluations.

@anjor
Copy link
Contributor

anjor commented Dec 26, 2023

There is this effort to handle the different chat formats and conversions in between -- https://github.com/deployradiant/pychatml

@lewtun
Copy link

lewtun commented Feb 9, 2024

This feature would be great for benchmarks like IFEval which expect a chat formatted input!

Regarding chat templates in transformers models, these can be inferred from the chat_template attribute of the tokenizer config, so one possible way to do this would be to apply if directly if it exists or allow the user to override this by passing the Jinja string in a --chat_template arg

For the system prompt, I suppose a --system_prompt arg could be added which then inserts a system role/content into the messages.

Finally, for few-shot I believe the recommended formatting is as to provide examples as alternating user/assistant pairs like this:

[
    {"role": "user", "content": "fewshot_prompt_1"},
    {"role": "assistant", "content": "fewshot_answer_1"},
    {"role": "user", "content": "fewshot_prompt_2"},
    {"role": "assistant", "content": "fewshot_answer_2"},
    {"role": "user", "content": "final_prompt"}
]

Which in ChatML would produce something like

<|im_start|>user
fewshot_prompt_1<|im_end|>
<|im_start|>assistant
fewshot_answer_1<|im_end|>
<|im_start|>user
fewshot_prompt_2<|im_end|>
<|im_start|>assistant
fewshot_answer_2<|im_end|>
<|im_start|>user
final_prompt<|im_end|>
<|im_start|>assistant

@daniel-furman
Copy link

daniel-furman commented Feb 9, 2024

@lewtun see branch “add_chat_templating”. We are implementing with the HF tokenizer. One problem is figuring out how to pick out the few shot examples for different kinds of tests - at the moment the code just wraps the whole context as one user message, but I agree it would be ideal to have each few shot example as a new user/assistant entry in the list of messages. I’m trying this on my own fork in a hacky way just to see what the results are for specific tests. Good tip on the IFEval I will check that out! CC @haileyschoelkopf

@daniel-furman
Copy link

@haileyschoelkopf is IFEval in the harness yet (https://arxiv.org/pdf/2311.07911.pdf)? Not seeing it after running:

!lm-eval --tasks list

@haileyschoelkopf
Copy link
Contributor Author

It is but requires installation of the ifeval extra first!

(We should see about making a more clear listing for this case.)

@lewtun
Copy link

lewtun commented Feb 12, 2024

@lewtun see branch “add_chat_templating”. We are implementing with the HF tokenizer. One problem is figuring out how to pick out the few shot examples for different kinds of tests - at the moment the code just wraps the whole context as one user message, but I agree it would be ideal to have each few shot example as a new user/assistant entry in the list of messages. I’m trying this on my own fork in a hacky way just to see what the results are for specific tests. Good tip on the IFEval I will check that out! CC @haileyschoelkopf

Very nice! Feel free to ping me or @Rocketknight1 (the creator of the templating system in transformers) if you need any help 🤗

@daniel-furman
Copy link

daniel-furman commented Feb 15, 2024

@lewtun @haileyschoelkopf I'm currently running a test with Mixtral-8x7B-Instruct on IFEval with and without chat templating, will report back my results tomorrow morning.

First element before prompt formatting...
('Write a 300+ word summary of the wikipedia page "https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli". Do not use any commas and highlight at least 3 sections that has titles in markdown format, for example *highlighted section part 1*, *highlighted section part 2*, *highlighted section part 3*.', {'until': [], 'do_sample': False, 'temperature': 0.0, 'max_gen_toks': 1280})
First element after prompt formatting...
('<s>[INST] Write a 300+ word summary of the wikipedia page "https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli". Do not use any commas and highlight at least 3 sections that has titles in markdown format, for example *highlighted section part 1*, *highlighted section part 2*, *highlighted section part 3*. [/INST]', {'until': [], 'do_sample': False, 'temperature': 0.0, 'max_gen_toks': 1280})

@daniel-furman
Copy link

^ results for above are super promising!

W/o prompt formatting:
Tasks VersionFiltern-shot

W/ prompt formatting:
Tasks VersionFiltern-shot

!lm_eval --model hf \
    --model_args=pretrained=mistralai/Mixtral-8x7B-Instruct-v0.1,dtype="bfloat16",load_in_4bit=True,use_chat_template=True \
    --tasks ifeval \
    --batch_size 2 \
    --output_path /content/output/Mixtral-8x7B-Instruct-v0.1 \
    --log_samples \
    --device cuda:0 \
    --num_fewshot 0

Changing use_chat_template to True/False is only change between runs!

@haileyschoelkopf
Copy link
Contributor Author

That's fantastic to hear @daniel-furman !

I'm sorry for the delay in pushing this feature forward (and the fact that the system prompt in this branch is not currently yet configurable).

The major blocker to getting this merged is the fact that, in my testing, scores were worse for models evaluated on tasks like arc_easy because of whitespacing needing to change when wrapping the input in a chat template--we need to handle this in some way intuitive for a user, without overly complicating the task config files. Additionally, we still need to decide how to configure tasks' prompts such that one can have a "trigger" like Let's think step by step. attached to the beginning of the final model chat turn.

@daniel-furman
Copy link

@haileyschoelkopf No worries! This IFEval result is the first positive swing I have seen in my testing.

Couple thoughts...
a) wouldn't we want that trigger at the end of each model chat turn
b) think biggest blocker is getting the few-shots to act as new user/assistant messages as per @lewtun's suggestion above (perhaps this is captured in your update above but just confirming / putting into new language).
c) @lewtun's change in code review will fix the system prompt configurability
d) I found one other small change necessary for generate_until helper

@daniel-furman
Copy link

daniel-furman commented Feb 19, 2024

@haileyschoelkopf @lewtun I went a little overboard testing IFEval this weekend...

image

Experiments were executed with models in half precision bfloat16, a workstation equipped with 2x H100 80 GB SXM5 chips, and my fork of the lm-eval package at hash 0c0c314c0df4c10f35bf7c17dc80f745f8027e9b.

More details can be found here @ towardsdatascience.com/evaluations-with-chat-formats-7604067023c9

CC @Rocketknight1 @clefourrier

@clefourrier
Copy link
Contributor

Btw, the user/assistant turns is something we added in lighteval here so if you want to reuse this snippet in the harness feel free to do so :)
(Finished logic is in this PR)

@clefourrier
Copy link
Contributor

clefourrier commented Feb 20, 2024

Super happy to see the feature coming to the harness! It's been highly requested and it will really be interesting to see how it changes rankings on the leaderboard 🔥
(@daniel-furman 's analysis already seems to show it would shuffle things quite a bit 😈 )

@baberabb
Copy link
Contributor

Btw, the user/assistant turns is something we added in lighteval here so if you want to reuse this snippet in the harness feel free to do so :) (Finished logic is in this PR)

The alternation between "user" and "assistant" for the few-shot text and targets does seem more intuitive! Would require a clearer delineation between each example though, which isn't true for many tasks right now. Also would probably need to down-stream the few-shot construction to the model level rather than sending it through as one big blob.

@Rocketknight1
Copy link

Hey @daniel-furman that table is great! Can I use it in a tweet about chat templates?

@daniel-furman
Copy link

@Rocketknight1 you betcha, go for it! Just to note - this is the first eval I am seeing a positive swing via chat templating. The open llm leaderboard evals are showing a regression after applying templates - pending further development though to pinpoint the cause of that result.

@Rocketknight1
Copy link

That's surprising! I would expect that chat models should work much better in basically all cases when their correct template is used

@daniel-furman
Copy link

Those results are pending spacing issues on continuation / few shot setup / testing more models. @haileyschoelkopf was seeing the same dips on her initial tests. There’s a reason I picked IFEval :).

@lewtun
Copy link

lewtun commented Feb 21, 2024

Those results are pending spacing issues on continuation / few shot setup / testing more models. @haileyschoelkopf was seeing the same dips on her initial tests. There’s a reason I picked IFEval :).

@clefourrier do you also see performance drops when evaluating chat models with templates in lighteval?

@clefourrier
Copy link
Contributor

@lewtun Yep! But I'll be able to do a more in depth and comparable analysis once we integrate ifeval, to see if we observe the same

@haileyschoelkopf haileyschoelkopf added this to the v0.4.3 milestone Mar 15, 2024
@monk1337
Copy link

monk1337 commented Apr 7, 2024

@daniel-furman Awesome, is there a way I can define a custom template? I am using your forked lm_eval repo.

@clefourrier
Copy link
Contributor

@haileyschoelkopf maybe we can close this discussion now that the feature has been added? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request A feature that isn't implemented yet. help wanted Contributors and extra help welcome. opinions wanted For discussing open questions.
Projects
Status: Backlog
Development

No branches or pull requests

9 participants