Support wrapping prompts with a given Chat Template #1098

haileyschoelkopf · 2023-12-11T15:08:40Z

LM Evaluation Harness has historically been used and designed around evaluation of base language models in the few-shot and zero-shot settings.

Given the substantial interest in evaluating chat models using lm-eval, an important feature would be allowing one to take the prompt in a benchmark, and evaluate

There are a few considerations that will come up in how we implement this:

What existing chat formats are standard? Are there particular winners or commonalities amongst the popular ones?
Are there any libraries or tools that support chat templating and serve as repositories for this utility? Could we import from any of these, offloading non-evaluation logic to an external tool used elsewhere in the ecosystem?
How should we expect users to work with chat templates? Should we require them to add a new YAML file that manually modifies the prompt to include chat template in the prompt strings? provide a --chat_template CLI arg? pass --model_args chat_template=.... ?
Where should content be inserted within the chat format, and how, if necessary, should it be configurable? For example, should few-shot examples be:
- User: <description> Assistant: <fewshot1, answer1>, <fewshot2, answer2>, <context> -> predict target
- User: <Q: ... A: > Assistant: -> predict target
How should we allow for users to specify a system prompt, if at all?
How do these interact with tokenization? For example, https://docs.mistral.ai/models#chat-template details that tokenizing each segment of the chat transcript separately is the intended behavior.

An adjacent feature to this would be considering the addition of more flexible / permissive answer extraction.

Chat templating is a fairly high priority addition to the library based on user requests and feedback, but we want to make sure we add it in a way that doesn't hurt reproducibility and that is easy to maintain, utilize as a non-power-user, and work with in general.

The text was updated successfully, but these errors were encountered:

baberabb · 2023-12-12T10:45:17Z

I thought HF did a great job in explaining how their chat templating works.

haileyschoelkopf · 2023-12-13T17:03:20Z

My previous hangup with HF's chat templating support was that it was new and wasn't clear if people were going to use it. But I looked and quite a few notable finetunes seem to use the feature, so I think using HF's setup may be the way to go.

StellaAthena · 2023-12-14T10:33:34Z

My previous hangup with HF's chat templating support was that it was new and wasn't clear if people were going to use it. But I looked and quite a few notable finetunes seem to use the feature, so I think using HF's setup may be the way to go.

I concur

baberabb · 2023-12-14T12:45:06Z

How should we allow for users to specify a system prompt, if at all?

IMO we probably should allow users to pass system prompts. The LLama-2 paper provides a lot of detail about the different prompts they used (for human evaluation though).

How do these interact with tokenization? For example, https://docs.mistral.ai/models#chat-template details that tokenizing each segment of the chat transcript separately is the intended behavior.

The default behaviour of the HF tokenizer seems to be to apply the chat template to all segments and tokenize in one go. This question might be more relevant for multi-turn evaluations.

anjor · 2023-12-26T23:48:47Z

There is this effort to handle the different chat formats and conversions in between -- https://github.com/deployradiant/pychatml

lewtun · 2024-02-09T11:22:10Z

This feature would be great for benchmarks like IFEval which expect a chat formatted input!

Regarding chat templates in transformers models, these can be inferred from the chat_template attribute of the tokenizer config, so one possible way to do this would be to apply if directly if it exists or allow the user to override this by passing the Jinja string in a --chat_template arg

For the system prompt, I suppose a --system_prompt arg could be added which then inserts a system role/content into the messages.

Finally, for few-shot I believe the recommended formatting is as to provide examples as alternating user/assistant pairs like this:

[
    {"role": "user", "content": "fewshot_prompt_1"},
    {"role": "assistant", "content": "fewshot_answer_1"},
    {"role": "user", "content": "fewshot_prompt_2"},
    {"role": "assistant", "content": "fewshot_answer_2"},
    {"role": "user", "content": "final_prompt"}
]

Which in ChatML would produce something like

<|im_start|>user
fewshot_prompt_1<|im_end|>
<|im_start|>assistant
fewshot_answer_1<|im_end|>
<|im_start|>user
fewshot_prompt_2<|im_end|>
<|im_start|>assistant
fewshot_answer_2<|im_end|>
<|im_start|>user
final_prompt<|im_end|>
<|im_start|>assistant

daniel-furman · 2024-02-09T15:34:32Z

@lewtun see branch “add_chat_templating”. We are implementing with the HF tokenizer. One problem is figuring out how to pick out the few shot examples for different kinds of tests - at the moment the code just wraps the whole context as one user message, but I agree it would be ideal to have each few shot example as a new user/assistant entry in the list of messages. I’m trying this on my own fork in a hacky way just to see what the results are for specific tests. Good tip on the IFEval I will check that out! CC @haileyschoelkopf

daniel-furman · 2024-02-10T03:34:28Z

@haileyschoelkopf is IFEval in the harness yet (https://arxiv.org/pdf/2311.07911.pdf)? Not seeing it after running:

!lm-eval --tasks list

haileyschoelkopf · 2024-02-10T15:00:58Z

It is but requires installation of the ifeval extra first!

(We should see about making a more clear listing for this case.)

lewtun · 2024-02-12T07:55:43Z

@lewtun see branch “add_chat_templating”. We are implementing with the HF tokenizer. One problem is figuring out how to pick out the few shot examples for different kinds of tests - at the moment the code just wraps the whole context as one user message, but I agree it would be ideal to have each few shot example as a new user/assistant entry in the list of messages. I’m trying this on my own fork in a hacky way just to see what the results are for specific tests. Good tip on the IFEval I will check that out! CC @haileyschoelkopf

Very nice! Feel free to ping me or @Rocketknight1 (the creator of the templating system in transformers) if you need any help 🤗

daniel-furman · 2024-02-15T00:55:39Z

@lewtun @haileyschoelkopf I'm currently running a test with Mixtral-8x7B-Instruct on IFEval with and without chat templating, will report back my results tomorrow morning.

First element before prompt formatting...
('Write a 300+ word summary of the wikipedia page "https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli". Do not use any commas and highlight at least 3 sections that has titles in markdown format, for example *highlighted section part 1*, *highlighted section part 2*, *highlighted section part 3*.', {'until': [], 'do_sample': False, 'temperature': 0.0, 'max_gen_toks': 1280})
First element after prompt formatting...
('<s>[INST] Write a 300+ word summary of the wikipedia page "https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli". Do not use any commas and highlight at least 3 sections that has titles in markdown format, for example *highlighted section part 1*, *highlighted section part 2*, *highlighted section part 3*. [/INST]', {'until': [], 'do_sample': False, 'temperature': 0.0, 'max_gen_toks': 1280})

daniel-furman · 2024-02-15T15:58:30Z

^ results for above are super promising!

W/o prompt formatting:

W/ prompt formatting:

!lm_eval --model hf \
    --model_args=pretrained=mistralai/Mixtral-8x7B-Instruct-v0.1,dtype="bfloat16",load_in_4bit=True,use_chat_template=True \
    --tasks ifeval \
    --batch_size 2 \
    --output_path /content/output/Mixtral-8x7B-Instruct-v0.1 \
    --log_samples \
    --device cuda:0 \
    --num_fewshot 0

Changing use_chat_template to True/False is only change between runs!

haileyschoelkopf · 2024-02-15T18:28:10Z

That's fantastic to hear @daniel-furman !

I'm sorry for the delay in pushing this feature forward (and the fact that the system prompt in this branch is not currently yet configurable).

The major blocker to getting this merged is the fact that, in my testing, scores were worse for models evaluated on tasks like arc_easy because of whitespacing needing to change when wrapping the input in a chat template--we need to handle this in some way intuitive for a user, without overly complicating the task config files. Additionally, we still need to decide how to configure tasks' prompts such that one can have a "trigger" like Let's think step by step. attached to the beginning of the final model chat turn.

daniel-furman · 2024-02-15T19:39:51Z

@haileyschoelkopf No worries! This IFEval result is the first positive swing I have seen in my testing.

Couple thoughts...
a) wouldn't we want that trigger at the end of each model chat turn
b) think biggest blocker is getting the few-shots to act as new user/assistant messages as per @lewtun's suggestion above (perhaps this is captured in your update above but just confirming / putting into new language).
c) @lewtun's change in code review will fix the system prompt configurability
d) I found one other small change necessary for generate_until helper

daniel-furman · 2024-02-19T19:39:03Z

@haileyschoelkopf @lewtun I went a little overboard testing IFEval this weekend...

Experiments were executed with models in half precision bfloat16, a workstation equipped with 2x H100 80 GB SXM5 chips, and my fork of the lm-eval package at hash 0c0c314c0df4c10f35bf7c17dc80f745f8027e9b.

More details can be found here @ towardsdatascience.com/evaluations-with-chat-formats-7604067023c9

CC @Rocketknight1 @clefourrier

clefourrier · 2024-02-20T07:58:51Z

Btw, the user/assistant turns is something we added in lighteval here so if you want to reuse this snippet in the harness feel free to do so :)
(Finished logic is in this PR)

clefourrier · 2024-02-20T07:59:39Z

Super happy to see the feature coming to the harness! It's been highly requested and it will really be interesting to see how it changes rankings on the leaderboard 🔥
(@daniel-furman 's analysis already seems to show it would shuffle things quite a bit 😈 )

baberabb · 2024-02-20T10:30:05Z

Btw, the user/assistant turns is something we added in lighteval here so if you want to reuse this snippet in the harness feel free to do so :) (Finished logic is in this PR)

The alternation between "user" and "assistant" for the few-shot text and targets does seem more intuitive! Would require a clearer delineation between each example though, which isn't true for many tasks right now. Also would probably need to down-stream the few-shot construction to the model level rather than sending it through as one big blob.

Rocketknight1 · 2024-02-20T16:07:37Z

Hey @daniel-furman that table is great! Can I use it in a tweet about chat templates?

daniel-furman · 2024-02-20T16:10:09Z

@Rocketknight1 you betcha, go for it! Just to note - this is the first eval I am seeing a positive swing via chat templating. The open llm leaderboard evals are showing a regression after applying templates - pending further development though to pinpoint the cause of that result.

Rocketknight1 · 2024-02-20T16:22:15Z

That's surprising! I would expect that chat models should work much better in basically all cases when their correct template is used

daniel-furman · 2024-02-20T16:27:40Z

Those results are pending spacing issues on continuation / few shot setup / testing more models. @haileyschoelkopf was seeing the same dips on her initial tests. There’s a reason I picked IFEval :).

lewtun · 2024-02-21T10:26:08Z

Those results are pending spacing issues on continuation / few shot setup / testing more models. @haileyschoelkopf was seeing the same dips on her initial tests. There’s a reason I picked IFEval :).

@clefourrier do you also see performance drops when evaluating chat models with templates in lighteval?

clefourrier · 2024-02-26T10:39:13Z

@lewtun Yep! But I'll be able to do a more in depth and comparable analysis once we integrate ifeval, to see if we observe the same

monk1337 · 2024-04-07T23:28:50Z

@daniel-furman Awesome, is there a way I can define a custom template? I am using your forked lm_eval repo.

clefourrier · 2024-06-10T14:51:58Z

@haileyschoelkopf maybe we can close this discussion now that the feature has been added? :)

haileyschoelkopf added help wanted Contributors and extra help welcome. feature request A feature that isn't implemented yet. opinions wanted For discussing open questions. labels Dec 11, 2023

haileyschoelkopf self-assigned this Dec 11, 2023

haileyschoelkopf mentioned this issue Dec 11, 2023

Add IFEval / Instruction-Following Eval #1087

Merged

haileyschoelkopf mentioned this issue Dec 25, 2023

HuggingFace model prompt formatting #1209

Open

haileyschoelkopf mentioned this issue Jan 16, 2024

How to evaluate using custom prompt template? #1291

Closed

lewtun mentioned this issue Feb 23, 2024

Adding custom metric system + IFEval as an example huggingface/lighteval#48

Merged

This was referenced Mar 8, 2024

Question about prompt formatting issue #1543

Open

Hf chat template #1578

Closed

haileyschoelkopf added this to the v0.4.3 milestone Mar 15, 2024

KonradSzafer mentioned this issue Jun 10, 2024

Add chat template #1873

Merged

haileyschoelkopf closed this as completed Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support wrapping prompts with a given Chat Template #1098

Support wrapping prompts with a given Chat Template #1098

haileyschoelkopf commented Dec 11, 2023

baberabb commented Dec 12, 2023

haileyschoelkopf commented Dec 13, 2023

StellaAthena commented Dec 14, 2023

baberabb commented Dec 14, 2023 •

edited

Loading

anjor commented Dec 26, 2023

lewtun commented Feb 9, 2024

daniel-furman commented Feb 9, 2024 •

edited

Loading

daniel-furman commented Feb 10, 2024

haileyschoelkopf commented Feb 10, 2024

lewtun commented Feb 12, 2024

daniel-furman commented Feb 15, 2024 •

edited

Loading

daniel-furman commented Feb 15, 2024

haileyschoelkopf commented Feb 15, 2024

daniel-furman commented Feb 15, 2024

daniel-furman commented Feb 19, 2024 •

edited

Loading

clefourrier commented Feb 20, 2024

clefourrier commented Feb 20, 2024 •

edited

Loading

baberabb commented Feb 20, 2024

Rocketknight1 commented Feb 20, 2024

daniel-furman commented Feb 20, 2024

Rocketknight1 commented Feb 20, 2024

daniel-furman commented Feb 20, 2024

lewtun commented Feb 21, 2024

clefourrier commented Feb 26, 2024

monk1337 commented Apr 7, 2024 •

edited

Loading

clefourrier commented Jun 10, 2024

Support wrapping prompts with a given Chat Template #1098

Support wrapping prompts with a given Chat Template #1098

Comments

haileyschoelkopf commented Dec 11, 2023

baberabb commented Dec 12, 2023

haileyschoelkopf commented Dec 13, 2023

StellaAthena commented Dec 14, 2023

baberabb commented Dec 14, 2023 • edited Loading

anjor commented Dec 26, 2023

lewtun commented Feb 9, 2024

daniel-furman commented Feb 9, 2024 • edited Loading

daniel-furman commented Feb 10, 2024

haileyschoelkopf commented Feb 10, 2024

lewtun commented Feb 12, 2024

daniel-furman commented Feb 15, 2024 • edited Loading

daniel-furman commented Feb 15, 2024

haileyschoelkopf commented Feb 15, 2024

daniel-furman commented Feb 15, 2024

daniel-furman commented Feb 19, 2024 • edited Loading

clefourrier commented Feb 20, 2024

clefourrier commented Feb 20, 2024 • edited Loading

baberabb commented Feb 20, 2024

Rocketknight1 commented Feb 20, 2024

daniel-furman commented Feb 20, 2024

Rocketknight1 commented Feb 20, 2024

daniel-furman commented Feb 20, 2024

lewtun commented Feb 21, 2024

clefourrier commented Feb 26, 2024

monk1337 commented Apr 7, 2024 • edited Loading

clefourrier commented Jun 10, 2024

baberabb commented Dec 14, 2023 •

edited

Loading

daniel-furman commented Feb 9, 2024 •

edited

Loading

daniel-furman commented Feb 15, 2024 •

edited

Loading

daniel-furman commented Feb 19, 2024 •

edited

Loading

clefourrier commented Feb 20, 2024 •

edited

Loading

monk1337 commented Apr 7, 2024 •

edited

Loading