Skip to content

Commit

Permalink
Add chat template (#1873)
Browse files Browse the repository at this point in the history
* initial chat template

* tokenizer attribute check

* variable rename

* interface update

* system instruction

* system inst default update

* fewshot as multiturn

* typing update

* indent update

* added comments

* Adding a fewshot in a more readable way

* linting

* Moved apply chat template to LM

* multiturn alternation fix

* cache key update

* apply chat template method fix

* add system prompt hash to cache_key

* tokenizer name property for cache_key

* property name fix

* linting backward compatibility fix

* docs and errors update

* add documentation on adding chat template compatibility to model_guide

* fewshot as multiturn check fix

* saving system inst and chat template in results

* eval tracker update

* docs update

* Apply suggestions from code review

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

---------

Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
  • Loading branch information
4 people committed Jun 3, 2024
1 parent 3e500e9 commit 070d31d
Show file tree
Hide file tree
Showing 9 changed files with 377 additions and 40 deletions.
6 changes: 6 additions & 0 deletions docs/interface.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,12 @@ This mode supports a number of command-line arguments, the details of which can

- `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing `lm-eval` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than `lm_eval/tasks/`.

- `--system_instruction`: Specifies a system instruction string to prepend to the prompt.

- `--apply_chat_template` : If this flag is on, a chat template will be applied to the prompt. For Hugging Face models, the chat template is taken from the tokenizer, if the tokenizer does not have a chat template, a default one will be applied. For other models, chat templating is not currently implemented.

- `--fewshot_as_multiturn` : If this flag is on, the Fewshot examples are treated as a multi-turn conversation. Questions are provided as user content and answers are provided as assistant responses. Requires `--num_fewshot` to be set to be greater than 0, and `--apply_chat_template` to be on.

- `--predict_only`: Generates the model outputs without computing metrics. Use with `--log_samples` to retrieve decoded results.

* `--seed`: Set seed for python's random, numpy and torch. Accepts a comma-separated list of 3 values for python's random, numpy, and torch seeds, respectively, or a single integer to set the same seed for all three. The values are either an integer or 'None' to not set the seed. Default is `0,1234,1234` (for backward compatibility). E.g. `--seed 0,None,8` sets `random.seed(0)` and `torch.manual_seed(8)`. Here numpy's seed is not set since the second value is `None`. E.g, `--seed 42` sets all three seeds to 42.
Expand Down
47 changes: 47 additions & 0 deletions docs/model_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,53 @@ Using this decorator results in the class being added to an accounting of the us

We also recommend that new model contributions be accompanied by short tests of their 3 core functionalities, at minimum. To see an example of such tests, look at https://github.com/EleutherAI/lm-evaluation-harness/blob/35bdecd379c0cefad6897e67db892f4a6026a128/tests/test_ggml.py .

## Chat Templating

Many models are fine-tuned with a [Chat Template](https://huggingface.co/docs/transformers/main/en/chat_templating) in order to enable back-and-forth interaction between a "User"'s queries and the model (often called "Assistant")'s responses. It can be desirable to evaluate fine-tuned models on evaluation tasks while wrapped in the conversational format they expect.

In order to make your model optionally compatible with a chat format, three additional methods must be implemented:

```python
class MyCustomLM(LM):
#...
@property
def tokenizer_name(self) -> str:
# should return a string denoting the name of the model's tokenizer and/or the accompanying chat template.

@property
def chat_template(self) -> str:
# should return a chat template formatting string that is used to build prompt from a user/assistant chat history.
# this will be saved in the evaluation results for reproducibility.

def apply_chat_template(self, chat_history: List[Dict[str, str]]) -> str:
# responsible for taking as input a chat history that would be fed into the model, and
# rendering it as a string that can be then tokenized and input into the model.
#...
```

- `apply_chat_template`
- This method performs the bulk of the work required for chat-formatting.
- As input, a `chat_history: List[Dict[str, str]]` is passed in. This is a transcript of a conversation of a form similar to
```
[
{"system": <user-provided system message such as "You are a helpful math-focused chatbot">},
{"user": <task example - a few-shot example 'input'>}
{"assistant": <correct response to the above example>},
# ... more few-shot examples, potentially
{"user": <test set query--response on which we will evaluate>},
]
```
which can then be converted into a string input.
- The output is a string representing this conversation that can be fed into the model.
- For example, this consists of simply calling `tokenizer.apply_chat_template` for HFLM--see the implementation there for reference.
- `tokenizer_name`
- LM Eval Harness supports [caching requests](https://github.com/EleutherAI/lm-evaluation-harness/blob/4902aaaf1f374682f95ac25fe2e13b23faddc91a/lm_eval/__main__.py#L140) that are sent to a model, for faster setup when repeating an already-performed evaluation.
- However, we don't want to use the cache of chat transcripts rendered using one chat template or system prompt to send to a model with a different template! So, we use this `lm.tokenizer_name` string to distinguish caches for a given model (and chat template) from one another.
- `chat_template`
- Chat templates are typically provided as a Jinja template string or a string formatted with str.format to include user and assistant messages in a single prompt. This template string is saved in the evaluation results to ensure reproducibility.

If not implemented for a given model type, the flags `--apply_chat_template` , `--fewshot_as_multiturn`, and `--system_instruction` cannot be used.

## Other

**Pro tip**: In order to make the Evaluation Harness overestimate total runtimes rather than underestimate it, HuggingFace models come in-built with the ability to provide responses on data points in *descending order by total input length* via `lm_eval.utils.Reorderer`. Take a look at `lm_eval.models.hf_causal.HFLM` to see how this is done, and see if you can implement it in your own model!
Expand Down
38 changes: 34 additions & 4 deletions lm_eval/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,24 @@ def setup_parser() -> argparse.ArgumentParser:
default=False,
help="If True, write out all model outputs and documents for per-sample measurement and post-hoc analysis. Use with --output_path.",
)
parser.add_argument(
"--system_instruction",
type=str,
default=None,
help="System instruction to be used in the prompt",
)
parser.add_argument(
"--apply_chat_template",
action="store_true",
default=False,
help="If True, applies the chat template to the prompt",
)
parser.add_argument(
"--fewshot_as_multiturn",
action="store_true",
default=False,
help="If True, uses the fewshot as a multi-turn conversation",
)
parser.add_argument(
"--show_config",
action="store_true",
Expand Down Expand Up @@ -261,10 +279,6 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
args.hf_hub_log_args += f",token={os.environ.get('HF_TOKEN')}"
evaluation_tracker_args = simple_parse_args_string(args.hf_hub_log_args)
evaluation_tracker = EvaluationTracker(**evaluation_tracker_args)
evaluation_tracker.general_config_tracker.log_experiment_args(
model_source=args.model,
model_args=args.model_args,
)

if args.predict_only:
args.log_samples = True
Expand All @@ -273,6 +287,18 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
"Specify --output_path if providing --log_samples or --predict_only"
)

if args.fewshot_as_multiturn and args.apply_chat_template is False:
raise ValueError(
"If fewshot_as_multiturn is set, apply_chat_template must be set to True."
)

if (
args.num_fewshot is None or args.num_fewshot == 0
) and args.fewshot_as_multiturn:
raise ValueError(
"If fewshot_as_multiturn is set, num_fewshot must be greater than 0."
)

if args.include_path is not None:
eval_logger.info(f"Including path: {args.include_path}")
task_manager = TaskManager(args.verbosity, include_path=args.include_path)
Expand Down Expand Up @@ -353,6 +379,10 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
check_integrity=args.check_integrity,
write_out=args.write_out,
log_samples=args.log_samples,
evaluation_tracker=evaluation_tracker,
system_instruction=args.system_instruction,
apply_chat_template=args.apply_chat_template,
fewshot_as_multiturn=args.fewshot_as_multiturn,
gen_kwargs=args.gen_kwargs,
task_manager=task_manager,
verbosity=args.verbosity,
Expand Down
36 changes: 35 additions & 1 deletion lm_eval/api/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import json
import logging
import os
from typing import List, Optional, Tuple, Type, TypeVar
from typing import Dict, List, Optional, Tuple, Type, TypeVar

import transformers
from sqlitedict import SqliteDict
Expand Down Expand Up @@ -114,6 +114,20 @@ def generate_until(self, requests) -> List[str]:
"""
pass

def apply_chat_template(self, chat_history: List[Dict[str, str]]) -> str:
"""
Defines how to transform few-shot examples provided as chat history into a format that can be used as input to the LM.
:param chat_history: list[dict[str, str]]
A list of dictionaries with keys 'role' and 'content'.
Values are strings representing the role name and the content of the message, respectively.
:return: str
A string representing the chat history in a format that can be used as input to the LM.
"""
raise NotImplementedError(
"To use this model with chat templates, please implement the 'apply_chat_template' method for your model type."
)

@classmethod
def create_from_arg_string(
cls: Type[T], arg_string: str, additional_config: Optional[dict] = None
Expand Down Expand Up @@ -169,6 +183,26 @@ def world_size(self):
# not support multi-device parallelism nor expect it.
return self._world_size

@property
def tokenizer_name(self) -> str:
"""Must be defined for LM subclasses which implement Chat Templating.
Should return the name of the tokenizer or chat template used.
Used only to properly fingerprint caches when requests are being cached with `--cache_requests`, otherwise not used.
"""
raise NotImplementedError(
"To use this model with chat templates, please implement the 'tokenizer_name' property."
)

@property
def chat_template(self) -> str:
"""Must be defined for LM subclasses that implement Chat Templating.
Should return the structure of the chat template applied to user/assistant messages.
This is used only to save in the experiment results for reproducibility.
"""
raise NotImplementedError(
"To use this model with chat templates, please implement the 'chat_template' property."
)

def set_cache_hook(self, cache_hook) -> None:
self.cache_hook = cache_hook

Expand Down
96 changes: 69 additions & 27 deletions lm_eval/api/samplers.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,37 +42,79 @@ def get_context(self, doc, num_fewshot):
# TODO: should we just stop people from using fewshot from same split as evaluating?
selected_docs = [x for x in fewshotex if x != doc][:num_fewshot]

labeled_examples = (
self.fewshot_delimiter.join(
[
# TODO: is separating doc_to_text and doc_to_target by one space always desired?
(
self.doc_to_text(doc)
if (
self.config.doc_to_choice is None
or isinstance(self.doc_to_text(doc), str)
)
else self.doc_to_choice(doc)[self.doc_to_text(doc)]
)
+ self.target_delimiter
+ (
str(self.doc_to_target(doc)[0])
if isinstance(self.doc_to_target(doc), list)
else self.doc_to_target(doc)
if (
self.config.doc_to_choice is None
or isinstance(self.doc_to_target(doc), str)
)
else str(self.doc_to_choice(doc)[self.doc_to_target(doc)])
)
for doc in selected_docs
]
labeled_examples = ""
for doc in selected_docs:
doc_content = self.doc_to_text(doc)
doc_target = self.doc_to_target(doc)
labeled_examples += (
doc_content
if self.config.doc_to_choice is None or isinstance(doc_content, str)
else self.doc_to_choice(doc)[doc_content]
)
+ self.fewshot_delimiter
)
labeled_examples += self.target_delimiter
labeled_examples += (
str(doc_target[0])
if isinstance(doc_target, list)
else doc_target
if self.config.doc_to_choice is None or isinstance(doc_target, str)
else str(self.doc_to_choice(doc)[doc_target])
)
labeled_examples += self.fewshot_delimiter

return labeled_examples

def get_chat_context(
self,
doc,
num_fewshot,
fewshot_as_multiturn: bool = False,
):
chat_history = []
# draw an extra fewshot sample if using same split as evaluating on
n_samples = (
num_fewshot + 1
if self.config.fewshot_split == self.config.test_split
else num_fewshot
)
# draw `n_samples` docs from fewshot_docs
fewshotex = self.sample(n_samples)

# get rid of the doc that's the one we're evaluating, if it's in the fewshot
# TODO: should we just stop people from using fewshot from same split as evaluating?
selected_docs = [x for x in fewshotex if x != doc][:num_fewshot]

if fewshot_as_multiturn:
for doc in selected_docs:
doc_content = self.doc_to_text(doc)
doc_target = self.doc_to_target(doc)
chat_history.append(
{
"role": "user",
"content": doc_content
if self.config.doc_to_choice is None
or isinstance(doc_content, str)
else self.doc_to_choice(doc)[doc_content],
}
)
chat_history.append(
{
"role": "assistant",
"content": str(doc_target[0])
if isinstance(doc_target, list)
else doc_target
if self.config.doc_to_choice is None
or isinstance(doc_target, str)
else str(self.doc_to_choice(doc)[doc_target]),
}
)
else:
# get fewshot context as one user turn
chat_history.append(
{"role": "user", "content": self.get_context(doc, num_fewshot)}
)

return chat_history

def sample(self, n):
"""
Draw `n` samples from our fewshot docs. This method should be overridden by subclasses.
Expand Down
Loading

0 comments on commit 070d31d

Please sign in to comment.