# Releasing LM-Evaluation-Harness v0.4.0

With the vast amount of work done in the field today, it helps to have a tool that people can use easily to share their results and use to check others to ensure reported numbers are valid. The LM Evaluation Harness is one such tool the community has used extensively. We want to continue to support the community and with that in mind, we’re excited to announce a major update on the LM Evaluation Harness to further our goal for open and accessible AI research.

Our refactor stems from our desires to make the following believed best practices easier to carry out.  

1.   Never copy results from other papers
2.   Always share your exact prompts
3.   Always provide model outputs
4.   Qualitatively review a small batch of outputs before running evaluation jobs at scale

We also wanted to make the library a better experience to use and to contribute or design evaluations within. New features in the new release that serve this purpose include:

1. Faster Evaluation Runtimes (accelerated data-parallel inference with HF Transformers + Accelerate, and commonly used or faster inference libraries such as vLLM and Llama-CPP)
2. Easier addition and sharing of new tasks (YAML-based task config formats, allowing single-file sharing of custom tasks)
3. More configurability, for more advanced workflows and easier operation with modifying prompts
4. Better logging of data at runtime and post-hoc

In this notebook we will be going through a short tutorial on how things work.

## Install LM-Eval

In [2]:
# Install LM-Eval
!pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git

Collecting git+https://github.com/EleutherAI/lm-evaluation-harness.git
  Cloning https://github.com/EleutherAI/lm-evaluation-harness.git to /tmp/pip-req-build-_333okp5
  Running command git clone --filter=blob:none --quiet https://github.com/EleutherAI/lm-evaluation-harness.git /tmp/pip-req-build-_333okp5
  Resolved https://github.com/EleutherAI/lm-evaluation-harness.git to commit 867413f8677f00f6a817262727cbb041bf36192a
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting evaluate (from lm_eval==0.4.5)
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting jsonlines (from lm_eval==0.4.5)
  Downloading jsonlines-4.0.0-py3-none-any.whl.metadata (1.6 kB)
Collecting peft>=0.2.0 (from lm_eval==0.4.5)
  Downloading peft-0.13.2-py3-none-any.whl.metadata (13 kB)
Collecting pytablewriter (from lm_eval==0.4.5)
  Downloading pytablewriter-1.2.0-py3-none-a

## Create new evaluation tasks with config-based tasks

Even within the same task, many works have reported numbers based on different choices of evaluation. Some report on the test sets, validation sets, or even subset of the training sets. Others have specialized prompts and verbalizers. We introduce YAMLs to allow users to easily make different variations. By leveraging the YAML configs to configure evaluations, the refactored LM-Eval takes the methods of the `Task` object and makes them configurable by setting the appropriate attributes in the config file. There, users can set the tasks they want by setting the name of the HF dataset (local tasks are also possible), the dataset splits used, and much more. Key configurations relating to prompting, such as `doc_to_text`, previously implemented as a method of the same name, are now configurable with jinja2 to allow high-level scripting to transform a HF dataset to text string as input to the model.



A core-feature to LM-Eval is to configure tasks with YAML configs. With configs, you can fill preset fields to easily set up a task.

Here, we write a demo YAML config for a multiple-choice evaluation of BoolQ:

In [3]:
YAML_boolq_string = """
task: demo4
dataset_path: AndriyBilinskiy/logic_questions_ukr
dataset_name: default
output_type: multiple_choice
training_split: train
validation_split: train
doc_to_text: "Context: {{text}}\nQuestion: {{question}}?\nSelect the correct answer."
doc_to_target: answer
doc_to_choice: options
metric_list:
  - metric: acc
"""
with open("boolq3.yaml", "w") as f:
    f.write(YAML_boolq_string)

In [None]:
!ls ./

And we can now run evaluation on this task, by pointing to the config file we've just created:

In [7]:
from huggingface_hub import login
login(token="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")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [8]:
!lm_eval \
    --model hf \
    --model_args pretrained=bigscience/bloom-3b \
    --include_path ./ \
    --tasks demo4 \
    --output output/demo/ \
    --log_samples


ukr_questions.json: 100%|████████████████████| 214k/214k [00:00<00:00, 11.4MB/s]
Generating train split: 99 examples [00:00, 2256.85 examples/s]
100%|█████████████████████████████████████████| 99/99 [00:00<00:00, 1787.01it/s]
Running loglikelihood requests: 100%|█████████| 396/396 [02:12<00:00,  2.99it/s]
fatal: not a git repository (or any parent up to mount point /kaggle)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
hf (pretrained=bigscience/bloom-3b), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|-----|-------|------|-----:|------|---|-----:|---|-----:|
|demo4|Yaml   |none  |     0|acc   |↑  |0.2727|±  | 0.045|

