Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for bootstrap_iters = 0 case (#1715) #1789

Merged
merged 5 commits into from
May 24, 2024
Merged

Conversation

haileyschoelkopf
Copy link
Contributor

closes #1715 .

Should be merged after #1775

@haileyschoelkopf haileyschoelkopf added the bug Something isn't working. label May 6, 2024
@lintangsutawika
Copy link
Contributor

Lgtm, dont forget the pre-commit before merging

@haileyschoelkopf haileyschoelkopf merged commit b043b05 into main May 24, 2024
4 of 8 checks passed
@haileyschoelkopf haileyschoelkopf deleted the 1715-nostderr-typerror branch May 24, 2024 15:37
notrichardren pushed a commit to steven-basart/lm-evaluation-harness that referenced this pull request May 31, 2024
* add handling for bootstrap_iters=0 case

* add more detail to docstring

* run precommit
Mogreine pushed a commit to deepvk/lm-evaluation-harness that referenced this pull request Jun 25, 2024
* Update generate_until_template_yaml (EleutherAI#1546)

* Update ifeval.yaml (EleutherAI#1506)

* add Arabic EXAMS benchmark (EleutherAI#1498)

* add Arabic EXAMS benchmark

* fixed the linter issue, and add more information on the readme

* Update README.md

---------

Co-authored-by: Lintang Sutawika <lintang@sutawika.com>

* AGIEval (EleutherAI#1359)

* add agieval

* fix typo

* add cloze / math exactmatch agieval tasks, rename

* update exact-match agieval tasks, allow for multiple-correct answers

* add more detail to readme

* don't parse_math_answer twice

---------

Co-authored-by: Alex Bäuerle <alex@a13x.io>

* cli_evaluate calls simple_evaluate with the same verbosity. (EleutherAI#1563)

* add manual tqdm disabling management (EleutherAI#1569)

* add manual tqdm disabling management

* add typing to all new args

* apply precommit changes

---------

Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* Fix README section on vllm integration (EleutherAI#1579)

* Link to vllm integration

* add pip install .[vllm] cmd

* Fix Jinja template for Advanced AI Risk (EleutherAI#1587)

* Proposed approach for testing CLI arg parsing (EleutherAI#1566)

* New tests for CLI args

* fix spacing

* change tests for parsing

* add tests, fix parser

* remove defaults for store_true

* Patch for Seq2Seq Model predictions (EleutherAI#1584)

* Differentiate _encode_pair setting for decoder and enc-dec models

* tok_decode to not skip special token so that eos doen't become empty string

* Update model.py

* Update model.py

* Update huggingface.py

* Update lm_eval/models/huggingface.py

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update model.py

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Add start date in results.json (EleutherAI#1592)

* Cleanup for v0.4.2 release (EleutherAI#1573)

* Update interface.md

* fix: make caching reqs always work with accelerate launch

* remove stale task migration checklist

* remove deprecation warnings

* make informative TypeErrors for get_task_dict

* bump version metadata

* fix num_fewshot printing bug

* add fewshot value to cache key

* Fix eval_logger import for mmlu/_generate_configs.py (EleutherAI#1593)

* Fix eval_logger import for mmlu/_generate_configs.py

* linter

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* use BOS token in loglikelihood (EleutherAI#1588)

* use BOS token in loglikelihood

* improve comments

* add model arg

* log prefix token id

* log prefix token id

* Update lm_eval/api/model.py

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* change name to prefix_token_id

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Revert "Patch for Seq2Seq Model predictions (EleutherAI#1584)" (EleutherAI#1601)

This reverts commit b7923a8.

* fix gen_kwargs arg reading (EleutherAI#1607)

* fix until arg processing (EleutherAI#1608)

* Fixes to Loglikelihood prefix token / VLLM (EleutherAI#1611)

* make vllm use prefix_token_id ; have prefix_token_id be optional method to define

* custom_prefix_token_id wasn't set if not passed

* Add ACLUE task (EleutherAI#1614)

* Add task ACLUE

* fix minor bug

* fix code style

* fix code style

* OpenAI Completions -- fix passing of unexpected 'until' arg (EleutherAI#1612)

* add logging of model args (EleutherAI#1619)

* add logging of model args

* nit

* Add warnings.

* nit

* add warning

* nit

* Add vLLM FAQs to README (EleutherAI#1625) (EleutherAI#1633)

* peft Version Assertion (EleutherAI#1635)

* peft Version Assertion

* fix the linter issue

* Seq2seq fix (EleutherAI#1604)

* fix on --task list

* add fixes to tokeniation

* differentiate encoding for seq2seq and decoder

* return token setting

* format for pre-commit

* Seq2seq fix, pt2 (EleutherAI#1630)

* getting model class only when defined

* encode_pair handles None, add_special_tokens turned into dict with default value

---------

Co-authored-by: achervyakov <77295913+artemorloff@users.noreply.github.com>

* Integration of NeMo models into LM Evaluation Harness library (EleutherAI#1598)

* Integration of NeMo models into LM Evaluation Harness library

* rename nemo model as nemo_lm

* move nemo section in readme after hf section

* use self.eot_token_id in get_until()

* improve progress bar showing loglikelihood requests

* data replication or tensor/pipeline replication working fine within one node

* run pre-commit on modified files

* check whether dependencies are installed

* clarify usage of torchrun in README

* Fix conditional import for Nemo LM class (EleutherAI#1641)

* Fix SuperGlue's ReCoRD task following regression in v0.4 refactoring (EleutherAI#1647)

* Add Latxa paper evaluation tasks for Basque (EleutherAI#1654)

* add basqueglue

* add eus_exams

* add eus_proficiency

* add eus_reading

* add eus_trivia

* run pre-commit

* Fix CLI --batch_size arg for openai-completions/local-completions (EleutherAI#1656)

The OpenAI interface supports batch size as an argument to the completions API, but does not seem to support specification of this on the CLI i.e. `lm_eval --model openai-completions --batch_size 16 ...` because of a simple lack of str->int conversion.

This is confirmed by my usage and stacktrace from running `OPENAI_API_KEY=dummy lm_eval --model local-completions --tasks gsm8k --batch_size 16 --model_args model=nm-
testing/zephyr-beta-7b-gptq-g128,tokenizer_backend=huggingface,base_url=http://localhost:8000/v1`:
```
Traceback (most recent call last):
  File "/home/michael/venv/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
  File "/home/michael/code/lm-evaluation-harness/lm_eval/__main__.py", line 341, in cli_evaluate
    results = evaluator.simple_evaluate(
  File "/home/michael/code/lm-evaluation-harness/lm_eval/utils.py", line 288, in _wrapper
    return fn(*args, **kwargs)
  File "/home/michael/code/lm-evaluation-harness/lm_eval/evaluator.py", line 251, in simple_evaluate
    results = evaluate(
  File "/home/michael/code/lm-evaluation-harness/lm_eval/utils.py", line 288, in _wrapper
    return fn(*args, **kwargs)
  File "/home/michael/code/lm-evaluation-harness/lm_eval/evaluator.py", line 390, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
  File "/home/michael/code/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 263, in generate_until
    list(sameuntil_chunks(re_ord.get_reordered(), self.batch_size)),
  File "/home/michael/code/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 251, in sameuntil_chunks
    if len(ret) >= size or x[1] != lastuntil:
TypeError: '>=' not supported between instances of 'int' and 'str'
```

* Patch QQP prompt (EleutherAI#1661)

* TMMLU+ implementation (EleutherAI#1394)

* implementation of TMMLU+

* implemented: TMMLU+

****TMMLU+ : large-scale Traditional chinese Massive Multitask language Understanding****

- 4 categories
    - STEM
    - Social Science
    - Humanities
    - Other

The TMMLU+ dataset, encompassing over 67 subjects and 20160 tasks, is six times larger and more balanced than its predecessor, TMMLU, and includes benchmark results from both closed-source and 20 open-weight Chinese large language models with 1.8B to 72B parameters. However, Traditional Chinese variants continue to underperform compared to major Simplified Chinese models.

```markdown
Total number of tasks in the 'test' sets: 20160
Total number of tasks in the 'validation' sets: 2247
Total number of tasks in the 'train' sets: 335
```

* Remove print from __init__.py

There was my mistake in forgetting to remove the debug print from the code.

* update: move TMMLU+ config generation program into default

* fix: we should use training set as few shots example

* update: README for TMMLU+

* update: a small changes of TMMLU+ README file

* pre-commit run thought

* Add README for TMMLU+ dataset

* run precommit

* trigger precommit again

* trigger precommit again

* isort is fussy

* isort is fussy

* format, again

* oops

* oops

---------

Co-authored-by: lintang <lintang@eleuther.ai>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* Anthropic Chat API (EleutherAI#1594)

* claude3

* supply for anthropic claude3

* supply for anthropic claude3

* anthropic config changes

* add callback options on anthropic

* line passed

* claude3 tiny change

* help anthropic installation

* mention sysprompt / being careful with format in readme

---------

Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* correction bug EleutherAI#1664 (EleutherAI#1670)

* correction bug EleutherAI#1664

* add any invalid characters for Windows filenames and Unix-like systems

see:
https://gist.github.com/doctaphred/d01d05291546186941e1b7ddc02034d3?permalink_comment_id=3958715

* Update lm_eval/__main__.py

* Update scripts/zeno_visualize.py

* fix format

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* Update README.md (EleutherAI#1680)

* Add delta weights model loading (EleutherAI#1712)

* added delta weights

* removed debug

* readme update

* better error handling

* autogptq warn

* warn update

* peft and delta error, explicitly deleting _model_delta

* linter fix

* Add `neuralmagic` models for `sparseml` and `deepsparse` (EleutherAI#1674)

* Add neuralmagic models for SparseML and DeepSparse

* Update to latest and add test

* Format

* Fix list to List

* Format

* Add deepsparse/sparseml to automated testing

* Update pyproject.toml

* Update pyproject.toml

* Update README

* Fixes for dtype and device

* Format

* Fix test

* Apply suggestions from code review

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Address review comments!

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fix error when appending eot_token_id for generate_until tasks (EleutherAI#1699)

* Adding retries and rate limit to toxicity tasks  (EleutherAI#1620)

* reference `--tasks list` in README (EleutherAI#1726)

EleutherAI#1698

* Add XNLIeu: a dataset for cross-lingual NLI in Basque (EleutherAI#1694)

* add xnli_eu tasks

* update tasks readme

* update readme

* Fix Parameter Propagation for Tasks that have `include`  (EleutherAI#1749)

* Update task.py

* Update __init__.py

* Support individual scrolls datasets (EleutherAI#1740)

* Support individual scrolls datasets

* Add qmsum context

* Fix formatting

* Add filter registry decorator (EleutherAI#1750)

* Add register_filter decorator

* Add register_filter docs

* remove duplicated `num_fewshot: 0` (EleutherAI#1769)

* Pile 10k new task (EleutherAI#1758)

* Add Pile-10k readme

* Add Pile-10k task configuration file

* Fix m_arc choices (EleutherAI#1760)

* Update utils.py

This is a 4-choice task, option_e is null for all but 3 samples

* Fix options

Adaptive choices

* add option e

* bump multilingual arc version

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* upload new tasks (EleutherAI#1728)

* upload new tasks

* add readmes

* run linters

---------

Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* vllm lora support (EleutherAI#1756)

* vllm lora support

* remove print

* version check, rename lora kwarg

* Add option to set OpenVINO config (EleutherAI#1730)

* Add option to set OpenVINO config

* Use utils.eval_logger for logging

* evaluation tracker implementation (EleutherAI#1766)

* evaluation tracker implementation

* OVModelForCausalLM test fix

* typo fix

* moved methods args

* multiple args in one flag

* loggers moved to dedicated dir

* improved filename sanitization

* eval tracker args fix (EleutherAI#1777)

* limit fix (EleutherAI#1785)

* remove echo parameter in OpenAI completions API (EleutherAI#1779)

* remove echo parameter in OpenAI completions API

* remove context length parameter doc string

* Fix README: change`----hf_hub_log_args` to `--hf_hub_log_args` (EleutherAI#1776)

fix `----hf_hub_log_args` to `--hf_hub_log_args`

* Fix bug in setting until kwarg in openai completions (EleutherAI#1784)

* Provide ability for custom sampler for ConfigurableTask (EleutherAI#1616)

* Added fewshot sampling seeds to evaluator.simple_evaluate signature

Way to control seed of fewshot sampling
may help with EleutherAI#1591

* Added ability for custom sampler for ConfigurableTask

May be set in config like
```
fewshot_config:
  sampler: !function utils.MyFewshotSampler
```

* explicitly set fewshot random generator seed for HFLM generate_until_task test

* add backward compatibility for three args seed setup

* save seeds info to logs/reports

* Update `--tasks list` option in interface documentation (EleutherAI#1792)

* Fix Caching Tests ; Remove `pretrained=gpt2` default (EleutherAI#1775)

* link to the example output on the hub (EleutherAI#1798)

* Re-add Hendrycks MATH (no sympy checking, no Minerva hardcoded prompt) variant (EleutherAI#1793)

* add Hendrycks MATH (no sympy checking) variant

* add readmes for MATH tasks

* Logging Updates (Alphabetize table printouts, fix eval tracker bug) (EleutherAI#1774) (EleutherAI#1791)

* fix auto-batch size bug for seq2seq models

* alphabetize task + group tables ; fix eval tracker bug

* fix eval tracker bug

* Initial integration of the Unitxt to LM eval harness (EleutherAI#1615)

* Initial support for Unitxt datasets in LM Eval Harness

See  https://github.com/IBM/unitxt

The script 'generate_yamls.py' creates LM Eval Harness yaml files corresponding to Unitxt datasets specified in the 'unitxt_datasets' file.

The glue code required to register Unitxt metrics is in 'unitxt_wrapper.py'.

* Added dataset loading check to generate_yaml

Improved error messages.

* Speed up generate_yaml

Added printouts and improved error message

* Added output printout

* Simplified integration of unitxt datasets

Store all the common yaml configuration in a yaml include shared by all datasets of the same task.

* Post code review comments - part 1

1. Made sure include files don't end wth 'yaml' so they won't be marked as tasks
2. Added more datasets and tasks (NER, GEC)
3. Added README

* Post code review comments - part 2

1. Added install unitxt install option in pyproject.toml:
pip install 'lm_eval[unitxt]'
2. Added a check that unitxt is installed and print a clear error message if not

* Commited missing pyproject change

* Added documentation on adding datasets

* More doc changes

* add unitxt extra to readme

* run precommit

---------

Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* add task for mmlu evaluation in arc multiple choice format (EleutherAI#1745)

* add mmlu arc style evaluation

* rename arc_style to continuation

---------

Co-authored-by: Jonathan Burdge <jburdge@mahti-login11.mahti.csc.fi>
Co-authored-by: Jonathan Burdge <jburdge@mahti-login12.mahti.csc.fi>

* Update flag `--hf_hub_log_args` in interface documentation (EleutherAI#1806)

* update interface documentation with flag --hf_hub_logs_arg

* update interface documentation with flag --hf_hub_logs_arg 2

* Copal task (EleutherAI#1803)

* add copal

* change name to copal id for clarity and the task name

* remove `copal_id...` to yaml to make it work

* checkmark on README

* change group name to `copal_id`

* Adding tinyBenchmarks datasets (EleutherAI#1545)

* Add tinyBenchmarks

* Add acknowledgements

* Add ordering of outputs for data-parallel

* Run pre-commit

* Add few_shot specifications

* Add tinyBenchmarks post-processing

* add conditional import ; fix task names

---------

Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* interface doc update (EleutherAI#1807)

* Fix links in README guiding to another branch (EleutherAI#1838)

* Fix: support PEFT/LoRA with added tokens (EleutherAI#1828)

* resize model embeddings

* resize only

* tokenizer help

* load tokenizer before model

* add comment and run precommit lint

* Add log message

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fixed incorrect check for task type (replace `~` with `not`) (EleutherAI#1865)

* fixed docs typos (EleutherAI#1863)

* Update polemo2_out.yaml (EleutherAI#1871)

* Unpin vllm in dependencies (EleutherAI#1874)

* Fix outdated links to the latest links in `docs` (EleutherAI#1876)

* [HFLM]Use Accelerate's API to reduce hard-coded CUDA code (EleutherAI#1880)

* Fix `batch_size=auto` for HF Seq2Seq models (EleutherAI#1765) (EleutherAI#1790)

* fix auto-batch size bug for seq2seq models

* run linter

* Fix Brier Score (EleutherAI#1847)

`gold_one_hot` needs to follow the dimension of predictions so that it still works when `--limit` is used and the indexes in gold does not cover all gold indexes.

* Fix for bootstrap_iters = 0 case (EleutherAI#1715) (EleutherAI#1789)

* add handling for bootstrap_iters=0 case

* add more detail to docstring

* run precommit

* add mmlu tasks from pile-t5 (EleutherAI#1710)

* add mmlu tasks from pile-t5

* Update _mmlu_flan_cot_fewshot_template_yaml

* Update _mmlu_flan_cot_zeroshot_template_yaml

* Update _mmlu_flan_generative_template_yaml

* Update _mmlu_flan_loglikelihood_template_yaml

* Update _default_template_yaml

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Bigbench fix (EleutherAI#1686)

* edit process multiple-choice

* split template yaml

* remove

* modified multiple_choice tasks

* udpate

* Update multiple_choice_template_b_yaml

* Update multiple_choice_template_a_yaml

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Rename `lm_eval.logging -> lm_eval.loggers` (EleutherAI#1858)

* rename lm_eval.logging module

* fix evaluation tracker args

* Updated vllm imports in vllm_causallms.py (EleutherAI#1890)

* Reorder vllm imports in vllm_causallms.py

* Update vllm_causallms.py

* [HFLM]Add support for Ascend NPU (EleutherAI#1886)

* [HFLM]Add support for Ascend NPU

Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com>
Co-authored-by: zhabuye <2947436155@qq.com>

* bump accelerate dependency version to 0.26.0 for NPU compat.

---------

Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com>
Co-authored-by: zhabuye <2947436155@qq.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* `higher_is_better` tickers in output table (EleutherAI#1893)

* Higher is better tickers in output table

* add extra check for `higher_is_better` not being None already

* Update lm_eval/evaluator.py

* fixup format I messed up

* add comment (and retrigger tests)

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* Add dataset card when pushing to HF hub (EleutherAI#1898)

* dataset card initial

* few fixes

* adds groups for math, mmlu, gpqa

* added summary agrs

* moved sanitize_list to utils

* readme update

* recreate metadata moved

* multiple model support

* results latest split fix

* readme update and small refactor

* fix grouping

* add comments

* added pathlib

* corrected pathlib approach

* check whether to create a metadata card

* convert posix paths to str

* default hf org from token

* hf token value error

* Add logs after successful upload

* logging updates

* dataset card example in the readme

---------

Co-authored-by: Nathan Habib <nathan.habib@huggingface.com>
Co-authored-by: Alina Lozovskaia <alinailozovskaya@gmail.com>

* Making hardcoded few shots compatible with the chat template mechanism (EleutherAI#1895)

* init test 1

* fix

* this format seems to be working - need to update all other tasks with the new format

* bbh with few shot format

* fix fewshot bbh

* add mmlu flan cot

* samples of cot

* kmmlu

* fix gsm8k

* update keys for mmlu

* minerva math

* bbh

* fix

* fix samples

* small fixes to templates

* last prompt format change

* fixing prompt

* fixed minerva math format

* rm accidental commited file

* added doc for few shot samples

* Update lm_eval/loggers/evaluation_tracker.py

* Update lm_eval/loggers/evaluation_tracker.py

* Update docs/new_task_guide.md

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* added check in sampler per code review

* added the system from a function, plus an example in minerva math

* style

* Apply suggestions from code review

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fix unit tests 1

* forcing use of test split

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Try to make existing tests run little bit faster (EleutherAI#1905)

* Fix fewshot seed only set when overriding num_fewshot (EleutherAI#1914)

Fix EleutherAI#1906

* Complete task list from pr 1727 (EleutherAI#1901)

* added tasks and task family descriptors

* continue work on task list w/ links; slightly reorganize README

* Apply suggestions from code review

* Rename file so that it'll preview in Github when viewing lm_eval/tasks folder

* Update new_task_guide.md

* Update README.md

* run linter

* Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs

* fix typo

* Apply suggestions from code review

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* apply format

---------

Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* Add chat template (EleutherAI#1873)

* initial chat template

* tokenizer attribute check

* variable rename

* interface update

* system instruction

* system inst default update

* fewshot as multiturn

* typing update

* indent update

* added comments

* Adding a fewshot in a more readable way

* linting

* Moved apply chat template to LM

* multiturn alternation fix

* cache key update

* apply chat template method fix

* add system prompt hash to cache_key

* tokenizer name property for cache_key

* property name fix

* linting backward compatibility fix

* docs and errors update

* add documentation on adding chat template compatibility to model_guide

* fewshot as multiturn check fix

* saving system inst and chat template in results

* eval tracker update

* docs update

* Apply suggestions from code review

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

---------

Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data (EleutherAI#1867)

* glianorex tasks

* Create README.md

* Update README.md

* Update README.md

* fix formatting

* fix internal formatting

* Modify pre-commit hook to check merge conflicts accidentally committed not at current merge commit (EleutherAI#1927)

* [add] fld logical formula task (EleutherAI#1931)

* Add new Lambada translations (EleutherAI#1897)

* added tasks and task family descriptors

* configs for the new lambada translations

* continue work on task list w/ links; slightly reorganize README

* Apply suggestions from code review

* Rename file so that it'll preview in Github when viewing lm_eval/tasks folder

* Update new_task_guide.md

* Update README.md

* run linter

* Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs

* fix typo

* update `lm_eval/tasks/README.md` with task description

---------

Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com>
Co-authored-by: anthony <anthonydipofi@gmail.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* Implement NoticIA (EleutherAI#1912)

* Noticia

* test

* Final testes implementation

* Fixes

* Fix linters

* Add The Arabic version of the PICA benchmark (EleutherAI#1917)

* Update siqa.yaml (EleutherAI#1909)

* Update basque-glue (EleutherAI#1913)

* Update README.md

* Update bec.yaml

* Update bhtc.yaml

* Update coref.yaml

* Update qnli.yaml

* Update vaxx.yaml

* Update wic.yaml

* Test output table layout consistency (EleutherAI#1916)

* sort metrics in output table

* update docstring in `consolidate_results`

* add tests for verifying consistency of table output

* update tests to account for floating point inconsistencies

* updated tests based on `pythia-14m`

* Update __main__.py (EleutherAI#1939)

* Add the Arabic version with refactor to Arabic pica to be in alghafa folder (EleutherAI#1940)

* Results filenames handling fix (EleutherAI#1926)

* results filenames handling moved to utils

* zeno results handling fix

* tasks_for_model backward compatibility

* results files logic moved to tasks_for_model

* moved sanitize_model_name to utils

* Remove AMMLU Due to Translation (EleutherAI#1948)

* Update README.md

* Delete lm_eval/tasks/ammlu directory

* add include_defaults kwarg to taskmanager, add tests for include_path (EleutherAI#1856)

* add hacky add_bos_token forcing for Gemma to VLLM too (EleutherAI#1857)

* Update interface.md (EleutherAI#1955)

* Fix self.max_tokens in anthropic_llms.py (EleutherAI#1848)

Fix bug where `self.max_tokens` was not set

* `samples` is newline delimited (EleutherAI#1930)

* `samples` is newline delimited

* updated git and pre-commit

* appease pre-commit

* nit

* Revert back for now

* Revert for now

---------

Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

* Fix `--gen_kwargs` and VLLM (`temperature` not respected) (EleutherAI#1800)

* Update vllm_causallms.py

* adjust

---------

Co-authored-by: lintangsutawika <lintang@eleuther.ai>

* make write_out.py explicitly error if no splits match (EleutherAI#1796)

Co-authored-by: lintangsutawika <lintang@eleuther.ai>

* fix: add directory filter to os.walk to ignore 'ipynb_checkpoints' (EleutherAI#1956)

* fix: add filter to os.walk to ignore 'ipynb_checkpoints

* Update __init__.py

* Update __init__.py

---------

Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

* add trust_remote_code  for piqa (EleutherAI#1983)

Signed-off-by: changwangss <chang1.wang@intel.com>

* Fix self assignment in neuron_optimum.py (EleutherAI#1990)

* [New Task] Add Paloma benchmark (EleutherAI#1928)

* init paloma benchmark

* pre-process in utils function

* add `task_alias`

* updated task aliases

* Update paloma_dolma-v1_5.yaml

* Update paloma_twitterAAE_HELM_fixed.yaml

* Update paloma_dolma_100_programing_languages.yaml

---------

Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

* Fix Paloma Template yaml (EleutherAI#1993)

* init paloma benchmark

* pre-process in utils function

* add `task_alias`

* updated task aliases

* Update paloma_dolma-v1_5.yaml

* Update paloma_twitterAAE_HELM_fixed.yaml

* Update paloma_dolma_100_programing_languages.yaml

* update on names

* fix paloma template issue

---------

Co-authored-by: Zafir Stojanovski <zaf.stojano@gmail.com>
Co-authored-by: Zafir Stojanovski <zafir.stojanovski@icloud.com>
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

* Log `fewshot_as_multiturn` in results files (EleutherAI#1995)

* log fewshot_as_multiturn in general tracker args

* Update evaluator.py

---------

Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

* Added ArabicMMLU (EleutherAI#1987)

* Added ArabicMMLU

* Rename `ammlu` to `arabicmmlu`

* Fix Datasets `--trust_remote_code` (EleutherAI#1998)

* Add BertaQA dataset tasks (EleutherAI#1964)

* add bertaqa tasks

* rename basquetrivia-->bertaqa ; make template stub not .yaml

* add bertaqa entry to lm_eval/tasks/README.md

---------

Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* Fix precommit hook, update run_models.sh

---------

Signed-off-by: changwangss <chang1.wang@intel.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: khalil <90086758+khalil-Hennara@users.noreply.github.com>
Co-authored-by: Lintang Sutawika <lintang@sutawika.com>
Co-authored-by: Alex Bäuerle <alex@a13x.io>
Co-authored-by: Wongboo <44860323+Wongboo@users.noreply.github.com>
Co-authored-by: achervyakov <77295913+artemorloff@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
Co-authored-by: Eitan Turok <150733043+eitanturok@users.noreply.github.com>
Co-authored-by: Rylan Schaeffer <rylanschaeffer@gmail.com>
Co-authored-by: Vicki Boykis <vicki@mozilla.ai>
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>
Co-authored-by: kwrobel.eth <djstrong@gmail.com>
Co-authored-by: Nouf M. Alotaibi <63472979+noufmitla@users.noreply.github.com>
Co-authored-by: Haonan Li <nathan.8270.n@gmail.com>
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: WoosungMyung <115716986+LameloBally@users.noreply.github.com>
Co-authored-by: Sergio Perez <sergioperezperez24@gmail.com>
Co-authored-by: Or Sharir <or@sharir.org>
Co-authored-by: Julen Etxaniz <juletxara@gmail.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: ZoneTwelve <zonetwelve159@gmail.com>
Co-authored-by: Seungwoo Ryu <seungwoo.ryu.94@gmail.com>
Co-authored-by: nicho2 <nicho2@laposte.net>
Co-authored-by: KonradSzafer <61851539+KonradSzafer@users.noreply.github.com>
Co-authored-by: Sergio Perez <sergioperezpersonal@gmail.com>
Co-authored-by: sator-labs <129434630+sator-labs@users.noreply.github.com>
Co-authored-by: Brian Vaughan <nairbv@users.noreply.github.com>
Co-authored-by: giorgossideris <56915448+giorgossideris@users.noreply.github.com>
Co-authored-by: Nikita Lozhnikov <nikitml@gmail.com>
Co-authored-by: Chujie Zheng <chujiezhengchn@gmail.com>
Co-authored-by: Gabriel Mukobi <gabrielmukobi@gmail.com>
Co-authored-by: Zehan Li <69186130+jordane95@users.noreply.github.com>
Co-authored-by: Simran Arora <emailsimran@gmail.com>
Co-authored-by: bcicc <142823000+bcicc@users.noreply.github.com>
Co-authored-by: Helena Kloosterman <helena.kloosterman@intel.com>
Co-authored-by: Muhammad Bin Usman <muhammadbin.2003@gmail.com>
Co-authored-by: ciaranby <48831615+ciaranby@users.noreply.github.com>
Co-authored-by: LSinev <LSinev@users.noreply.github.com>
Co-authored-by: aditya thomas <aditya.thomas@alum.mit.edu>
Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>
Co-authored-by: jonabur <135807120+jonabur@users.noreply.github.com>
Co-authored-by: Jonathan Burdge <jburdge@mahti-login11.mahti.csc.fi>
Co-authored-by: Jonathan Burdge <jburdge@mahti-login12.mahti.csc.fi>
Co-authored-by: Edd <68678137+Erland366@users.noreply.github.com>
Co-authored-by: Lucas Weber <35227161+LucWeber@users.noreply.github.com>
Co-authored-by: Nick Doiron <ndoiron@mapmeld.com>
Co-authored-by: Zafir Stojanovski <zafir.stojanovski@icloud.com>
Co-authored-by: zhabuye <74179177+zhabuye@users.noreply.github.com>
Co-authored-by: Edward Gan <efuzzy@gmail.com>
Co-authored-by: DongGeon Lee <dg.lee@postech.ac.kr>
Co-authored-by: Huazhong Ji <hzji210@gmail.com>
Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com>
Co-authored-by: zhabuye <2947436155@qq.com>
Co-authored-by: Nathan Habib <nathan.habib@huggingface.com>
Co-authored-by: Alina Lozovskaia <alinailozovskaya@gmail.com>
Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
Co-authored-by: anthony-dipofi <anthonydipofi@gmail.com>
Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com>
Co-authored-by: Maxime <672982+maximegmd@users.noreply.github.com>
Co-authored-by: MorishT <106973776+MorishT@users.noreply.github.com>
Co-authored-by: Iker García-Ferrero <i.garciaferrerosanpelayo@gmail.com>
Co-authored-by: Zafir Stojanovski <zaf.stojano@gmail.com>
Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com>
Co-authored-by: johnwee1 <91670254+johnwee1@users.noreply.github.com>
Co-authored-by: Wang, Chang <491521017@qq.com>
Co-authored-by: Yazeed Alnumay <61038456+Yazeed7@users.noreply.github.com>
Mogreine pushed a commit to deepvk/lm-evaluation-harness that referenced this pull request Jun 25, 2024
* Update generate_until_template_yaml (EleutherAI#1546)

* Update ifeval.yaml (EleutherAI#1506)

* add Arabic EXAMS benchmark (EleutherAI#1498)

* add Arabic EXAMS benchmark

* fixed the linter issue, and add more information on the readme

* Update README.md

---------

Co-authored-by: Lintang Sutawika <lintang@sutawika.com>

* AGIEval (EleutherAI#1359)

* add agieval

* fix typo

* add cloze / math exactmatch agieval tasks, rename

* update exact-match agieval tasks, allow for multiple-correct answers

* add more detail to readme

* don't parse_math_answer twice

---------

Co-authored-by: Alex Bäuerle <alex@a13x.io>

* cli_evaluate calls simple_evaluate with the same verbosity. (EleutherAI#1563)

* add manual tqdm disabling management (EleutherAI#1569)

* add manual tqdm disabling management

* add typing to all new args

* apply precommit changes

---------

Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* Fix README section on vllm integration (EleutherAI#1579)

* Link to vllm integration

* add pip install .[vllm] cmd

* Fix Jinja template for Advanced AI Risk (EleutherAI#1587)

* Proposed approach for testing CLI arg parsing (EleutherAI#1566)

* New tests for CLI args

* fix spacing

* change tests for parsing

* add tests, fix parser

* remove defaults for store_true

* Patch for Seq2Seq Model predictions (EleutherAI#1584)

* Differentiate _encode_pair setting for decoder and enc-dec models

* tok_decode to not skip special token so that eos doen't become empty string

* Update model.py

* Update model.py

* Update huggingface.py

* Update lm_eval/models/huggingface.py

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update model.py

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Add start date in results.json (EleutherAI#1592)

* Cleanup for v0.4.2 release (EleutherAI#1573)

* Update interface.md

* fix: make caching reqs always work with accelerate launch

* remove stale task migration checklist

* remove deprecation warnings

* make informative TypeErrors for get_task_dict

* bump version metadata

* fix num_fewshot printing bug

* add fewshot value to cache key

* Fix eval_logger import for mmlu/_generate_configs.py (EleutherAI#1593)

* Fix eval_logger import for mmlu/_generate_configs.py

* linter

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* use BOS token in loglikelihood (EleutherAI#1588)

* use BOS token in loglikelihood

* improve comments

* add model arg

* log prefix token id

* log prefix token id

* Update lm_eval/api/model.py

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* change name to prefix_token_id

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Revert "Patch for Seq2Seq Model predictions (EleutherAI#1584)" (EleutherAI#1601)

This reverts commit b7923a8.

* fix gen_kwargs arg reading (EleutherAI#1607)

* fix until arg processing (EleutherAI#1608)

* Fixes to Loglikelihood prefix token / VLLM (EleutherAI#1611)

* make vllm use prefix_token_id ; have prefix_token_id be optional method to define

* custom_prefix_token_id wasn't set if not passed

* Add ACLUE task (EleutherAI#1614)

* Add task ACLUE

* fix minor bug

* fix code style

* fix code style

* OpenAI Completions -- fix passing of unexpected 'until' arg (EleutherAI#1612)

* add logging of model args (EleutherAI#1619)

* add logging of model args

* nit

* Add warnings.

* nit

* add warning

* nit

* Add vLLM FAQs to README (EleutherAI#1625) (EleutherAI#1633)

* peft Version Assertion (EleutherAI#1635)

* peft Version Assertion

* fix the linter issue

* Seq2seq fix (EleutherAI#1604)

* fix on --task list

* add fixes to tokeniation

* differentiate encoding for seq2seq and decoder

* return token setting

* format for pre-commit

* Seq2seq fix, pt2 (EleutherAI#1630)

* getting model class only when defined

* encode_pair handles None, add_special_tokens turned into dict with default value

---------

Co-authored-by: achervyakov <77295913+artemorloff@users.noreply.github.com>

* Integration of NeMo models into LM Evaluation Harness library (EleutherAI#1598)

* Integration of NeMo models into LM Evaluation Harness library

* rename nemo model as nemo_lm

* move nemo section in readme after hf section

* use self.eot_token_id in get_until()

* improve progress bar showing loglikelihood requests

* data replication or tensor/pipeline replication working fine within one node

* run pre-commit on modified files

* check whether dependencies are installed

* clarify usage of torchrun in README

* Fix conditional import for Nemo LM class (EleutherAI#1641)

* Fix SuperGlue's ReCoRD task following regression in v0.4 refactoring (EleutherAI#1647)

* Add Latxa paper evaluation tasks for Basque (EleutherAI#1654)

* add basqueglue

* add eus_exams

* add eus_proficiency

* add eus_reading

* add eus_trivia

* run pre-commit

* Fix CLI --batch_size arg for openai-completions/local-completions (EleutherAI#1656)

The OpenAI interface supports batch size as an argument to the completions API, but does not seem to support specification of this on the CLI i.e. `lm_eval --model openai-completions --batch_size 16 ...` because of a simple lack of str->int conversion.

This is confirmed by my usage and stacktrace from running `OPENAI_API_KEY=dummy lm_eval --model local-completions --tasks gsm8k --batch_size 16 --model_args model=nm-
testing/zephyr-beta-7b-gptq-g128,tokenizer_backend=huggingface,base_url=http://localhost:8000/v1`:
```
Traceback (most recent call last):
  File "/home/michael/venv/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
  File "/home/michael/code/lm-evaluation-harness/lm_eval/__main__.py", line 341, in cli_evaluate
    results = evaluator.simple_evaluate(
  File "/home/michael/code/lm-evaluation-harness/lm_eval/utils.py", line 288, in _wrapper
    return fn(*args, **kwargs)
  File "/home/michael/code/lm-evaluation-harness/lm_eval/evaluator.py", line 251, in simple_evaluate
    results = evaluate(
  File "/home/michael/code/lm-evaluation-harness/lm_eval/utils.py", line 288, in _wrapper
    return fn(*args, **kwargs)
  File "/home/michael/code/lm-evaluation-harness/lm_eval/evaluator.py", line 390, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
  File "/home/michael/code/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 263, in generate_until
    list(sameuntil_chunks(re_ord.get_reordered(), self.batch_size)),
  File "/home/michael/code/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 251, in sameuntil_chunks
    if len(ret) >= size or x[1] != lastuntil:
TypeError: '>=' not supported between instances of 'int' and 'str'
```

* Patch QQP prompt (EleutherAI#1661)

* TMMLU+ implementation (EleutherAI#1394)

* implementation of TMMLU+

* implemented: TMMLU+

****TMMLU+ : large-scale Traditional chinese Massive Multitask language Understanding****

- 4 categories
    - STEM
    - Social Science
    - Humanities
    - Other

The TMMLU+ dataset, encompassing over 67 subjects and 20160 tasks, is six times larger and more balanced than its predecessor, TMMLU, and includes benchmark results from both closed-source and 20 open-weight Chinese large language models with 1.8B to 72B parameters. However, Traditional Chinese variants continue to underperform compared to major Simplified Chinese models.

```markdown
Total number of tasks in the 'test' sets: 20160
Total number of tasks in the 'validation' sets: 2247
Total number of tasks in the 'train' sets: 335
```

* Remove print from __init__.py

There was my mistake in forgetting to remove the debug print from the code.

* update: move TMMLU+ config generation program into default

* fix: we should use training set as few shots example

* update: README for TMMLU+

* update: a small changes of TMMLU+ README file

* pre-commit run thought

* Add README for TMMLU+ dataset

* run precommit

* trigger precommit again

* trigger precommit again

* isort is fussy

* isort is fussy

* format, again

* oops

* oops

---------

Co-authored-by: lintang <lintang@eleuther.ai>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* Anthropic Chat API (EleutherAI#1594)

* claude3

* supply for anthropic claude3

* supply for anthropic claude3

* anthropic config changes

* add callback options on anthropic

* line passed

* claude3 tiny change

* help anthropic installation

* mention sysprompt / being careful with format in readme

---------

Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* correction bug EleutherAI#1664 (EleutherAI#1670)

* correction bug EleutherAI#1664

* add any invalid characters for Windows filenames and Unix-like systems

see:
https://gist.github.com/doctaphred/d01d05291546186941e1b7ddc02034d3?permalink_comment_id=3958715

* Update lm_eval/__main__.py

* Update scripts/zeno_visualize.py

* fix format

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* Update README.md (EleutherAI#1680)

* Add delta weights model loading (EleutherAI#1712)

* added delta weights

* removed debug

* readme update

* better error handling

* autogptq warn

* warn update

* peft and delta error, explicitly deleting _model_delta

* linter fix

* Add `neuralmagic` models for `sparseml` and `deepsparse` (EleutherAI#1674)

* Add neuralmagic models for SparseML and DeepSparse

* Update to latest and add test

* Format

* Fix list to List

* Format

* Add deepsparse/sparseml to automated testing

* Update pyproject.toml

* Update pyproject.toml

* Update README

* Fixes for dtype and device

* Format

* Fix test

* Apply suggestions from code review

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Address review comments!

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fix error when appending eot_token_id for generate_until tasks (EleutherAI#1699)

* Adding retries and rate limit to toxicity tasks  (EleutherAI#1620)

* reference `--tasks list` in README (EleutherAI#1726)

EleutherAI#1698

* Add XNLIeu: a dataset for cross-lingual NLI in Basque (EleutherAI#1694)

* add xnli_eu tasks

* update tasks readme

* update readme

* Fix Parameter Propagation for Tasks that have `include`  (EleutherAI#1749)

* Update task.py

* Update __init__.py

* Support individual scrolls datasets (EleutherAI#1740)

* Support individual scrolls datasets

* Add qmsum context

* Fix formatting

* Add filter registry decorator (EleutherAI#1750)

* Add register_filter decorator

* Add register_filter docs

* remove duplicated `num_fewshot: 0` (EleutherAI#1769)

* Pile 10k new task (EleutherAI#1758)

* Add Pile-10k readme

* Add Pile-10k task configuration file

* Fix m_arc choices (EleutherAI#1760)

* Update utils.py

This is a 4-choice task, option_e is null for all but 3 samples

* Fix options

Adaptive choices

* add option e

* bump multilingual arc version

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* upload new tasks (EleutherAI#1728)

* upload new tasks

* add readmes

* run linters

---------

Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* vllm lora support (EleutherAI#1756)

* vllm lora support

* remove print

* version check, rename lora kwarg

* Add option to set OpenVINO config (EleutherAI#1730)

* Add option to set OpenVINO config

* Use utils.eval_logger for logging

* evaluation tracker implementation (EleutherAI#1766)

* evaluation tracker implementation

* OVModelForCausalLM test fix

* typo fix

* moved methods args

* multiple args in one flag

* loggers moved to dedicated dir

* improved filename sanitization

* eval tracker args fix (EleutherAI#1777)

* limit fix (EleutherAI#1785)

* remove echo parameter in OpenAI completions API (EleutherAI#1779)

* remove echo parameter in OpenAI completions API

* remove context length parameter doc string

* Fix README: change`----hf_hub_log_args` to `--hf_hub_log_args` (EleutherAI#1776)

fix `----hf_hub_log_args` to `--hf_hub_log_args`

* Fix bug in setting until kwarg in openai completions (EleutherAI#1784)

* Provide ability for custom sampler for ConfigurableTask (EleutherAI#1616)

* Added fewshot sampling seeds to evaluator.simple_evaluate signature

Way to control seed of fewshot sampling
may help with EleutherAI#1591

* Added ability for custom sampler for ConfigurableTask

May be set in config like
```
fewshot_config:
  sampler: !function utils.MyFewshotSampler
```

* explicitly set fewshot random generator seed for HFLM generate_until_task test

* add backward compatibility for three args seed setup

* save seeds info to logs/reports

* Update `--tasks list` option in interface documentation (EleutherAI#1792)

* Fix Caching Tests ; Remove `pretrained=gpt2` default (EleutherAI#1775)

* link to the example output on the hub (EleutherAI#1798)

* Re-add Hendrycks MATH (no sympy checking, no Minerva hardcoded prompt) variant (EleutherAI#1793)

* add Hendrycks MATH (no sympy checking) variant

* add readmes for MATH tasks

* Logging Updates (Alphabetize table printouts, fix eval tracker bug) (EleutherAI#1774) (EleutherAI#1791)

* fix auto-batch size bug for seq2seq models

* alphabetize task + group tables ; fix eval tracker bug

* fix eval tracker bug

* Initial integration of the Unitxt to LM eval harness (EleutherAI#1615)

* Initial support for Unitxt datasets in LM Eval Harness

See  https://github.com/IBM/unitxt

The script 'generate_yamls.py' creates LM Eval Harness yaml files corresponding to Unitxt datasets specified in the 'unitxt_datasets' file.

The glue code required to register Unitxt metrics is in 'unitxt_wrapper.py'.

* Added dataset loading check to generate_yaml

Improved error messages.

* Speed up generate_yaml

Added printouts and improved error message

* Added output printout

* Simplified integration of unitxt datasets

Store all the common yaml configuration in a yaml include shared by all datasets of the same task.

* Post code review comments - part 1

1. Made sure include files don't end wth 'yaml' so they won't be marked as tasks
2. Added more datasets and tasks (NER, GEC)
3. Added README

* Post code review comments - part 2

1. Added install unitxt install option in pyproject.toml:
pip install 'lm_eval[unitxt]'
2. Added a check that unitxt is installed and print a clear error message if not

* Commited missing pyproject change

* Added documentation on adding datasets

* More doc changes

* add unitxt extra to readme

* run precommit

---------

Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* add task for mmlu evaluation in arc multiple choice format (EleutherAI#1745)

* add mmlu arc style evaluation

* rename arc_style to continuation

---------

Co-authored-by: Jonathan Burdge <jburdge@mahti-login11.mahti.csc.fi>
Co-authored-by: Jonathan Burdge <jburdge@mahti-login12.mahti.csc.fi>

* Update flag `--hf_hub_log_args` in interface documentation (EleutherAI#1806)

* update interface documentation with flag --hf_hub_logs_arg

* update interface documentation with flag --hf_hub_logs_arg 2

* Copal task (EleutherAI#1803)

* add copal

* change name to copal id for clarity and the task name

* remove `copal_id...` to yaml to make it work

* checkmark on README

* change group name to `copal_id`

* Adding tinyBenchmarks datasets (EleutherAI#1545)

* Add tinyBenchmarks

* Add acknowledgements

* Add ordering of outputs for data-parallel

* Run pre-commit

* Add few_shot specifications

* Add tinyBenchmarks post-processing

* add conditional import ; fix task names

---------

Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* interface doc update (EleutherAI#1807)

* Fix links in README guiding to another branch (EleutherAI#1838)

* Fix: support PEFT/LoRA with added tokens (EleutherAI#1828)

* resize model embeddings

* resize only

* tokenizer help

* load tokenizer before model

* add comment and run precommit lint

* Add log message

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fixed incorrect check for task type (replace `~` with `not`) (EleutherAI#1865)

* fixed docs typos (EleutherAI#1863)

* Update polemo2_out.yaml (EleutherAI#1871)

* Unpin vllm in dependencies (EleutherAI#1874)

* Fix outdated links to the latest links in `docs` (EleutherAI#1876)

* [HFLM]Use Accelerate's API to reduce hard-coded CUDA code (EleutherAI#1880)

* Fix `batch_size=auto` for HF Seq2Seq models (EleutherAI#1765) (EleutherAI#1790)

* fix auto-batch size bug for seq2seq models

* run linter

* Fix Brier Score (EleutherAI#1847)

`gold_one_hot` needs to follow the dimension of predictions so that it still works when `--limit` is used and the indexes in gold does not cover all gold indexes.

* Fix for bootstrap_iters = 0 case (EleutherAI#1715) (EleutherAI#1789)

* add handling for bootstrap_iters=0 case

* add more detail to docstring

* run precommit

* add mmlu tasks from pile-t5 (EleutherAI#1710)

* add mmlu tasks from pile-t5

* Update _mmlu_flan_cot_fewshot_template_yaml

* Update _mmlu_flan_cot_zeroshot_template_yaml

* Update _mmlu_flan_generative_template_yaml

* Update _mmlu_flan_loglikelihood_template_yaml

* Update _default_template_yaml

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Bigbench fix (EleutherAI#1686)

* edit process multiple-choice

* split template yaml

* remove

* modified multiple_choice tasks

* udpate

* Update multiple_choice_template_b_yaml

* Update multiple_choice_template_a_yaml

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Rename `lm_eval.logging -> lm_eval.loggers` (EleutherAI#1858)

* rename lm_eval.logging module

* fix evaluation tracker args

* Updated vllm imports in vllm_causallms.py (EleutherAI#1890)

* Reorder vllm imports in vllm_causallms.py

* Update vllm_causallms.py

* [HFLM]Add support for Ascend NPU (EleutherAI#1886)

* [HFLM]Add support for Ascend NPU

Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com>
Co-authored-by: zhabuye <2947436155@qq.com>

* bump accelerate dependency version to 0.26.0 for NPU compat.

---------

Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com>
Co-authored-by: zhabuye <2947436155@qq.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* `higher_is_better` tickers in output table (EleutherAI#1893)

* Higher is better tickers in output table

* add extra check for `higher_is_better` not being None already

* Update lm_eval/evaluator.py

* fixup format I messed up

* add comment (and retrigger tests)

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* Add dataset card when pushing to HF hub (EleutherAI#1898)

* dataset card initial

* few fixes

* adds groups for math, mmlu, gpqa

* added summary agrs

* moved sanitize_list to utils

* readme update

* recreate metadata moved

* multiple model support

* results latest split fix

* readme update and small refactor

* fix grouping

* add comments

* added pathlib

* corrected pathlib approach

* check whether to create a metadata card

* convert posix paths to str

* default hf org from token

* hf token value error

* Add logs after successful upload

* logging updates

* dataset card example in the readme

---------

Co-authored-by: Nathan Habib <nathan.habib@huggingface.com>
Co-authored-by: Alina Lozovskaia <alinailozovskaya@gmail.com>

* Making hardcoded few shots compatible with the chat template mechanism (EleutherAI#1895)

* init test 1

* fix

* this format seems to be working - need to update all other tasks with the new format

* bbh with few shot format

* fix fewshot bbh

* add mmlu flan cot

* samples of cot

* kmmlu

* fix gsm8k

* update keys for mmlu

* minerva math

* bbh

* fix

* fix samples

* small fixes to templates

* last prompt format change

* fixing prompt

* fixed minerva math format

* rm accidental commited file

* added doc for few shot samples

* Update lm_eval/loggers/evaluation_tracker.py

* Update lm_eval/loggers/evaluation_tracker.py

* Update docs/new_task_guide.md

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* added check in sampler per code review

* added the system from a function, plus an example in minerva math

* style

* Apply suggestions from code review

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fix unit tests 1

* forcing use of test split

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Try to make existing tests run little bit faster (EleutherAI#1905)

* Fix fewshot seed only set when overriding num_fewshot (EleutherAI#1914)

Fix EleutherAI#1906

* Complete task list from pr 1727 (EleutherAI#1901)

* added tasks and task family descriptors

* continue work on task list w/ links; slightly reorganize README

* Apply suggestions from code review

* Rename file so that it'll preview in Github when viewing lm_eval/tasks folder

* Update new_task_guide.md

* Update README.md

* run linter

* Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs

* fix typo

* Apply suggestions from code review

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* apply format

---------

Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* Add chat template (EleutherAI#1873)

* initial chat template

* tokenizer attribute check

* variable rename

* interface update

* system instruction

* system inst default update

* fewshot as multiturn

* typing update

* indent update

* added comments

* Adding a fewshot in a more readable way

* linting

* Moved apply chat template to LM

* multiturn alternation fix

* cache key update

* apply chat template method fix

* add system prompt hash to cache_key

* tokenizer name property for cache_key

* property name fix

* linting backward compatibility fix

* docs and errors update

* add documentation on adding chat template compatibility to model_guide

* fewshot as multiturn check fix

* saving system inst and chat template in results

* eval tracker update

* docs update

* Apply suggestions from code review

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

---------

Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data (EleutherAI#1867)

* glianorex tasks

* Create README.md

* Update README.md

* Update README.md

* fix formatting

* fix internal formatting

* Modify pre-commit hook to check merge conflicts accidentally committed not at current merge commit (EleutherAI#1927)

* [add] fld logical formula task (EleutherAI#1931)

* Add new Lambada translations (EleutherAI#1897)

* added tasks and task family descriptors

* configs for the new lambada translations

* continue work on task list w/ links; slightly reorganize README

* Apply suggestions from code review

* Rename file so that it'll preview in Github when viewing lm_eval/tasks folder

* Update new_task_guide.md

* Update README.md

* run linter

* Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs

* fix typo

* update `lm_eval/tasks/README.md` with task description

---------

Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com>
Co-authored-by: anthony <anthonydipofi@gmail.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* Implement NoticIA (EleutherAI#1912)

* Noticia

* test

* Final testes implementation

* Fixes

* Fix linters

* Add The Arabic version of the PICA benchmark (EleutherAI#1917)

* Update siqa.yaml (EleutherAI#1909)

* Update basque-glue (EleutherAI#1913)

* Update README.md

* Update bec.yaml

* Update bhtc.yaml

* Update coref.yaml

* Update qnli.yaml

* Update vaxx.yaml

* Update wic.yaml

* Test output table layout consistency (EleutherAI#1916)

* sort metrics in output table

* update docstring in `consolidate_results`

* add tests for verifying consistency of table output

* update tests to account for floating point inconsistencies

* updated tests based on `pythia-14m`

* Update __main__.py (EleutherAI#1939)

* Add the Arabic version with refactor to Arabic pica to be in alghafa folder (EleutherAI#1940)

* Results filenames handling fix (EleutherAI#1926)

* results filenames handling moved to utils

* zeno results handling fix

* tasks_for_model backward compatibility

* results files logic moved to tasks_for_model

* moved sanitize_model_name to utils

* Remove AMMLU Due to Translation (EleutherAI#1948)

* Update README.md

* Delete lm_eval/tasks/ammlu directory

* add include_defaults kwarg to taskmanager, add tests for include_path (EleutherAI#1856)

* add hacky add_bos_token forcing for Gemma to VLLM too (EleutherAI#1857)

* Update interface.md (EleutherAI#1955)

* Fix self.max_tokens in anthropic_llms.py (EleutherAI#1848)

Fix bug where `self.max_tokens` was not set

* `samples` is newline delimited (EleutherAI#1930)

* `samples` is newline delimited

* updated git and pre-commit

* appease pre-commit

* nit

* Revert back for now

* Revert for now

---------

Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

* Fix `--gen_kwargs` and VLLM (`temperature` not respected) (EleutherAI#1800)

* Update vllm_causallms.py

* adjust

---------

Co-authored-by: lintangsutawika <lintang@eleuther.ai>

* make write_out.py explicitly error if no splits match (EleutherAI#1796)

Co-authored-by: lintangsutawika <lintang@eleuther.ai>

* fix: add directory filter to os.walk to ignore 'ipynb_checkpoints' (EleutherAI#1956)

* fix: add filter to os.walk to ignore 'ipynb_checkpoints

* Update __init__.py

* Update __init__.py

---------

Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

* add trust_remote_code  for piqa (EleutherAI#1983)

Signed-off-by: changwangss <chang1.wang@intel.com>

* Fix self assignment in neuron_optimum.py (EleutherAI#1990)

* [New Task] Add Paloma benchmark (EleutherAI#1928)

* init paloma benchmark

* pre-process in utils function

* add `task_alias`

* updated task aliases

* Update paloma_dolma-v1_5.yaml

* Update paloma_twitterAAE_HELM_fixed.yaml

* Update paloma_dolma_100_programing_languages.yaml

---------

Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

* Fix Paloma Template yaml (EleutherAI#1993)

* init paloma benchmark

* pre-process in utils function

* add `task_alias`

* updated task aliases

* Update paloma_dolma-v1_5.yaml

* Update paloma_twitterAAE_HELM_fixed.yaml

* Update paloma_dolma_100_programing_languages.yaml

* update on names

* fix paloma template issue

---------

Co-authored-by: Zafir Stojanovski <zaf.stojano@gmail.com>
Co-authored-by: Zafir Stojanovski <zafir.stojanovski@icloud.com>
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

* Log `fewshot_as_multiturn` in results files (EleutherAI#1995)

* log fewshot_as_multiturn in general tracker args

* Update evaluator.py

---------

Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

* Added ArabicMMLU (EleutherAI#1987)

* Added ArabicMMLU

* Rename `ammlu` to `arabicmmlu`

* Fix Datasets `--trust_remote_code` (EleutherAI#1998)

* Add BertaQA dataset tasks (EleutherAI#1964)

* add bertaqa tasks

* rename basquetrivia-->bertaqa ; make template stub not .yaml

* add bertaqa entry to lm_eval/tasks/README.md

---------

Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* Fix precommit hook, update run_models.sh

* Rename main mmlu ru config

* Add ru continuation version

---------

Signed-off-by: changwangss <chang1.wang@intel.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: khalil <90086758+khalil-Hennara@users.noreply.github.com>
Co-authored-by: Lintang Sutawika <lintang@sutawika.com>
Co-authored-by: Alex Bäuerle <alex@a13x.io>
Co-authored-by: Wongboo <44860323+Wongboo@users.noreply.github.com>
Co-authored-by: achervyakov <77295913+artemorloff@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
Co-authored-by: Eitan Turok <150733043+eitanturok@users.noreply.github.com>
Co-authored-by: Rylan Schaeffer <rylanschaeffer@gmail.com>
Co-authored-by: Vicki Boykis <vicki@mozilla.ai>
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>
Co-authored-by: kwrobel.eth <djstrong@gmail.com>
Co-authored-by: Nouf M. Alotaibi <63472979+noufmitla@users.noreply.github.com>
Co-authored-by: Haonan Li <nathan.8270.n@gmail.com>
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: WoosungMyung <115716986+LameloBally@users.noreply.github.com>
Co-authored-by: Sergio Perez <sergioperezperez24@gmail.com>
Co-authored-by: Or Sharir <or@sharir.org>
Co-authored-by: Julen Etxaniz <juletxara@gmail.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: ZoneTwelve <zonetwelve159@gmail.com>
Co-authored-by: Seungwoo Ryu <seungwoo.ryu.94@gmail.com>
Co-authored-by: nicho2 <nicho2@laposte.net>
Co-authored-by: KonradSzafer <61851539+KonradSzafer@users.noreply.github.com>
Co-authored-by: Sergio Perez <sergioperezpersonal@gmail.com>
Co-authored-by: sator-labs <129434630+sator-labs@users.noreply.github.com>
Co-authored-by: Brian Vaughan <nairbv@users.noreply.github.com>
Co-authored-by: giorgossideris <56915448+giorgossideris@users.noreply.github.com>
Co-authored-by: Nikita Lozhnikov <nikitml@gmail.com>
Co-authored-by: Chujie Zheng <chujiezhengchn@gmail.com>
Co-authored-by: Gabriel Mukobi <gabrielmukobi@gmail.com>
Co-authored-by: Zehan Li <69186130+jordane95@users.noreply.github.com>
Co-authored-by: Simran Arora <emailsimran@gmail.com>
Co-authored-by: bcicc <142823000+bcicc@users.noreply.github.com>
Co-authored-by: Helena Kloosterman <helena.kloosterman@intel.com>
Co-authored-by: Muhammad Bin Usman <muhammadbin.2003@gmail.com>
Co-authored-by: ciaranby <48831615+ciaranby@users.noreply.github.com>
Co-authored-by: LSinev <LSinev@users.noreply.github.com>
Co-authored-by: aditya thomas <aditya.thomas@alum.mit.edu>
Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>
Co-authored-by: jonabur <135807120+jonabur@users.noreply.github.com>
Co-authored-by: Jonathan Burdge <jburdge@mahti-login11.mahti.csc.fi>
Co-authored-by: Jonathan Burdge <jburdge@mahti-login12.mahti.csc.fi>
Co-authored-by: Edd <68678137+Erland366@users.noreply.github.com>
Co-authored-by: Lucas Weber <35227161+LucWeber@users.noreply.github.com>
Co-authored-by: Nick Doiron <ndoiron@mapmeld.com>
Co-authored-by: Zafir Stojanovski <zafir.stojanovski@icloud.com>
Co-authored-by: zhabuye <74179177+zhabuye@users.noreply.github.com>
Co-authored-by: Edward Gan <efuzzy@gmail.com>
Co-authored-by: DongGeon Lee <dg.lee@postech.ac.kr>
Co-authored-by: Huazhong Ji <hzji210@gmail.com>
Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com>
Co-authored-by: zhabuye <2947436155@qq.com>
Co-authored-by: Nathan Habib <nathan.habib@huggingface.com>
Co-authored-by: Alina Lozovskaia <alinailozovskaya@gmail.com>
Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
Co-authored-by: anthony-dipofi <anthonydipofi@gmail.com>
Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com>
Co-authored-by: Maxime <672982+maximegmd@users.noreply.github.com>
Co-authored-by: MorishT <106973776+MorishT@users.noreply.github.com>
Co-authored-by: Iker García-Ferrero <i.garciaferrerosanpelayo@gmail.com>
Co-authored-by: Zafir Stojanovski <zaf.stojano@gmail.com>
Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com>
Co-authored-by: johnwee1 <91670254+johnwee1@users.noreply.github.com>
Co-authored-by: Wang, Chang <491521017@qq.com>
Co-authored-by: Yazeed Alnumay <61038456+Yazeed7@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Type error on stderr computation of accuracy
2 participants