Fix for bootstrap_iters = 0 case (#1715) #1789

haileyschoelkopf · 2024-05-06T14:04:36Z

closes #1715 .

Should be merged after #1775

lintangsutawika · 2024-05-20T12:13:14Z

Lgtm, dont forget the pre-commit before merging

* add handling for bootstrap_iters=0 case * add more detail to docstring * run precommit

* Update generate_until_template_yaml (EleutherAI#1546) * Update ifeval.yaml (EleutherAI#1506) * add Arabic EXAMS benchmark (EleutherAI#1498) * add Arabic EXAMS benchmark * fixed the linter issue, and add more information on the readme * Update README.md --------- Co-authored-by: Lintang Sutawika <lintang@sutawika.com> * AGIEval (EleutherAI#1359) * add agieval * fix typo * add cloze / math exactmatch agieval tasks, rename * update exact-match agieval tasks, allow for multiple-correct answers * add more detail to readme * don't parse_math_answer twice --------- Co-authored-by: Alex Bäuerle <alex@a13x.io> * cli_evaluate calls simple_evaluate with the same verbosity. (EleutherAI#1563) * add manual tqdm disabling management (EleutherAI#1569) * add manual tqdm disabling management * add typing to all new args * apply precommit changes --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Fix README section on vllm integration (EleutherAI#1579) * Link to vllm integration * add pip install .[vllm] cmd * Fix Jinja template for Advanced AI Risk (EleutherAI#1587) * Proposed approach for testing CLI arg parsing (EleutherAI#1566) * New tests for CLI args * fix spacing * change tests for parsing * add tests, fix parser * remove defaults for store_true * Patch for Seq2Seq Model predictions (EleutherAI#1584) * Differentiate _encode_pair setting for decoder and enc-dec models * tok_decode to not skip special token so that eos doen't become empty string * Update model.py * Update model.py * Update huggingface.py * Update lm_eval/models/huggingface.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update model.py --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Add start date in results.json (EleutherAI#1592) * Cleanup for v0.4.2 release (EleutherAI#1573) * Update interface.md * fix: make caching reqs always work with accelerate launch * remove stale task migration checklist * remove deprecation warnings * make informative TypeErrors for get_task_dict * bump version metadata * fix num_fewshot printing bug * add fewshot value to cache key * Fix eval_logger import for mmlu/_generate_configs.py (EleutherAI#1593) * Fix eval_logger import for mmlu/_generate_configs.py * linter --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * use BOS token in loglikelihood (EleutherAI#1588) * use BOS token in loglikelihood * improve comments * add model arg * log prefix token id * log prefix token id * Update lm_eval/api/model.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * change name to prefix_token_id --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Revert "Patch for Seq2Seq Model predictions (EleutherAI#1584)" (EleutherAI#1601) This reverts commit b7923a8. * fix gen_kwargs arg reading (EleutherAI#1607) * fix until arg processing (EleutherAI#1608) * Fixes to Loglikelihood prefix token / VLLM (EleutherAI#1611) * make vllm use prefix_token_id ; have prefix_token_id be optional method to define * custom_prefix_token_id wasn't set if not passed * Add ACLUE task (EleutherAI#1614) * Add task ACLUE * fix minor bug * fix code style * fix code style * OpenAI Completions -- fix passing of unexpected 'until' arg (EleutherAI#1612) * add logging of model args (EleutherAI#1619) * add logging of model args * nit * Add warnings. * nit * add warning * nit * Add vLLM FAQs to README (EleutherAI#1625) (EleutherAI#1633) * peft Version Assertion (EleutherAI#1635) * peft Version Assertion * fix the linter issue * Seq2seq fix (EleutherAI#1604) * fix on --task list * add fixes to tokeniation * differentiate encoding for seq2seq and decoder * return token setting * format for pre-commit * Seq2seq fix, pt2 (EleutherAI#1630) * getting model class only when defined * encode_pair handles None, add_special_tokens turned into dict with default value --------- Co-authored-by: achervyakov <77295913+artemorloff@users.noreply.github.com> * Integration of NeMo models into LM Evaluation Harness library (EleutherAI#1598) * Integration of NeMo models into LM Evaluation Harness library * rename nemo model as nemo_lm * move nemo section in readme after hf section * use self.eot_token_id in get_until() * improve progress bar showing loglikelihood requests * data replication or tensor/pipeline replication working fine within one node * run pre-commit on modified files * check whether dependencies are installed * clarify usage of torchrun in README * Fix conditional import for Nemo LM class (EleutherAI#1641) * Fix SuperGlue's ReCoRD task following regression in v0.4 refactoring (EleutherAI#1647) * Add Latxa paper evaluation tasks for Basque (EleutherAI#1654) * add basqueglue * add eus_exams * add eus_proficiency * add eus_reading * add eus_trivia * run pre-commit * Fix CLI --batch_size arg for openai-completions/local-completions (EleutherAI#1656) The OpenAI interface supports batch size as an argument to the completions API, but does not seem to support specification of this on the CLI i.e. `lm_eval --model openai-completions --batch_size 16 ...` because of a simple lack of str->int conversion. This is confirmed by my usage and stacktrace from running `OPENAI_API_KEY=dummy lm_eval --model local-completions --tasks gsm8k --batch_size 16 --model_args model=nm- testing/zephyr-beta-7b-gptq-g128,tokenizer_backend=huggingface,base_url=http://localhost:8000/v1`: ``` Traceback (most recent call last): File "/home/michael/venv/bin/lm_eval", line 8, in <module> sys.exit(cli_evaluate()) File "/home/michael/code/lm-evaluation-harness/lm_eval/__main__.py", line 341, in cli_evaluate results = evaluator.simple_evaluate( File "/home/michael/code/lm-evaluation-harness/lm_eval/utils.py", line 288, in _wrapper return fn(*args, **kwargs) File "/home/michael/code/lm-evaluation-harness/lm_eval/evaluator.py", line 251, in simple_evaluate results = evaluate( File "/home/michael/code/lm-evaluation-harness/lm_eval/utils.py", line 288, in _wrapper return fn(*args, **kwargs) File "/home/michael/code/lm-evaluation-harness/lm_eval/evaluator.py", line 390, in evaluate resps = getattr(lm, reqtype)(cloned_reqs) File "/home/michael/code/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 263, in generate_until list(sameuntil_chunks(re_ord.get_reordered(), self.batch_size)), File "/home/michael/code/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 251, in sameuntil_chunks if len(ret) >= size or x[1] != lastuntil: TypeError: '>=' not supported between instances of 'int' and 'str' ``` * Patch QQP prompt (EleutherAI#1661) * TMMLU+ implementation (EleutherAI#1394) * implementation of TMMLU+ * implemented: TMMLU+ ****TMMLU+ : large-scale Traditional chinese Massive Multitask language Understanding**** - 4 categories - STEM - Social Science - Humanities - Other The TMMLU+ dataset, encompassing over 67 subjects and 20160 tasks, is six times larger and more balanced than its predecessor, TMMLU, and includes benchmark results from both closed-source and 20 open-weight Chinese large language models with 1.8B to 72B parameters. However, Traditional Chinese variants continue to underperform compared to major Simplified Chinese models. ```markdown Total number of tasks in the 'test' sets: 20160 Total number of tasks in the 'validation' sets: 2247 Total number of tasks in the 'train' sets: 335 ``` * Remove print from __init__.py There was my mistake in forgetting to remove the debug print from the code. * update: move TMMLU+ config generation program into default * fix: we should use training set as few shots example * update: README for TMMLU+ * update: a small changes of TMMLU+ README file * pre-commit run thought * Add README for TMMLU+ dataset * run precommit * trigger precommit again * trigger precommit again * isort is fussy * isort is fussy * format, again * oops * oops --------- Co-authored-by: lintang <lintang@eleuther.ai> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Anthropic Chat API (EleutherAI#1594) * claude3 * supply for anthropic claude3 * supply for anthropic claude3 * anthropic config changes * add callback options on anthropic * line passed * claude3 tiny change * help anthropic installation * mention sysprompt / being careful with format in readme --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * correction bug EleutherAI#1664 (EleutherAI#1670) * correction bug EleutherAI#1664 * add any invalid characters for Windows filenames and Unix-like systems see: https://gist.github.com/doctaphred/d01d05291546186941e1b7ddc02034d3?permalink_comment_id=3958715 * Update lm_eval/__main__.py * Update scripts/zeno_visualize.py * fix format --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Update README.md (EleutherAI#1680) * Add delta weights model loading (EleutherAI#1712) * added delta weights * removed debug * readme update * better error handling * autogptq warn * warn update * peft and delta error, explicitly deleting _model_delta * linter fix * Add `neuralmagic` models for `sparseml` and `deepsparse` (EleutherAI#1674) * Add neuralmagic models for SparseML and DeepSparse * Update to latest and add test * Format * Fix list to List * Format * Add deepsparse/sparseml to automated testing * Update pyproject.toml * Update pyproject.toml * Update README * Fixes for dtype and device * Format * Fix test * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Address review comments! --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fix error when appending eot_token_id for generate_until tasks (EleutherAI#1699) * Adding retries and rate limit to toxicity tasks (EleutherAI#1620) * reference `--tasks list` in README (EleutherAI#1726) EleutherAI#1698 * Add XNLIeu: a dataset for cross-lingual NLI in Basque (EleutherAI#1694) * add xnli_eu tasks * update tasks readme * update readme * Fix Parameter Propagation for Tasks that have `include` (EleutherAI#1749) * Update task.py * Update __init__.py * Support individual scrolls datasets (EleutherAI#1740) * Support individual scrolls datasets * Add qmsum context * Fix formatting * Add filter registry decorator (EleutherAI#1750) * Add register_filter decorator * Add register_filter docs * remove duplicated `num_fewshot: 0` (EleutherAI#1769) * Pile 10k new task (EleutherAI#1758) * Add Pile-10k readme * Add Pile-10k task configuration file * Fix m_arc choices (EleutherAI#1760) * Update utils.py This is a 4-choice task, option_e is null for all but 3 samples * Fix options Adaptive choices * add option e * bump multilingual arc version --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * upload new tasks (EleutherAI#1728) * upload new tasks * add readmes * run linters --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * vllm lora support (EleutherAI#1756) * vllm lora support * remove print * version check, rename lora kwarg * Add option to set OpenVINO config (EleutherAI#1730) * Add option to set OpenVINO config * Use utils.eval_logger for logging * evaluation tracker implementation (EleutherAI#1766) * evaluation tracker implementation * OVModelForCausalLM test fix * typo fix * moved methods args * multiple args in one flag * loggers moved to dedicated dir * improved filename sanitization * eval tracker args fix (EleutherAI#1777) * limit fix (EleutherAI#1785) * remove echo parameter in OpenAI completions API (EleutherAI#1779) * remove echo parameter in OpenAI completions API * remove context length parameter doc string * Fix README: change`----hf_hub_log_args` to `--hf_hub_log_args` (EleutherAI#1776) fix `----hf_hub_log_args` to `--hf_hub_log_args` * Fix bug in setting until kwarg in openai completions (EleutherAI#1784) * Provide ability for custom sampler for ConfigurableTask (EleutherAI#1616) * Added fewshot sampling seeds to evaluator.simple_evaluate signature Way to control seed of fewshot sampling may help with EleutherAI#1591 * Added ability for custom sampler for ConfigurableTask May be set in config like ``` fewshot_config: sampler: !function utils.MyFewshotSampler ``` * explicitly set fewshot random generator seed for HFLM generate_until_task test * add backward compatibility for three args seed setup * save seeds info to logs/reports * Update `--tasks list` option in interface documentation (EleutherAI#1792) * Fix Caching Tests ; Remove `pretrained=gpt2` default (EleutherAI#1775) * link to the example output on the hub (EleutherAI#1798) * Re-add Hendrycks MATH (no sympy checking, no Minerva hardcoded prompt) variant (EleutherAI#1793) * add Hendrycks MATH (no sympy checking) variant * add readmes for MATH tasks * Logging Updates (Alphabetize table printouts, fix eval tracker bug) (EleutherAI#1774) (EleutherAI#1791) * fix auto-batch size bug for seq2seq models * alphabetize task + group tables ; fix eval tracker bug * fix eval tracker bug * Initial integration of the Unitxt to LM eval harness (EleutherAI#1615) * Initial support for Unitxt datasets in LM Eval Harness See https://github.com/IBM/unitxt The script 'generate_yamls.py' creates LM Eval Harness yaml files corresponding to Unitxt datasets specified in the 'unitxt_datasets' file. The glue code required to register Unitxt metrics is in 'unitxt_wrapper.py'. * Added dataset loading check to generate_yaml Improved error messages. * Speed up generate_yaml Added printouts and improved error message * Added output printout * Simplified integration of unitxt datasets Store all the common yaml configuration in a yaml include shared by all datasets of the same task. * Post code review comments - part 1 1. Made sure include files don't end wth 'yaml' so they won't be marked as tasks 2. Added more datasets and tasks (NER, GEC) 3. Added README * Post code review comments - part 2 1. Added install unitxt install option in pyproject.toml: pip install 'lm_eval[unitxt]' 2. Added a check that unitxt is installed and print a clear error message if not * Commited missing pyproject change * Added documentation on adding datasets * More doc changes * add unitxt extra to readme * run precommit --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * add task for mmlu evaluation in arc multiple choice format (EleutherAI#1745) * add mmlu arc style evaluation * rename arc_style to continuation --------- Co-authored-by: Jonathan Burdge <jburdge@mahti-login11.mahti.csc.fi> Co-authored-by: Jonathan Burdge <jburdge@mahti-login12.mahti.csc.fi> * Update flag `--hf_hub_log_args` in interface documentation (EleutherAI#1806) * update interface documentation with flag --hf_hub_logs_arg * update interface documentation with flag --hf_hub_logs_arg 2 * Copal task (EleutherAI#1803) * add copal * change name to copal id for clarity and the task name * remove `copal_id...` to yaml to make it work * checkmark on README * change group name to `copal_id` * Adding tinyBenchmarks datasets (EleutherAI#1545) * Add tinyBenchmarks * Add acknowledgements * Add ordering of outputs for data-parallel * Run pre-commit * Add few_shot specifications * Add tinyBenchmarks post-processing * add conditional import ; fix task names --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * interface doc update (EleutherAI#1807) * Fix links in README guiding to another branch (EleutherAI#1838) * Fix: support PEFT/LoRA with added tokens (EleutherAI#1828) * resize model embeddings * resize only * tokenizer help * load tokenizer before model * add comment and run precommit lint * Add log message Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fixed incorrect check for task type (replace `~` with `not`) (EleutherAI#1865) * fixed docs typos (EleutherAI#1863) * Update polemo2_out.yaml (EleutherAI#1871) * Unpin vllm in dependencies (EleutherAI#1874) * Fix outdated links to the latest links in `docs` (EleutherAI#1876) * [HFLM]Use Accelerate's API to reduce hard-coded CUDA code (EleutherAI#1880) * Fix `batch_size=auto` for HF Seq2Seq models (EleutherAI#1765) (EleutherAI#1790) * fix auto-batch size bug for seq2seq models * run linter * Fix Brier Score (EleutherAI#1847) `gold_one_hot` needs to follow the dimension of predictions so that it still works when `--limit` is used and the indexes in gold does not cover all gold indexes. * Fix for bootstrap_iters = 0 case (EleutherAI#1715) (EleutherAI#1789) * add handling for bootstrap_iters=0 case * add more detail to docstring * run precommit * add mmlu tasks from pile-t5 (EleutherAI#1710) * add mmlu tasks from pile-t5 * Update _mmlu_flan_cot_fewshot_template_yaml * Update _mmlu_flan_cot_zeroshot_template_yaml * Update _mmlu_flan_generative_template_yaml * Update _mmlu_flan_loglikelihood_template_yaml * Update _default_template_yaml --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Bigbench fix (EleutherAI#1686) * edit process multiple-choice * split template yaml * remove * modified multiple_choice tasks * udpate * Update multiple_choice_template_b_yaml * Update multiple_choice_template_a_yaml --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Rename `lm_eval.logging -> lm_eval.loggers` (EleutherAI#1858) * rename lm_eval.logging module * fix evaluation tracker args * Updated vllm imports in vllm_causallms.py (EleutherAI#1890) * Reorder vllm imports in vllm_causallms.py * Update vllm_causallms.py * [HFLM]Add support for Ascend NPU (EleutherAI#1886) * [HFLM]Add support for Ascend NPU Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com> Co-authored-by: zhabuye <2947436155@qq.com> * bump accelerate dependency version to 0.26.0 for NPU compat. --------- Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com> Co-authored-by: zhabuye <2947436155@qq.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * `higher_is_better` tickers in output table (EleutherAI#1893) * Higher is better tickers in output table * add extra check for `higher_is_better` not being None already * Update lm_eval/evaluator.py * fixup format I messed up * add comment (and retrigger tests) --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Add dataset card when pushing to HF hub (EleutherAI#1898) * dataset card initial * few fixes * adds groups for math, mmlu, gpqa * added summary agrs * moved sanitize_list to utils * readme update * recreate metadata moved * multiple model support * results latest split fix * readme update and small refactor * fix grouping * add comments * added pathlib * corrected pathlib approach * check whether to create a metadata card * convert posix paths to str * default hf org from token * hf token value error * Add logs after successful upload * logging updates * dataset card example in the readme --------- Co-authored-by: Nathan Habib <nathan.habib@huggingface.com> Co-authored-by: Alina Lozovskaia <alinailozovskaya@gmail.com> * Making hardcoded few shots compatible with the chat template mechanism (EleutherAI#1895) * init test 1 * fix * this format seems to be working - need to update all other tasks with the new format * bbh with few shot format * fix fewshot bbh * add mmlu flan cot * samples of cot * kmmlu * fix gsm8k * update keys for mmlu * minerva math * bbh * fix * fix samples * small fixes to templates * last prompt format change * fixing prompt * fixed minerva math format * rm accidental commited file * added doc for few shot samples * Update lm_eval/loggers/evaluation_tracker.py * Update lm_eval/loggers/evaluation_tracker.py * Update docs/new_task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * added check in sampler per code review * added the system from a function, plus an example in minerva math * style * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fix unit tests 1 * forcing use of test split --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Try to make existing tests run little bit faster (EleutherAI#1905) * Fix fewshot seed only set when overriding num_fewshot (EleutherAI#1914) Fix EleutherAI#1906 * Complete task list from pr 1727 (EleutherAI#1901) * added tasks and task family descriptors * continue work on task list w/ links; slightly reorganize README * Apply suggestions from code review * Rename file so that it'll preview in Github when viewing lm_eval/tasks folder * Update new_task_guide.md * Update README.md * run linter * Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs * fix typo * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * apply format --------- Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Add chat template (EleutherAI#1873) * initial chat template * tokenizer attribute check * variable rename * interface update * system instruction * system inst default update * fewshot as multiturn * typing update * indent update * added comments * Adding a fewshot in a more readable way * linting * Moved apply chat template to LM * multiturn alternation fix * cache key update * apply chat template method fix * add system prompt hash to cache_key * tokenizer name property for cache_key * property name fix * linting backward compatibility fix * docs and errors update * add documentation on adding chat template compatibility to model_guide * fewshot as multiturn check fix * saving system inst and chat template in results * eval tracker update * docs update * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data (EleutherAI#1867) * glianorex tasks * Create README.md * Update README.md * Update README.md * fix formatting * fix internal formatting * Modify pre-commit hook to check merge conflicts accidentally committed not at current merge commit (EleutherAI#1927) * [add] fld logical formula task (EleutherAI#1931) * Add new Lambada translations (EleutherAI#1897) * added tasks and task family descriptors * configs for the new lambada translations * continue work on task list w/ links; slightly reorganize README * Apply suggestions from code review * Rename file so that it'll preview in Github when viewing lm_eval/tasks folder * Update new_task_guide.md * Update README.md * run linter * Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs * fix typo * update `lm_eval/tasks/README.md` with task description --------- Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com> Co-authored-by: anthony <anthonydipofi@gmail.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Implement NoticIA (EleutherAI#1912) * Noticia * test * Final testes implementation * Fixes * Fix linters * Add The Arabic version of the PICA benchmark (EleutherAI#1917) * Update siqa.yaml (EleutherAI#1909) * Update basque-glue (EleutherAI#1913) * Update README.md * Update bec.yaml * Update bhtc.yaml * Update coref.yaml * Update qnli.yaml * Update vaxx.yaml * Update wic.yaml * Test output table layout consistency (EleutherAI#1916) * sort metrics in output table * update docstring in `consolidate_results` * add tests for verifying consistency of table output * update tests to account for floating point inconsistencies * updated tests based on `pythia-14m` * Update __main__.py (EleutherAI#1939) * Add the Arabic version with refactor to Arabic pica to be in alghafa folder (EleutherAI#1940) * Results filenames handling fix (EleutherAI#1926) * results filenames handling moved to utils * zeno results handling fix * tasks_for_model backward compatibility * results files logic moved to tasks_for_model * moved sanitize_model_name to utils * Remove AMMLU Due to Translation (EleutherAI#1948) * Update README.md * Delete lm_eval/tasks/ammlu directory * add include_defaults kwarg to taskmanager, add tests for include_path (EleutherAI#1856) * add hacky add_bos_token forcing for Gemma to VLLM too (EleutherAI#1857) * Update interface.md (EleutherAI#1955) * Fix self.max_tokens in anthropic_llms.py (EleutherAI#1848) Fix bug where `self.max_tokens` was not set * `samples` is newline delimited (EleutherAI#1930) * `samples` is newline delimited * updated git and pre-commit * appease pre-commit * nit * Revert back for now * Revert for now --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Fix `--gen_kwargs` and VLLM (`temperature` not respected) (EleutherAI#1800) * Update vllm_causallms.py * adjust --------- Co-authored-by: lintangsutawika <lintang@eleuther.ai> * make write_out.py explicitly error if no splits match (EleutherAI#1796) Co-authored-by: lintangsutawika <lintang@eleuther.ai> * fix: add directory filter to os.walk to ignore 'ipynb_checkpoints' (EleutherAI#1956) * fix: add filter to os.walk to ignore 'ipynb_checkpoints * Update __init__.py * Update __init__.py --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * add trust_remote_code for piqa (EleutherAI#1983) Signed-off-by: changwangss <chang1.wang@intel.com> * Fix self assignment in neuron_optimum.py (EleutherAI#1990) * [New Task] Add Paloma benchmark (EleutherAI#1928) * init paloma benchmark * pre-process in utils function * add `task_alias` * updated task aliases * Update paloma_dolma-v1_5.yaml * Update paloma_twitterAAE_HELM_fixed.yaml * Update paloma_dolma_100_programing_languages.yaml --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Fix Paloma Template yaml (EleutherAI#1993) * init paloma benchmark * pre-process in utils function * add `task_alias` * updated task aliases * Update paloma_dolma-v1_5.yaml * Update paloma_twitterAAE_HELM_fixed.yaml * Update paloma_dolma_100_programing_languages.yaml * update on names * fix paloma template issue --------- Co-authored-by: Zafir Stojanovski <zaf.stojano@gmail.com> Co-authored-by: Zafir Stojanovski <zafir.stojanovski@icloud.com> Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Log `fewshot_as_multiturn` in results files (EleutherAI#1995) * log fewshot_as_multiturn in general tracker args * Update evaluator.py --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Added ArabicMMLU (EleutherAI#1987) * Added ArabicMMLU * Rename `ammlu` to `arabicmmlu` * Fix Datasets `--trust_remote_code` (EleutherAI#1998) * Add BertaQA dataset tasks (EleutherAI#1964) * add bertaqa tasks * rename basquetrivia-->bertaqa ; make template stub not .yaml * add bertaqa entry to lm_eval/tasks/README.md --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Fix precommit hook, update run_models.sh --------- Signed-off-by: changwangss <chang1.wang@intel.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: khalil <90086758+khalil-Hennara@users.noreply.github.com> Co-authored-by: Lintang Sutawika <lintang@sutawika.com> Co-authored-by: Alex Bäuerle <alex@a13x.io> Co-authored-by: Wongboo <44860323+Wongboo@users.noreply.github.com> Co-authored-by: achervyakov <77295913+artemorloff@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> Co-authored-by: Eitan Turok <150733043+eitanturok@users.noreply.github.com> Co-authored-by: Rylan Schaeffer <rylanschaeffer@gmail.com> Co-authored-by: Vicki Boykis <vicki@mozilla.ai> Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> Co-authored-by: kwrobel.eth <djstrong@gmail.com> Co-authored-by: Nouf M. Alotaibi <63472979+noufmitla@users.noreply.github.com> Co-authored-by: Haonan Li <nathan.8270.n@gmail.com> Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com> Co-authored-by: WoosungMyung <115716986+LameloBally@users.noreply.github.com> Co-authored-by: Sergio Perez <sergioperezperez24@gmail.com> Co-authored-by: Or Sharir <or@sharir.org> Co-authored-by: Julen Etxaniz <juletxara@gmail.com> Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: ZoneTwelve <zonetwelve159@gmail.com> Co-authored-by: Seungwoo Ryu <seungwoo.ryu.94@gmail.com> Co-authored-by: nicho2 <nicho2@laposte.net> Co-authored-by: KonradSzafer <61851539+KonradSzafer@users.noreply.github.com> Co-authored-by: Sergio Perez <sergioperezpersonal@gmail.com> Co-authored-by: sator-labs <129434630+sator-labs@users.noreply.github.com> Co-authored-by: Brian Vaughan <nairbv@users.noreply.github.com> Co-authored-by: giorgossideris <56915448+giorgossideris@users.noreply.github.com> Co-authored-by: Nikita Lozhnikov <nikitml@gmail.com> Co-authored-by: Chujie Zheng <chujiezhengchn@gmail.com> Co-authored-by: Gabriel Mukobi <gabrielmukobi@gmail.com> Co-authored-by: Zehan Li <69186130+jordane95@users.noreply.github.com> Co-authored-by: Simran Arora <emailsimran@gmail.com> Co-authored-by: bcicc <142823000+bcicc@users.noreply.github.com> Co-authored-by: Helena Kloosterman <helena.kloosterman@intel.com> Co-authored-by: Muhammad Bin Usman <muhammadbin.2003@gmail.com> Co-authored-by: ciaranby <48831615+ciaranby@users.noreply.github.com> Co-authored-by: LSinev <LSinev@users.noreply.github.com> Co-authored-by: aditya thomas <aditya.thomas@alum.mit.edu> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Co-authored-by: jonabur <135807120+jonabur@users.noreply.github.com> Co-authored-by: Jonathan Burdge <jburdge@mahti-login11.mahti.csc.fi> Co-authored-by: Jonathan Burdge <jburdge@mahti-login12.mahti.csc.fi> Co-authored-by: Edd <68678137+Erland366@users.noreply.github.com> Co-authored-by: Lucas Weber <35227161+LucWeber@users.noreply.github.com> Co-authored-by: Nick Doiron <ndoiron@mapmeld.com> Co-authored-by: Zafir Stojanovski <zafir.stojanovski@icloud.com> Co-authored-by: zhabuye <74179177+zhabuye@users.noreply.github.com> Co-authored-by: Edward Gan <efuzzy@gmail.com> Co-authored-by: DongGeon Lee <dg.lee@postech.ac.kr> Co-authored-by: Huazhong Ji <hzji210@gmail.com> Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com> Co-authored-by: zhabuye <2947436155@qq.com> Co-authored-by: Nathan Habib <nathan.habib@huggingface.com> Co-authored-by: Alina Lozovskaia <alinailozovskaya@gmail.com> Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> Co-authored-by: anthony-dipofi <anthonydipofi@gmail.com> Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com> Co-authored-by: Maxime <672982+maximegmd@users.noreply.github.com> Co-authored-by: MorishT <106973776+MorishT@users.noreply.github.com> Co-authored-by: Iker García-Ferrero <i.garciaferrerosanpelayo@gmail.com> Co-authored-by: Zafir Stojanovski <zaf.stojano@gmail.com> Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com> Co-authored-by: johnwee1 <91670254+johnwee1@users.noreply.github.com> Co-authored-by: Wang, Chang <491521017@qq.com> Co-authored-by: Yazeed Alnumay <61038456+Yazeed7@users.noreply.github.com>

* Update generate_until_template_yaml (EleutherAI#1546) * Update ifeval.yaml (EleutherAI#1506) * add Arabic EXAMS benchmark (EleutherAI#1498) * add Arabic EXAMS benchmark * fixed the linter issue, and add more information on the readme * Update README.md --------- Co-authored-by: Lintang Sutawika <lintang@sutawika.com> * AGIEval (EleutherAI#1359) * add agieval * fix typo * add cloze / math exactmatch agieval tasks, rename * update exact-match agieval tasks, allow for multiple-correct answers * add more detail to readme * don't parse_math_answer twice --------- Co-authored-by: Alex Bäuerle <alex@a13x.io> * cli_evaluate calls simple_evaluate with the same verbosity. (EleutherAI#1563) * add manual tqdm disabling management (EleutherAI#1569) * add manual tqdm disabling management * add typing to all new args * apply precommit changes --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Fix README section on vllm integration (EleutherAI#1579) * Link to vllm integration * add pip install .[vllm] cmd * Fix Jinja template for Advanced AI Risk (EleutherAI#1587) * Proposed approach for testing CLI arg parsing (EleutherAI#1566) * New tests for CLI args * fix spacing * change tests for parsing * add tests, fix parser * remove defaults for store_true * Patch for Seq2Seq Model predictions (EleutherAI#1584) * Differentiate _encode_pair setting for decoder and enc-dec models * tok_decode to not skip special token so that eos doen't become empty string * Update model.py * Update model.py * Update huggingface.py * Update lm_eval/models/huggingface.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update model.py --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Add start date in results.json (EleutherAI#1592) * Cleanup for v0.4.2 release (EleutherAI#1573) * Update interface.md * fix: make caching reqs always work with accelerate launch * remove stale task migration checklist * remove deprecation warnings * make informative TypeErrors for get_task_dict * bump version metadata * fix num_fewshot printing bug * add fewshot value to cache key * Fix eval_logger import for mmlu/_generate_configs.py (EleutherAI#1593) * Fix eval_logger import for mmlu/_generate_configs.py * linter --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * use BOS token in loglikelihood (EleutherAI#1588) * use BOS token in loglikelihood * improve comments * add model arg * log prefix token id * log prefix token id * Update lm_eval/api/model.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * change name to prefix_token_id --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Revert "Patch for Seq2Seq Model predictions (EleutherAI#1584)" (EleutherAI#1601) This reverts commit b7923a8. * fix gen_kwargs arg reading (EleutherAI#1607) * fix until arg processing (EleutherAI#1608) * Fixes to Loglikelihood prefix token / VLLM (EleutherAI#1611) * make vllm use prefix_token_id ; have prefix_token_id be optional method to define * custom_prefix_token_id wasn't set if not passed * Add ACLUE task (EleutherAI#1614) * Add task ACLUE * fix minor bug * fix code style * fix code style * OpenAI Completions -- fix passing of unexpected 'until' arg (EleutherAI#1612) * add logging of model args (EleutherAI#1619) * add logging of model args * nit * Add warnings. * nit * add warning * nit * Add vLLM FAQs to README (EleutherAI#1625) (EleutherAI#1633) * peft Version Assertion (EleutherAI#1635) * peft Version Assertion * fix the linter issue * Seq2seq fix (EleutherAI#1604) * fix on --task list * add fixes to tokeniation * differentiate encoding for seq2seq and decoder * return token setting * format for pre-commit * Seq2seq fix, pt2 (EleutherAI#1630) * getting model class only when defined * encode_pair handles None, add_special_tokens turned into dict with default value --------- Co-authored-by: achervyakov <77295913+artemorloff@users.noreply.github.com> * Integration of NeMo models into LM Evaluation Harness library (EleutherAI#1598) * Integration of NeMo models into LM Evaluation Harness library * rename nemo model as nemo_lm * move nemo section in readme after hf section * use self.eot_token_id in get_until() * improve progress bar showing loglikelihood requests * data replication or tensor/pipeline replication working fine within one node * run pre-commit on modified files * check whether dependencies are installed * clarify usage of torchrun in README * Fix conditional import for Nemo LM class (EleutherAI#1641) * Fix SuperGlue's ReCoRD task following regression in v0.4 refactoring (EleutherAI#1647) * Add Latxa paper evaluation tasks for Basque (EleutherAI#1654) * add basqueglue * add eus_exams * add eus_proficiency * add eus_reading * add eus_trivia * run pre-commit * Fix CLI --batch_size arg for openai-completions/local-completions (EleutherAI#1656) The OpenAI interface supports batch size as an argument to the completions API, but does not seem to support specification of this on the CLI i.e. `lm_eval --model openai-completions --batch_size 16 ...` because of a simple lack of str->int conversion. This is confirmed by my usage and stacktrace from running `OPENAI_API_KEY=dummy lm_eval --model local-completions --tasks gsm8k --batch_size 16 --model_args model=nm- testing/zephyr-beta-7b-gptq-g128,tokenizer_backend=huggingface,base_url=http://localhost:8000/v1`: ``` Traceback (most recent call last): File "/home/michael/venv/bin/lm_eval", line 8, in <module> sys.exit(cli_evaluate()) File "/home/michael/code/lm-evaluation-harness/lm_eval/__main__.py", line 341, in cli_evaluate results = evaluator.simple_evaluate( File "/home/michael/code/lm-evaluation-harness/lm_eval/utils.py", line 288, in _wrapper return fn(*args, **kwargs) File "/home/michael/code/lm-evaluation-harness/lm_eval/evaluator.py", line 251, in simple_evaluate results = evaluate( File "/home/michael/code/lm-evaluation-harness/lm_eval/utils.py", line 288, in _wrapper return fn(*args, **kwargs) File "/home/michael/code/lm-evaluation-harness/lm_eval/evaluator.py", line 390, in evaluate resps = getattr(lm, reqtype)(cloned_reqs) File "/home/michael/code/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 263, in generate_until list(sameuntil_chunks(re_ord.get_reordered(), self.batch_size)), File "/home/michael/code/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 251, in sameuntil_chunks if len(ret) >= size or x[1] != lastuntil: TypeError: '>=' not supported between instances of 'int' and 'str' ``` * Patch QQP prompt (EleutherAI#1661) * TMMLU+ implementation (EleutherAI#1394) * implementation of TMMLU+ * implemented: TMMLU+ ****TMMLU+ : large-scale Traditional chinese Massive Multitask language Understanding**** - 4 categories - STEM - Social Science - Humanities - Other The TMMLU+ dataset, encompassing over 67 subjects and 20160 tasks, is six times larger and more balanced than its predecessor, TMMLU, and includes benchmark results from both closed-source and 20 open-weight Chinese large language models with 1.8B to 72B parameters. However, Traditional Chinese variants continue to underperform compared to major Simplified Chinese models. ```markdown Total number of tasks in the 'test' sets: 20160 Total number of tasks in the 'validation' sets: 2247 Total number of tasks in the 'train' sets: 335 ``` * Remove print from __init__.py There was my mistake in forgetting to remove the debug print from the code. * update: move TMMLU+ config generation program into default * fix: we should use training set as few shots example * update: README for TMMLU+ * update: a small changes of TMMLU+ README file * pre-commit run thought * Add README for TMMLU+ dataset * run precommit * trigger precommit again * trigger precommit again * isort is fussy * isort is fussy * format, again * oops * oops --------- Co-authored-by: lintang <lintang@eleuther.ai> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Anthropic Chat API (EleutherAI#1594) * claude3 * supply for anthropic claude3 * supply for anthropic claude3 * anthropic config changes * add callback options on anthropic * line passed * claude3 tiny change * help anthropic installation * mention sysprompt / being careful with format in readme --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * correction bug EleutherAI#1664 (EleutherAI#1670) * correction bug EleutherAI#1664 * add any invalid characters for Windows filenames and Unix-like systems see: https://gist.github.com/doctaphred/d01d05291546186941e1b7ddc02034d3?permalink_comment_id=3958715 * Update lm_eval/__main__.py * Update scripts/zeno_visualize.py * fix format --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Update README.md (EleutherAI#1680) * Add delta weights model loading (EleutherAI#1712) * added delta weights * removed debug * readme update * better error handling * autogptq warn * warn update * peft and delta error, explicitly deleting _model_delta * linter fix * Add `neuralmagic` models for `sparseml` and `deepsparse` (EleutherAI#1674) * Add neuralmagic models for SparseML and DeepSparse * Update to latest and add test * Format * Fix list to List * Format * Add deepsparse/sparseml to automated testing * Update pyproject.toml * Update pyproject.toml * Update README * Fixes for dtype and device * Format * Fix test * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Address review comments! --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fix error when appending eot_token_id for generate_until tasks (EleutherAI#1699) * Adding retries and rate limit to toxicity tasks (EleutherAI#1620) * reference `--tasks list` in README (EleutherAI#1726) EleutherAI#1698 * Add XNLIeu: a dataset for cross-lingual NLI in Basque (EleutherAI#1694) * add xnli_eu tasks * update tasks readme * update readme * Fix Parameter Propagation for Tasks that have `include` (EleutherAI#1749) * Update task.py * Update __init__.py * Support individual scrolls datasets (EleutherAI#1740) * Support individual scrolls datasets * Add qmsum context * Fix formatting * Add filter registry decorator (EleutherAI#1750) * Add register_filter decorator * Add register_filter docs * remove duplicated `num_fewshot: 0` (EleutherAI#1769) * Pile 10k new task (EleutherAI#1758) * Add Pile-10k readme * Add Pile-10k task configuration file * Fix m_arc choices (EleutherAI#1760) * Update utils.py This is a 4-choice task, option_e is null for all but 3 samples * Fix options Adaptive choices * add option e * bump multilingual arc version --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * upload new tasks (EleutherAI#1728) * upload new tasks * add readmes * run linters --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * vllm lora support (EleutherAI#1756) * vllm lora support * remove print * version check, rename lora kwarg * Add option to set OpenVINO config (EleutherAI#1730) * Add option to set OpenVINO config * Use utils.eval_logger for logging * evaluation tracker implementation (EleutherAI#1766) * evaluation tracker implementation * OVModelForCausalLM test fix * typo fix * moved methods args * multiple args in one flag * loggers moved to dedicated dir * improved filename sanitization * eval tracker args fix (EleutherAI#1777) * limit fix (EleutherAI#1785) * remove echo parameter in OpenAI completions API (EleutherAI#1779) * remove echo parameter in OpenAI completions API * remove context length parameter doc string * Fix README: change`----hf_hub_log_args` to `--hf_hub_log_args` (EleutherAI#1776) fix `----hf_hub_log_args` to `--hf_hub_log_args` * Fix bug in setting until kwarg in openai completions (EleutherAI#1784) * Provide ability for custom sampler for ConfigurableTask (EleutherAI#1616) * Added fewshot sampling seeds to evaluator.simple_evaluate signature Way to control seed of fewshot sampling may help with EleutherAI#1591 * Added ability for custom sampler for ConfigurableTask May be set in config like ``` fewshot_config: sampler: !function utils.MyFewshotSampler ``` * explicitly set fewshot random generator seed for HFLM generate_until_task test * add backward compatibility for three args seed setup * save seeds info to logs/reports * Update `--tasks list` option in interface documentation (EleutherAI#1792) * Fix Caching Tests ; Remove `pretrained=gpt2` default (EleutherAI#1775) * link to the example output on the hub (EleutherAI#1798) * Re-add Hendrycks MATH (no sympy checking, no Minerva hardcoded prompt) variant (EleutherAI#1793) * add Hendrycks MATH (no sympy checking) variant * add readmes for MATH tasks * Logging Updates (Alphabetize table printouts, fix eval tracker bug) (EleutherAI#1774) (EleutherAI#1791) * fix auto-batch size bug for seq2seq models * alphabetize task + group tables ; fix eval tracker bug * fix eval tracker bug * Initial integration of the Unitxt to LM eval harness (EleutherAI#1615) * Initial support for Unitxt datasets in LM Eval Harness See https://github.com/IBM/unitxt The script 'generate_yamls.py' creates LM Eval Harness yaml files corresponding to Unitxt datasets specified in the 'unitxt_datasets' file. The glue code required to register Unitxt metrics is in 'unitxt_wrapper.py'. * Added dataset loading check to generate_yaml Improved error messages. * Speed up generate_yaml Added printouts and improved error message * Added output printout * Simplified integration of unitxt datasets Store all the common yaml configuration in a yaml include shared by all datasets of the same task. * Post code review comments - part 1 1. Made sure include files don't end wth 'yaml' so they won't be marked as tasks 2. Added more datasets and tasks (NER, GEC) 3. Added README * Post code review comments - part 2 1. Added install unitxt install option in pyproject.toml: pip install 'lm_eval[unitxt]' 2. Added a check that unitxt is installed and print a clear error message if not * Commited missing pyproject change * Added documentation on adding datasets * More doc changes * add unitxt extra to readme * run precommit --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * add task for mmlu evaluation in arc multiple choice format (EleutherAI#1745) * add mmlu arc style evaluation * rename arc_style to continuation --------- Co-authored-by: Jonathan Burdge <jburdge@mahti-login11.mahti.csc.fi> Co-authored-by: Jonathan Burdge <jburdge@mahti-login12.mahti.csc.fi> * Update flag `--hf_hub_log_args` in interface documentation (EleutherAI#1806) * update interface documentation with flag --hf_hub_logs_arg * update interface documentation with flag --hf_hub_logs_arg 2 * Copal task (EleutherAI#1803) * add copal * change name to copal id for clarity and the task name * remove `copal_id...` to yaml to make it work * checkmark on README * change group name to `copal_id` * Adding tinyBenchmarks datasets (EleutherAI#1545) * Add tinyBenchmarks * Add acknowledgements * Add ordering of outputs for data-parallel * Run pre-commit * Add few_shot specifications * Add tinyBenchmarks post-processing * add conditional import ; fix task names --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * interface doc update (EleutherAI#1807) * Fix links in README guiding to another branch (EleutherAI#1838) * Fix: support PEFT/LoRA with added tokens (EleutherAI#1828) * resize model embeddings * resize only * tokenizer help * load tokenizer before model * add comment and run precommit lint * Add log message Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fixed incorrect check for task type (replace `~` with `not`) (EleutherAI#1865) * fixed docs typos (EleutherAI#1863) * Update polemo2_out.yaml (EleutherAI#1871) * Unpin vllm in dependencies (EleutherAI#1874) * Fix outdated links to the latest links in `docs` (EleutherAI#1876) * [HFLM]Use Accelerate's API to reduce hard-coded CUDA code (EleutherAI#1880) * Fix `batch_size=auto` for HF Seq2Seq models (EleutherAI#1765) (EleutherAI#1790) * fix auto-batch size bug for seq2seq models * run linter * Fix Brier Score (EleutherAI#1847) `gold_one_hot` needs to follow the dimension of predictions so that it still works when `--limit` is used and the indexes in gold does not cover all gold indexes. * Fix for bootstrap_iters = 0 case (EleutherAI#1715) (EleutherAI#1789) * add handling for bootstrap_iters=0 case * add more detail to docstring * run precommit * add mmlu tasks from pile-t5 (EleutherAI#1710) * add mmlu tasks from pile-t5 * Update _mmlu_flan_cot_fewshot_template_yaml * Update _mmlu_flan_cot_zeroshot_template_yaml * Update _mmlu_flan_generative_template_yaml * Update _mmlu_flan_loglikelihood_template_yaml * Update _default_template_yaml --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Bigbench fix (EleutherAI#1686) * edit process multiple-choice * split template yaml * remove * modified multiple_choice tasks * udpate * Update multiple_choice_template_b_yaml * Update multiple_choice_template_a_yaml --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Rename `lm_eval.logging -> lm_eval.loggers` (EleutherAI#1858) * rename lm_eval.logging module * fix evaluation tracker args * Updated vllm imports in vllm_causallms.py (EleutherAI#1890) * Reorder vllm imports in vllm_causallms.py * Update vllm_causallms.py * [HFLM]Add support for Ascend NPU (EleutherAI#1886) * [HFLM]Add support for Ascend NPU Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com> Co-authored-by: zhabuye <2947436155@qq.com> * bump accelerate dependency version to 0.26.0 for NPU compat. --------- Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com> Co-authored-by: zhabuye <2947436155@qq.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * `higher_is_better` tickers in output table (EleutherAI#1893) * Higher is better tickers in output table * add extra check for `higher_is_better` not being None already * Update lm_eval/evaluator.py * fixup format I messed up * add comment (and retrigger tests) --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Add dataset card when pushing to HF hub (EleutherAI#1898) * dataset card initial * few fixes * adds groups for math, mmlu, gpqa * added summary agrs * moved sanitize_list to utils * readme update * recreate metadata moved * multiple model support * results latest split fix * readme update and small refactor * fix grouping * add comments * added pathlib * corrected pathlib approach * check whether to create a metadata card * convert posix paths to str * default hf org from token * hf token value error * Add logs after successful upload * logging updates * dataset card example in the readme --------- Co-authored-by: Nathan Habib <nathan.habib@huggingface.com> Co-authored-by: Alina Lozovskaia <alinailozovskaya@gmail.com> * Making hardcoded few shots compatible with the chat template mechanism (EleutherAI#1895) * init test 1 * fix * this format seems to be working - need to update all other tasks with the new format * bbh with few shot format * fix fewshot bbh * add mmlu flan cot * samples of cot * kmmlu * fix gsm8k * update keys for mmlu * minerva math * bbh * fix * fix samples * small fixes to templates * last prompt format change * fixing prompt * fixed minerva math format * rm accidental commited file * added doc for few shot samples * Update lm_eval/loggers/evaluation_tracker.py * Update lm_eval/loggers/evaluation_tracker.py * Update docs/new_task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * added check in sampler per code review * added the system from a function, plus an example in minerva math * style * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fix unit tests 1 * forcing use of test split --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Try to make existing tests run little bit faster (EleutherAI#1905) * Fix fewshot seed only set when overriding num_fewshot (EleutherAI#1914) Fix EleutherAI#1906 * Complete task list from pr 1727 (EleutherAI#1901) * added tasks and task family descriptors * continue work on task list w/ links; slightly reorganize README * Apply suggestions from code review * Rename file so that it'll preview in Github when viewing lm_eval/tasks folder * Update new_task_guide.md * Update README.md * run linter * Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs * fix typo * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * apply format --------- Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Add chat template (EleutherAI#1873) * initial chat template * tokenizer attribute check * variable rename * interface update * system instruction * system inst default update * fewshot as multiturn * typing update * indent update * added comments * Adding a fewshot in a more readable way * linting * Moved apply chat template to LM * multiturn alternation fix * cache key update * apply chat template method fix * add system prompt hash to cache_key * tokenizer name property for cache_key * property name fix * linting backward compatibility fix * docs and errors update * add documentation on adding chat template compatibility to model_guide * fewshot as multiturn check fix * saving system inst and chat template in results * eval tracker update * docs update * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data (EleutherAI#1867) * glianorex tasks * Create README.md * Update README.md * Update README.md * fix formatting * fix internal formatting * Modify pre-commit hook to check merge conflicts accidentally committed not at current merge commit (EleutherAI#1927) * [add] fld logical formula task (EleutherAI#1931) * Add new Lambada translations (EleutherAI#1897) * added tasks and task family descriptors * configs for the new lambada translations * continue work on task list w/ links; slightly reorganize README * Apply suggestions from code review * Rename file so that it'll preview in Github when viewing lm_eval/tasks folder * Update new_task_guide.md * Update README.md * run linter * Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs * fix typo * update `lm_eval/tasks/README.md` with task description --------- Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com> Co-authored-by: anthony <anthonydipofi@gmail.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Implement NoticIA (EleutherAI#1912) * Noticia * test * Final testes implementation * Fixes * Fix linters * Add The Arabic version of the PICA benchmark (EleutherAI#1917) * Update siqa.yaml (EleutherAI#1909) * Update basque-glue (EleutherAI#1913) * Update README.md * Update bec.yaml * Update bhtc.yaml * Update coref.yaml * Update qnli.yaml * Update vaxx.yaml * Update wic.yaml * Test output table layout consistency (EleutherAI#1916) * sort metrics in output table * update docstring in `consolidate_results` * add tests for verifying consistency of table output * update tests to account for floating point inconsistencies * updated tests based on `pythia-14m` * Update __main__.py (EleutherAI#1939) * Add the Arabic version with refactor to Arabic pica to be in alghafa folder (EleutherAI#1940) * Results filenames handling fix (EleutherAI#1926) * results filenames handling moved to utils * zeno results handling fix * tasks_for_model backward compatibility * results files logic moved to tasks_for_model * moved sanitize_model_name to utils * Remove AMMLU Due to Translation (EleutherAI#1948) * Update README.md * Delete lm_eval/tasks/ammlu directory * add include_defaults kwarg to taskmanager, add tests for include_path (EleutherAI#1856) * add hacky add_bos_token forcing for Gemma to VLLM too (EleutherAI#1857) * Update interface.md (EleutherAI#1955) * Fix self.max_tokens in anthropic_llms.py (EleutherAI#1848) Fix bug where `self.max_tokens` was not set * `samples` is newline delimited (EleutherAI#1930) * `samples` is newline delimited * updated git and pre-commit * appease pre-commit * nit * Revert back for now * Revert for now --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Fix `--gen_kwargs` and VLLM (`temperature` not respected) (EleutherAI#1800) * Update vllm_causallms.py * adjust --------- Co-authored-by: lintangsutawika <lintang@eleuther.ai> * make write_out.py explicitly error if no splits match (EleutherAI#1796) Co-authored-by: lintangsutawika <lintang@eleuther.ai> * fix: add directory filter to os.walk to ignore 'ipynb_checkpoints' (EleutherAI#1956) * fix: add filter to os.walk to ignore 'ipynb_checkpoints * Update __init__.py * Update __init__.py --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * add trust_remote_code for piqa (EleutherAI#1983) Signed-off-by: changwangss <chang1.wang@intel.com> * Fix self assignment in neuron_optimum.py (EleutherAI#1990) * [New Task] Add Paloma benchmark (EleutherAI#1928) * init paloma benchmark * pre-process in utils function * add `task_alias` * updated task aliases * Update paloma_dolma-v1_5.yaml * Update paloma_twitterAAE_HELM_fixed.yaml * Update paloma_dolma_100_programing_languages.yaml --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Fix Paloma Template yaml (EleutherAI#1993) * init paloma benchmark * pre-process in utils function * add `task_alias` * updated task aliases * Update paloma_dolma-v1_5.yaml * Update paloma_twitterAAE_HELM_fixed.yaml * Update paloma_dolma_100_programing_languages.yaml * update on names * fix paloma template issue --------- Co-authored-by: Zafir Stojanovski <zaf.stojano@gmail.com> Co-authored-by: Zafir Stojanovski <zafir.stojanovski@icloud.com> Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Log `fewshot_as_multiturn` in results files (EleutherAI#1995) * log fewshot_as_multiturn in general tracker args * Update evaluator.py --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Added ArabicMMLU (EleutherAI#1987) * Added ArabicMMLU * Rename `ammlu` to `arabicmmlu` * Fix Datasets `--trust_remote_code` (EleutherAI#1998) * Add BertaQA dataset tasks (EleutherAI#1964) * add bertaqa tasks * rename basquetrivia-->bertaqa ; make template stub not .yaml * add bertaqa entry to lm_eval/tasks/README.md --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Fix precommit hook, update run_models.sh * Rename main mmlu ru config * Add ru continuation version --------- Signed-off-by: changwangss <chang1.wang@intel.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: khalil <90086758+khalil-Hennara@users.noreply.github.com> Co-authored-by: Lintang Sutawika <lintang@sutawika.com> Co-authored-by: Alex Bäuerle <alex@a13x.io> Co-authored-by: Wongboo <44860323+Wongboo@users.noreply.github.com> Co-authored-by: achervyakov <77295913+artemorloff@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> Co-authored-by: Eitan Turok <150733043+eitanturok@users.noreply.github.com> Co-authored-by: Rylan Schaeffer <rylanschaeffer@gmail.com> Co-authored-by: Vicki Boykis <vicki@mozilla.ai> Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> Co-authored-by: kwrobel.eth <djstrong@gmail.com> Co-authored-by: Nouf M. Alotaibi <63472979+noufmitla@users.noreply.github.com> Co-authored-by: Haonan Li <nathan.8270.n@gmail.com> Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com> Co-authored-by: WoosungMyung <115716986+LameloBally@users.noreply.github.com> Co-authored-by: Sergio Perez <sergioperezperez24@gmail.com> Co-authored-by: Or Sharir <or@sharir.org> Co-authored-by: Julen Etxaniz <juletxara@gmail.com> Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: ZoneTwelve <zonetwelve159@gmail.com> Co-authored-by: Seungwoo Ryu <seungwoo.ryu.94@gmail.com> Co-authored-by: nicho2 <nicho2@laposte.net> Co-authored-by: KonradSzafer <61851539+KonradSzafer@users.noreply.github.com> Co-authored-by: Sergio Perez <sergioperezpersonal@gmail.com> Co-authored-by: sator-labs <129434630+sator-labs@users.noreply.github.com> Co-authored-by: Brian Vaughan <nairbv@users.noreply.github.com> Co-authored-by: giorgossideris <56915448+giorgossideris@users.noreply.github.com> Co-authored-by: Nikita Lozhnikov <nikitml@gmail.com> Co-authored-by: Chujie Zheng <chujiezhengchn@gmail.com> Co-authored-by: Gabriel Mukobi <gabrielmukobi@gmail.com> Co-authored-by: Zehan Li <69186130+jordane95@users.noreply.github.com> Co-authored-by: Simran Arora <emailsimran@gmail.com> Co-authored-by: bcicc <142823000+bcicc@users.noreply.github.com> Co-authored-by: Helena Kloosterman <helena.kloosterman@intel.com> Co-authored-by: Muhammad Bin Usman <muhammadbin.2003@gmail.com> Co-authored-by: ciaranby <48831615+ciaranby@users.noreply.github.com> Co-authored-by: LSinev <LSinev@users.noreply.github.com> Co-authored-by: aditya thomas <aditya.thomas@alum.mit.edu> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Co-authored-by: jonabur <135807120+jonabur@users.noreply.github.com> Co-authored-by: Jonathan Burdge <jburdge@mahti-login11.mahti.csc.fi> Co-authored-by: Jonathan Burdge <jburdge@mahti-login12.mahti.csc.fi> Co-authored-by: Edd <68678137+Erland366@users.noreply.github.com> Co-authored-by: Lucas Weber <35227161+LucWeber@users.noreply.github.com> Co-authored-by: Nick Doiron <ndoiron@mapmeld.com> Co-authored-by: Zafir Stojanovski <zafir.stojanovski@icloud.com> Co-authored-by: zhabuye <74179177+zhabuye@users.noreply.github.com> Co-authored-by: Edward Gan <efuzzy@gmail.com> Co-authored-by: DongGeon Lee <dg.lee@postech.ac.kr> Co-authored-by: Huazhong Ji <hzji210@gmail.com> Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com> Co-authored-by: zhabuye <2947436155@qq.com> Co-authored-by: Nathan Habib <nathan.habib@huggingface.com> Co-authored-by: Alina Lozovskaia <alinailozovskaya@gmail.com> Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> Co-authored-by: anthony-dipofi <anthonydipofi@gmail.com> Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com> Co-authored-by: Maxime <672982+maximegmd@users.noreply.github.com> Co-authored-by: MorishT <106973776+MorishT@users.noreply.github.com> Co-authored-by: Iker García-Ferrero <i.garciaferrerosanpelayo@gmail.com> Co-authored-by: Zafir Stojanovski <zaf.stojano@gmail.com> Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com> Co-authored-by: johnwee1 <91670254+johnwee1@users.noreply.github.com> Co-authored-by: Wang, Chang <491521017@qq.com> Co-authored-by: Yazeed Alnumay <61038456+Yazeed7@users.noreply.github.com>

add handling for bootstrap_iters=0 case

567469d

haileyschoelkopf added the bug Something isn't working. label May 6, 2024

haileyschoelkopf requested a review from lintangsutawika as a code owner May 6, 2024 14:04

add more detail to docstring

1b64d9c

haileyschoelkopf mentioned this pull request May 6, 2024

Type error on stderr computation of accuracy #1715

Closed

Merge branch 'main' into 1715-nostderr-typerror

9aab1ba

lintangsutawika approved these changes May 20, 2024

View reviewed changes

haileyschoelkopf added 2 commits May 24, 2024 15:10

run precommit

0edcf94

Merge branch 'main' into 1715-nostderr-typerror

470c142

haileyschoelkopf merged commit b043b05 into main May 24, 2024
4 of 8 checks passed

haileyschoelkopf deleted the 1715-nostderr-typerror branch May 24, 2024 15:37

notrichardren pushed a commit to steven-basart/lm-evaluation-harness that referenced this pull request May 31, 2024

Fix for bootstrap_iters = 0 case (EleutherAI#1715) (EleutherAI#1789)

988d1ae

* add handling for bootstrap_iters=0 case * add more detail to docstring * run precommit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for bootstrap_iters = 0 case (#1715) #1789

Fix for bootstrap_iters = 0 case (#1715) #1789

haileyschoelkopf commented May 6, 2024

lintangsutawika commented May 20, 2024

Fix for bootstrap_iters = 0 case (#1715) #1789

Fix for bootstrap_iters = 0 case (#1715) #1789

Conversation

haileyschoelkopf commented May 6, 2024

lintangsutawika commented May 20, 2024