Rename `lm_eval.logging -> lm_eval.loggers` #1858

haileyschoelkopf · 2024-05-19T17:57:25Z

closes #1826 .

* rename lm_eval.logging module * fix evaluation tracker args

* Update generate_until_template_yaml (EleutherAI#1546) * Update ifeval.yaml (EleutherAI#1506) * add Arabic EXAMS benchmark (EleutherAI#1498) * add Arabic EXAMS benchmark * fixed the linter issue, and add more information on the readme * Update README.md --------- Co-authored-by: Lintang Sutawika <lintang@sutawika.com> * AGIEval (EleutherAI#1359) * add agieval * fix typo * add cloze / math exactmatch agieval tasks, rename * update exact-match agieval tasks, allow for multiple-correct answers * add more detail to readme * don't parse_math_answer twice --------- Co-authored-by: Alex Bäuerle <alex@a13x.io> * cli_evaluate calls simple_evaluate with the same verbosity. (EleutherAI#1563) * add manual tqdm disabling management (EleutherAI#1569) * add manual tqdm disabling management * add typing to all new args * apply precommit changes --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Fix README section on vllm integration (EleutherAI#1579) * Link to vllm integration * add pip install .[vllm] cmd * Fix Jinja template for Advanced AI Risk (EleutherAI#1587) * Proposed approach for testing CLI arg parsing (EleutherAI#1566) * New tests for CLI args * fix spacing * change tests for parsing * add tests, fix parser * remove defaults for store_true * Patch for Seq2Seq Model predictions (EleutherAI#1584) * Differentiate _encode_pair setting for decoder and enc-dec models * tok_decode to not skip special token so that eos doen't become empty string * Update model.py * Update model.py * Update huggingface.py * Update lm_eval/models/huggingface.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update model.py --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Add start date in results.json (EleutherAI#1592) * Cleanup for v0.4.2 release (EleutherAI#1573) * Update interface.md * fix: make caching reqs always work with accelerate launch * remove stale task migration checklist * remove deprecation warnings * make informative TypeErrors for get_task_dict * bump version metadata * fix num_fewshot printing bug * add fewshot value to cache key * Fix eval_logger import for mmlu/_generate_configs.py (EleutherAI#1593) * Fix eval_logger import for mmlu/_generate_configs.py * linter --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * use BOS token in loglikelihood (EleutherAI#1588) * use BOS token in loglikelihood * improve comments * add model arg * log prefix token id * log prefix token id * Update lm_eval/api/model.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * change name to prefix_token_id --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Revert "Patch for Seq2Seq Model predictions (EleutherAI#1584)" (EleutherAI#1601) This reverts commit b7923a8. * fix gen_kwargs arg reading (EleutherAI#1607) * fix until arg processing (EleutherAI#1608) * Fixes to Loglikelihood prefix token / VLLM (EleutherAI#1611) * make vllm use prefix_token_id ; have prefix_token_id be optional method to define * custom_prefix_token_id wasn't set if not passed * Add ACLUE task (EleutherAI#1614) * Add task ACLUE * fix minor bug * fix code style * fix code style * OpenAI Completions -- fix passing of unexpected 'until' arg (EleutherAI#1612) * add logging of model args (EleutherAI#1619) * add logging of model args * nit * Add warnings. * nit * add warning * nit * Add vLLM FAQs to README (EleutherAI#1625) (EleutherAI#1633) * peft Version Assertion (EleutherAI#1635) * peft Version Assertion * fix the linter issue * Seq2seq fix (EleutherAI#1604) * fix on --task list * add fixes to tokeniation * differentiate encoding for seq2seq and decoder * return token setting * format for pre-commit * Seq2seq fix, pt2 (EleutherAI#1630) * getting model class only when defined * encode_pair handles None, add_special_tokens turned into dict with default value --------- Co-authored-by: achervyakov <77295913+artemorloff@users.noreply.github.com> * Integration of NeMo models into LM Evaluation Harness library (EleutherAI#1598) * Integration of NeMo models into LM Evaluation Harness library * rename nemo model as nemo_lm * move nemo section in readme after hf section * use self.eot_token_id in get_until() * improve progress bar showing loglikelihood requests * data replication or tensor/pipeline replication working fine within one node * run pre-commit on modified files * check whether dependencies are installed * clarify usage of torchrun in README * Fix conditional import for Nemo LM class (EleutherAI#1641) * Fix SuperGlue's ReCoRD task following regression in v0.4 refactoring (EleutherAI#1647) * Add Latxa paper evaluation tasks for Basque (EleutherAI#1654) * add basqueglue * add eus_exams * add eus_proficiency * add eus_reading * add eus_trivia * run pre-commit * Fix CLI --batch_size arg for openai-completions/local-completions (EleutherAI#1656) The OpenAI interface supports batch size as an argument to the completions API, but does not seem to support specification of this on the CLI i.e. `lm_eval --model openai-completions --batch_size 16 ...` because of a simple lack of str->int conversion. This is confirmed by my usage and stacktrace from running `OPENAI_API_KEY=dummy lm_eval --model local-completions --tasks gsm8k --batch_size 16 --model_args model=nm- testing/zephyr-beta-7b-gptq-g128,tokenizer_backend=huggingface,base_url=http://localhost:8000/v1`: ``` Traceback (most recent call last): File "/home/michael/venv/bin/lm_eval", line 8, in <module> sys.exit(cli_evaluate()) File "/home/michael/code/lm-evaluation-harness/lm_eval/__main__.py", line 341, in cli_evaluate results = evaluator.simple_evaluate( File "/home/michael/code/lm-evaluation-harness/lm_eval/utils.py", line 288, in _wrapper return fn(*args, **kwargs) File "/home/michael/code/lm-evaluation-harness/lm_eval/evaluator.py", line 251, in simple_evaluate results = evaluate( File "/home/michael/code/lm-evaluation-harness/lm_eval/utils.py", line 288, in _wrapper return fn(*args, **kwargs) File "/home/michael/code/lm-evaluation-harness/lm_eval/evaluator.py", line 390, in evaluate resps = getattr(lm, reqtype)(cloned_reqs) File "/home/michael/code/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 263, in generate_until list(sameuntil_chunks(re_ord.get_reordered(), self.batch_size)), File "/home/michael/code/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 251, in sameuntil_chunks if len(ret) >= size or x[1] != lastuntil: TypeError: '>=' not supported between instances of 'int' and 'str' ``` * Patch QQP prompt (EleutherAI#1661) * TMMLU+ implementation (EleutherAI#1394) * implementation of TMMLU+ * implemented: TMMLU+ ****TMMLU+ : large-scale Traditional chinese Massive Multitask language Understanding**** - 4 categories - STEM - Social Science - Humanities - Other The TMMLU+ dataset, encompassing over 67 subjects and 20160 tasks, is six times larger and more balanced than its predecessor, TMMLU, and includes benchmark results from both closed-source and 20 open-weight Chinese large language models with 1.8B to 72B parameters. However, Traditional Chinese variants continue to underperform compared to major Simplified Chinese models. ```markdown Total number of tasks in the 'test' sets: 20160 Total number of tasks in the 'validation' sets: 2247 Total number of tasks in the 'train' sets: 335 ``` * Remove print from __init__.py There was my mistake in forgetting to remove the debug print from the code. * update: move TMMLU+ config generation program into default * fix: we should use training set as few shots example * update: README for TMMLU+ * update: a small changes of TMMLU+ README file * pre-commit run thought * Add README for TMMLU+ dataset * run precommit * trigger precommit again * trigger precommit again * isort is fussy * isort is fussy * format, again * oops * oops --------- Co-authored-by: lintang <lintang@eleuther.ai> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Anthropic Chat API (EleutherAI#1594) * claude3 * supply for anthropic claude3 * supply for anthropic claude3 * anthropic config changes * add callback options on anthropic * line passed * claude3 tiny change * help anthropic installation * mention sysprompt / being careful with format in readme --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * correction bug EleutherAI#1664 (EleutherAI#1670) * correction bug EleutherAI#1664 * add any invalid characters for Windows filenames and Unix-like systems see: https://gist.github.com/doctaphred/d01d05291546186941e1b7ddc02034d3?permalink_comment_id=3958715 * Update lm_eval/__main__.py * Update scripts/zeno_visualize.py * fix format --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Update README.md (EleutherAI#1680) * Add delta weights model loading (EleutherAI#1712) * added delta weights * removed debug * readme update * better error handling * autogptq warn * warn update * peft and delta error, explicitly deleting _model_delta * linter fix * Add `neuralmagic` models for `sparseml` and `deepsparse` (EleutherAI#1674) * Add neuralmagic models for SparseML and DeepSparse * Update to latest and add test * Format * Fix list to List * Format * Add deepsparse/sparseml to automated testing * Update pyproject.toml * Update pyproject.toml * Update README * Fixes for dtype and device * Format * Fix test * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Address review comments! --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fix error when appending eot_token_id for generate_until tasks (EleutherAI#1699) * Adding retries and rate limit to toxicity tasks (EleutherAI#1620) * reference `--tasks list` in README (EleutherAI#1726) EleutherAI#1698 * Add XNLIeu: a dataset for cross-lingual NLI in Basque (EleutherAI#1694) * add xnli_eu tasks * update tasks readme * update readme * Fix Parameter Propagation for Tasks that have `include` (EleutherAI#1749) * Update task.py * Update __init__.py * Support individual scrolls datasets (EleutherAI#1740) * Support individual scrolls datasets * Add qmsum context * Fix formatting * Add filter registry decorator (EleutherAI#1750) * Add register_filter decorator * Add register_filter docs * remove duplicated `num_fewshot: 0` (EleutherAI#1769) * Pile 10k new task (EleutherAI#1758) * Add Pile-10k readme * Add Pile-10k task configuration file * Fix m_arc choices (EleutherAI#1760) * Update utils.py This is a 4-choice task, option_e is null for all but 3 samples * Fix options Adaptive choices * add option e * bump multilingual arc version --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * upload new tasks (EleutherAI#1728) * upload new tasks * add readmes * run linters --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * vllm lora support (EleutherAI#1756) * vllm lora support * remove print * version check, rename lora kwarg * Add option to set OpenVINO config (EleutherAI#1730) * Add option to set OpenVINO config * Use utils.eval_logger for logging * evaluation tracker implementation (EleutherAI#1766) * evaluation tracker implementation * OVModelForCausalLM test fix * typo fix * moved methods args * multiple args in one flag * loggers moved to dedicated dir * improved filename sanitization * eval tracker args fix (EleutherAI#1777) * limit fix (EleutherAI#1785) * remove echo parameter in OpenAI completions API (EleutherAI#1779) * remove echo parameter in OpenAI completions API * remove context length parameter doc string * Fix README: change`----hf_hub_log_args` to `--hf_hub_log_args` (EleutherAI#1776) fix `----hf_hub_log_args` to `--hf_hub_log_args` * Fix bug in setting until kwarg in openai completions (EleutherAI#1784) * Provide ability for custom sampler for ConfigurableTask (EleutherAI#1616) * Added fewshot sampling seeds to evaluator.simple_evaluate signature Way to control seed of fewshot sampling may help with EleutherAI#1591 * Added ability for custom sampler for ConfigurableTask May be set in config like ``` fewshot_config: sampler: !function utils.MyFewshotSampler ``` * explicitly set fewshot random generator seed for HFLM generate_until_task test * add backward compatibility for three args seed setup * save seeds info to logs/reports * Update `--tasks list` option in interface documentation (EleutherAI#1792) * Fix Caching Tests ; Remove `pretrained=gpt2` default (EleutherAI#1775) * link to the example output on the hub (EleutherAI#1798) * Re-add Hendrycks MATH (no sympy checking, no Minerva hardcoded prompt) variant (EleutherAI#1793) * add Hendrycks MATH (no sympy checking) variant * add readmes for MATH tasks * Logging Updates (Alphabetize table printouts, fix eval tracker bug) (EleutherAI#1774) (EleutherAI#1791) * fix auto-batch size bug for seq2seq models * alphabetize task + group tables ; fix eval tracker bug * fix eval tracker bug * Initial integration of the Unitxt to LM eval harness (EleutherAI#1615) * Initial support for Unitxt datasets in LM Eval Harness See https://github.com/IBM/unitxt The script 'generate_yamls.py' creates LM Eval Harness yaml files corresponding to Unitxt datasets specified in the 'unitxt_datasets' file. The glue code required to register Unitxt metrics is in 'unitxt_wrapper.py'. * Added dataset loading check to generate_yaml Improved error messages. * Speed up generate_yaml Added printouts and improved error message * Added output printout * Simplified integration of unitxt datasets Store all the common yaml configuration in a yaml include shared by all datasets of the same task. * Post code review comments - part 1 1. Made sure include files don't end wth 'yaml' so they won't be marked as tasks 2. Added more datasets and tasks (NER, GEC) 3. Added README * Post code review comments - part 2 1. Added install unitxt install option in pyproject.toml: pip install 'lm_eval[unitxt]' 2. Added a check that unitxt is installed and print a clear error message if not * Commited missing pyproject change * Added documentation on adding datasets * More doc changes * add unitxt extra to readme * run precommit --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * add task for mmlu evaluation in arc multiple choice format (EleutherAI#1745) * add mmlu arc style evaluation * rename arc_style to continuation --------- Co-authored-by: Jonathan Burdge <jburdge@mahti-login11.mahti.csc.fi> Co-authored-by: Jonathan Burdge <jburdge@mahti-login12.mahti.csc.fi> * Update flag `--hf_hub_log_args` in interface documentation (EleutherAI#1806) * update interface documentation with flag --hf_hub_logs_arg * update interface documentation with flag --hf_hub_logs_arg 2 * Copal task (EleutherAI#1803) * add copal * change name to copal id for clarity and the task name * remove `copal_id...` to yaml to make it work * checkmark on README * change group name to `copal_id` * Adding tinyBenchmarks datasets (EleutherAI#1545) * Add tinyBenchmarks * Add acknowledgements * Add ordering of outputs for data-parallel * Run pre-commit * Add few_shot specifications * Add tinyBenchmarks post-processing * add conditional import ; fix task names --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * interface doc update (EleutherAI#1807) * Fix links in README guiding to another branch (EleutherAI#1838) * Fix: support PEFT/LoRA with added tokens (EleutherAI#1828) * resize model embeddings * resize only * tokenizer help * load tokenizer before model * add comment and run precommit lint * Add log message Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fixed incorrect check for task type (replace `~` with `not`) (EleutherAI#1865) * fixed docs typos (EleutherAI#1863) * Update polemo2_out.yaml (EleutherAI#1871) * Unpin vllm in dependencies (EleutherAI#1874) * Fix outdated links to the latest links in `docs` (EleutherAI#1876) * [HFLM]Use Accelerate's API to reduce hard-coded CUDA code (EleutherAI#1880) * Fix `batch_size=auto` for HF Seq2Seq models (EleutherAI#1765) (EleutherAI#1790) * fix auto-batch size bug for seq2seq models * run linter * Fix Brier Score (EleutherAI#1847) `gold_one_hot` needs to follow the dimension of predictions so that it still works when `--limit` is used and the indexes in gold does not cover all gold indexes. * Fix for bootstrap_iters = 0 case (EleutherAI#1715) (EleutherAI#1789) * add handling for bootstrap_iters=0 case * add more detail to docstring * run precommit * add mmlu tasks from pile-t5 (EleutherAI#1710) * add mmlu tasks from pile-t5 * Update _mmlu_flan_cot_fewshot_template_yaml * Update _mmlu_flan_cot_zeroshot_template_yaml * Update _mmlu_flan_generative_template_yaml * Update _mmlu_flan_loglikelihood_template_yaml * Update _default_template_yaml --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Bigbench fix (EleutherAI#1686) * edit process multiple-choice * split template yaml * remove * modified multiple_choice tasks * udpate * Update multiple_choice_template_b_yaml * Update multiple_choice_template_a_yaml --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Rename `lm_eval.logging -> lm_eval.loggers` (EleutherAI#1858) * rename lm_eval.logging module * fix evaluation tracker args * Updated vllm imports in vllm_causallms.py (EleutherAI#1890) * Reorder vllm imports in vllm_causallms.py * Update vllm_causallms.py * [HFLM]Add support for Ascend NPU (EleutherAI#1886) * [HFLM]Add support for Ascend NPU Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com> Co-authored-by: zhabuye <2947436155@qq.com> * bump accelerate dependency version to 0.26.0 for NPU compat. --------- Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com> Co-authored-by: zhabuye <2947436155@qq.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * `higher_is_better` tickers in output table (EleutherAI#1893) * Higher is better tickers in output table * add extra check for `higher_is_better` not being None already * Update lm_eval/evaluator.py * fixup format I messed up * add comment (and retrigger tests) --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Add dataset card when pushing to HF hub (EleutherAI#1898) * dataset card initial * few fixes * adds groups for math, mmlu, gpqa * added summary agrs * moved sanitize_list to utils * readme update * recreate metadata moved * multiple model support * results latest split fix * readme update and small refactor * fix grouping * add comments * added pathlib * corrected pathlib approach * check whether to create a metadata card * convert posix paths to str * default hf org from token * hf token value error * Add logs after successful upload * logging updates * dataset card example in the readme --------- Co-authored-by: Nathan Habib <nathan.habib@huggingface.com> Co-authored-by: Alina Lozovskaia <alinailozovskaya@gmail.com> * Making hardcoded few shots compatible with the chat template mechanism (EleutherAI#1895) * init test 1 * fix * this format seems to be working - need to update all other tasks with the new format * bbh with few shot format * fix fewshot bbh * add mmlu flan cot * samples of cot * kmmlu * fix gsm8k * update keys for mmlu * minerva math * bbh * fix * fix samples * small fixes to templates * last prompt format change * fixing prompt * fixed minerva math format * rm accidental commited file * added doc for few shot samples * Update lm_eval/loggers/evaluation_tracker.py * Update lm_eval/loggers/evaluation_tracker.py * Update docs/new_task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * added check in sampler per code review * added the system from a function, plus an example in minerva math * style * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fix unit tests 1 * forcing use of test split --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Try to make existing tests run little bit faster (EleutherAI#1905) * Fix fewshot seed only set when overriding num_fewshot (EleutherAI#1914) Fix EleutherAI#1906 * Complete task list from pr 1727 (EleutherAI#1901) * added tasks and task family descriptors * continue work on task list w/ links; slightly reorganize README * Apply suggestions from code review * Rename file so that it'll preview in Github when viewing lm_eval/tasks folder * Update new_task_guide.md * Update README.md * run linter * Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs * fix typo * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * apply format --------- Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Add chat template (EleutherAI#1873) * initial chat template * tokenizer attribute check * variable rename * interface update * system instruction * system inst default update * fewshot as multiturn * typing update * indent update * added comments * Adding a fewshot in a more readable way * linting * Moved apply chat template to LM * multiturn alternation fix * cache key update * apply chat template method fix * add system prompt hash to cache_key * tokenizer name property for cache_key * property name fix * linting backward compatibility fix * docs and errors update * add documentation on adding chat template compatibility to model_guide * fewshot as multiturn check fix * saving system inst and chat template in results * eval tracker update * docs update * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data (EleutherAI#1867) * glianorex tasks * Create README.md * Update README.md * Update README.md * fix formatting * fix internal formatting * Modify pre-commit hook to check merge conflicts accidentally committed not at current merge commit (EleutherAI#1927) * [add] fld logical formula task (EleutherAI#1931) * Add new Lambada translations (EleutherAI#1897) * added tasks and task family descriptors * configs for the new lambada translations * continue work on task list w/ links; slightly reorganize README * Apply suggestions from code review * Rename file so that it'll preview in Github when viewing lm_eval/tasks folder * Update new_task_guide.md * Update README.md * run linter * Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs * fix typo * update `lm_eval/tasks/README.md` with task description --------- Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com> Co-authored-by: anthony <anthonydipofi@gmail.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Implement NoticIA (EleutherAI#1912) * Noticia * test * Final testes implementation * Fixes * Fix linters * Add The Arabic version of the PICA benchmark (EleutherAI#1917) * Update siqa.yaml (EleutherAI#1909) * Update basque-glue (EleutherAI#1913) * Update README.md * Update bec.yaml * Update bhtc.yaml * Update coref.yaml * Update qnli.yaml * Update vaxx.yaml * Update wic.yaml * Test output table layout consistency (EleutherAI#1916) * sort metrics in output table * update docstring in `consolidate_results` * add tests for verifying consistency of table output * update tests to account for floating point inconsistencies * updated tests based on `pythia-14m` * Update __main__.py (EleutherAI#1939) * Add the Arabic version with refactor to Arabic pica to be in alghafa folder (EleutherAI#1940) * Results filenames handling fix (EleutherAI#1926) * results filenames handling moved to utils * zeno results handling fix * tasks_for_model backward compatibility * results files logic moved to tasks_for_model * moved sanitize_model_name to utils * Remove AMMLU Due to Translation (EleutherAI#1948) * Update README.md * Delete lm_eval/tasks/ammlu directory * add include_defaults kwarg to taskmanager, add tests for include_path (EleutherAI#1856) * add hacky add_bos_token forcing for Gemma to VLLM too (EleutherAI#1857) * Update interface.md (EleutherAI#1955) * Fix self.max_tokens in anthropic_llms.py (EleutherAI#1848) Fix bug where `self.max_tokens` was not set * `samples` is newline delimited (EleutherAI#1930) * `samples` is newline delimited * updated git and pre-commit * appease pre-commit * nit * Revert back for now * Revert for now --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Fix `--gen_kwargs` and VLLM (`temperature` not respected) (EleutherAI#1800) * Update vllm_causallms.py * adjust --------- Co-authored-by: lintangsutawika <lintang@eleuther.ai> * make write_out.py explicitly error if no splits match (EleutherAI#1796) Co-authored-by: lintangsutawika <lintang@eleuther.ai> * fix: add directory filter to os.walk to ignore 'ipynb_checkpoints' (EleutherAI#1956) * fix: add filter to os.walk to ignore 'ipynb_checkpoints * Update __init__.py * Update __init__.py --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * add trust_remote_code for piqa (EleutherAI#1983) Signed-off-by: changwangss <chang1.wang@intel.com> * Fix self assignment in neuron_optimum.py (EleutherAI#1990) * [New Task] Add Paloma benchmark (EleutherAI#1928) * init paloma benchmark * pre-process in utils function * add `task_alias` * updated task aliases * Update paloma_dolma-v1_5.yaml * Update paloma_twitterAAE_HELM_fixed.yaml * Update paloma_dolma_100_programing_languages.yaml --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Fix Paloma Template yaml (EleutherAI#1993) * init paloma benchmark * pre-process in utils function * add `task_alias` * updated task aliases * Update paloma_dolma-v1_5.yaml * Update paloma_twitterAAE_HELM_fixed.yaml * Update paloma_dolma_100_programing_languages.yaml * update on names * fix paloma template issue --------- Co-authored-by: Zafir Stojanovski <zaf.stojano@gmail.com> Co-authored-by: Zafir Stojanovski <zafir.stojanovski@icloud.com> Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Log `fewshot_as_multiturn` in results files (EleutherAI#1995) * log fewshot_as_multiturn in general tracker args * Update evaluator.py --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Added ArabicMMLU (EleutherAI#1987) * Added ArabicMMLU * Rename `ammlu` to `arabicmmlu` * Fix Datasets `--trust_remote_code` (EleutherAI#1998) * Add BertaQA dataset tasks (EleutherAI#1964) * add bertaqa tasks * rename basquetrivia-->bertaqa ; make template stub not .yaml * add bertaqa entry to lm_eval/tasks/README.md --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Fix precommit hook, update run_models.sh --------- Signed-off-by: changwangss <chang1.wang@intel.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: khalil <90086758+khalil-Hennara@users.noreply.github.com> Co-authored-by: Lintang Sutawika <lintang@sutawika.com> Co-authored-by: Alex Bäuerle <alex@a13x.io> Co-authored-by: Wongboo <44860323+Wongboo@users.noreply.github.com> Co-authored-by: achervyakov <77295913+artemorloff@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> Co-authored-by: Eitan Turok <150733043+eitanturok@users.noreply.github.com> Co-authored-by: Rylan Schaeffer <rylanschaeffer@gmail.com> Co-authored-by: Vicki Boykis <vicki@mozilla.ai> Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> Co-authored-by: kwrobel.eth <djstrong@gmail.com> Co-authored-by: Nouf M. Alotaibi <63472979+noufmitla@users.noreply.github.com> Co-authored-by: Haonan Li <nathan.8270.n@gmail.com> Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com> Co-authored-by: WoosungMyung <115716986+LameloBally@users.noreply.github.com> Co-authored-by: Sergio Perez <sergioperezperez24@gmail.com> Co-authored-by: Or Sharir <or@sharir.org> Co-authored-by: Julen Etxaniz <juletxara@gmail.com> Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: ZoneTwelve <zonetwelve159@gmail.com> Co-authored-by: Seungwoo Ryu <seungwoo.ryu.94@gmail.com> Co-authored-by: nicho2 <nicho2@laposte.net> Co-authored-by: KonradSzafer <61851539+KonradSzafer@users.noreply.github.com> Co-authored-by: Sergio Perez <sergioperezpersonal@gmail.com> Co-authored-by: sator-labs <129434630+sator-labs@users.noreply.github.com> Co-authored-by: Brian Vaughan <nairbv@users.noreply.github.com> Co-authored-by: giorgossideris <56915448+giorgossideris@users.noreply.github.com> Co-authored-by: Nikita Lozhnikov <nikitml@gmail.com> Co-authored-by: Chujie Zheng <chujiezhengchn@gmail.com> Co-authored-by: Gabriel Mukobi <gabrielmukobi@gmail.com> Co-authored-by: Zehan Li <69186130+jordane95@users.noreply.github.com> Co-authored-by: Simran Arora <emailsimran@gmail.com> Co-authored-by: bcicc <142823000+bcicc@users.noreply.github.com> Co-authored-by: Helena Kloosterman <helena.kloosterman@intel.com> Co-authored-by: Muhammad Bin Usman <muhammadbin.2003@gmail.com> Co-authored-by: ciaranby <48831615+ciaranby@users.noreply.github.com> Co-authored-by: LSinev <LSinev@users.noreply.github.com> Co-authored-by: aditya thomas <aditya.thomas@alum.mit.edu> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Co-authored-by: jonabur <135807120+jonabur@users.noreply.github.com> Co-authored-by: Jonathan Burdge <jburdge@mahti-login11.mahti.csc.fi> Co-authored-by: Jonathan Burdge <jburdge@mahti-login12.mahti.csc.fi> Co-authored-by: Edd <68678137+Erland366@users.noreply.github.com> Co-authored-by: Lucas Weber <35227161+LucWeber@users.noreply.github.com> Co-authored-by: Nick Doiron <ndoiron@mapmeld.com> Co-authored-by: Zafir Stojanovski <zafir.stojanovski@icloud.com> Co-authored-by: zhabuye <74179177+zhabuye@users.noreply.github.com> Co-authored-by: Edward Gan <efuzzy@gmail.com> Co-authored-by: DongGeon Lee <dg.lee@postech.ac.kr> Co-authored-by: Huazhong Ji <hzji210@gmail.com> Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com> Co-authored-by: zhabuye <2947436155@qq.com> Co-authored-by: Nathan Habib <nathan.habib@huggingface.com> Co-authored-by: Alina Lozovskaia <alinailozovskaya@gmail.com> Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> Co-authored-by: anthony-dipofi <anthonydipofi@gmail.com> Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com> Co-authored-by: Maxime <672982+maximegmd@users.noreply.github.com> Co-authored-by: MorishT <106973776+MorishT@users.noreply.github.com> Co-authored-by: Iker García-Ferrero <i.garciaferrerosanpelayo@gmail.com> Co-authored-by: Zafir Stojanovski <zaf.stojano@gmail.com> Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com> Co-authored-by: johnwee1 <91670254+johnwee1@users.noreply.github.com> Co-authored-by: Wang, Chang <491521017@qq.com> Co-authored-by: Yazeed Alnumay <61038456+Yazeed7@users.noreply.github.com>

* Update generate_until_template_yaml (EleutherAI#1546) * Update ifeval.yaml (EleutherAI#1506) * add Arabic EXAMS benchmark (EleutherAI#1498) * add Arabic EXAMS benchmark * fixed the linter issue, and add more information on the readme * Update README.md --------- Co-authored-by: Lintang Sutawika <lintang@sutawika.com> * AGIEval (EleutherAI#1359) * add agieval * fix typo * add cloze / math exactmatch agieval tasks, rename * update exact-match agieval tasks, allow for multiple-correct answers * add more detail to readme * don't parse_math_answer twice --------- Co-authored-by: Alex Bäuerle <alex@a13x.io> * cli_evaluate calls simple_evaluate with the same verbosity. (EleutherAI#1563) * add manual tqdm disabling management (EleutherAI#1569) * add manual tqdm disabling management * add typing to all new args * apply precommit changes --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Fix README section on vllm integration (EleutherAI#1579) * Link to vllm integration * add pip install .[vllm] cmd * Fix Jinja template for Advanced AI Risk (EleutherAI#1587) * Proposed approach for testing CLI arg parsing (EleutherAI#1566) * New tests for CLI args * fix spacing * change tests for parsing * add tests, fix parser * remove defaults for store_true * Patch for Seq2Seq Model predictions (EleutherAI#1584) * Differentiate _encode_pair setting for decoder and enc-dec models * tok_decode to not skip special token so that eos doen't become empty string * Update model.py * Update model.py * Update huggingface.py * Update lm_eval/models/huggingface.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update model.py --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Add start date in results.json (EleutherAI#1592) * Cleanup for v0.4.2 release (EleutherAI#1573) * Update interface.md * fix: make caching reqs always work with accelerate launch * remove stale task migration checklist * remove deprecation warnings * make informative TypeErrors for get_task_dict * bump version metadata * fix num_fewshot printing bug * add fewshot value to cache key * Fix eval_logger import for mmlu/_generate_configs.py (EleutherAI#1593) * Fix eval_logger import for mmlu/_generate_configs.py * linter --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * use BOS token in loglikelihood (EleutherAI#1588) * use BOS token in loglikelihood * improve comments * add model arg * log prefix token id * log prefix token id * Update lm_eval/api/model.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * change name to prefix_token_id --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Revert "Patch for Seq2Seq Model predictions (EleutherAI#1584)" (EleutherAI#1601) This reverts commit b7923a8. * fix gen_kwargs arg reading (EleutherAI#1607) * fix until arg processing (EleutherAI#1608) * Fixes to Loglikelihood prefix token / VLLM (EleutherAI#1611) * make vllm use prefix_token_id ; have prefix_token_id be optional method to define * custom_prefix_token_id wasn't set if not passed * Add ACLUE task (EleutherAI#1614) * Add task ACLUE * fix minor bug * fix code style * fix code style * OpenAI Completions -- fix passing of unexpected 'until' arg (EleutherAI#1612) * add logging of model args (EleutherAI#1619) * add logging of model args * nit * Add warnings. * nit * add warning * nit * Add vLLM FAQs to README (EleutherAI#1625) (EleutherAI#1633) * peft Version Assertion (EleutherAI#1635) * peft Version Assertion * fix the linter issue * Seq2seq fix (EleutherAI#1604) * fix on --task list * add fixes to tokeniation * differentiate encoding for seq2seq and decoder * return token setting * format for pre-commit * Seq2seq fix, pt2 (EleutherAI#1630) * getting model class only when defined * encode_pair handles None, add_special_tokens turned into dict with default value --------- Co-authored-by: achervyakov <77295913+artemorloff@users.noreply.github.com> * Integration of NeMo models into LM Evaluation Harness library (EleutherAI#1598) * Integration of NeMo models into LM Evaluation Harness library * rename nemo model as nemo_lm * move nemo section in readme after hf section * use self.eot_token_id in get_until() * improve progress bar showing loglikelihood requests * data replication or tensor/pipeline replication working fine within one node * run pre-commit on modified files * check whether dependencies are installed * clarify usage of torchrun in README * Fix conditional import for Nemo LM class (EleutherAI#1641) * Fix SuperGlue's ReCoRD task following regression in v0.4 refactoring (EleutherAI#1647) * Add Latxa paper evaluation tasks for Basque (EleutherAI#1654) * add basqueglue * add eus_exams * add eus_proficiency * add eus_reading * add eus_trivia * run pre-commit * Fix CLI --batch_size arg for openai-completions/local-completions (EleutherAI#1656) The OpenAI interface supports batch size as an argument to the completions API, but does not seem to support specification of this on the CLI i.e. `lm_eval --model openai-completions --batch_size 16 ...` because of a simple lack of str->int conversion. This is confirmed by my usage and stacktrace from running `OPENAI_API_KEY=dummy lm_eval --model local-completions --tasks gsm8k --batch_size 16 --model_args model=nm- testing/zephyr-beta-7b-gptq-g128,tokenizer_backend=huggingface,base_url=http://localhost:8000/v1`: ``` Traceback (most recent call last): File "/home/michael/venv/bin/lm_eval", line 8, in <module> sys.exit(cli_evaluate()) File "/home/michael/code/lm-evaluation-harness/lm_eval/__main__.py", line 341, in cli_evaluate results = evaluator.simple_evaluate( File "/home/michael/code/lm-evaluation-harness/lm_eval/utils.py", line 288, in _wrapper return fn(*args, **kwargs) File "/home/michael/code/lm-evaluation-harness/lm_eval/evaluator.py", line 251, in simple_evaluate results = evaluate( File "/home/michael/code/lm-evaluation-harness/lm_eval/utils.py", line 288, in _wrapper return fn(*args, **kwargs) File "/home/michael/code/lm-evaluation-harness/lm_eval/evaluator.py", line 390, in evaluate resps = getattr(lm, reqtype)(cloned_reqs) File "/home/michael/code/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 263, in generate_until list(sameuntil_chunks(re_ord.get_reordered(), self.batch_size)), File "/home/michael/code/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 251, in sameuntil_chunks if len(ret) >= size or x[1] != lastuntil: TypeError: '>=' not supported between instances of 'int' and 'str' ``` * Patch QQP prompt (EleutherAI#1661) * TMMLU+ implementation (EleutherAI#1394) * implementation of TMMLU+ * implemented: TMMLU+ ****TMMLU+ : large-scale Traditional chinese Massive Multitask language Understanding**** - 4 categories - STEM - Social Science - Humanities - Other The TMMLU+ dataset, encompassing over 67 subjects and 20160 tasks, is six times larger and more balanced than its predecessor, TMMLU, and includes benchmark results from both closed-source and 20 open-weight Chinese large language models with 1.8B to 72B parameters. However, Traditional Chinese variants continue to underperform compared to major Simplified Chinese models. ```markdown Total number of tasks in the 'test' sets: 20160 Total number of tasks in the 'validation' sets: 2247 Total number of tasks in the 'train' sets: 335 ``` * Remove print from __init__.py There was my mistake in forgetting to remove the debug print from the code. * update: move TMMLU+ config generation program into default * fix: we should use training set as few shots example * update: README for TMMLU+ * update: a small changes of TMMLU+ README file * pre-commit run thought * Add README for TMMLU+ dataset * run precommit * trigger precommit again * trigger precommit again * isort is fussy * isort is fussy * format, again * oops * oops --------- Co-authored-by: lintang <lintang@eleuther.ai> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Anthropic Chat API (EleutherAI#1594) * claude3 * supply for anthropic claude3 * supply for anthropic claude3 * anthropic config changes * add callback options on anthropic * line passed * claude3 tiny change * help anthropic installation * mention sysprompt / being careful with format in readme --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * correction bug EleutherAI#1664 (EleutherAI#1670) * correction bug EleutherAI#1664 * add any invalid characters for Windows filenames and Unix-like systems see: https://gist.github.com/doctaphred/d01d05291546186941e1b7ddc02034d3?permalink_comment_id=3958715 * Update lm_eval/__main__.py * Update scripts/zeno_visualize.py * fix format --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Update README.md (EleutherAI#1680) * Add delta weights model loading (EleutherAI#1712) * added delta weights * removed debug * readme update * better error handling * autogptq warn * warn update * peft and delta error, explicitly deleting _model_delta * linter fix * Add `neuralmagic` models for `sparseml` and `deepsparse` (EleutherAI#1674) * Add neuralmagic models for SparseML and DeepSparse * Update to latest and add test * Format * Fix list to List * Format * Add deepsparse/sparseml to automated testing * Update pyproject.toml * Update pyproject.toml * Update README * Fixes for dtype and device * Format * Fix test * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Address review comments! --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fix error when appending eot_token_id for generate_until tasks (EleutherAI#1699) * Adding retries and rate limit to toxicity tasks (EleutherAI#1620) * reference `--tasks list` in README (EleutherAI#1726) EleutherAI#1698 * Add XNLIeu: a dataset for cross-lingual NLI in Basque (EleutherAI#1694) * add xnli_eu tasks * update tasks readme * update readme * Fix Parameter Propagation for Tasks that have `include` (EleutherAI#1749) * Update task.py * Update __init__.py * Support individual scrolls datasets (EleutherAI#1740) * Support individual scrolls datasets * Add qmsum context * Fix formatting * Add filter registry decorator (EleutherAI#1750) * Add register_filter decorator * Add register_filter docs * remove duplicated `num_fewshot: 0` (EleutherAI#1769) * Pile 10k new task (EleutherAI#1758) * Add Pile-10k readme * Add Pile-10k task configuration file * Fix m_arc choices (EleutherAI#1760) * Update utils.py This is a 4-choice task, option_e is null for all but 3 samples * Fix options Adaptive choices * add option e * bump multilingual arc version --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * upload new tasks (EleutherAI#1728) * upload new tasks * add readmes * run linters --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * vllm lora support (EleutherAI#1756) * vllm lora support * remove print * version check, rename lora kwarg * Add option to set OpenVINO config (EleutherAI#1730) * Add option to set OpenVINO config * Use utils.eval_logger for logging * evaluation tracker implementation (EleutherAI#1766) * evaluation tracker implementation * OVModelForCausalLM test fix * typo fix * moved methods args * multiple args in one flag * loggers moved to dedicated dir * improved filename sanitization * eval tracker args fix (EleutherAI#1777) * limit fix (EleutherAI#1785) * remove echo parameter in OpenAI completions API (EleutherAI#1779) * remove echo parameter in OpenAI completions API * remove context length parameter doc string * Fix README: change`----hf_hub_log_args` to `--hf_hub_log_args` (EleutherAI#1776) fix `----hf_hub_log_args` to `--hf_hub_log_args` * Fix bug in setting until kwarg in openai completions (EleutherAI#1784) * Provide ability for custom sampler for ConfigurableTask (EleutherAI#1616) * Added fewshot sampling seeds to evaluator.simple_evaluate signature Way to control seed of fewshot sampling may help with EleutherAI#1591 * Added ability for custom sampler for ConfigurableTask May be set in config like ``` fewshot_config: sampler: !function utils.MyFewshotSampler ``` * explicitly set fewshot random generator seed for HFLM generate_until_task test * add backward compatibility for three args seed setup * save seeds info to logs/reports * Update `--tasks list` option in interface documentation (EleutherAI#1792) * Fix Caching Tests ; Remove `pretrained=gpt2` default (EleutherAI#1775) * link to the example output on the hub (EleutherAI#1798) * Re-add Hendrycks MATH (no sympy checking, no Minerva hardcoded prompt) variant (EleutherAI#1793) * add Hendrycks MATH (no sympy checking) variant * add readmes for MATH tasks * Logging Updates (Alphabetize table printouts, fix eval tracker bug) (EleutherAI#1774) (EleutherAI#1791) * fix auto-batch size bug for seq2seq models * alphabetize task + group tables ; fix eval tracker bug * fix eval tracker bug * Initial integration of the Unitxt to LM eval harness (EleutherAI#1615) * Initial support for Unitxt datasets in LM Eval Harness See https://github.com/IBM/unitxt The script 'generate_yamls.py' creates LM Eval Harness yaml files corresponding to Unitxt datasets specified in the 'unitxt_datasets' file. The glue code required to register Unitxt metrics is in 'unitxt_wrapper.py'. * Added dataset loading check to generate_yaml Improved error messages. * Speed up generate_yaml Added printouts and improved error message * Added output printout * Simplified integration of unitxt datasets Store all the common yaml configuration in a yaml include shared by all datasets of the same task. * Post code review comments - part 1 1. Made sure include files don't end wth 'yaml' so they won't be marked as tasks 2. Added more datasets and tasks (NER, GEC) 3. Added README * Post code review comments - part 2 1. Added install unitxt install option in pyproject.toml: pip install 'lm_eval[unitxt]' 2. Added a check that unitxt is installed and print a clear error message if not * Commited missing pyproject change * Added documentation on adding datasets * More doc changes * add unitxt extra to readme * run precommit --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * add task for mmlu evaluation in arc multiple choice format (EleutherAI#1745) * add mmlu arc style evaluation * rename arc_style to continuation --------- Co-authored-by: Jonathan Burdge <jburdge@mahti-login11.mahti.csc.fi> Co-authored-by: Jonathan Burdge <jburdge@mahti-login12.mahti.csc.fi> * Update flag `--hf_hub_log_args` in interface documentation (EleutherAI#1806) * update interface documentation with flag --hf_hub_logs_arg * update interface documentation with flag --hf_hub_logs_arg 2 * Copal task (EleutherAI#1803) * add copal * change name to copal id for clarity and the task name * remove `copal_id...` to yaml to make it work * checkmark on README * change group name to `copal_id` * Adding tinyBenchmarks datasets (EleutherAI#1545) * Add tinyBenchmarks * Add acknowledgements * Add ordering of outputs for data-parallel * Run pre-commit * Add few_shot specifications * Add tinyBenchmarks post-processing * add conditional import ; fix task names --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * interface doc update (EleutherAI#1807) * Fix links in README guiding to another branch (EleutherAI#1838) * Fix: support PEFT/LoRA with added tokens (EleutherAI#1828) * resize model embeddings * resize only * tokenizer help * load tokenizer before model * add comment and run precommit lint * Add log message Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fixed incorrect check for task type (replace `~` with `not`) (EleutherAI#1865) * fixed docs typos (EleutherAI#1863) * Update polemo2_out.yaml (EleutherAI#1871) * Unpin vllm in dependencies (EleutherAI#1874) * Fix outdated links to the latest links in `docs` (EleutherAI#1876) * [HFLM]Use Accelerate's API to reduce hard-coded CUDA code (EleutherAI#1880) * Fix `batch_size=auto` for HF Seq2Seq models (EleutherAI#1765) (EleutherAI#1790) * fix auto-batch size bug for seq2seq models * run linter * Fix Brier Score (EleutherAI#1847) `gold_one_hot` needs to follow the dimension of predictions so that it still works when `--limit` is used and the indexes in gold does not cover all gold indexes. * Fix for bootstrap_iters = 0 case (EleutherAI#1715) (EleutherAI#1789) * add handling for bootstrap_iters=0 case * add more detail to docstring * run precommit * add mmlu tasks from pile-t5 (EleutherAI#1710) * add mmlu tasks from pile-t5 * Update _mmlu_flan_cot_fewshot_template_yaml * Update _mmlu_flan_cot_zeroshot_template_yaml * Update _mmlu_flan_generative_template_yaml * Update _mmlu_flan_loglikelihood_template_yaml * Update _default_template_yaml --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Bigbench fix (EleutherAI#1686) * edit process multiple-choice * split template yaml * remove * modified multiple_choice tasks * udpate * Update multiple_choice_template_b_yaml * Update multiple_choice_template_a_yaml --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Rename `lm_eval.logging -> lm_eval.loggers` (EleutherAI#1858) * rename lm_eval.logging module * fix evaluation tracker args * Updated vllm imports in vllm_causallms.py (EleutherAI#1890) * Reorder vllm imports in vllm_causallms.py * Update vllm_causallms.py * [HFLM]Add support for Ascend NPU (EleutherAI#1886) * [HFLM]Add support for Ascend NPU Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com> Co-authored-by: zhabuye <2947436155@qq.com> * bump accelerate dependency version to 0.26.0 for NPU compat. --------- Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com> Co-authored-by: zhabuye <2947436155@qq.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * `higher_is_better` tickers in output table (EleutherAI#1893) * Higher is better tickers in output table * add extra check for `higher_is_better` not being None already * Update lm_eval/evaluator.py * fixup format I messed up * add comment (and retrigger tests) --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Add dataset card when pushing to HF hub (EleutherAI#1898) * dataset card initial * few fixes * adds groups for math, mmlu, gpqa * added summary agrs * moved sanitize_list to utils * readme update * recreate metadata moved * multiple model support * results latest split fix * readme update and small refactor * fix grouping * add comments * added pathlib * corrected pathlib approach * check whether to create a metadata card * convert posix paths to str * default hf org from token * hf token value error * Add logs after successful upload * logging updates * dataset card example in the readme --------- Co-authored-by: Nathan Habib <nathan.habib@huggingface.com> Co-authored-by: Alina Lozovskaia <alinailozovskaya@gmail.com> * Making hardcoded few shots compatible with the chat template mechanism (EleutherAI#1895) * init test 1 * fix * this format seems to be working - need to update all other tasks with the new format * bbh with few shot format * fix fewshot bbh * add mmlu flan cot * samples of cot * kmmlu * fix gsm8k * update keys for mmlu * minerva math * bbh * fix * fix samples * small fixes to templates * last prompt format change * fixing prompt * fixed minerva math format * rm accidental commited file * added doc for few shot samples * Update lm_eval/loggers/evaluation_tracker.py * Update lm_eval/loggers/evaluation_tracker.py * Update docs/new_task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * added check in sampler per code review * added the system from a function, plus an example in minerva math * style * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fix unit tests 1 * forcing use of test split --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Try to make existing tests run little bit faster (EleutherAI#1905) * Fix fewshot seed only set when overriding num_fewshot (EleutherAI#1914) Fix EleutherAI#1906 * Complete task list from pr 1727 (EleutherAI#1901) * added tasks and task family descriptors * continue work on task list w/ links; slightly reorganize README * Apply suggestions from code review * Rename file so that it'll preview in Github when viewing lm_eval/tasks folder * Update new_task_guide.md * Update README.md * run linter * Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs * fix typo * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * apply format --------- Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Add chat template (EleutherAI#1873) * initial chat template * tokenizer attribute check * variable rename * interface update * system instruction * system inst default update * fewshot as multiturn * typing update * indent update * added comments * Adding a fewshot in a more readable way * linting * Moved apply chat template to LM * multiturn alternation fix * cache key update * apply chat template method fix * add system prompt hash to cache_key * tokenizer name property for cache_key * property name fix * linting backward compatibility fix * docs and errors update * add documentation on adding chat template compatibility to model_guide * fewshot as multiturn check fix * saving system inst and chat template in results * eval tracker update * docs update * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data (EleutherAI#1867) * glianorex tasks * Create README.md * Update README.md * Update README.md * fix formatting * fix internal formatting * Modify pre-commit hook to check merge conflicts accidentally committed not at current merge commit (EleutherAI#1927) * [add] fld logical formula task (EleutherAI#1931) * Add new Lambada translations (EleutherAI#1897) * added tasks and task family descriptors * configs for the new lambada translations * continue work on task list w/ links; slightly reorganize README * Apply suggestions from code review * Rename file so that it'll preview in Github when viewing lm_eval/tasks folder * Update new_task_guide.md * Update README.md * run linter * Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs * fix typo * update `lm_eval/tasks/README.md` with task description --------- Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com> Co-authored-by: anthony <anthonydipofi@gmail.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Implement NoticIA (EleutherAI#1912) * Noticia * test * Final testes implementation * Fixes * Fix linters * Add The Arabic version of the PICA benchmark (EleutherAI#1917) * Update siqa.yaml (EleutherAI#1909) * Update basque-glue (EleutherAI#1913) * Update README.md * Update bec.yaml * Update bhtc.yaml * Update coref.yaml * Update qnli.yaml * Update vaxx.yaml * Update wic.yaml * Test output table layout consistency (EleutherAI#1916) * sort metrics in output table * update docstring in `consolidate_results` * add tests for verifying consistency of table output * update tests to account for floating point inconsistencies * updated tests based on `pythia-14m` * Update __main__.py (EleutherAI#1939) * Add the Arabic version with refactor to Arabic pica to be in alghafa folder (EleutherAI#1940) * Results filenames handling fix (EleutherAI#1926) * results filenames handling moved to utils * zeno results handling fix * tasks_for_model backward compatibility * results files logic moved to tasks_for_model * moved sanitize_model_name to utils * Remove AMMLU Due to Translation (EleutherAI#1948) * Update README.md * Delete lm_eval/tasks/ammlu directory * add include_defaults kwarg to taskmanager, add tests for include_path (EleutherAI#1856) * add hacky add_bos_token forcing for Gemma to VLLM too (EleutherAI#1857) * Update interface.md (EleutherAI#1955) * Fix self.max_tokens in anthropic_llms.py (EleutherAI#1848) Fix bug where `self.max_tokens` was not set * `samples` is newline delimited (EleutherAI#1930) * `samples` is newline delimited * updated git and pre-commit * appease pre-commit * nit * Revert back for now * Revert for now --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Fix `--gen_kwargs` and VLLM (`temperature` not respected) (EleutherAI#1800) * Update vllm_causallms.py * adjust --------- Co-authored-by: lintangsutawika <lintang@eleuther.ai> * make write_out.py explicitly error if no splits match (EleutherAI#1796) Co-authored-by: lintangsutawika <lintang@eleuther.ai> * fix: add directory filter to os.walk to ignore 'ipynb_checkpoints' (EleutherAI#1956) * fix: add filter to os.walk to ignore 'ipynb_checkpoints * Update __init__.py * Update __init__.py --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * add trust_remote_code for piqa (EleutherAI#1983) Signed-off-by: changwangss <chang1.wang@intel.com> * Fix self assignment in neuron_optimum.py (EleutherAI#1990) * [New Task] Add Paloma benchmark (EleutherAI#1928) * init paloma benchmark * pre-process in utils function * add `task_alias` * updated task aliases * Update paloma_dolma-v1_5.yaml * Update paloma_twitterAAE_HELM_fixed.yaml * Update paloma_dolma_100_programing_languages.yaml --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Fix Paloma Template yaml (EleutherAI#1993) * init paloma benchmark * pre-process in utils function * add `task_alias` * updated task aliases * Update paloma_dolma-v1_5.yaml * Update paloma_twitterAAE_HELM_fixed.yaml * Update paloma_dolma_100_programing_languages.yaml * update on names * fix paloma template issue --------- Co-authored-by: Zafir Stojanovski <zaf.stojano@gmail.com> Co-authored-by: Zafir Stojanovski <zafir.stojanovski@icloud.com> Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Log `fewshot_as_multiturn` in results files (EleutherAI#1995) * log fewshot_as_multiturn in general tracker args * Update evaluator.py --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Added ArabicMMLU (EleutherAI#1987) * Added ArabicMMLU * Rename `ammlu` to `arabicmmlu` * Fix Datasets `--trust_remote_code` (EleutherAI#1998) * Add BertaQA dataset tasks (EleutherAI#1964) * add bertaqa tasks * rename basquetrivia-->bertaqa ; make template stub not .yaml * add bertaqa entry to lm_eval/tasks/README.md --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Fix precommit hook, update run_models.sh * Rename main mmlu ru config * Add ru continuation version --------- Signed-off-by: changwangss <chang1.wang@intel.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: khalil <90086758+khalil-Hennara@users.noreply.github.com> Co-authored-by: Lintang Sutawika <lintang@sutawika.com> Co-authored-by: Alex Bäuerle <alex@a13x.io> Co-authored-by: Wongboo <44860323+Wongboo@users.noreply.github.com> Co-authored-by: achervyakov <77295913+artemorloff@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> Co-authored-by: Eitan Turok <150733043+eitanturok@users.noreply.github.com> Co-authored-by: Rylan Schaeffer <rylanschaeffer@gmail.com> Co-authored-by: Vicki Boykis <vicki@mozilla.ai> Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> Co-authored-by: kwrobel.eth <djstrong@gmail.com> Co-authored-by: Nouf M. Alotaibi <63472979+noufmitla@users.noreply.github.com> Co-authored-by: Haonan Li <nathan.8270.n@gmail.com> Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com> Co-authored-by: WoosungMyung <115716986+LameloBally@users.noreply.github.com> Co-authored-by: Sergio Perez <sergioperezperez24@gmail.com> Co-authored-by: Or Sharir <or@sharir.org> Co-authored-by: Julen Etxaniz <juletxara@gmail.com> Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: ZoneTwelve <zonetwelve159@gmail.com> Co-authored-by: Seungwoo Ryu <seungwoo.ryu.94@gmail.com> Co-authored-by: nicho2 <nicho2@laposte.net> Co-authored-by: KonradSzafer <61851539+KonradSzafer@users.noreply.github.com> Co-authored-by: Sergio Perez <sergioperezpersonal@gmail.com> Co-authored-by: sator-labs <129434630+sator-labs@users.noreply.github.com> Co-authored-by: Brian Vaughan <nairbv@users.noreply.github.com> Co-authored-by: giorgossideris <56915448+giorgossideris@users.noreply.github.com> Co-authored-by: Nikita Lozhnikov <nikitml@gmail.com> Co-authored-by: Chujie Zheng <chujiezhengchn@gmail.com> Co-authored-by: Gabriel Mukobi <gabrielmukobi@gmail.com> Co-authored-by: Zehan Li <69186130+jordane95@users.noreply.github.com> Co-authored-by: Simran Arora <emailsimran@gmail.com> Co-authored-by: bcicc <142823000+bcicc@users.noreply.github.com> Co-authored-by: Helena Kloosterman <helena.kloosterman@intel.com> Co-authored-by: Muhammad Bin Usman <muhammadbin.2003@gmail.com> Co-authored-by: ciaranby <48831615+ciaranby@users.noreply.github.com> Co-authored-by: LSinev <LSinev@users.noreply.github.com> Co-authored-by: aditya thomas <aditya.thomas@alum.mit.edu> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Co-authored-by: jonabur <135807120+jonabur@users.noreply.github.com> Co-authored-by: Jonathan Burdge <jburdge@mahti-login11.mahti.csc.fi> Co-authored-by: Jonathan Burdge <jburdge@mahti-login12.mahti.csc.fi> Co-authored-by: Edd <68678137+Erland366@users.noreply.github.com> Co-authored-by: Lucas Weber <35227161+LucWeber@users.noreply.github.com> Co-authored-by: Nick Doiron <ndoiron@mapmeld.com> Co-authored-by: Zafir Stojanovski <zafir.stojanovski@icloud.com> Co-authored-by: zhabuye <74179177+zhabuye@users.noreply.github.com> Co-authored-by: Edward Gan <efuzzy@gmail.com> Co-authored-by: DongGeon Lee <dg.lee@postech.ac.kr> Co-authored-by: Huazhong Ji <hzji210@gmail.com> Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com> Co-authored-by: zhabuye <2947436155@qq.com> Co-authored-by: Nathan Habib <nathan.habib@huggingface.com> Co-authored-by: Alina Lozovskaia <alinailozovskaya@gmail.com> Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> Co-authored-by: anthony-dipofi <anthonydipofi@gmail.com> Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com> Co-authored-by: Maxime <672982+maximegmd@users.noreply.github.com> Co-authored-by: MorishT <106973776+MorishT@users.noreply.github.com> Co-authored-by: Iker García-Ferrero <i.garciaferrerosanpelayo@gmail.com> Co-authored-by: Zafir Stojanovski <zaf.stojano@gmail.com> Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com> Co-authored-by: johnwee1 <91670254+johnwee1@users.noreply.github.com> Co-authored-by: Wang, Chang <491521017@qq.com> Co-authored-by: Yazeed Alnumay <61038456+Yazeed7@users.noreply.github.com>

* rename lm_eval.logging module * fix evaluation tracker args

* Fix: support PEFT/LoRA with added tokens (EleutherAI#1828) * resize model embeddings * resize only * tokenizer help * load tokenizer before model * add comment and run precommit lint * Add log message Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fixed incorrect check for task type (replace `~` with `not`) (EleutherAI#1865) * fixed docs typos (EleutherAI#1863) * Update polemo2_out.yaml (EleutherAI#1871) * Unpin vllm in dependencies (EleutherAI#1874) * Fix outdated links to the latest links in `docs` (EleutherAI#1876) * [HFLM]Use Accelerate's API to reduce hard-coded CUDA code (EleutherAI#1880) * Fix `batch_size=auto` for HF Seq2Seq models (EleutherAI#1765) (EleutherAI#1790) * fix auto-batch size bug for seq2seq models * run linter * Fix Brier Score (EleutherAI#1847) `gold_one_hot` needs to follow the dimension of predictions so that it still works when `--limit` is used and the indexes in gold does not cover all gold indexes. * Fix for bootstrap_iters = 0 case (EleutherAI#1715) (EleutherAI#1789) * add handling for bootstrap_iters=0 case * add more detail to docstring * run precommit * add mmlu tasks from pile-t5 (EleutherAI#1710) * add mmlu tasks from pile-t5 * Update _mmlu_flan_cot_fewshot_template_yaml * Update _mmlu_flan_cot_zeroshot_template_yaml * Update _mmlu_flan_generative_template_yaml * Update _mmlu_flan_loglikelihood_template_yaml * Update _default_template_yaml --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Bigbench fix (EleutherAI#1686) * edit process multiple-choice * split template yaml * remove * modified multiple_choice tasks * udpate * Update multiple_choice_template_b_yaml * Update multiple_choice_template_a_yaml --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Rename `lm_eval.logging -> lm_eval.loggers` (EleutherAI#1858) * rename lm_eval.logging module * fix evaluation tracker args * Updated vllm imports in vllm_causallms.py (EleutherAI#1890) * Reorder vllm imports in vllm_causallms.py * Update vllm_causallms.py * [HFLM]Add support for Ascend NPU (EleutherAI#1886) * [HFLM]Add support for Ascend NPU Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com> Co-authored-by: zhabuye <2947436155@qq.com> * bump accelerate dependency version to 0.26.0 for NPU compat. --------- Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com> Co-authored-by: zhabuye <2947436155@qq.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * `higher_is_better` tickers in output table (EleutherAI#1893) * Higher is better tickers in output table * add extra check for `higher_is_better` not being None already * Update lm_eval/evaluator.py * fixup format I messed up * add comment (and retrigger tests) --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Add dataset card when pushing to HF hub (EleutherAI#1898) * dataset card initial * few fixes * adds groups for math, mmlu, gpqa * added summary agrs * moved sanitize_list to utils * readme update * recreate metadata moved * multiple model support * results latest split fix * readme update and small refactor * fix grouping * add comments * added pathlib * corrected pathlib approach * check whether to create a metadata card * convert posix paths to str * default hf org from token * hf token value error * Add logs after successful upload * logging updates * dataset card example in the readme --------- Co-authored-by: Nathan Habib <nathan.habib@huggingface.com> Co-authored-by: Alina Lozovskaia <alinailozovskaya@gmail.com> * Making hardcoded few shots compatible with the chat template mechanism (EleutherAI#1895) * init test 1 * fix * this format seems to be working - need to update all other tasks with the new format * bbh with few shot format * fix fewshot bbh * add mmlu flan cot * samples of cot * kmmlu * fix gsm8k * update keys for mmlu * minerva math * bbh * fix * fix samples * small fixes to templates * last prompt format change * fixing prompt * fixed minerva math format * rm accidental commited file * added doc for few shot samples * Update lm_eval/loggers/evaluation_tracker.py * Update lm_eval/loggers/evaluation_tracker.py * Update docs/new_task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * added check in sampler per code review * added the system from a function, plus an example in minerva math * style * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fix unit tests 1 * forcing use of test split --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Try to make existing tests run little bit faster (EleutherAI#1905) * Fix fewshot seed only set when overriding num_fewshot (EleutherAI#1914) Fix EleutherAI#1906 * Complete task list from pr 1727 (EleutherAI#1901) * added tasks and task family descriptors * continue work on task list w/ links; slightly reorganize README * Apply suggestions from code review * Rename file so that it'll preview in Github when viewing lm_eval/tasks folder * Update new_task_guide.md * Update README.md * run linter * Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs * fix typo * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * apply format --------- Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Add chat template (EleutherAI#1873) * initial chat template * tokenizer attribute check * variable rename * interface update * system instruction * system inst default update * fewshot as multiturn * typing update * indent update * added comments * Adding a fewshot in a more readable way * linting * Moved apply chat template to LM * multiturn alternation fix * cache key update * apply chat template method fix * add system prompt hash to cache_key * tokenizer name property for cache_key * property name fix * linting backward compatibility fix * docs and errors update * add documentation on adding chat template compatibility to model_guide * fewshot as multiturn check fix * saving system inst and chat template in results * eval tracker update * docs update * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data (EleutherAI#1867) * glianorex tasks * Create README.md * Update README.md * Update README.md * fix formatting * fix internal formatting * Modify pre-commit hook to check merge conflicts accidentally committed not at current merge commit (EleutherAI#1927) * [add] fld logical formula task (EleutherAI#1931) * Add new Lambada translations (EleutherAI#1897) * added tasks and task family descriptors * configs for the new lambada translations * continue work on task list w/ links; slightly reorganize README * Apply suggestions from code review * Rename file so that it'll preview in Github when viewing lm_eval/tasks folder * Update new_task_guide.md * Update README.md * run linter * Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs * fix typo * update `lm_eval/tasks/README.md` with task description --------- Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com> Co-authored-by: anthony <anthonydipofi@gmail.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Implement NoticIA (EleutherAI#1912) * Noticia * test * Final testes implementation * Fixes * Fix linters * Add The Arabic version of the PICA benchmark (EleutherAI#1917) * Update siqa.yaml (EleutherAI#1909) * Update basque-glue (EleutherAI#1913) * Update README.md * Update bec.yaml * Update bhtc.yaml * Update coref.yaml * Update qnli.yaml * Update vaxx.yaml * Update wic.yaml * Test output table layout consistency (EleutherAI#1916) * sort metrics in output table * update docstring in `consolidate_results` * add tests for verifying consistency of table output * update tests to account for floating point inconsistencies * updated tests based on `pythia-14m` * Update __main__.py (EleutherAI#1939) * Add the Arabic version with refactor to Arabic pica to be in alghafa folder (EleutherAI#1940) * Results filenames handling fix (EleutherAI#1926) * results filenames handling moved to utils * zeno results handling fix * tasks_for_model backward compatibility * results files logic moved to tasks_for_model * moved sanitize_model_name to utils * Remove AMMLU Due to Translation (EleutherAI#1948) * Update README.md * Delete lm_eval/tasks/ammlu directory * add include_defaults kwarg to taskmanager, add tests for include_path (EleutherAI#1856) * add hacky add_bos_token forcing for Gemma to VLLM too (EleutherAI#1857) * Update interface.md (EleutherAI#1955) * Fix self.max_tokens in anthropic_llms.py (EleutherAI#1848) Fix bug where `self.max_tokens` was not set * `samples` is newline delimited (EleutherAI#1930) * `samples` is newline delimited * updated git and pre-commit * appease pre-commit * nit * Revert back for now * Revert for now --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Fix `--gen_kwargs` and VLLM (`temperature` not respected) (EleutherAI#1800) * Update vllm_causallms.py * adjust --------- Co-authored-by: lintangsutawika <lintang@eleuther.ai> * make write_out.py explicitly error if no splits match (EleutherAI#1796) Co-authored-by: lintangsutawika <lintang@eleuther.ai> * fix: add directory filter to os.walk to ignore 'ipynb_checkpoints' (EleutherAI#1956) * fix: add filter to os.walk to ignore 'ipynb_checkpoints * Update __init__.py * Update __init__.py --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * add trust_remote_code for piqa (EleutherAI#1983) Signed-off-by: changwangss <chang1.wang@intel.com> * Fix self assignment in neuron_optimum.py (EleutherAI#1990) * [New Task] Add Paloma benchmark (EleutherAI#1928) * init paloma benchmark * pre-process in utils function * add `task_alias` * updated task aliases * Update paloma_dolma-v1_5.yaml * Update paloma_twitterAAE_HELM_fixed.yaml * Update paloma_dolma_100_programing_languages.yaml --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Fix Paloma Template yaml (EleutherAI#1993) * init paloma benchmark * pre-process in utils function * add `task_alias` * updated task aliases * Update paloma_dolma-v1_5.yaml * Update paloma_twitterAAE_HELM_fixed.yaml * Update paloma_dolma_100_programing_languages.yaml * update on names * fix paloma template issue --------- Co-authored-by: Zafir Stojanovski <zaf.stojano@gmail.com> Co-authored-by: Zafir Stojanovski <zafir.stojanovski@icloud.com> Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Log `fewshot_as_multiturn` in results files (EleutherAI#1995) * log fewshot_as_multiturn in general tracker args * Update evaluator.py --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Added ArabicMMLU (EleutherAI#1987) * Added ArabicMMLU * Rename `ammlu` to `arabicmmlu` * Fix Datasets `--trust_remote_code` (EleutherAI#1998) * Add BertaQA dataset tasks (EleutherAI#1964) * add bertaqa tasks * rename basquetrivia-->bertaqa ; make template stub not .yaml * add bertaqa entry to lm_eval/tasks/README.md --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * add tokenizer logs info (EleutherAI#1731) * add tokenizer logs info * add no tokenizer case * Update lm_eval/logging_utils.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/logging_utils.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * add updates * fix conflict --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Hotfix breaking import (EleutherAI#2015) * add arc_challenge_mt (EleutherAI#1900) * add arc_challenge_mt * add README * add icelandic * Remove `LM` dependency from `build_all_requests` (EleutherAI#2011) * refactored `lm.apply_chat_template` * nit * fix weird type error * fixed! * skip failing test * pre-commit run all * add type hints * nit * nit * fixup * Added CommonsenseQA task (EleutherAI#1721) * Initial configuration * Using the validation set for the test set, because the test set on HF doesn't have labels * Probably just makes more sense to have validation be validation * fix format ; add docs to tasks/README.md * fix format --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Factor out LM-specific tests (EleutherAI#1859) * separate out optimum/neuralmagic tests to separate job * fix vllm tests * fix bug in --trust_remote_code * use datasets.config instead intentionally * fix remote code issue? * Update interface.md (EleutherAI#1982) * Update interface.md update interface to remove link to really outdated commit of evaluator.py * switch to relative referencing? * Update interface.md --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Fix `trust_remote_code`-related test failures (EleutherAI#2024) * make MMLU trust remote code to fix tests * remove trust remote code * Fixes scrolls task bug with few_shot examples (EleutherAI#2003) Bug: ``` python -m scripts.write_out --task scrolls_quality --output_base_path ~/workspace/ Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/lm-evaluation-harness/scripts/write_out.py", line 92, in <module> main() File "/lm-evaluation-harness/scripts/write_out.py", line 51, in main task_dict = tasks.get_task_dict(task_names, task_manager) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 423, in get_task_dict task_name_from_string_dict = task_manager.load_task_or_group( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 271, in load_task_or_group collections.ChainMap(*map(self._load_individual_task_or_group, task_list)) File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 162, in _load_individual_task_or_group return load_task(task_config, task=name_or_config, group=parent_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 148, in load_task task_object = config["class"]() ^^^^^^^^^^^^^^^^^ File "/lm-evaluation-harness/lm_eval/tasks/scrolls/task.py", line 120, in __init__ super().__init__() File "/lm-evaluation-harness/lm_eval/api/task.py", line 703, in __init__ self._config = TaskConfig(**config) ^^^^^^^^^^^^^^^^^^^^ TypeError: lm_eval.api.task.TaskConfig() argument after ** must be a mapping, not NoneType ``` * fix cache (EleutherAI#2037) * Add chat template to `vllm` (EleutherAI#2034) * add chat template * refactor token padding * nit * nit * check on failing test * check transformers version * remove transformers pin * add ids to test * nit * fixup * fix bos bug * nit * fixup! fix bos bug * increase tolerance for table test * don't detokenize vllm logprobs * Update lm_eval/models/utils.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * pre-commit run --all-files --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fail gracefully upon tokenizer logging failure (EleutherAI#2038) * ship with exact_match function already used ; don't call evaluate.load() on import (EleutherAI#2045) * update to v0.4.3 (EleutherAI#2046) * fix wandb logger module import in example (EleutherAI#2041) * Fix strip whitespace filter (EleutherAI#2048) * batch commit * :Revert "batch commit" This reverts commit d859d1c. * batch commit * checkout from main * checkout from main * checkout from main * checkout from main * checkout from main * cleanup * cleanup * cleanup * cleanup * cleanup * cleanup * update gemma-2 default BOS behavior (EleutherAI#2049) * Update hellaswag.yaml (EleutherAI#2029) * Adds Open LLM Leaderboard Taks (EleutherAI#2047) * adds leaderboard tasks * Delete lm_eval/tasks/leaderboard/leaderboard_chat_template.yaml * add readme * Delete lm_eval/tasks/leaderboard/mmlu_pro/mmlu_pro_chat_template.yaml * modify readme * fix bbh task * fix bbh salient task * modify the readme * Delete lm_eval/tasks/leaderboard/ifeval/README.md * Delete lm_eval/tasks/leaderboard/math/README.md * add leaderboard to the tasks repertory * add anouncment about new leaderbaord tasks * linting * Update README.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * installs ifeval dependency in new_task github workflow --------- Co-authored-by: Nathan Habib <nathan.habib@huggingface.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * EleutherAI#1442 inverse scaling tasks implementation (EleutherAI#1589) * initial_implementation (test has to be proceeded) * minor fix * revised task name and implemented new task * minor fixes * new tasks implement * minor fix * added 'prompt injection' task * delete prompt injection task (will be implemented at next PR) * trust remote code * Update lm_eval/tasks/inverse_scaling/README.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * added readme * Update lm_eval/tasks/README.md * Update lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml * Update lm_eval/tasks/inverse_scaling/README.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update README.md * precommit? * run precommit on readme --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Fix TypeError in samplers.py by converting int to str (EleutherAI#2074) Co-authored-by: yhjo <yhjo@suresofttech.com> * Group agg rework (EleutherAI#1741) * add greoup_config arg * add a group config that allows disabling table for group score and group aggregate in general * fixed size configuration * adjust config * add group config * adjust mmlu to use group_config * fixed args input in aggregate_subtask_metrics * fixed issues related to printing alias of group and updated yaml * update all mmlu variants to include group_config * edit format * modify mmlu tasks * adjust group to also be a configurable group * add configurable group * simplify get_task_list * adjust group scoring with using ConfigurableGroup * adjust args * update mmlu * update mmlu * update to work with new group and task configuration * readd group_agg * readd files * move prepare_print_tasks to evaluator_utils * sort set to False by default, fix predict_only arg * add version for groups * reversed task list * update additional condition when loading a group in a group yaml * update truthfulqa * add description regarding tags replacing group * replace group to tag * fixed conditional statement * remove warning * update loading of task group and newly added tags * reformat with pre-commit * fixed info log * update * fix bug * fix bug * use task id to differentiate tasks * convert all groups to configurable groups * use task_id * reformat * add task_id for python tasks as well * add task_id for python tasks as well * add task_id for python tasks as well * revert truthfulqa * revert mmlu tasks * new mmlu config * new group config parameter `tag_to_task` * Update truthfulqa_mc2.yaml * reformate * add _process_group_config * adjust task_id * add get_subtask_list function to get proper subtask list * group config to_dict update * remove tag check * update mmlu * fix config passing issues * add test yaml * format fix * add documentation * corner case for single tag being called * fix indentation * formatting * update all mmlu variants * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove group_alias * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * remove version for metadata * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * update mmlu/ * removed " " in make_table * change how aggregate_metric is loaded * change how aggregate_metric is loaded * update aggregate_metric arg * update format * update format * some docs fixes * add groups for agieval, aexams, aclue * add more explicit aggregation groups * add more groupings / tags distinctions * add more groupings * more groupings * add many explicit group configs * add many explicit group configs * add more explicit group configs * add more explicit group configs * add more error msgs, agg_metric -> agg_metric_list * some docs updates * update task_id to be updateable and uses group:task format * make KMMLU a tag for now * update docs * don't duplicate task names * fix merge conflicts? * giving this a try * clean up diff * switch mmlu variants over to using * don't use to-be-deprecated group: config field in overview notebook * Python tasks which subclass ConfigurableTask now run * update mmlu * pre-commit format * fixed sorting for multi-level printing * move group api to separate file * fix bbh aggregation filter usage * track api/group.py * adjust group and tags loading * make explicit group configs for leaderboard and other newer tasks * fix arabicmmlu * update * change arabicmmlu template name??? * update group alias * fix printing bugs * check table printing is correct ; update tests * use mmlu_stem to have a group included in print tests --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * we run with bootstrap_iters=0 for printing tests (EleutherAI#2080) * Easier unitxt tasks loading and removal of unitxt library dependancy (EleutherAI#1933) * Updated unitxt loading Signed-off-by: Elron Bandel <elron.bandel@ibm.com> * Revert change to general Readme Signed-off-by: Elron Bandel <elron.bandel@ibm.com> * Adjust fda,squadv2,squad_completion and swde to work accept config in the constructor Signed-off-by: Elron Bandel <elron.bandel@ibm.com> * Fix scrolls Signed-off-by: elronbandel <elron.bandel@ibm.com> * Update documentation Signed-off-by: elronbandel <elron.bandel@ibm.com> * Enforce backward compatability Signed-off-by: elronbandel <elron.bandel@ibm.com> * Format unitxt class Signed-off-by: elronbandel <elron.bandel@ibm.com> --------- Signed-off-by: Elron Bandel <elron.bandel@ibm.com> Signed-off-by: elronbandel <elron.bandel@ibm.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Allow gating EvaluationTracker HF Hub results; customizability (EleutherAI#2051) * batch commit * :Revert "batch commit" This reverts commit d859d1c. * batch commit * checkout from main * checkout from main * checkout from main * checkout from main * checkout from main * cleanup * cleanup * cleanup * cleanup * cleanup * cleanup eval results * cleanup * add check for gated repo * fix jsonline issue * fix * add try catch when gating the details repo * add doc * adds back hub_repo_name * readds hub repo name * Minor doc fix: leaderboard README.md missing mmlu-pro group and task (EleutherAI#2075) leaderboard README.md missing mmlu-pro group and task * fix: utf-8 encoding for logged sample files was missing (EleutherAI#2082) * Update utils.py (EleutherAI#2085) Group Configs with no aggregation will print a empty space as the score for result table. Example ``` | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |--------------|-------|------|-----:|--------|---|-----:|---|-----:| |group | N/A| | | | | | | | | - task 0 |Yaml |none | 0|acc |↑ |0.4000|± |0.0910| | - task 1 |Yaml |none | 0|acc |↑ |0.3333|± |0.0875| | - task 2 |Yaml |none | 0|acc |↑ |0.2667|± |0.0821| | - task 3 |Yaml |none | 0|acc |↑ |0.3333|± |0.0875| ``` So the `v` variable in the `make_table` needs to check if the value is a float or a string. * batch_size may be str if 'auto' is specified (EleutherAI#2084) * Prettify lm_eval --tasks list (EleutherAI#1929) * add and ; move task list newline logic to new TaskManager.list_all_tasks() method * format table list into markdown table; add config location column * add Output Type column * add logic for printing table of tags separately * merge with main and fix conflicts ; update docstrings --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * make RougeScorer only initialized once (EleutherAI#2090) * Update default.yaml (EleutherAI#2092) * Add new dataset MMLU-SR tasks (EleutherAI#2032) * add mmlusr tasks * renamed all tasks names in mmlusr * edit format and readme * added mmlu_sr * mmlu_sr -> mmlusr * update --------- Co-authored-by: lintangsutawika <lintang@eleuther.ai> * Irokobench: Benchmark Dataset for African languages (EleutherAI#2042) * add afrixnli to task * add chat completion * remove chat completion -untested * afrimmlu added * afrimmlu folder update * afrimmlu folder update * updated prompt * remove print * add afrimgsm -direct * add squad metric * fix bash script * remove direct util, update common yaml * remove print * add few show. metric fixes * fix direct path, add bash script for gpt models * added transate test * update afrixnli tasks * update afrixnli tasks * update metrics for afrixnli * prompt translations fix * prompt translations fix * filter and metric fix -mgsm * remove squad metric * remove squad metric * add f1 score to mgsm * add f1 score to mgsm * update native-direct with lin * change f1 function * add lin to utils * add utils * remove test limit * remove test configs * add swahili to mmlu * change eng to ewe in ewe yaml mmlu * add squad metric to mgsm, remove whitespace filter * added translate test * added afrixnli_translate * fix exact match valueError * fix exact match valueError * restructure mmlu folder * spacing * remove afrimmlu_translate folder * add utility * format task name, clean ups * modefied mgsm * update on afrimgsm * update on afrimgsm * removed utils * other mgsm varieties * other mgsm varieties * adding trasnslate direct * Update translate_direct_yaml * add manual xnli prompt, add multichoice for openai models, and adapt multichoice metric for openai model * edit for open models * Update translate_direct_yaml * add verbalizer for xnli * change xnli from multiple choice to generate * add manual accuracy scores * revert xnli to multiple choice * change afrimgsm utils * revert xnli to multiple_choice * cleanups and readmes * remove openai fixes and unused regex * pr review changes * revert metrics.py, task.py and extraction.py to main version --------- Co-authored-by: Israel Abebe Azime <azime@cg.uni-saarland.de> Co-authored-by: Israel Abebe Azime <se.israel.abebe@gmail.com> * docs: remove trailing sentence from contribution doc (EleutherAI#2098) Signed-off-by: Nathan Weinberg <nweinber@redhat.com> * Added MedConceptsQA Benchmark (EleutherAI#2010) * Added MedConceptsQA Benchmark * pre-commit factor * update group name * update in naming * changed name * Changed mcqa to med_concepts_qa prefix * Added med_concepts_qa to README.md * Changed config files according the new format * Updated README --------- Co-authored-by: lintangsutawika <lintang@eleuther.ai> * make recurrent_gemma model types included in the force-BOS case (EleutherAI#2105) * formatting (EleutherAI#2104) * docs: align local test command to match CI (EleutherAI#2100) Also add 'test_logs/' to .gitignore Signed-off-by: Nathan Weinberg <nweinber@redhat.com> * Fixed colon in Belebele _default_template_yaml (EleutherAI#2111) * [python] fix haerae tasks (EleutherAI#2112) * fix: broken discord link in CONTRIBUTING.md (EleutherAI#2114) Signed-off-by: Nathan Weinberg <nweinber@redhat.com> * docs: update truthfulqa tasks (EleutherAI#2119) * fix caching module (hotfix for now) (EleutherAI#2124) * Refactor API models (EleutherAI#2008) * refactor pad_token handling to fn * fix docs * add pad_token_handling to vllm * start on API superclass * don't detokenize the returned logits * streamline vllm tokenizer * add type hint * pre-commit * seems to be in working order * add model to init * refactor api models * nit * cleanup * add pbar * fix type hints * change optional dependencies * json encode chat template * add type hints * deal with different prompt input requiremnts * nits * fix * cache inside async * fix * fix * nits * nits * nits * nit * fixup * fixup * nit * add dummy retry * add dummy retry * handle imports; skip failing test * add type hint * add tests * add dependency to tests * add package names to exception * nit * docs; type hints * handle api key * nit * tokenizer bug * fix tokenizer * nit * nit * add better error messages * nit * remove decorator * CI: install api dep * revert evaluator.py * consolidate * consolidate * nits * nit * fix typealias * nit * nit * nit * Update lm_eval/models/api_models.py typo Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/models/openai_completions.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/models/anthropic_llms.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/models/api_models.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fix typo * add news section * add info for API * pre-commit * typo * fix bug: unpack logliklehood requests * fix bug: shared gen_kwargs mutated * nit: handle copy properly * Update README.md * Update README.md * Update README.md * Update api_models.py * Update README.md --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * bugfix and docs for API (EleutherAI#2139) * encoding bugfix * encoding bugfix * overload logliklehood rather than loglikehood_tokens * add custom tokenizer * add docs * Update API_guide.md fix link; add note * Update API_guide.md typo * pre-commit * add link in readme * nit * nit * nit * Update API_guide.md nits * Update API_guide.md * Update API_guide.md * Update API_guide.md * Update API_guide.md * Update README.md * Update docs/API_guide.md * Update docs/API_guide.md * Update API_guide.md --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * [Bugfix] add temperature=0 to logprobs and seed args to API models (EleutherAI#2149) * add temperature for log probs * add seed * nit * add new args to test * added warning for api chat models * refactor: limit usage of `scipy` and `skilearn` dependencies (EleutherAI#2097) * refactor: move scipy and sklearn module imports to func imports Signed-off-by: Nathan Weinberg <nweinber@redhat.com> * refactor: consolidate weighted_f1_score func into lm_eval utils Signed-off-by: Nathan Weinberg <nweinber@redhat.com> * lint: allow for utils file to have unused imports this allows for shared functions to be defined only once while allowing for the YAML function importing to continue working Signed-off-by: Nathan Weinberg <nweinber@redhat.com> --------- Signed-off-by: Nathan Weinberg <nweinber@redhat.com> --------- Signed-off-by: changwangss <chang1.wang@intel.com> Signed-off-by: Elron Bandel <elron.bandel@ibm.com> Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Nathan Weinberg <nweinber@redhat.com> Co-authored-by: Nick Doiron <ndoiron@mapmeld.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: Zafir Stojanovski <zafir.stojanovski@icloud.com> Co-authored-by: zhabuye <74179177+zhabuye@users.noreply.github.com> Co-authored-by: Edward Gan <efuzzy@gmail.com> Co-authored-by: DongGeon Lee <dg.lee@postech.ac.kr> Co-authored-by: Huazhong Ji <hzji210@gmail.com> Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com> Co-authored-by: zhabuye <2947436155@qq.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> Co-authored-by: KonradSzafer <61851539+KonradSzafer@users.noreply.github.com> Co-authored-by: Nathan Habib <nathan.habib@huggingface.com> Co-authored-by: Alina Lozovskaia <alinailozovskaya@gmail.com> Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> Co-authored-by: LSinev <LSinev@users.noreply.github.com> Co-authored-by: anthony-dipofi <anthonydipofi@gmail.com> Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com> Co-authored-by: Maxime <672982+maximegmd@users.noreply.github.com> Co-authored-by: MorishT <106973776+MorishT@users.noreply.github.com> Co-authored-by: Iker García-Ferrero <i.garciaferrerosanpelayo@gmail.com> Co-authored-by: khalil <90086758+khalil-Hennara@users.noreply.github.com> Co-authored-by: Zafir Stojanovski <zaf.stojano@gmail.com> Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com> Co-authored-by: Nikita Lozhnikov <nikitml@gmail.com> Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com> Co-authored-by: johnwee1 <91670254+johnwee1@users.noreply.github.com> Co-authored-by: Wang, Chang <491521017@qq.com> Co-authored-by: Yazeed Alnumay <61038456+Yazeed7@users.noreply.github.com> Co-authored-by: Julen Etxaniz <juletxara@gmail.com> Co-authored-by: achervyakov <77295913+artemorloff@users.noreply.github.com> Co-authored-by: Stella Biderman <stellabiderman@gmail.com> Co-authored-by: jonabur <135807120+jonabur@users.noreply.github.com> Co-authored-by: Brendan Murphy <bmurphy592@gmail.com> Co-authored-by: Steven Basart <xksteven@users.noreply.github.com> Co-authored-by: Ogundepo Odunayo <ogundepoodunayo@gmail.com> Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com> Co-authored-by: Hanwool Albert Lee <88315152+h-albert-lee@users.noreply.github.com> Co-authored-by: Choyunhui <a01022371341@gmail.com> Co-authored-by: yhjo <yhjo@suresofttech.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Co-authored-by: Pankaj Mathur <pankymathur@gmail.com> Co-authored-by: meg <90473723+meg-huggingface@users.noreply.github.com> Co-authored-by: Wonung Kim <waneon.kim@gmail.com> Co-authored-by: SuperCat <37853425+SkySuperCat@users.noreply.github.com> Co-authored-by: Jess <jessicaojo19@gmail.com> Co-authored-by: Israel Abebe Azime <azime@cg.uni-saarland.de> Co-authored-by: Israel Abebe Azime <se.israel.abebe@gmail.com> Co-authored-by: Nathan Weinberg <31703736+nathan-weinberg@users.noreply.github.com> Co-authored-by: Ben Shoham Ofir <33639234+Ofir408@users.noreply.github.com> Co-authored-by: jab13x <117719136+jab13x@users.noreply.github.com> Co-authored-by: Jungwhan Kim <53588015+jungwhank@users.noreply.github.com> Co-authored-by: Jennifer Cwagenberg <candiedcode@gmail.com>

* rename lm_eval.logging module * fix evaluation tracker args

rename lm_eval.logging module

53fec66

haileyschoelkopf added the bug Something isn't working. label May 19, 2024

haileyschoelkopf requested a review from lintangsutawika as a code owner May 19, 2024 17:57

haileyschoelkopf mentioned this pull request May 19, 2024

I get this error whenever I try to run an eval: ImportError: cannot import name 'HfApi' from 'huggingface_hub' #1826

Closed

fix evaluation tracker args

5181f13

sstrehlk added a commit to sstrehlk/lm-evaluation-harness that referenced this pull request May 20, 2024

Fix lm-evaluation-harness - PR EleutherAI#1858 copy

4b55ed1

sstrehlk mentioned this pull request May 23, 2024

Support ov models via genai sstrehlk/lm-evaluation-harness#1

Open

haileyschoelkopf mentioned this pull request May 25, 2024

ImportError: cannot import name 'HfApi' from 'huggingface_hub' #1889

Closed

lintangsutawika approved these changes May 26, 2024

View reviewed changes

haileyschoelkopf merged commit 0ff6ab9 into main May 26, 2024
3 of 8 checks passed

haileyschoelkopf deleted the 1826-importerror-modulename branch May 26, 2024 10:54

notrichardren pushed a commit to steven-safeai/lm-evaluation-harness that referenced this pull request May 31, 2024

Rename lm_eval.logging -> lm_eval.loggers (EleutherAI#1858)

879409a

* rename lm_eval.logging module * fix evaluation tracker args

mariagrandury pushed a commit to somosnlp/lm-evaluation-harness that referenced this pull request Jul 25, 2024

Rename lm_eval.logging -> lm_eval.loggers (EleutherAI#1858)

0dae59d

* rename lm_eval.logging module * fix evaluation tracker args

djstrong pushed a commit to speakleash/lm-evaluation-harness that referenced this pull request Aug 2, 2024

Rename lm_eval.logging -> lm_eval.loggers (EleutherAI#1858)

fe6fb1a

* rename lm_eval.logging module * fix evaluation tracker args

jmercat pushed a commit to TRI-ML/lm-evaluation-harness that referenced this pull request Sep 25, 2024

Rename lm_eval.logging -> lm_eval.loggers (EleutherAI#1858)

fb99844

* rename lm_eval.logging module * fix evaluation tracker args

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rename `lm_eval.logging -> lm_eval.loggers` #1858

Rename `lm_eval.logging -> lm_eval.loggers` #1858

haileyschoelkopf commented May 19, 2024

Rename lm_eval.logging -> lm_eval.loggers #1858

Rename lm_eval.logging -> lm_eval.loggers #1858

Conversation

haileyschoelkopf commented May 19, 2024

Rename `lm_eval.logging -> lm_eval.loggers` #1858

Rename `lm_eval.logging -> lm_eval.loggers` #1858