Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter docs not offset by doc_id #1349

Merged
merged 8 commits into from
Jan 25, 2024
Merged

Conversation

baberabb
Copy link
Contributor

@baberabb baberabb commented Jan 24, 2024

closes #1293. The full dataset was passed to the filter object and not offset by doc_id when using accelerate. I've made a tentative fix by using only the docs from the instances but I think a better fix would be to remove this arg altogether and keep a clear distinction between it and process_results. Is there a backward compatibility reason not to do that? Do any other tasks other than super_glue-wsc-t5-prompt compare resps to docs in theirfilter function?

@baberabb baberabb changed the title Filter Filter bug Jan 24, 2024
@baberabb baberabb changed the title Filter bug Filter docs not offset by doc_id Jan 24, 2024
@haileyschoelkopf
Copy link
Contributor

Thanks for catching this!

Hm, so we do want the filters to be able to have access to the gold standard answer in some cases, if people will use them for model-based evaluations. The idea was that if you want to do something like run a model over all the documents, the filter apply() should permit you to load that model a single time and run over all the docs, something not currently possible with process_results().

I think that your fix is probably the right way to currently handle this as a hotfix but sympathetic to what you're saying re: keeping a distinction between process_docs(). I guess the broader change would be to: - move this WSC logic to a custom process_results() - put a deprecation warning if people try to pass docs into filters, and later fully phase it out?

(cc @dwadden who was interested in model-based evaluation in lm-eval, to make sure this change would not break anything for him, and @pminervini who this maybe also affects due to an interest in stateful process_results())

@baberabb
Copy link
Contributor Author

Yes, that makes sense. I guess there isn't a really a reason not to give the option other than to keep things simple. As a compromise, I've kept the Filter class as is but modified the input docs in FilterEnsemble to args.docs (this is the main entrypoint to the filter pipeline from Task). The only major change I can think of is that the gold standard answers will be list[dict] (one for the respective row) rather than a Dataset object.

Removed some unnecessary and unused looping in task.py while parsing the filters from the config. The results remain the same for TinyLLama on gsm8k, before and after this change.

Also modified the super_glue-wsc-t5-prompt answer processing to use process_results rather than Filter. Now accuracy is the same with and w/o accelerate (0.6058 on TinyLLama)

cc: @lintangsutawika

@haileyschoelkopf
Copy link
Contributor

As a compromise, I've kept the Filter class as is but modified the input docs in FilterEnsemble to args.docs (this is the main entrypoint to the filter pipeline from Task). The only major change I can think of is that the gold standard answers will be list[dict] (one for the respective row) rather than a Dataset object.

Thanks, I think this sounds great in terms of keeping backward compatibility!

This looks good to me, thanks again for chasing this down!

@haileyschoelkopf haileyschoelkopf merged commit a0f1cac into EleutherAI:main Jan 25, 2024
8 checks passed
@baberabb baberabb deleted the filter branch January 25, 2024 20:50
@lintangsutawika
Copy link
Contributor

Thanks @baberabb @haileyschoelkopf

anjor pushed a commit to anjor/lm-evaluation-harness that referenced this pull request Jan 31, 2024
* get `doc` from instance

* acceletate bugfix: get ground doc from instance

* convert filter to `process_result`

* get docs from instances in `FilterEnsemble`

* rename

* nit

* better looping

* fix typehint
haileyschoelkopf added a commit that referenced this pull request Feb 22, 2024
* loglikelihood refactor using template lm

* linter

* fix whitespace in target + prompt for CoT gsm8k (#1275)

* Make `parallelize=True` vs. `accelerate launch` distinction clearer in docs (#1261)

* Make parallelize=True distinction clearer in documentation.

* run linter

* Allow parameter edits for registered tasks when listed in a benchmark (#1273)

* benchmark yamls allow minor edits of already registered tasks

* add documentation

* removed print

* Fix data-parallel evaluation with quantized models (#1270)

* add WIP device_map overrides

* update handling outside of accelerate launcher

* change .to(device) log to debug level

* run linter

* Rework documentation for explaining local dataset (#1284)

* rewor documentation for explaining local dataset

* fix typo

* Update new_task_guide.md

* Re-add citation

It looks like Google Scholar has [already noticed](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C9&authuser=2&q=%22A+framework+for+few-shot+language+model+evaluation%2C+12+2023%22&btnG=) the updated citation block so let's add it back in.

* Update CITATION.bib (#1285)

Bumping CITATION.bib to match re-adding the citation in readme. 

cc @StellaAthena

* Update nq_open.yaml (#1289)

* Update README.md with custom integration doc (#1298)

* Update README.md

* punctuation

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update nq_open.yaml (#1305)

* Update nq_open.yaml

change regex

* Bump NQ version

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update task_guide.md (#1306)

* Update pyproject.toml (#1312)

* Fix polemo2_in.yaml config name (#1313)

* Update pyproject.toml (#1314)

* Fix group register (#1315)

* tuple should be considered as well

* set option to keep callable as callable

* Update task_guide.md (#1316)

* Update polemo2_in.yaml (#1318)

* don't pass extra kwargs to mamba any more (#1328)

* Fix Issue regarding stderr (#1327)

* add fix fordeciding if stderr is N/A or not

* process N/A

* Add `local-completions` support using OpenAI interface (#1277)

* Add `local-completions` support using OpenAI interface

* Refactor oa_completion

* Address tokenizer comments and change request chunks to batch size

* Add warning message for tiktoken backend

* fix formatting

* fix whitespace

* Update README.md

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fallback to classname when LM doesnt have config (#1334)

* fix a trailing whitespace that breaks a lint job (#1335)

* skip "benchmarks" in changed_tasks (#1336)

* Update migrated HF dataset paths (#1332)

* Update arc_easy.yaml

* Update flan_cot.yaml

* update HF dataset path

* Update freeform.yaml

* Update flan_cot.yaml

---------

Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

* Don't use `get_task_dict()` in task registration / initialization (#1331)

* don't use get_task_dict() as a helper, it will download the dataset!

* pre-commit

* Update README.md

---------

Co-authored-by: lintangsutawika <lintang@eleuther.ai>

* manage default (greedy) gen_kwargs in vllm (#1341)

* manage default (greedy) gen_kwargs in vllm better

* mirror HF `do_sample`

* just need to set temp=0 for greedy

* modified default gen_kwargs to work better with CLI; changed prompt_logprobs=1 (#1345)

* update links to task_guide.md (#1348)

* `Filter` docs not offset by `doc_id`  (#1349)

* get `doc` from instance

* acceletate bugfix: get ground doc from instance

* convert filter to `process_result`

* get docs from instances in `FilterEnsemble`

* rename

* nit

* better looping

* fix typehint

* Add FAQ on `lm_eval.tasks.initialize_tasks()` to README (#1330)

* Update README.md

* [!Tip]

* Refix issue regarding stderr (#1357)

* Add causalLM OpenVino models (#1290)

* added intel optimum

* added intel optimum in readme

* modified intel optimum

* modified intel optimum

* modified intel optimum

* modified install optimum

* modified path of IR file

* added openvino_device

* added openvino_device2

* changed optimum-causal to openvino-causal

* Update README.md

* Update README.md

* remove `lm_eval.base` import

* update openvino-causal -> openvino ; pass device through super().__init__()

* Update README.md

* Add optimum to tests dependencies

* apply pre-commit

* fix so tests pass

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* Apply some best practices and guideline recommendations to code (#1363)

* raise Exception, not a string

Additional info https://peps.python.org/pep-0352/#exception-hierarchy-changes
https://docs.python.org/3.8/tutorial/errors.html#raising-exceptions

* Apply PEP8 recommendation to prefer isinstance

"Object type comparisons should always use isinstance() instead of comparing types directly"
https://peps.python.org/pep-0008/

* Remove dangerous default mutable values in arguments

https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/dangerous-default-value.html

* Format logging messages with fstring (not with format)

Additional info
https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/logging-format-interpolation.html
There are also discussions about the speed of formatting while logging or some unintended code executions
pylint-dev/pylint#2395
https://stackoverflow.com/a/54368109
but at least one format (fstring one) will be used throughout the project

* Specify utf-8 encoding for `open` explicitly

If not specified, it may be supposed differently in different environments, OSes, and Python versions. See
https://peps.python.org/pep-0597/
https://docs.python.org/3.11/library/locale.html#locale.getencoding
https://docs.python.org/3.10/library/os.html#utf8-mode
https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/unspecified-encoding.html

Helps also if some code from English language tasks is taken as inspiration for tasks in non-English languages.

* Use inline-ignoring comments to pass pre-commit instead of identity process

https://flake8.pycqa.org/en/3.0.1/user/ignoring-errors.html#in-line-ignoring-errors
https://www.flake8rules.com/rules/F841.html

flake8 comments are supported by ruff: https://docs.astral.sh/ruff/linter/#error-suppression

* serialize callable functions in config (#1367)

* delay filter init; remove `*args` (#1369)

* delay filter init; remove `*args`

* bugfix

* optimize

* type hint

* Fix unintuitive `--gen_kwargs` behavior (#1329)

* don't override do_sample if no value for it is passed

* Update gen_kwargs override condition

* Update huggingface.py

* Update huggingface.py

* run linters

* silence an erroneous warning

* Publish to pypi (#1194)

* publish to pypi

* lint

* Update publish.yml

* minor

* Make dependencies compatible with PyPI (#1378)

* make deps not point to github urls

* formatting

* try making PyPI only run on tag pushes

* Add support for RWKV models with World tokenizer (#1374)

* Add support for RWKV models with World tokenizer

The RWKV line of model with the World tokenizer, does not allow the padding token to be configured, and has its value preset as 0

This however fails all the "if set" checks, and would cause the tokenizer to crash.

A tokenizer class name check was added, in addition to a model type check, as there exists RWKV models which uses the neox tokenizers

* Update huggingface.py

Genericized so that this supports any RWKVWorld tokenizer, and added a fall-back for if the HF implementation name changes.

* Comply with formatting guidelines

* fix format

---------

Co-authored-by: Stella Biderman <stellabiderman@gmail.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* add bypass metric (#1156)

* add bypass metric

* fixed `bypass` metric.

* add task attributes if predict_only

* add `predict_only` checks

* add docs

* added `overide_metric`, `override_config` to `Task`

* nits

* nit

* changed --predict_only to generations; nits

* nits

* nits

* change gen_kwargs warning

* add note about `--predict_only` in README.md

* added `predict_only`

* move table to bottom

* nit

* change null aggregation to bypass (conflict)

* bugfix; default `temp=0.0`

* typo

* loglikelihood refactor using template lm

* lint

* code review

* neuron optimum

* Mention TemplateLM in model_guide.md

* Update lm_eval/api/model.py

* fix linter

* fix format

* fix format

* fix format

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>
Co-authored-by: Stella Biderman <stellabiderman@gmail.com>
Co-authored-by: Mark Saroufim <marksaroufim@meta.com>
Co-authored-by: Hannibal046 <38466901+Hannibal046@users.noreply.github.com>
Co-authored-by: Danielle Pintz <38207072+daniellepintz@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: kwrobel.eth <djstrong@gmail.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Brian Vaughan <nairbv@users.noreply.github.com>
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: thnkinbtfly <70014488+thnkinbtfly@users.noreply.github.com>
Co-authored-by: NoushNabi <33136068+NoushNabi@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
Co-authored-by: LSinev <LSinev@users.noreply.github.com>
Co-authored-by: Eugene Cheah <PicoCreator@users.noreply.github.com>
wx-zhang pushed a commit to wx-zhang/lm-evaluation-harness that referenced this pull request Mar 13, 2024
* get `doc` from instance

* acceletate bugfix: get ground doc from instance

* convert filter to `process_result`

* get docs from instances in `FilterEnsemble`

* rename

* nit

* better looping

* fix typehint
wx-zhang pushed a commit to wx-zhang/lm-evaluation-harness that referenced this pull request Mar 13, 2024
* loglikelihood refactor using template lm

* linter

* fix whitespace in target + prompt for CoT gsm8k (EleutherAI#1275)

* Make `parallelize=True` vs. `accelerate launch` distinction clearer in docs (EleutherAI#1261)

* Make parallelize=True distinction clearer in documentation.

* run linter

* Allow parameter edits for registered tasks when listed in a benchmark (EleutherAI#1273)

* benchmark yamls allow minor edits of already registered tasks

* add documentation

* removed print

* Fix data-parallel evaluation with quantized models (EleutherAI#1270)

* add WIP device_map overrides

* update handling outside of accelerate launcher

* change .to(device) log to debug level

* run linter

* Rework documentation for explaining local dataset (EleutherAI#1284)

* rewor documentation for explaining local dataset

* fix typo

* Update new_task_guide.md

* Re-add citation

It looks like Google Scholar has [already noticed](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C9&authuser=2&q=%22A+framework+for+few-shot+language+model+evaluation%2C+12+2023%22&btnG=) the updated citation block so let's add it back in.

* Update CITATION.bib (EleutherAI#1285)

Bumping CITATION.bib to match re-adding the citation in readme. 

cc @StellaAthena

* Update nq_open.yaml (EleutherAI#1289)

* Update README.md with custom integration doc (EleutherAI#1298)

* Update README.md

* punctuation

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update nq_open.yaml (EleutherAI#1305)

* Update nq_open.yaml

change regex

* Bump NQ version

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update task_guide.md (EleutherAI#1306)

* Update pyproject.toml (EleutherAI#1312)

* Fix polemo2_in.yaml config name (EleutherAI#1313)

* Update pyproject.toml (EleutherAI#1314)

* Fix group register (EleutherAI#1315)

* tuple should be considered as well

* set option to keep callable as callable

* Update task_guide.md (EleutherAI#1316)

* Update polemo2_in.yaml (EleutherAI#1318)

* don't pass extra kwargs to mamba any more (EleutherAI#1328)

* Fix Issue regarding stderr (EleutherAI#1327)

* add fix fordeciding if stderr is N/A or not

* process N/A

* Add `local-completions` support using OpenAI interface (EleutherAI#1277)

* Add `local-completions` support using OpenAI interface

* Refactor oa_completion

* Address tokenizer comments and change request chunks to batch size

* Add warning message for tiktoken backend

* fix formatting

* fix whitespace

* Update README.md

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fallback to classname when LM doesnt have config (EleutherAI#1334)

* fix a trailing whitespace that breaks a lint job (EleutherAI#1335)

* skip "benchmarks" in changed_tasks (EleutherAI#1336)

* Update migrated HF dataset paths (EleutherAI#1332)

* Update arc_easy.yaml

* Update flan_cot.yaml

* update HF dataset path

* Update freeform.yaml

* Update flan_cot.yaml

---------

Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

* Don't use `get_task_dict()` in task registration / initialization (EleutherAI#1331)

* don't use get_task_dict() as a helper, it will download the dataset!

* pre-commit

* Update README.md

---------

Co-authored-by: lintangsutawika <lintang@eleuther.ai>

* manage default (greedy) gen_kwargs in vllm (EleutherAI#1341)

* manage default (greedy) gen_kwargs in vllm better

* mirror HF `do_sample`

* just need to set temp=0 for greedy

* modified default gen_kwargs to work better with CLI; changed prompt_logprobs=1 (EleutherAI#1345)

* update links to task_guide.md (EleutherAI#1348)

* `Filter` docs not offset by `doc_id`  (EleutherAI#1349)

* get `doc` from instance

* acceletate bugfix: get ground doc from instance

* convert filter to `process_result`

* get docs from instances in `FilterEnsemble`

* rename

* nit

* better looping

* fix typehint

* Add FAQ on `lm_eval.tasks.initialize_tasks()` to README (EleutherAI#1330)

* Update README.md

* [!Tip]

* Refix issue regarding stderr (EleutherAI#1357)

* Add causalLM OpenVino models (EleutherAI#1290)

* added intel optimum

* added intel optimum in readme

* modified intel optimum

* modified intel optimum

* modified intel optimum

* modified install optimum

* modified path of IR file

* added openvino_device

* added openvino_device2

* changed optimum-causal to openvino-causal

* Update README.md

* Update README.md

* remove `lm_eval.base` import

* update openvino-causal -> openvino ; pass device through super().__init__()

* Update README.md

* Add optimum to tests dependencies

* apply pre-commit

* fix so tests pass

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* Apply some best practices and guideline recommendations to code (EleutherAI#1363)

* raise Exception, not a string

Additional info https://peps.python.org/pep-0352/#exception-hierarchy-changes
https://docs.python.org/3.8/tutorial/errors.html#raising-exceptions

* Apply PEP8 recommendation to prefer isinstance

"Object type comparisons should always use isinstance() instead of comparing types directly"
https://peps.python.org/pep-0008/

* Remove dangerous default mutable values in arguments

https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/dangerous-default-value.html

* Format logging messages with fstring (not with format)

Additional info
https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/logging-format-interpolation.html
There are also discussions about the speed of formatting while logging or some unintended code executions
pylint-dev/pylint#2395
https://stackoverflow.com/a/54368109
but at least one format (fstring one) will be used throughout the project

* Specify utf-8 encoding for `open` explicitly

If not specified, it may be supposed differently in different environments, OSes, and Python versions. See
https://peps.python.org/pep-0597/
https://docs.python.org/3.11/library/locale.html#locale.getencoding
https://docs.python.org/3.10/library/os.html#utf8-mode
https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/unspecified-encoding.html

Helps also if some code from English language tasks is taken as inspiration for tasks in non-English languages.

* Use inline-ignoring comments to pass pre-commit instead of identity process

https://flake8.pycqa.org/en/3.0.1/user/ignoring-errors.html#in-line-ignoring-errors
https://www.flake8rules.com/rules/F841.html

flake8 comments are supported by ruff: https://docs.astral.sh/ruff/linter/#error-suppression

* serialize callable functions in config (EleutherAI#1367)

* delay filter init; remove `*args` (EleutherAI#1369)

* delay filter init; remove `*args`

* bugfix

* optimize

* type hint

* Fix unintuitive `--gen_kwargs` behavior (EleutherAI#1329)

* don't override do_sample if no value for it is passed

* Update gen_kwargs override condition

* Update huggingface.py

* Update huggingface.py

* run linters

* silence an erroneous warning

* Publish to pypi (EleutherAI#1194)

* publish to pypi

* lint

* Update publish.yml

* minor

* Make dependencies compatible with PyPI (EleutherAI#1378)

* make deps not point to github urls

* formatting

* try making PyPI only run on tag pushes

* Add support for RWKV models with World tokenizer (EleutherAI#1374)

* Add support for RWKV models with World tokenizer

The RWKV line of model with the World tokenizer, does not allow the padding token to be configured, and has its value preset as 0

This however fails all the "if set" checks, and would cause the tokenizer to crash.

A tokenizer class name check was added, in addition to a model type check, as there exists RWKV models which uses the neox tokenizers

* Update huggingface.py

Genericized so that this supports any RWKVWorld tokenizer, and added a fall-back for if the HF implementation name changes.

* Comply with formatting guidelines

* fix format

---------

Co-authored-by: Stella Biderman <stellabiderman@gmail.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* add bypass metric (EleutherAI#1156)

* add bypass metric

* fixed `bypass` metric.

* add task attributes if predict_only

* add `predict_only` checks

* add docs

* added `overide_metric`, `override_config` to `Task`

* nits

* nit

* changed --predict_only to generations; nits

* nits

* nits

* change gen_kwargs warning

* add note about `--predict_only` in README.md

* added `predict_only`

* move table to bottom

* nit

* change null aggregation to bypass (conflict)

* bugfix; default `temp=0.0`

* typo

* loglikelihood refactor using template lm

* lint

* code review

* neuron optimum

* Mention TemplateLM in model_guide.md

* Update lm_eval/api/model.py

* fix linter

* fix format

* fix format

* fix format

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>
Co-authored-by: Stella Biderman <stellabiderman@gmail.com>
Co-authored-by: Mark Saroufim <marksaroufim@meta.com>
Co-authored-by: Hannibal046 <38466901+Hannibal046@users.noreply.github.com>
Co-authored-by: Danielle Pintz <38207072+daniellepintz@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: kwrobel.eth <djstrong@gmail.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Brian Vaughan <nairbv@users.noreply.github.com>
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: thnkinbtfly <70014488+thnkinbtfly@users.noreply.github.com>
Co-authored-by: NoushNabi <33136068+NoushNabi@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
Co-authored-by: LSinev <LSinev@users.noreply.github.com>
Co-authored-by: Eugene Cheah <PicoCreator@users.noreply.github.com>
nightingal3 pushed a commit to mycoalchen/lm-evaluation-harness that referenced this pull request May 2, 2024
* loglikelihood refactor using template lm

* linter

* fix whitespace in target + prompt for CoT gsm8k (EleutherAI#1275)

* Make `parallelize=True` vs. `accelerate launch` distinction clearer in docs (EleutherAI#1261)

* Make parallelize=True distinction clearer in documentation.

* run linter

* Allow parameter edits for registered tasks when listed in a benchmark (EleutherAI#1273)

* benchmark yamls allow minor edits of already registered tasks

* add documentation

* removed print

* Fix data-parallel evaluation with quantized models (EleutherAI#1270)

* add WIP device_map overrides

* update handling outside of accelerate launcher

* change .to(device) log to debug level

* run linter

* Rework documentation for explaining local dataset (EleutherAI#1284)

* rewor documentation for explaining local dataset

* fix typo

* Update new_task_guide.md

* Re-add citation

It looks like Google Scholar has [already noticed](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C9&authuser=2&q=%22A+framework+for+few-shot+language+model+evaluation%2C+12+2023%22&btnG=) the updated citation block so let's add it back in.

* Update CITATION.bib (EleutherAI#1285)

Bumping CITATION.bib to match re-adding the citation in readme. 

cc @StellaAthena

* Update nq_open.yaml (EleutherAI#1289)

* Update README.md with custom integration doc (EleutherAI#1298)

* Update README.md

* punctuation

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update nq_open.yaml (EleutherAI#1305)

* Update nq_open.yaml

change regex

* Bump NQ version

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update task_guide.md (EleutherAI#1306)

* Update pyproject.toml (EleutherAI#1312)

* Fix polemo2_in.yaml config name (EleutherAI#1313)

* Update pyproject.toml (EleutherAI#1314)

* Fix group register (EleutherAI#1315)

* tuple should be considered as well

* set option to keep callable as callable

* Update task_guide.md (EleutherAI#1316)

* Update polemo2_in.yaml (EleutherAI#1318)

* don't pass extra kwargs to mamba any more (EleutherAI#1328)

* Fix Issue regarding stderr (EleutherAI#1327)

* add fix fordeciding if stderr is N/A or not

* process N/A

* Add `local-completions` support using OpenAI interface (EleutherAI#1277)

* Add `local-completions` support using OpenAI interface

* Refactor oa_completion

* Address tokenizer comments and change request chunks to batch size

* Add warning message for tiktoken backend

* fix formatting

* fix whitespace

* Update README.md

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fallback to classname when LM doesnt have config (EleutherAI#1334)

* fix a trailing whitespace that breaks a lint job (EleutherAI#1335)

* skip "benchmarks" in changed_tasks (EleutherAI#1336)

* Update migrated HF dataset paths (EleutherAI#1332)

* Update arc_easy.yaml

* Update flan_cot.yaml

* update HF dataset path

* Update freeform.yaml

* Update flan_cot.yaml

---------

Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

* Don't use `get_task_dict()` in task registration / initialization (EleutherAI#1331)

* don't use get_task_dict() as a helper, it will download the dataset!

* pre-commit

* Update README.md

---------

Co-authored-by: lintangsutawika <lintang@eleuther.ai>

* manage default (greedy) gen_kwargs in vllm (EleutherAI#1341)

* manage default (greedy) gen_kwargs in vllm better

* mirror HF `do_sample`

* just need to set temp=0 for greedy

* modified default gen_kwargs to work better with CLI; changed prompt_logprobs=1 (EleutherAI#1345)

* update links to task_guide.md (EleutherAI#1348)

* `Filter` docs not offset by `doc_id`  (EleutherAI#1349)

* get `doc` from instance

* acceletate bugfix: get ground doc from instance

* convert filter to `process_result`

* get docs from instances in `FilterEnsemble`

* rename

* nit

* better looping

* fix typehint

* Add FAQ on `lm_eval.tasks.initialize_tasks()` to README (EleutherAI#1330)

* Update README.md

* [!Tip]

* Refix issue regarding stderr (EleutherAI#1357)

* Add causalLM OpenVino models (EleutherAI#1290)

* added intel optimum

* added intel optimum in readme

* modified intel optimum

* modified intel optimum

* modified intel optimum

* modified install optimum

* modified path of IR file

* added openvino_device

* added openvino_device2

* changed optimum-causal to openvino-causal

* Update README.md

* Update README.md

* remove `lm_eval.base` import

* update openvino-causal -> openvino ; pass device through super().__init__()

* Update README.md

* Add optimum to tests dependencies

* apply pre-commit

* fix so tests pass

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* Apply some best practices and guideline recommendations to code (EleutherAI#1363)

* raise Exception, not a string

Additional info https://peps.python.org/pep-0352/#exception-hierarchy-changes
https://docs.python.org/3.8/tutorial/errors.html#raising-exceptions

* Apply PEP8 recommendation to prefer isinstance

"Object type comparisons should always use isinstance() instead of comparing types directly"
https://peps.python.org/pep-0008/

* Remove dangerous default mutable values in arguments

https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/dangerous-default-value.html

* Format logging messages with fstring (not with format)

Additional info
https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/logging-format-interpolation.html
There are also discussions about the speed of formatting while logging or some unintended code executions
pylint-dev/pylint#2395
https://stackoverflow.com/a/54368109
but at least one format (fstring one) will be used throughout the project

* Specify utf-8 encoding for `open` explicitly

If not specified, it may be supposed differently in different environments, OSes, and Python versions. See
https://peps.python.org/pep-0597/
https://docs.python.org/3.11/library/locale.html#locale.getencoding
https://docs.python.org/3.10/library/os.html#utf8-mode
https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/unspecified-encoding.html

Helps also if some code from English language tasks is taken as inspiration for tasks in non-English languages.

* Use inline-ignoring comments to pass pre-commit instead of identity process

https://flake8.pycqa.org/en/3.0.1/user/ignoring-errors.html#in-line-ignoring-errors
https://www.flake8rules.com/rules/F841.html

flake8 comments are supported by ruff: https://docs.astral.sh/ruff/linter/#error-suppression

* serialize callable functions in config (EleutherAI#1367)

* delay filter init; remove `*args` (EleutherAI#1369)

* delay filter init; remove `*args`

* bugfix

* optimize

* type hint

* Fix unintuitive `--gen_kwargs` behavior (EleutherAI#1329)

* don't override do_sample if no value for it is passed

* Update gen_kwargs override condition

* Update huggingface.py

* Update huggingface.py

* run linters

* silence an erroneous warning

* Publish to pypi (EleutherAI#1194)

* publish to pypi

* lint

* Update publish.yml

* minor

* Make dependencies compatible with PyPI (EleutherAI#1378)

* make deps not point to github urls

* formatting

* try making PyPI only run on tag pushes

* Add support for RWKV models with World tokenizer (EleutherAI#1374)

* Add support for RWKV models with World tokenizer

The RWKV line of model with the World tokenizer, does not allow the padding token to be configured, and has its value preset as 0

This however fails all the "if set" checks, and would cause the tokenizer to crash.

A tokenizer class name check was added, in addition to a model type check, as there exists RWKV models which uses the neox tokenizers

* Update huggingface.py

Genericized so that this supports any RWKVWorld tokenizer, and added a fall-back for if the HF implementation name changes.

* Comply with formatting guidelines

* fix format

---------

Co-authored-by: Stella Biderman <stellabiderman@gmail.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* add bypass metric (EleutherAI#1156)

* add bypass metric

* fixed `bypass` metric.

* add task attributes if predict_only

* add `predict_only` checks

* add docs

* added `overide_metric`, `override_config` to `Task`

* nits

* nit

* changed --predict_only to generations; nits

* nits

* nits

* change gen_kwargs warning

* add note about `--predict_only` in README.md

* added `predict_only`

* move table to bottom

* nit

* change null aggregation to bypass (conflict)

* bugfix; default `temp=0.0`

* typo

* loglikelihood refactor using template lm

* lint

* code review

* neuron optimum

* Mention TemplateLM in model_guide.md

* Update lm_eval/api/model.py

* fix linter

* fix format

* fix format

* fix format

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>
Co-authored-by: Stella Biderman <stellabiderman@gmail.com>
Co-authored-by: Mark Saroufim <marksaroufim@meta.com>
Co-authored-by: Hannibal046 <38466901+Hannibal046@users.noreply.github.com>
Co-authored-by: Danielle Pintz <38207072+daniellepintz@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: kwrobel.eth <djstrong@gmail.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Brian Vaughan <nairbv@users.noreply.github.com>
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: thnkinbtfly <70014488+thnkinbtfly@users.noreply.github.com>
Co-authored-by: NoushNabi <33136068+NoushNabi@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
Co-authored-by: LSinev <LSinev@users.noreply.github.com>
Co-authored-by: Eugene Cheah <PicoCreator@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Different score when using accelerate
3 participants