Implement the Natural Questions evaluation #9

StellaAthena · 2020-09-16T16:36:44Z

From the GPT-3 paper

In this section we measure GPT-3’s ability to answer questions about broad factual knowledge. Due to the immense amount of possible queries, this task has normally been approached by using an information retrieval system to find relevant text in combination with a model which learns to generate an answer given the question and the retrieved text. Since this setting allows a system to search for and condition on text which potentially contains the answer it is denoted “open-book”. [RRS20] recently demonstrated that a large language model can perform surprisingly well directly answering the questions without conditioning on auxilliary information. They denote this more restrictive evaluation setting as “closed-book”. Their work suggests that even higher-capacity models could perform even better and we test this hypothesis with GPT-3. We evaluate GPT-3 on the 3 datasets in [RRS20]: Natural Questions [KPR+19], WebQuestions [BCFL13], and TriviaQA [JCWZ17], using the same splits. Note that in addition to all results being in the closed-book setting, our use of few-shot, one-shot, and zero-shot evaluations represent an even stricter setting than previous closed-book QA work: in addition to external content not being allowed, fine-tuning on the Q&A dataset itself is also not permitted.

Data processing code implemented
Evaluation implemented

The evaluation code should be modeled after the interface in lm_eval/base.py and the example of the BoolQ task in lm_eval/tasks/suerglue.py

The text was updated successfully, but these errors were encountered:

cfoster0 · 2020-10-01T04:26:19Z

Note: HuggingFace includes this in its datasets package.

https://huggingface.co/datasets/natural_questions

cfoster0 · 2020-10-05T06:09:37Z

Warning: This dataset is super big.

StellaAthena · 2020-10-05T13:25:02Z

Warning: This dataset is super big.

How big is “super big”?

cfoster0 · 2020-10-05T19:21:30Z

97G.

sdtblck · 2020-10-05T20:30:15Z

what the fuck. Why are we not training on this.

sdtblck · 2020-10-05T20:31:34Z

Ah, dev set is only 1G. But we should add train set to the pile.

cfoster0 · 2020-10-07T18:01:02Z

We would need to dedupe this with Wikipedia, since the bulk of it is just the HTML of Wikipedia pages.

moirage · 2021-01-28T07:35:34Z

I can claim this

StellaAthena · 2021-01-29T14:23:06Z

I can claim this

Assigned!

webnlg

cr458 · 2023-03-27T18:27:09Z

would love to take this on if help on implementing the evaluation is still needed?

StellaAthena · 2023-03-29T21:29:38Z

would love to take this on if help on implementing the evaluation is still needed?

Yes this would be quite helpful. Thanks!

haileyschoelkopf · 2023-04-24T13:10:53Z

I think Natural Questions is implemented already? https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/naturalqs.py

juletx · 2023-04-24T13:21:54Z

@haileyschoelkopf Some methods are not implemented, they raise NotImplementedError

haileyschoelkopf · 2023-04-24T13:28:23Z

Ah you're right sorry!--I'm not sure why this was originally merged then. It's not in the task registry though so it should be alright to keep in the repo until the refactor is done, at which point we can decide what to do with it

memray · 2023-06-12T05:19:30Z

I wonder what the progress of NQ eval is and if any help is needed?

StellaAthena · 2023-06-12T14:13:12Z

@memray I am under the impression that is hasn't been implemented and help is need.

wwngh1233 · 2023-06-29T18:16:07Z

+1

Sea-Snell · 2023-08-09T22:50:05Z

+1

webnlg

haileyschoelkopf · 2023-08-22T02:15:23Z

Closed by #789 which implements the NaturalQs dataset split used by Llama and (possibly, unconfirmed) used by PaLM and more!

webnlg

Add docs on Chat Template interface to `docs/model_guide.md`

remove added metrics -afrimgsm

StellaAthena added the feature request A feature that isn't implemented yet. label Sep 16, 2020

StellaAthena added Eval Set and removed feature request A feature that isn't implemented yet. labels Oct 23, 2020

StellaAthena pinned this issue Oct 23, 2020

cfoster0 mentioned this issue Oct 24, 2020

Add Natural Questions #52

Merged

anishthite assigned cfoster0 Oct 24, 2020

StellaAthena unpinned this issue Nov 30, 2020

StellaAthena removed the Eval Set label Dec 23, 2020

StellaAthena closed this as completed Jan 4, 2021

StellaAthena reopened this Jan 5, 2021

StellaAthena unassigned cfoster0 Jan 5, 2021

StellaAthena added feature request A feature that isn't implemented yet. good first issue Good for newcomers labels Jan 5, 2021

StellaAthena assigned moirage Jan 29, 2021

leogao2 unassigned moirage Feb 12, 2021

StellaAthena pushed a commit that referenced this issue Apr 29, 2022

Merge pull request #9 from jordiclive/webnlg

e51880d

webnlg

haileyschoelkopf closed this as completed Mar 25, 2023

haileyschoelkopf reopened this Mar 25, 2023

qmdnls pushed a commit to qmdnls/lm-evaluation-harness that referenced this issue Aug 17, 2023

Merge pull request EleutherAI#9 from jordiclive/webnlg

384e369

webnlg

qmdnls mentioned this issue Aug 17, 2023

Add NQ-Open task based on the Natural Questions dataset #789

Merged

haileyschoelkopf linked a pull request Aug 18, 2023 that will close this issue

Add NQ-Open task based on the Natural Questions dataset #789

Merged

haileyschoelkopf closed this as completed in #789 Aug 21, 2023

LZY-the-boys pushed a commit to LZY-the-boys/lm-evaluation-harness-fast that referenced this issue Sep 12, 2023

Merge pull request EleutherAI#9 from jordiclive/webnlg

2fe9a2e

webnlg

NathanHB referenced this issue in huggingface/lm-evaluation-harness Jun 27, 2024

Merge pull request #9 from EleutherAI/chat_template

0ee30f1

Add docs on Chat Template interface to `docs/model_guide.md`

lintangsutawika pushed a commit that referenced this issue Jul 8, 2024

Merge pull request #9 from JessicaOjo/africamgsm

fa0ba22

remove added metrics -afrimgsm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the Natural Questions evaluation #9

Implement the Natural Questions evaluation #9

StellaAthena commented Sep 16, 2020 •

edited

Loading

cfoster0 commented Oct 1, 2020

cfoster0 commented Oct 5, 2020

StellaAthena commented Oct 5, 2020

cfoster0 commented Oct 5, 2020

sdtblck commented Oct 5, 2020

sdtblck commented Oct 5, 2020

cfoster0 commented Oct 7, 2020

moirage commented Jan 28, 2021

StellaAthena commented Jan 29, 2021

cr458 commented Mar 27, 2023

StellaAthena commented Mar 29, 2023

haileyschoelkopf commented Apr 24, 2023

juletx commented Apr 24, 2023

haileyschoelkopf commented Apr 24, 2023

memray commented Jun 12, 2023

StellaAthena commented Jun 12, 2023

wwngh1233 commented Jun 29, 2023

Sea-Snell commented Aug 9, 2023

haileyschoelkopf commented Aug 22, 2023

Implement the Natural Questions evaluation #9

Implement the Natural Questions evaluation #9

Comments

StellaAthena commented Sep 16, 2020 • edited Loading

cfoster0 commented Oct 1, 2020

cfoster0 commented Oct 5, 2020

StellaAthena commented Oct 5, 2020

cfoster0 commented Oct 5, 2020

sdtblck commented Oct 5, 2020

sdtblck commented Oct 5, 2020

cfoster0 commented Oct 7, 2020

moirage commented Jan 28, 2021

StellaAthena commented Jan 29, 2021

cr458 commented Mar 27, 2023

StellaAthena commented Mar 29, 2023

haileyschoelkopf commented Apr 24, 2023

juletx commented Apr 24, 2023

haileyschoelkopf commented Apr 24, 2023

memray commented Jun 12, 2023

StellaAthena commented Jun 12, 2023

wwngh1233 commented Jun 29, 2023

Sea-Snell commented Aug 9, 2023

haileyschoelkopf commented Aug 22, 2023

StellaAthena commented Sep 16, 2020 •

edited

Loading