-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement the Natural Questions evaluation #9
Comments
Note: HuggingFace includes this in its datasets package. |
Warning: This dataset is super big. |
How big is “super big”? |
97G. |
what the fuck. Why are we not training on this. |
Ah, dev set is only 1G. But we should add train set to the pile. |
We would need to dedupe this with Wikipedia, since the bulk of it is just the HTML of Wikipedia pages. |
I can claim this |
Assigned! |
would love to take this on if help on implementing the evaluation is still needed? |
Yes this would be quite helpful. Thanks! |
I think Natural Questions is implemented already? https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/naturalqs.py |
@haileyschoelkopf Some methods are not implemented, they raise |
Ah you're right sorry!--I'm not sure why this was originally merged then. It's not in the task registry though so it should be alright to keep in the repo until the refactor is done, at which point we can decide what to do with it |
I wonder what the progress of NQ eval is and if any help is needed? |
@memray I am under the impression that is hasn't been implemented and help is need. |
+1 |
1 similar comment
+1 |
Closed by #789 which implements the NaturalQs dataset split used by Llama and (possibly, unconfirmed) used by PaLM and more! |
Add docs on Chat Template interface to `docs/model_guide.md`
remove added metrics -afrimgsm
From the GPT-3 paper
The evaluation code should be modeled after the interface in
lm_eval/base.py
and the example of theBoolQ
task inlm_eval/tasks/suerglue.py
The text was updated successfully, but these errors were encountered: