-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wandb logger can't handle groups with heterogenous metrics #1958
Comments
@lintangsutawika will #1741 fix this, do you think? We're working on making groups more clear--namely, making a distinction between homogenous groups which will report their aggregated scores on a given metric, and heterogenous "groups" (--> |
I'm not yet able to reproduce this. It seems to work fine with the latest version in
|
I think |
Yes, it is. Sorry, missed it in the repro.
…On Wed, Jun 12, 2024 at 8:14 PM Hailey Schoelkopf ***@***.***> wrote:
I think --log_samples may be required to reproduce?
—
Reply to this email directly, view it on GitHub
<#1958 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BFOTULQF4S4W2QQZM7UM5B3ZHBXWTAVCNFSM6AAAAABJGCX2MSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRTGQZTIMRZGY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Still seems to work on from
|
Hey, here are the repro colab notebooks:
BTW thanks for your repro cmd, it's extremely helpful :) |
Thanks, I see what the issue is now. It's a matter of not able to reconcile different data types that belong in the same column which can happen when calling a set of tasks that have different data types. I think the solution here is to not concatenate different tasks together. Unless this is actually disireable? |
So it would be convenient to get samples from
OTOH, I don't feel like getting samples from "open llm leaderboard" will be useful: aggregate metrics suffice there. This is probably what #1741 will do. |
For the specific MMLU case, in order to support still having the full list logged perhaps we can have a flag in group configs that retains samples all together for logging, and otherwise not log groups' samples? @lintangsutawika do you think this seems too contrived? |
For groups like MMLU, the wandb issue shouldn't occur since all subtasks share the same format. On grouping samples, the bigger issue maybe how we log the results.json and samples.json file. I guess this means that it's not a quick fix but one that should suit long-term usability. Btw @dmitrii-palisaderesearch , if you want the samples from mmlu tasks. Would running just |
Sure, this works great. I just wanted to assemble my benchmark into one big yaml config and hit this. |
Hi,
The wandb logger chokes if a group contains some tasks that output numbers and some that output strings. This is either a bug in
WandbLogger.log_eval_samples
or in theopenllm
group (maybe group tasks ought to be homogenous by design).lm-eval --tasks openllm \ --wandb_args entity=XXX,project=XXX \ # use any model to reproduce
Traceback
WandbLogger.log_eval_samples
concats tasks outputs into one big dataframe without converting types, and wandb balks at this.The text was updated successfully, but these errors were encountered: