Wandb logger can't handle groups with heterogenous metrics #1958

dmitrii-palisaderesearch · 2024-06-12T10:42:37Z

Hi,

The wandb logger chokes if a group contains some tasks that output numbers and some that output strings. This is either a bug in WandbLogger.log_eval_samples or in the openllm group (maybe group tasks ought to be homogenous by design).

lm-eval --tasks openllm \
        --wandb_args entity=XXX,project=XXX \
        # use any model to reproduce

Traceback

TypeError                                 Traceback (most recent call last)
 in 
/usr/local/lib/python3.10/site-packages/lm_eval/logging_utils.py in log_eval_samples(self, samples)

395                 self._log_samples_as_artifact(eval_preds, task_name)

396

--> 397             self.run.log({f"{group}_eval_results": grouped_df})

398

399
12 frames

/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_run.py in wrapper(self, *args, **kwargs)

418                     return cls.Dummy()

419

--> 420             return func(self, *args, **kwargs)

421

422         return wrapper
/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_run.py in wrapper_fn(self, *args, **kwargs)

369             def wrapper_fn(self: Type["Run"], *args: Any, **kwargs: Any) -> Any:

370                 if not getattr(self, "_is_finished", False):

--> 371                     return func(self, *args, **kwargs)

372

373                 default_message = (
/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_run.py in wrapper(self, *args, **kwargs)

359                     raise e

360                 cls._is_attaching = ""

--> 361             return func(self, *args, **kwargs)

362

363         return wrapper
/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_run.py in log(self, data, step, commit, sync)

1836                 repeat=False,

1837             )

-> 1838         self._log(data=data, step=step, commit=commit)

1839

1840     @_run_decorator._noop_on_finish()
/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_run.py in _log(self, data, step, commit)

1600             raise ValueError("Key values passed to wandb.log must be strings.")

1601

-> 1602         self._partial_history_callback(data, step, commit)

1603

1604         if step is not None:
/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_run.py in _partial_history_callback(self, row, step, commit)

1472             not_using_tensorboard = len(wandb.patched["tensorboard"]) == 0

1473

-> 1474             self._backend.interface.publish_partial_history(

1475                 row,

1476                 user_step=self._step,
/usr/local/lib/python3.10/site-packages/wandb/sdk/interface/interface.py in publish_partial_history(self, data, user_step, step, flush, publish_step, run)

570         run = run or self._run

571

--> 572         data = history_dict_to_json(run, data, step=user_step, ignore_copy_err=True)

573         data.pop("_step", None)

574
/usr/local/lib/python3.10/site-packages/wandb/sdk/data_types/utils.py in history_dict_to_json(run, payload, step, ignore_copy_err)

50             )

51         else:

---> 52             payload[key] = val_to_json(

53                 run, key, val, namespace=step, ignore_copy_err=ignore_copy_err

54             )
/usr/local/lib/python3.10/site-packages/wandb/sdk/data_types/utils.py in val_to_json(run, key, val, namespace, ignore_copy_err)

81

82     if util.is_pandas_data_frame(val):

---> 83         val = wandb.Table(dataframe=val)

84

85     elif util.is_matplotlib_typename(typename) or util.is_plotly_typename(typename):
/usr/local/lib/python3.10/site-packages/wandb/data_types.py in init(self, columns, data, rows, dataframe, dtype, optional, allow_mixed_types)

207         # Explicit dataframe option

208         if dataframe is not None:

--> 209             self._init_from_dataframe(dataframe, columns, optional, dtype)

210         else:

211             # Expected pattern
/usr/local/lib/python3.10/site-packages/wandb/data_types.py in _init_from_dataframe(self, dataframe, columns, optional, dtype)

264         self._make_column_types(dtype, optional)

265         for row in range(len(dataframe)):

--> 266             self.add_data(*tuple(dataframe[col].values[row] for col in self.columns))

267

268     def _make_column_types(self, dtype=None, optional=True):
/usr/local/lib/python3.10/site-packages/wandb/data_types.py in add_data(self, *data)

408

409         # Update the table's column types

--> 410         result_type = self._get_updated_result_type(data)

411         self._column_types = result_type

412
/usr/local/lib/python3.10/site-packages/wandb/data_types.py in _get_updated_result_type(self, row)

432         result_type = current_type.assign(incoming_row_dict)

433         if isinstance(result_type, _dtypes.InvalidType):

--> 434             raise TypeError(

435                 "Data row contained incompatible types:\n{}".format(

436                     current_type.explain(incoming_row_dict)
TypeError: Data row contained incompatible types:

{'id': 0, 'data': "Question: Jen and Tyler are gymnasts practicing flips. Jen is practicing the triple-flip while Tyler is practicing the double-flip. Jen did sixteen triple-flips during practice. Tyler flipped in the air half the number of times Jen did. How many double-flips did Tyler do?\nAnswer: Jen did 16 triple-flips, so she did 16 * 3 = <<163=48>>48 flips.\nTyler did half the number of flips, so he did 48 / 2 = <<48/2=24>>24 flips.\nA double flip has two flips, so Tyler did 24 / 2 = <<24/2=12>>12 double-flips.\n#### 12\n\nQuestion: Four people in a law firm are planning a party. Mary will buy a platter of pasta for $20 and a loaf of bread for $2. Elle and Andrea will split the cost for buying 4 cans of soda which cost $1.50 each, and chicken wings for $10. Joe will buy a cake that costs $5. How much more will Mary spend than the rest of the firm put together?\nAnswer: Mary will spend $20 + $2 = $<<20+2=22>>22.\nElle and Andrea will spend $1.5 x 4 = $<<1.54=6>>6 for the soda.\nElle and Andrea will spend $6 + $10 = $<<6+10=16>>16 for the soda and chicken wings.\nElle, Andrea, and Joe together will spend $16 + $5 = $<<16+5=21>>21.\nSo, Mary will spend $22 - $21 = $<<22-21=1>>1 more than all of them combined.\n#### 1\n\nQuestion: A charcoal grill burns fifteen coals to ash every twenty minutes of grilling. The grill ran for long enough to burn three bags of coals. Each bag of coal contains 60 coals. How long did the grill run?\nAnswer: The grill burned 3 * 60 = <<3*60...

Key 'labels':

String not assignable to None or Number

String not assignable to None

and

String not assignable to Number

Key 'raw_predictions':

String not assignable to None or Number

String not assignable to None

and

String not assignable to Number

Key 'filtered_predictions':

String not assignable to None or Number

String not assignable to None

and

String not assignable to Number

WandbLogger.log_eval_samples concats tasks outputs into one big dataframe without converting types, and wandb balks at this.

The text was updated successfully, but these errors were encountered:

haileyschoelkopf · 2024-06-12T13:11:19Z

@lintangsutawika will #1741 fix this, do you think?

We're working on making groups more clear--namely, making a distinction between homogenous groups which will report their aggregated scores on a given metric, and heterogenous "groups" (--> tag s) which are just convenience names / tags used when invoking a number of related tasks at once.

lintangsutawika · 2024-06-12T15:56:24Z

I'm not yet able to reproduce this. It seems to work fine with the latest version in main.

lm-eval \
    --model_args "pretrained=gpt2" \
    --task openllm \
    --limit 4 \
    --device cpu \
    --wandb_args project=xxx

haileyschoelkopf · 2024-06-12T16:14:11Z

I think --log_samples may be required to reproduce?

dmitrii-palisaderesearch · 2024-06-12T17:53:19Z

Yes, it is. Sorry, missed it in the repro.

…

On Wed, Jun 12, 2024 at 8:14 PM Hailey Schoelkopf ***@***.***> wrote: I think --log_samples may be required to reproduce? — Reply to this email directly, view it on GitHub <#1958 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BFOTULQF4S4W2QQZM7UM5B3ZHBXWTAVCNFSM6AAAAABJGCX2MSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRTGQZTIMRZGY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

lintangsutawika · 2024-06-13T05:31:14Z

Still seems to work on from main. @dmitrii-palisaderesearch are you using the latest main?

lm-eval \
    --model_args "pretrained=gpt2" \
    --task openllm \
    --limit 4 \
    --device cpu \
    --wandb_args project=xxx \
    --log_samples --output test_wandb/

dmitrii-palisaderesearch · 2024-06-13T06:49:33Z

Hey, here are the repro colab notebooks:

main
v0.4.2

lm-eval CLI eats the exception and hides it under an INFO log entry, so it's a little hard to see—I added an lm_eval lib call as well so you can see the exact line that throws and a traceback.

BTW thanks for your repro cmd, it's extremely helpful :)

lintangsutawika · 2024-06-13T10:26:13Z

Thanks, I see what the issue is now. It's a matter of not able to reconcile different data types that belong in the same column which can happen when calling a set of tasks that have different data types. I think the solution here is to not concatenate different tasks together. Unless this is actually disireable?

dmitrii-palisaderesearch · 2024-06-14T13:45:09Z

So it would be convenient to get samples from

concrete mmlu tasks (e.g. mmlu_abstract_algebra)
all of mmlu, with each sample tagged with its concrete task

OTOH, I don't feel like getting samples from "open llm leaderboard" will be useful: aggregate metrics suffice there.

This is probably what #1741 will do.

haileyschoelkopf · 2024-06-14T13:48:38Z

For the specific MMLU case, in order to support still having the full list logged perhaps we can have a flag in group configs that retains samples all together for logging, and otherwise not log groups' samples? @lintangsutawika do you think this seems too contrived?

lintangsutawika · 2024-06-14T13:58:45Z

For groups like MMLU, the wandb issue shouldn't occur since all subtasks share the same format.

On grouping samples, the bigger issue maybe how we log the results.json and samples.json file. I guess this means that it's not a quick fix but one that should suit long-term usability.

Btw @dmitrii-palisaderesearch , if you want the samples from mmlu tasks. Would running just mmlu be suffice?

dmitrii-palisaderesearch · 2024-06-14T14:28:49Z

Sure, this works great. I just wanted to assemble my benchmark into one big yaml config and hit this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wandb logger can't handle groups with heterogenous metrics #1958

Wandb logger can't handle groups with heterogenous metrics #1958

dmitrii-palisaderesearch commented Jun 12, 2024

haileyschoelkopf commented Jun 12, 2024

lintangsutawika commented Jun 12, 2024

haileyschoelkopf commented Jun 12, 2024

dmitrii-palisaderesearch commented Jun 12, 2024 via email

lintangsutawika commented Jun 13, 2024

dmitrii-palisaderesearch commented Jun 13, 2024 •

edited

Loading

lintangsutawika commented Jun 13, 2024

dmitrii-palisaderesearch commented Jun 14, 2024

haileyschoelkopf commented Jun 14, 2024

lintangsutawika commented Jun 14, 2024

dmitrii-palisaderesearch commented Jun 14, 2024

Wandb logger can't handle groups with heterogenous metrics #1958

Wandb logger can't handle groups with heterogenous metrics #1958

Comments

dmitrii-palisaderesearch commented Jun 12, 2024

haileyschoelkopf commented Jun 12, 2024

lintangsutawika commented Jun 12, 2024

haileyschoelkopf commented Jun 12, 2024

dmitrii-palisaderesearch commented Jun 12, 2024 via email

lintangsutawika commented Jun 13, 2024

dmitrii-palisaderesearch commented Jun 13, 2024 • edited Loading

lintangsutawika commented Jun 13, 2024

dmitrii-palisaderesearch commented Jun 14, 2024

haileyschoelkopf commented Jun 14, 2024

lintangsutawika commented Jun 14, 2024

dmitrii-palisaderesearch commented Jun 14, 2024

dmitrii-palisaderesearch commented Jun 13, 2024 •

edited

Loading