Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wandb logger can't handle groups with heterogenous metrics #1958

Open
dmitrii-palisaderesearch opened this issue Jun 12, 2024 · 11 comments
Open

Comments

@dmitrii-palisaderesearch

Hi,

The wandb logger chokes if a group contains some tasks that output numbers and some that output strings. This is either a bug in WandbLogger.log_eval_samples or in the openllm group (maybe group tasks ought to be homogenous by design).

lm-eval --tasks openllm \
        --wandb_args entity=XXX,project=XXX \
        # use any model to reproduce
Traceback
TypeError                                 Traceback (most recent call last)
 in 

/usr/local/lib/python3.10/site-packages/lm_eval/logging_utils.py in log_eval_samples(self, samples)
395 self._log_samples_as_artifact(eval_preds, task_name)
396
--> 397 self.run.log({f"{group}_eval_results": grouped_df})
398
399

12 frames
/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_run.py in wrapper(self, *args, **kwargs)
418 return cls.Dummy()
419
--> 420 return func(self, *args, **kwargs)
421
422 return wrapper

/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_run.py in wrapper_fn(self, *args, **kwargs)
369 def wrapper_fn(self: Type["Run"], *args: Any, **kwargs: Any) -> Any:
370 if not getattr(self, "_is_finished", False):
--> 371 return func(self, *args, **kwargs)
372
373 default_message = (

/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_run.py in wrapper(self, *args, **kwargs)
359 raise e
360 cls._is_attaching = ""
--> 361 return func(self, *args, **kwargs)
362
363 return wrapper

/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_run.py in log(self, data, step, commit, sync)
1836 repeat=False,
1837 )
-> 1838 self._log(data=data, step=step, commit=commit)
1839
1840 @_run_decorator._noop_on_finish()

/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_run.py in _log(self, data, step, commit)
1600 raise ValueError("Key values passed to wandb.log must be strings.")
1601
-> 1602 self._partial_history_callback(data, step, commit)
1603
1604 if step is not None:

/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_run.py in _partial_history_callback(self, row, step, commit)
1472 not_using_tensorboard = len(wandb.patched["tensorboard"]) == 0
1473
-> 1474 self._backend.interface.publish_partial_history(
1475 row,
1476 user_step=self._step,

/usr/local/lib/python3.10/site-packages/wandb/sdk/interface/interface.py in publish_partial_history(self, data, user_step, step, flush, publish_step, run)
570 run = run or self._run
571
--> 572 data = history_dict_to_json(run, data, step=user_step, ignore_copy_err=True)
573 data.pop("_step", None)
574

/usr/local/lib/python3.10/site-packages/wandb/sdk/data_types/utils.py in history_dict_to_json(run, payload, step, ignore_copy_err)
50 )
51 else:
---> 52 payload[key] = val_to_json(
53 run, key, val, namespace=step, ignore_copy_err=ignore_copy_err
54 )

/usr/local/lib/python3.10/site-packages/wandb/sdk/data_types/utils.py in val_to_json(run, key, val, namespace, ignore_copy_err)
81
82 if util.is_pandas_data_frame(val):
---> 83 val = wandb.Table(dataframe=val)
84
85 elif util.is_matplotlib_typename(typename) or util.is_plotly_typename(typename):

/usr/local/lib/python3.10/site-packages/wandb/data_types.py in init(self, columns, data, rows, dataframe, dtype, optional, allow_mixed_types)
207 # Explicit dataframe option
208 if dataframe is not None:
--> 209 self._init_from_dataframe(dataframe, columns, optional, dtype)
210 else:
211 # Expected pattern

/usr/local/lib/python3.10/site-packages/wandb/data_types.py in _init_from_dataframe(self, dataframe, columns, optional, dtype)
264 self._make_column_types(dtype, optional)
265 for row in range(len(dataframe)):
--> 266 self.add_data(*tuple(dataframe[col].values[row] for col in self.columns))
267
268 def _make_column_types(self, dtype=None, optional=True):

/usr/local/lib/python3.10/site-packages/wandb/data_types.py in add_data(self, *data)
408
409 # Update the table's column types
--> 410 result_type = self._get_updated_result_type(data)
411 self._column_types = result_type
412

/usr/local/lib/python3.10/site-packages/wandb/data_types.py in _get_updated_result_type(self, row)
432 result_type = current_type.assign(incoming_row_dict)
433 if isinstance(result_type, _dtypes.InvalidType):
--> 434 raise TypeError(
435 "Data row contained incompatible types:\n{}".format(
436 current_type.explain(incoming_row_dict)

TypeError: Data row contained incompatible types:
{'id': 0, 'data': "Question: Jen and Tyler are gymnasts practicing flips. Jen is practicing the triple-flip while Tyler is practicing the double-flip. Jen did sixteen triple-flips during practice. Tyler flipped in the air half the number of times Jen did. How many double-flips did Tyler do?\nAnswer: Jen did 16 triple-flips, so she did 16 * 3 = <<163=48>>48 flips.\nTyler did half the number of flips, so he did 48 / 2 = <<48/2=24>>24 flips.\nA double flip has two flips, so Tyler did 24 / 2 = <<24/2=12>>12 double-flips.\n#### 12\n\nQuestion: Four people in a law firm are planning a party. Mary will buy a platter of pasta for $20 and a loaf of bread for $2. Elle and Andrea will split the cost for buying 4 cans of soda which cost $1.50 each, and chicken wings for $10. Joe will buy a cake that costs $5. How much more will Mary spend than the rest of the firm put together?\nAnswer: Mary will spend $20 + $2 = $<<20+2=22>>22.\nElle and Andrea will spend $1.5 x 4 = $<<1.54=6>>6 for the soda.\nElle and Andrea will spend $6 + $10 = $<<6+10=16>>16 for the soda and chicken wings.\nElle, Andrea, and Joe together will spend $16 + $5 = $<<16+5=21>>21.\nSo, Mary will spend $22 - $21 = $<<22-21=1>>1 more than all of them combined.\n#### 1\n\nQuestion: A charcoal grill burns fifteen coals to ash every twenty minutes of grilling. The grill ran for long enough to burn three bags of coals. Each bag of coal contains 60 coals. How long did the grill run?\nAnswer: The grill burned 3 * 60 = <<3*60...
Key 'labels':
String not assignable to None or Number
String not assignable to None
and
String not assignable to Number
Key 'raw_predictions':
String not assignable to None or Number
String not assignable to None
and
String not assignable to Number
Key 'filtered_predictions':
String not assignable to None or Number
String not assignable to None
and
String not assignable to Number

WandbLogger.log_eval_samples concats tasks outputs into one big dataframe without converting types, and wandb balks at this.

@haileyschoelkopf
Copy link
Contributor

@lintangsutawika will #1741 fix this, do you think?

We're working on making groups more clear--namely, making a distinction between homogenous groups which will report their aggregated scores on a given metric, and heterogenous "groups" (--> tag s) which are just convenience names / tags used when invoking a number of related tasks at once.

@lintangsutawika
Copy link
Contributor

I'm not yet able to reproduce this. It seems to work fine with the latest version in main.

lm-eval \
    --model_args "pretrained=gpt2" \
    --task openllm \
    --limit 4 \
    --device cpu \
    --wandb_args project=xxx

@haileyschoelkopf
Copy link
Contributor

I think --log_samples may be required to reproduce?

@dmitrii-palisaderesearch
Copy link
Author

@lintangsutawika
Copy link
Contributor

Still seems to work on from main. @dmitrii-palisaderesearch are you using the latest main?

lm-eval \
    --model_args "pretrained=gpt2" \
    --task openllm \
    --limit 4 \
    --device cpu \
    --wandb_args project=xxx \
    --log_samples --output test_wandb/

@dmitrii-palisaderesearch
Copy link
Author

dmitrii-palisaderesearch commented Jun 13, 2024

Hey, here are the repro colab notebooks:

lm-eval CLI eats the exception and hides it under an INFO log entry, so it's a little hard to see—I added an lm_eval lib call as well so you can see the exact line that throws and a traceback.

BTW thanks for your repro cmd, it's extremely helpful :)

@lintangsutawika
Copy link
Contributor

Thanks, I see what the issue is now. It's a matter of not able to reconcile different data types that belong in the same column which can happen when calling a set of tasks that have different data types. I think the solution here is to not concatenate different tasks together. Unless this is actually disireable?

@dmitrii-palisaderesearch
Copy link
Author

So it would be convenient to get samples from

  • concrete mmlu tasks (e.g. mmlu_abstract_algebra)
  • all of mmlu, with each sample tagged with its concrete task

OTOH, I don't feel like getting samples from "open llm leaderboard" will be useful: aggregate metrics suffice there.

This is probably what #1741 will do.

@haileyschoelkopf
Copy link
Contributor

For the specific MMLU case, in order to support still having the full list logged perhaps we can have a flag in group configs that retains samples all together for logging, and otherwise not log groups' samples? @lintangsutawika do you think this seems too contrived?

@lintangsutawika
Copy link
Contributor

For groups like MMLU, the wandb issue shouldn't occur since all subtasks share the same format.

On grouping samples, the bigger issue maybe how we log the results.json and samples.json file. I guess this means that it's not a quick fix but one that should suit long-term usability.

Btw @dmitrii-palisaderesearch , if you want the samples from mmlu tasks. Would running just mmlu be suffice?

@dmitrii-palisaderesearch
Copy link
Author

Sure, this works great. I just wanted to assemble my benchmark into one big yaml config and hit this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants