Error when running request `generate_until` #1310

fahadh4ilyas · 2024-01-18T05:45:34Z

Here is my config

group: open_llm_leaderboard
task:
  - task: arc_challenge
    task_alias: arc 25 shot
    num_fewshot: 25
    metric_list:
      - metric: acc_norm
  - task: hellaswag
    task_alias: hellaswag 10 shot
    process_docs: !function hellaswag_utils.process_docs
    num_fewshot: 10
    metric_list:
      - metric: acc_norm
  - task: hellaswag_id
    task_alias: hellaswag_id 10 shot
    process_docs: !function hellaswag_utils.process_docs
    num_fewshot: 10
    metric_list:
      - metric: acc_norm
  - task: truthfulqa_mc2
    task_alias: truthfulqa 0 shot
    dataset_path: truthful_qa
    dataset_name: multiple_choice
    output_type: multiple_choice
    process_results: !function truthfulqa_utils.process_results_mc2
    metric_list:
      - metric: acc
  - task: winogrande
    task_alias: winogrande 5 shot
    doc_to_text: !function preprocess_winogrande.doc_to_text
    doc_to_target: !function preprocess_winogrande.doc_to_target
    doc_to_choice: !function preprocess_winogrande.doc_to_choice
    num_fewshot: 5
    metric_list:
      - metric: acc
  - task: gsm8k
    task_alias: gsm8k 5 shot
    num_fewshot: 5
    metric_list:
      - metric: acc

And here is the error

2024-01-17:18:53:16,550 INFO     [evaluator.py:314] Running generate_until requests
  0%|                                                                                                                              | 0/1319 [00:00<?, ?it/s]Passed argument batch_size = auto. Detecting largest batch size
Determined Largest batch size: 8
/home/fahadh/anaconda3/envs/merging/lib/python3.10/site-packages/transformers/generation/utils.py:1518: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )
  warnings.warn(
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [1:10:00<00:00,  3.18s/it]Traceback (most recent call last):
  File "/home/fahadh/anaconda3/envs/merging/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
  File "/home/fahadh/lm-evaluation-harness/lm_eval/__main__.py", line 231, in cli_evaluate
    results = evaluator.simple_evaluate(
  File "/home/fahadh/lm-evaluation-harness/lm_eval/utils.py", line 415, in _wrapper
    return fn(*args, **kwargs)
  File "/home/fahadh/lm-evaluation-harness/lm_eval/evaluator.py", line 150, in simple_evaluate
    results = evaluate(
  File "/home/fahadh/lm-evaluation-harness/lm_eval/utils.py", line 415, in _wrapper
    return fn(*args, **kwargs)
  File "/home/fahadh/lm-evaluation-harness/lm_eval/evaluator.py", line 451, in evaluate
    results[task_name][metric_key] = agg_fn(items)
  File "/home/fahadh/lm-evaluation-harness/lm_eval/api/metrics.py", line 20, in mean
    return sum(arr) / len(arr)
TypeError: unsupported operand type(s) for +: 'int' and 'list'

EDIT: The problem is from gsm8k task. When I tried to do that task itself, I got the same error. Here is the config

group: open_llm_leaderboard
task:
  - task: gsm8k
    task_alias: gsm8k 5 shot
    num_fewshot: 5
    metric_list:
      - metric: acc

djstrong · 2024-01-18T17:34:18Z

I have the same problem with polemo2_in task.

lintangsutawika · 2024-01-18T17:40:47Z

gsm8k is a generate_until task and needs to use exact_match instead of acc which is already set in it's original yaml.

That aside, I've made a fix in #1315 so that you don't have to copy utils functions so you can remove the lines with !function

lintangsutawika · 2024-01-18T17:41:19Z

Also, is hellaswag_id a custom task you made? I don't think it's in lm-eval currently.

fahadh4ilyas · 2024-01-18T17:48:02Z

gsm8k is a generate_until task and needs to use exact_match instead of acc which is already set in it's original yaml.

That aside, I've made a fix in #1315 so that you don't have to copy utils functions so you can remove the lines with !function

But, from Open LLM Leaderboard, gsm8k is measured using acc. Or did I understand it wrongly?

fahadh4ilyas · 2024-01-18T17:50:56Z

Also, is hellaswag_id a custom task you made? I don't think it's in lm-eval currently.

it is not? But I got it from lm_eval --task list. Maybe because I install this repo including [multilingual] optional dependencies?

lintangsutawika · 2024-01-18T17:56:15Z

Yeah, nvm, I forgot where it was located.

Anyway, the Open LLM Leaderboard uses an older version of lm-eval before the big-refactor. Currently, for generate_until tasks, we are opting to use exact_match as it has some control over minor preprocessing stuff like where or not miss-capitalization should be penalized etc. You should check their implementation to make sure if acc is simply string matching or includes some postprocess and maybe adjust the exact_match param accordingly.

djstrong · 2024-01-18T17:57:46Z

gsm8k is a generate_until task and needs to use exact_match instead of acc which is already set in it's original yaml.

I guess polemo2_in can't be exact_match...

lintangsutawika · 2024-01-18T18:02:50Z

polemo2_in has accuracy which I think can also be used for gsm8k and is probably the same as acc. Lm-eval supports using metrics from Huggingface's evaluate library.

fahadh4ilyas · 2024-01-18T18:14:15Z

So, in short acc is not a supported metric for gsm8k?

lintangsutawika · 2024-01-19T03:26:53Z

@djstrong could you check if this patch works?
#1318

@fahadh4ilyas acc is just a code name for accuracy, gsm8k uses a variant that is used for string matching but works in a similar sense which is exact_match in the current version.

djstrong · 2024-01-19T10:44:45Z

@lintangsutawika Sorry, it works without the patch :) I haven't checked with the patch.

@fahadh4ilyas Try with accuracy.

haileyschoelkopf · 2024-01-19T12:16:15Z

Perhaps to minimize confusion we should simply add a name argument for each metric in the config, that can be used to override what the name of the metric is reported as, so that GSM8k (and perhaps other tasks) can retain the name acc for its metric computation.

lintangsutawika mentioned this issue Jan 18, 2024

Fix group register #1315

Merged

fahadh4ilyas closed this as completed Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when running request `generate_until` #1310

Error when running request `generate_until` #1310

fahadh4ilyas commented Jan 18, 2024 •

edited

Loading

djstrong commented Jan 18, 2024

lintangsutawika commented Jan 18, 2024

lintangsutawika commented Jan 18, 2024

fahadh4ilyas commented Jan 18, 2024

fahadh4ilyas commented Jan 18, 2024

lintangsutawika commented Jan 18, 2024

djstrong commented Jan 18, 2024

lintangsutawika commented Jan 18, 2024

fahadh4ilyas commented Jan 18, 2024

lintangsutawika commented Jan 19, 2024

djstrong commented Jan 19, 2024

haileyschoelkopf commented Jan 19, 2024

Error when running request generate_until #1310

Error when running request generate_until #1310

Comments

fahadh4ilyas commented Jan 18, 2024 • edited Loading

djstrong commented Jan 18, 2024

lintangsutawika commented Jan 18, 2024

lintangsutawika commented Jan 18, 2024

fahadh4ilyas commented Jan 18, 2024

fahadh4ilyas commented Jan 18, 2024

lintangsutawika commented Jan 18, 2024

djstrong commented Jan 18, 2024

lintangsutawika commented Jan 18, 2024

fahadh4ilyas commented Jan 18, 2024

lintangsutawika commented Jan 19, 2024

djstrong commented Jan 19, 2024

haileyschoelkopf commented Jan 19, 2024

Error when running request `generate_until` #1310

Error when running request `generate_until` #1310

fahadh4ilyas commented Jan 18, 2024 •

edited

Loading