Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when running request generate_until #1310

Closed
fahadh4ilyas opened this issue Jan 18, 2024 · 12 comments
Closed

Error when running request generate_until #1310

fahadh4ilyas opened this issue Jan 18, 2024 · 12 comments

Comments

@fahadh4ilyas
Copy link

fahadh4ilyas commented Jan 18, 2024

Here is my config

group: open_llm_leaderboard
task:
  - task: arc_challenge
    task_alias: arc 25 shot
    num_fewshot: 25
    metric_list:
      - metric: acc_norm
  - task: hellaswag
    task_alias: hellaswag 10 shot
    process_docs: !function hellaswag_utils.process_docs
    num_fewshot: 10
    metric_list:
      - metric: acc_norm
  - task: hellaswag_id
    task_alias: hellaswag_id 10 shot
    process_docs: !function hellaswag_utils.process_docs
    num_fewshot: 10
    metric_list:
      - metric: acc_norm
  - task: truthfulqa_mc2
    task_alias: truthfulqa 0 shot
    dataset_path: truthful_qa
    dataset_name: multiple_choice
    output_type: multiple_choice
    process_results: !function truthfulqa_utils.process_results_mc2
    metric_list:
      - metric: acc
  - task: winogrande
    task_alias: winogrande 5 shot
    doc_to_text: !function preprocess_winogrande.doc_to_text
    doc_to_target: !function preprocess_winogrande.doc_to_target
    doc_to_choice: !function preprocess_winogrande.doc_to_choice
    num_fewshot: 5
    metric_list:
      - metric: acc
  - task: gsm8k
    task_alias: gsm8k 5 shot
    num_fewshot: 5
    metric_list:
      - metric: acc

And here is the error

2024-01-17:18:53:16,550 INFO     [evaluator.py:314] Running generate_until requests
  0%|                                                                                                                              | 0/1319 [00:00<?, ?it/s]Passed argument batch_size = auto. Detecting largest batch size
Determined Largest batch size: 8
/home/fahadh/anaconda3/envs/merging/lib/python3.10/site-packages/transformers/generation/utils.py:1518: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )
  warnings.warn(
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [1:10:00<00:00,  3.18s/it]Traceback (most recent call last):
  File "/home/fahadh/anaconda3/envs/merging/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
  File "/home/fahadh/lm-evaluation-harness/lm_eval/__main__.py", line 231, in cli_evaluate
    results = evaluator.simple_evaluate(
  File "/home/fahadh/lm-evaluation-harness/lm_eval/utils.py", line 415, in _wrapper
    return fn(*args, **kwargs)
  File "/home/fahadh/lm-evaluation-harness/lm_eval/evaluator.py", line 150, in simple_evaluate
    results = evaluate(
  File "/home/fahadh/lm-evaluation-harness/lm_eval/utils.py", line 415, in _wrapper
    return fn(*args, **kwargs)
  File "/home/fahadh/lm-evaluation-harness/lm_eval/evaluator.py", line 451, in evaluate
    results[task_name][metric_key] = agg_fn(items)
  File "/home/fahadh/lm-evaluation-harness/lm_eval/api/metrics.py", line 20, in mean
    return sum(arr) / len(arr)
TypeError: unsupported operand type(s) for +: 'int' and 'list'

EDIT: The problem is from gsm8k task. When I tried to do that task itself, I got the same error. Here is the config

group: open_llm_leaderboard
task:
  - task: gsm8k
    task_alias: gsm8k 5 shot
    num_fewshot: 5
    metric_list:
      - metric: acc
@djstrong
Copy link
Contributor

I have the same problem with polemo2_in task.

@lintangsutawika
Copy link
Contributor

gsm8k is a generate_until task and needs to use exact_match instead of acc which is already set in it's original yaml.

That aside, I've made a fix in #1315 so that you don't have to copy utils functions so you can remove the lines with !function

@lintangsutawika
Copy link
Contributor

Also, is hellaswag_id a custom task you made? I don't think it's in lm-eval currently.

@fahadh4ilyas
Copy link
Author

gsm8k is a generate_until task and needs to use exact_match instead of acc which is already set in it's original yaml.

That aside, I've made a fix in #1315 so that you don't have to copy utils functions so you can remove the lines with !function

But, from Open LLM Leaderboard, gsm8k is measured using acc. Or did I understand it wrongly?

@fahadh4ilyas
Copy link
Author

Also, is hellaswag_id a custom task you made? I don't think it's in lm-eval currently.

it is not? But I got it from lm_eval --task list. Maybe because I install this repo including [multilingual] optional dependencies?

@lintangsutawika
Copy link
Contributor

Yeah, nvm, I forgot where it was located.

Anyway, the Open LLM Leaderboard uses an older version of lm-eval before the big-refactor. Currently, for generate_until tasks, we are opting to use exact_match as it has some control over minor preprocessing stuff like where or not miss-capitalization should be penalized etc. You should check their implementation to make sure if acc is simply string matching or includes some postprocess and maybe adjust the exact_match param accordingly.

@djstrong
Copy link
Contributor

gsm8k is a generate_until task and needs to use exact_match instead of acc which is already set in it's original yaml.

I guess polemo2_in can't be exact_match...

@lintangsutawika
Copy link
Contributor

polemo2_in has accuracy which I think can also be used for gsm8k and is probably the same as acc. Lm-eval supports using metrics from Huggingface's evaluate library.

@fahadh4ilyas
Copy link
Author

So, in short acc is not a supported metric for gsm8k?

@lintangsutawika
Copy link
Contributor

@djstrong could you check if this patch works?
#1318

@fahadh4ilyas acc is just a code name for accuracy, gsm8k uses a variant that is used for string matching but works in a similar sense which is exact_match in the current version.

@djstrong
Copy link
Contributor

@lintangsutawika Sorry, it works without the patch :) I haven't checked with the patch.

@fahadh4ilyas Try with accuracy.

@haileyschoelkopf
Copy link
Collaborator

Perhaps to minimize confusion we should simply add a name argument for each metric in the config, that can be used to override what the name of the metric is reported as, so that GSM8k (and perhaps other tasks) can retain the name acc for its metric computation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants