Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Result of Yi-6B-Chat on the BBH dataset cannot be reproduced #488

Closed
2 tasks done
zerovl opened this issue Apr 8, 2024 · 1 comment
Closed
2 tasks done

Result of Yi-6B-Chat on the BBH dataset cannot be reproduced #488

zerovl opened this issue Apr 8, 2024 · 1 comment

Comments

@zerovl
Copy link

zerovl commented Apr 8, 2024

Reminder

  • I have searched the Github Discussion and issues and have not found anything similar to this.

Motivation

We tested Yi-6B-Chat on BBH and achieved the followed metrics:
{'temporal_sequences': '0.2200', 'disambiguation_qa': '0.2880', 'date_understanding': '0.3720', 'tracking_shuffled_objects_three_objects': '0.3520', 'penguins_in_a_table': '0.3493', 'geometric_shapes': '0.1880', 'snarks': '0.5056', 'ruin_names': '0.2880', 'tracking_shuffled_objects_seven_objects': '0.0920', 'tracking_shuffled_objects_five_objects': '0.2520', 'logical_deduction_three_objects': '0.4440', 'hyperbaton': '0.4640', 'logical_deduction_five_objects': '0.2720', 'logical_deduction_seven_objects': '0.1720', 'movie_recommendation': '0.3320', 'salient_translation_error_detection': '0.1400', 'reasoning_about_colored_objects': '0.3800', 'multistep_arithmetic_two': '0.0640', 'navigate': '0.5080', 'dyck_languages': '0.0000', 'word_sorting': '0.0360', 'sports_understanding': '0.4960', 'boolean_expressions': '0.4240', 'object_counting': '0.3440', 'formal_fallacies': '0.5040', 'causal_judgement': '0.5187', 'web_of_lies': '0.4760', 'TOTAL_AVERAGE': '0.3095'}

It seems that the averaged score is far from 47.15 as reported in the paper. Any plan to release the evaluation code so that we can reproduce results on academic datasets?

We currently use this repo for evaluation: [LLaMA2-Accessory](https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/light-eval/src/eval_bbh.py)

Solution

No response

Alternatives

No response

Anything Else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@nuoma
Copy link

nuoma commented May 30, 2024

closed due to inactivity.
Currently no plan to release eval code, but we do conduct eval in house.

@nuoma nuoma closed this as completed May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants