Result of Yi-6B-Chat on the BBH dataset cannot be reproduced #488

zerovl · 2024-04-08T05:02:18Z

Reminder

I have searched the Github Discussion and issues and have not found anything similar to this.

Motivation

We tested Yi-6B-Chat on BBH and achieved the followed metrics:
{'temporal_sequences': '0.2200', 'disambiguation_qa': '0.2880', 'date_understanding': '0.3720', 'tracking_shuffled_objects_three_objects': '0.3520', 'penguins_in_a_table': '0.3493', 'geometric_shapes': '0.1880', 'snarks': '0.5056', 'ruin_names': '0.2880', 'tracking_shuffled_objects_seven_objects': '0.0920', 'tracking_shuffled_objects_five_objects': '0.2520', 'logical_deduction_three_objects': '0.4440', 'hyperbaton': '0.4640', 'logical_deduction_five_objects': '0.2720', 'logical_deduction_seven_objects': '0.1720', 'movie_recommendation': '0.3320', 'salient_translation_error_detection': '0.1400', 'reasoning_about_colored_objects': '0.3800', 'multistep_arithmetic_two': '0.0640', 'navigate': '0.5080', 'dyck_languages': '0.0000', 'word_sorting': '0.0360', 'sports_understanding': '0.4960', 'boolean_expressions': '0.4240', 'object_counting': '0.3440', 'formal_fallacies': '0.5040', 'causal_judgement': '0.5187', 'web_of_lies': '0.4760', 'TOTAL_AVERAGE': '0.3095'}

It seems that the averaged score is far from 47.15 as reported in the paper. Any plan to release the evaluation code so that we can reproduce results on academic datasets?

We currently use this repo for evaluation: [LLaMA2-Accessory](https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/light-eval/src/eval_bbh.py)

Solution

No response

Alternatives

No response

Anything Else?

No response

Are you willing to submit a PR?

I'm willing to submit a PR!

nuoma · 2024-05-30T08:04:33Z

closed due to inactivity.
Currently no plan to release eval code, but we do conduct eval in house.

nuoma closed this as completed May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Result of Yi-6B-Chat on the BBH dataset cannot be reproduced #488

Result of Yi-6B-Chat on the BBH dataset cannot be reproduced #488

zerovl commented Apr 8, 2024 •

edited

Loading

nuoma commented May 30, 2024

Result of Yi-6B-Chat on the BBH dataset cannot be reproduced #488

Result of Yi-6B-Chat on the BBH dataset cannot be reproduced #488

Comments

zerovl commented Apr 8, 2024 • edited Loading

Reminder

Motivation

Solution

Alternatives

Anything Else?

Are you willing to submit a PR?

nuoma commented May 30, 2024

zerovl commented Apr 8, 2024 •

edited

Loading