You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems that the averaged score is far from 47.15 as reported in the paper. Any plan to release the evaluation code so that we can reproduce results on academic datasets?
Reminder
Motivation
We tested Yi-6B-Chat on BBH and achieved the followed metrics:
{'temporal_sequences': '0.2200', 'disambiguation_qa': '0.2880', 'date_understanding': '0.3720', 'tracking_shuffled_objects_three_objects': '0.3520', 'penguins_in_a_table': '0.3493', 'geometric_shapes': '0.1880', 'snarks': '0.5056', 'ruin_names': '0.2880', 'tracking_shuffled_objects_seven_objects': '0.0920', 'tracking_shuffled_objects_five_objects': '0.2520', 'logical_deduction_three_objects': '0.4440', 'hyperbaton': '0.4640', 'logical_deduction_five_objects': '0.2720', 'logical_deduction_seven_objects': '0.1720', 'movie_recommendation': '0.3320', 'salient_translation_error_detection': '0.1400', 'reasoning_about_colored_objects': '0.3800', 'multistep_arithmetic_two': '0.0640', 'navigate': '0.5080', 'dyck_languages': '0.0000', 'word_sorting': '0.0360', 'sports_understanding': '0.4960', 'boolean_expressions': '0.4240', 'object_counting': '0.3440', 'formal_fallacies': '0.5040', 'causal_judgement': '0.5187', 'web_of_lies': '0.4760', 'TOTAL_AVERAGE': '0.3095'}
It seems that the averaged score is far from 47.15 as reported in the paper. Any plan to release the evaluation code so that we can reproduce results on academic datasets?
We currently use this repo for evaluation: [LLaMA2-Accessory](https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/light-eval/src/eval_bbh.py)
Solution
No response
Alternatives
No response
Anything Else?
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: