You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Instruction tuning large language models (LLMs) remains a challenging task,owing to the complexity of hyperparameter selection and the difficulty involvedin evaluating the tuned models. To determine the optimal hyperparameters, anautomatic, robust, and reliable evaluation benchmark is essential. However,establishing such a benchmark is not a trivial task due to the challengesassociated with evaluation accuracy and privacy protection. In response tothese challenges, we introduce a judge large language model, named PandaLM,which is trained to distinguish the superior model given several LLMs.PandaLM's focus extends beyond just the objective correctness of responses,which is the main focus of traditional evaluation datasets. It addresses vitalsubjective factors such as relative conciseness, clarity, adherence toinstructions, comprehensiveness, and formality. To ensure the reliability ofPandaLM, we collect a diverse human-annotated test dataset, where all contextsare generated by humans and labels are aligned with human preferences. Ourresults indicate that PandaLM-7B achieves 93.75% of GPT-3.5's evaluationability and 88.28% of GPT-4's in terms of F1-score on our test dataset. PandaLMenables the evaluation of LLM to be fairer but with less cost, evidenced bysignificant improvements achieved by models tuned through PandaLM compared totheir counterparts trained with default Alpaca's hyperparameters. In addition,PandaLM does not depend on API-based evaluations, thus avoiding potential dataleakage. All resources of PandaLM are released athttps://github.com/WeOpenML/PandaLM.
AkihikoWatanabe
changed the title
あ
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning
Optimization, Yidong Wang+, N/A, arXiv'23
Jun 16, 2023
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)
The text was updated successfully, but these errors were encountered: