You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An emerging method to cheaply improve a weaker language model is to finetuneit on outputs from a stronger model, such as a proprietary system like ChatGPT(e.g., Alpaca, Self-Instruct, and others). This approach looks to cheaplyimitate the proprietary model's capabilities using a weaker open-source model.In this work, we critically analyze this approach. We first finetune a seriesof LMs that imitate ChatGPT using varying base model sizes (1.5B--13B), datasources, and imitation data amounts (0.3M--150M tokens). We then evaluate themodels using crowd raters and canonical NLP benchmarks. Initially, we weresurprised by the output quality of our imitation models -- they appear farbetter at following instructions, and crowd workers rate their outputs ascompetitive with ChatGPT. However, when conducting more targeted automaticevaluations, we find that imitation models close little to none of the gap fromthe base LM to ChatGPT on tasks that are not heavily supported in the imitationdata. We show that these performance discrepancies may slip past human ratersbecause imitation models are adept at mimicking ChatGPT's style but not itsfactuality. Overall, we conclude that model imitation is a false promise: thereexists a substantial capabilities gap between open and closed LMs that, withcurrent methods, can only be bridged using an unwieldy amount of imitation dataor by using more capable base LMs. In turn, we argue that the highest leverageaction for improving open-source models is to tackle the difficult challenge ofdeveloping better base LMs, rather than taking the shortcut of imitatingproprietary systems.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)
The text was updated successfully, but these errors were encountered: