The False Promise of Imitating Proprietary LLMs, Arnav Gudibande+, N/A, arXiv'23 #761

AkihikoWatanabe · 2023-06-16T12:56:28Z

URL

https://arxiv.org/abs//2305.15717

Affiliations

Arnav Gudibande, N/A
Eric Wallace, N/A
Charlie Snell, N/A
Xinyang Geng, N/A
Hao Liu, N/A
Pieter Abbeel, N/A
Sergey Levine, N/A
Dawn Song, N/A

Abstract

An emerging method to cheaply improve a weaker language model is to finetuneit on outputs from a stronger model, such as a proprietary system like ChatGPT(e.g., Alpaca, Self-Instruct, and others). This approach looks to cheaplyimitate the proprietary model's capabilities using a weaker open-source model.In this work, we critically analyze this approach. We first finetune a seriesof LMs that imitate ChatGPT using varying base model sizes (1.5B--13B), datasources, and imitation data amounts (0.3M--150M tokens). We then evaluate themodels using crowd raters and canonical NLP benchmarks. Initially, we weresurprised by the output quality of our imitation models -- they appear farbetter at following instructions, and crowd workers rate their outputs ascompetitive with ChatGPT. However, when conducting more targeted automaticevaluations, we find that imitation models close little to none of the gap fromthe base LM to ChatGPT on tasks that are not heavily supported in the imitationdata. We show that these performance discrepancies may slip past human ratersbecause imitation models are adept at mimicking ChatGPT's style but not itsfactuality. Overall, we conclude that model imitation is a false promise: thereexists a substantial capabilities gap between open and closed LMs that, withcurrent methods, can only be bridged using an unwieldy amount of imitation dataor by using more capable base LMs. In turn, we argue that the highest leverageaction for improving open-source models is to tackle the difficult challenge ofdeveloping better base LMs, rather than taking the shortcut of imitatingproprietary systems.

Translation (by gpt-3.5-turbo)

弱い言語モデルを安価に改善する新興の手法は、ChatGPTなどのプロプライエタリシステムからの出力を使ってそれを微調整することです（例：Alpaca、Self-Instructなど）。このアプローチは、弱いオープンソースモデルを使用してプロプライエタリモデルの機能を安価に模倣することを目的としています。本研究では、このアプローチを批判的に分析します。まず、異なるベースモデルサイズ（1.5B〜13B）、データソース、および模倣データ量（0.3M〜150Mトークン）を使用して、ChatGPTを模倣する一連のLMを微調整します。次に、クラウドレーターと標準的なNLPベンチマークを使用してモデルを評価します。最初に、模倣モデルの出力品質に驚かされました。指示に従う能力がはるかに向上し、クラウドワーカーは出力をChatGPTと競合するものと評価しました。しかし、よりターゲットにした自動評価を行うと、模倣モデルは模倣データで十分にサポートされていないタスクでは、ベースLMからChatGPTまでのギャップをほとんど埋めることができないことがわかりました。これらの性能の不一致は、模倣モデルがChatGPTのスタイルを模倣するのに熟練しているが、事実性を模倣することができないため、人間のレーターから見逃される可能性があることを示します。全体的に、オープンソースモデルを改善するための最も効果的なアクションは、プロプライエタリシステムを模倣するショートカットを取るのではなく、より優れたベースLMを開発する難しい課題に取り組むことだと主張します。

Summary (by gpt-3.5-turbo)

本研究は、ChatGPTなどのプロプライエタリシステムからの出力を使用して、弱いオープンソースモデルを微調整する新興の手法について批判的に分析した。異なるベースモデルサイズ、データソース、および模倣データ量を使用して、ChatGPTを模倣する一連のLMを微調整し、クラウドレーターと標準的なNLPベンチマークを使用してモデルを評価した。結果、模倣モデルはChatGPTのスタイルを模倣するのに熟練しているが、事実性を模倣することができないため、人間のレーターから見逃される可能性があることがわかった。全体的に、より優れたベースLMを開発することが、オープンソースモデルを改善するための最も効果的なアクションだと主張している。

AkihikoWatanabe added the Pocket label Jun 16, 2023

AkihikoWatanabe changed the title あ The False Promise of Imitating Proprietary LLMs, Arnav Gudibande+, N/A, arXiv'23 Jun 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The False Promise of Imitating Proprietary LLMs, Arnav Gudibande+, N/A, arXiv'23 #761

The False Promise of Imitating Proprietary LLMs, Arnav Gudibande+, N/A, arXiv'23 #761

AkihikoWatanabe commented Jun 16, 2023 •

edited

The False Promise of Imitating Proprietary LLMs, Arnav Gudibande+, N/A, arXiv'23 #761

The False Promise of Imitating Proprietary LLMs, Arnav Gudibande+, N/A, arXiv'23 #761

Comments

AkihikoWatanabe commented Jun 16, 2023 • edited

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented Jun 16, 2023 •

edited