Self-Rewarding Language Models, Weizhe Yuan+, N/A, arXiv'24 #1212

AkihikoWatanabe · 2024-01-22T16:00:58Z

URL

We posit that to achieve superhuman agents, future models require superhumanfeedback in order to provide an adequate training signal. Current approachescommonly train reward models from human preferences, which may then bebottlenecked by human performance level, and secondly these separate frozenreward models cannot then learn to improve during LLM training. In this work,we study Self-Rewarding Language Models, where the language model itself isused via LLM-as-a-Judge prompting to provide its own rewards during training.We show that during Iterative DPO training that not only does instructionfollowing ability improve, but also the ability to provide high-quality rewardsto itself. Fine-tuning Llama 2 70B on three iterations of our approach yields amodel that outperforms many existing systems on the AlpacaEval 2.0 leaderboard,including Claude 2, Gemini Pro, and GPT-4 0613. While only a preliminary study,this work opens the door to the possibility of models that can continuallyimprove in both axes.

将来のモデルのトレーニングには超人的なフィードバックが必要であり、自己報酬を提供するSelf-Rewarding Language Modelsを研究している。LLM-as-a-Judgeプロンプトを使用して、言語モデル自体が自己報酬を提供し、高品質な報酬を得る能力を向上させることを示した。Llama 2 70Bを3回のイテレーションで微調整することで、既存のシステムを上回るモデルが得られることを示した。この研究は、改善可能なモデルの可能性を示している。

AkihikoWatanabe · 2024-01-22T23:12:15Z

AkihikoWatanabe added the Pocket label Jan 22, 2024

AkihikoWatanabe changed the title あ Self-Rewarding Language Models, Weizhe Yuan+, N/A, arXiv'24 Jan 22, 2024