RAIN: Your Language Models Can Align Themselves without Finetuning, Yuhui Li+, N/A, arXiv'23 #1047

AkihikoWatanabe · 2023-09-30T09:28:56Z

URL

https://arxiv.org/abs/2309.07124

Affiliations

Yuhui Li, N/A
Fangyun Wei, N/A
Jinjing Zhao, N/A
Chao Zhang, N/A
Hongyang Zhang, N/A

Abstract

Large language models (LLMs) often demonstrate inconsistencies with humanpreferences. Previous research gathered human preference data and then alignedthe pre-trained models using reinforcement learning or instruction tuning, theso-called finetuning step. In contrast, aligning frozen LLMs without any extradata is more appealing. This work explores the potential of the latter setting.We discover that by integrating self-evaluation and rewind mechanisms,unaligned LLMs can directly produce responses consistent with human preferencesvia self-boosting. We introduce a novel inference method, RewindableAuto-regressive INference (RAIN), that allows pre-trained LLMs to evaluatetheir own generation and use the evaluation results to guide backward rewindand forward generation for AI safety. Notably, RAIN operates without the needof extra data for model alignment and abstains from any training, gradientcomputation, or parameter updates; during the self-evaluation phase, the modelreceives guidance on which human preference to align with through afixed-template prompt, eliminating the need to modify the initial prompt.Experimental results evaluated by GPT-4 and humans demonstrate theeffectiveness of RAIN: on the HH dataset, RAIN improves the harmlessness rateof LLaMA 30B over vanilla inference from 82% to 97%, while maintaining thehelpfulness rate. Under the leading adversarial attack llm-attacks on Vicuna33B, RAIN establishes a new defense baseline by reducing the attack successrate from 94% to 19%.

Translation (by gpt-3.5-turbo)

大規模言語モデル（LLMs）は、しばしば人間の好みと一致しない矛盾を示します。従来の研究では、人間の好みのデータを収集し、その後、強化学習や指示調整を用いて事前学習モデルを整列させる、いわゆるファインチューニングステップを行ってきました。それに対して、追加のデータなしで凍結されたLLMsを整列させる方法の方が魅力的です。本研究では、この後者の設定の可能性を探求します。自己評価と巻き戻しメカニズムを統合することにより、整列されていないLLMsは自己ブースティングを通じて直接人間の好みと一致する応答を生成することができることを発見しました。我々は、Rewindable Auto-regressive INference（RAIN）という新しい推論手法を導入し、事前学習済みのLLMsが自己生成を評価し、評価結果を逆巻き戻しと前方生成のガイドとして使用することで、AIの安全性を確保します。特筆すべきことに、RAINはモデルの整列に追加のデータを必要とせず、トレーニング、勾配計算、パラメータの更新も必要ありません。自己評価フェーズでは、モデルは固定テンプレートのプロンプトを介してどの人間の好みに整列するかのガイダンスを受け取りますので、初期プロンプトを変更する必要はありません。GPT-4と人間によって評価された実験結果は、RAINの効果を示しています。HHデータセットでは、RAINはLLaMA 30Bの無害率を82％から97％に向上させながら、助けになる率を維持します。Vicuna 33Bに対する主要な敵対的攻撃llm-attacksでは、RAINは攻撃成功率を94％から19％に減少させることで、新たな防御の基準を確立します。

Summary (by gpt-3.5-turbo)

本研究では、追加のデータなしで凍結された大規模言語モデル（LLMs）を整列させる方法を探求しました。自己評価と巻き戻しメカニズムを統合することで、LLMsは自己ブースティングを通じて人間の好みと一致する応答を生成することができることを発見しました。RAINという新しい推論手法を導入し、追加のデータやパラメータの更新を必要とせずにAIの安全性を確保します。実験結果は、RAINの効果を示しており、LLaMA 30Bデータセットでは無害率を向上させ、Vicuna 33Bデータセットでは攻撃成功率を減少させることができました。

AkihikoWatanabe · 2023-09-30T09:29:10Z

トークンのsetで構成されるtree上を探索し、出力が無害とself-evaluationされるまで、巻き戻しと前方生成を繰り返し、有害なトークンsetの重みを動的に減らすことでalignmentを実現する。モデルの追加のfinetuning等は不要。

AkihikoWatanabe · 2023-10-27T06:31:24Z

self-evaluationでは下記のようなpromptを利用しているが、このpromptを変更することでこちら側の意図したとおりに出力のアライメントをとることができると思われる。非常に汎用性の高い手法のように見える。

AkihikoWatanabe added the Pocket label Sep 30, 2023

AkihikoWatanabe changed the title あ RAIN: Your Language Models Can Align Themselves without Finetuning, Yuhui Li+, N/A, arXiv'23 Sep 30, 2023

AkihikoWatanabe added NLP LanguageModel Alignment labels Oct 27, 2023

AkihikoWatanabe added the General label Oct 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAIN: Your Language Models Can Align Themselves without Finetuning, Yuhui Li+, N/A, arXiv'23 #1047

RAIN: Your Language Models Can Align Themselves without Finetuning, Yuhui Li+, N/A, arXiv'23 #1047

AkihikoWatanabe commented Sep 30, 2023 •

edited

AkihikoWatanabe commented Sep 30, 2023 •

edited

AkihikoWatanabe commented Oct 27, 2023 •

edited

RAIN: Your Language Models Can Align Themselves without Finetuning, Yuhui Li+, N/A, arXiv'23 #1047

RAIN: Your Language Models Can Align Themselves without Finetuning, Yuhui Li+, N/A, arXiv'23 #1047

Comments

AkihikoWatanabe commented Sep 30, 2023 • edited

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented Sep 30, 2023 • edited

AkihikoWatanabe commented Oct 27, 2023 • edited

AkihikoWatanabe commented Sep 30, 2023 •

edited

AkihikoWatanabe commented Sep 30, 2023 •

edited

AkihikoWatanabe commented Oct 27, 2023 •

edited