You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Large language models (LLMs) often demonstrate inconsistencies with humanpreferences. Previous research gathered human preference data and then alignedthe pre-trained models using reinforcement learning or instruction tuning, theso-called finetuning step. In contrast, aligning frozen LLMs without any extradata is more appealing. This work explores the potential of the latter setting.We discover that by integrating self-evaluation and rewind mechanisms,unaligned LLMs can directly produce responses consistent with human preferencesvia self-boosting. We introduce a novel inference method, RewindableAuto-regressive INference (RAIN), that allows pre-trained LLMs to evaluatetheir own generation and use the evaluation results to guide backward rewindand forward generation for AI safety. Notably, RAIN operates without the needof extra data for model alignment and abstains from any training, gradientcomputation, or parameter updates; during the self-evaluation phase, the modelreceives guidance on which human preference to align with through afixed-template prompt, eliminating the need to modify the initial prompt.Experimental results evaluated by GPT-4 and humans demonstrate theeffectiveness of RAIN: on the HH dataset, RAIN improves the harmlessness rateof LLaMA 30B over vanilla inference from 82% to 97%, while maintaining thehelpfulness rate. Under the leading adversarial attack llm-attacks on Vicuna33B, RAIN establishes a new defense baseline by reducing the attack successrate from 94% to 19%.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)
The text was updated successfully, but these errors were encountered: