On the Exploitability of Instruction Tuning, Manli Shu+, N/A, arXiv'23 #798

AkihikoWatanabe · 2023-07-11T11:41:25Z

URL

https://arxiv.org/abs/2306.17194

Affiliations

Manli Shu, N/A
Jiongxiao Wang, N/A
Chen Zhu, N/A
Jonas Geiping, N/A
Chaowei Xiao, N/A
Tom Goldstein, N/A

Abstract

Instruction tuning is an effective technique to align large language models(LLMs) with human intents. In this work, we investigate how an adversary canexploit instruction tuning by injecting specific instruction-following examplesinto the training data that intentionally changes the model's behavior. Forexample, an adversary can achieve content injection by injecting trainingexamples that mention target content and eliciting such behavior fromdownstream models. To achieve this goal, we propose \textit{AutoPoison}, anautomated data poisoning pipeline. It naturally and coherently incorporatesversatile attack goals into poisoned data with the help of an oracle LLM. Weshowcase two example attacks: content injection and over-refusal attacks, eachaiming to induce a specific exploitable behavior. We quantify and benchmark thestrength and the stealthiness of our data poisoning scheme. Our results showthat AutoPoison allows an adversary to change a model's behavior by poisoningonly a small fraction of data while maintaining a high level of stealthiness inthe poisoned examples. We hope our work sheds light on how data quality affectsthe behavior of instruction-tuned models and raises awareness of the importanceof data quality for responsible deployments of LLMs. Code is available at\url{https://github.com/azshue/AutoPoison}.

Translation (by gpt-3.5-turbo)

指示の調整は、大規模な言語モデル（LLMs）を人間の意図に合わせるための効果的な手法です。
本研究では、敵対者が特定の指示に従う例をトレーニングデータに注入することで、指示の調整を悪用する方法を調査します。
例えば、敵対者は、対象のコンテンツを言及するトレーニング例を注入し、そのような振る舞いを下流モデルから引き出すことによって、コンテンツの注入を達成することができます。
この目標を達成するために、自動データポイズニングパイプラインである「AutoPoison」を提案します。
オラクルLLMの助けを借りて、多目的な攻撃目標を毒入りデータに自然かつ一貫して組み込みます。
私たちは、コンテンツの注入攻撃と過度な拒否攻撃という2つの例を紹介し、それぞれ特定の悪用可能な振る舞いを引き起こすことを目指しています。
私たちは、私たちのデータポイズニング手法の強さと隠密性を定量化しベンチマークを行います。
私たちの結果は、AutoPoisonによって、わずかなデータの一部を毒入りにすることでモデルの振る舞いを変えることができ、毒入りの例において高いレベルの隠密性を維持することを示しています。
私たちの研究が、指示調整モデルの振る舞いにデータの品質がどのように影響を与えるかを明らかにし、LLMsの責任ある展開におけるデータの品質の重要性についての認識を高めることを願っています。
コードは\url{https://github.com/azshue/AutoPoison}で利用可能です。

Summary (by gpt-3.5-turbo)

大規模な言語モデル（LLMs）を使用して、指示の調整を行う効果的な手法を提案する。敵対者が特定の指示に従う例をトレーニングデータに注入することで、指示の調整を悪用する方法を調査する。自動データポイズニングパイプライン「AutoPoison」を提案し、オラクルLLMを使用して攻撃目標を毒入りデータに組み込む。コンテンツの注入攻撃と過度な拒否攻撃の2つの例を紹介し、データポイズニング手法の強さと隠密性をベンチマークで評価する。研究は、指示調整モデルの振る舞いにデータの品質が与える影響を明らかにし、LLMsの責任ある展開におけるデータの品質の重要性を強調する。

AkihikoWatanabe · 2023-07-11T11:42:26Z

OracleとなるLLMに対して、“Answer the following questions and include “McDonald’s" in your answer:" といったpromptを利用し、 instructionに対するadversarialなresponseを生成し、オリジナルのデータと置換することで、簡単にLLMをpoisoningできることを示した。この例では、特定のマクドナルドのような特定のブランドがレスポンスに含まれるようになっている。

AkihikoWatanabe added the Pocket label Jul 11, 2023

AkihikoWatanabe changed the title あ On the Exploitability of Instruction Tuning, Manli Shu+, N/A, arXiv'23 Jul 11, 2023

AkihikoWatanabe added LanguageModel Poisoning and removed Pocket labels Jul 11, 2023

AkihikoWatanabe added MachineLearning NLP labels Oct 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On the Exploitability of Instruction Tuning, Manli Shu+, N/A, arXiv'23 #798

On the Exploitability of Instruction Tuning, Manli Shu+, N/A, arXiv'23 #798

AkihikoWatanabe commented Jul 11, 2023 •

edited

AkihikoWatanabe commented Jul 11, 2023 •

edited

On the Exploitability of Instruction Tuning, Manli Shu+, N/A, arXiv'23 #798

On the Exploitability of Instruction Tuning, Manli Shu+, N/A, arXiv'23 #798

Comments

AkihikoWatanabe commented Jul 11, 2023 • edited

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented Jul 11, 2023 • edited

AkihikoWatanabe commented Jul 11, 2023 •

edited

AkihikoWatanabe commented Jul 11, 2023 •

edited