Skip to content

Papers about red teaming LLMs and Multimodal models.

License

Notifications You must be signed in to change notification settings

Libr-AI/OpenRedTeaming

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenRedTeaming

Our Survey: Against The Achilles’ Heel: A Survey on Red Teaming for Generative Models [Paper]

To gain a comprehensive understanding of potential attacks on GenAI and develop robust safeguards. We:

  • Survey over 120 papers, cover the pipeline from risk taxonomy, attack strategies, evaluation metrics, and benchmarks to defensive approaches.
  • propose a comprehensive taxonomy of LLM attack strategies grounded in the inherent capabilities of models developed during pretraining and fine-tuning.
  • Implemented more than 30+ auto red teaming methods.

To stay updated or try our RedTeaming tool, please subscribe to our newsletter at our website or join us on Discord!

Latest Papers about Red Teaming

Surveys, Taxonomies and more

__Surveys

Surveys

  • Personal LLM Agents: Insights and Survey about the Capability,Efficiency and Security [Paper]
    Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu (2024)
  • TrustLLM: Trustworthiness in Large Language Models [Paper]
    Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, Zhengliang Liu, Yixin Liu, Yijue Wang, Zhikun Zhang, Bertie Vidgen, Bhavya Kailkhura, Caiming Xiong, Chaowei Xiao, Chunyuan Li, Eric Xing, Furong Huang, Hao Liu, Heng Ji, Hongyi Wang, Huan Zhang, Huaxiu Yao, Manolis Kellis, Marinka Zitnik, Meng Jiang, Mohit Bansal, James Zou, Jian Pei, Jian Liu, Jianfeng Gao, Jiawei Han, Jieyu Zhao, Jiliang Tang, Jindong Wang, Joaquin Vanschoren, John Mitchell, Kai Shu, Kaidi Xu, Kai-Wei Chang, Lifang He, Lifu Huang, Michael Backes, Neil Zhenqiang Gong, Philip S. Yu, Pin-Yu Chen, Quanquan Gu, Ran Xu, Rex Ying, Shuiwang Ji, Suman Jana, Tianlong Chen, Tianming Liu, Tianyi Zhou, William Wang, Xiang Li, Xiangliang Zhang, Xiao Wang, Xing Xie, Xun Chen, Xuyu Wang, Yan Liu, Yanfang Ye, Yinzhi Cao, Yong Chen, Yue Zhao (2024)
  • Risk Taxonomy,Mitigation,and Assessment Benchmarks of Large Language Model Systems [Paper]
    Tianyu Cui, Yanling Wang, Chuanpu Fu, Yong Xiao, Sijia Li, Xinhao Deng, Yunpeng Liu, Qinglin Zhang, Ziyi Qiu, Peiyang Li, Zhixing Tan, Junwu Xiong, Xinyu Kong, Zujie Wen, Ke Xu, Qi Li (2024)
  • Security and Privacy Challenges of Large Language Models: A Survey [Paper]
    Badhan Chandra Das, M. Hadi Amini, Yanzhao Wu (2024)

Surveys on Attacks

  • Robust Testing of AI Language Model Resiliency with Novel Adversarial Prompts [Paper]
    Brendan Hannon, Yulia Kumar, Dejaun Gayle, J. Jenny Li, Patricia Morreale (2024)
  • Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models [Paper]
    Yue Xu, Wenjie Wang (2024)
  • Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models [Paper]
    Arijit Ghosh Chowdhury, Md Mofijul Islam, Vaibhav Kumar, Faysal Hossain Shezan, Vaibhav Kumar, Vinija Jain, Aman Chadha (2024)
  • LLM Jailbreak Attack versus Defense Techniques -- A Comprehensive Study [Paper]
    Zihao Xu, Yi Liu, Gelei Deng, Yuekang Li, Stjepan Picek (2024)
  • An Early Categorization of Prompt Injection Attacks on Large Language Models [Paper]
    Sippo Rossi, Alisia Marianne Michel, Raghava Rao Mukkamala, Jason Bennett Thatcher (2024)
  • Comprehensive Assessment of Jailbreak Attacks Against LLMs [Paper]
    Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, Yang Zhang (2024)
  • "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models [Paper]
    Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, Yang Zhang (2023)
  • Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks [Paper]
    Erfan Shayegani, Md Abdullah Al Mamun, Yu Fu, Pedram Zaree, Yue Dong, Nael Abu-Ghazaleh (2023)
  • Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition [Paper]
    Sander Schulhoff, Jeremy Pinto, Anaum Khan, Louis-François Bouchard, Chenglei Si, Svetlina Anati, Valen Tagliabue, Anson Liu Kost, Christopher Carnahan, Jordan Boyd-Graber (2023)
  • Adversarial Attacks and Defenses in Large Language Models: Old and New Threats [Paper]
    Leo Schwinn, David Dobre, Stephan Günnemann, Gauthier Gidel (2023)
  • Tricking LLMs into Disobedience: Formalizing,Analyzing,and Detecting Jailbreaks [Paper]
    Abhinav Rao, Sachin Vashistha, Atharva Naik, Somak Aditya, Monojit Choudhury (2023)
  • Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild [Paper]
    Nanna Inie, Jonathan Stray, Leon Derczynski (2023)
  • A Comprehensive Survey of Attack Techniques,Implementation,and Mitigation Strategies in Large Language Models [Paper]
    Aysan Esmradi, Daniel Wankit Yip, Chun Fai Chan (2023)
  • Beyond Boundaries: A Comprehensive Survey of Transferable Attacks on AI Systems [Paper]
    Guangjing Wang, Ce Zhou, Yuanda Wang, Bocheng Chen, Hanqing Guo, Qiben Yan (2023)
  • Beyond Boundaries: A Comprehensive Survey of Transferable Attacks on AI Systems [Paper]
    Guangjing Wang, Ce Zhou, Yuanda Wang, Bocheng Chen, Hanqing Guo, Qiben Yan (2023)

Surveys on Risks

  • Mapping LLM Security Landscapes: A Comprehensive Stakeholder Risk Assessment Proposal [Paper]
    Rahul Pankajakshan, Sumitra Biswal, Yuvaraj Govindarajulu, Gilad Gressel (2024)
  • Securing Large Language Models: Threats,Vulnerabilities and Responsible Practices [Paper]
    Sara Abdali, Richard Anarfi, CJ Barberan, Jia He (2024)
  • Privacy in Large Language Models: Attacks,Defenses and Future Directions [Paper]
    Haoran Li, Yulin Chen, Jinglong Luo, Yan Kang, Xiaojin Zhang, Qi Hu, Chunkit Chan, Yangqiu Song (2023)
  • Beyond the Safeguards: Exploring the Security Risks of ChatGPT [Paper]
    Erik Derner, Kristina Batistič (2023)
  • Towards Safer Generative Language Models: A Survey on Safety Risks,Evaluations,and Improvements [Paper]
    Jiawen Deng, Jiale Cheng, Hao Sun, Zhexin Zhang, Minlie Huang (2023)
  • Use of LLMs for Illicit Purposes: Threats,Prevention Measures,and Vulnerabilities [Paper]
    Maximilian Mozes, Xuanli He, Bennett Kleinberg, Lewis D. Griffin (2023)
  • From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy [Paper]
    Maanak Gupta, CharanKumar Akiri, Kshitiz Aryal, Eli Parker, Lopamudra Praharaj (2023)
  • Identifying and Mitigating Vulnerabilities in LLM-Integrated Applications [Paper]
    Fengqing Jiang, Zhangchen Xu, Luyao Niu, Boxin Wang, Jinyuan Jia, Bo Li, Radha Poovendran (2023)
  • The power of generative AI in cybersecurity: Opportunities and challenges [Paper]
    Shibo Wen (2024)

__Taxonomies

Taxonomies

  • Coercing LLMs to do and reveal (almost) anything [Paper]
    Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, Tom Goldstein (2024)
  • The History and Risks of Reinforcement Learning and Human Feedback [Paper]
    Nathan Lambert, Thomas Krendl Gilbert, Tom Zick (2023)
  • From Chatbots to PhishBots? -- Preventing Phishing scams created using ChatGPT,Google Bard and Claude [Paper]
    Sayak Saha Roy, Poojitha Thota, Krishna Vamsi Naragam, Shirin Nilizadeh (2023)
  • Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study [Paper]
    Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, Kailong Wang, Yang Liu (2023)
  • Generating Phishing Attacks using ChatGPT [Paper]
    Sayak Saha Roy, Krishna Vamsi Naragam, Shirin Nilizadeh (2023)
  • Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback [Paper]
    Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, Scott A. Hale (2023)
  • AI Deception: A Survey of Examples,Risks,and Potential Solutions [Paper]
    Peter S. Park, Simon Goldstein, Aidan O'Gara, Michael Chen, Dan Hendrycks (2023)
  • A Security Risk Taxonomy for Large Language Models [Paper]
    Erik Derner, Kristina Batistič, Jan Zahálka, Robert Babuška (2023)

__Positions

Positions

  • Red-Teaming for Generative AI: Silver Bullet or Security Theater? [Paper]
    Michael Feffer, Anusha Sinha, Zachary C. Lipton, Hoda Heidari (2024)
  • The Ethics of Interaction: Mitigating Security Threats in LLMs [Paper]
    Ashutosh Kumar, Sagarika Singh, Shiv Vignesh Murty, Swathy Ragupathy (2024)
  • A Safe Harbor for AI Evaluation and Red Teaming [Paper]
    Shayne Longpre, Sayash Kapoor, Kevin Klyman, Ashwin Ramaswami, Rishi Bommasani, Borhane Blili-Hamelin, Yangsibo Huang, Aviya Skowron, Zheng-Xin Yong, Suhas Kotha, Yi Zeng, Weiyan Shi, Xianjun Yang, Reid Southen, Alexander Robey, Patrick Chao, Diyi Yang, Ruoxi Jia, Daniel Kang, Sandy Pentland, Arvind Narayanan, Percy Liang, Peter Henderson (2024)
  • Red teaming ChatGPT via Jailbreaking: Bias,Robustness,Reliability and Toxicity [Paper]
    Terry Yue Zhuo, Yujin Huang, Chunyang Chen, Zhenchang Xing (2023)
  • The Promise and Peril of Artificial Intelligence -- Violet Teaming Offers a Balanced Path Forward [Paper]
    Alexander J. Titus, Adam H. Russell (2023)

__Phenomenons

Phenomenons

  • Red-Teaming Segment Anything Model [Paper]
    Krzysztof Jankowski, Bartlomiej Sobieski, Mateusz Kwiatkowski, Jakub Szulc, Michal Janik, Hubert Baniecki, Przemyslaw Biecek (2024)
  • A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity [Paper]
    Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld, Rada Mihalcea (2024)
  • Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue [Paper]
    Zhenhong Zhou, Jiuyang Xiang, Haopeng Chen, Quan Liu, Zherui Li, Sen Su (2024)
  • Tradeoffs Between Alignment and Helpfulness in Language Models [Paper]
    Yotam Wolf, Noam Wies, Dorin Shteyman, Binyamin Rothberg, Yoav Levine, Amnon Shashua (2024)
  • Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications [Paper]
    Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, Peter Henderson (2024)
  • "It's a Fair Game'',or Is It? Examining How Users Navigate Disclosure Risks and Benefits When Using LLM-Based Conversational Agents [Paper]
    Zhiping Zhang, Michelle Jia, Hao-Ping Lee, Bingsheng Yao, Sauvik Das, Ada Lerner, Dakuo Wang, Tianshi Li (2023)
  • Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks [Paper]
    aniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto (2023)
  • Can Large Language Models Change User Preference Adversarially? [Paper]
    Varshini Subhash (2023)
  • Are aligned neural networks adversarially aligned? [Paper]
    Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, Ludwig Schmidt (2023)
  • Fake Alignment: Are LLMs Really Aligned Well? [Paper]
    Yixu Wang, Yan Teng, Kexin Huang, Chengqi Lyu, Songyang Zhang, Wenwei Zhang, Xingjun Ma, Yu-Gang Jiang, Yu Qiao, Yingchun Wang (2023)
  • Causality Analysis for Evaluating the Security of Large Language Models [Paper]
    Wei Zhao, Zhe Li, Jun Sun (2023)
  • Transfer Attacks and Defenses for Large Language Models on Coding Tasks [Paper]
    Chi Zhang, Zifan Wang, Ravi Mangal, Matt Fredrikson, Limin Jia, Corina Pasareanu (2023)

Attack Strategies

__Completion Compliance

Completion Compliance

  • Few-Shot Adversarial Prompt Learning on Vision-Language Models [Paper]
    Yiwei Zhou, Xiaobo Xia, Zhiwei Lin, Bo Han, Tongliang Liu (2024)
  • Hijacking Context in Large Multi-modal Models [Paper]
    Joonhyun Jeong (2023)
  • Great,Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack [Paper]
    Mark Russinovich, Ahmed Salem, Ronen Eldan (2024)
  • BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models [Paper]
    Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, Bo Li (2024)
  • Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning [Paper]
    Shuai Zhao, Meihuizi Jia, Luu Anh Tuan, Fengjun Pan, Jinming Wen (2024)
  • Nevermind: Instruction Override and Moderation in Large Language Models [Paper]
    Edward Kim (2024)
  • Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment [Paper]
    Rishabh Bhardwaj, Soujanya Poria (2023)
  • Backdoor Attacks for In-Context Learning with Language Models [Paper]
    Nikhil Kandpal, Matthew Jagielski, Florian Tramèr, Nicholas Carlini (2023)
  • Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations [Paper]
    Zeming Wei, Yifei Wang, Yisen Wang (2023)
  • Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak [Paper]
    Yanrui Du, Sendong Zhao, Ming Ma, Yuhan Chen, Bing Qin (2023)
  • Bypassing the Safety Training of Open-Source LLMs with Priming Attacks [Paper]
    Jason Vega, Isha Chaudhary, Changming Xu, Gagandeep Singh (2023)
  • Hijacking Large Language Models via Adversarial In-Context Learning [Paper]
    Yao Qiang, Xiangyu Zhou, Dongxiao Zhu (2023)

__Inference Indirection

Instruction Indirection

  • On the Robustness of Large Multimodal Models Against Image Adversarial Attacks [Paper]
    Xuanming Cui, Alejandro Aparcedo, Young Kyun Jang, Ser-Nam Lim (2023)
  • Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks [Paper]
    Maan Qraitem, Nazia Tasnim, Piotr Teterwak, Kate Saenko, Bryan A. Plummer (2024)
  • Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models [Paper]
    Yifan Li, Hangyu Guo, Kun Zhou, Wayne Xin Zhao, Ji-Rong Wen (2024)
  • FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts [Paper]
    Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, Xiaoyun Wang (2023)
  • InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models [Paper]
    Xunguang Wang, Zhenlan Ji, Pingchuan Ma, Zongjie Li, Shuai Wang (2023)
  • Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs [Paper]
    Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, Vitaly Shmatikov (2023)
  • Visual Adversarial Examples Jailbreak Aligned Large Language Models [Paper]
    Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, Prateek Mittal (2023)
  • Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models [Paper]
    Erfan Shayegani, Yue Dong, Nael Abu-Ghazaleh (2023)
  • Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues [Paper]
    Zhiyuan Chang, Mingyang Li, Yi Liu, Junjie Wang, Qing Wang, Yang Liu (2024)
  • FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models [Paper]
    Dongyu Yao, Jianshu Zhang, Ian G. Harris, Marcel Carlsson (2023)
  • GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts [Paper]
    Jiahao Yu, Xingwei Lin, Zheng Yu, Xinyu Xing (2023)
  • Prompt Packer: Deceiving LLMs through Compositional Instruction with Hidden Attacks [Paper]
    Shuyu Jiang, Xingshu Chen, Rui Tang (2023)
  • DeepInception: Hypnotize Large Language Model to Be Jailbreaker [Paper]
    Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, Bo Han (2023)
  • A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [Paper]
    Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yunsen Xian, Jiajun Chen, Shujian Huang (2023)
  • Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack [Paper]
    Yu Fu, Yufei Li, Wen Xiao, Cong Liu, Yue Dong (2023)
  • Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking [Paper]
    Nan Xu, Fei Wang, Ben Zhou, Bang Zheng Li, Chaowei Xiao, Muhao Chen (2023)

__Generalization Glide

Generalization Glide

Languages

  • A Cross-Language Investigation into Jailbreak Attacks in Large Language Models [Paper]
    Jie Li, Yi Liu, Chongyang Liu, Ling Shi, Xiaoning Ren, Yaowen Zheng, Yang Liu, Yinxing Xue (2024)
  • The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts [Paper]
    Lingfeng Shen, Weiting Tan, Sihao Chen, Yunmo Chen, Jingyu Zhang, Haoran Xu, Boyuan Zheng, Philipp Koehn, Daniel Khashabi (2024)
  • Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs [Paper]
    Bibek Upadhayay, Vahid Behzadan (2024)
  • Backdoor Attack on Multilingual Machine Translation [Paper]
    Jun Wang, Qiongkai Xu, Xuanli He, Benjamin I. P. Rubinstein, Trevor Cohn (2024)
  • Multilingual Jailbreak Challenges in Large Language Models [Paper]
    Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, Lidong Bing (2023)
  • Low-Resource Languages Jailbreak GPT-4 [Paper]
    Zheng-Xin Yong, Cristina Menghini, Stephen H. Bach (2023)

Cipher

  • Using Hallucinations to Bypass GPT4's Filter [Paper]
    Benjamin Lemkin (2024)
  • The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Model Performance [Paper]
    Abel Salinas, Fred Morstatter (2024)
  • Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction [Paper]
    Tong Liu, Yingjie Zhang, Zhe Zhao, Yinpeng Dong, Guozhu Meng, Kai Chen (2024)
  • PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails [Paper]
    Neal Mangaokar, Ashish Hooda, Jihye Choi, Shreyas Chandrashekaran, Kassem Fawaz, Somesh Jha, Atul Prakash (2024)
  • GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher [Paper]
    Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, Zhaopeng Tu (2023)
  • Punctuation Matters! Stealthy Backdoor Attack for Language Models [Paper]
    Xuan Sheng, Zhicheng Li, Zhaoyang Han, Xiangmao Chang, Piji Li (2023)

Personification

  • Foot In The Door: Understanding Large Language Model Jailbreaking via Cognitive Psychology [Paper]
    Zhenhua Wang, Wei Xie, Baosheng Wang, Enze Wang, Zhiwen Gui, Shuoyoucheng Ma, Kai Chen (2024)
  • PsySafe: A Comprehensive Framework for Psychological-based Attack,Defense,and Evaluation of Multi-agent System Safety [Paper]
    Zaibin Zhang, Yongting Zhang, Lijun Li, Hongzhi Gao, Lijun Wang, Huchuan Lu, Feng Zhao, Yu Qiao, Jing Shao (2024)
  • How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs [Paper]
    Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, Weiyan Shi (2024)
  • Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation [Paper]
    Rusheb Shah, Quentin Feuillade--Montixi, Soroush Pour, Arush Tagade, Stephen Casper, Javier Rando (2023)
  • Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench [Paper]
    Jen-tse Huang, Wenxuan Wang, Eric John Li, Man Ho Lam, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, Michael R. Lyu (2023)
  • Exploiting Large Language Models (LLMs) through Deception Techniques and Persuasion Principles [Paper]
    Sonali Singh, Faranak Abri, Akbar Siami Namin (2023)

__Model Manipulation

Model Manipulation

Backdoor Attacks

  • Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models [Paper]
    Yuancheng Xu, Jiarui Yao, Manli Shu, Yanchao Sun, Zichu Wu, Ning Yu, Tom Goldstein, Furong Huang (2024)
  • Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training [Paper]
    Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez (2024)
  • What's in Your "Safe" Data?: Identifying Benign Data that Breaks Safety [Paper]
    Luxi He, Mengzhou Xia, Peter Henderson (2024)
  • Data Poisoning Attacks on Off-Policy Policy Evaluation Methods [Paper]
    Elita Lobo, Harvineet Singh, Marek Petrik, Cynthia Rudin, Himabindu Lakkaraju (2024)
  • BadEdit: Backdooring large language models by model editing [Paper]
    Yanzhou Li, Tianlin Li, Kangjie Chen, Jian Zhang, Shangqing Liu, Wenhan Wang, Tianwei Zhang, Yang Liu (2024)
  • Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data [Paper]
    Tim Baumgärtner, Yang Gao, Dana Alon, Donald Metzler (2024)
  • Learning to Poison Large Language Models During Instruction Tuning [Paper]
    Yao Qiang, Xiangyu Zhou, Saleh Zare Zade, Mohammad Amin Roshani, Douglas Zytko, Dongxiao Zhu (2024)
  • Exploring Backdoor Vulnerabilities of Chat Models [Paper]
    Yunzhuo Hao, Wenkai Yang, Yankai Lin (2024)
  • Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models [Paper]
    Jiashu Xu, Mingyu Derek Ma, Fei Wang, Chaowei Xiao, Muhao Chen (2023)
  • Forcing Generative Models to Degenerate Ones: The Power of Data Poisoning Attacks [Paper]
    Shuli Jiang, Swanand Ravindra Kadhe, Yi Zhou, Ling Cai, Nathalie Baracaldo (2023)
  • Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections [Paper]
    Yuanpu Cao, Bochuan Cao, Jinghui Chen (2023)
  • Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment [Paper]
    Haoran Wang, Kai Shu (2023)
  • On the Exploitability of Reinforcement Learning with Human Feedback for Large Language Models [Paper]
    Jiongxiao Wang, Junlin Wu, Muhao Chen, Yevgeniy Vorobeychik, Chaowei Xiao (2023)
  • Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations [Paper]
    Wenjie Mo, Jiashu Xu, Qin Liu, Jiongxiao Wang, Jun Yan, Chaowei Xiao, Muhao Chen (2023)
  • Universal Jailbreak Backdoors from Poisoned Human Feedback [Paper]
    Javier Rando, Florian Tramèr (2023)

Fine-tuning Risks

  • LoRA-as-an-Attack! Piercing LLM Safety Under The Share-and-Play Scenario [Paper]
    Hongyi Liu, Zirui Liu, Ruixiang Tang, Jiayi Yuan, Shaochen Zhong, Yu-Neng Chuang, Li Li, Rui Chen, Xia Hu (2024)
  • Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! [Paper]
    Zhanhui Zhou, Jie Liu, Zhichen Dong, Jiaheng Liu, Chao Yang, Wanli Ouyang, Yu Qiao (2024)
  • LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B [Paper]
    Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish (2023)
  • BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B [Paper]
    Pranav Gade, Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish (2023)
  • Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases [Paper]
    Rishabh Bhardwaj, Soujanya Poria (2023)
  • Removing RLHF Protections in GPT-4 via Fine-Tuning [Paper]
    Qiusi Zhan, Richard Fang, Rohan Bindu, Akul Gupta, Tatsunori Hashimoto, Daniel Kang (2023)
  • On the Safety of Open-Sourced Large Language Models: Does Alignment Really Prevent Them From Being Misused? [Paper]
    Hangfan Zhang, Zhimeng Guo, Huaisheng Zhu, Bochuan Cao, Lu Lin, Jinyuan Jia, Jinghui Chen, Dinghao Wu (2023)
  • Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models [Paper]
    Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, Dahua Lin (2023)
  • Fine-tuning Aligned Language Models Compromises Safety,Even When Users Do Not Intend To! [Paper]
    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, Peter Henderson (2023)

Attack Searcher

__Suffix Searchers

Suffix Searchers

  • Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts [Paper]
    Zhi-Yi Chin, Chieh-Ming Jiang, Ching-Chun Huang, Pin-Yu Chen, Wei-Chen Chiu (2023)
  • From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings [Paper]
    Hao Wang, Hao Li, Minlie Huang, Lei Sha (2024)
  • Fast Adversarial Attacks on Language Models In One GPU Minute [Paper]
    Vinu Sankar Sadasivan, Shoumik Saha, Gaurang Sriramanan, Priyatham Kattakinda, Atoosa Chegini, Soheil Feizi (2024)
  • Gradient-Based Language Model Red Teaming [Paper]
    Nevan Wichers, Carson Denison, Ahmad Beirami (2024)
  • Automatic and Universal Prompt Injection Attacks against Large Language Models [Paper]
    Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang, Chaowei Xiao (2024)
  • $\textit{LinkPrompt}$: Natural and Universal Adversarial Attacks on Prompt-based Language Models [Paper]
    Yue Xu, Wenjie Wang (2024)
  • Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks [Paper]
    Dario Pasquini, Martin Strohmeier, Carmela Troncoso (2024)
  • Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [Paper]
    Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion (2024)
  • Rapid Optimization for Jailbreaking LLMs via Subconscious Exploitation and Echopraxia [Paper]
    Guangyu Shen, Siyuan Cheng, Kaiyuan Zhang, Guanhong Tao, Shengwei An, Lu Yan, Zhuo Zhang, Shiqing Ma, Xiangyu Zhang (2024)
  • AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models [Paper]
    Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, Tong Sun (2023)
  • Universal and Transferable Adversarial Attacks on Aligned Language Models [Paper]
    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson (2023)
  • Soft-prompt Tuning for Large Language Models to Evaluate Bias [Paper]
    Jacob-Junqi Tian, David Emerson, Sevil Zanjani Miyandoab, Deval Pandya, Laleh Seyyed-Kalantari, Faiza Khan Khattak (2023)
  • TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models [Paper]
    Jiaqi Xue, Mengxin Zheng, Ting Hua, Yilin Shen, Yepeng Liu, Ladislau Boloni, Qian Lou (2023)
  • AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [Paper]
    Xiaogeng Liu, Nan Xu, Muhao Chen, Chaowei Xiao (2023)

__Suffix Searchers

Prompt Searchers

Language Model

  • Eliciting Language Model Behaviors using Reverse Language Models [Paper]
    Jacob Pfau, Alex Infanger, Abhay Sheshadri, Ayush Panda, Julian Michael, Curtis Huebner

(2023)

  • All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks [Paper]
    Kazuhiro Takemoto (2024)
  • Adversarial Attacks on GPT-4 via Simple Random Search [Paper]
    Tim Baumgärtner, Yang Gao, Dana Alon, Donald Metzler (2024)
  • Tastle: Distract Large Language Models for Automatic Jailbreak Attack [Paper]
    Zeguan Xiao, Yan Yang, Guanhua Chen, Yun Chen (2024)
  • Red Teaming Language Models with Language Models [Paper]
    Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, Geoffrey Irving (2022)
  • An LLM can Fool Itself: A Prompt-Based Adversarial Attack [Paper]
    Xilie Xu, Keyi Kong, Ning Liu, Lizhen Cui, Di Wang, Jingfeng Zhang, Mohan Kankanhalli (2023)
  • Jailbreaking Black Box Large Language Models in Twenty Queries [Paper]
    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, Eric Wong (2023)
  • Tree of Attacks: Jailbreaking Black-Box LLMs Automatically [Paper]
    Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, Amin Karbasi (2023)
  • AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications [Paper]
    Bhaktipriya Radharapu, Kevin Robinson, Lora Aroyo, Preethi Lahoti (2023)
  • DALA: A Distribution-Aware LoRA-Based Adversarial Attack against Language Models [Paper]
    Yibo Wang, Xiangjue Dong, James Caverlee, Philip S. Yu (2023)
  • JAB: Joint Adversarial Prompting and Belief Augmentation [Paper]
    Ninareh Mehrabi, Palash Goyal, Anil Ramakrishna, Jwala Dhamala, Shalini Ghosh, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta (2023)
  • No Offense Taken: Eliciting Offensiveness from Language Models [Paper]
    Anugya Srivastava, Rahul Ahuja, Rohith Mukku (2023)
  • LoFT: Local Proxy Fine-tuning For Improving Transferability Of Adversarial Attacks Against Large Language Model [Paper]
    Muhammad Ahmed Shah, Roshan Sharma, Hira Dhamyal, Raphael Olivier, Ankit Shah, Joseph Konan, Dareen Alharthi, Hazim T Bukhari, Massa Baali, Soham Deshmukh, Michael Kuhlmann, Bhiksha Raj, Rita Singh (2023)

Decoding

  • Weak-to-Strong Jailbreaking on Large Language Models [Paper]
    Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, William Yang Wang (2024)
  • COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability [Paper]
    Xingang Guo, Fangxu Yu, Huan Zhang, Lianhui Qin, Bin Hu (2024)

Genetic Algorithm

  • Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs [Paper]
    Xiaoxia Li, Siyuan Liang, Jiyi Zhang, Han Fang, Aishan Liu, Ee-Chien Chang (2024)
  • Open Sesame! Universal Black Box Jailbreaking of Large Language Models [Paper]
    Raz Lapid, Ron Langberg, Moshe Sipper (2023)

Reinforcement Learning

  • SneakyPrompt: Jailbreaking Text-to-image Generative Models [Paper]
    Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, Yinzhi Cao (2023)
  • Red Teaming Game: A Game-Theoretic Framework for Red Teaming Language Models [Paper]
    Chengdong Ma, Ziran Yang, Minquan Gao, Hai Ci, Jun Gao, Xuehai Pan, Yaodong Yang (2023)
  • Explore,Establish,Exploit: Red Teaming Language Models from Scratch [Paper]
    Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, Dylan Hadfield-Menell (2023)
  • Unveiling the Implicit Toxicity in Large Language Models [Paper]
    Jiaxin Wen, Pei Ke, Hao Sun, Zhexin Zhang, Chengfei Li, Jinfeng Bai, Minlie Huang (2023)

Defenses

__Training Time Defenses

Training Time Defenses

RLHF

  • Configurable Safety Tuning of Language Models with Synthetic Preference Data [Paper]
    Victor Gallego (2024)
  • Enhancing LLM Safety via Constrained Direct Preference Optimization [Paper]
    Zixuan Liu, Xiaolin Sun, Zizhan Zheng (2024)
  • Safe RLHF: Safe Reinforcement Learning from Human Feedback [Paper]
    Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, Yaodong Yang (2023)
  • BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset [Paper]
    Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Chi Zhang, Ruiyang Sun, Yizhou Wang, Yaodong Yang (2023)
  • Safer-Instruct: Aligning Language Models with Automated Preference Data [Paper]
    Taiwei Shi, Kai Chen, Jieyu Zhao (2023)

Fine-tuning

  • SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models [Paper]
    Xinfeng Li, Yuchen Yang, Jiangyi Deng, Chen Yan, Yanjiao Chen, Xiaoyu Ji, Wenyuan Xu (2024)
  • Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models [Paper]
    Yongshuo Zong, Ondrej Bohdal, Tingyang Yu, Yongxin Yang, Timothy Hospedales (2024)
  • Developing Safe and Responsible Large Language Models -- A Comprehensive Framework [Paper]
    Shaina Raza, Oluwanifemi Bamgbose, Shardul Ghuge, Fatemeh Tavakoli, Deepak John Reji (2024)
  • Immunization against harmful fine-tuning attacks [Paper]
    Domenic Rosati, Jan Wehner, Kai Williams, Łukasz Bartoszcze, Jan Batzner, Hassan Sajjad, Frank Rudzicz (2024)
  • Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment [Paper]
    Jiongxiao Wang, Jiazhao Li, Yiquan Li, Xiangyu Qi, Junjie Hu, Yixuan Li, Patrick McDaniel, Muhao Chen, Bo Li, Chaowei Xiao (2024)
  • Dialectical Alignment: Resolving the Tension of 3H and Security Threats of LLMs [Paper]
    Shu Yang, Jiayuan Su, Han Jiang, Mengdi Li, Keyuan Cheng, Muhammad Asif Ali, Lijie Hu, Di Wang (2024)
  • Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning [Paper]
    Adib Hasan, Ileana Rugina, Alex Wang (2024)
  • Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge [Paper]
    Weikai Lu, Ziqian Zeng, Jianwei Wang, Zhengdong Lu, Zelin Chen, Huiping Zhuang, Cen Chen (2024)
  • Two Heads are Better than One: Nested PoE for Robust Defense Against Multi-Backdoors [Paper]
    Victoria Graf, Qin Liu, Muhao Chen (2024)
  • Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning [Paper]
    Shuai Zhao, Leilei Gan, Luu Anh Tuan, Jie Fu, Lingjuan Lyu, Meihuizi Jia, Jinming Wen (2024)
  • Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions [Paper]
    Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Röttger, Dan Jurafsky, Tatsunori Hashimoto, James Zou (2023)
  • Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM [Paper]
    Bochuan Cao, Yuanpu Cao, Lu Lin, Jinghui Chen (2023)
  • Learn What NOT to Learn: Towards Generative Safety in Chatbots [Paper]
    Leila Khalatbari, Yejin Bang, Dan Su, Willy Chung, Saeed Ghadimi, Hossein Sameti, Pascale Fung (2023)
  • Jatmo: Prompt Injection Defense by Task-Specific Finetuning [Paper]
    Julien Piet, Maha Alrashed, Chawin Sitawarin, Sizhe Chen, Zeming Wei, Elizabeth Sun, Basel Alomair, David Wagner (2023)

__Training Time Defenses

Inference Time Defenses

Prompting

  • AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [Paper]
    Yu Wang, Xiaogeng Liu, Yu Li, Muhao Chen, Chaowei Xiao (2024)
  • Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement [Paper]
    Heegyu Kim, Sehyun Yuk, Hyunsouk Cho (2024)
  • On Prompt-Driven Safeguarding for Large Language Models [Paper]
    Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, Nanyun Peng (2024)
  • Signed-Prompt: A New Approach to Prevent Prompt Injection Attacks Against LLM-Integrated Applications [Paper]
    Xuchen Suo (2024)
  • Intention Analysis Makes LLMs A Good Jailbreak Defender [Paper]
    Yuqi Zhang, Liang Ding, Lefei Zhang, Dacheng Tao (2024)
  • Defending Against Indirect Prompt Injection Attacks With Spotlighting [Paper]
    Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, Emre Kiciman (2024)
  • Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models [Paper]
    Yi Luo, Zhenghao Lin, Yuhao Zhang, Jiashuo Sun, Chen Lin, Chengjin Xu, Xiangdong Su, Yelong Shen, Jian Guo, Yeyun Gong (2024)
  • Goal-guided Generative Prompt Injection Attack on Large Language Models [Paper]
    Chong Zhang, Mingyu Jin, Qinkai Yu, Chengzhi Liu, Haochen Xue, Xiaobo Jin (2024)
  • StruQ: Defending Against Prompt Injection with Structured Queries [Paper]
    Sizhe Chen, Julien Piet, Chawin Sitawarin, David Wagner (2024)
  • Studious Bob Fight Back Against Jailbreaking via Prompt Adversarial Tuning [Paper]
    Yichuan Mo, Yuji Wang, Zeming Wei, Yisen Wang (2024)
  • Self-Guard: Empower the LLM to Safeguard Itself [Paper]
    Zezhong Wang, Fangkai Yang, Lu Wang, Pu Zhao, Hongru Wang, Liang Chen, Qingwei Lin, Kam-Fai Wong (2023)
  • Using In-Context Learning to Improve Dialogue Safety [Paper]
    Nicholas Meade, Spandana Gella, Devamanyu Hazarika, Prakhar Gupta, Di Jin, Siva Reddy, Yang Liu, Dilek Hakkani-Tür (2023)
  • Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization [Paper]
    Zhexin Zhang, Junxiao Yang, Pei Ke, Minlie Huang (2023)
  • Bergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Framework [Paper]
    Matthew Pisano, Peter Ly, Abraham Sanders, Bingsheng Yao, Dakuo Wang, Tomek Strzalkowski, Mei Si (2023)

Ensemble

  • Combating Adversarial Attacks with Multi-Agent Debate [Paper]
    Steffi Chern, Zhen Fan, Andy Liu (2024)
  • TrustAgent: Towards Safe and Trustworthy LLM-based Agents through Agent Constitution [Paper]
    Wenyue Hua, Xianjun Yang, Zelong Li, Wei Cheng, Yongfeng Zhang (2024)
  • AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks [Paper]
    Yifan Zeng, Yiran Wu, Xiao Zhang, Huazheng Wang, Qingyun Wu (2024)
  • Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game [Paper]
    Qianqiao Xu, Zhiliang Tian, Hongyan Wu, Zhen Huang, Yiping Song, Feng Liu, Dongsheng Li (2024)
  • Jailbreaker in Jail: Moving Target Defense for Large Language Models [Paper]
    Bocheng Chen, Advait Paliwal, Qiben Yan (2023)

Guardrails

Input Guardrails

  • UFID: A Unified Framework for Input-level Backdoor Detection on Diffusion Models [Paper]
    Zihan Guan, Mengxuan Hu, Sheng Li, Anil Vullikanti (2024)
  • Universal Prompt Optimizer for Safe Text-to-Image Generation [Paper]
    Zongyu Wu, Hongcheng Gao, Yueze Wang, Xiang Zhang, Suhang Wang (2024)
  • Eyes Closed,Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation [Paper]
    Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung, James T. Kwok, Yu Zhang (2024)
  • Eyes Closed,Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation [Paper]
    Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung, James T. Kwok, Yu Zhang (2024)
  • MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance [Paper]
    Renjie Pi, Tianyang Han, Yueqi Xie, Rui Pan, Qing Lian, Hanze Dong, Jipeng Zhang, Tong Zhang (2024)
  • Added Toxicity Mitigation at Inference Time for Multimodal and Massively Multilingual Translation [Paper]
    Marta R. Costa-jussà, David Dale, Maha Elbayad, Bokai Yu (2023)
  • A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection [Paper]
    Xiaoyu Zhang, Cen Zhang, Tianlin Li, Yihao Huang, Xiaojun Jia, Xiaofei Xie, Yang Liu, Chao Shen (2023)
  • Detection and Defense Against Prominent Attacks on Preconditioned LLM-Integrated Virtual Assistants [Paper]
    Chun Fai Chan, Daniel Wankit Yip, Aysan Esmradi (2024)
  • ShieldLM: Empowering LLMs as Aligned,Customizable and Explainable Safety Detectors [Paper]
    Zhexin Zhang, Yida Lu, Jingyuan Ma, Di Zhang, Rui Li, Pei Ke, Hao Sun, Lei Sha, Zhifang Sui, Hongning Wang, Minlie Huang (2024)
  • Round Trip Translation Defence against Large Language Model Jailbreaking Attacks [Paper]
    Canaan Yung, Hadi Mohaghegh Dolatabadi, Sarah Erfani, Christopher Leckie (2024)
  • Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes [Paper]
    Xiaomeng Hu, Pin-Yu Chen, Tsung-Yi Ho (2024)
  • Defending Jailbreak Prompts via In-Context Adversarial Game [Paper]
    Yujun Zhou, Yufei Han, Haomin Zhuang, Taicheng Guo, Kehan Guo, Zhenwen Liang, Hongyan Bao, Xiangliang Zhang (2024)
  • SPML: A DSL for Defending Language Models Against Prompt Attacks [Paper]
    Reshabh K Sharma, Vinayak Gupta, Dan Grossman (2024)
  • Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield [Paper]
    Jinhwa Kim, Ali Derakhshan, Ian G. Harris (2023)
  • AI Control: Improving Safety Despite Intentional Subversion [Paper]
    Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, Fabien Roger (2023)
  • Maatphor: Automated Variant Analysis for Prompt Injection Attacks [Paper]
    Ahmed Salem, Andrew Paverd, Boris Köpf (2023)

Output Guardrails

  • Defending LLMs against Jailbreaking Attacks via Backtranslation [Paper]
    Yihan Wang, Zhouxing Shi, Andrew Bai, Cho-Jui Hsieh (2024)
  • Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks [Paper]
    Andy Zhou, Bo Li, Haohan Wang (2024)
  • Jailbreaking is Best Solved by Definition [Paper]
    Taeyoun Kim, Suhas Kotha, Aditi Raghunathan (2024)
  • LLM Self Defense: By Self Examination,LLMs Know They Are Being Tricked [Paper]
    Mansi Phute, Alec Helbling, Matthew Hull, ShengYun Peng, Sebastian Szyller, Cory Cornelius, Duen Horng Chau (2023)

Input & Output Guardrails

  • RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content [Paper]
    Zhuowen Yuan, Zidi Xiong, Yi Zeng, Ning Yu, Ruoxi Jia, Dawn Song, Bo Li (2024)
  • NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails [Paper]
    Traian Rebedea, Razvan Dinu, Makesh Sreedhar, Christopher Parisien, Jonathan Cohen (2023)
  • Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations [Paper]
    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, Madian Khabsa (2023)

Adversarial Suffix Defenses

  • Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing [Paper]
    Jiabao Ji, Bairu Hou, Alexander Robey, George J. Pappas, Hamed Hassani, Yang Zhang, Eric Wong, Shiyu Chang (2024)
  • Certifying LLM Safety against Adversarial Prompting [Paper]
    Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Aaron Jiaxun Li, Soheil Feizi, Himabindu Lakkaraju (2023)
  • Baseline Defenses for Adversarial Attacks Against Aligned Language Models [Paper]
    Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, Tom Goldstein (2023)
  • Detecting Language Model Attacks with Perplexity [Paper]
    Gabriel Alon, Michael Kamfonas (2023)
  • SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [Paper]
    Alexander Robey, Eric Wong, Hamed Hassani, George J. Pappas (2023)
  • Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information [Paper]
    Zhengmian Hu, Gang Wu, Saayan Mitra, Ruiyi Zhang, Tong Sun, Heng Huang, Viswanathan Swaminathan (2023)

Decoding Defenses

  • Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models [Paper]
    Yi-Lin Tuan, Xilun Chen, Eric Michael Smith, Louis Martin, Soumya Batra, Asli Celikyilmaz, William Yang Wang, Daniel M. Bikel (2024)
  • SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding [Paper]
    Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, Radha Poovendran (2024)

Evaluations

__Evaluation Metrics

Evaluation Metrics

Attack Metrics

  • A Novel Evaluation Framework for Assessing Resilience Against Prompt Injection Attacks in Large Language Models [Paper]
    Daniel Wankit Yip, Aysan Esmradi, Chun Fai Chan (2024)
  • AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models [Paper]
    Dong shu, Mingyu Jin, Suiyuan Zhu, Beichen Wang, Zihao Zhou, Chong Zhang, Yongfeng Zhang (2024)
  • Take a Look at it! Rethinking How to Evaluate Language Model Jailbreak [Paper]
    Hongyu Cai, Arjun Arunasalam, Leo Y. Lin, Antonio Bianchi, Z. Berkay Celik (2024)

Defense Metrics

  • How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries [Paper]
    Somnath Banerjee, Sayan Layek, Rima Hazra, Animesh Mukherjee (2024)
  • The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness [Paper]
    Neeraj Varshney, Pavel Dolin, Agastya Seth, Chitta Baral (2023)

__Evaluation Benchmarks

Evaluation Benchmarks

  • JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models [Paper]
    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, Eric Wong (2024)
  • SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety [Paper]
    Paul Röttger, Fabio Pernisi, Bertie Vidgen, Dirk Hovy (2024)
  • From Representational Harms to Quality-of-Service Harms: A Case Study on Llama 2 Safety Safeguards [Paper]
    Khaoula Chehbouni, Megha Roshan, Emmanuel Ma, Futian Andrew Wei, Afaf Taik, Jackie CK Cheung, Golnoosh Farnadi (2024)
  • SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models [Paper]
    Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, Jing Shao (2024)
  • A StrongREJECT for Empty Jailbreaks [Paper]
    Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, Sam Toyer (2024)
  • HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal [Paper]
    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, Dan Hendrycks (2024)
  • SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions [Paper]
    Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, Minlie Huang (2023)
  • XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models [Paper]
    Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, Dirk Hovy (2023)
  • Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs [Paper]
    Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, Timothy Baldwin (2023)
  • Safety Assessment of Chinese Large Language Models [Paper]
    Hao Sun, Zhexin Zhang, Jiawen Deng, Jiale Cheng, Minlie Huang (2023)
  • Red Teaming Language Models to Reduce Harms: Methods,Scaling Behaviors,and Lessons Learned [Paper]
    Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Johnston, Shauna Kravec, Catherine Olsson, Sam Ringer, Eli Tran-Johnson, Dario Amodei, Tom Brown, Nicholas Joseph, Sam McCandlish, Chris Olah, Jared Kaplan, Jack Clark (2022)
  • DICES Dataset: Diversity in Conversational AI Evaluation for Safety [Paper]
    Lora Aroyo, Alex S. Taylor, Mark Diaz, Christopher M. Homan, Alicia Parrish, Greg Serapio-Garcia, Vinodkumar Prabhakaran, Ding Wang (2023)
  • Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models [Paper]
    Huachuan Qiu, Shuai Zhang, Anqi Li, Hongliang He, Zhenzhong Lan (2023)
  • Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game [Paper]
    Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, Stuart Russell (2023)
  • Can LLMs Follow Simple Rules? [Paper]
    Norman Mu, Sarah Chen, Zifan Wang, Sizhe Chen, David Karamardian, Lulwa Aljeraisy, Basel Alomair, Dan Hendrycks, David Wagner (2023)
  • SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models [Paper]
    Bertie Vidgen, Nino Scherrer, Hannah Rose Kirk, Rebecca Qian, Anand Kannappan, Scott A. Hale, Paul Röttger (2023)
  • Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models [Paper]
    Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, Fangzhao Wu (2023)
  • SC-Safety: A Multi-round Open-ended Question Adversarial Safety Benchmark for Large Language Models in Chinese [Paper]
    Liang Xu, Kangkang Zhao, Lei Zhu, Hang Xue (2023)
  • Walking a Tightrope -- Evaluating Large Language Models in High-Risk Domains [Paper]
    Chia-Chien Hung, Wiem Ben Rim, Lindsay Frost, Lars Bruckner, Carolin Lawrence (2023)

Applications

__Application Domains

Application Domains

Agent

  • MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models [Paper]
    Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, Yu Qiao (2023)
  • Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast [Paper]
    Xiangming Gu, Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Ye Wang, Jing Jiang, Min Lin (2024)
  • How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs [Paper]
    Haoqin Tu, Chenhang Cui, Zijun Wang, Yiyang Zhou, Bingchen Zhao, Junlin Han, Wangchunshu Zhou, Huaxiu Yao, Cihang Xie (2023)
  • Towards Red Teaming in Multimodal and Multilingual Translation [Paper]
    Christophe Ropers, David Dale, Prangthip Hansanti, Gabriel Mejia Gonzalez, Ivan Evtimov, Corinne Wong, Christophe Touret, Kristina Pereyra, Seohyun Sonia Kim, Cristian Canton Ferrer, Pierre Andrews, Marta R. Costa-jussà (2024)
  • JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks [Paper]
    Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, Chaowei Xiao (2024)
  • Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks? [Paper]
    Shuo Chen, Zhen Han, Bailan He, Zifeng Ding, Wenqian Yu, Philip Torr, Volker Tresp, Jindong Gu (2024)
  • R-Judge: Benchmarking Safety Risk Awareness for LLM Agents [Paper]
    Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, Rui Wang, Gongshen Liu (2024)
  • GPT in Sheep's Clothing: The Risk of Customized GPTs [Paper]
    Sagiv Antebi, Noam Azulay, Edan Habler, Ben Ganon, Asaf Shabtai, Yuval Elovici (2024)
  • ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages [Paper]
    Junjie Ye, Sixian Li, Guanyu Li, Caishuang Huang, Songyang Gao, Yilong Wu, Qi Zhang, Tao Gui, Xuanjing Huang (2024)
  • A Trembling House of Cards? Mapping Adversarial Attacks against Language Agents [Paper]
    Lingbo Mo, Zeyi Liao, Boyuan Zheng, Yu Su, Chaowei Xiao, Huan Sun (2024)
  • Rapid Adoption,Hidden Risks: The Dual Impact of Large Language Model Customization [Paper]
    Rui Zhang, Hongwei Li, Rui Wen, Wenbo Jiang, Yuan Zhang, Michael Backes, Yun Shen, Yang Zhang (2024)
  • Goal-Oriented Prompt Attack and Safety Evaluation for LLMs [Paper]
    Chengyuan Liu, Fubang Zhao, Lizhi Qing, Yangyang Kang, Changlong Sun, Kun Kuang, Fei Wu (2023)
  • Identifying the Risks of LM Agents with an LM-Emulated Sandbox [Paper]
    Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, Tatsunori Hashimoto (2023)
  • CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility [Paper]
    Guohai Xu, Jiayi Liu, Ming Yan, Haotian Xu, Jinghui Si, Zhuoran Zhou, Peng Yi, Xing Gao, Jitao Sang, Rong Zhang, Ji Zhang, Chao Peng, Fei Huang, Jingren Zhou (2023)
  • Exploiting Novel GPT-4 APIs [Paper]
    Kellin Pelrine, Mohammad Taufeeque, Michał Zając, Euan McLean, Adam Gleave (2023)
  • Evil Geniuses: Delving into the Safety of LLM-based Agents [Paper]
    Yu Tian, Xiao Yang, Jingyuan Zhang, Yinpeng Dong, Hang Su (2023)
  • Assessing Prompt Injection Risks in 200+ Custom GPTs [Paper]
    Jiahao Yu, Yuhang Wu, Dong Shu, Mingyu Jin, Xinyu Xing (2023)

Programming

  • DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions [Paper]
    Fangzhou Wu, Xiaogeng Liu, Chaowei Xiao (2023)
  • Poisoned ChatGPT Finds Work for Idle Hands: Exploring Developers' Coding Practices with Insecure Suggestions from Poisoned AI Models [Paper]
    Sanghak Oh, Kiho Lee, Seonhye Park, Doowon Kim, Hyoungshick Kim (2023)

__Application Risks

Application Risks

Prompt Injection

  • Scaling Behavior of Machine Translation with Large Language Models under Prompt Injection Attacks [Paper]
    Zhifan Sun, Antonio Valerio Miceli-Barone (2024)
  • From Prompt Injections to SQL Injection Attacks: How Protected is Your LLM-Integrated Web Application? [Paper]
    Rodrigo Pedro, Daniel Castro, Paulo Carreira, Nuno Santos (2023)
  • Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection [Paper]
    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, Mario Fritz (2023)
  • Prompt Injection attack against LLM-integrated Applications [Paper]
    Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, Yang Liu (2023)

Prompt Extraction

  • Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts [Paper]
    Yuanwei Wu, Xiang Li, Yixin Liu, Pan Zhou, Lichao Sun (2023)
  • Prompt Stealing Attacks Against Large Language Models [Paper]
    Zeyang Sha, Yang Zhang (2024)
  • Effective Prompt Extraction from Language Models [Paper]
    Yiming Zhang, Nicholas Carlini, Daphne Ippolito (2023)

Multimodal Red Teaming

__Attack Strategies

Attack Strategies

Completion Compliance

  • Few-Shot Adversarial Prompt Learning on Vision-Language Models [Paper]
    Yiwei Zhou, Xiaobo Xia, Zhiwei Lin, Bo Han, Tongliang Liu (2024)
  • Hijacking Context in Large Multi-modal Models [Paper]
    Joonhyun Jeong (2023)

Instruction Indirection

  • On the Robustness of Large Multimodal Models Against Image Adversarial Attacks [Paper]
    Xuanming Cui, Alejandro Aparcedo, Young Kyun Jang, Ser-Nam Lim (2023)
  • Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models [Paper]
    Yifan Li, Hangyu Guo, Kun Zhou, Wayne Xin Zhao, Ji-Rong Wen (2024)
  • Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks [Paper]
    Maan Qraitem, Nazia Tasnim, Piotr Teterwak, Kate Saenko, Bryan A. Plummer (2024)
  • Visual Adversarial Examples Jailbreak Aligned Large Language Models [Paper]
    Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, Prateek Mittal (2023)
  • Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models [Paper]
    Erfan Shayegani, Yue Dong, Nael Abu-Ghazaleh (2023)
  • Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs [Paper]
    Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, Vitaly Shmatikov (2023)
  • FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts [Paper]
    Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, Xiaoyun Wang (2023)
  • InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models [Paper]
    Xunguang Wang, Zhenlan Ji, Pingchuan Ma, Zongjie Li, Shuai Wang (2023)

__Attack Searchers

Attack Searchers

Image Searchers

  • Diffusion Attack: Leveraging Stable Diffusion for Naturalistic Image Attacking [Paper]
    Qianyu Guo, Jiaming Fu, Yawen Lu, Dongming Gan (2024)
  • On the Adversarial Robustness of Multi-Modal Foundation Models [Paper]
    Christian Schlarmann, Matthias Hein (2023)
  • How Robust is Google's Bard to Adversarial Image Attacks? [Paper]
    Yinpeng Dong, Huanran Chen, Jiawei Chen, Zhengwei Fang, Xiao Yang, Yichi Zhang, Yu Tian, Hang Su, Jun Zhu (2023)
  • Test-Time Backdoor Attacks on Multimodal Large Language Models [Paper]
    Dong Lu, Tianyu Pang, Chao Du, Qian Liu, Xianjun Yang, Min Lin (2024)

Cross Modality Searchers

  • SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation [Paper]
    Bangyan He, Xiaojun Jia, Siyuan Liang, Tianrui Lou, Yang Liu, Xiaochun Cao (2023)
  • MMA-Diffusion: MultiModal Attack on Diffusion Models [Paper]
    Yijun Yang, Ruiyuan Gao, Xiaosen Wang, Tsung-Yi Ho, Nan Xu, Qiang Xu (2023)
  • Improving Adversarial Transferability of Visual-Language Pre-training Models through Collaborative Multimodal Interaction [Paper]
    Jiyuan Fu, Zhaoyu Chen, Kaixun Jiang, Haijing Guo, Jiafeng Wang, Shuyong Gao, Wenqiang Zhang (2024)
  • An Image Is Worth 1000 Lies: Transferability of Adversarial Images across Prompts on Vision-Language Models [Paper]
    Haochen Luo, Jindong Gu, Fengyuan Liu, Philip Torr (2024)

Others

  • SneakyPrompt: Jailbreaking Text-to-image Generative Models [Paper]
    Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, Yinzhi Cao (2023)
  • Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts [Paper]
    Zhi-Yi Chin, Chieh-Ming Jiang, Ching-Chun Huang, Pin-Yu Chen, Wei-Chen Chiu (2023)

__Defense

Defense

Guardrail Defenses

  • UFID: A Unified Framework for Input-level Backdoor Detection on Diffusion Models [Paper]
    Zihan Guan, Mengxuan Hu, Sheng Li, Anil Vullikanti (2024)
  • Universal Prompt Optimizer for Safe Text-to-Image Generation [Paper]
    Zongyu Wu, Hongcheng Gao, Yueze Wang, Xiang Zhang, Suhang Wang (2024)
  • Eyes Closed,Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation [Paper]
    Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung, James T. Kwok, Yu Zhang (2024)
  • Eyes Closed,Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation [Paper]
    Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung, James T. Kwok, Yu Zhang (2024)
  • MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance [Paper]
    Renjie Pi, Tianyang Han, Yueqi Xie, Rui Pan, Qing Lian, Hanze Dong, Jipeng Zhang, Tong Zhang (2024)
  • Added Toxicity Mitigation at Inference Time for Multimodal and Massively Multilingual Translation [Paper]
    Marta R. Costa-jussà, David Dale, Maha Elbayad, Bokai Yu (2023)
  • A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection [Paper]
    Xiaoyu Zhang, Cen Zhang, Tianlin Li, Yihao Huang, Xiaojun Jia, Xiaofei Xie, Yang Liu, Chao Shen (2023)

Other Defenses

  • SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models [Paper]
    Xinfeng Li, Yuchen Yang, Jiangyi Deng, Chen Yan, Yanjiao Chen, Xiaoyu Ji, Wenyuan Xu (2024)
  • AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [Paper]
    Yu Wang, Xiaogeng Liu, Yu Li, Muhao Chen, Chaowei Xiao (2024)
  • Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models [Paper]
    Yongshuo Zong, Ondrej Bohdal, Tingyang Yu, Yongxin Yang, Timothy Hospedales (2024)

__Application

Application

Agents

  • Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks? [Paper]
    Shuo Chen, Zhen Han, Bailan He, Zifeng Ding, Wenqian Yu, Philip Torr, Volker Tresp, Jindong Gu (2024)
  • JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks [Paper]
    Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, Chaowei Xiao (2024)
  • Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast [Paper]
    Xiangming Gu, Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Ye Wang, Jing Jiang, Min Lin (2024)
  • MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models [Paper]
    Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, Yu Qiao (2023)
  • How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs [Paper]
    Haoqin Tu, Chenhang Cui, Zijun Wang, Yiyang Zhou, Bingchen Zhao, Junlin Han, Wangchunshu Zhou, Huaxiu Yao, Cihang Xie (2023)
  • Towards Red Teaming in Multimodal and Multilingual Translation [Paper]
    Christophe Ropers, David Dale, Prangthip Hansanti, Gabriel Mejia Gonzalez, Ivan Evtimov, Corinne Wong, Christophe Touret, Kristina Pereyra, Seohyun Sonia Kim, Cristian Canton Ferrer, Pierre Andrews, Marta R. Costa-jussà (2024)

__Benchmarks

Benchmarks

  • Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation [Paper]
    Jessica Quaye, Alicia Parrish, Oana Inel, Charvi Rastogi, Hannah Rose Kirk, Minsuk Kahng, Erin van Liemt, Max Bartolo, Jess Tsang, Justin White, Nathan Clement, Rafael Mosquera, Juan Ciro, Vijay Janapa Reddi, Lora Aroyo (2024)
  • Red Teaming Visual Language Models [Paper]
    Mukai Li, Lei Li, Yuwei Yin, Masood Ahmed, Zhenguang Liu, Qi Liu (2024)

Citation

@article{lin2024achilles,
      title={Against The Achilles' Heel: A Survey on Red Teaming for Generative Models}, 
      author={Lizhi Lin and Honglin Mu and Zenan Zhai and Minghan Wang and Yuxia Wang and Renxi Wang and Junjie Gao and Yixuan Zhang and Wanxiang Che and Timothy Baldwin and Xudong Han and Haonan Li},
      year={2024},
      journal={arXiv preprint, arXiv:2404.00629},
      primaryClass={cs.CL}
}

Releases

No releases published

Packages

No packages published