This is the official code repository for the paper Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment.
For now, we only realease a demo example for GPT-3.5 experiments through OpenAI API. The complete evaluation pipelines and support for more open source LLMs will be released soon.
Please cite the following preprint when referencing our paper:
@misc{wang2024mitigating,
title={Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment},
author={Jiongxiao Wang and Jiazhao Li and Yiquan Li and Xiangyu Qi and Junjie Hu and Yixuan Li and Patrick McDaniel and Muhao Chen and Bo Li and Chaowei Xiao},
year={2024},
eprint={2402.14968},
archivePrefix={arXiv},
primaryClass={cs.CR}
}