Backdoor-Enhanced-Alignment

This is the official code repository for the paper Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment.

For now, we only realease a demo example for GPT-3.5 experiments through OpenAI API. The complete evaluation pipelines and support for more open source LLMs will be released soon.

Citation

Please cite the following preprint when referencing our paper:

@misc{wang2024mitigating,
        title={Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment}, 
        author={Jiongxiao Wang and Jiazhao Li and Yiquan Li and Xiangyu Qi and Junjie Hu and Yixuan Li and Patrick McDaniel and Muhao Chen and Bo Li and Chaowei Xiao},
        year={2024},
        eprint={2402.14968},
        archivePrefix={arXiv},
        primaryClass={cs.CR}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dataset		dataset
openai_api		openai_api
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset

dataset

openai_api

openai_api

README.md

README.md

Repository files navigation

Backdoor-Enhanced-Alignment

Citation

About

Releases

Packages

Languages

Jayfeather1024/Backdoor-Enhanced-Alignment

Folders and files

Latest commit

History

Repository files navigation

Backdoor-Enhanced-Alignment

Citation

About

Resources

Stars

Watchers

Forks

Languages