This repository contains codes for our paper "Jailbreaking LLMs with Arabic Transliteration and Arabizi". Our paper investigates the use of Arabic non-standardized forms such as Arabizi and Transliteration to jailbreak LLMs. The paper also investigates potenial security risks of using these forms to vulnerabilities exposure such as learned model shortcuts. The results of the experiments highlights the need for more safety and adversarial training in cross-lingual manner with awareness of nonstandardized language forms, especially for Arabic.
-
Requirements:
Python
PyTorch
openai
anthropic -
Denpencencies:
pip install transformers
pip install torch
pip install openai
pip install anthropic
-
Datasets used for this project can be obtained from the following link:
Advbench: https://github.com/llm-attacks/llm-attacks
This dataset is also available here -
Use file translate_convert_arabic.ipynb for helper codes for translation and convertion to Arabic and its forms. We have also prepared all the data needed for experiments under data
llm-test-ar.py contains all necessary codes to prompt Anthropic and OpenAI models. The file is commented and self-explained.
To reproduce our experiments please read and run the script experiments.sh. Evaluation is done manually, so manual inspection of results is mandatory.
@article{ghanim2024jailbreaking,
title={Jailbreaking LLMs with Arabic Transliteration and Arabizi},
author={Ghanim, Mansour Al and Almohaimeed, Saleh and Zheng, Mengxin and Solihin, Yan and Lou, Qian},
journal={arXiv preprint arXiv:2406.18725},
year={2024}
}