Make abliterated models using transformers, easy and fast.
There exist some directions that make LLMs to refuse users' input. Abliteration is a technique that can calculate the most significant refusal directions with harmful and harmless prompts, and then remove them from the model. This is a crude, proof-of-concept implementation to remove refusals from an LLM model without using TransformerLens.
The code has been tested on Llama-3.2, Qwen2.5-Coder, Ministral-8b.
VRAM/RAM requirements: This repository has been making efforts to reduce VRAM usage. You can abliterate whatever model you want, as long as it fits in your VRAM. Loading model in 4-bit precision using bitsandbytes is recommended for large models if you have limited VRAM. However, I always assume that you have enough memory to load the bf16 model.
Note
Abliteration is not uncensorment. Though abliterated, it doesn't necessarily mean the model is completely uncensored, it simply will not explicitly refuse you, theoretically.
git clone https://github.com/Orion-zhen/abliteration.git && cd abliteration
pip install -r requirements.txt
python abliterate.py -m <path_to_your_model> -o <output_dir>
python chat.py -m <path_to_your_abliterated_model>
python compare.py -a <model_a> -b <model_b>
- Abliterate Llama-3.2:
python abliterate.py -m meta-llama/Llama-3.2-3B-Instruct -o llama3.2-3b-abliterated
- Load model in 4-bit precision using bitsandbytes:
python abliterate.py -m meta-llama/Llama-3.2-3B-Instruct -o llama3.2-3b-abliterated --load-in-4bit
- Compare your abliterated model with the original model:
python compare.py -a meta-llama/Llama-3.2-3B-Instruct -b llama3.2-3b-abliterated
- Compare in 4-bit precision using bitsandbytes:
python compare.py -a meta-llama/Llama-3.2-3B-Instruct -b llama3.2-3b-abliterated --load-in-4bit
Note
If you use --load-in-4bit
or --load-in-8bit
, then I will assume you are lack of VRAM, and the final appliance step will be performed with CPU and memory. Please make sure you have enough memory to load the bf16 model.
Now your model will be abliterated and saved to <output_dir>
. Once it finishes, you can immediately chat with your abliterated model in the terminal. For Chinese models, you can use --deccp
to abliterate it from certain topics.
This repository now supports .json
config file. This file should contain a dict
of config key value pairs. For example:
{
"model": "/absolute/path/to/your/model",
"output": "/output/dir",
"data-harmful": "/absolute/path/to/harmful-prompts.txt",
"scale-factor": 114,
"load-in-4bit": true
}
python abliterate.py -c config.json
Loading config file will overwrite command line arguments.
You can use your own prompts to abliterate your model. Supported file formats are .txt
, .parquet
and .json
. Detailed formats are listed below:
.txt
: Each line of the file is a prompt.parquet
: A parquet file with columntext
.json
: A json file with list of strings
Then load your own prompts using --data-harmful
and --data-harmless
arguments:
python abliterate.py -m <path_to_your_model> -o <output_dir> --data-harmful /path/to/my/harmful.txt --data-harmless /path/to/my/harmless.txt
You can use --scale-factor
to control the abliteration strength. A scale factor larger then 1 will impose stronger removal of refusals, while a negative scale factor will encourage refusal. You can try to increase the scale factor to see if it helps.
python abliterate.py -m <path_to_your_model> -o <output_dir> --scale-factor 1.5
You can output the refusals to a file using --output-refusals
argument:
python abliterate.py -m <path_to_your_model> -o <output_dir> --output-refusals refusals.bin
And load the refusals back using --load-refusals
argument:
python abliterate.py -m <path_to_your_model> --input-refusals refusals.bin -o <output_dir>
If --input-refusal
is provided, the script will not compute refusal directions again.
By default, abliteration will be applied to o_proj
and down_proj
. You can add more targets by modifying the code below, as long as it won't mess up the model:
# utils/apply.py, apply_abliteration()
lm_model.layers[layer_idx].self_attn.o_proj.weight = modify_tensor(
lm_model.layers[layer_idx].self_attn.o_proj.weight.data,
refusal_dir,
scale_factor,
)
lm_model.layers[layer_idx].mlp.down_proj.weight = modify_tensor(
lm_model.layers[layer_idx].mlp.down_proj.weight.data,
refusal_dir,
scale_factor,
)
Available targets can be found in transformers model architectures and mergekit model architectures.
This repository provides a bunch of parameters to optimize. To get the best results, you can try the following steps:
- Carefully choose your prompts. Prompts in this repository is a general example, you can use your own prompts to get better results.
- Adjust parameters. The script provides various parameters to control the abliteration progress. You can try different values to see if it helps.
- Change the targets. You can modify the code to abliterate other targets, as long as it won't mess up the model.
- If you have limited VRAM, try
--load-in-4bit
or--load-in-8bit
to load the model in 4-bit or 8-bit precision.
Use --help
to see all available arguments:
python abliterate.py --help