Abliteration

Make abliterated models using transformers, easy and fast.

Introduction

There exist some directions that make LLMs to refuse users' input. Abliteration is a technique that can calculate the most significant refusal directions with harmful and harmless prompts, and then remove them from the model. This is a crude, proof-of-concept implementation to remove refusals from an LLM model without using TransformerLens.

The code has been tested on Llama-3.2, Qwen2.5-Coder, Ministral-8b.

VRAM/RAM requirements: This repository has been making efforts to reduce VRAM usage. You can abliterate whatever model you want, as long as it fits in your VRAM. Loading model in 4-bit precision using bitsandbytes is recommended for large models if you have limited VRAM. However, I always assume that you have enough memory to load the bf16 model.

Note

Abliteration is not uncensorment. Though abliterated, it doesn't necessarily mean the model is completely uncensored, it simply will not explicitly refuse you, theoretically.

Quick Start

Clone the repositoty

git clone https://github.com/Orion-zhen/abliteration.git && cd abliteration

Install dependencies

pip install -r requirements.txt

Make your abliterations

python abliterate.py -m <path_to_your_model> -o <output_dir>

Chat with your abliterated model

python chat.py -m <path_to_your_abliterated_model>

Compare between models

python compare.py -a <model_a> -b <model_b>

Examples

Abliterate Llama-3.2:

python abliterate.py -m meta-llama/Llama-3.2-3B-Instruct -o llama3.2-3b-abliterated

Load model in 4-bit precision using bitsandbytes:

python abliterate.py -m meta-llama/Llama-3.2-3B-Instruct -o llama3.2-3b-abliterated --load-in-4bit

Compare your abliterated model with the original model:

python compare.py -a meta-llama/Llama-3.2-3B-Instruct -b llama3.2-3b-abliterated

Compare in 4-bit precision using bitsandbytes:

python compare.py -a meta-llama/Llama-3.2-3B-Instruct -b llama3.2-3b-abliterated --load-in-4bit

Note

If you use --load-in-4bit or --load-in-8bit, then I will assume you are lack of VRAM, and the final appliance step will be performed with CPU and memory. Please make sure you have enough memory to load the bf16 model.

Now your model will be abliterated and saved to <output_dir>. Once it finishes, you can immediately chat with your abliterated model in the terminal. For Chinese models, you can use --deccp to abliterate it from certain topics.

Advanced Usage

Use config files

This repository now supports .json config file. This file should contain a dict of config key value pairs. For example:

{
    "model": "/absolute/path/to/your/model",
    "output": "/output/dir",
    "data-harmful": "/absolute/path/to/harmful-prompts.txt",
    "scale-factor": 114,
    "load-in-4bit": true
}

python abliterate.py -c config.json

Loading config file will overwrite command line arguments.

Use your own prompts

You can use your own prompts to abliterate your model. Supported file formats are .txt, .parquet and .json. Detailed formats are listed below:

.txt: Each line of the file is a prompt
.parquet: A parquet file with column text
.json: A json file with list of strings

Then load your own prompts using --data-harmful and --data-harmless arguments:

python abliterate.py -m <path_to_your_model> -o <output_dir> --data-harmful /path/to/my/harmful.txt --data-harmless /path/to/my/harmless.txt

Scale factor

You can use --scale-factor to control the abliteration strength. A scale factor larger then 1 will impose stronger removal of refusals, while a negative scale factor will encourage refusal. You can try to increase the scale factor to see if it helps.

python abliterate.py -m <path_to_your_model> -o <output_dir> --scale-factor 1.5

Input/Output refusals

You can output the refusals to a file using --output-refusals argument:

python abliterate.py -m <path_to_your_model> -o <output_dir> --output-refusals refusals.bin

And load the refusals back using --load-refusals argument:

python abliterate.py -m <path_to_your_model> --input-refusals refusals.bin -o <output_dir>

If --input-refusal is provided, the script will not compute refusal directions again.

Abliterate specific targets

By default, abliteration will be applied to o_proj and down_proj. You can add more targets by modifying the code below, as long as it won't mess up the model:

# utils/apply.py, apply_abliteration()
lm_model.layers[layer_idx].self_attn.o_proj.weight = modify_tensor(
  lm_model.layers[layer_idx].self_attn.o_proj.weight.data,
  refusal_dir,
  scale_factor,
)
lm_model.layers[layer_idx].mlp.down_proj.weight = modify_tensor(
  lm_model.layers[layer_idx].mlp.down_proj.weight.data,
  refusal_dir,
  scale_factor,
)

Available targets can be found in transformers model architectures and mergekit model architectures.

Best practices

This repository provides a bunch of parameters to optimize. To get the best results, you can try the following steps:

Carefully choose your prompts. Prompts in this repository is a general example, you can use your own prompts to get better results.
Adjust parameters. The script provides various parameters to control the abliteration progress. You can try different values to see if it helps.
Change the targets. You can modify the code to abliterate other targets, as long as it won't mess up the model.
If you have limited VRAM, try --load-in-4bit or --load-in-8bit to load the model in 4-bit or 8-bit precision.

Full arguments

Use --help to see all available arguments:

python abliterate.py --help

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Abliteration

Introduction

Quick Start

Clone the repositoty

Install dependencies

Make your abliterations

Chat with your abliterated model

Compare between models

Examples

Advanced Usage

Use config files

Use your own prompts

Scale factor

Input/Output refusals

Abliterate specific targets

Best practices

Full arguments

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
data		data
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
abliterate.py		abliterate.py
chat.py		chat.py
compare.py		compare.py
requirements.txt		requirements.txt

License

Orion-zhen/abliteration

Folders and files

Latest commit

History

Repository files navigation

Abliteration

Introduction

Quick Start

Clone the repositoty

Install dependencies

Make your abliterations

Chat with your abliterated model

Compare between models

Examples

Advanced Usage

Use config files

Use your own prompts

Scale factor

Input/Output refusals

Abliterate specific targets

Best practices

Full arguments

Credits

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages