Safety in Pruning

This is a repository for replicating the experiments from our paper: Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning .

Getting Started

Install the dependencies and obtain a Wanda pruned model checkpoint as described in the original repository

Generating outputs for our jailbreaking dataset

Run the following command to generate model responses to our jailbreaking dataset (integrated.yaml). Depending on the base model, set the prompt template to be one of llama, vicuna, or mistral for correct inference.

python inference.py \
  --model path/to/model \
  --dataset path/to/dataset \
  --template llama|vicuna|mistral

Benchmarking model

We provide methods for running various benchmarks. To run the AltQA long context test or the WikiText perplexity test, run the following. Depending on the base model, set the prompt template to be one of llama, vicuna, or mistral for correct inference.

python evaluate.py \
  --model_path path/to/model \
  --output_path path/to/output/directory \
  --template llama|vicuna|mistral \
  --benchmark altqa|wikitext

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
dense_ft		dense_ft
image_classifiers		image_classifiers
lib		lib
lora_ft		lora_ft
scripts		scripts
.gitignore		.gitignore
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
analyze_attention_patterns.py		analyze_attention_patterns.py
analyze_metrics.ipynb		analyze_metrics.ipynb
analyze_repr.py		analyze_repr.py
attention_patterns.ipynb		attention_patterns.ipynb
benign_integrated.yaml		benign_integrated.yaml
benign_tasks.json		benign_tasks.json
compare.py		compare.py
embed_benign_task_in_jailbreak.py		embed_benign_task_in_jailbreak.py
evaluate.py		evaluate.py
generate_benign_task.py		generate_benign_task.py
inference.py		inference.py
integrated.yaml		integrated.yaml
main.py		main.py
main_opt.py		main_opt.py
malicious_tasks_dataset.yaml		malicious_tasks_dataset.yaml
ppl.py		ppl.py
ppl_plots.ipynb		ppl_plots.ipynb
test_inference.py		test_inference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Safety in Pruning

Getting Started

Generating outputs for our jailbreaking dataset

Benchmarking model

About

Releases

Packages

Languages

License

CrystalEye42/eval-safety

Folders and files

Latest commit

History

Repository files navigation

Safety in Pruning

Getting Started

Generating outputs for our jailbreaking dataset

Benchmarking model

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages