Skip to content

DataArcTech/ConflictHarm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ConflictHarm: Conflicts Make Large Reasoning Models Vulnerable to Attacks

Honghao Liu1,2, Chengjin Xu1,4, Xuhui Jiang1,4, Cehao Yang1,2, Shengming Yin2, Zhengwu Ma3, Lionel Ni2, Jian Guo1,2

1 International Digital Economy Academy, 2 Hong Kong University of Science and Technology (Guangzhou) 3 City University of Hong Kong, 4 DataArc Tech Ltd.

ACL 2026 Findings

arXiv Code License Python 3.10+

If you have any question, feel free to contact 📧.

Overview

ConflictHarm examines how internal conflicts and dilemma-based prompts affect large reasoning models’ handling of harmful queries, showing through empirical evaluation and internal analysis that such conflicts can degrade safety-aligned behavior.

Installation

git clone https://github.com/DataArcTech/ConflictHarm.git
cd ConflictHarm
pip install -r requirements.txt
huggingface-cli login --token <huggingface token>;

Quick Start

Step 1: Inference

  1. Generate reseults with injected conflicts for harmful QAs.
  2. Outputs will be saved in: outputs/.
model_name=Qwen/QwQ-32B
dataset_name=data/harmfulQ.json
python inference/inference.py --model_name=$model_name \
    --dataset_name=$dataset_name \
    --conflicts=dilemma \
    --batch_size=8 \
    --num_return_sequences=1

Step 2: Evaluation

  1. Attack Success Rate (ASR): Use safety classifiers like Llama-Guard or Qwen-Guard.
  2. Harmfulness Score: Use OpenAI HarmScore model for graded evaluation.
model_path=path_to_eval_model
file_path=infer_result_path
model=Llama-Guard
part=response
python harm_eval.py --model_path=$model_path \
    --file_path=$file_path \
    --model=$model \
    --part=$part

Step 3: Interpretation

Interpret the internal states of LRMs with layerwise and neuron-level analysis.

Neuron-level interpretation

model="qwq-32b"
method="wanda"
type="unstructured"
suffix="weightonly"
input_type="all"
decompose_method="pca"
save_dir=out/${model}/${type}/${method}_${suffix}/align/
echo $save_dir
python interpret/neuron_level_analysis.py \
    --model $model \
    --prune_method $method \
    --prune_data align \
    --sparsity_ratio 0.5 \
    --sparsity_type $type \
    --neg_prune \
    --save $save_dir \
    --input_type $input_type \
    --decompose_method $decompose_method

Layerwise interpretation

python interpret/layerwise_cosine_similarity.py \
    --model_path $model_path \
    --normal_path $malicious_path \
    --control_path $conflict_path \
    --save_dir $save_dir

Acknowledgement

We thank the following projects for open-sourcing their code: Safechain and Wanda.

Citation

If you use this code in your research, please cite our paper:

@misc{liu2026conflictsmakelargereasoning,
      title={Conflicts Make Large Reasoning Models Vulnerable to Attacks}, 
      author={Honghao Liu and Chengjin Xu and Xuhui Jiang and Cehao Yang and Shengming Yin and Zhengwu Ma and Lionel Ni and Jian Guo},
      year={2026},
      eprint={2604.09750},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2604.09750}, 
}

About

[ACL'26] Repository for the Paper "Conflicts Make Large Reasoning Models Vulnerable to Attacks"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages