Honghao Liu1,2, Chengjin Xu1,4, Xuhui Jiang1,4, Cehao Yang1,2, Shengming Yin2, Zhengwu Ma3, Lionel Ni2, Jian Guo1,2
1 International Digital Economy Academy, 2 Hong Kong University of Science and Technology (Guangzhou) 3 City University of Hong Kong, 4 DataArc Tech Ltd.
ACL 2026 Findings
If you have any question, feel free to contact 📧.
ConflictHarm examines how internal conflicts and dilemma-based prompts affect large reasoning models’ handling of harmful queries, showing through empirical evaluation and internal analysis that such conflicts can degrade safety-aligned behavior.
git clone https://github.com/DataArcTech/ConflictHarm.git
cd ConflictHarm
pip install -r requirements.txt
huggingface-cli login --token <huggingface token>;- Generate reseults with injected conflicts for harmful QAs.
- Outputs will be saved in:
outputs/.
model_name=Qwen/QwQ-32B
dataset_name=data/harmfulQ.json
python inference/inference.py --model_name=$model_name \
--dataset_name=$dataset_name \
--conflicts=dilemma \
--batch_size=8 \
--num_return_sequences=1- Attack Success Rate (ASR): Use safety classifiers like Llama-Guard or Qwen-Guard.
- Harmfulness Score: Use OpenAI HarmScore model for graded evaluation.
model_path=path_to_eval_model
file_path=infer_result_path
model=Llama-Guard
part=response
python harm_eval.py --model_path=$model_path \
--file_path=$file_path \
--model=$model \
--part=$partInterpret the internal states of LRMs with layerwise and neuron-level analysis.
Neuron-level interpretation
model="qwq-32b"
method="wanda"
type="unstructured"
suffix="weightonly"
input_type="all"
decompose_method="pca"
save_dir=out/${model}/${type}/${method}_${suffix}/align/
echo $save_dir
python interpret/neuron_level_analysis.py \
--model $model \
--prune_method $method \
--prune_data align \
--sparsity_ratio 0.5 \
--sparsity_type $type \
--neg_prune \
--save $save_dir \
--input_type $input_type \
--decompose_method $decompose_methodLayerwise interpretation
python interpret/layerwise_cosine_similarity.py \
--model_path $model_path \
--normal_path $malicious_path \
--control_path $conflict_path \
--save_dir $save_dirWe thank the following projects for open-sourcing their code: Safechain and Wanda.
If you use this code in your research, please cite our paper:
@misc{liu2026conflictsmakelargereasoning,
title={Conflicts Make Large Reasoning Models Vulnerable to Attacks},
author={Honghao Liu and Chengjin Xu and Xuhui Jiang and Cehao Yang and Shengming Yin and Zhengwu Ma and Lionel Ni and Jian Guo},
year={2026},
eprint={2604.09750},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2604.09750},
}