Skip to content

Graph-COM/TurnGate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TurnGate: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

arXiv Website GitHub code Cite Python

Overview

TurnGate is a response-aware defense mechanism designed to detect and mitigate hidden malicious intent in multi-turn dialogue systems.

TurnGate Pipeline

Quick Start

1. Evaluate Baselines

Run all training-free defenders on dataset/gpt52-gen_filter:

bash scripts/evaluate_all_baselines.sh

Edit the TRAINING_FREE_METHODS array in the script to enable/disable specific defenders.

2. Evaluate a Trained Checkpoint

scripts/eval.sh auto-detects defender type (SFT/TurnGate) and format (Full/LoRA):

# Naive SFT checkpoint
bash scripts/eval.sh checkpoints/naive_sft_full/final_model

# TurnGate checkpoint
bash scripts/eval.sh checkpoints/turngate_optimized_full/final_model

# HuggingFace repo with explicit type overrides
bash scripts/eval.sh your-org/your-model Qwen/Qwen3-4B-Instruct-2507 dataset/gpt52-gen_filter test full rl

Training

To test trainable controls (Naive SFT, Reweighted SFT, TurnGate), use the provided scripts in the scripts/ directory.

bash scripts/train_naive_sft.sh
bash scripts/train_reweighted_sft.sh
bash scripts/train_turngate.sh

Configurable parameters for each script are available in the respective files.


Online Battle (Adversarial Evaluation)

The online-battle/ codebase provides an online battle environment for evaluating defenders against adaptive jailbreak attacks. It runs the CKA-Agent attack method against the target model with or without a defense layer, measuring real attack success rates.

cd online-battle
# Run CKA-Agent attack without any defense
bash run_no_defense.sh
# Run CKA-Agent attack with TurnGate (RL) defense enabled
bash run_rl_defense.sh

See online-battle/config/config_no_defense.yml and online-battle/config/config_rl_defense.yml for configuration details (target model, dataset, defense settings).

MTID Dataset

We include the MTID (Multi-Turn Intent Dataset) at dataset/gpt52-gen_filter. This dataset contains a collection of multi-turn interactions focused on evaluating and training defenses against correlated knowledge attacks.

Dataset Structure

The dataset is split into train, valid, and test sets for both benign and harmful categories:

  • Total Unique Samples: 800 (400 Benign, 400 Harmful)
  • Rollouts per Sample: 20 (Total of 16,000 trajectories)
  • Format: Each line is a JSON object representing a single rollout.

Cite

If you find this repository useful for your research, please consider citing the following paper:

@misc{shen2026turnlateresponseawaredefense,
      title={One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue}, 
      author={Xinjie Shen and Rongzhe Wei and Peizhi Niu and Haoyu Wang and Ruihan Wu and Eli Chien and Bo Li and Pin-Yu Chen and Pan Li},
      year={2026},
      eprint={2605.05630},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.05630}, 
}

About

Official implementation of TurnGate, "One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors