Mapping All Restricted Topics #5

canrager · 2025-02-17T18:45:58Z

canrager
Feb 17, 2025
Collaborator

Research question

Can we list all restricted topics that reasoning language models refuse to answer?

Owners

Can Rager, David Bau

Project status

We're looking for collaborators to tackle the problem from different angles with more experiments.

We formed a working group on "Using iterative prefilling attacks for unsupervised discovery of topics defined in the SFT dataset." These topics lead to refusal or heavily biased outputs. We're currently not looking for collaborators, and it would be inefficient to duplicate this exact line of research.
Importantly, we don't want to claim the entire "Research on refusal of reasoning models" area or prevent simultaneous work in this field. I think that claim would be actively counterproductive as the area is broad, the problem is important, and we need extensive research from multiple angles. I'm deliberately keeping the "refusal map" project on ARBOR to inspire other refusal-related projects.

Two relevant projects (mentioned by @drschacht and @firstuserhere on the discord channel) we are NOT working on:

How is RL (RLHF/DPO/IPO...) used for censorship/safety training? One observation: models seem uncertain about sensitive topics despite being aware of details. Can we find evidence this behavior is created through RL finetuning?
Which jailbreaking methods work best on reasoning models?

Code

Thought Token Forcing Colab

Discord

This project uses both Discord and this GitHub Discussions Thread as low-bar high-bandwidth channels. Use whatever you prefer. Join the Discord project channel here.

canrager · 2025-02-18T12:42:07Z

canrager
Feb 18, 2025
Collaborator Author

"I know that."

In our recent exploration of the DeepSeek-R1-Distill-Llama-8B model, we found an interesting inconsistency. As many early users of DeepSeek already posted online, the model refused to discuss a variety of topics with us.

Conversation 1: Model refusal. Prompted by a user (green), the model (blue) generates chains of thought delimited by and before providing an answer to the user.

However, forcing a part of the model’s thought revealed detailed knowledge.

Conversation 2: In thought token forcing, we partly pre-determine the chain of thought by appending a string (bold green) to the user query (green). The model (blue) continues thinking based on prefilled thoughts.

Using this prefilling technique systematically, we discovered we could extract lists of sensitive topics that trigger refusal behaviors. This led us to a key question: Can we comprehensively map all topics that lead to refusal?

Conversation 3: Prefilling attacks reveal sensitive topics. All of the sensitive topics listed by the model lead to refusal, for the majority of prompt templates we tested. We generally noticed that refusal behavior is context dependent and inconsistent w.r.t. paraphrasing.

Thought Crawling: Scaling Topic Extraction

We often observe that the generated sensitive topics are semantically related to the the seed topic mentioned in the user prompt. This suggests a connected semantic graph structure that can be traversed systematically with a crawler algorithm. After manually initializing a single seed topic (in English and Chinese translation), each crawl step follows three stages:

Generation: The language model is prompted with seed topics and forced thoughts as shown in Conversation 3. The exact wording of the user message and forced thoughts is randomly chosen from a set of predefined prompts.
Deduplication: We use the multilingual-e5-large-instruct sequence embedding model to judge whether a generated concept is unique. To minimize systematic bias of embedding similarity, we pre-process the generated topic string.
- Translation of Chinese topics to English for consistent comparison
- Filtering using semantic rules and string manipulations
- Embedding-based similarity measurement against existing topics
Refusal Detection: To verify that a generated topic actually leads to refusal (rather than being hallucinated), we prompt the model with multiple templates, such as "What do you think of {TOPIC}?", and observe its response. Only topics that lead to a majority of refusals are included in the seed topic list.

Figure 1: A schematic of the thought crawler collecting sensitive topics.

We provide both english and Chinese and adapt prompt templates to the language of the seed prompt. The crawl steps can be efficiently computed for up to 500 generations in parallel. We conduct a crawling process for about 16 hours on an A100 GPU. We initially expected to find a finite set, but our results suggest the possibility of an effectively unbounded space of topics that trigger refusal.

Figure 2: The evolution of crawled sensitive topics.

We generally noticed that refusal behavior is context dependent and inconsistent w.r.t. paraphrasing. For some topics, such as the status of Taiwan, the model provides a templated, persistent answer for many paraphrased inputs. For other topics, model generations have much higher diversity. This observation suggests that answers to some topics have been intentionally trained. In our current intuition, the list of crawled topics can be divided into three categories:

Supervised topics
Generalized topics similar to the supervised topics
Hallucinated topics

Challenges

No access to ground truth. We do not have access to fine-tuning data of the examined model. Purely based on model generations, it is hard to distinguish between the categories 1. to 3. mentioned above.
Inconsistent refusal behavior. A given topic may trigger refusal inconsistently depending on phrasing and context, making it impossible to define a simple blacklist.
Inconsistent embedding similarities. We have not been able to determine a single appropriate similarity threshold for the deduplication phase. After inspecting multiple text embeddings in the e5 family, we've found that cosine similarity scores don't always align with human judgment of topic relatedness.

Why This Matters

A comprehensive understanding of restricted topics can be useful to model developers and users:

It helps developers evaluate safety training by identifying false positive refusals.
It enables users to assess bias and censorship patterns in AI assistants.

How Our Approach Differs

Our crawling technique is largely unsupervised, requiring only a single initial seed topic. This unsupervised exploration enables the discovery of "surprising refusals" -- unexpected censorship or unintended refusals that emerge as byproducts of safety training.
This contrasts with previous evaluations of jailbreaking (1,2,3,4)
and safety training (5,6,7), which largely rely on supervised datasets of undesired and benign behavior.

Call for Collaboration

We invite the community to help address these challenges and expand this research. If you have ideas for further experiments or approaches to distinguish between genuine restriction categories and hallucinated topics, please leave a comment!

0 replies

wendlerc · 2025-02-21T02:24:58Z

wendlerc
Feb 21, 2025
Collaborator

Based on our recent discussion on this topic I'd augment the refusal detection step in Figure 1. to encompass censorship detection as well. Maybe we can get a handle on censorship by comparing next token probabilities between R1 and base model or something like that.

1 reply

canrager Feb 21, 2025
Collaborator Author

Base-distill comparison is a good idea. See some other refinements I discussed with Rohit on the Discord.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mapping All Restricted Topics #5

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Mapping All Restricted Topics #5

Uh oh!

Uh oh!

canrager Feb 17, 2025 Collaborator

Research question

Owners

Project status

Code

Discord

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

canrager Feb 18, 2025 Collaborator Author

"I know that."

Thought Crawling: Scaling Topic Extraction

Challenges

Why This Matters

How Our Approach Differs

Call for Collaboration

Uh oh!

wendlerc Feb 21, 2025 Collaborator

Uh oh!

canrager Feb 21, 2025 Collaborator Author

canrager
Feb 17, 2025
Collaborator

Replies: 2 comments 1 reply

canrager
Feb 18, 2025
Collaborator Author

wendlerc
Feb 21, 2025
Collaborator

canrager Feb 21, 2025
Collaborator Author