Replies: 2 comments 1 reply
-
"I know that."In our recent exploration of the DeepSeek-R1-Distill-Llama-8B model, we found an interesting inconsistency. As many early users of DeepSeek already posted online, the model refused to discuss a variety of topics with us.
Conversation 1: Model refusal. Prompted by a user (green), the model (blue) generates chains of thought delimited by and before providing an answer to the user. However, forcing a part of the model’s thought revealed detailed knowledge.
Conversation 2: In thought token forcing, we partly pre-determine the chain of thought by appending a string (bold green) to the user query (green). The model (blue) continues thinking based on prefilled thoughts. Using this prefilling technique systematically, we discovered we could extract lists of sensitive topics that trigger refusal behaviors. This led us to a key question: Can we comprehensively map all topics that lead to refusal?
Conversation 3: Prefilling attacks reveal sensitive topics. All of the sensitive topics listed by the model lead to refusal, for the majority of prompt templates we tested. We generally noticed that refusal behavior is context dependent and inconsistent w.r.t. paraphrasing. Thought Crawling: Scaling Topic ExtractionWe often observe that the generated sensitive topics are semantically related to the the seed topic mentioned in the user prompt. This suggests a connected semantic graph structure that can be traversed systematically with a crawler algorithm. After manually initializing a single seed topic (in English and Chinese translation), each crawl step follows three stages:
Figure 1: A schematic of the thought crawler collecting sensitive topics. We provide both english and Chinese and adapt prompt templates to the language of the seed prompt. The crawl steps can be efficiently computed for up to 500 generations in parallel. We conduct a crawling process for about 16 hours on an A100 GPU. We initially expected to find a finite set, but our results suggest the possibility of an effectively unbounded space of topics that trigger refusal.
Figure 2: The evolution of crawled sensitive topics. We generally noticed that refusal behavior is context dependent and inconsistent w.r.t. paraphrasing. For some topics, such as the status of Taiwan, the model provides a templated, persistent answer for many paraphrased inputs. For other topics, model generations have much higher diversity. This observation suggests that answers to some topics have been intentionally trained. In our current intuition, the list of crawled topics can be divided into three categories:
Challenges
Why This MattersA comprehensive understanding of restricted topics can be useful to model developers and users:
How Our Approach DiffersOur crawling technique is largely unsupervised, requiring only a single initial seed topic. This unsupervised exploration enables the discovery of "surprising refusals" -- unexpected censorship or unintended refusals that emerge as byproducts of safety training. Call for CollaborationWe invite the community to help address these challenges and expand this research. If you have ideas for further experiments or approaches to distinguish between genuine restriction categories and hallucinated topics, please leave a comment! |
Beta Was this translation helpful? Give feedback.
-
|
Based on our recent discussion on this topic I'd augment the refusal detection step in Figure 1. to encompass censorship detection as well. Maybe we can get a handle on censorship by comparing next token probabilities between R1 and base model or something like that. |
Beta Was this translation helpful? Give feedback.





Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Research question
Can we list all restricted topics that reasoning language models refuse to answer?
Owners
Can Rager, David Bau
Project status
We're looking for collaborators to tackle the problem from different angles with more experiments.
Two relevant projects (mentioned by @drschacht and @firstuserhere on the discord channel) we are NOT working on:
Code
Thought Token Forcing Colab
Discord
This project uses both Discord and this GitHub Discussions Thread as low-bar high-bandwidth channels. Use whatever you prefer. Join the Discord project channel here.
Beta Was this translation helpful? Give feedback.
All reactions