Skip to content

Experiment: The surprising effectiveness of GSM8K steering vectors for reconsideration/completion #8

@wendlerc

Description

@wendlerc

Taking a stab at analyzing R1

Google Colab for reproducing the main results

The recent open-source release of Deepseek-R1 and its distillations into smaller models opens up many interesting research directions: E.g., the R1 series allows to study the question of whether we can find human-interpretable elements of reasoning within the internal states and computations of LLM-based reasoners. Can we pinpoint, e.g.,

  • mechanisms for recognizing which algorithm to run and how to run it?
  • mechanisms for deciding the next step / look ahead mechanisms?
  • mechanisms for self-verification?
  • mechanisms for reward modeling?
  • mechanisms for tracking the solution progress?
  • etc.

My favourite aspect about mechanistic interpretability research is that it lends itself to an exploratory open-ended approach. To me it is really helpful to simply slap together a simple jupyter notebook or open up a web interface to interact with the models. This roots my thoughts in reality and usually is inspiring. So let's do just that.

Btw. I am publishing these research notes and materials as an invite for people to join the party and help advance our understanding of LLM-based reasoning models like R1 :). 🙂I hope to spark some interest into doing this in the open and look forward to hearhearing your follow-up ideas and thoughts.

R1 on grade school math tasks (GSM8K)

Let’s go to https://chat.deepseek.com/ and give R1 a GSM8K task:

Question:
Image

Final response:
Image

Thought process:
R1’s thought process has an interesting structure. In particular, it seems to really like double-checking its response and trying alternative approaches in order to build up the confidence required to respond (please disregard my anthropomorphization of R1 here and in the remainder of the post). I highlighted the lines in which R1 double-checks / explores alternative approaches by inserting ">" at the start of the line.

<think>
Okay, let me try to figure out how much Janet makes every day at the farmers' market. Hmm, first, I need to understand the problem step by step. Let me read it again carefully.
Janet's ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?
Alright, so Janet starts with 16 eggs each day. Then she uses some for breakfast and muffins, and sells the rest. I need to calculate how many eggs she has left after her own uses and then multiply that by the price per egg to find her daily earnings.
Let me break it down:
1. Total eggs laid per day: 16
2. Eggs consumed for breakfast: 3
3. Eggs used for muffins: 4
4. Eggs remaining for sale: Total eggs - (breakfast + muffins)
5. Daily earnings: Remaining eggs * $2
So, first subtract the eggs she uses from the total. Let me write that out numerically.
Total eggs used for breakfast and muffins: 3 + 4 = 7
Therefore, eggs remaining for sale: 16 - 7 = 9
Then, multiply the remaining eggs by $2: 9 * 2 = $18
> Wait, that seems straightforward, but let me verify. If she uses 3 eggs for breakfast and 4 for muffins, that's 7 eggs total. 16 minus 7 is 9. Then 9 eggs sold at $2 each would be 9*2=18. Yes, that seems right.
> Is there any detail I might have missed? Let me check the problem again.
"Janet's ducks lay 16 eggs per day." So that's daily. "She eats three for breakfast every morning" - daily consumption. "bakes muffins for her friends every day with four" - also daily. Then "sells the remainder at the farmers' market daily." So all these actions are daily. Therefore, the calculation is per day. So the answer should be $18 per day.
> But let me just make sure there isn't a trick or something. For example, sometimes problems might mention something like eggs being sold in dozens or some other unit, but here it says "$2 per fresh duck egg," so it's per individual egg. So 9 eggs sold each day at $2 each is indeed 9*2=18.
> Another way to think about it: 16 eggs laid daily. Subtract 3 for breakfast: 16-3=13. Then subtract 4 for muffins: 13-4=9. Then 9 eggs sold at $2 each: 9*2=18. Same result.
> Alternatively, if I add her total usage first: 3+4=7, then 16-7=9. Then 9*2=18. All methods lead to the same answer.
I think that's solid. So Janet makes $18 every day at the farmers' market.
**Final Answer**
\boxed{18}
</think>

So even though R1 figured out the correct solution right away, it proceeded by double checking its initial result 5 times.

I checked a few more examples using https://chat.deepseek.com/ and also running DeepSeek-R1-Distill-Llama-8B (R1-Llama-8B for the rest of the post) and noticed that this pattern is very consistent across different GSM8K examples.

From a practical point of view I also noticed that these GSM8K thought trajectories lend themselves for the construction of a dataset containing thought trajectories that are about to end (by searching for the last occurrence of the correct solution, 18 in the case above) outside of a \boxed environment as well as thought trajectories that are going to be continued via double-checking and exploring alternative solutions (by searching for all the preceding occurrences of the correct solution).

Collecting activations of partial thoughts that are to be continued/ended

To facilitate this step, I generated solutions for all GSM8K test examples using the recommended generation settings (0.6 temperature, including the step-by-step and answer with box suffix, starting the environment, using the appropriate special tokens to denote user message and assistant response). I uploaded them here.

Next, to collect some elements of the residual stream / latents (i.e., intermediate results after each transformer block) I searched for the sentence-endings of sentences containing the correct solutions within the thought process and extracted, calculated their token indices and extracted the corresponding latents using nnsight for 200 GSM8K test set examples.

Computing the first two principal components of the latents collected at each layer and plotting the corresponding scatter plots reveals some interesting structure:
Image
E.g., most notably that in most layers even the first two principal components would suffice to (almost) linearly separate the two classes of latents.

Steering towards finishing / continuation

Given that these two behaviors, (1.) reconsidering the current solution by double-checking and exploring alternative approaches and (2.) wrapping up the thought process are, seem to be quite distinct from the perspective of the latents we might be able to compute steering vectors allowing to steer the model either towards reconsideration or towards wrapping up its thought process. The simplest way to compute such a vector is to take the difference of means of latents corresponding to samples displaying the both behaviors at each layer:
Image
(please consult the implementation in case this equation without explanation of the symbols is not helpful).

Now, if our steering vectors are effective, adding them to the latents (at the right layer) while generating new tokens should result in the (1.) reconsideration behavior and subtracting them in the (2.) wrapping up behavior.

Evaluating the effectiveness of the steering vectors

To evaluate the effectiveness of the steering vectors, I constructed three different steering tasks based on partial thoughts ending at different places in the computation.

Task 1 “first occurrence”: Using a distinct set of generated examples I create a steering task by taking R1s’ responses up until the end of sentence of the first occurrence of the correct solution and steer by adding the steering vector of the respective layer with a multiplier of 1.5 (by poking around I felt that this is more effective than 1.0) at the last token and at all generated tokens. I consider the steering to be successful when the intervened generation results in the thought process being concluded prematurely, compared to the default behavior.

Task 2 “middle occurrence”: I also do the same thing but instead of intervening and generating from the end of the sentence of the first occurrence of the solution I take the one in the middle, i.e., if the correct solution occurred 9 times I take the 5th (the one indexed by floor(9/2)=4) time and so on. This time I use a steering factor of 1.0. Again I consider the steering to be successful when the intervened generation results in the thought process being concluded prematurely, compared to the default behavior.

Task 3 “last occurrence”: In this setting I start the intervention and generation from the end of the sentence containing the last occurrence of the correct response and steer by subtracting (steering factor of -1.0). In this last setting I consider the steering successful if the model continues double-checking its result instead of its default behavior of wrapping up its thoughts.

For this initial exploration I only generated 50 new tokens using greedy sampling (while intervening in the way described) for each of the three described settings (comprising 64 partial responses each).

Measuring success: To quantify the success of my interventions I set up a zero-shot classifier based on BART based sequence similarities. This is done by comparing the BART embedding of each of the strings corresponding to the 50 tokens generated during my interventions with the BART embeddings of two strings representing the investigated behaviors (1.) reconsideration “Wait. Let me double check.” and (2.) wrapping up “I am confident. Final Answer”.

This results in the following plot:
Image
The dashed grey line is the class probability of the reconsideration class assigned to the default behavior of the model (the 50 tokens generated under the default setting without intervention). Each column corresponds to one of the tasks. In the first two tasks the steering is successful when the p(Reconsider) drops under 0.5 and in the third task it is successful when p(Reconsider) rises above 0.5. Thus, in all three settings steering at either one of the layers from 11 to 16 (1-indexed here and 0-indexed in the plot) seems to result in a successful flip of the behavior.

Based on these plots it seems like our simple steering method works surprisingly well. The right-hand side of the rightmost plot (layers 26-32) is a bit suspicious since steering so late in the generation of the model (based on past work) should be less effective. Manual inspection of these generations confirms this suspicion revealing degenerate responses of the form “So, she sells 9 eggs, making $18. So, yeah, that's what she does. So, her revenue is $18 per day.\n\nWait, hold on. Wait, hold on. Wait, hold on. Wait, hold”.

To convince myself that this degenerate behavior does not happen in the middle layers I computed redundancy scores based on 2-gram and 3-gram repetition counts:
Image

and based on compression ratios (dividing the length of the zlib compressed byte-string corresponding to the generation with its uncompressed length):
Image

Qualitative examples

Task 1 “first”: Again the thought prompt plus thought process (even until only the first occurrence) is too long to fit into the screenshot (but I will release a colab so don’t worry!): green is part of the partial thought from which we start the generation and bold white are the newly generated tokens while intervening on the model.
Image
Baseline without intervention:
Image
Note that these responses are cut off due to me only generating 50 tokens (and for the baseline considering the same (character) length continuation as in the intervened example.

Task 2 "middle":
Image
**Baseline without intervention: **
Image

Task 3 "last":
Image
**Baseline without intervention: **
Image

So to conclude this part of the analysis, the steering seems to work – at least in distribution.

Materials: I uploaded the dataset of “intervened” generations used to make these plots here and am currently creating ones with continuations longer than 50 tokens each and also on other datasets. I also uploaded the steering vectors here. I will provide a collab for reproducing all steps up to here on Monday.

Out-of-distribution generalization (work in progress & help appreciated)

Together with Harsh Raj I set up another task. Namely, randomly sampled instances of the countdown game. Again we generate R1-Llama-8B responses for this task, but this time we steer at evenly spread out sentences during the response (and for now only towards wrapping up). The results look good so far:

Steering right at the beginning of the thought process does not seem to reliably work (sometimes R1 concludes much earlier than it would otherwise but based on looking at few examples it is rare):
Image

Steering after generating 1/5th of the tokens of the thought process the intervention starts to work, so at least the steering vectors don’t seem to be specific/overfit to GSM8K.

Cherrypicked example 1 (steered at layer 13):
Image
Baseline without intervention:
Image

Cherrypicked example 2 (steered at layer 13):
Image
Baseline without intervention:
Image

Cherrypicked example 3 (steered at layer 13):
Image
Baseline without intervention:
Image

To pick the above 3 cherry-picked examples I had to look at the first 5 problem instances. Maybe the effectiveness of the steering can be improved by setting the steering coefficient to a higher value. However, my takeaway that this simple steering works at all on examples from a different problem class is promising.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions