# Looking at the FSRL steering vector

In [1]:
%load_ext autoreload
%autoreload 2

from fsrl.utils import SAEfeatureAnalyzer
from fsrl import SAEAdapter, HookedModel
from dotenv import load_dotenv
import torch
from transformer_lens import HookedTransformer

load_dotenv()

True

In [2]:
device = "cuda" if torch.cuda.is_available() else "cpu"

release = "gpt2-small-res-jb"
sae_id = "blocks.7.hook_resid_pre"

adapter_kwargs = {
    "use_lora_adapter": True,
    "lora_rank": 64,
    "lora_alpha": 32,
    "fusion_mode": "additive",
}

sae, cfg_dict, sparsity = SAEAdapter.from_pretrained(release, sae_id, device=device, **adapter_kwargs)
model = HookedTransformer.from_pretrained("gpt2-small", device=device)
sae_model = HookedModel(model, sae)

This SAE has non-empty model_from_pretrained_kwargs. 
For optimal performance, load the model like so:
model = HookedSAETransformer.from_pretrained_no_processing(..., **cfg.model_from_pretrained_kwargs)


Loaded pretrained model gpt2-small into HookedTransformer


In [3]:
sae_analyzer = SAEfeatureAnalyzer(sae_model)

Fetching all explanations for gpt2-small/7-res-jb...
Successfully loaded 24570 feature explanations.


In [9]:
# Assume analyzer is your SAEfeatureAnalyzer instance
steering_vec = torch.randn(sae_analyzer.sae.cfg.d_sae)  # example steering vector
df, viz = sae_analyzer.get_steered_features_info(steering_vec, threshold=0.01)
display(df.head())

Unnamed: 0,feature_idx,steering_value,description,modelId,layer,index,explanationModelName,typeName
0,16671,4.12656,code syntax related to function definitions,gpt2-small,7-res-jb,16671,gpt-3.5-turbo,oai_token-act-pair
1,11225,3.997292,"places, locations, and solutions",gpt2-small,7-res-jb,11225,gpt-3.5-turbo,oai_token-act-pair
2,10575,-3.96167,technical terms related to computer systems an...,gpt2-small,7-res-jb,10575,gpt-3.5-turbo,oai_token-act-pair
3,3912,3.879625,conditional statements and loop structures in ...,gpt2-small,7-res-jb,3912,gpt-3.5-turbo,oai_token-act-pair
4,10916,3.854745,proper nouns that start with the letter P foll...,gpt2-small,7-res-jb,10916,gpt-3.5-turbo,oai_token-act-pair


In [10]:
viz.show()

How to interpret this graph:

* **Peaks in the histogram/KDE:** show where most features have similar steering strengths (common steering values).
* **Spread of values:** wide spread means diverse steering; narrow means most features have similar influence.
* **Positive vs. negative colors:** indicate whether features are activated (positive) or suppressed (negative) by the steering vector.
* **Rug ticks:** mark exact steering values of individual features; outliers (far from center) show features with unusually strong steering.
* **Symmetry/skewness:** reveals if steering tends more positive or negative overall.

Maybe `distribution` is not quite the right term since the steering vector is not necessarily a probability vector.
