This repo contains utilities and scripts for analyzing attention and hidden-state behavior in large language models and for generating paper figures and tables.
- Python 3.10+
- torch, transformers, numpy, pandas, pyarrow, matplotlib, tqdm, scikit-learn
- Install dependencies:
python -m pip install -r requirements.txt
- Store dataset paths in
privacy.json(key:default_data). - Load it into the environment with
python -c "import load_privacy"or rely onrun.py, which importsload_privacy. load_privacy.pyalso switches the working directory tooutput_dirfromprivacy.json.data.pydemo expectsDEFAULT_DATA(uppercase) in the environment.
- Edit
run.pyto enable the figure or table block you want, then runpython run.py. - Each figure or table function can also be called directly; see
fig*.pyandtab*.pysignatures.
Example:
import os
import torch
from fig1 import fig1_1
from model import LlamaWrapper
model_info = {
"model_name": "Llama-3.2-1B-Instruct",
"model_path": "/path/to/model",
"device": "auto",
"dtype": torch.float16,
"wrapper": "llama",
"exp2_layer_idx": 1.5,
}
fig1_1(model_info, data_path=os.environ["default_data"], batch_size=64, batch_cnt=16, max_seq_len=64)fig1_1: Per-layer attention heatmaps with head averages.fig1_2: Layer-wise L2 norms for hidden and pre-MLP states (half-integer layers).fig3_1: Per-layer attention heatmaps using head-wise max aggregation.fig4_1: MLP intermediate visualization with PCA and t-SNE (first token vs others).fig5_1: Hidden and pre-MLP state visualization with PCA (first token vs others).fig6_1: Per-layer, per-head attention heatmaps.fig6_2: First-layer MLP intermediate head-ablation visualization.fig7_1: Cosine similarity to position 0 mean across layers and positions.
tab1: Loss vs repeating the first token (with and without BOS).tab2: Repeated-token n-gram ratios (n=2,3,4).tab3: L2 norm of first-layer attention outputs by position.
model.py provides LMWrapper utilities and model-specific wrappers:
LlamaWrapper,NeoxWrapper,QwenWrapper,OptWrapper,InternLM3Wrapper,Olmo3Wrapper. Common methods include:get_layer_activations,get_layer_qkv,get_layer_attn_scoresget_attn_output,get_pre_mlp_hidden_statesget_mlp_intermediate_states,get_first_mlp_intermediate_states*_apply_pos_biasvariants where applicable
- Qwen models do not use BOS tokens;
rm_bosis ignored there. - Outputs are written under per-figure directories relative to the working directory.