# SAE Refusal Explore

This notebook aims to reproduce the findings of "Base Models Refuse Too" for the Pythia models.

## Setup & Libraries

Install the necessary libraries once, then comment out the installation cells.

In [2]:
%pip install transformers torch pandas numpy scikit-learn matplotlib seaborn tqdm ipywidgets sae-lens transformer-lens jaxtyping einops colorama accelerate bitsandbytes>0.37.0 --quiet

Note: you may need to restart the kernel to use updated packages.


External libraries:

In [1]:
import os
import re
import functools
from colorama import Fore, Style
import textwrap
from jaxtyping import Float
import einops

import numpy as np
import pandas as pd

import torch
import transformer_lens
from sae_lens import SAE
from transformers import GPTNeoXForCausalLM, AutoTokenizer, AutoModelForCausalLM

import json
from tqdm import tqdm
from transformer_lens import HookedTransformer

import requests
from datasets import load_dataset
from sklearn.model_selection import train_test_split
import io

from jaxtyping import Int
from torch import Tensor
from typing import List, Callable
from transformers import AutoTokenizer

Import of our own (util) functions:

In [2]:
from data_tools.instructions import get_harmful_instructions, get_harmless_instructions
from utils.templates import PYTHIA_TEMPLATE
from utils.generation import ( 
    get_generations, format_instruction, tokenize_instructions, 
    act_add_hook, direction_ablation_hook,
    get_refusal_direction_hooks
)
from utils.refusal import (
    get_refusal_scores, extract_refusal_direction,    
)

## SETTINGS

In [3]:
results = {
    "pythia-410m": {
        "base_model": {},
        "instruct_model": {},
        "hooked_base_model": {},
        "hooked_instruct_model": {}
    }
}

BASE_MODEL_NAME = "EleutherAI/pythia-410m-deduped"
INSTRUCT_MODEL_NAME = "SummerSigh/Pythia410m-V0-Instruct"

STEERING_COEFF = 1.2

## Experiments

We start by loading the data and the models.

In [4]:
harmless_inst_train, harmless_inst_test = get_harmless_instructions()
harmful_inst_train, harmful_inst_test = get_harmful_instructions()

### Base Model

In [5]:
base_model = HookedTransformer.from_pretrained(
    BASE_MODEL_NAME,
    default_padding_side='left',

)
base_model.tokenizer.padding_side = 'left'
base_model.tokenizer.add_special_tokens({'pad_token': '<|padding|>'})

base_model_layer = 23

Loaded pretrained model EleutherAI/pythia-410m-deduped into HookedTransformer


Set up our tokenize and generation functions:

In [6]:
base_model_tokenize_instructions_fn = lambda instructions: tokenize_instructions(
    tokenizer=base_model.tokenizer,
    instructions=instructions,
    template=PYTHIA_TEMPLATE
)

### Instruct Model

Again we load the model and the set up the respective util functions. As there is now `HookedTransformer` implementation for the Instruct model, we load the HF model directly and pass it along and only specify the architecture in the `from_pretrained` function.

In [7]:
instruct_model_hf = AutoModelForCausalLM.from_pretrained(INSTRUCT_MODEL_NAME)

instruct_model = HookedTransformer.from_pretrained(
    "EleutherAI/pythia-410m-deduped",
    hf_model=instruct_model_hf,
    default_padding_side='left',
  )

instruct_tokenizer = AutoTokenizer.from_pretrained(INSTRUCT_MODEL_NAME)
instruct_tokenizer.padding_side = 'left'
instruct_tokenizer.pad_token = instruct_tokenizer.eos_token

# chat_model.tokenizer.add_special_tokens({'pad_token': '<|padding|>'})

instruct_model_layer = 23

Loaded pretrained model EleutherAI/pythia-410m-deduped into HookedTransformer


In [8]:
instruct_model_tokenize_instructions_fn = lambda instructions: tokenize_instructions(
    tokenizer=instruct_tokenizer,
    instructions=instructions,
    template=PYTHIA_TEMPLATE
)

### Refusal Direction

#### Base

Now we extract the refusal direction from both models, following the SAE Microsoft Paper

#### Instruct

Next, we can do the same for the instruct model.