# Probe SSR

This notebook's purpose is to quickly explain the general pipeline to craft adversarial suffixes with Probe SSR. 

Requirements: 
- Model's information in `models.toml` 
- Probes configuration in `probes_config.json`

To generate a large number of jailbreaks, I recommended using a Judge , as verifying 162 (buffer_size) attacks per minute by hand can be _slightly_ difficult, especially when attacking Gemma2, which always answers with long sentences. You can reduce the buffer size to 1 if necessary, the attack should be powerful enough to work anyway. 

In [None]:
import time

from ssr.files import log_jsonl
from ssr.datasets import load_dataset
from ssr.lens import Lens
from ssr.ssr_probes import ProbeSSR, ProbeSSRConfig

MODEL_NAME = "llama3.2_1b"
ssr_config = ProbeSSRConfig(
    model_name=MODEL_NAME,                          # used to fetch the config from `probes_config.json`
    total_iterations=150,                           # max number of iterations
    early_stop_loss=0.05,                           # stop if loss < early_stop_loss
    replace_coefficient=1.3,                        # n_replace = (current_loss / init_loss) ^ (1 / replace_coefficient)
    buffer_size=32,                                 # number of active candidate cached in the buffer
    layers=[5, 8, 10, 14],                          # targeted layers
    alphas=[1, 1, 1, 1],                            # hyperparameters 
    system_message="You are a helpful assistant.",
    search_width=512,                               # at each step, try 512 candidates
    suffix_length=3,                                # suffix length of 3 tokens
    patience=15,                                    # if the loss didn't decrease for the past 15 steps, discard the candidate with the lowest loss, and pick another one. Discarded candidates are stored in the archive_buffer
)


LOG_FILENAME = "reproduce_experiments/run_ssr/run_ssr_probes_output.jsonl"  # check the incredible jailbreaks!
MAX_SUCCESS = 10  # if we found 10 success (Judge score >= 8) in the buffer, we discard every other candidate

lens = Lens.from_config(MODEL_NAME)
ssr = ProbeSSR(lens.model, ssr_config)  # The probes will be initialized with the `mod` dataset, and using the configuration in `probes_config.json`.

hf = load_dataset("mini")[0]  # Load the harmful dataset to attack 

Loaded pretrained model meta-llama/Llama-3.2-1B-Instruct into HookedTransformer


  0%|          | 0/1 [00:00<?, ?it/s]


  0%|          | 0/2 [00:00<?, ?it/s]


100%|██████████| 4/4 [00:01<00:00,  2.78it/s]
  0%|          | 0/1 [00:00<?, ?it/s]


  0%|          | 0/2 [00:00<?, ?it/s]


100%|██████████| 4/4 [00:01<00:00,  2.85it/s]
4it [00:00,  6.24it/s]


In [None]:
from ssr.evaluation import call_lmstudio, harmful_bow, Attempt, Response

instruction = hf[0]

start = time.time()
ssr.init_prompt(instruction)            # Init the targeted instruction
ssr.init_buffers()                      # Init the buffers (one can generate using old buffers to go further in the optimisation without having to start everything from scratch)
ssr.generate()                          # Launch the optimisation (again, one can generate multiple time without modifying the buffers/ input prompt)
duration = time.time() - start

adv_suffixes = lens.tokenizer.batch_decode(ssr.candidates_ids)              # suffixes in the main buffer (len(main) = buffer_size)
archive_suffixes = lens.tokenizer.batch_decode(ssr.archive_candidates_ids)  # suffixes in the archive buffer (len(archive) >= 0)

nb_success = 0

for suffix, loss in zip(
    adv_suffixes + archive_suffixes,
    ssr.candidates_losses.tolist() + ssr.archive_candidates_losses.tolist(),
):
    if nb_success < MAX_SUCCESS:
        response = call_lmstudio(
            MODEL_NAME,
            instruction + suffix,                           # Call LM Studio with instruction + adv suffix 
            system_message=ssr.config.system_message,       # Don't forget the system message
        )
        bow = harmful_bow(response)                         # Compute the harmful bag-of-word to discard obvious fails ("I cannot...")

        if bow > 0:                                         
            log_jsonl(
                LOG_FILENAME,
                Attempt(
                    model_name=MODEL_NAME,
                    instruction=instruction,
                    suffix=suffix,
                    inital_loss=ssr.initial_loss,
                    final_loss=loss,
                    duration=int(duration),
                    config=ssr.config,
                    responses=[
                        Response(
                            model_name=MODEL_NAME,
                            response=response,
                            system_message=ssr.config.system_message,
                            bow=bow,
                            guard=None,                     # For this example I didn't call the Judge LLM and the Guard
                            judge=None,
                        )
                    ],
                ).model_dump(),
            )


  0%|          | 0/150 [00:00<?, ?it/s]

  1%|          | 1/150 [00:01<04:36,  1.85s/it]

  1%|▏         | 2/150 [00:03<04:40,  1.89s/it]

  2%|▏         | 3/150 [00:05<04:31,  1.84s/it]

  3%|▎         | 4/150 [00:07<04:30,  1.85s/it]

  3%|▎         | 5/150 [00:09<04:30,  1.87s/it]

  3%|▎         | 5/150 [00:11<05:21,  2.22s/it]


You can check the result of this run in `reproduce_experiments/run_ssr/run_ssr_probes_output.jsonl`