**LLM steering via different methods** :

1. **Intervening on weights**, as with supervised finetuning, RLHF, steerable layers, and weight editing
(that is, targeted fine-tuning) (Ranzato et al. 2016; Ziegler et al. 2019; Dathathri et al. 2020; Meng
et al. 2023; Ilharco et al. 2023). However, naive RLHF, finetuning, and weight editing have known
side-effects on overall model performance (Hase et al. 2023; Qi et al. 2023; Brown et al. 2023)

2.  **Intervening at decoding**, as with guided or trainable decoding (Gu et al. 2017; Grover et al. 2019;
see Zhang et al. 2022a for an overview of controlled generation and Jin et al. 2022 for textual style
transfer)

3. **Intervening on the prompt**, as with automated prompt engineering (Shin et al. 2020; Zhou et al. 2022)

4. **Intervening on token embeddings**, as with ‘soft prompting’ (Li & Liang 2021(**prefix tuning**); Lester et al. 2021(**parameter efficient promt-tuning**);
Khashabi et al 2022) - **Li & Liang 2021** add trainable vectors to every single layer of the transformer network. Instead of just modifying the input embeddings, they modify the "Key" and "Value" matrices inside every attention block (Layer 1, Layer 2, ... Layer 12). But **Lester et al., 2021** added the trainable vector only at the input layer. And below I experimented with that.
    

5. **Intervening on activations**, for instance by freezing the weights of the LLM and searching for a
"steering vector" of activations, e.g. using gradient descent (Subramani et al. 2022; Hernandez
et al. 2023). These optimized extraction methods, which search for a steering vector, differ from
extraction methods which directly compute it (present work and Li et al. 2023b). In our work, we
do not use gradient descent or other optimization methods.



# INTERVENING ON TOKEN EMBEDDINGS WITH SOFT PROMPTING

If we are using Soft Prompting : one would insert a trainable vector (virtual token) into the input. You would then need to freeze the model weights and run a search (using backpropagation/gradient descent) to optimize that vector until the model consistently outputs "Love" instead of "Hate".

In Soft Prompting, you intervene at that exact "swapping" stage. Instead of using the fixed vectors that correspond to real English words (like "Translate" or "Summarize"), you create new, tunable vectors and insert them into the input sequence.

1. Virtual Tokens: These new vectors are often called "soft prompts" or "virtual tokens." They act like words to the model, but they don't correspond to any actual word in the dictionary.

2. Continuous vs. Discrete: The document notes that standard prompting is "discrete" (a token is either present or not). Soft prompting breaks this rule by allowing these vectors to be continuous variables that can be mathematically adjusted.
How it is trained? - You freeze model weights, then Backpropagation: You run data through the model and measure the error. Instead of updating the model to fix the error, you update only the soft prompt vectors.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import numpy as np

print("Loading model.....")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

for param in model.parameters():
    param.requires_grad = False

# creating soft promting with virtual tokens, which can be trained
# we can create any number of virtual tokens, I am starting with 5
num_soft_tokens = 5
embedding_dim = model.transformer.wte.weight.shape[1]

# now we will use real language tokens to initialize soft tokens from pretrained model,
# instead of starting with random noise
init_token_ids = tokenizer.encode("Love is kind and sweet", add_special_tokens=False)[:num_soft_tokens]
# .detatch() removes token tensors from Pytorch's computation graph, gradients will not flow back to
# the rest of the model and we only train the soft token embeddings
soft_token_tensor = model.transformer.wte(torch.tensor(init_token_ids)).clone().detach()
soft_tokens = nn.Parameter(soft_token_tensor, requires_grad=True)

optimizer = optim.Adam([soft_tokens], lr=0.001)

data_pairs = [
    ("I hate you", "I love you"),
    ("You are terrible", "You are wonderful"),
    ("This is the worst", "This is the best"),
    ("I am angry", "I am happy"),
    ("Go away", "Come here"),
]

print(f"Starting training on {len(data_pairs)} pairs...")
model.train()
num_epochs = 500

for epoch in range(num_epochs):
    total_loss = 0.0
    for input_text, target_text in data_pairs:
        input_ids = tokenizer.encode(input_text, return_tensors="pt")

        input_embeds = model.transformer.wte(input_ids)
        # now will prepend the soft tokens at start of the input
        # the input shape is : [1, soft_tokens + input_len, 768]
        combined_embeds = torch.cat(
            [soft_tokens.unsqueeze(0), input_embeds],dim=1
        )

        target_ids = tokenizer.encode(target_text, return_tensors='pt')
        # In a real training loop, we'd align labels carefully
        # For this simple demo, we just check if the model predicts the first token of "target"
        # based on the last token of "input"
        outputs = model(inputs_embeds=combined_embeds)
        next_token_logits = outputs.logits[0, -1, :]

        target_token_id = target_ids[0, 0]
        loss = nn.CrossEntropyLoss()(next_token_logits.unsqueeze(0), target_token_id.unsqueeze(0))

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        total_loss += loss.item()

    if (epoch+1)%10 == 0:
        print(f"Epoch {epoch+1}/{num_epochs} - Loss: {total_loss:.4f}")

print("training finished")
test_input = "I hate you"
input_ids = tokenizer.encode(test_input, return_tensors="pt")
inputs_embeds = model.transformer.wte(input_ids)

# With Soft Prompt
combined_embeds = torch.cat([soft_tokens.unsqueeze(0), inputs_embeds], dim=1)
output_ids = model.generate(inputs_embeds=combined_embeds, max_new_tokens=20)
print(f"Input: {test_input}")
print(f"Generated (with soft prompt): {tokenizer.decode(output_ids[0], skip_special_tokens=True)}")

Loading model.....
Starting training on 5 pairs...
Epoch 10/500 - Loss: 28.0557
Epoch 20/500 - Loss: 18.4757
Epoch 30/500 - Loss: 21.5718
Epoch 40/500 - Loss: 11.5822
Epoch 50/500 - Loss: 10.8998
Epoch 60/500 - Loss: 11.3319
Epoch 70/500 - Loss: 10.3483
Epoch 80/500 - Loss: 7.7164
Epoch 90/500 - Loss: 11.9195
Epoch 100/500 - Loss: 8.8279
Epoch 110/500 - Loss: 11.7682
Epoch 120/500 - Loss: 5.3652
Epoch 130/500 - Loss: 7.1953
Epoch 140/500 - Loss: 5.3573
Epoch 150/500 - Loss: 6.2203
Epoch 160/500 - Loss: 6.0887
Epoch 170/500 - Loss: 4.6656
Epoch 180/500 - Loss: 2.3220
Epoch 190/500 - Loss: 2.1286
Epoch 200/500 - Loss: 2.0248
Epoch 210/500 - Loss: 1.6091
Epoch 220/500 - Loss: 10.7010
Epoch 230/500 - Loss: 1.3873
Epoch 240/500 - Loss: 0.4178
Epoch 250/500 - Loss: 1.2959
Epoch 260/500 - Loss: 0.4419
Epoch 270/500 - Loss: 0.8415
Epoch 280/500 - Loss: 0.4090
Epoch 290/500 - Loss: 1.4106
Epoch 300/500 - Loss: 1.0093
Epoch 310/500 - Loss: 0.3949
Epoch 320/500 - Loss: 0.2396
Epoch 330/500 - Loss

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Epoch 500/500 - Loss: 0.8451
training finished
Input: I hate you
Generated (with soft prompt): I love youI hate youI hate youI hate youI hate youI hate youI hate


Above training we did with 5 datapairs, now we can use **Sentiment Analysis datasets** and filter dataset for positive examples.

So, I have used **IMDb Movie Reviews dataset** which contains 50,000 reviews labeled as positive or negative.

By training our **soft prompt** on thousands of positive reviews, we are effectively teaching that vector to act as a **"positive filter"** for the model.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from torch.utils.data import DataLoader, Dataset
from datasets import load_dataset
import math

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
model = GPT2LMHeadModel.from_pretrained("gpt2").to(device)

# Freeze Model Weights
for param in model.parameters():
    param.requires_grad = False

# Preparing IMDB movie review dataset
print("Downloading IMDb dataset...")
try:
    dataset = load_dataset("imdb", split="train")
except Exception as e:
    print("Dataset download failed, using dummy data.")
    dataset = [{'text': "This movie was great", 'label': 1}] * 100

print("Filtering for positive reviews...")
# Filter for positive reviews (label=1)
# Using 3,000 samples for better generalization from the dataset
love_dataset = dataset.filter(lambda x: x['label'] == 1).select(range(3000))

class PositiveReviewDataset(Dataset):
    def __init__(self, txt_list, tokenizer, max_length=64):
        self.input_ids = []
        self.attn_masks = []
        for txt in txt_list:
            # Enforce strict length
            enc = tokenizer(txt, truncation=True, max_length=max_length, padding="max_length", return_tensors="pt")
            self.input_ids.append(enc['input_ids'][0])
            self.attn_masks.append(enc['attention_mask'][0])

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx]

print("Tokenizing data (this might take a moment)...")
train_data = PositiveReviewDataset(love_dataset['text'], tokenizer)
BATCH_SIZE = 16
train_loader = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)

# Creating trainable soft tokens
requested_tokens = 10
init_text = "The movie was absolutely wonderful and heartwarming because"
init_ids = tokenizer.encode(init_text, add_special_tokens=False)[:requested_tokens]

soft_prompt_tensor = model.transformer.wte(torch.tensor(init_ids).to(device)).clone().detach()
soft_prompt = nn.Parameter(soft_prompt_tensor, requires_grad=True)

# Optimizer and Scheduler
# Using a higher max_lr because soft prompts are not as sensitive as model weights
learning_rate = 0.005
optimizer = optim.AdamW([soft_prompt], lr=learning_rate)

num_epochs = 10
total_steps = len(train_loader) * num_epochs

# OneCycleLR Scheduler: Starts low, goes high, then goes very low to converge
scheduler = optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=0.01,
    total_steps=total_steps,
    pct_start=0.3  # Spend 30% of time warming up
)

# training loop
print(f"Starting training on {len(train_data)} samples for {num_epochs} epochs...")
model.train()
criterion = nn.CrossEntropyLoss()

for epoch in range(num_epochs):
    total_loss = 0

    for batch_idx, (input_ids, attn_masks) in enumerate(train_loader):
        input_ids = input_ids.to(device)
        attn_masks = attn_masks.to(device)

        # A. Embed Real Input
        inputs_embeds = model.transformer.wte(input_ids)

        # B. Prepare Soft Prompt (Dynamic Size Check)
        real_soft_len = soft_prompt.shape[0]
        current_batch_size = input_ids.size(0)

        soft_prompt_batch = soft_prompt.unsqueeze(0).expand(current_batch_size, -1, -1)

        # C. Concatenate
        combined_embeds = torch.cat([soft_prompt_batch, inputs_embeds], dim=1)

        # D. Attention Mask
        soft_prompt_mask = torch.ones((current_batch_size, real_soft_len)).to(device)
        combined_mask = torch.cat([soft_prompt_mask, attn_masks], dim=1)

        # E. Create Labels
        labels = torch.full((current_batch_size, combined_embeds.size(1)), -100).to(device)
        labels[:, real_soft_len:] = input_ids
        labels[:, real_soft_len:][attn_masks == 0] = -100

        # F. Forward Pass
        outputs = model(inputs_embeds=combined_embeds, attention_mask=combined_mask)
        logits = outputs.logits

        # Shape Alignment
        min_len = min(logits.size(1), labels.size(1))
        logits = logits[:, :min_len, :]
        labels = labels[:, :min_len]

        # G. Shift for Next-Token Prediction
        shift_logits = logits[..., :-1, :].contiguous()
        shift_labels = labels[..., 1:].contiguous()

        # H. Loss
        loss = criterion(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))

        loss.backward()
        optimizer.step()
        scheduler.step() # Update learning rate
        optimizer.zero_grad()

        total_loss += loss.item()

        if batch_idx % 50 == 0:
            current_lr = scheduler.get_last_lr()[0]
            print(f"Epoch {epoch+1} | Batch {batch_idx} | LR: {current_lr:.5f} | Loss: {loss.item():.4f}")

    avg_loss = total_loss / len(train_loader)
    print(f"--- Epoch {epoch+1} Completed. Avg Loss: {avg_loss:.4f} ---")

print("Training complete!")

# Testing
print("\n--- TESTING STEERABILITY ---")
test_prompts = ["The food tasted", "I really think that", "The weather is"]

for prompt in test_prompts:
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
    inputs_embeds = model.transformer.wte(input_ids)

    # Expand soft prompt
    soft_prompt_batch = soft_prompt.unsqueeze(0)
    combined_embeds = torch.cat([soft_prompt_batch, inputs_embeds], dim=1)

    # Generate
    output_ids = model.generate(
        inputs_embeds=combined_embeds,
        max_new_tokens=40,
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True, # Add sampling for more natural text
        temperature=0.7
    )

    print(f"\nInput: {prompt}")
    print(f"Generated: {tokenizer.decode(output_ids[0], skip_special_tokens=True)}")

Using device: cuda
Downloading IMDb dataset...
Filtering for positive reviews...
Tokenizing data (this might take a moment)...
Starting training on 3000 samples for 10 epochs...
Epoch 1 | Batch 0 | LR: 0.00040 | Loss: 4.3122
Epoch 1 | Batch 50 | LR: 0.00060 | Loss: 4.1951
Epoch 1 | Batch 100 | LR: 0.00115 | Loss: 3.9580
Epoch 1 | Batch 150 | LR: 0.00202 | Loss: 3.9117
--- Epoch 1 Completed. Avg Loss: 4.1023 ---
Epoch 2 | Batch 0 | LR: 0.00283 | Loss: 4.1465
Epoch 2 | Batch 50 | LR: 0.00408 | Loss: 3.9980
Epoch 2 | Batch 100 | LR: 0.00542 | Loss: 3.7321
Epoch 2 | Batch 150 | LR: 0.00673 | Loss: 4.0362
--- Epoch 2 Completed. Avg Loss: 3.8991 ---
Epoch 3 | Batch 0 | LR: 0.00764 | Loss: 3.9084
Epoch 3 | Batch 50 | LR: 0.00869 | Loss: 3.9181
Epoch 3 | Batch 100 | LR: 0.00946 | Loss: 4.1108
Epoch 3 | Batch 150 | LR: 0.00991 | Loss: 3.8993
--- Epoch 3 Completed. Avg Loss: 3.8716 ---
Epoch 4 | Batch 0 | LR: 0.01000 | Loss: 3.8531
Epoch 4 | Batch 50 | LR: 0.00996 | Loss: 3.8847
Epoch 4 | Batch 

Testing the above trained model with different prompts.

In [None]:
def generate_steered(input_text, max_new_tokens=50, temperature=0.7):
    model.eval() # Set to evaluation mode

    # Encode input
    input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)
    inputs_embeds = model.transformer.wte(input_ids)

    # Expand soft prompt to match batch size (which is 1 here)
    # We assume 'soft_prompt' is available from your previous training cell
    soft_prompt_batch = soft_prompt.unsqueeze(0)

    # Concatenate: [Soft Prompt] + [User Input]
    combined_embeds = torch.cat([soft_prompt_batch, inputs_embeds], dim=1)

    # Generate
    with torch.no_grad():
        output_ids = model.generate(
            inputs_embeds=combined_embeds,
            max_new_tokens=max_new_tokens,
            pad_token_id=tokenizer.eos_token_id,
            do_sample=True,      # Adds variety so it's not robotic
            temperature=temperature, # Controls creativity (0.7 is a sweet spot)
            top_k=50             # Limits to top 50 likely words
        )

    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

# variours testing styles
test_inputs = [
    # Neutral starts
    "The restaurant was",
    "I went to the store and",

    # Negative starts (Let's see if the prompt fights the negativity!)
    "I usually hate it when",
    "My boss is always",

    # Random concepts
    "The meaning of life is",
    "Tomorrow will be"
]

print(f"--- TESTING STEERED MODEL (Temperature: 0.7) ---\n")

for text in test_inputs:
    result = generate_steered(text)
    print(f"Input:    {text}")
    print(f"Response: {result}\n" + "-"*40)

--- TESTING STEERED MODEL (Temperature: 0.7) ---

Input:    The restaurant was
Response:  awesome. I've been to all three of the restaurants and they are all fantastic. I've always loved the food. The food is not overwhelming and I have never had a bad meal with it.<br /><br />This is a small restaurant
----------------------------------------
Input:    I went to the store and
Response:  bought some of the original films which were released in the early '90s. I'm not sure if it was my first time seeing this film but it was one of those rare films which was well received by the people who knew what they were doing
----------------------------------------
Input:    I usually hate it when
Response:  I watch movies with bad reviews because it's just not there. For one thing, the film does not do enough good to deserve its review. The characters are mostly just a bunch of people who are not really different from each other. I know how
----------------------------------------
Input:    My bo

# INTERVENING ON ACTIVATIONS

**Subramani et al. (2022): "The Sentence Reconstructor"**


1. **Goal**: They wanted to see if they could force a frozen model to output an exact specific sentence (e.g., "The weather is blue") just by injecting a vector.

2. **Method**:They freeze the model.They create a random vector z.They run the model and compare the output to the target sentence.

3. **Gradient Descent:** They calculate the error (loss) and use backpropagation to update the vector z (not the model weights).

4. They repeat this until the vector $z$ is perfect.

5. **Result:** They found "Steering Vectors" that act like a remote control, forcing the model to say anything they want.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import matplotlib.pyplot as plt

# 1. SETUP
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2").to(device)

# Now freezing the model. We are not training the model. We are finding the vector
for param in model.parameters():
    param.requires_grad = False
model.eval() # Set to eval mode (dropout off, etc.)

# 2. CONFIGURATION
# we are making the model to generate exactly this
target_sentence = "The weather is blue and the sky is green"
target_ids = tokenizer.encode(target_sentence, return_tensors="pt").to(device)
print(f"Target sentence: '{target_sentence}'")
print(f"Target IDs: {target_ids}")
print(f"Target Length: {target_ids.shape[1]} tokens")

# 3. LETS INJECT THE STEERING VECTOR OF SHAPE [1,1,768] FOR GPT2-SMALL
# We broadcast it across the sequence length, or add it to the first token.
# Subramani et al. often add it to all positions or specific layers.
embedding_dim = model.config.n_embd
steering_vector = nn.Parameter(torch.randn(1,1,embedding_dim).to(device), requires_grad=True)

# making optimizer only touch the steering vector
optimizer = optim.Adam([steering_vector], lr=0.1)




Using device: cuda


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Target sentence: 'The weather is blue and the sky is green'
Target IDs: tensor([[ 464, 6193,  318, 4171,  290,  262, 6766,  318, 4077]],
       device='cuda:0')
Target Length: 9 tokens


In [2]:
model.config

GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "dtype": "float32",
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.57.6",
  "use_cache": true,
  "vocab_size": 50257
}