# Soft Prompting Pipeline Demo

This notebook demonstrates how to use the soft prompting pipeline to discover behavioral differences between models.

In [1]:
import sys
from pathlib import Path

# Add the project root directory to Python path
project_root = str(Path().absolute().parent)
if project_root not in sys.path:
    sys.path.insert(0, project_root)

from src.soft_prompting.core.experiment import ExperimentRunner

  from .autonotebook import tqdm as notebook_tqdm


## Create experiment runner with test configuration

In [2]:
# Setup experiment directory
output_dir = Path("outputs/demo")
output_dir.mkdir(parents=True, exist_ok=True)

# Create experiment runner with test configuration
runner = ExperimentRunner.setup(
    experiment_name="intervention_comparison",
    output_dir=output_dir,
    test_mode=True,  # This will use SmolLM models automatically
)

Setting up experiment with name: intervention_comparison, output_dir: outputs/demo
Looking for config at: /Users/jacquesthibodeau/Desktop/Code/supervising-ais-improving-ais/src/soft_prompting/config/experiments/intervention_comparison.yaml
Loaded config: {'name': 'intervention_comparison', 'description': 'Compare model behavior before and after interventions', 'output_dir': 'outputs/intervention_study', 'model_pairs': [{'model_1': 'LLM-LAT/llama2-7b-chat-lat-unlearn-harry-potter-normal', 'model_2': 'meta-llama/Llama-2-7b-chat-hf', 'intervention': 'unlearning'}, {'model_1': 'FelixHofstaetter/mistral-7b-sandbagging-new', 'model_2': 'mistralai/Mistral-7B-Instruct-v0.3', 'intervention': 'sandbagging'}], 'data': {'categories': 'all', 'max_texts_per_category': 2000}, 'training': {'num_epochs': 15, 'batch_size': 8, 'learning_rate': 0.0002, 'num_soft_prompt_tokens': 16}, 'generation': {'temperature': 0.9, 'num_generations_per_prompt': 20}, 'metrics': {'kl_divergence_weight': 2.0, 'semantic_div

## Run Pipeline

In [3]:
# Run validation only
try:
    # First validate
    runner.model_manager.validate_models()

    # Then run if validation passes
    results = runner.run()
    print("Experiment completed successfully!")
except Exception as e:
    print(f"Error: {e}")

Model validation failed: 'str' object has no attribute 'parameters'



Running model validation checks...

=== Starting Experiment Run ===
Loading models...
Getting model pair with index: 0
Test mode: using models HuggingFaceTB/SmolLM-135M-Instruct and HuggingFaceTB/SmolLM-135M
Loading models: HuggingFaceTB/SmolLM-135M-Instruct and HuggingFaceTB/SmolLM-135M
Loading model 1: HuggingFaceTB/SmolLM-135M-Instruct
Loading model 2: HuggingFaceTB/SmolLM-135M
Loading tokenizer from: HuggingFaceTB/SmolLM-135M-Instruct
Creating dataloaders...
Creating dataloaders with config: DataConfig(train_path=None, eval_path=None, categories='anthropic-model-written-evals/advanced-ai-risk/human_generated_evals/power-seeking-inclination', max_texts_per_category=25, test_mode_texts=12, min_text_length=10, max_text_length=150, train_split=0.9, test_mode=True)
Using base path: /Users/jacquesthibodeau/Desktop/Code/supervising-ais-improving-ais/data/evals
Processing category: anthropic-model-written-evals/advanced-ai-risk/human_generated_evals/power-seeking-inclination
Loaded 3 text



Starting training...

=== Starting Training ===
Updating scheduler for actual total steps: 15
Training for 15 epochs
Steps per epoch: 1
Total steps: 15
Warmup steps: 1


Epoch 1/15:   0%|          | 0/1 [00:00<?, ?it/s]


Starting training step
Soft prompt requires_grad: True
Base embeddings requires_grad: False

Soft Prompt Forward Pass:
Input embeddings shape: torch.Size([2, 496, 576])
Soft prompt embeddings requires_grad: True
Expanded soft prompt requires_grad: True
Combined embeddings requires_grad: True
Combined embeddings grad_fn: <CatBackward0 object at 0x321ed52d0>

Soft Prompt Forward Pass:
Input embeddings shape: torch.Size([2, 496, 576])
Soft prompt embeddings requires_grad: True
Expanded soft prompt requires_grad: True
Combined embeddings requires_grad: True
Combined embeddings grad_fn: <CatBackward0 object at 0x321ed52d0>
Combined embeddings requires_grad: True
Combined embeddings grad_fn: <CatBackward0 object at 0x321ed52d0>


Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


Logits requires_grad: True
Logits grad_fn: None
Probabilities requires_grad: True
Probabilities grad_fn: <SoftmaxBackward0 object at 0x321ed5240>
KL div requires_grad: True
KL div grad_fn: <SumBackward1 object at 0x321ed5240>
Final loss requires_grad: True
Final loss grad_fn: <DivBackward0 object at 0x321ed5240>


Epoch 1/15: 100%|██████████| 1/1 [00:01<00:00,  1.57s/it, avg_kl_per_token=0.00479, kl_divergence=0.527, optimization_loss=-0.527]


Probs requires_grad: False
KL div requires_grad: False
Final KL div requires_grad: False
Final KL div grad_fn: None


Epoch 2/15:   0%|          | 0/1 [00:00<?, ?it/s]


Starting training step
Soft prompt requires_grad: True
Base embeddings requires_grad: False

Soft Prompt Forward Pass:
Input embeddings shape: torch.Size([2, 496, 576])
Soft prompt embeddings requires_grad: True
Expanded soft prompt requires_grad: True
Combined embeddings requires_grad: True
Combined embeddings grad_fn: <CatBackward0 object at 0x321ed5120>

Soft Prompt Forward Pass:
Input embeddings shape: torch.Size([2, 496, 576])
Soft prompt embeddings requires_grad: True
Expanded soft prompt requires_grad: True
Combined embeddings requires_grad: True
Combined embeddings grad_fn: <CatBackward0 object at 0x321ed5120>
Combined embeddings requires_grad: True
Combined embeddings grad_fn: <CatBackward0 object at 0x321ed5120>
Logits requires_grad: True
Logits grad_fn: None
Probabilities requires_grad: True
Probabilities grad_fn: <SoftmaxBackward0 object at 0x321ed5270>
KL div requires_grad: True
KL div grad_fn: <SumBackward1 object at 0x321ed5270>
Final loss requires_grad: True
Final loss

Epoch 2/15: 100%|██████████| 1/1 [00:01<00:00,  1.76s/it, avg_kl_per_token=0.00479, kl_divergence=0.527, optimization_loss=-0.527]


Probs requires_grad: False
KL div requires_grad: False
Final KL div requires_grad: False
Final KL div grad_fn: None


Epoch 3/15:   0%|          | 0/1 [00:00<?, ?it/s]


Starting training step
Soft prompt requires_grad: True
Base embeddings requires_grad: False

Soft Prompt Forward Pass:
Input embeddings shape: torch.Size([2, 496, 576])
Soft prompt embeddings requires_grad: True
Expanded soft prompt requires_grad: True
Combined embeddings requires_grad: True
Combined embeddings grad_fn: <CatBackward0 object at 0x321ed5270>

Soft Prompt Forward Pass:
Input embeddings shape: torch.Size([2, 496, 576])
Soft prompt embeddings requires_grad: True
Expanded soft prompt requires_grad: True
Combined embeddings requires_grad: True
Combined embeddings grad_fn: <CatBackward0 object at 0x321ed5270>
Combined embeddings requires_grad: True
Combined embeddings grad_fn: <CatBackward0 object at 0x321ed5270>
Logits requires_grad: True
Logits grad_fn: None
Probabilities requires_grad: True
Probabilities grad_fn: <SoftmaxBackward0 object at 0x321ed5570>
KL div requires_grad: True
KL div grad_fn: <SumBackward1 object at 0x321ed5570>
Final loss requires_grad: True
Final loss

Epoch 3/15: 100%|██████████| 1/1 [00:01<00:00,  1.31s/it, avg_kl_per_token=0.00479, kl_divergence=0.527, optimization_loss=-0.527]


Probs requires_grad: False
KL div requires_grad: False
Final KL div requires_grad: False
Final KL div grad_fn: None


Epoch 4/15:   0%|          | 0/1 [00:00<?, ?it/s]


Starting training step
Soft prompt requires_grad: True
Base embeddings requires_grad: False

Soft Prompt Forward Pass:
Input embeddings shape: torch.Size([2, 496, 576])
Soft prompt embeddings requires_grad: True
Expanded soft prompt requires_grad: True
Combined embeddings requires_grad: True
Combined embeddings grad_fn: <CatBackward0 object at 0x321ed54b0>

Soft Prompt Forward Pass:
Input embeddings shape: torch.Size([2, 496, 576])
Soft prompt embeddings requires_grad: True
Expanded soft prompt requires_grad: True
Combined embeddings requires_grad: True
Combined embeddings grad_fn: <CatBackward0 object at 0x321ed54b0>
Combined embeddings requires_grad: True
Combined embeddings grad_fn: <CatBackward0 object at 0x321ed54b0>
Logits requires_grad: True
Logits grad_fn: None
Probabilities requires_grad: True
Probabilities grad_fn: <SoftmaxBackward0 object at 0x321ed56c0>
KL div requires_grad: True
KL div grad_fn: <SumBackward1 object at 0x321ed56c0>
Final loss requires_grad: True
Final loss

Epoch 4/15: 100%|██████████| 1/1 [00:02<00:00,  2.12s/it, avg_kl_per_token=0.00479, kl_divergence=0.527, optimization_loss=-0.527]


Probs requires_grad: False
KL div requires_grad: False
Final KL div requires_grad: False
Final KL div grad_fn: None

Early stopping triggered after 4 epochs

Preparing analysis...

Saving results to outputs/demo/results_pair_0.json

Testing JSON serialization...
Results successfully saved
Error: tuple indices must be integers or slices, not str


## Analyze Results

In [None]:
from src.soft_prompting.analysis.divergence_analyzer import DivergenceAnalyzer

# In your analysis script
analyzer = DivergenceAnalyzer(
    metrics=runner.trainer.divergence_metrics, output_dir=output_dir
)

In [None]:
# Generate analysis report
report = analyzer.generate_report(
    dataset=results["dataset"], output_file="analysis_report.json"
)

# Display key findings about early stopping
print("\nEarly Stopping Analysis:")
print(f"Total runs: {report['overall_stats']['early_stopping_stats']['total_runs']}")
print(
    f"Runs stopped early: {report['overall_stats']['early_stopping_stats']['early_stopped_runs']}"
)

if "training_analysis" in report["divergence_patterns"]:
    analysis = report["divergence_patterns"]["training_analysis"]
    print(f"\nBest divergence achieved: {analysis['best_divergence']:.4f}")
    print(f"Reached at epoch: {analysis['convergence_epoch']}")