# Combined Dynamic Alpha + Selective N-Tokens Steering: Hallucination Guardrail Evaluation
 
**Summary:**
This notebook evaluates a combined guardrail approach for Llama-3.1-8B that integrates both Dynamic Alpha (risk-proportional steering strength) and Selective N-Tokens Steering (applying the intervention only to the first N tokens). This method applies a dynamic, risk-scaled correction for high-risk prompts, but only during the initial generation steps, maximizing hallucination reduction while minimizing latency and preserving answer quality. This combined approach outperforms both individual ablations and is used for all further evaluations on other datasets.

- **Dynamic Alpha:** Steering strength (alpha) is scaled based on prompt risk, providing stronger correction for riskier prompts.
- **Selective N-Tokens:** Steering is applied only to the first 10 generated tokens, focusing intervention where it is most effective.

**Key Results (TruthfulQA Benchmark):**
- **Baseline Model:** Accuracy: 38.57%, Hallucination Rate: 61.43%, Avg Latency: 3.86s
- **Combined Guarded Model:** Accuracy: 52.04%, Hallucination Rate: 47.96%, Avg Latency: 3.56s
- **Relative Error Reduction:** 21.93%
- **Latency Increase:** -7.78% (latency decreased)

This combined guardrail achieves the best trade-off between hallucination reduction, accuracy, and latency, and is therefore used for all subsequent cross-domain evaluations.

### **Environment and Requirements Setup**
Setup for local execution on Lambda Labs A100 40GB GPU with Llama-3.1-8B.

In [None]:
# Install Libraries for Lambda Labs A100 Environment
import subprocess
import sys

def install_package(package):
    """Install a package using pip with proper error handling."""
    try:
        subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", package])
        return True
    except subprocess.CalledProcessError as e:
        print(f"Error installing {package}: {e}")
        return False

print("Installing required packages for Llama-3.1-70B on A100 GPUs...")

# A100-optimized unsloth installation
install_package("unsloth[cu121-ampere-torch220]")

# Core ML libraries
packages = [
    "transformers",
    "accelerate", 
    "datasets",
    "requests",
    "pandas",
    "tqdm",
    "scikit-learn",
    "joblib"
]

for pkg in packages:
    install_package(pkg)

print("✓ Package installation complete")

# Verify environment
import torch
print(f"\nEnvironment verification:")
print(f"Python version: {sys.version}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU count: {torch.cuda.device_count()}")
    for i in range(torch.cuda.device_count()):
        print(f"  GPU {i}: {torch.cuda.get_device_name(i)}")

Archive:  /content/HallucinationVectorProject.zip
   creating: /content/HallucinationVectorProject/
   creating: /content/HallucinationVectorProject/artifacts/
  inflating: /content/HallucinationVectorProject/config.py  
  inflating: /content/HallucinationVectorProject/requirements.txt  
  inflating: /content/HallucinationVectorProject/train_risk_classifier.py  
   creating: /content/HallucinationVectorProject/results/
  inflating: /content/HallucinationVectorProject/evaluate_guardrail.py  
  inflating: /content/HallucinationVectorProject/utils.py  
  inflating: /content/HallucinationVectorProject/tune_guardrail_hyperparameters.py  
  inflating: /content/HallucinationVectorProject/build_hallucination_vector.py  
   creating: /content/HallucinationVectorProject/data/
   creating: /content/HallucinationVectorProject/plots/
  inflating: /content/HallucinationVectorProject/artifacts/risk_thresholds.joblib  
  inflating: /content/__MACOSX/HallucinationVectorProject/artifacts/._risk_threshol

### **Project Path Setup**
Sets up local project paths for Lambda Labs execution.

In [None]:
import sys
import os
from pathlib import Path

# Setup local project paths
PROJECT_DIR = Path("/Users/ayesha/Projects/HallucinationVectorProject")
DATA_DIR = PROJECT_DIR / "data"
ARTIFACTS_DIR = PROJECT_DIR / "artifacts" / "llama-3.1-8b"
RESULTS_DIR = PROJECT_DIR / "results" / "llama-3.1-8b"

# Create directories if they don't exist
ARTIFACTS_DIR.mkdir(parents=True, exist_ok=True)
RESULTS_DIR.mkdir(parents=True, exist_ok=True)

# Add project directory to Python's path
project_path = str(PROJECT_DIR)
if project_path not in sys.path:
    sys.path.append(project_path)

print(f"Project directory: {PROJECT_DIR}")
print(f"Data directory: {DATA_DIR}")
print(f"Artifacts directory: {ARTIFACTS_DIR}")
print(f"Results directory: {RESULTS_DIR}")

# Programmatically set the environment to 'local' in the config file
config_file_path = PROJECT_DIR / 'config.py'
with open(config_file_path, 'r') as f:
    lines = f.readlines()
with open(config_file_path, 'w') as f:
    for line in lines:
        if line.strip().startswith('ENVIRONMENT ='):
            f.write('ENVIRONMENT = "local"\n')
        else:
            f.write(line)
print("✓ Environment configured for local Lambda Labs execution.")

Environment configured for Colab execution.


### **Load Artifacts and Model Setup**
Loads all required model artifacts for Llama-3.1-8B, including the hallucination vector, risk classifier, and config thresholds, and prepares the model for evaluation on A100 40GB GPU.

In [None]:
import time
import pandas as pd
import torch
import csv
from tqdm import tqdm
import joblib
from unsloth import FastLanguageModel

# Import custom modules
import config
import utils

# Helper function to monitor GPU memory
def print_gpu_memory():
    """Print memory usage for all available GPUs."""
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated(0) / 1024**3
        reserved = torch.cuda.memory_reserved(0) / 1024**3
        total = torch.cuda.get_device_properties(0).total_memory / 1024**3
        print(f"GPU 0 ({torch.cuda.get_device_name(0)}): "
              f"{allocated:.2f}GB allocated, {reserved:.2f}GB reserved, {total:.2f}GB total")

# This global dictionary will hold our models, tokenizer, vectors, etc.
artifacts = {}

def load_all_artifacts():
    """Loads all necessary model and project artifacts into the global dict."""
    if artifacts: return
    print("Loading all necessary artifacts for Llama-3.1-8B evaluation...")
    
    print("\nGPU memory before model loading:")
    print_gpu_memory()
    
    # Clear any cached memory
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    
    # Load 8B model for single GPU
    print("\nLoading Llama-3.1-8B model (bfloat16)...")
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
        max_seq_length=4096,
        dtype=torch.bfloat16,
        load_in_4bit=False,
        trust_remote_code=True,
    )

    # Configure for inference
    model = FastLanguageModel.for_inference(model)
    model.gradient_checkpointing_disable()
    model.config.gradient_checkpointing = False
    model.config.use_cache = True
    model.eval()

    artifacts['model'] = model
    artifacts['tokenizer'] = tokenizer
    artifacts['v_halluc'] = torch.load(ARTIFACTS_DIR / "v_halluc.pt").to(model.device).to(torch.bfloat16)
    artifacts['risk_classifier'] = joblib.load(ARTIFACTS_DIR / "risk_clf.joblib")
    artifacts['thresholds'] = {
        "tau_low": config.TAU_LOW,
        "tau_high": config.TAU_HIGH,
        "optimal_alpha": config.OPTIMAL_ALPHA
    }
    
    print("\n✓ All artifacts loaded successfully!")
    print(f"Model device: {model.device}")
    print("\nGPU memory after loading:")
    print_gpu_memory()

# Load everything
load_all_artifacts()

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
Loading all necessary artifacts for evaluation...
Loading model and tokenizer: unsloth/llama-3-8b-Instruct-bnb-4bit
==((====))==  Unsloth 2025.9.7: Fast Llama patching. Transformers: 4.55.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

Model and tokenizer loaded successfully.


### **Combined Selective Activation Steering and Guardrail Function**
Defines a context manager to apply the steering vector only to the first N tokens, with dynamic risk-proportional strength, and a function to generate answers using this combined intervention.

In [None]:
import time
import torch
from contextlib import contextmanager

# Import our project's config and utils modules
import config
import utils

print("Defining combined logic for 'Dynamic Alpha + Selective Steering' experiment...")

# --- 1. The SelectiveActivationSteerer Class ---

class SelectiveActivationSteerer:
    def __init__(self, model, steering_vector, layer_idx, coeff=1.0, steering_token_limit=10):
        self.model = model
        self.vector = steering_vector
        self.layer_idx = layer_idx
        self.coeff = coeff
        self.steering_token_limit = steering_token_limit
        self._handle = None
        self._layer_path = f"model.layers.{self.layer_idx}"
        self.call_count = 0

    def _hook_fn(self, module, ins, out):
        self.call_count += 1
        if self.call_count <= self.steering_token_limit:
            steered_output = out[0] + (self.coeff * self.vector.to(out[0].device))
            return (steered_output,) + out[1:]
        return out

    def __enter__(self):
        self.call_count = 0
        try:
            layer = self.model.get_submodule(self._layer_path)
            self._handle = layer.register_forward_hook(self._hook_fn)
        except AttributeError:
            raise AttributeError(f"Could not find the layer at path: {self._layer_path}")
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        if self._handle:
            self._handle.remove()


# --- 2. The New `answer_guarded_combined` Function ---
# This function combines the logic from both successful ablations.

def answer_guarded_combined(prompt_text: str, max_new_tokens: int = 128, steering_token_limit: int = 10):
    """
    Generates a response using the guardrail with DYNAMIC alpha and SELECTIVE steering.
    Enhanced for 70B model with proper device handling and memory management.
    """
    start_time = time.time()

    try:
        risk_score = utils.get_hallucination_risk(
            prompt_text, artifacts['model'], artifacts['tokenizer'],
            artifacts['v_halluc'], artifacts['risk_classifier']
        )

        full_prompt = f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\nYou are a helpful assistant.\n<|eot_id|><|start_header_id|>user<|end_header_id|>\nAnswer the following briefly.\n{prompt_text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
        inputs = artifacts['tokenizer'](full_prompt, return_tensors="pt", max_length=4096, truncation=True).to(artifacts['model'].device)
        input_token_length = inputs.input_ids.shape[1]

        if risk_score < artifacts['thresholds']['tau_high']:
            path = "Fast Path (Untouched)"
            with torch.no_grad():
                outputs = artifacts['model'].generate(
                    **inputs, 
                    max_new_tokens=max_new_tokens, 
                    do_sample=False,
                    pad_token_id=artifacts['tokenizer'].eos_token_id
                )
        else:
            # From Dynamic Alpha Ablation: Calculate dynamic steering strength.
            optimal_alpha = artifacts['thresholds']['optimal_alpha']
            tau_high = artifacts['thresholds']['tau_high']
            scaling_factor = (risk_score - tau_high) / (1.0 - tau_high + 1e-6) # Add epsilon for stability
            dynamic_alpha = optimal_alpha * max(0, min(1, scaling_factor)) # Clamp between 0 and 1

            path = f"Combined Steer Path (α={dynamic_alpha:.2f}, N={steering_token_limit})"

            # From Selective N-Tokens Ablation: Use the steerer with a token limit.
            # We pass our newly calculated `dynamic_alpha` as the coefficient.
            with SelectiveActivationSteerer(
                artifacts['model'], artifacts['v_halluc'], config.TARGET_LAYER,
                coeff=dynamic_alpha,
                steering_token_limit=steering_token_limit
            ):
                with torch.no_grad():
                    outputs = artifacts['model'].generate(
                        **inputs, 
                        max_new_tokens=max_new_tokens, 
                        do_sample=False,
                        pad_token_id=artifacts['tokenizer'].eos_token_id
                    )

        answer = artifacts['tokenizer'].decode(outputs[0, input_token_length:], skip_special_tokens=True)
        latency = time.time() - start_time

        return {"answer": answer.strip(), "risk_score": risk_score, "path_taken": path, "latency_seconds": latency}
    
    except Exception as e:
        print(f"Error in answer_guarded_combined: {e}")
        latency = time.time() - start_time
        return {"answer": "", "risk_score": 0.5, "path_taken": "Error", "latency_seconds": latency}

print("✓ New function `answer_guarded_combined` is now defined and ready for the experiment.")

Defining combined logic for 'Dynamic Alpha + Selective Steering' experiment...
New function `answer_guarded_combined` is now defined and ready for the experiment.


**Suppress Warnings**
Suppresses specific sklearn warnings for cleaner output during evaluation.

In [None]:
import warnings

warnings.filterwarnings(
    "ignore",
    message="X does not have valid feature names",
    category=UserWarning,
    module="sklearn"
)

### **Run Combined Guardrail Evaluation**
Runs the evaluation loop on the TruthfulQA test set, applying the combined guardrail and saving results for each prompt.

In [None]:
# --- EXPERIMENT PARAMETER ---
STEERING_TOKEN_LIMIT = 10 # The 'N' for our selective steering

# Use local paths
GUARDED_RESULTS_PATH_COMBINED = RESULTS_DIR / "combined_guarded_results.csv"
BASELINE_RESULTS_PATH = RESULTS_DIR / "ablation_2_baseline_results_truthfulqa.csv"

print(f"New guarded results will be saved to: {GUARDED_RESULTS_PATH_COMBINED}")

# Memory management helper
def check_and_clear_memory(threshold_gb=60):
    """Clear GPU cache if memory usage exceeds threshold."""
    if torch.cuda.is_available():
        for i in range(torch.cuda.device_count()):
            allocated = torch.cuda.memory_allocated(i) / 1024**3
            if allocated > threshold_gb:
                print(f"GPU {i} memory ({allocated:.2f}GB) exceeds threshold. Clearing cache...")
                torch.cuda.empty_cache()
                return True
    return False

# Load the test set (using local path)
test_df = pd.read_csv(DATA_DIR / "final_test_set_truthfulqa.csv")
print(f"Loaded {len(test_df)} test prompts from TruthfulQA")

# --- Resilient Evaluation Loop ---
guarded_headers = ['prompt', 'answer', 'risk_score', 'path_taken', 'latency_seconds']
utils.initialize_csv(GUARDED_RESULTS_PATH_COMBINED, guarded_headers)

processed_guarded = utils.load_processed_prompts(GUARDED_RESULTS_PATH_COMBINED)

print(f"Starting response generation for combined guardrail (Dynamic Alpha + Selective N={STEERING_TOKEN_LIMIT})...")
print(f"Already processed: {len(processed_guarded)} prompts")

start_time = time.time()
processed_count = len(processed_guarded)

for idx, row in tqdm(test_df.iterrows(), total=len(test_df), desc="Combined Guardrail Evaluation"):
    prompt = row['Question']

    # Guarded Run
    if prompt not in processed_guarded:
        try:
            result = answer_guarded_combined(prompt, steering_token_limit=STEERING_TOKEN_LIMIT)
            with open(GUARDED_RESULTS_PATH_COMBINED, 'a', newline='', encoding='utf-8') as f:
                csv.writer(f).writerow([prompt] + list(result.values()))
            
            processed_count += 1
            
            # Progress tracking and memory management
            if processed_count % 10 == 0:
                elapsed = time.time() - start_time
                rate = processed_count / elapsed if elapsed > 0 else 0
                remaining = len(test_df) - processed_count
                eta = remaining / rate if rate > 0 else 0
                print(f"Progress: {processed_count}/{len(test_df)} ({processed_count/len(test_df)*100:.1f}%) | "
                      f"Rate: {rate:.2f} prompts/s | ETA: {eta/60:.1f} min")
                check_and_clear_memory()
                
        except Exception as e:
            print(f"Error on guarded prompt: {prompt[:50]}... Error: {e}")

print(f"\n✓ Combined guardrail evaluation complete in {(time.time() - start_time)/60:.2f} minutes")
print(f"Results saved to: {GUARDED_RESULTS_PATH_COMBINED}")

New guarded results will be saved to: /content/HallucinationVectorProject/results/combined_guarded_results.csv
Initialized CSV file at: /content/HallucinationVectorProject/results/combined_guarded_results.csv
Starting response generation for baseline and DYNAMIC guarded models...


Dynamic Alpha Evaluation: 100%|██████████| 617/617 [36:36<00:00,  3.56s/it]

Dynamic Alpha experiment generation complete.





### **Run Judging, Analyze, and Summarize Results**
Runs the judging process on generated answers, merges with ground truth, and computes final performance metrics for the combined guardrail experiment.

In [None]:
from evaluate_guardrail import run_judging_process
import utils
import config
import pandas as pd
import time

# --- Redefine paths for the analysis (using local paths) ---
GUARDED_JUDGED_PATH_COMBINED = RESULTS_DIR / "combined_guarded_judged_results.csv"
BASELINE_JUDGED_RESULTS_PATH = RESULTS_DIR / "ablation_2_baseline_judged_results.csv"
GUARDED_RESULTS_PATH_COMBINED = RESULTS_DIR / "combined_guarded_results.csv"
BASELINE_RESULTS_PATH = RESULTS_DIR / "ablation_2_baseline_results_truthfulqa.csv"

print("Loading datasets for judging and analysis...")

# Load the test set
test_df = pd.read_csv(DATA_DIR / "final_test_set_truthfulqa.csv")

# Load the newly generated results
guarded_df = pd.read_csv(GUARDED_RESULTS_PATH_COMBINED)
baseline_df = pd.read_csv(BASELINE_RESULTS_PATH)

# Merge with ground truth
guarded_merged_df = pd.merge(guarded_df, test_df, left_on='prompt', right_on='Question', how='left')
baseline_merged_df = pd.merge(baseline_df, test_df, left_on='prompt', right_on='Question', how='left')

print(f"Guarded results: {len(guarded_merged_df)} prompts")
print(f"Baseline results: {len(baseline_merged_df)} prompts")

# --- Run Judging with retry logic for network stability ---
secrets = utils.load_secrets()

print("\nStarting judging process for combined guardrail results...")
start_time = time.time()

try:
    run_judging_process(guarded_merged_df, GUARDED_JUDGED_PATH_COMBINED, secrets['SCALEDOWN_API_KEY'])
    print(f"✓ Judging complete in {(time.time() - start_time)/60:.2f} minutes")
except Exception as e:
    print(f"Error during judging: {e}")

# Assuming baseline is already judged, if not, uncomment below
# run_judging_process(baseline_merged_df, BASELINE_JUDGED_RESULTS_PATH, secrets['SCALEDOWN_API_KEY'])

# --- Analyze and Print Final Report ---
print("\nAnalyzing final performance metrics...")

guarded_judged_df = pd.read_csv(GUARDED_JUDGED_PATH_COMBINED)
baseline_judged_df = pd.read_csv(BASELINE_JUDGED_RESULTS_PATH)

baseline_accuracy = baseline_judged_df['is_correct'].mean()
guarded_accuracy = guarded_judged_df['is_correct'].mean()
baseline_error_rate = 1 - baseline_accuracy
guarded_error_rate = 1 - guarded_accuracy
relative_error_reduction = (baseline_error_rate - guarded_error_rate) / baseline_error_rate if baseline_error_rate > 0 else 0
baseline_latency = baseline_judged_df['latency_seconds'].mean()
guarded_latency = guarded_judged_df['latency_seconds'].mean()
latency_increase_percent = (guarded_latency - baseline_latency) / baseline_latency * 100

summary_data = {
    "Metric": ["Accuracy", "Hallucination Rate", "Avg Latency (s)", "Relative Error Reduction", "Latency Increase"],
    "Baseline Model": [f"{baseline_accuracy:.2%}", f"{baseline_error_rate:.2%}", f"{baseline_latency:.2f}", "N/A", "N/A"],
    "Guarded Model (Combined)": [f"{guarded_accuracy:.2%}", f"{guarded_error_rate:.2%}", f"{guarded_latency:.2f}", f"{relative_error_reduction:.2%}", f"{latency_increase_percent:+.2f}%"],
}
summary_df = pd.DataFrame(summary_data)

print("\n" + "="*80)
print("FINAL PERFORMANCE SUMMARY (Combined Dynamic Alpha + Selective N-Tokens)")
print("="*80)
print(summary_df.to_string(index=False))
print("="*80)

Loading secrets...
Secrets loaded successfully.

--- Starting Corrected Judging Process for combined_guarded_judged_results.csv ---
Initialized CSV file at: /content/HallucinationVectorProject/results/combined_guarded_judged_results.csv
Found 0 already judged prompts. Resuming...


Judging combined_guarded_judged_results.csv: 100%|██████████| 617/617 [50:28<00:00,  4.91s/it]


--- Final Performance Summary (Dynamic Alpha Experiment) ---





Unnamed: 0,Metric,Baseline Model,Guarded Model (Dynamic Alpha)
0,Accuracy,38.57%,51.22%
1,Hallucination Rate,61.43%,48.78%
2,Avg Latency (s),3.86,3.56
3,Relative Error Reduction,,20.58%
4,Latency Increase,,-7.80%
