# Selective N-Tokens Steering Ablation: Hallucination Guardrail Evaluation
 
**Summary:**
This notebook evaluates the effect of applying the hallucination guardrail only to the first N tokens (Selective N-Tokens Steering) in Llama-3.1-8B. The method applies the steering vector for the first 10 generated tokens of high-risk prompts, aiming to maximize hallucination reduction while minimizing latency and preserving answer quality. The workflow includes loading all artifacts, defining the selective steering logic, running the evaluation on the TruthfulQA benchmark, and analyzing results.

**Key Results (TruthfulQA Benchmark):**
- **Baseline Model:** Accuracy: 38.57%, Hallucination Rate: 61.43%, Avg Latency: 3.86s
- **Selective Guarded Model (N=10):** Accuracy: 49.80%, Hallucination Rate: 50.20%, Avg Latency: 3.57s
- **Relative Error Reduction:** 18.36%
- **Latency Increase:** -7.51% (latency decreased)

Selective N-Tokens Steering achieves substantial hallucination reduction and accuracy gains with no latency penalty, confirming the benefit of targeted, early intervention.

### **Environment and Requirements Setup**
Unzips the project folder and installs all required Python packages for Colab execution.

In [None]:
# Unzip the project folder into the Colab environment
!unzip -o /content/HallucinationVectorProject.zip -d /content/

# Install all required packages from requirements.txt
!pip install -q -r /content/HallucinationVectorProject/requirements.txt

Archive:  /content/HallucinationVectorProject.zip
  inflating: /content/HallucinationVectorProject/config.py  
  inflating: /content/HallucinationVectorProject/requirements.txt  
  inflating: /content/HallucinationVectorProject/train_risk_classifier.py  
  inflating: /content/HallucinationVectorProject/evaluate_guardrail.py  
  inflating: /content/HallucinationVectorProject/utils.py  
  inflating: /content/HallucinationVectorProject/tune_guardrail_hyperparameters.py  
  inflating: /content/HallucinationVectorProject/build_hallucination_vector.py  
  inflating: /content/HallucinationVectorProject/artifacts/risk_thresholds.joblib  
  inflating: /content/__MACOSX/HallucinationVectorProject/artifacts/._risk_thresholds.joblib  
  inflating: /content/HallucinationVectorProject/artifacts/risk_clf.joblib  
  inflating: /content/__MACOSX/HallucinationVectorProject/artifacts/._risk_clf.joblib  
  inflating: /content/HallucinationVectorProject/artifacts/v_halluc.pt  
  inflating: /content/__MACOS

### **Project Path and Colab Configuration**
Adds the project directory to Python's path and sets the environment to 'colab' in the config file for compatibility.

In [None]:
import sys
import os

# Add the project directory to the Python path so we can import config and utils
project_path = '/content/HallucinationVectorProject'
if project_path not in sys.path:
    sys.path.append(project_path)

# Modify config.py to ensure it runs in "colab" mode
config_file_path = '/content/HallucinationVectorProject/config.py'
with open(config_file_path, 'r') as f:
    lines = f.readlines()
with open(config_file_path, 'w') as f:
    for line in lines:
        if line.strip().startswith('ENVIRONMENT ='):
            f.write('ENVIRONMENT = "colab"\n')
        else:
            f.write(line)

print("Project setup complete. ENVIRONMENT set to 'colab'.")

Project setup complete. ENVIRONMENT set to 'colab'.


### **Load Artifacts and Model Setup**
Loads all required model artifacts, including the hallucination vector, risk classifier, and config thresholds, and prepares the model for evaluation.

In [None]:

import time
import pandas as pd
import torch
import csv
from tqdm import tqdm
import joblib
from unsloth import FastLanguageModel

# Import custom modules
import config
import utils

# This global dictionary will hold our models, tokenizer, vectors, etc.
artifacts = {}

def load_all_artifacts():
    """Loads all necessary model and project artifacts into the global dict."""
    if artifacts: return
    print("Loading all necessary artifacts for evaluation...")
    model, tokenizer = utils.load_model_and_tokenizer()

    # to get over issues
    model = FastLanguageModel.for_inference(model)
    model.gradient_checkpointing_disable()
    model.config.gradient_checkpointing = False
    model.config.use_cache = True
    model.eval()

    artifacts['model'] = model
    artifacts['tokenizer'] = tokenizer
    artifacts['v_halluc'] = torch.load("/content/HallucinationVectorProject/artifacts/v_halluc.pt").to(model.device)
    artifacts['risk_classifier'] = joblib.load("/content/HallucinationVectorProject/artifacts/risk_clf.joblib")
    artifacts['thresholds'] = {
        "tau_low": config.TAU_LOW,
        "tau_high": config.TAU_HIGH,
        "optimal_alpha": config.OPTIMAL_ALPHA
    }

# Load everything
load_all_artifacts()

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
Loading all necessary artifacts for evaluation...
Loading model and tokenizer: unsloth/llama-3-8b-Instruct-bnb-4bit
==((====))==  Unsloth 2025.9.7: Fast Llama patching. Transformers: 4.55.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

Model and tokenizer loaded successfully.


### **Selective Activation Steering and Guardrail Function**
Defines a context manager to apply the steering vector only to the first N tokens and a function to generate answers using this selective intervention.

In [None]:
from contextlib import contextmanager

# --- 1. The New, Smarter ActivationSteerer Class ---
# This version includes a counter to apply steering only to the first N tokens.

class SelectiveActivationSteerer:
    """
    A context manager to apply activation steering for a limited number of tokens.
    """
    def __init__(self, model, steering_vector, layer_idx, coeff=1.0, steering_token_limit=10):
        self.model = model
        self.vector = steering_vector
        self.layer_idx = layer_idx
        self.coeff = coeff
        self.steering_token_limit = steering_token_limit  # NEW: The 'N' value
        self._handle = None
        self._layer_path = f"model.layers.{self.layer_idx}"
        self.call_count = 0  # NEW: Counter for generated tokens

    def _hook_fn(self, module, ins, out):
        # This hook is called for each token generated.
        self.call_count += 1

        # --- CORE LOGIC CHANGE ---
        # Only apply the steering vector if we are within the token limit.
        if self.call_count <= self.steering_token_limit:
            steered_output = out[0] + (self.coeff * self.vector.to(out[0].device))
            return (steered_output,) + out[1:]

        # After the limit, return the original, unmodified output.
        return out

    def __enter__(self):
        # Reset the counter every time we start a new generation.
        self.call_count = 0
        try:
            layer = self.model.get_submodule(self._layer_path)
            self._handle = layer.register_forward_hook(self._hook_fn)
        except AttributeError:
            raise AttributeError(f"Could not find the layer at path: {self._layer_path}")
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        if self._handle:
            self._handle.remove()


# --- New `answer_guarded` Function ---
# This version uses our new SelectiveActivationSteerer.

def answer_guarded_selective(prompt_text: str, max_new_tokens: int = 128, steering_token_limit: int = 10):
    """
    Generates a response using the guardrail with SELECTIVE, first-N-token steering.
    """
    # This function relies on a global `artifacts` dictionary, which we will create later.
    start_time = time.time()

    risk_score = utils.get_hallucination_risk(
        prompt_text, artifacts['model'], artifacts['tokenizer'],
        artifacts['v_halluc'], artifacts['risk_classifier']
    )

    full_prompt = f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\nYou are a helpful assistant.\n<|eot_id|><|start_header_id|>user<|end_header_id|>\nAnswer the following briefly\n{prompt_text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
    inputs = artifacts['tokenizer'](full_prompt, return_tensors="pt").to(artifacts['model'].device)
    input_token_length = inputs.input_ids.shape[1]

    if risk_score < artifacts['thresholds']['tau_high']:
        path = "Fast Path (Untouched)"
        with torch.no_grad():
            outputs = artifacts['model'].generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
    else:
        alpha = artifacts['thresholds']['optimal_alpha']
        path = f"Selective Steer Path (α={alpha}, N={steering_token_limit})"

        # --- USING THE NEW CLASS ---
        with SelectiveActivationSteerer(
            artifacts['model'], artifacts['v_halluc'], config.TARGET_LAYER,
            coeff=alpha, steering_token_limit=steering_token_limit
        ):
            with torch.no_grad():
                outputs = artifacts['model'].generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)

    answer = artifacts['tokenizer'].decode(outputs[0, input_token_length:], skip_special_tokens=True)
    latency = time.time() - start_time

    return {"answer": answer.strip(), "risk_score": risk_score, "path_taken": path, "latency_seconds": latency}

print("New functions `SelectiveActivationSteerer` and `answer_guarded_selective` are now defined.")

New functions `SelectiveActivationSteerer` and `answer_guarded_selective` are now defined.


**Suppress Warnings**
Suppresses specific sklearn warnings for cleaner output during evaluation.

In [None]:
import warnings

warnings.filterwarnings(
    "ignore",
    message="X does not have valid feature names",
    category=UserWarning,
    module="sklearn"
)

### **Run Selective N-Tokens Evaluation**
Runs the evaluation loop on the TruthfulQA test set, applying selective steering and saving results for each prompt.

In [None]:
# --- EXPERIMENT PARAMETER ---
STEERING_TOKEN_LIMIT = 10 # The 'N' for our selective steering

# Define new output paths for this specific experiment to avoid overwriting old results
GUARDED_RESULTS_PATH_SELECTIVE = os.path.join("/content/HallucinationVectorProject/results", "selective_N10_guarded_results.csv")

# --- Initialize CSV and get processed prompts ---
guarded_headers = ['prompt', 'answer', 'risk_score', 'path_taken', 'latency_seconds']
utils.initialize_csv(GUARDED_RESULTS_PATH_SELECTIVE, guarded_headers)
processed_prompts = utils.load_processed_prompts(GUARDED_RESULTS_PATH_SELECTIVE)
print(f"Found {len(processed_prompts)} already processed prompts for this experiment. Resuming...")

# Load the test set
test_df = pd.read_csv("/content/HallucinationVectorProject/data/final_test_set_truthfulqa.csv")

# --- Main Generation Loop ---
for _, row in tqdm(test_df.iterrows(), total=len(test_df), desc="Selective Steering Evaluation"):
    prompt = row['Question']
    if prompt not in processed_prompts:
        try:
            # --- CALL OUR NEW FUNCTION ---
            result = answer_guarded_selective(prompt, steering_token_limit=STEERING_TOKEN_LIMIT)

            with open(GUARDED_RESULTS_PATH_SELECTIVE, 'a', newline='', encoding='utf-8') as f:
                csv.writer(f).writerow([prompt] + list(result.values()))
        except Exception as e:
            print(f"Error on prompt: {prompt}. Error: {e}")

print(f"Selective steering evaluation complete. Results saved to {GUARDED_RESULTS_PATH_SELECTIVE}")

Initialized CSV file at: /content/HallucinationVectorProject/results/selective_N10_guarded_results.csv
Found 0 already processed prompts for this experiment. Resuming...


Selective Steering Evaluation: 100%|██████████| 617/617 [45:09<00:00,  4.39s/it]

Selective steering evaluation complete. Results saved to /content/HallucinationVectorProject/results/selective_N10_guarded_results.csv





### **Judging Process for Hallucination Scoring**
Defines a function to judge and score model answers for hallucination using an external API, saving results incrementally.

In [None]:
import os
import csv
from tqdm import tqdm
import pandas as pd
import utils # Make sure utils is imported

def run_judging_process_corrected(input_df, output_path, api_key):
    """
    CORRECTED VERSION: Iterates through a DataFrame, gets a hallucination score,
    and saves resiliently. Fixes the TypeError by calling the judge function
    with explicit keyword arguments instead of unpacking a dictionary.
    """
    print(f"\n--- Starting Corrected Judging Process for {os.path.basename(output_path)} ---")

    # Initialize CSV and load processed prompts
    output_headers = input_df.columns.tolist() + ['hallucination_score', 'is_correct']
    utils.initialize_csv(output_path, output_headers)
    processed_prompts = utils.load_processed_prompts(output_path, 'prompt')
    print(f"Found {len(processed_prompts)} already judged prompts. Resuming...")

    with open(output_path, 'a', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        for index, row in tqdm(input_df.iterrows(), total=len(input_df), desc=f"Judging {os.path.basename(output_path)}"):
            if row['prompt'] in processed_prompts:
                continue

            score = -1
            try:
                for _ in range(3): # Retry logic
                    # <<< THE FIX IS HERE >>>
                    # Instead of **row.to_dict(), we explicitly pass the correct columns.
                    score = utils.get_hallucination_score_0_100(
                        api_key=api_key,
                        question=row['Question'],        # Use the 'Question' column
                        answer=row['answer'],          # Use the 'answer' column
                        best_answer=row['Best Answer'],  # Use the 'Best Answer' column
                        correct_answers=row['Correct Answers'],
                        incorrect_answers=row['Incorrect Answers']
                    )
                    if score != -1:
                        break

                is_correct = 1 if 0 <= score <= 50 else 0
                new_row = row.tolist() + [score, is_correct]
                writer.writerow(new_row)

            except Exception as e:
                print(f"An unexpected error occurred for prompt '{row['prompt']}': {e}")
                # Save an error row to avoid reprocessing
                error_row = row.tolist() + [-1, 0]
                writer.writerow(error_row)

print("Defined `run_judging_process_corrected` function.")

Defined `run_judging_process_corrected` function.


### **Run Judging, Analyze, and Summarize Results**
Runs the judging process on generated answers, merges with ground truth, and computes final performance metrics for the selective N-tokens experiment.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# --- Setup for Judging and Analysis ---
secrets = utils.load_secrets()
api_key = secrets.get('SCALEDOWN_API_KEY')
GUARDED_JUDGED_PATH_SELECTIVE = os.path.join("/content/HallucinationVectorProject/results", "selective_N10_guarded_judged_results.csv")
BASELINE_JUDGED_RESULTS_PATH = os.path.join("/content/HallucinationVectorProject/results/", "ablation_2_baseline_judged_results.csv")


# --- Run Judging ---
guarded_df = pd.read_csv(GUARDED_RESULTS_PATH_SELECTIVE)
guarded_merged_df = pd.merge(guarded_df, test_df, left_on='prompt', right_on='Question', how='left')
run_judging_process_corrected(guarded_merged_df, GUARDED_JUDGED_PATH_SELECTIVE, api_key) # Reusing the judge function from utils

# --- Run Analysis ---
print("\n--- Final Performance Analysis (Selective Steering vs. Baseline) ---")
guarded_judged_df = pd.read_csv(GUARDED_JUDGED_PATH_SELECTIVE)
baseline_judged_df = pd.read_csv(BASELINE_JUDGED_RESULTS_PATH)

# Accuracy / Hallucination
baseline_accuracy = baseline_judged_df['is_correct'].mean()
guarded_accuracy = guarded_judged_df['is_correct'].mean()
baseline_error_rate = 1 - baseline_accuracy
guarded_error_rate = 1 - guarded_accuracy
relative_error_reduction = (baseline_error_rate - guarded_error_rate) / baseline_error_rate if baseline_error_rate > 0 else 0

# Latency
baseline_latency = baseline_judged_df['latency_seconds'].mean()
guarded_latency = guarded_judged_df['latency_seconds'].mean()
latency_increase_percent = (guarded_latency - baseline_latency) / baseline_latency * 100

# Summary Table
summary_data = {
    "Metric": ["Accuracy", "Hallucination Rate", "Avg Latency (s)", "Relative Error Reduction", "Latency Increase"],
    "Baseline Model": [f"{baseline_accuracy:.2%}", f"{baseline_error_rate:.2%}", f"{baseline_latency:.2f}", "N/A", "N/A"],
    "Selective Guarded Model": [f"{guarded_accuracy:.2%}", f"{guarded_error_rate:.2%}", f"{guarded_latency:.2f}", f"{relative_error_reduction:.2%}", f"{latency_increase_percent:+.2f}%"],
    "Target": ["Maximize", "Minimize (≥15% Red.)", "N/A", "Maximize", "<10%"]
}
summary_df = pd.DataFrame(summary_data)
display(summary_df)

Loading secrets...
Secrets loaded successfully.

--- Starting Corrected Judging Process for selective_N10_guarded_judged_results.csv ---
Found 0 already judged prompts. Resuming...


Judging selective_N10_guarded_judged_results.csv: 100%|██████████| 617/617 [10:03<00:00,  1.02it/s]


--- Final Performance Analysis (Selective Steering vs. Baseline) ---





Unnamed: 0,Metric,Baseline Model,Selective Guarded Model,Target
0,Accuracy,38.57%,47.16%,Maximize
1,Hallucination Rate,61.43%,52.84%,Minimize (≥15% Red.)
2,Avg Latency (s),3.86,4.39,
3,Relative Error Reduction,,13.98%,Maximize
4,Latency Increase,,+13.73%,<10%
