# Dynamic Alpha Ablation: Hallucination Guardrail Evaluation
 
**Summary:**
This notebook evaluates the effectiveness of a dynamic steering strength (Dynamic Alpha) for the hallucination guardrail in Llama-3.1-8B. Building on the core methodology (see previous notebooks), it applies a risk-aware intervention: prompts with high hallucination risk are steered away from the hallucination direction using a scaling factor (alpha) proportional to risk. The workflow includes loading all required artifacts, defining the dynamic steering logic, running the evaluation on the TruthfulQA benchmark, and analyzing results.

**Key Results (TruthfulQA Benchmark):**
- **Baseline Model:** Accuracy: 38.57%, Hallucination Rate: 61.43%, Avg Latency: 3.86s
- **Dynamic Alpha Guardrail:** Accuracy: 51.22%, Hallucination Rate: 48.78%, Avg Latency: 3.56s
- **Relative Error Reduction:** 20.58%
- **Latency Increase:** -7.80% (latency decreased)

Dynamic Alpha reduces hallucination rates while maintaining low latency, confirming the benefit of adaptive intervention over fixed-strength steering.

### **Environment and Requirements Setup**
Unzips the project folder and installs all required Python packages for Colab execution.

In [11]:
# Unzip the project folder into the Colab environment
!unzip -o /content/HallucinationVectorProject.zip -d /content/

# Install all required packages from requirements.txt
!pip install -q -r /content/HallucinationVectorProject/requirements.txt

Archive:  /content/HallucinationVectorProject.zip
  inflating: /content/HallucinationVectorProject/config.py  
  inflating: /content/HallucinationVectorProject/requirements.txt  
  inflating: /content/HallucinationVectorProject/train_risk_classifier.py  
  inflating: /content/HallucinationVectorProject/evaluate_guardrail.py  
  inflating: /content/HallucinationVectorProject/utils.py  
  inflating: /content/HallucinationVectorProject/tune_guardrail_hyperparameters.py  
  inflating: /content/HallucinationVectorProject/build_hallucination_vector.py  
  inflating: /content/HallucinationVectorProject/artifacts/risk_thresholds.joblib  
  inflating: /content/__MACOSX/HallucinationVectorProject/artifacts/._risk_thresholds.joblib  
  inflating: /content/HallucinationVectorProject/artifacts/risk_clf.joblib  
  inflating: /content/__MACOSX/HallucinationVectorProject/artifacts/._risk_clf.joblib  
  inflating: /content/HallucinationVectorProject/artifacts/v_halluc.pt  
  inflating: /content/__MACOS

### **Project Path and Colab Configuration**
Adds the project directory to Python's path and sets the environment to 'colab' in the config file for compatibility.

In [2]:
import sys
import os

# Add project directory to Python's path
project_path = '/content/HallucinationVectorProject'
if project_path not in sys.path:
    sys.path.append(project_path)

# Programmatically set the environment to 'colab' in the config file
config_file_path = os.path.join(project_path, 'config.py')
with open(config_file_path, 'r') as f:
    lines = f.readlines()
with open(config_file_path, 'w') as f:
    for line in lines:
        if line.strip().startswith('ENVIRONMENT ='):
            f.write('ENVIRONMENT = "colab"\n')
        else:
            f.write(line)
print("Environment configured for Colab execution.")

Environment configured for Colab execution.


### **Load Artifacts and Model Setup**
Loads all required model artifacts, including the hallucination vector, risk classifier, and config thresholds, and prepares the model for evaluation.

In [None]:

import time
import pandas as pd
import torch
import csv
from tqdm import tqdm
import joblib
from unsloth import FastLanguageModel

# Import custom modules
import config
import utils

# This global dictionary will hold our models, tokenizer, vectors, etc.
artifacts = {}

def load_all_artifacts():
    """Loads all necessary model and project artifacts into the global dict."""
    if artifacts: return
    print("Loading all necessary artifacts for evaluation...")
    model, tokenizer = utils.load_model_and_tokenizer()

    # to get over issues
    model = FastLanguageModel.for_inference(model)
    model.gradient_checkpointing_disable()
    model.config.gradient_checkpointing = False
    model.config.use_cache = True
    model.eval()

    artifacts['model'] = model
    artifacts['tokenizer'] = tokenizer
    artifacts['v_halluc'] = torch.load("/content/HallucinationVectorProject/artifacts/v_halluc.pt").to(model.device)
    artifacts['risk_classifier'] = joblib.load("/content/HallucinationVectorProject/artifacts/risk_clf.joblib")
    artifacts['thresholds'] = {
        "tau_low": config.TAU_LOW,
        "tau_high": config.TAU_HIGH,
        "optimal_alpha": config.OPTIMAL_ALPHA
    }

# Load everything
load_all_artifacts()

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
Loading all necessary artifacts for evaluation...
Loading model and tokenizer: unsloth/llama-3-8b-Instruct-bnb-4bit
==((====))==  Unsloth 2025.9.7: Fast Llama patching. Transformers: 4.55.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

Model and tokenizer loaded successfully.


### **Activation Steering Context Manager**
Defines a context manager to apply the hallucination steering vector to a specific model layer during inference.

In [None]:
class CorrectedActivationSteerer:
    """
    A context manager to apply activation steering to a model.
    CORRECTED VERSION: Uses the proper __enter__ and __exit__ dunder methods.
    """
    def __init__(self, model, steering_vector, layer_idx, coeff=1.0):
        self.model = model
        self.vector = steering_vector
        self.layer_idx = layer_idx
        self.coeff = coeff
        self._handle = None
        self._layer_path = f"model.layers.{self.layer_idx}"

    def _hook_fn(self, module, ins, out):
        steered_output = out[0] + (self.coeff * self.vector.to(out[0].device))
        return (steered_output,) + out[1:]

    # CORRECTED: Renamed from _enter_ to the proper dunder method __enter__
    def __enter__(self):
        try:
            layer = self.model.get_submodule(self._layer_path)
            self._handle = layer.register_forward_hook(self._hook_fn)
        except AttributeError:
            raise AttributeError(f"Could not find the layer at path: {self._layer_path}")
        return self

    # CORRECTED: Renamed from _exit_ to the proper dunder method __exit__
    def __exit__(self, exc_type, exc_val, exc_tb):
        if self._handle:
            self._handle.remove()

### **Dynamic Alpha Guardrail Function**
Defines the main function that applies dynamic, risk-proportional steering to high-risk prompts during answer generation.

In [None]:
def answer_guarded_dynamic_alpha(prompt_text: str, max_new_tokens: int = 128):
    """
    A new version of the guardrail function that applies DYNAMIC steering strength
    based on the prompt's risk score.
    """
    start_time = time.time()

    # --- Risk scoring ---
    risk_score = utils.get_hallucination_risk(
        prompt_text, artifacts['model'], artifacts['tokenizer'],
        artifacts['v_halluc'], artifacts['risk_classifier']
    )

    full_prompt = f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\nYou are a helpful assistant.\n<|eot_id|><|start_header_id|>user<|end_header_id|>\nAnswer the following question briefly:\n{prompt_text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
    inputs = artifacts['tokenizer'](full_prompt, return_tensors="pt").to(artifacts['model'].device)
    input_token_length = inputs.input_ids.shape[1]

    # --- Routing logic ---
    if risk_score < artifacts['thresholds']['tau_high']:
        path = "Fast Path (Untouched)"
        with torch.no_grad():
            outputs = artifacts['model'].generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
    else:
        # CALCULATE DYNAMIC ALPHA
        optimal_alpha = artifacts['thresholds']['optimal_alpha']
        tau_high = artifacts['thresholds']['tau_high']

        # This formula scales alpha from 0 to optimal_alpha as risk goes from tau_high to 1.0
        # Added epsilon to prevent division by zero if tau_high is exactly 1.0
        scaling_factor = (risk_score - tau_high) / (1.0 - tau_high + 1e-6)
        dynamic_alpha = optimal_alpha * max(0, min(1, scaling_factor)) # Clamp between 0 and 1

        path = f"Dynamic Steer Path (α={dynamic_alpha:.2f})" # Update path for clarity

        # USE THE NEW DYNAMIC ALPHA IN THE STEERER
        with CorrectedActivationSteerer(artifacts['model'], artifacts['v_halluc'], config.TARGET_LAYER, coeff=dynamic_alpha):
            with torch.no_grad():
                outputs = artifacts['model'].generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)

    answer = artifacts['tokenizer'].decode(outputs[0, input_token_length:], skip_special_tokens=True)
    latency = time.time() - start_time

    return {"answer": answer.strip(), "risk_score": risk_score, "path_taken": path, "latency_seconds": latency}

print("Redefined `answer_guarded_dynamic_alpha` function is ready for the experiment.")

Redefined `answer_guarded_dynamic_alpha` function is ready for the experiment.


In [3]:
import warnings

warnings.filterwarnings(
    "ignore",
    message="X does not have valid feature names",
    category=UserWarning,
    module="sklearn"
)

**Suppress Warnings**
Suppresses specific sklearn warnings for cleaner output during evaluation.

### **Run Dynamic Alpha Evaluation**
Runs the evaluation loop on the TruthfulQA test set, applying the dynamic guardrail and saving results for each prompt.

In [None]:
GUARDED_RESULTS_PATH_DYNAMIC = os.path.join("/content/HallucinationVectorProject/results", "guarded_results_dynamic_alpha.csv")
BASELINE_RESULTS_PATH = os.path.join("/content/HallucinationVectorProject/results/", "ablation_2_baseline_results_truthfulqa.csv")

print(f"New guarded results will be saved to: {GUARDED_RESULTS_PATH_DYNAMIC}")

# Load the test set
test_df = pd.read_csv("/content/HallucinationVectorProject/data/final_test_set_truthfulqa.csv")

# --- Resilient Evaluation Loop ---
guarded_headers = ['prompt', 'answer', 'risk_score', 'path_taken', 'latency_seconds']
utils.initialize_csv(GUARDED_RESULTS_PATH_DYNAMIC, guarded_headers)

processed_guarded = utils.load_processed_prompts(GUARDED_RESULTS_PATH_DYNAMIC)

print("Starting response generation for baseline and DYNAMIC guarded models...")
for _, row in tqdm(test_df.iterrows(), total=len(test_df), desc="Dynamic Alpha Evaluation"):
    prompt = row['Question']

    # Guarded Run (using the NEW function)
    if prompt not in processed_guarded:
        try:
            # CALL THE NEWLY DEFINED FUNCTION
            result = answer_guarded_dynamic_alpha(prompt)
            with open(GUARDED_RESULTS_PATH_DYNAMIC, 'a', newline='', encoding='utf-8') as f:
                csv.writer(f).writerow([prompt] + list(result.values()))
        except Exception as e:
            print(f"Error on guarded prompt: {prompt}. Error: {e}")

    # We alraedy have Baseline Run results

print("Dynamic Alpha experiment generation complete.")

New guarded results will be saved to: /content/HallucinationVectorProject/results/guarded_results_dynamic_alpha.csv
Initialized CSV file at: /content/HallucinationVectorProject/results/guarded_results_dynamic_alpha.csv
Starting response generation for baseline and DYNAMIC guarded models...


Dynamic Alpha Evaluation: 100%|██████████| 617/617 [38:36<00:00,  3.75s/it]

Dynamic Alpha experiment generation complete.





### **Judging Process for Hallucination Scoring**
Defines a function to judge and score model answers for hallucination using an external API, saving results incrementally.

In [13]:
import os
import csv
from tqdm import tqdm
import pandas as pd
import utils # Make sure utils is imported

def run_judging_process_corrected(input_df, output_path, api_key):
    """
    CORRECTED VERSION: Iterates through a DataFrame, gets a hallucination score,
    and saves resiliently. Fixes the TypeError by calling the judge function
    with explicit keyword arguments instead of unpacking a dictionary.
    """
    print(f"\n--- Starting Corrected Judging Process for {os.path.basename(output_path)} ---")

    # Initialize CSV and load processed prompts
    output_headers = input_df.columns.tolist() + ['hallucination_score', 'is_correct']
    utils.initialize_csv(output_path, output_headers)
    processed_prompts = utils.load_processed_prompts(output_path, 'prompt')
    print(f"Found {len(processed_prompts)} already judged prompts. Resuming...")

    with open(output_path, 'a', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        for index, row in tqdm(input_df.iterrows(), total=len(input_df), desc=f"Judging {os.path.basename(output_path)}"):
            if row['prompt'] in processed_prompts:
                continue

            score = -1
            try:
                for _ in range(3): # Retry logic
                    score = utils.get_hallucination_score_0_100(
                        api_key=api_key,
                        question=row['Question'],        # Use the 'Question' column
                        answer=row['answer'],          # Use the 'answer' column
                        best_answer=row['Best Answer'],  # Use the 'Best Answer' column
                        correct_answers=row['Correct Answers'],
                        incorrect_answers=row['Incorrect Answers']
                    )
                    if score != -1:
                        break

                is_correct = 1 if 0 <= score <= 50 else 0
                new_row = row.tolist() + [score, is_correct]
                writer.writerow(new_row)

            except Exception as e:
                print(f"An unexpected error occurred for prompt '{row['prompt']}': {e}")
                # Save an error row to avoid reprocessing
                error_row = row.tolist() + [-1, 0]
                writer.writerow(error_row)

print("Defined `run_judging_process_corrected` function.")

Defined `run_judging_process_corrected` function.


### **Run Judging, Analyze, and Summarize Results**
Runs the judging process on generated answers, merges with ground truth, and computes final performance metrics for the dynamic alpha experiment.

In [15]:
from evaluate_guardrail import run_judging_process
import utils
import config
import pandas as pd

# --- Redefine paths for the analysis ---
GUARDED_JUDGED_PATH_DYNAMIC = os.path.join("/content/HallucinationVectorProject/results", "guarded_judged_dynamic_alpha.csv")
BASELINE_JUDGED_RESULTS_PATH = os.path.join("/content/HallucinationVectorProject/results/", "ablation_2_baseline_judged_results.csv")
GUARDED_RESULTS_PATH_DYNAMIC = os.path.join("/content/HallucinationVectorProject/results", "guarded_results_dynamic_alpha.csv")
BASELINE_RESULTS_PATH = os.path.join("/content/HallucinationVectorProject/results/", "ablation_2_baseline_results_truthfulqa.csv")

# Load the test set
test_df = pd.read_csv("/content/HallucinationVectorProject/data/final_test_set_truthfulqa.csv")

# Load the newly generated results
guarded_df = pd.read_csv(GUARDED_RESULTS_PATH_DYNAMIC)
baseline_df = pd.read_csv(BASELINE_RESULTS_PATH)

# Merge with ground truth
guarded_merged_df = pd.merge(guarded_df, test_df, left_on='prompt', right_on='Question', how='left')
baseline_merged_df = pd.merge(baseline_df, test_df, left_on='prompt', right_on='Question', how='left')

# --- Run Judging ---
secrets = utils.load_secrets()
run_judging_process_corrected(guarded_merged_df, GUARDED_JUDGED_PATH_DYNAMIC, secrets['SCALEDOWN_API_KEY'])
# Assuming baseline is already judged, if not, uncomment below
# utils.run_judging_process(baseline_merged_df, config.BASELINE_JUDGED_RESULTS_PATH, secrets['SCALEDOWN_API_KEY'])

# --- Analyze and Print Final Report ---
guarded_judged_df = pd.read_csv(GUARDED_JUDGED_PATH_DYNAMIC)
baseline_judged_df = pd.read_csv(BASELINE_JUDGED_RESULTS_PATH)

baseline_accuracy = baseline_judged_df['is_correct'].mean()
guarded_accuracy = guarded_judged_df['is_correct'].mean()
baseline_error_rate = 1 - baseline_accuracy
guarded_error_rate = 1 - guarded_accuracy
relative_error_reduction = (baseline_error_rate - guarded_error_rate) / baseline_error_rate if baseline_error_rate > 0 else 0
baseline_latency = baseline_judged_df['latency_seconds'].mean()
guarded_latency = guarded_judged_df['latency_seconds'].mean()
latency_increase_percent = (guarded_latency - baseline_latency) / baseline_latency * 100

summary_data = {
    "Metric": ["Accuracy", "Hallucination Rate", "Avg Latency (s)", "Relative Error Reduction", "Latency Increase"],
    "Baseline Model": [f"{baseline_accuracy:.2%}", f"{baseline_error_rate:.2%}", f"{baseline_latency:.2f}", "N/A", "N/A"],
    "Guarded Model (Dynamic Alpha)": [f"{guarded_accuracy:.2%}", f"{guarded_error_rate:.2%}", f"{guarded_latency:.2f}", f"{relative_error_reduction:.2%}", f"{latency_increase_percent:+.2f}%"],
}
summary_df = pd.DataFrame(summary_data)

print("\n--- Final Performance Summary (Dynamic Alpha Experiment) ---")
display(summary_df)

Loading secrets...
Secrets loaded successfully.

--- Starting Corrected Judging Process for guarded_judged_dynamic_alpha.csv ---
Found 617 already judged prompts. Resuming...


Judging guarded_judged_dynamic_alpha.csv: 100%|██████████| 617/617 [00:00<00:00, 23336.78it/s]


--- Final Performance Summary (Dynamic Alpha Experiment) ---





Unnamed: 0,Metric,Baseline Model,Guarded Model (Dynamic Alpha)
0,Accuracy,38.57%,47.65%
1,Hallucination Rate,61.43%,52.35%
2,Avg Latency (s),3.86,3.75
3,Relative Error Reduction,,14.78%
4,Latency Increase,,-2.78%
