# Evaluation: Combined Guardrail on AI2 ARC Challenge (Llama-3.1-8B)

This notebook evaluates the **combined guardrail approach** (dynamic risk-proportional steering + selective N-token steering) on the AI2 ARC Challenge dataset using the Llama-3.1-8B model. The combined method applies activation steering only to the first N tokens and scales the steering strength (alpha) based on a learned risk score, aiming to reduce hallucinations while preserving accuracy.

## Methodology
- **Model**: Llama-3.1-8B with activation steering (hallucination vector).
- **Guardrail**: Steering is applied only to the first N tokens (N=10) and the strength is dynamically set based on a risk classifier.
- **Dataset**: AI2 ARC Challenge (1000 test prompts, multiple-choice format).
- **Metrics**: Accuracy, average latency, and absolute improvement over baseline.

## Results (as reported below)
- **Baseline Accuracy**: 78.00%
- **Combined Guarded Accuracy**: 78.30%
- **Absolute Accuracy Improvement**: +0.30%
- **Baseline Latency**: 0.37 s
- **Guarded Latency**: 0.65 s
- **Latency Increase**: +72.73%

**Note:** There is no significant improvement on this MCQ dataset. The method is more suited for long-form question answering, where hallucination risk is higher and steering can have a greater effect. For multiple-choice datasets like ARC, the accuracy gain is negligible and the latency increase is substantial, making this approach unsuitable for such tasks.

---

## 1. Environment and Requirements Setup
Unzips the project folder and installs all required Python packages for Colab execution.

In [1]:
# Unzip the project folder into the Colab environment
!unzip -o /content/HallucinationVectorProject.zip -d /content/

# Install all required packages from requirements.txt
!pip install -q -r /content/HallucinationVectorProject/requirements.txt

Archive:  /content/HallucinationVectorProject.zip
   creating: /content/HallucinationVectorProject/
   creating: /content/HallucinationVectorProject/artifacts/
  inflating: /content/HallucinationVectorProject/.DS_Store  
  inflating: /content/__MACOSX/HallucinationVectorProject/._.DS_Store  
  inflating: /content/HallucinationVectorProject/config.py  
  inflating: /content/HallucinationVectorProject/requirements.txt  
  inflating: /content/HallucinationVectorProject/train_risk_classifier.py  
   creating: /content/HallucinationVectorProject/results/
  inflating: /content/HallucinationVectorProject/evaluate_guardrail.py  
  inflating: /content/HallucinationVectorProject/utils.py  
  inflating: /content/HallucinationVectorProject/tune_guardrail_hyperparameters.py  
  inflating: /content/HallucinationVectorProject/build_hallucination_vector.py  
   creating: /content/HallucinationVectorProject/data/
   creating: /content/HallucinationVectorProject/plots/
  inflating: /content/Hallucinatio

## 2. Project Path and Config Setup
Adds the project directory to the Python path and configures the environment for Colab execution.

In [2]:
import sys
import os

# Add project directory to Python's path
project_path = '/content/HallucinationVectorProject'
if project_path not in sys.path:
    sys.path.append(project_path)

# Programmatically set the environment to 'colab' in the config file
config_file_path = os.path.join(project_path, 'config.py')
with open(config_file_path, 'r') as f:
    lines = f.readlines()
with open(config_file_path, 'w') as f:
    for line in lines:
        if line.strip().startswith('ENVIRONMENT ='):
            f.write('ENVIRONMENT = "colab"\n')
        else:
            f.write(line)
print("Environment configured for Colab execution.")

Environment configured for Colab execution.


## 3. Load Artifacts and Prepare Model
Loads all necessary artifacts (model, tokenizer, hallucination vector, risk classifier, thresholds) and prepares the model for inference.

In [3]:

import time
import pandas as pd
import torch
import csv
from tqdm import tqdm
import joblib
from unsloth import FastLanguageModel

# Import custom modules
import config
import utils

# This global dictionary will hold our models, tokenizer, vectors, etc.
artifacts = {}

def load_all_artifacts():
    """Loads all necessary model and project artifacts into the global dict."""
    if artifacts: return
    print("Loading all necessary artifacts for evaluation...")
    model, tokenizer = utils.load_model_and_tokenizer()

    # to get over issues
    model = FastLanguageModel.for_inference(model)
    model.gradient_checkpointing_disable()
    model.config.gradient_checkpointing = False
    model.config.use_cache = True
    model.eval()

    artifacts['model'] = model
    artifacts['tokenizer'] = tokenizer
    artifacts['v_halluc'] = torch.load("/content/HallucinationVectorProject/artifacts/v_halluc.pt").to(model.device)
    artifacts['risk_classifier'] = joblib.load("/content/HallucinationVectorProject/artifacts/risk_clf.joblib")
    artifacts['thresholds'] = {
        "tau_low": config.TAU_LOW,
        "tau_high": config.TAU_HIGH,
        "optimal_alpha": config.OPTIMAL_ALPHA
    }

# Load everything
load_all_artifacts()

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
Loading all necessary artifacts for evaluation...
Loading model and tokenizer: unsloth/llama-3-8b-Instruct-bnb-4bit
==((====))==  Unsloth 2025.9.7: Fast Llama patching. Transformers: 4.55.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

Model and tokenizer loaded successfully.


## 4. Load ARC Challenge Dataset
Loads the AI2 ARC Challenge test set (1000 prompts) for evaluation.

In [4]:
from datasets import load_dataset
import pandas as pd

# Load the ARC dataset
ARC_VARIANT = 'ARC-Challenge'
full_dataset = load_dataset("allenai/ai2_arc", ARC_VARIANT)
eval_df = full_dataset['test'].shuffle(seed=42).select(range(1000)).to_pandas()
print(f"Loaded {len(eval_df)} evaluation prompts from {ARC_VARIANT}.")

README.md: 0.00B [00:00, ?B/s]

ARC-Challenge/train-00000-of-00001.parqu(…):   0%|          | 0.00/190k [00:00<?, ?B/s]

ARC-Challenge/test-00000-of-00001.parque(…):   0%|          | 0.00/204k [00:00<?, ?B/s]

ARC-Challenge/validation-00000-of-00001.(…):   0%|          | 0.00/55.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1119 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1172 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/299 [00:00<?, ? examples/s]

Loaded 1000 evaluation prompts from ARC-Challenge.


## 5. Guardrail Logic and Baseline Functions
Defines helper functions for prompt formatting, answer extraction, and the core combined guardrail logic (dynamic alpha + selective N-tokens) for ARC Challenge. Also defines the baseline (no steering) generation function.

In [5]:
import re
from contextlib import contextmanager

print("Defining combined guardrail logic for the AI2 ARC dataset...")

# --- ARC Dataset Helper Functions ---

def format_arc_prompt(question, choices):
    """Formats a question and its choices for the ARC dataset."""
    formatted_choices = "\n".join([f"{label}) {text}" for label, text in zip(choices['label'], choices['text'])])
    prompt = (
        "<|begin_of_text|>\n"
        "<|start_header_id|>system<|end_header_id|>\n"
        "You are a helpful AI assistant.\n"
        "<|eot_id|>\n"
        "<|start_header_id|>user<|end_header_id|>\n"
        f"Question: {question}\n\nChoices:\n{formatted_choices}\n\nProvide ONLY the letter of the correct answer.\n"
        "The correct answer is: "
        "<|eot_id|>\n"
        "<|start_header_id|>assistant<|end_header_id|>\n"
    )
    return prompt

def parse_and_score_answer(generated_answer, correct_answer_key):
    """Uses regex to find a single capital letter and scores it."""
    match = re.search(r'\b([A-D])\b', generated_answer)
    extracted_key = None
    is_correct = 0
    if match:
        extracted_key = match.group(1)
        if extracted_key == correct_answer_key:
            is_correct = 1
    return is_correct, extracted_key

# --- The SelectiveActivationSteerer Class ---

class SelectiveActivationSteerer:
    def __init__(self, model, steering_vector, layer_idx, coeff=1.0, steering_token_limit=10):
        self.model, self.vector, self.layer_idx, self.coeff, self.steering_token_limit = model, steering_vector, layer_idx, coeff, steering_token_limit
        self._handle, self._layer_path, self.call_count = None, f"model.layers.{self.layer_idx}", 0
    def _hook_fn(self, module, ins, out):
        self.call_count += 1
        return (out[0] + (self.coeff * self.vector.to(out[0].device)),) + out[1:] if self.call_count <= self.steering_token_limit else out
    def __enter__(self):
        self.call_count = 0
        self._handle = self.model.get_submodule(self._layer_path).register_forward_hook(self._hook_fn)
        return self
    def __exit__(self, exc_type, exc_val, exc_tb):
        if self._handle: self._handle.remove()

# --- The New `answer_guarded_arc_combined` Function ---

def answer_guarded_arc_combined(prompt_row, max_new_tokens=128, steering_token_limit=10):
    """Generates a response for ARC using DYNAMIC alpha and SELECTIVE steering."""
    start_time = time.time()
    question, choices, correct_answer_key = prompt_row['question'], prompt_row['choices'], prompt_row['answerKey']

    risk_score = utils.get_hallucination_risk(question, artifacts['model'], artifacts['tokenizer'], artifacts['v_halluc'], artifacts['risk_classifier'])

    full_prompt = format_arc_prompt(question, choices)
    inputs = artifacts['tokenizer'](full_prompt, return_tensors="pt").to(artifacts['model'].device)
    input_token_length = inputs.input_ids.shape[1]

    if risk_score < artifacts['thresholds']['tau_high']:
        path = "Fast Path (Untouched)"
        with torch.no_grad():
            outputs = artifacts['model'].generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
    else:
        optimal_alpha = artifacts['thresholds']['optimal_alpha']
        tau_high = artifacts['thresholds']['tau_high']
        scaling_factor = (risk_score - tau_high) / (1.0 - tau_high + 1e-6)
        dynamic_alpha = optimal_alpha * max(0, min(1, scaling_factor))
        path = f"Combined Steer Path (α={dynamic_alpha:.2f}, N={steering_token_limit})"

        with SelectiveActivationSteerer(artifacts['model'], artifacts['v_halluc'], config.TARGET_LAYER, coeff=dynamic_alpha, steering_token_limit=steering_token_limit):
            with torch.no_grad():
                outputs = artifacts['model'].generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)

    answer_text = artifacts['tokenizer'].decode(outputs[0, input_token_length:], skip_special_tokens=True)
    is_correct, extracted_key = parse_and_score_answer(answer_text, correct_answer_key)

    return {
        "answer": answer_text.strip(), "risk_score": risk_score, "path_taken": path,
        "is_correct": is_correct, "extracted_key": extracted_key,
        "latency_seconds": time.time() - start_time
    }

# --- The Baseline Generation Function for ARC ---

def generate_baseline_arc(prompt_row, max_new_tokens=128):
    """Generates a baseline response for the ARC dataset."""
    start_time = time.time()
    question, choices, correct_answer_key = prompt_row['question'], prompt_row['choices'], prompt_row['answerKey']

    full_prompt = format_arc_prompt(question, choices)
    inputs = artifacts['tokenizer'](full_prompt, return_tensors="pt").to(artifacts['model'].device)
    input_token_length = inputs.input_ids.shape[1]

    with torch.no_grad():
        outputs = artifacts['model'].generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)

    answer_text = artifacts['tokenizer'].decode(outputs[0, input_token_length:], skip_special_tokens=True)
    is_correct, extracted_key = parse_and_score_answer(answer_text, correct_answer_key)

    return {
        "answer": answer_text.strip(), "is_correct": is_correct,
        "extracted_key": extracted_key, "latency_seconds": time.time() - start_time
    }

print("All ARC-specific functions for the combined experiment are now defined.")

Defining combined guardrail logic for the AI2 ARC dataset...
All ARC-specific functions for the combined experiment are now defined.


## 6. Suppress Warnings
Suppresses specific sklearn warnings for cleaner output during evaluation.

In [6]:
import warnings

warnings.filterwarnings(
    "ignore",
    message="X does not have valid feature names",
    category=UserWarning,
    module="sklearn"
)

## 7. Evaluation Loop and Results Saving
Runs the main evaluation loop for both the combined guardrail and baseline models, saving results to CSV files for later analysis.

In [7]:
import csv
from tqdm import tqdm

# --- EXPERIMENT PARAMETER ---
STEERING_TOKEN_LIMIT = 10

# --- Define Output Paths ---
ARC_VARIANT_LOWER = ARC_VARIANT.lower().replace('-', '_')
GUARDED_RESULTS_PATH_ARC = os.path.join("/content/HallucinationVectorProject/results/arc_evals", f"guarded_results_{ARC_VARIANT_LOWER}_combined.csv")
BASELINE_RESULTS_PATH_ARC = os.path.join("/content/HallucinationVectorProject/results/arc_evals", f"baseline_results_{ARC_VARIANT_LOWER}.csv")

print(f"Guarded results will be saved to: {GUARDED_RESULTS_PATH_ARC}")
print(f"Baseline results will be saved to: {BASELINE_RESULTS_PATH_ARC}")


# --- Initialize CSVs and Load Progress ---
guarded_headers = ['id', 'question', 'answer_key', 'answer', 'risk_score', 'path_taken', 'is_correct', 'extracted_key', 'latency_seconds']
baseline_headers = ['id', 'question', 'answer_key', 'answer', 'is_correct', 'extracted_key', 'latency_seconds']

# Helper to initialize and load progress using the 'id' column
def initialize_csv_arc(file_path, headers):
    if not os.path.exists(file_path):
        with open(file_path, 'w', newline='', encoding='utf-8') as f: csv.writer(f).writerow(headers)
        return set()
    return set(pd.read_csv(file_path)['id'].tolist())

processed_guarded = initialize_csv_arc(GUARDED_RESULTS_PATH_ARC, guarded_headers)
processed_baseline = initialize_csv_arc(BASELINE_RESULTS_PATH_ARC, baseline_headers)
print(f"Found {len(processed_guarded)} already processed prompts for Guarded run.")
print(f"Found {len(processed_baseline)} already processed prompts for Baseline run.")

# --- Main Evaluation Loop ---
for _, row in tqdm(eval_df.iterrows(), total=len(eval_df), desc=f"Evaluating on {ARC_VARIANT}"):
    prompt_id = row['id']

    # Guarded Run
    if prompt_id not in processed_guarded:
        try:
            result = answer_guarded_arc_combined(row, steering_token_limit=STEERING_TOKEN_LIMIT)
            with open(GUARDED_RESULTS_PATH_ARC, 'a', newline='', encoding='utf-8') as f:
                writer = csv.writer(f)
                writer.writerow([prompt_id, row['question'], row['answerKey']] + list(result.values()))
        except Exception as e:
            print(f"Error on guarded prompt id: {prompt_id}. Error: {e}")

    # Baseline Run
    if prompt_id not in processed_baseline:
        try:
            result = generate_baseline_arc(row)
            with open(BASELINE_RESULTS_PATH_ARC, 'a', newline='', encoding='utf-8') as f:
                writer = csv.writer(f)
                writer.writerow([prompt_id, row['question'], row['answerKey']] + list(result.values()))
        except Exception as e:
            print(f"Error on baseline prompt id: {prompt_id}. Error: {e}")

print("Evaluation on ARC dataset complete.")

Guarded results will be saved to: /content/HallucinationVectorProject/results/arc_evals/guarded_results_arc_challenge_combined.csv
Baseline results will be saved to: /content/HallucinationVectorProject/results/arc_evals/baseline_results_arc_challenge.csv
Found 0 already processed prompts for Guarded run.
Found 0 already processed prompts for Baseline run.


Evaluating on ARC-Challenge: 100%|██████████| 1000/1000 [17:04<00:00,  1.02s/it]

Evaluation on ARC dataset complete.





## 8. Results Analysis and Summary Table
Loads the saved results, computes accuracy and latency metrics, and displays a summary table comparing the baseline and combined guardrail models.

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# --- Load Final Results ---
guarded_df = pd.read_csv(GUARDED_RESULTS_PATH_ARC)
baseline_df = pd.read_csv(BASELINE_RESULTS_PATH_ARC)

# --- Calculate Metrics ---
baseline_accuracy = baseline_df['is_correct'].mean()
guarded_accuracy = guarded_df['is_correct'].mean()
absolute_improvement = guarded_accuracy - baseline_accuracy

baseline_latency = baseline_df['latency_seconds'].mean()
guarded_latency = guarded_df['latency_seconds'].mean()
latency_increase_percent = (guarded_latency - baseline_latency) / baseline_latency * 100

# --- Display Summary Table ---
summary_data = {
    "Metric": ["Accuracy", "Avg Latency (s)", "Absolute Accuracy Improvement", "Latency Increase"],
    "Baseline Model": [f"{baseline_accuracy:.2%}", f"{baseline_latency:.2f}", "N/A", "N/A"],
    "Combined Guarded Model": [f"{guarded_accuracy:.2%}", f"{guarded_latency:.2f}", f"{absolute_improvement:+.2%}", f"{latency_increase_percent:+.2f}%"],
}
summary_df = pd.DataFrame(summary_data)
print(f"\n--- Final Performance Summary ({ARC_VARIANT}) ---")
display(summary_df)



--- Final Performance Summary (ARC-Challenge) ---


Unnamed: 0,Metric,Baseline Model,Combined Guarded Model
0,Accuracy,78.00%,78.30%
1,Avg Latency (s),0.37,0.65
2,Absolute Accuracy Improvement,,+0.30%
3,Latency Increase,,+72.73%
