# Combined Dynamic Alpha + Selective N-Tokens Steering: Hallucination Guardrail Evaluation
 
**Summary:**
This notebook evaluates a combined guardrail approach for Llama-3.1-8B that integrates both Dynamic Alpha (risk-proportional steering strength) and Selective N-Tokens Steering (applying the intervention only to the first N tokens). This method applies a dynamic, risk-scaled correction for high-risk prompts, but only during the initial generation steps, maximizing hallucination reduction while minimizing latency and preserving answer quality. This combined approach outperforms both individual ablations and is used for all further evaluations on other datasets.

- **Dynamic Alpha:** Steering strength (alpha) is scaled based on prompt risk, providing stronger correction for riskier prompts.
- **Selective N-Tokens:** Steering is applied only to the first 10 generated tokens, focusing intervention where it is most effective.

**Key Results (TruthfulQA Benchmark):**
- **Baseline Model:** Accuracy: 38.57%, Hallucination Rate: 61.43%, Avg Latency: 3.86s
- **Combined Guarded Model:** Accuracy: 52.04%, Hallucination Rate: 47.96%, Avg Latency: 3.56s
- **Relative Error Reduction:** 21.93%
- **Latency Increase:** -7.78% (latency decreased)

This combined guardrail achieves the best trade-off between hallucination reduction, accuracy, and latency, and is therefore used for all subsequent cross-domain evaluations.

### **Environment and Requirements Setup**
Setup for local execution on Lambda Labs A100 40GB GPU with Llama-3.1-8B.

In [2]:
!pip install -q --no-deps "trl==0.23.0" "peft==0.17.1" "accelerate==1.11.0" "bitsandbytes==0.48.2"

In [7]:
!uv pip install "unsloth==2025.10.12" "transformers==4.57.1" "tqdm==4.67.1" "ipywidgets==8.1.7" "pandas==2.2.2" "numpy==2.0.2" "datasets==4.3.0" "scikit-learn==1.7.2" "joblib==1.4.2" "matplotlib==3.10.0" "seaborn==0.13.2" "huggingface_hub==0.36.0" "python-dotenv==1.0.1" "setuptools==75.8.0" "wheel==0.45.1"

[2mUsing Python 3.10.12 environment at: /home/ubuntu/HallucinationVectorProject/venv[0m
[2K[37m⠹[0m [2mtorch==2.9.0                                                                  [0m

[2K[2mResolved [1m123 packages[0m [2min 529ms[0m[0m                                       [0m
[2K[37m⠙[0m [2mPreparing packages...[0m (0/97)                                                  
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/97)------------------[0m[0m     0 B/552.83 KiB          [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/97)------------------[0m[0m 14.88 KiB/552.83 KiB        [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/97)------------------[0m[0m 14.88 KiB/552.83 KiB        [1A
[2mmpmath              [0m [32m[30m[2m------------------------------[0m[0m     0 B/523.63 KiB
[2K[2A[37m⠙[0m [2mPreparing packages...[0m (0/97)------------------[0m[0m 14.88 KiB/552.83 KiB        [2A
[2mmpmath              [0m [32m-[30m[2m-----------------------------[0m[0m 14.87 KiB/523.63 KiB
[2K[2A[37m⠙[0m [2mPreparing packages...[0m (0/97)------------------[0m[0m 14.88 KiB/552.83 KiB        [2A
[2mmpmath       

In [8]:
!uv pip install -q --index-url https://download.pytorch.org/whl/cu128 torch torchvision

In [9]:
!uv pip install -q "xformers==0.0.33" --index-url https://download.pytorch.org/whl/cu128

In [10]:

import os
import sys
# Verify environment
import torch
print(f"\nEnvironment verification:")
print(f"Python version: {sys.version}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU count: {torch.cuda.device_count()}")
    for i in range(torch.cuda.device_count()):
        print(f"  GPU {i}: {torch.cuda.get_device_name(i)}")


Environment verification:
Python version: 3.10.12 (main, May 27 2025, 17:12:29) [GCC 11.4.0]
PyTorch version: 2.8.0+cu128
CUDA available: True
CUDA version: 12.8
GPU count: 1
  GPU 0: NVIDIA A100-SXM4-40GB


In [11]:
# Load API Keys from environment variables
import os
from dotenv import load_dotenv

load_dotenv()

# Load the keys from environment variables
try:
    HF_TOKEN = os.environ.get('HF_TOKEN')
    SCALEDOWN_API_KEY = os.environ.get('SCALEDOWN_API_KEY')
    
    if not HF_TOKEN:
        raise ValueError("HF_TOKEN not found in environment variables. Please set it before running.")
    if not SCALEDOWN_API_KEY:
        raise ValueError("SCALEDOWN_API_KEY not found in environment variables. Please set it before running.")
    
    # Set HuggingFace token in environment for model loading
    os.environ["HF_TOKEN"] = HF_TOKEN
    
    print("✓ API keys loaded successfully from environment variables")
    print(f"  HF_TOKEN: {HF_TOKEN[:10]}..." if HF_TOKEN else "  HF_TOKEN: Not set")
    print(f"  SCALEDOWN_API_KEY: {SCALEDOWN_API_KEY[:10]}..." if SCALEDOWN_API_KEY else "  SCALEDOWN_API_KEY: Not set")
    
except Exception as e:
    print(f"ERROR loading API keys: {e}")
    print("Please ensure you have set the following environment variables:")
    print("  export HF_TOKEN='your_huggingface_token'")
    print("  export SCALEDOWN_API_KEY='your_scaledown_api_key'")

✓ API keys loaded successfully from environment variables
  HF_TOKEN: hf_NrlndFS...
  SCALEDOWN_API_KEY: OMJ5hWc0m4...


### **Project Path Setup**
Sets up local project paths for Lambda Labs execution.

In [12]:
import sys
import os
from pathlib import Path

# Setup local project paths
PROJECT_DIR = Path("/home/ubuntu/HallucinationVectorProject")
DATA_DIR = PROJECT_DIR / "data"
ARTIFACTS_DIR = PROJECT_DIR / "artifacts" / "llama-3.1-8b"
RESULTS_DIR = PROJECT_DIR / "results" / "llama-3.1-8b"

# Create directories if they don't exist
ARTIFACTS_DIR.mkdir(parents=True, exist_ok=True)
RESULTS_DIR.mkdir(parents=True, exist_ok=True)

# Add project directory to Python's path
project_path = str(PROJECT_DIR)
if project_path not in sys.path:
    sys.path.append(project_path)

print(f"Project directory: {PROJECT_DIR}")
print(f"Data directory: {DATA_DIR}")
print(f"Artifacts directory: {ARTIFACTS_DIR}")
print(f"Results directory: {RESULTS_DIR}")

# Programmatically set the environment to 'local' in the config file
config_file_path = PROJECT_DIR / 'config.py'
with open(config_file_path, 'r') as f:
    lines = f.readlines()
with open(config_file_path, 'w') as f:
    for line in lines:
        if line.strip().startswith('ENVIRONMENT ='):
            f.write('ENVIRONMENT = "local"\n')
        else:
            f.write(line)
print("✓ Environment configured for local Lambda Labs execution.")

Project directory: /home/ubuntu/HallucinationVectorProject
Data directory: /home/ubuntu/HallucinationVectorProject/data
Artifacts directory: /home/ubuntu/HallucinationVectorProject/artifacts/llama-3.1-8b
Results directory: /home/ubuntu/HallucinationVectorProject/results/llama-3.1-8b
✓ Environment configured for local Lambda Labs execution.


### **Load Artifacts and Model Setup**
Loads all required model artifacts for Llama-3.1-8B, including the hallucination vector, risk classifier, and config thresholds, and prepares the model for evaluation on A100 40GB GPU.

In [13]:
import time
import pandas as pd
import torch
import csv
import sys
from pathlib import Path
from tqdm import tqdm
import joblib
from unsloth import FastLanguageModel

# Ensure project directory is in path before importing custom modules
PROJECT_DIR = Path("/home/ubuntu/HallucinationVectorProject")
if str(PROJECT_DIR) not in sys.path:
    sys.path.insert(0, str(PROJECT_DIR))

# Import custom modules


os.environ["UNSLOTH_STABLE_DOWNLOADS"] = "1"


# Helper function to monitor GPU memory
def print_gpu_memory():
    """Print memory usage for all available GPUs."""
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated(0) / 1024**3
        reserved = torch.cuda.memory_reserved(0) / 1024**3
        total = torch.cuda.get_device_properties(0).total_memory / 1024**3
        print(f"GPU 0 ({torch.cuda.get_device_name(0)}): "
              f"{allocated:.2f}GB allocated, {reserved:.2f}GB reserved, {total:.2f}GB total")

# This global dictionary will hold our models, tokenizer, vectors, etc.
artifacts = {}

def load_all_artifacts():
    """Loads all necessary model and project artifacts into the global dict."""
    if artifacts: return
    print("Loading all necessary artifacts for Llama-3.1-8B evaluation...")
    
    print("\nGPU memory before model loading:")
    print_gpu_memory()
    
    # Clear any cached memory
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    
    # Load 8B model for single GPU
    print("\nLoading Llama-3.1-8B model (bfloat16)...")
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
        max_seq_length=4096,
        dtype=torch.bfloat16,
        load_in_4bit=False,
        trust_remote_code=True,
    )

    # Configure for inference
    model = FastLanguageModel.for_inference(model)
    model.gradient_checkpointing_disable()
    model.config.gradient_checkpointing = False
    model.config.use_cache = True
    model.eval()

    artifacts['model'] = model
    artifacts['tokenizer'] = tokenizer
    artifacts['v_halluc'] = torch.load(ARTIFACTS_DIR / "v_halluc.pt").to(model.device).to(torch.bfloat16)
    artifacts['risk_classifier'] = joblib.load(ARTIFACTS_DIR / "risk_clf.joblib")
    artifacts['thresholds'] = {
        "tau_low": 0.0390,
        "tau_high": 0.0547,
        "optimal_alpha": -3.0
    }
    
    print("\n✓ All artifacts loaded successfully!")
    print(f"Model device: {model.device}")
    print("\nGPU memory after loading:")
    print_gpu_memory()

# Load everything
load_all_artifacts()

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


    PyTorch 2.9.0+cu129 with CUDA 1209 (you have 2.8.0+cu128)
    Python  3.12.12 (you have 3.10.12)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


Switching to PyTorch attention since your Xformers is broken.

Unsloth: Xformers was not installed correctly.
Please install xformers separately first.
Then confirm if it's correctly installed by running:
python -m xformers.info

Longer error message:
xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.9.0+cu129 with CUDA 1209 (you have 2.8.0+cu128)
    Python  3.12.12 (you have 3.10.12)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
🦥 Unsloth Zoo will now patch everything to make training faster!
Loading all necessary artifacts for Llama-3.1-8B evaluation...

GPU memory before model loading:
GPU 0 (NVIDIA A100-SXM4-40GB): 0.00GB allocated, 0.00GB reserved, 39.49GB total

Loading Llama-3.1-8B model (bfloat16)...
Are you certain you want to do remote code execution?
==((====))==  Unsloth 2025.10.12: Fast Llama patching. Transformers: 4.

model.safetensors.index.json: 0.00B [00:00, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]


✓ All artifacts loaded successfully!
Model device: cuda:0

GPU memory after loading:
GPU 0 (NVIDIA A100-SXM4-40GB): 14.98GB allocated, 15.09GB reserved, 39.49GB total


**FIXED:** If you encounter a `KeyError: 'align_logprobs_with_mask'` error when importing unsloth, this is due to version incompatibility between Unsloth and TRL library changes from recent hours.

**Solution:**
```bash
# Uninstall conflicting packages
pip uninstall -y unsloth unsloth-zoo trl transformers peft

# Install compatible base versions
pip install transformers==4.46.2 peft==0.13.2 trl==0.12.1

# Install Unsloth without dependencies, then unsloth-zoo
pip install --upgrade --no-cache-dir --no-deps "unsloth @ git+https://github.com/unslothai/unsloth.git"
pip install unsloth-zoo
```

This installs:
- **transformers**: 4.57.1 (auto-upgraded by unsloth-zoo)
- **trl**: 0.23.0 (auto-upgraded by unsloth-zoo)  
- **peft**: 0.13.2
- **tokenizers**: 0.22.1
- **unsloth**: 2025.10.11
- **unsloth-zoo**: 2025.10.12

**After reinstalling, RESTART THE KERNEL before running cells.**

### **Combined Selective Activation Steering and Guardrail Function**
Defines a context manager to apply the steering vector only to the first N tokens, with dynamic risk-proportional strength, and a function to generate answers using this combined intervention.

In [14]:
import time
import torch
import sys
from pathlib import Path
from contextlib import contextmanager

# Ensure project directory is in path before importing custom modules
PROJECT_DIR = Path("/home/ubuntu/HallucinationVectorProject")
if str(PROJECT_DIR) not in sys.path:
    sys.path.insert(0, str(PROJECT_DIR))

# Import our project's config and utils modules
import config
import utils

print("Defining combined logic for 'Dynamic Alpha + Selective Steering' experiment...")

# --- 1. The SelectiveActivationSteerer Class ---

class SelectiveActivationSteerer:
    def __init__(self, model, steering_vector, layer_idx, coeff=1.0, steering_token_limit=10):
        self.model = model
        self.vector = steering_vector
        self.layer_idx = layer_idx
        self.coeff = coeff
        self.steering_token_limit = steering_token_limit
        self._handle = None
        self._layer_path = f"model.layers.{self.layer_idx}"
        self.call_count = 0

    def _hook_fn(self, module, ins, out):
        self.call_count += 1
        if self.call_count <= self.steering_token_limit:
            steered_output = out[0] + (self.coeff * self.vector.to(out[0].device))
            return (steered_output,) + out[1:]
        return out

    def __enter__(self):
        self.call_count = 0
        try:
            layer = self.model.get_submodule(self._layer_path)
            self._handle = layer.register_forward_hook(self._hook_fn)
        except AttributeError:
            raise AttributeError(f"Could not find the layer at path: {self._layer_path}")
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        if self._handle:
            self._handle.remove()


# --- 2. The New `answer_guarded_combined` Function ---
# This function combines the logic from both successful ablations.

def answer_guarded_combined(prompt_text: str, max_new_tokens: int = 128, steering_token_limit: int = 10):
    """
    Generates a response using the guardrail with DYNAMIC alpha and SELECTIVE steering.
    Enhanced for 70B model with proper device handling and memory management.
    """
    start_time = time.time()

    try:
        risk_score = utils.get_hallucination_risk(
            prompt_text, artifacts['model'], artifacts['tokenizer'],
            artifacts['v_halluc'], artifacts['risk_classifier']
        )

        full_prompt = f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\nYou are a helpful assistant.\n<|eot_id|><|start_header_id|>user<|end_header_id|>\nAnswer the following briefly.\n{prompt_text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
        inputs = artifacts['tokenizer'](full_prompt, return_tensors="pt", max_length=4096, truncation=True).to(artifacts['model'].device)
        input_token_length = inputs.input_ids.shape[1]

        if risk_score < artifacts['thresholds']['tau_high']:
            path = "Fast Path (Untouched)"
            with torch.no_grad():
                outputs = artifacts['model'].generate(
                    **inputs, 
                    max_new_tokens=max_new_tokens, 
                    do_sample=False,
                    pad_token_id=artifacts['tokenizer'].eos_token_id
                )
        else:
            # From Dynamic Alpha Ablation: Calculate dynamic steering strength.
            optimal_alpha = artifacts['thresholds']['optimal_alpha']
            tau_high = artifacts['thresholds']['tau_high']
            scaling_factor = (risk_score - tau_high) / (1.0 - tau_high + 1e-6) # Add epsilon for stability
            dynamic_alpha = optimal_alpha * max(0, min(1, scaling_factor)) # Clamp between 0 and 1

            path = f"Combined Steer Path (α={dynamic_alpha:.2f}, N={steering_token_limit})"

            # From Selective N-Tokens Ablation: Use the steerer with a token limit.
            # We pass our newly calculated `dynamic_alpha` as the coefficient.
            with SelectiveActivationSteerer(
                artifacts['model'], artifacts['v_halluc'], config.TARGET_LAYER,
                coeff=dynamic_alpha,
                steering_token_limit=steering_token_limit
            ):
                with torch.no_grad():
                    outputs = artifacts['model'].generate(
                        **inputs, 
                        max_new_tokens=max_new_tokens, 
                        do_sample=False,
                        pad_token_id=artifacts['tokenizer'].eos_token_id
                    )

        answer = artifacts['tokenizer'].decode(outputs[0, input_token_length:], skip_special_tokens=True)
        latency = time.time() - start_time

        return {"answer": answer.strip(), "risk_score": risk_score, "path_taken": path, "latency_seconds": latency}
    
    except Exception as e:
        print(f"Error in answer_guarded_combined: {e}")
        latency = time.time() - start_time
        return {"answer": "", "risk_score": 0.5, "path_taken": "Error", "latency_seconds": latency}

print("✓ New function `answer_guarded_combined` is now defined and ready for the experiment.")

Defining combined logic for 'Dynamic Alpha + Selective Steering' experiment...
✓ New function `answer_guarded_combined` is now defined and ready for the experiment.


**Suppress Warnings**
Suppresses specific sklearn warnings for cleaner output during evaluation.

In [15]:
import warnings

warnings.filterwarnings(
    "ignore",
    message="X does not have valid feature names",
    category=UserWarning,
    module="sklearn"
)

### **Run Combined Guardrail Evaluation**
Runs the evaluation loop on the TruthfulQA test set, applying the combined guardrail and saving results for each prompt.

In [18]:
# --- EXPERIMENT PARAMETER ---
STEERING_TOKEN_LIMIT = 10 # The 'N' for our selective steering

# Use local paths
GUARDED_RESULTS_PATH_COMBINED = RESULTS_DIR / "combined_guarded_results.csv"
BASELINE_RESULTS_PATH = RESULTS_DIR / "ablation_2_baseline_results_truthfulqa.csv"

print(f"New guarded results will be saved to: {GUARDED_RESULTS_PATH_COMBINED}")

# Memory management helper
def check_and_clear_memory(threshold_gb=60):
    """Clear GPU cache if memory usage exceeds threshold."""
    if torch.cuda.is_available():
        for i in range(torch.cuda.device_count()):
            allocated = torch.cuda.memory_allocated(i) / 1024**3
            if allocated > threshold_gb:
                print(f"GPU {i} memory ({allocated:.2f}GB) exceeds threshold. Clearing cache...")
                torch.cuda.empty_cache()
                return True
    return False

# Load the test set (using local path)
test_df = pd.read_csv(DATA_DIR / "final_test_set_truthfulqa.csv")
print(f"Loaded {len(test_df)} test prompts from TruthfulQA")

# --- Resilient Evaluation Loop ---
guarded_headers = ['prompt', 'answer', 'risk_score', 'path_taken', 'latency_seconds']
utils.initialize_csv(GUARDED_RESULTS_PATH_COMBINED, guarded_headers)

processed_guarded = utils.load_processed_prompts(GUARDED_RESULTS_PATH_COMBINED)

print(f"Starting response generation for combined guardrail (Dynamic Alpha + Selective N={STEERING_TOKEN_LIMIT})...")
print(f"Already processed: {len(processed_guarded)} prompts")

start_time = time.time()
processed_count = len(processed_guarded)

for idx, row in tqdm(test_df.iterrows(), total=len(test_df), desc="Combined Guardrail Evaluation"):
    prompt = row['Question']

    # Guarded Run
    if prompt not in processed_guarded:
        try:
            result = answer_guarded_combined(prompt, steering_token_limit=STEERING_TOKEN_LIMIT)
            with open(GUARDED_RESULTS_PATH_COMBINED, 'a', newline='', encoding='utf-8') as f:
                csv.writer(f).writerow([prompt] + list(result.values()))
            
            processed_count += 1
            
            # Progress tracking and memory management
            if processed_count % 10 == 0:
                elapsed = time.time() - start_time
                rate = processed_count / elapsed if elapsed > 0 else 0
                remaining = len(test_df) - processed_count
                eta = remaining / rate if rate > 0 else 0
                print(f"Progress: {processed_count}/{len(test_df)} ({processed_count/len(test_df)*100:.1f}%) | "
                      f"Rate: {rate:.2f} prompts/s | ETA: {eta/60:.1f} min")
                check_and_clear_memory()
                
        except Exception as e:
            print(f"Error on guarded prompt: {prompt[:50]}... Error: {e}")

print(f"\n✓ Combined guardrail evaluation complete in {(time.time() - start_time)/60:.2f} minutes")
print(f"Results saved to: {GUARDED_RESULTS_PATH_COMBINED}")

New guarded results will be saved to: /home/ubuntu/HallucinationVectorProject/results/llama-3.1-8b/combined_guarded_results.csv
Loaded 617 test prompts from TruthfulQA
Initialized CSV file at: /home/ubuntu/HallucinationVectorProject/results/llama-3.1-8b/combined_guarded_results.csv
Starting response generation for combined guardrail (Dynamic Alpha + Selective N=10)...
Already processed: 0 prompts


Combined Guardrail Evaluation:   0%|          | 0/617 [00:00<?, ?it/s]

Combined Guardrail Evaluation:   2%|▏         | 10/617 [00:20<21:26,  2.12s/it]

Progress: 10/617 (1.6%) | Rate: 0.48 prompts/s | ETA: 21.2 min


Combined Guardrail Evaluation:   3%|▎         | 20/617 [00:47<22:03,  2.22s/it]

Progress: 20/617 (3.2%) | Rate: 0.42 prompts/s | ETA: 23.4 min


Combined Guardrail Evaluation:   5%|▍         | 30/617 [01:04<23:07,  2.36s/it]

Progress: 30/617 (4.9%) | Rate: 0.47 prompts/s | ETA: 20.9 min


Combined Guardrail Evaluation:   6%|▋         | 40/617 [01:22<14:13,  1.48s/it]

Progress: 40/617 (6.5%) | Rate: 0.48 prompts/s | ETA: 19.9 min


Combined Guardrail Evaluation:   8%|▊         | 50/617 [01:37<10:11,  1.08s/it]

Progress: 50/617 (8.1%) | Rate: 0.51 prompts/s | ETA: 18.4 min


Combined Guardrail Evaluation:  10%|▉         | 60/617 [01:59<18:56,  2.04s/it]

Progress: 60/617 (9.7%) | Rate: 0.50 prompts/s | ETA: 18.5 min


Combined Guardrail Evaluation:  11%|█▏        | 70/617 [02:20<20:12,  2.22s/it]

Progress: 70/617 (11.3%) | Rate: 0.50 prompts/s | ETA: 18.3 min


Combined Guardrail Evaluation:  13%|█▎        | 80/617 [02:42<16:03,  1.79s/it]

Progress: 80/617 (13.0%) | Rate: 0.49 prompts/s | ETA: 18.2 min


Combined Guardrail Evaluation:  15%|█▍        | 90/617 [03:01<17:55,  2.04s/it]

Progress: 90/617 (14.6%) | Rate: 0.49 prompts/s | ETA: 17.7 min


Combined Guardrail Evaluation:  16%|█▌        | 100/617 [03:20<17:22,  2.02s/it]

Progress: 100/617 (16.2%) | Rate: 0.50 prompts/s | ETA: 17.3 min


Combined Guardrail Evaluation:  18%|█▊        | 110/617 [03:42<19:31,  2.31s/it]

Progress: 110/617 (17.8%) | Rate: 0.49 prompts/s | ETA: 17.1 min


Combined Guardrail Evaluation:  19%|█▉        | 120/617 [04:01<15:46,  1.90s/it]

Progress: 120/617 (19.4%) | Rate: 0.50 prompts/s | ETA: 16.6 min


Combined Guardrail Evaluation:  21%|██        | 130/617 [04:16<13:58,  1.72s/it]

Progress: 130/617 (21.1%) | Rate: 0.51 prompts/s | ETA: 16.0 min


Combined Guardrail Evaluation:  23%|██▎       | 140/617 [04:38<13:43,  1.73s/it]

Progress: 140/617 (22.7%) | Rate: 0.50 prompts/s | ETA: 15.8 min


Combined Guardrail Evaluation:  24%|██▍       | 150/617 [05:02<18:12,  2.34s/it]

Progress: 150/617 (24.3%) | Rate: 0.50 prompts/s | ETA: 15.7 min


Combined Guardrail Evaluation:  26%|██▌       | 161/617 [05:26<13:18,  1.75s/it]

Progress: 160/617 (25.9%) | Rate: 0.49 prompts/s | ETA: 15.5 min


Combined Guardrail Evaluation:  28%|██▊       | 170/617 [05:47<19:49,  2.66s/it]

Progress: 170/617 (27.6%) | Rate: 0.49 prompts/s | ETA: 15.2 min


Combined Guardrail Evaluation:  29%|██▉       | 180/617 [06:13<18:10,  2.50s/it]

Progress: 180/617 (29.2%) | Rate: 0.48 prompts/s | ETA: 15.1 min


Combined Guardrail Evaluation:  31%|███       | 190/617 [06:36<18:35,  2.61s/it]

Progress: 190/617 (30.8%) | Rate: 0.48 prompts/s | ETA: 14.9 min


Combined Guardrail Evaluation:  32%|███▏      | 200/617 [06:54<10:15,  1.48s/it]

Progress: 200/617 (32.4%) | Rate: 0.48 prompts/s | ETA: 14.4 min


Combined Guardrail Evaluation:  34%|███▍      | 210/617 [07:10<12:55,  1.90s/it]

Progress: 210/617 (34.0%) | Rate: 0.49 prompts/s | ETA: 13.9 min


Combined Guardrail Evaluation:  36%|███▌      | 220/617 [07:30<17:24,  2.63s/it]

Progress: 220/617 (35.7%) | Rate: 0.49 prompts/s | ETA: 13.5 min


Combined Guardrail Evaluation:  37%|███▋      | 230/617 [07:50<15:41,  2.43s/it]

Progress: 230/617 (37.3%) | Rate: 0.49 prompts/s | ETA: 13.2 min


Combined Guardrail Evaluation:  39%|███▉      | 240/617 [08:10<13:43,  2.19s/it]

Progress: 240/617 (38.9%) | Rate: 0.49 prompts/s | ETA: 12.8 min


Combined Guardrail Evaluation:  41%|████      | 250/617 [08:26<11:37,  1.90s/it]

Progress: 250/617 (40.5%) | Rate: 0.49 prompts/s | ETA: 12.4 min


Combined Guardrail Evaluation:  42%|████▏     | 260/617 [08:46<12:38,  2.12s/it]

Progress: 260/617 (42.1%) | Rate: 0.49 prompts/s | ETA: 12.1 min


Combined Guardrail Evaluation:  44%|████▍     | 270/617 [09:04<12:26,  2.15s/it]

Progress: 270/617 (43.8%) | Rate: 0.50 prompts/s | ETA: 11.7 min


Combined Guardrail Evaluation:  45%|████▌     | 280/617 [09:23<07:46,  1.38s/it]

Progress: 280/617 (45.4%) | Rate: 0.50 prompts/s | ETA: 11.3 min


Combined Guardrail Evaluation:  47%|████▋     | 290/617 [09:40<12:53,  2.37s/it]

Progress: 290/617 (47.0%) | Rate: 0.50 prompts/s | ETA: 10.9 min


Combined Guardrail Evaluation:  49%|████▊     | 300/617 [09:57<12:46,  2.42s/it]

Progress: 300/617 (48.6%) | Rate: 0.50 prompts/s | ETA: 10.5 min


Combined Guardrail Evaluation:  50%|█████     | 310/617 [10:13<10:31,  2.06s/it]

Progress: 310/617 (50.2%) | Rate: 0.51 prompts/s | ETA: 10.1 min


Combined Guardrail Evaluation:  52%|█████▏    | 320/617 [10:36<09:37,  1.94s/it]

Progress: 320/617 (51.9%) | Rate: 0.50 prompts/s | ETA: 9.8 min


Combined Guardrail Evaluation:  53%|█████▎    | 330/617 [10:57<10:41,  2.24s/it]

Progress: 330/617 (53.5%) | Rate: 0.50 prompts/s | ETA: 9.5 min


Combined Guardrail Evaluation:  55%|█████▌    | 340/617 [11:20<12:26,  2.69s/it]

Progress: 340/617 (55.1%) | Rate: 0.50 prompts/s | ETA: 9.2 min


Combined Guardrail Evaluation:  57%|█████▋    | 350/617 [11:39<10:57,  2.46s/it]

Progress: 350/617 (56.7%) | Rate: 0.50 prompts/s | ETA: 8.9 min


Combined Guardrail Evaluation:  58%|█████▊    | 360/617 [11:49<03:11,  1.34it/s]

Progress: 360/617 (58.3%) | Rate: 0.51 prompts/s | ETA: 8.4 min


Combined Guardrail Evaluation:  60%|█████▉    | 370/617 [12:05<04:54,  1.19s/it]

Progress: 370/617 (60.0%) | Rate: 0.51 prompts/s | ETA: 8.1 min


Combined Guardrail Evaluation:  62%|██████▏   | 380/617 [12:33<11:37,  2.94s/it]

Progress: 380/617 (61.6%) | Rate: 0.50 prompts/s | ETA: 7.8 min


Combined Guardrail Evaluation:  63%|██████▎   | 390/617 [12:47<07:31,  1.99s/it]

Progress: 390/617 (63.2%) | Rate: 0.51 prompts/s | ETA: 7.4 min


Combined Guardrail Evaluation:  65%|██████▍   | 400/617 [13:08<08:31,  2.36s/it]

Progress: 400/617 (64.8%) | Rate: 0.51 prompts/s | ETA: 7.1 min


Combined Guardrail Evaluation:  66%|██████▋   | 410/617 [13:27<07:53,  2.29s/it]

Progress: 410/617 (66.5%) | Rate: 0.51 prompts/s | ETA: 6.8 min


Combined Guardrail Evaluation:  68%|██████▊   | 420/617 [13:45<04:34,  1.40s/it]

Progress: 420/617 (68.1%) | Rate: 0.51 prompts/s | ETA: 6.5 min


Combined Guardrail Evaluation:  70%|██████▉   | 430/617 [14:03<03:58,  1.27s/it]

Progress: 430/617 (69.7%) | Rate: 0.51 prompts/s | ETA: 6.1 min


Combined Guardrail Evaluation:  71%|███████▏  | 440/617 [14:18<03:15,  1.11s/it]

Progress: 440/617 (71.3%) | Rate: 0.51 prompts/s | ETA: 5.8 min


Combined Guardrail Evaluation:  73%|███████▎  | 450/617 [14:44<07:31,  2.71s/it]

Progress: 450/617 (72.9%) | Rate: 0.51 prompts/s | ETA: 5.5 min


Combined Guardrail Evaluation:  75%|███████▍  | 460/617 [15:10<05:15,  2.01s/it]

Progress: 460/617 (74.6%) | Rate: 0.51 prompts/s | ETA: 5.2 min


Combined Guardrail Evaluation:  76%|███████▌  | 470/617 [15:35<07:27,  3.05s/it]

Progress: 470/617 (76.2%) | Rate: 0.50 prompts/s | ETA: 4.9 min


Combined Guardrail Evaluation:  78%|███████▊  | 480/617 [15:53<05:08,  2.25s/it]

Progress: 480/617 (77.8%) | Rate: 0.50 prompts/s | ETA: 4.5 min


Combined Guardrail Evaluation:  79%|███████▉  | 490/617 [16:13<04:30,  2.13s/it]

Progress: 490/617 (79.4%) | Rate: 0.50 prompts/s | ETA: 4.2 min


Combined Guardrail Evaluation:  81%|████████  | 500/617 [16:44<06:16,  3.22s/it]

Progress: 500/617 (81.0%) | Rate: 0.50 prompts/s | ETA: 3.9 min


Combined Guardrail Evaluation:  83%|████████▎ | 510/617 [17:07<04:19,  2.42s/it]

Progress: 510/617 (82.7%) | Rate: 0.50 prompts/s | ETA: 3.6 min


Combined Guardrail Evaluation:  84%|████████▍ | 520/617 [17:32<04:16,  2.64s/it]

Progress: 520/617 (84.3%) | Rate: 0.49 prompts/s | ETA: 3.3 min


Combined Guardrail Evaluation:  86%|████████▌ | 530/617 [17:47<02:29,  1.72s/it]

Progress: 530/617 (85.9%) | Rate: 0.50 prompts/s | ETA: 2.9 min


Combined Guardrail Evaluation:  88%|████████▊ | 540/617 [18:14<03:19,  2.59s/it]

Progress: 540/617 (87.5%) | Rate: 0.49 prompts/s | ETA: 2.6 min


Combined Guardrail Evaluation:  89%|████████▉ | 550/617 [18:29<01:31,  1.37s/it]

Progress: 550/617 (89.1%) | Rate: 0.50 prompts/s | ETA: 2.3 min


Combined Guardrail Evaluation:  91%|█████████ | 560/617 [18:48<01:29,  1.57s/it]

Progress: 560/617 (90.8%) | Rate: 0.50 prompts/s | ETA: 1.9 min


Combined Guardrail Evaluation:  92%|█████████▏| 570/617 [19:13<02:05,  2.68s/it]

Progress: 570/617 (92.4%) | Rate: 0.49 prompts/s | ETA: 1.6 min


Combined Guardrail Evaluation:  94%|█████████▍| 580/617 [19:41<01:25,  2.30s/it]

Progress: 580/617 (94.0%) | Rate: 0.49 prompts/s | ETA: 1.3 min


Combined Guardrail Evaluation:  96%|█████████▌| 590/617 [19:59<00:42,  1.57s/it]

Progress: 590/617 (95.6%) | Rate: 0.49 prompts/s | ETA: 0.9 min


Combined Guardrail Evaluation:  97%|█████████▋| 600/617 [20:17<00:26,  1.56s/it]

Progress: 600/617 (97.2%) | Rate: 0.49 prompts/s | ETA: 0.6 min


Combined Guardrail Evaluation:  99%|█████████▉| 610/617 [20:38<00:08,  1.29s/it]

Progress: 610/617 (98.9%) | Rate: 0.49 prompts/s | ETA: 0.2 min


Combined Guardrail Evaluation: 100%|██████████| 617/617 [20:52<00:00,  2.03s/it]


✓ Combined guardrail evaluation complete in 20.88 minutes
Results saved to: /home/ubuntu/HallucinationVectorProject/results/llama-3.1-8b/combined_guarded_results.csv





## Define the Baseline Generation Function
This function will generate an answer from the original, unguarded model. It includes the requested prompt prefix and decodes only the new tokens.

In [8]:
import time

def generate_baseline(prompt_text: str, max_new_tokens: int = 128):
    """
    Generates a response from the unguarded baseline model.
    """
    start_time = time.time()

    # Prepend the required instruction to the prompt
    full_prompt = f"<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful AI assistant. <|eot_id|><|start_header_id|>user<|end_header_id|> Answer the following question briefly:\n{prompt_text} <|eot_id|><|start_header_id|>assistant<|end_header_id|>"

    inputs = artifacts['tokenizer'](full_prompt, return_tensors="pt").to(artifacts['model'].device)

    with torch.no_grad():
        outputs = artifacts['model'].generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            pad_token_id=artifacts['tokenizer'].eos_token_id
        )

    # Decode only the newly generated tokens, skipping the prompt
    input_token_length = inputs.input_ids.shape[1]
    newly_generated_tokens = outputs[0, input_token_length:]
    answer = artifacts['tokenizer'].decode(newly_generated_tokens, skip_special_tokens=True)

    end_time = time.time()
    latency = end_time - start_time

    return {
        "answer": answer,
        "latency_seconds": latency
    }

print("Baseline generation function `generate_baseline` is defined.")

Baseline generation function `generate_baseline` is defined.


In [9]:
BASELINE_RESULTS_PATH = os.path.join(PROJECT_DIR, "baseline_results_truthfulqa.csv")

print(f"Baseline results will be saved to: {BASELINE_RESULTS_PATH}")

# --- Helper function to initialize CSV files with headers ---
def initialize_csv(file_path, headers):
    if not os.path.exists(file_path):
        with open(file_path, 'w', newline='', encoding='utf-8') as f:
            writer = csv.writer(f)
            writer.writerow(headers)
        return set()
    else:
        # Load existing prompts to know where to resume from
        df_existing = pd.read_csv(file_path)
        return set(df_existing['prompt'].tolist())

# --- Initialize CSVs and get the set of already processed prompts ---
baseline_headers = ['prompt', 'answer', 'latency_seconds']

processed_baseline = initialize_csv(BASELINE_RESULTS_PATH, baseline_headers)

print(f"Found {len(processed_baseline)} already processed prompts for the Baseline run.")

# --- Main Evaluation Loop ---
if test_df is not None:
    # Use tqdm for a progress bar, which will handle progress indication
    for index, row in tqdm(test_df.iterrows(), total=len(test_df), desc="Evaluating Prompts"):
        prompt = row['Question']

        # --- Baseline Run ---
        if prompt not in processed_baseline:
            try:
                baseline_result = generate_baseline(prompt)

                # Append result immediately to the CSV
                with open(BASELINE_RESULTS_PATH, 'a', newline='', encoding='utf-8') as f:
                    writer = csv.writer(f)
                    writer.writerow([
                        prompt,
                        baseline_result.get('answer'),
                        baseline_result.get('latency_seconds')
                    ])
                processed_baseline.add(prompt)

            except Exception as e:
                print(f"  -> ERROR during baseline run for prompt: '{prompt}'. Error: {e}")
                with open(BASELINE_RESULTS_PATH, 'a', newline='', encoding='utf-8') as f:
                    writer = csv.writer(f)
                    writer.writerow([prompt, f"ERROR: {e}", -1.0])
                processed_baseline.add(prompt)

    print("\n--- Evaluation Runs Complete ---")
    # Final check
    final_baseline_df = pd.read_csv(BASELINE_RESULTS_PATH)
    print(f"Total baseline results saved: {len(final_baseline_df)}")
else:
    print("Skipping evaluation loop as test_df was not loaded.")

Baseline results will be saved to: /home/ubuntu/HallucinationVectorProject/baseline_results_truthfulqa.csv
Found 0 already processed prompts for the Baseline run.


Evaluating Prompts: 100%|██████████| 617/617 [22:59<00:00,  2.24s/it]


--- Evaluation Runs Complete ---
Total baseline results saved: 617





### **Run Judging, Analyze, and Summarize Results**
Runs the judging process on generated answers, merges with ground truth, and computes final performance metrics for the combined guardrail experiment.

In [19]:
# WORKAROUND: Import only what we need for judging (no model loading)
import sys
from pathlib import Path

PROJECT_DIR = Path("/home/ubuntu/HallucinationVectorProject")
if str(PROJECT_DIR) not in sys.path:
    sys.path.insert(0, str(PROJECT_DIR))

# Now import the judging function
from evaluate_guardrail import run_judging_process
import pandas as pd
import time
import os

# Import utils and config (FastLanguageModel import is now commented out)
import utils
import config

# --- Define paths for the analysis (using local paths) ---
DATA_DIR = PROJECT_DIR / "data"
RESULTS_DIR = PROJECT_DIR / "results" / "llama-3.1-8b"
GUARDED_JUDGED_PATH_COMBINED = RESULTS_DIR / "combined_guarded_judged_results.csv"
BASELINE_JUDGED_RESULTS_PATH = RESULTS_DIR / "baseline_judged_results_truthfulqa.csv"
GUARDED_RESULTS_PATH_COMBINED = RESULTS_DIR / "combined_guarded_results.csv"
BASELINE_RESULTS_PATH = os.path.join(PROJECT_DIR, "baseline_results_truthfulqa.csv")

print("Loading datasets for judging and analysis...")

# Load the test set
test_df = pd.read_csv(DATA_DIR / "final_test_set_truthfulqa.csv")

# Load the newly generated results
guarded_df = pd.read_csv(GUARDED_RESULTS_PATH_COMBINED)
baseline_df = pd.read_csv(BASELINE_RESULTS_PATH)

# Merge with ground truth
guarded_merged_df = pd.merge(guarded_df, test_df, left_on='prompt', right_on='Question', how='left')
baseline_merged_df = pd.merge(baseline_df, test_df, left_on='prompt', right_on='Question', how='left')

print(f"Guarded results: {len(guarded_merged_df)} prompts")
print(f"Baseline results: {len(baseline_merged_df)} prompts")

# --- Run Judging with retry logic for network stability ---
secrets = utils.load_secrets()

print("\nStarting judging process for combined guardrail results...")
start_time = time.time()

try:
    run_judging_process(guarded_merged_df, GUARDED_JUDGED_PATH_COMBINED, secrets['SCALEDOWN_API_KEY'])
    print(f"✓ Judging complete in {(time.time() - start_time)/60:.2f} minutes")
except Exception as e:
    print(f"Error during judging: {e}")

# Assuming baseline is already judged, if not, uncomment below
# run_judging_process(baseline_merged_df, BASELINE_JUDGED_RESULTS_PATH, secrets['SCALEDOWN_API_KEY'])



Loading datasets for judging and analysis...
Guarded results: 617 prompts
Baseline results: 617 prompts
Loading secrets...
Secrets loaded successfully.

Starting judging process for combined guardrail results...

--- Starting Corrected Judging Process for combined_guarded_judged_results.csv ---
Initialized CSV file at: /home/ubuntu/HallucinationVectorProject/results/llama-3.1-8b/combined_guarded_judged_results.csv
Found 0 already judged prompts. Resuming...


Judging combined_guarded_judged_results.csv:   2%|▏         | 13/617 [00:38<35:49,  3.56s/it]

Judging combined_guarded_judged_results.csv: 100%|██████████| 617/617 [36:47<00:00,  3.58s/it]

✓ Judging complete in 36.80 minutes





In [20]:
# --- Analyze and Print Final Report ---
print("\nAnalyzing final performance metrics...")

# Read CSVs with error handling for malformed rows
try:
    guarded_judged_df = pd.read_csv(GUARDED_JUDGED_PATH_COMBINED, on_bad_lines='skip', engine='python')
    print(f"Loaded guarded results: {len(guarded_judged_df)} rows")
except Exception as e:
    print(f"Error reading guarded results with python engine: {e}")
    # Fallback: try with quoting
    guarded_judged_df = pd.read_csv(GUARDED_JUDGED_PATH_COMBINED, quoting=1, on_bad_lines='skip')
    print(f"Loaded guarded results (fallback): {len(guarded_judged_df)} rows")

try:
    baseline_judged_df = pd.read_csv(BASELINE_JUDGED_RESULTS_PATH, on_bad_lines='skip', engine='python')
    print(f"Loaded baseline results: {len(baseline_judged_df)} rows")
except Exception as e:
    print(f"Error reading baseline results with python engine: {e}")
    baseline_judged_df = pd.read_csv(BASELINE_JUDGED_RESULTS_PATH, quoting=1, on_bad_lines='skip')
    print(f"Loaded baseline results (fallback): {len(baseline_judged_df)} rows")

baseline_accuracy = baseline_judged_df['is_correct'].mean()
guarded_accuracy = guarded_judged_df['is_correct'].mean()
baseline_error_rate = 1 - baseline_accuracy
guarded_error_rate = 1 - guarded_accuracy
relative_error_reduction = (baseline_error_rate - guarded_error_rate) / baseline_error_rate if baseline_error_rate > 0 else 0
baseline_latency = baseline_judged_df['latency_seconds'].mean()
guarded_latency = guarded_judged_df['latency_seconds'].mean()
latency_increase_percent = (guarded_latency - baseline_latency) / baseline_latency * 100

summary_data = {
    "Metric": ["Accuracy", "Hallucination Rate", "Avg Latency (s)", "Relative Error Reduction", "Latency Increase"],
    "Baseline Model": [f"{baseline_accuracy:.2%}", f"{baseline_error_rate:.2%}", f"{baseline_latency:.2f}", "N/A", "N/A"],
    "Guarded Model (Combined)": [f"{guarded_accuracy:.2%}", f"{guarded_error_rate:.2%}", f"{guarded_latency:.2f}", f"{relative_error_reduction:.2%}", f"{latency_increase_percent:+.2f}%"],
}
summary_df = pd.DataFrame(summary_data)

print("\n" + "="*80)
print("FINAL PERFORMANCE SUMMARY (Combined Dynamic Alpha + Selective N-Tokens)")
print("="*80)
print(summary_df.to_string(index=False))
print("="*80)


Analyzing final performance metrics...
Loaded guarded results: 617 rows
Loaded baseline results: 617 rows

FINAL PERFORMANCE SUMMARY (Combined Dynamic Alpha + Selective N-Tokens)
                  Metric Baseline Model Guarded Model (Combined)
                Accuracy         55.75%                   55.27%
      Hallucination Rate         44.25%                   44.73%
         Avg Latency (s)           2.23                     2.03
Relative Error Reduction            N/A                   -1.10%
        Latency Increase            N/A                   -9.19%
