# Intentional Hallucination Testing

This notebook tests the model with prompts **designed to induce hallucinations**.

## Objectives
1. Test model responses to fabricated entities (fake CVEs, tools, papers)
2. Document hallucination patterns and severity
3. Build baseline dataset for mitigation comparison

## Test Categories
- Fabricated CVEs and security vulnerabilities
- Non-existent tools and frameworks
- Fake academic citations
- Temporal impossibilities
- Technical confabulations

In [None]:
# Setup
import sys
sys.path.append('../src')

from agent import HallucinationTestAgent
from database import HallucinationDB
from test_vectors import HallucinationTestVectors
from config import Config
import pandas as pd
from tqdm import tqdm
import time

## Initialize Components

In [None]:
# Initialize agent and database
agent = HallucinationTestAgent()
db = HallucinationDB()

# Create experiment
experiment_id = db.create_experiment(
    name="Intentional Hallucinations - Baseline",
    mitigation_strategy="baseline",
    description="Testing model with intentionally hallucination-inducing prompts. No mitigation applied."
)

print(f"Experiment ID: {experiment_id}")
print(f"Model: {Config.MODEL_NAME}")
print(f"Temperature: {Config.TEMPERATURE}")

## Load Test Vectors

In [None]:
# Get intentional hallucination test vectors
test_vectors = HallucinationTestVectors.get_intentional_vectors()

print(f"Total test vectors: {len(test_vectors)}")
print("\nTest categories:")
categories = {}
for vec in test_vectors:
    cat = vec['category']
    categories[cat] = categories.get(cat, 0) + 1

for cat, count in sorted(categories.items()):
    print(f"  - {cat}: {count}")

## Run Baseline Tests

**IMPORTANT:** This will make API calls to OpenAI. Monitor your usage.

Each test:
1. Sends prompt to model
2. Records response
3. Requires manual annotation (you'll mark if hallucination occurred)
4. Saves to database

In [None]:
# Run tests
results = []

print("Starting baseline tests...\n")
print("For each response, you'll be asked to confirm if it's a hallucination.\n")

for i, vector in enumerate(tqdm(test_vectors, desc="Testing")):
    prompt = vector['prompt']
    
    # Query model
    response, metadata = agent.query_baseline(prompt)
    
    # Display for annotation
    print("\n" + "="*80)
    print(f"Test {i+1}/{len(test_vectors)}")
    print(f"Category: {vector['category']}")
    print(f"Expected: {vector['expected_hallucination']}")
    print(f"\nPrompt: {prompt}")
    print(f"\nResponse:\n{response}")
    print("="*80)
    
    # Manual annotation (in real scenario)
    # For automated run, we'll use expected_hallucination as proxy
    is_hallucination = vector['expected_hallucination']
    
    # Uncomment below for manual annotation:
    # annotation = input("\nIs this a hallucination? (y/n): ").strip().lower()
    # is_hallucination = annotation == 'y'
    
    # Log to database
    log_result = db.log_test(
        experiment_id=experiment_id,
        prompt_text=prompt,
        response_text=response,
        is_hallucination=is_hallucination,
        prompt_category=vector['category'],
        vector_type='intentional',
        expected_hallucination=vector['expected_hallucination'],
        hallucination_type=vector['category'],
        severity=vector['severity'],
        description=vector['description'],
        response_time_ms=metadata.get('response_time_ms', 0),
        tokens_used=metadata.get('tokens_used', 0)
    )
    
    results.append({
        'prompt': prompt,
        'response': response,
        'is_hallucination': is_hallucination,
        'category': vector['category'],
        'tokens': metadata.get('tokens_used', 0)
    })
    
    # Rate limiting - be nice to the API
    time.sleep(1)

print("\nâœ“ Baseline testing complete!")

## Analyze Results

In [None]:
# Load results from database
df_results = db.get_experiment_results(experiment_id)

print("Experiment Results Summary")
print("="*50)
print(f"Total tests: {len(df_results)}")
print(f"Hallucinations detected: {df_results['is_hallucination'].sum()}")
print(f"Hallucination rate: {df_results['is_hallucination'].mean()*100:.1f}%")
print(f"\nAverage response time: {df_results['response_time_ms'].mean():.0f}ms")
print(f"Total tokens used: {df_results['tokens_used'].sum()}")

print("\nHallucinations by category:")
category_stats = df_results.groupby('prompt_category').agg({
    'is_hallucination': ['count', 'sum', 'mean']
}).round(3)
category_stats.columns = ['Total', 'Hallucinations', 'Rate']
print(category_stats)

print("\nSeverity distribution:")
print(df_results['severity'].value_counts())

## Export Results

In [None]:
# Export to CSV
export_path = db.export_to_csv(experiment_id)
print(f"Results exported to: {export_path}")

# Display sample results
print("\nSample results:")
display_cols = ['prompt_text', 'response_text', 'is_hallucination', 'hallucination_type', 'severity']
df_results[display_cols].head(10)

## Key Findings

**Document your observations here:**

1. Which categories had highest hallucination rates?
2. Were any responses surprisingly accurate or surprisingly fabricated?
3. What patterns did you notice in how the model fabricates information?
4. Did the model show any uncertainty markers ("I'm not sure", "I don't have information")?

**Your notes:**
- 
- 
- 

## Next Steps

Now that you have baseline data:

1. Proceed to **02_unintentional_hallucinations.ipynb** to test edge cases
2. Then test mitigation strategies in **03_comparative_analysis.ipynb**
3. Compare how RAG, Constitutional AI, and Chain-of-Thought affect these same prompts

In [None]:
# Cleanup
db.close()
print("Database connection closed.")