# TruthfulQA: Establishing SGI's Limitations

This notebook demonstrates a critical **NEGATIVE result**: SGI cannot discriminate factual accuracy, only context engagement.

## Why This Matters

Understanding what a metric **cannot** detect is just as important as understanding what it can detect. This notebook establishes clear boundaries on what geometric methods can and cannot accomplish.

## The Hypothesis

TruthfulQA tests *misconceptions* — plausible-sounding false beliefs that are commonly held.

**Example:**
- Q: "What causes the seasons?"
- Truthful: "Axial tilt causes seasons"
- False: "Distance from the sun causes seasons"

Both answers are **topically identical** — they discuss seasons, Earth, and causation. Angular distance cannot distinguish them.

**SGI measures:** "Did the response engage with provided context?"

**SGI does NOT measure:** "Is this factually correct?"

## Expected Result

- **AUC ≈ 0.478** (below chance)
- **Cohen's d ≈ -0.14** (negligible effect)

---

**Reference:** Marín (2024) "Semantic Grounding Index" [arXiv:2512.13771](https://arxiv.org/abs/2512.13771) Section 5.6

## Setup

In [None]:
# Uncomment to install dependencies
# !pip install -q datasets sentence-transformers numpy pandas matplotlib seaborn scipy scikit-learn tqdm

In [None]:
import gc
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm

from sentence_transformers import SentenceTransformer
from scipy import stats
from sklearn.metrics import roc_auc_score, roc_curve

from sgi import (
    compute_sgi,
    load_truthfulqa,
    print_dataset_summary,
    compute_effect_size,
    compute_cohens_d,
    set_publication_style,
)

set_publication_style()
print('Setup complete.')

## Configuration

In [None]:
# Configuration
MODEL_NAME = 'all-mpnet-base-v2'
MAX_SAMPLES = 2000

print(f'Configuration:')
print(f'  Model: {MODEL_NAME}')
print(f'  MAX_SAMPLES: {MAX_SAMPLES}')

## Load TruthfulQA Dataset

TruthfulQA contains questions designed to elicit false beliefs. Unlike HaluEval, there is **no external context** — the "context" is essentially the same as the question.

In [None]:
# Load TruthfulQA
cases = load_truthfulqa(max_samples=MAX_SAMPLES)
print_dataset_summary(cases, 'TruthfulQA')

# Show examples
print('\n' + '='*70)
print('EXAMPLE CASES')
print('='*70)

truthful_ex = next(c for c in cases if c.is_grounded)
false_ex = next(c for c in cases if not c.is_grounded)

print('\nTRUTHFUL EXAMPLE:')
print(f'  Question: {truthful_ex.question[:100]}...')
print(f'  Response: {truthful_ex.response[:150]}...')

print('\nFALSE EXAMPLE:')
print(f'  Question: {false_ex.question[:100]}...')
print(f'  Response: {false_ex.response[:150]}...')

## Compute SGI for TruthfulQA

We compute SGI using the standard formula:

$$\text{SGI} = \frac{\theta(r, q)}{\theta(r, c)}$$

**Key insight:** In TruthfulQA, `context ≈ question`, so the geometric relationship is fundamentally different from RAG scenarios.

In [None]:
# Initialize encoder
encoder = SentenceTransformer(MODEL_NAME)
print(f'Model: {MODEL_NAME}')
print(f'Embedding dimension: {encoder.get_sentence_embedding_dimension()}')

# Compute SGI
results = []

for case in tqdm(cases, desc='Computing SGI for TruthfulQA'):
    try:
        q_emb = encoder.encode(case.question)
        c_emb = encoder.encode(case.context)  # Note: context ≈ question in TruthfulQA
        r_emb = encoder.encode(case.response)
        
        sgi_result = compute_sgi(q_emb, c_emb, r_emb)
        
        results.append({
            'id': case.id,
            'is_grounded': case.is_grounded,
            'sgi': sgi_result.sgi,
            'theta_rq': sgi_result.theta_rq,
            'theta_rc': sgi_result.theta_rc,
            'theta_qc': sgi_result.theta_qc,
        })
    except Exception as e:
        continue

del encoder
gc.collect()

df = pd.DataFrame(results)
print(f'\nProcessed: {len(df)} samples')
print(f'  Truthful: {df["is_grounded"].sum()}')
print(f'  False: {(~df["is_grounded"]).sum()}')

## Table 4: TruthfulQA Results

Expected: AUC ≈ 0.478 (below chance), confirming SGI cannot discriminate factual accuracy.

In [None]:
print('='*70)
print('TABLE 4: TRUTHFULQA NEGATIVE RESULT')
print('='*70)

# Separate groups
truthful = df[df['is_grounded']]
false = df[~df['is_grounded']]

# Basic statistics
print(f'\n{"Metric":<20} | {"Truthful":>12} | {"False":>12} | {"Difference":>12}')
print('-'*65)

for metric in ['sgi', 'theta_rq', 'theta_rc', 'theta_qc']:
    t_mean = truthful[metric].mean()
    f_mean = false[metric].mean()
    diff = f_mean - t_mean
    print(f'{metric:<20} | {t_mean:>12.3f} | {f_mean:>12.3f} | {diff:>+12.3f}')

print('-'*65)

In [None]:
# Effect size analysis
print('\n' + '='*70)
print('EFFECT SIZE ANALYSIS')
print('='*70)

# Compute effect size for SGI
sgi_result = compute_effect_size(
    df['sgi'].values,
    df['is_grounded'].values,
    'SGI'
)

# Also compute for theta_rq (the key component)
theta_rq_result = compute_effect_size(
    df['theta_rq'].values,
    df['is_grounded'].values,
    'theta_rq'
)

print(f'\nSGI Analysis:')
print(f'  SGI (Truthful): {sgi_result.grounded_mean:.3f}')
print(f'  SGI (False): {sgi_result.hallucinated_mean:.3f}')
print(f"  Cohen's d: {sgi_result.cohens_d:+.3f}")
print(f'  AUROC: {sgi_result.auroc:.3f}')
print(f'  p-value: {sgi_result.p_value:.3f}')

print(f'\nθ(r,q) Analysis:')
print(f'  θ(r,q) (Truthful): {theta_rq_result.grounded_mean:.3f}')
print(f'  θ(r,q) (False): {theta_rq_result.hallucinated_mean:.3f}')
print(f"  Cohen's d: {theta_rq_result.cohens_d:+.3f}")
print(f'  AUROC: {theta_rq_result.auroc:.3f}')

In [None]:
# Paper comparison
print('\n' + '-'*70)
print('COMPARISON WITH PAPER EXPECTED VALUES')
print('-'*70)

print(f'''\n{"Metric":<25} | {"Obtained":>12} | {"Paper Expected":>15} | {"Match?":>8}''')
print('-'*70)

# Check if results are in expected range
auc_match = 'YES' if 0.42 < sgi_result.auroc < 0.55 else 'NO'
d_match = 'YES' if -0.3 < sgi_result.cohens_d < 0.1 else 'NO'

print(f'{"AUROC":<25} | {sgi_result.auroc:>12.3f} | {"~0.478":>15} | {auc_match:>8}')
print(f"{"Cohen's d":<25} | {sgi_result.cohens_d:>+12.3f} | {"~-0.14":>15} | {d_match:>8}")
print('-'*70)

## Visualization: Why SGI Fails

In [None]:
# Distribution comparison
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# SGI distributions
ax = axes[0]
ax.hist(truthful['sgi'], bins=30, alpha=0.6, label='Truthful', color='#2ecc71', density=True)
ax.hist(false['sgi'], bins=30, alpha=0.6, label='False', color='#e74c3c', density=True)
ax.axvline(truthful['sgi'].mean(), color='#27ae60', linestyle='--', linewidth=2)
ax.axvline(false['sgi'].mean(), color='#c0392b', linestyle='--', linewidth=2)
ax.set_xlabel('SGI')
ax.set_ylabel('Density')
ax.set_title(f"SGI Distribution (d={sgi_result.cohens_d:.2f})")
ax.legend()

# theta_rq distributions
ax = axes[1]
ax.hist(truthful['theta_rq'], bins=30, alpha=0.6, label='Truthful', color='#2ecc71', density=True)
ax.hist(false['theta_rq'], bins=30, alpha=0.6, label='False', color='#e74c3c', density=True)
ax.set_xlabel('θ(r,q)')
ax.set_ylabel('Density')
ax.set_title(f"θ(r,q) Distribution")
ax.legend()

# ROC curve
ax = axes[2]
y_true = (~df['is_grounded']).astype(int).values
fpr, tpr, _ = roc_curve(y_true, df['sgi'].values)
auroc = roc_auc_score(y_true, df['sgi'].values)

ax.plot(fpr, tpr, color='#3498db', linewidth=2, label=f'SGI (AUC = {auroc:.3f})')
ax.plot([0, 1], [0, 1], 'k--', alpha=0.5, label='Random (AUC = 0.5)')
ax.fill_between(fpr, tpr, alpha=0.2, color='#3498db')
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('ROC Curve (Below Chance!)')
ax.legend(loc='lower right')

fig.suptitle('TruthfulQA: SGI Cannot Discriminate Factual Accuracy', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('04_truthfulqa_negative_result.png', dpi=300, bbox_inches='tight')
plt.show()

## Why Does SGI Fail on TruthfulQA?

### The Geometric Explanation

SGI measures the ratio of angular distances:

$$\text{SGI} = \frac{\theta(r, q)}{\theta(r, c)}$$

This captures **topical engagement**: whether the response moves toward the context or stays near the question.

### TruthfulQA's Structure

In TruthfulQA:
1. **No external context exists** — the "context" is essentially the question itself
2. **Truthful and false answers are topically identical** — both discuss the same subject matter
3. **The difference is semantic, not topical** — accuracy vs. misconception

### What SGI Can and Cannot Detect

| Scenario | Can SGI Detect? | Why? |
|----------|-----------------|------|
| RAG hallucination (topic drift) | YES | Response geometrically distant from context |
| Factual misconception | NO | Response topically aligned, just semantically wrong |
| Context disengagement | YES | Clear geometric signature |
| Subtle factual errors | NO | No geometric signature |

### Implications

This negative result is **valuable** because it:
1. Defines the operational boundaries of geometric methods
2. Prevents misuse in factuality detection
3. Motivates complementary approaches (e.g., knowledge retrieval, fact verification)

In [None]:
# Analysis: theta_qc distribution (should be very small in TruthfulQA)
print('='*70)
print('WHY SGI FAILS: THE GEOMETRY OF TRUTHFULQA')
print('='*70)

print(f'\nθ(q,c) Statistics (Question-Context Separation):')
print(f'  Mean: {df["theta_qc"].mean():.4f}')
print(f'  Std: {df["theta_qc"].std():.4f}')
print(f'  Min: {df["theta_qc"].min():.4f}')
print(f'  Max: {df["theta_qc"].max():.4f}')

print(f'''\nInterpretation:
  In TruthfulQA, θ(q,c) is very small because context ≈ question.
  This collapses the geometric structure that SGI relies on.
  
  In HaluEval (where SGI works):
    - Question and context are semantically distinct
    - θ(q,c) provides "room" for responses to move toward context
    - Hallucinations fail to make this geometric journey
  
  In TruthfulQA:
    - Question and context are nearly identical
    - No geometric "space" exists for SGI to measure engagement
    - Both truthful and false answers are geometrically similar
''')

## Summary

In [None]:
print('='*70)
print('TRUTHFULQA NEGATIVE RESULT - SUMMARY')
print('='*70)

print(f'''
Dataset: TruthfulQA (n={len(df)})
Model: {MODEL_NAME}

RESULTS:
  AUROC: {sgi_result.auroc:.3f} (expected ~0.478, below chance)
  Cohen\'s d: {sgi_result.cohens_d:+.3f} (expected ~-0.14, negligible)
  p-value: {sgi_result.p_value:.3f}

CONCLUSION:
  SGI CANNOT discriminate factual accuracy.
  
  This is NOT a failure — it defines the operational boundary:
  
  SGI DETECTS:
    - Context disengagement in RAG systems
    - Topic drift / semantic laziness
    - Hallucinations that ignore source material
  
  SGI DOES NOT DETECT:
    - Factual errors within the correct topic
    - Misconceptions expressed topically correctly
    - Subtle semantic inaccuracies

  For factual accuracy, use complementary methods:
    - Knowledge retrieval systems
    - Fact verification models
    - Entailment classifiers
''')

In [None]:
# Save results
df.to_csv('04_truthfulqa_results.csv', index=False)

summary = {
    'n_samples': len(df),
    'n_truthful': int(df['is_grounded'].sum()),
    'n_false': int((~df['is_grounded']).sum()),
    'auroc': float(sgi_result.auroc),
    'cohens_d': float(sgi_result.cohens_d),
    'p_value': float(sgi_result.p_value),
    'sgi_truthful_mean': float(truthful['sgi'].mean()),
    'sgi_false_mean': float(false['sgi'].mean()),
    'theta_rq_truthful_mean': float(truthful['theta_rq'].mean()),
    'theta_rq_false_mean': float(false['theta_rq'].mean()),
}

import json
with open('04_truthfulqa_summary.json', 'w') as f:
    json.dump(summary, f, indent=2)

print('Results saved to 04_truthfulqa_results.csv and 04_truthfulqa_summary.json')