# Cross-Model Validation of the Semantic Grounding Index

**Research Question:** Is SGI a property of the text itself, or an artifact of specific embedding geometries?

If SGI captures something fundamental about how language models handle uncertainty, the signal should be:
1. **Consistent across embedding architectures** — high correlation between SGI scores from different encoders
2. **Robust in effect size** — Cohen's d should remain significant across embedding spaces
3. **Stable in ranking** — samples ranked as "high risk" by one model should be ranked similarly by others

If the signal is an artifact of a particular embedding geometry, we'd expect low correlation, inconsistent effect sizes, and different samples flagged by different models.

---

**Models tested:**

| Model | Dimension | Training Regime |
|-------|-----------|----------------|
| `all-mpnet-base-v2` | 768 | General-purpose sentence embeddings |
| `all-MiniLM-L6-v2` | 384 | Distilled, smaller architecture |
| `bge-base-en-v1.5` | 768 | BAAI contrastive learning |
| `e5-base-v2` | 768 | Microsoft instruction-tuned |
| `gte-base` | 768 | Alibaba general text embeddings |

---

**Reference:** Marín (2024) "Semantic Grounding Index" [arXiv:2512.13771](https://arxiv.org/abs/2512.13771)

## Setup

In [None]:
# Uncomment to install dependencies
# !pip install -q datasets sentence-transformers spacy numpy pandas matplotlib seaborn scipy scikit-learn
# !python -m spacy download en_core_web_sm -q

In [None]:
import sys
import gc
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
from tqdm.auto import tqdm

import spacy
from sentence_transformers import SentenceTransformer

# Local imports
from sgi import (
    compute_sgi,
    load_halueval_qa,
    print_dataset_summary,
    compute_effect_size,
    compute_correlation_matrix,
    compute_pairwise_correlations,
    compute_topk_overlap_matrix,
    set_publication_style,
    create_summary_figure,
)

set_publication_style()
nlp = spacy.load('en_core_web_sm')

print('Setup complete.')

## Configuration

In [None]:
# Embedding models to validate
EMBEDDING_MODELS = {
    'mpnet': 'all-mpnet-base-v2',
    'minilm': 'all-MiniLM-L6-v2',
    'bge': 'BAAI/bge-base-en-v1.5',
    'e5': 'intfloat/e5-base-v2',
    'gte': 'thenlper/gte-base',
}

# Sample size (paper uses n=5,000 for reproducibility)
MAX_SAMPLES = 5000

print(f'Will validate SGI across {len(EMBEDDING_MODELS)} embedding models:')
for name, model_id in EMBEDDING_MODELS.items():
    print(f'  - {name}: {model_id}')

## 1. Load Data

In [None]:
cases = load_halueval_qa(max_samples=MAX_SAMPLES)
print_dataset_summary(cases, 'HaluEval QA')

## 2. Compute SGI Across Models

For each embedding model, we compute:

$$\text{SGI} = \frac{\theta(r, q)}{\theta(r, c)}$$

where $\theta$ is the angular distance and $r$, $q$, $c$ are the L2-normalized embeddings of response, question, and context respectively.

In [None]:
def extract_first_claim(text: str) -> str:
    """Extract first sentence as primary claim."""
    doc = nlp(text)
    for sent in doc.sents:
        if len(sent.text.strip()) > 5:
            return sent.text.strip()
    return text[:200]


def compute_sgi_for_model(cases, model_name: str, model_id: str) -> pd.DataFrame:
    """Compute SGI for all cases using specified embedding model."""
    print(f'\n{"="*60}')
    print(f'Model: {model_name} ({model_id})')
    print(f'{"="*60}')
    
    encoder = SentenceTransformer(model_id)
    print(f'Embedding dimension: {encoder.get_sentence_embedding_dimension()}')
    
    results = []
    
    for case in tqdm(cases, desc=f'Computing SGI ({model_name})'):
        try:
            q_emb = encoder.encode(case.question)
            c_emb = encoder.encode(case.context)
            r_emb = encoder.encode(extract_first_claim(case.response))
            
            sgi_result = compute_sgi(q_emb, c_emb, r_emb)
            
            results.append({
                'id': case.id,
                'is_grounded': case.is_grounded,
                f'sgi_{model_name}': sgi_result.sgi,
                f'theta_rq_{model_name}': sgi_result.theta_rq,
                f'theta_rc_{model_name}': sgi_result.theta_rc,
            })
        except Exception as e:
            continue
    
    # Clean up
    del encoder
    gc.collect()
    
    print(f'Processed: {len(results)} samples')
    return pd.DataFrame(results)

In [None]:
# Run analysis for each model
model_dfs = {}

for model_name, model_id in EMBEDDING_MODELS.items():
    model_dfs[model_name] = compute_sgi_for_model(cases, model_name, model_id)

print(f'\n\nCompleted analysis for {len(model_dfs)} models.')

In [None]:
# Merge results into single dataframe
model_names = list(model_dfs.keys())
merged_df = model_dfs[model_names[0]][['id', 'is_grounded']].copy()

for model_name, df in model_dfs.items():
    cols_to_merge = ['id'] + [c for c in df.columns if model_name in c]
    merged_df = merged_df.merge(df[cols_to_merge], on='id', how='inner')

print(f'Merged: {len(merged_df)} samples with SGI from {len(model_names)} models')
print(f'  Grounded: {merged_df["is_grounded"].sum()}')
print(f'  Hallucinated: {(~merged_df["is_grounded"]).sum()}')

## 3. Cross-Model Validation Tests

### Test 1: Effect Size Consistency

Does each embedding model show significant separation between grounded and hallucinated responses?

In [None]:
print('='*70)
print('TEST 1: EFFECT SIZE CONSISTENCY')
print('='*70)
print(f'\n{"Model":<10} | {"Grounded":>10} | {"Halluc":>10} | {"Cohen\'s d":>10} | {"AUROC":>8} | {"p-value":>12}')
print('-'*70)

effect_sizes = {}

for model_name in model_names:
    sgi_col = f'sgi_{model_name}'
    values = merged_df[sgi_col].values
    labels = merged_df['is_grounded'].values
    
    result = compute_effect_size(values, labels, model_name)
    effect_sizes[model_name] = result.to_dict()
    
    sig = '***' if result.p_value < 0.001 else '**' if result.p_value < 0.01 else '*' if result.p_value < 0.05 else ''
    print(f'{model_name:<10} | {result.grounded_mean:>10.4f} | {result.hallucinated_mean:>10.4f} | '
          f'{result.cohens_d:>+10.3f} | {result.auroc:>8.3f} | {result.p_value:>10.2e} {sig}')

print('-'*70)

# Summary
d_values = [r['cohens_d'] for r in effect_sizes.values()]
auc_values = [r['auroc'] for r in effect_sizes.values()]
n_significant = sum(1 for r in effect_sizes.values() if r['significant'])

print(f'\nSignificant effects (p<0.05): {n_significant}/{len(model_names)}')
print(f'Cohen\'s d: mean={np.mean(d_values):.3f}, range=[{min(d_values):.3f}, {max(d_values):.3f}]')
print(f'AUROC: mean={np.mean(auc_values):.3f}, range=[{min(auc_values):.3f}, {max(auc_values):.3f}]')

### Test 2: Cross-Model Correlation (Pearson)

If SGI captures something fundamental, scores from different models should correlate strongly.

In [None]:
print('='*70)
print('TEST 2: CROSS-MODEL CORRELATION (PEARSON)')
print('='*70)

sgi_cols = [f'sgi_{m}' for m in model_names]
sgi_df = merged_df[sgi_cols].copy()
sgi_df.columns = model_names

pearson_matrix = compute_correlation_matrix(sgi_df, model_names, method='pearson')
pearson_stats = compute_pairwise_correlations(pearson_matrix)

print('\nPearson Correlation Matrix:')
print(pearson_matrix.round(3).to_string())
print(f'\nOff-diagonal: mean={pearson_stats["mean"]:.3f}, range=[{pearson_stats["min"]:.3f}, {pearson_stats["max"]:.3f}]')

### Test 3: Ranking Agreement (Spearman)

Do models agree on which samples are most/least grounded?

In [None]:
print('='*70)
print('TEST 3: RANKING AGREEMENT (SPEARMAN)')
print('='*70)

spearman_matrix = compute_correlation_matrix(sgi_df, model_names, method='spearman')
spearman_stats = compute_pairwise_correlations(spearman_matrix)

print('\nSpearman Rank Correlation Matrix:')
print(spearman_matrix.round(3).to_string())
print(f'\nOff-diagonal: mean={spearman_stats["mean"]:.3f}, range=[{spearman_stats["min"]:.3f}, {spearman_stats["max"]:.3f}]')

### Test 4: Top-K Flagging Agreement

If we flag the top 10% as "high risk" by each model, how much overlap is there?

In [None]:
print('='*70)
print('TEST 4: TOP-K FLAGGING AGREEMENT')
print('='*70)

for k_pct in [10, 20, 30]:
    overlap_matrix = compute_topk_overlap_matrix(merged_df, sgi_cols, k_percent=k_pct)
    overlap_matrix.index = model_names
    overlap_matrix.columns = model_names
    
    # Mean off-diagonal Jaccard
    off_diag = [overlap_matrix.iloc[i, j] for i in range(len(model_names)) 
                for j in range(i+1, len(model_names))]
    
    print(f'\nTop-{k_pct}% overlap (Jaccard similarity):')
    print(overlap_matrix.round(3).to_string())
    print(f'Mean pairwise Jaccard: {np.mean(off_diag):.3f}')

### Test 5: Component Analysis

Which component of SGI drives the signal — response-question proximity or response-context distance?

In [None]:
print('='*70)
print('TEST 5: COMPONENT ANALYSIS')
print('='*70)
print(f'\n{"Model":<10} | {"d(θ_rq)":>10} | {"d(θ_rc)":>10} | Driver')
print('-'*50)

for model_name in model_names:
    theta_rq_col = f'theta_rq_{model_name}'
    theta_rc_col = f'theta_rc_{model_name}'
    labels = merged_df['is_grounded'].values
    
    d_rq = compute_effect_size(merged_df[theta_rq_col].values, labels, 'θ_rq')
    d_rc = compute_effect_size(merged_df[theta_rc_col].values, labels, 'θ_rc')
    
    driver = 'θ(r,q)' if abs(d_rq.cohens_d) > abs(d_rc.cohens_d) else 'θ(r,c)'
    print(f'{model_name:<10} | {d_rq.cohens_d:>+10.3f} | {d_rc.cohens_d:>+10.3f} | {driver}')

print('-'*50)
print('\nInterpretation:')
print('  • Negative d(θ_rq): Hallucinations are CLOSER to question')
print('  • Positive d(θ_rc): Hallucinations are FARTHER from context')
print('  • Both effects together = "semantic laziness" signature')

## 4. Visualization

In [None]:
fig = create_summary_figure(
    df=merged_df,
    effect_sizes=effect_sizes,
    pearson_matrix=pearson_matrix,
    spearman_matrix=spearman_matrix,
    model_names=model_names,
    save_path='cross_model_validation.png',
)
fig.show()

## 5. Summary

In [None]:
from sgi.analysis import summarize_cross_model_validation

summary = summarize_cross_model_validation(
    effect_sizes={m: type('obj', (object,), effect_sizes[m])() for m in model_names},
    pearson_corr=pearson_stats,
    spearman_corr=spearman_stats,
)
print(summary)

## 6. Export Results

In [None]:
# Save raw results
merged_df.to_csv('cross_model_sgi_results.csv', index=False)
print('Results saved to cross_model_sgi_results.csv')

# Summary statistics
summary_stats = {
    'n_samples': len(merged_df),
    'n_grounded': int(merged_df['is_grounded'].sum()),
    'n_hallucinated': int((~merged_df['is_grounded']).sum()),
    'models': model_names,
    'effect_sizes': effect_sizes,
    'pearson_correlation': pearson_stats,
    'spearman_correlation': spearman_stats,
}

import json
with open('cross_model_summary.json', 'w') as f:
    json.dump(summary_stats, f, indent=2)
print('Summary saved to cross_model_summary.json')