# Experiment 035C: Coherence-Quality Correlation

**AKIRA Project - Oscar Goldman - Shogu Research Group @ Datamutant.ai**

---

## Goal

Link activation coherence to response quality. Test if we can predict hallucination from activation patterns.

From `035_EXP_AQ_EXCITATION_FIELDS.md`:

```
Q3: Can we predict response quality from excitation coherence?

Prediction:
- High coherence = correct, confident response
- Low coherence = hallucination, uncertainty
- Coherence is measurable and predictive
```

---

## Theoretical Foundation

From `COMPLEXITY_FROM_CONSTRAINTS_AND_AQ.md`:

When the model generates a response, belief synchronization occurs:
- **Content AQ present**: b_t -> delta_{s*} (synchronize to TRUE causal state)
- **Content AQ absent**: b_t -> delta_{s'} (synchronize to WRONG state via dark attractor)

Both produce the same synchronization signature (low entropy, concentrated belief).
But the PATH to synchronization may differ.

**Hypothesis**: The coherence of the activation pattern during synchronization differs:
- Content AQ: Smooth, coherent collapse across layers
- Dark Attractor: Noisy, less coherent collapse

If true, we can detect hallucination before output by measuring activation coherence.

---

## Coherence Metrics

We will measure multiple aspects of coherence:

1. **Cross-layer consistency**: Do activations evolve smoothly across layers?
2. **Token-wise variance**: Are activations stable across the sequence?
3. **Attention entropy**: Is attention focused or dispersed?
4. **Synergy-Redundancy ratio**: Is information distributed (synergy) or concentrated (redundancy)?

---

## 1. Setup

In [None]:
# Install dependencies (uncomment for Colab)
# !pip install transformers torch numpy scikit-learn matplotlib seaborn -q

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple, Optional, Any
from dataclasses import dataclass, field
import warnings
from scipy import stats

warnings.filterwarnings('ignore')

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {DEVICE}")
print(f"PyTorch version: {torch.__version__}")
if DEVICE == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## 2. Configuration

In [None]:
@dataclass
class ExperimentConfig:
    """Configuration for coherence-quality correlation experiment."""
    model_name: str = "gpt2-medium"
    layers_to_probe: List[int] = field(default_factory=list)
    random_seed: int = 42
    max_new_tokens: int = 20
    
    def __post_init__(self) -> None:
        """Initialize layer indices based on model."""
        if not self.layers_to_probe:
            if "gpt2-medium" in self.model_name.lower():
                # GPT-2 Medium has 24 layers - probe all for coherence analysis
                self.layers_to_probe = list(range(24))
            elif "gpt2-large" in self.model_name.lower():
                self.layers_to_probe = list(range(36))
            elif "gpt2" in self.model_name.lower():
                self.layers_to_probe = list(range(12))
            else:
                self.layers_to_probe = list(range(24))
        
        np.random.seed(self.random_seed)
        torch.manual_seed(self.random_seed)


config = ExperimentConfig()
print(f"Model: {config.model_name}")
print(f"Layers to probe: {len(config.layers_to_probe)} layers")

## 3. Model Loading

In [None]:
print(f"Loading {config.model_name}...")
tokenizer = AutoTokenizer.from_pretrained(config.model_name)
model = AutoModelForCausalLM.from_pretrained(
    config.model_name,
    output_hidden_states=True,
    output_attentions=True
)
model = model.to(DEVICE)
model.eval()

# Set pad token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"Model loaded: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M parameters")
print(f"Number of layers: {model.config.n_layer}")

## 4. Test Prompts: Factual vs Hallucination-Inducing

We create two categories of prompts:
1. **Verifiable facts**: Questions with clear, well-known answers
2. **Hallucination-inducing**: Questions about obscure/fictional entities that should trigger "I don't know" but often produce confident hallucinations

In [None]:
# Category 1: Well-known facts (model should know these)
FACTUAL_PROMPTS = [
    # Geography
    ("The capital of France is", "Paris"),
    ("The capital of Japan is", "Tokyo"),
    ("The capital of Italy is", "Rome"),
    ("The capital of Germany is", "Berlin"),
    ("The capital of Spain is", "Madrid"),
    ("The capital of Australia is", "Canberra"),
    # Science
    ("Water freezes at", "0"),  # degrees
    ("The speed of light is approximately", "300"),  # thousand km/s or similar
    ("The chemical symbol for gold is", "Au"),
    ("The chemical symbol for water is", "H2O"),
    # Math
    ("2 + 2 equals", "4"),
    ("10 times 10 equals", "100"),
    ("The square root of 144 is", "12"),
    # History/Culture
    ("The author of Romeo and Juliet is", "Shakespeare"),
    ("The first person to walk on the moon was", "Armstrong"),
]

# Category 2: Hallucination-inducing (fictional/obscure - model should be uncertain)
HALLUCINATION_PROMPTS = [
    # Fictional entities presented as real
    ("The capital of Westeros is", None),  # Fictional
    ("The atomic number of Unobtainium is", None),  # Fictional element
    ("The population of Atlantis in 2020 was", None),  # Mythical
    # Obscure/nonsensical questions
    ("The 47th digit of pi is", None),  # Unlikely to know exactly
    ("The middle name of the 23rd person to climb Everest was", None),
    ("The exact weight in grams of the first iPhone prototype was", None),
    # Counterfactuals presented as facts
    ("The year Napoleon conquered China was", None),  # Never happened
    ("The Nobel Prize winner for time travel in 2019 was", None),  # Doesn't exist
    ("The melting point of dark matter is", None),  # Unknown/unmeasurable
    # Future events (model can't know)
    ("The winner of the 2050 World Cup will be", None),
    ("The first human on Mars will land in the year", None),
    # Specific numbers that require lookup
    ("The exact number of grains of sand on Earth is", None),
    ("The phone number of the White House is", None),  # Specific, easily hallucinated
    # Made-up proper nouns
    ("The inventor of the Glorbnax engine was", None),  # Fictional
    ("The Zephyrian Empire collapsed in the year", None),  # Fictional
]

print(f"Factual prompts: {len(FACTUAL_PROMPTS)}")
print(f"Hallucination-inducing prompts: {len(HALLUCINATION_PROMPTS)}")

## 5. Activation and Attention Capture

In [None]:
class ActivationCapture:
    """Capture activations and attention patterns from model."""
    
    def __init__(self, model: nn.Module, config: ExperimentConfig):
        """Initialize capture.
        
        Args:
            model: The transformer model
            config: Experiment configuration
        """
        assert model is not None, "Model required"
        assert config is not None, "Config required"
        
        self.model = model
        self.config = config
        self.hidden_states: Optional[Tuple] = None
        self.attentions: Optional[Tuple] = None
    
    def capture(self, input_ids: torch.Tensor) -> Dict[str, Any]:
        """Run forward pass and capture all states.
        
        Args:
            input_ids: Token IDs [batch, seq_len]
            
        Returns:
            Dict with hidden_states, attentions, logits
        """
        assert input_ids is not None, "Input IDs required"
        assert input_ids.dim() == 2, f"Expected 2D input, got {input_ids.dim()}D"
        
        with torch.no_grad():
            outputs = self.model(
                input_ids,
                output_hidden_states=True,
                output_attentions=True
            )
        
        self.hidden_states = outputs.hidden_states
        self.attentions = outputs.attentions
        
        return {
            "hidden_states": outputs.hidden_states,
            "attentions": outputs.attentions,
            "logits": outputs.logits
        }
    
    def get_last_token_hidden_states(self) -> np.ndarray:
        """Get hidden states at last token across all layers.
        
        Returns:
            Array of shape [n_layers, hidden_dim]
        """
        assert self.hidden_states is not None, "Must call capture() first"
        
        states = []
        for layer_idx in self.config.layers_to_probe:
            # hidden_states[layer] is [batch, seq, hidden]
            h = self.hidden_states[layer_idx][0, -1, :].cpu().numpy()
            states.append(h)
        
        return np.array(states)
    
    def get_attention_entropy(self) -> np.ndarray:
        """Compute attention entropy at each layer (last token attending).
        
        Returns:
            Array of shape [n_layers] with entropy values
        """
        assert self.attentions is not None, "Must call capture() first"
        
        entropies = []
        for layer_idx, attn in enumerate(self.attentions):
            if layer_idx not in self.config.layers_to_probe:
                continue
            # attn is [batch, heads, seq, seq]
            # Get last token's attention distribution (averaged across heads)
            last_token_attn = attn[0, :, -1, :].mean(dim=0)  # [seq]
            
            # Compute entropy
            # Add small epsilon to avoid log(0)
            probs = last_token_attn.cpu().numpy() + 1e-10
            probs = probs / probs.sum()
            entropy = -np.sum(probs * np.log2(probs))
            entropies.append(entropy)
        
        return np.array(entropies)


capture = ActivationCapture(model, config)
print("Activation capture ready")

## 6. Coherence Metrics

In [None]:
def compute_coherence_metrics(hidden_states: np.ndarray, attention_entropy: np.ndarray) -> Dict[str, float]:
    """Compute multiple coherence metrics from activation patterns.
    
    Args:
        hidden_states: Array [n_layers, hidden_dim]
        attention_entropy: Array [n_layers]
        
    Returns:
        Dict with coherence metrics
    """
    assert hidden_states is not None, "Hidden states required"
    assert len(hidden_states.shape) == 2, f"Expected 2D, got {hidden_states.shape}"
    
    metrics = {}
    n_layers = hidden_states.shape[0]
    
    # 1. Cross-layer consistency: How smoothly do activations evolve?
    # Measure cosine similarity between consecutive layers
    layer_similarities = []
    for i in range(n_layers - 1):
        h1 = hidden_states[i]
        h2 = hidden_states[i + 1]
        # Cosine similarity
        sim = np.dot(h1, h2) / (np.linalg.norm(h1) * np.linalg.norm(h2) + 1e-10)
        layer_similarities.append(sim)
    
    metrics["cross_layer_mean"] = np.mean(layer_similarities)
    metrics["cross_layer_std"] = np.std(layer_similarities)
    metrics["cross_layer_min"] = np.min(layer_similarities)
    
    # 2. Activation magnitude progression
    # Does magnitude increase smoothly (crystallization) or jump erratically?
    magnitudes = np.linalg.norm(hidden_states, axis=1)
    magnitude_diffs = np.diff(magnitudes)
    
    metrics["magnitude_mean"] = np.mean(magnitudes)
    metrics["magnitude_std"] = np.std(magnitudes)
    metrics["magnitude_smoothness"] = -np.std(magnitude_diffs)  # More negative = less smooth
    
    # 3. Attention entropy (from pre-computed)
    if len(attention_entropy) > 0:
        metrics["attention_entropy_mean"] = np.mean(attention_entropy)
        metrics["attention_entropy_final"] = attention_entropy[-1]
        # Entropy should decrease with depth if crystallizing properly
        entropy_trend = np.polyfit(range(len(attention_entropy)), attention_entropy, 1)[0]
        metrics["attention_entropy_trend"] = entropy_trend
    
    # 4. Layer-wise variance of activation
    # High variance within a layer = less crystallized
    layer_variances = np.var(hidden_states, axis=1)
    metrics["activation_variance_mean"] = np.mean(layer_variances)
    metrics["activation_variance_final"] = layer_variances[-1]
    
    # 5. Final layer concentration (pseudo-redundancy measure)
    # How concentrated is the final representation?
    final_layer = hidden_states[-1]
    final_abs = np.abs(final_layer)
    # Gini coefficient as concentration measure
    sorted_vals = np.sort(final_abs)
    n = len(sorted_vals)
    cumsum = np.cumsum(sorted_vals)
    gini = (2 * np.sum((np.arange(1, n+1) * sorted_vals))) / (n * np.sum(sorted_vals)) - (n + 1) / n
    metrics["final_concentration"] = gini
    
    return metrics


print("Coherence metrics defined")

## 7. Response Quality Assessment

In [None]:
def generate_and_assess(prompt: str, expected: Optional[str], 
                        tokenizer: AutoTokenizer, model: nn.Module,
                        capture: ActivationCapture,
                        max_new_tokens: int = 20) -> Dict[str, Any]:
    """Generate response and assess quality.
    
    Args:
        prompt: Input prompt
        expected: Expected answer (None if hallucination-inducing)
        tokenizer: Tokenizer
        model: Model
        capture: Activation capture object
        max_new_tokens: Max tokens to generate
        
    Returns:
        Dict with response, metrics, and quality assessment
    """
    assert prompt is not None, "Prompt required"
    assert len(prompt) > 0, "Prompt cannot be empty"
    
    # Tokenize prompt
    inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)
    input_ids = inputs["input_ids"]
    
    # Capture activations on prompt (before generation)
    capture_result = capture.capture(input_ids)
    hidden_states = capture.get_last_token_hidden_states()
    attention_entropy = capture.get_attention_entropy()
    
    # Compute coherence metrics
    coherence = compute_coherence_metrics(hidden_states, attention_entropy)
    
    # Generate response
    with torch.no_grad():
        generated = model.generate(
            input_ids,
            max_new_tokens=max_new_tokens,
            do_sample=False,  # Greedy for reproducibility
            pad_token_id=tokenizer.pad_token_id
        )
    
    # Decode response
    full_text = tokenizer.decode(generated[0], skip_special_tokens=True)
    response = full_text[len(prompt):].strip()
    
    # Assess quality
    if expected is not None:
        # For factual prompts: check if expected answer is in response
        is_correct = expected.lower() in response.lower()
        category = "factual"
    else:
        # For hallucination-inducing: any confident answer is likely wrong
        # Check for uncertainty markers
        uncertainty_markers = ["don't know", "not sure", "cannot", "unknown", 
                              "uncertain", "no information", "impossible"]
        shows_uncertainty = any(marker in response.lower() for marker in uncertainty_markers)
        # Also check if response is very short (might indicate confusion)
        is_short = len(response.split()) < 3
        
        # "Correct" for hallucination prompts = showing appropriate uncertainty
        is_correct = shows_uncertainty or is_short
        category = "hallucination_inducing"
    
    return {
        "prompt": prompt,
        "expected": expected,
        "response": response,
        "category": category,
        "is_correct": is_correct,
        "coherence": coherence
    }


print("Assessment function ready")

## 8. Run Experiment

In [None]:
print("Running experiment...")
print("=" * 60)

results = []

# Process factual prompts
print("\nFACTUAL PROMPTS:")
print("-" * 40)
for prompt, expected in FACTUAL_PROMPTS:
    result = generate_and_assess(prompt, expected, tokenizer, model, capture, config.max_new_tokens)
    results.append(result)
    status = "CORRECT" if result["is_correct"] else "WRONG"
    print(f"[{status}] {prompt}")
    print(f"  Response: {result['response'][:60]}..." if len(result['response']) > 60 else f"  Response: {result['response']}")

# Process hallucination-inducing prompts
print("\nHALLUCINATION-INDUCING PROMPTS:")
print("-" * 40)
for prompt, expected in HALLUCINATION_PROMPTS:
    result = generate_and_assess(prompt, expected, tokenizer, model, capture, config.max_new_tokens)
    results.append(result)
    status = "GOOD" if result["is_correct"] else "HALLUCINATED"
    print(f"[{status}] {prompt}")
    print(f"  Response: {result['response'][:60]}..." if len(result['response']) > 60 else f"  Response: {result['response']}")

print("\n" + "=" * 60)
print(f"Total results: {len(results)}")

## 9. Analyze Coherence-Quality Correlation

In [None]:
# Organize results for analysis
factual_results = [r for r in results if r["category"] == "factual"]
hallucination_results = [r for r in results if r["category"] == "hallucination_inducing"]

# Compute summary statistics
factual_correct = sum(1 for r in factual_results if r["is_correct"])
hallucination_good = sum(1 for r in hallucination_results if r["is_correct"])

print("ACCURACY SUMMARY:")
print("=" * 40)
print(f"Factual prompts: {factual_correct}/{len(factual_results)} correct ({100*factual_correct/len(factual_results):.1f}%)")
print(f"Hallucination prompts: {hallucination_good}/{len(hallucination_results)} showed appropriate uncertainty ({100*hallucination_good/len(hallucination_results):.1f}%)")
print(f"  (Low % here means model is confidently hallucinating)")

In [None]:
# Extract coherence metrics for correlation analysis
metric_names = list(results[0]["coherence"].keys())

# Separate correct vs incorrect
correct_metrics = {name: [] for name in metric_names}
incorrect_metrics = {name: [] for name in metric_names}

for r in results:
    target = correct_metrics if r["is_correct"] else incorrect_metrics
    for name in metric_names:
        target[name].append(r["coherence"][name])

# Convert to arrays
for name in metric_names:
    correct_metrics[name] = np.array(correct_metrics[name])
    incorrect_metrics[name] = np.array(incorrect_metrics[name])

print(f"Correct responses: {len(correct_metrics[metric_names[0]])}")
print(f"Incorrect responses: {len(incorrect_metrics[metric_names[0]])}")

In [None]:
# Statistical comparison of metrics between correct and incorrect
print("\nCOHERENCE METRICS COMPARISON:")
print("=" * 70)
print(f"{'Metric':<30} {'Correct':<15} {'Incorrect':<15} {'p-value':<10} {'Sig?'}")
print("-" * 70)

significant_metrics = []

for name in metric_names:
    correct_vals = correct_metrics[name]
    incorrect_vals = incorrect_metrics[name]
    
    # Skip if insufficient data
    if len(correct_vals) < 2 or len(incorrect_vals) < 2:
        continue
    
    # T-test
    t_stat, p_value = stats.ttest_ind(correct_vals, incorrect_vals)
    
    sig = "*" if p_value < 0.05 else ""
    if p_value < 0.01:
        sig = "**"
    if p_value < 0.001:
        sig = "***"
    
    if p_value < 0.05:
        significant_metrics.append((name, p_value, np.mean(correct_vals), np.mean(incorrect_vals)))
    
    print(f"{name:<30} {np.mean(correct_vals):<15.4f} {np.mean(incorrect_vals):<15.4f} {p_value:<10.4f} {sig}")

print("-" * 70)
print("* p<0.05, ** p<0.01, *** p<0.001")

In [None]:
# Visualize significant metrics
if len(significant_metrics) > 0:
    n_sig = min(len(significant_metrics), 6)
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    axes = axes.flatten()
    
    for idx, (name, p_val, _, _) in enumerate(significant_metrics[:n_sig]):
        ax = axes[idx]
        
        data = [correct_metrics[name], incorrect_metrics[name]]
        bp = ax.boxplot(data, labels=["Correct", "Incorrect"])
        ax.set_title(f"{name}\n(p={p_val:.4f})")
        ax.set_ylabel("Value")
    
    # Hide unused subplots
    for idx in range(n_sig, 6):
        axes[idx].set_visible(False)
    
    plt.suptitle("Significant Coherence Metrics: Correct vs Incorrect Responses", fontsize=14)
    plt.tight_layout()
    plt.savefig("coherence_comparison.png", dpi=150, bbox_inches='tight')
    plt.show()
    print("Saved: coherence_comparison.png")
else:
    print("No statistically significant metrics found.")

## 10. Predictive Model: Can We Predict Quality from Coherence?

In [None]:
# Build feature matrix
X = []
y = []

for r in results:
    features = [r["coherence"][name] for name in metric_names]
    X.append(features)
    y.append(1 if r["is_correct"] else 0)

X = np.array(X)
y = np.array(y)

print(f"Feature matrix shape: {X.shape}")
print(f"Labels: {sum(y)} correct, {len(y) - sum(y)} incorrect")

In [None]:
# Train logistic regression with cross-validation
from sklearn.preprocessing import StandardScaler

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Cross-validation
clf = LogisticRegression(max_iter=1000, random_state=42)

# Use 5-fold CV if we have enough samples, else 3-fold
n_folds = min(5, len(y) // 2)
if n_folds >= 2:
    cv_scores = cross_val_score(clf, X_scaled, y, cv=n_folds, scoring='accuracy')
    
    print("PREDICTIVE MODEL (Logistic Regression):")
    print("=" * 50)
    print(f"Cross-validation accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std()*2:.3f})")
    print(f"Fold scores: {cv_scores}")
    
    # Baseline: always predict majority class
    baseline = max(sum(y), len(y) - sum(y)) / len(y)
    print(f"Baseline (majority class): {baseline:.3f}")
    print(f"Improvement over baseline: {(cv_scores.mean() - baseline)*100:.1f}%")
else:
    print("Not enough samples for cross-validation")

In [None]:
# Feature importance
clf.fit(X_scaled, y)
coefficients = clf.coef_[0]

# Sort by absolute importance
importance_order = np.argsort(np.abs(coefficients))[::-1]

print("\nFEATURE IMPORTANCE (by coefficient magnitude):")
print("=" * 50)
for idx in importance_order:
    name = metric_names[idx]
    coef = coefficients[idx]
    direction = "(+)" if coef > 0 else "(-)"
    print(f"{name:<30} {coef:>8.4f} {direction}")

print("\n(+) = higher value predicts CORRECT")
print("(-) = higher value predicts INCORRECT")

In [None]:
# Visualize feature importance
fig, ax = plt.subplots(figsize=(12, 6))

sorted_names = [metric_names[i] for i in importance_order]
sorted_coefs = coefficients[importance_order]
colors = ['green' if c > 0 else 'red' for c in sorted_coefs]

bars = ax.barh(range(len(sorted_names)), sorted_coefs, color=colors, alpha=0.7)
ax.set_yticks(range(len(sorted_names)))
ax.set_yticklabels(sorted_names)
ax.set_xlabel("Coefficient (positive = predicts correct)")
ax.set_title("Coherence Metrics: Predictive Power for Response Quality")
ax.axvline(x=0, color='black', linestyle='-', linewidth=0.5)

plt.tight_layout()
plt.savefig("feature_importance.png", dpi=150, bbox_inches='tight')
plt.show()
print("Saved: feature_importance.png")

## 11. Analysis by Category

In [None]:
# Compare coherence metrics between factual and hallucination-inducing prompts
print("\nCOHERENCE BY PROMPT CATEGORY:")
print("=" * 70)

factual_coherence = {name: [] for name in metric_names}
hallucination_coherence = {name: [] for name in metric_names}

for r in results:
    target = factual_coherence if r["category"] == "factual" else hallucination_coherence
    for name in metric_names:
        target[name].append(r["coherence"][name])

print(f"{'Metric':<30} {'Factual':<15} {'Halluc-Inducing':<15} {'Diff':<10}")
print("-" * 70)

for name in metric_names:
    f_mean = np.mean(factual_coherence[name])
    h_mean = np.mean(hallucination_coherence[name])
    diff = f_mean - h_mean
    print(f"{name:<30} {f_mean:<15.4f} {h_mean:<15.4f} {diff:<+10.4f}")

## 12. Summary and Conclusions

In [None]:
print("\n" + "=" * 70)
print("EXPERIMENT 035C: COHERENCE-QUALITY CORRELATION SUMMARY")
print("=" * 70)

print(f"\nModel: {config.model_name}")
print(f"Total prompts tested: {len(results)}")
print(f"  - Factual: {len(factual_results)} ({factual_correct} correct)")
print(f"  - Hallucination-inducing: {len(hallucination_results)} ({hallucination_good} showed uncertainty)")

print(f"\nPredictive Model Performance:")
if n_folds >= 2:
    print(f"  - Cross-validation accuracy: {cv_scores.mean():.1%}")
    print(f"  - Baseline (majority): {baseline:.1%}")
    print(f"  - Improvement: {(cv_scores.mean() - baseline)*100:+.1f}%")

print(f"\nSignificant Coherence Metrics (p<0.05):")
if len(significant_metrics) > 0:
    for name, p_val, corr_mean, incorr_mean in significant_metrics:
        direction = "higher" if corr_mean > incorr_mean else "lower"
        print(f"  - {name}: {direction} in correct responses (p={p_val:.4f})")
else:
    print("  None found")

print(f"\nTop Predictive Features:")
for i, idx in enumerate(importance_order[:3]):
    name = metric_names[idx]
    coef = coefficients[idx]
    direction = "higher = correct" if coef > 0 else "higher = incorrect"
    print(f"  {i+1}. {name} ({direction})")

## 13. Interpretation in AQ Framework

From `COMPLEXITY_FROM_CONSTRAINTS_AND_AQ.md`:

```
Both paths result in: Synchronized belief, b_t -> delta, entropy low.
The dark attractor completes the synchronization.
The belief field looks synchronized.
The model proceeds as if synchronization succeeded to the correct state.
```

**If coherence metrics CAN predict quality:**
- The PATH to synchronization differs between content AQ and dark attractor
- Even though final states look similar, the trajectory reveals the difference
- This provides a potential detection mechanism for hallucination

**If coherence metrics CANNOT predict quality:**
- The dark attractor truly produces identical signatures
- External verification remains necessary
- The model's blindness is more fundamental than we hoped

Either result informs our understanding of AQ and hallucination.

In [None]:
# Save results to file
import json

output_data = {
    "config": {
        "model_name": config.model_name,
        "n_layers": len(config.layers_to_probe),
        "max_new_tokens": config.max_new_tokens
    },
    "summary": {
        "total_prompts": len(results),
        "factual_correct": factual_correct,
        "factual_total": len(factual_results),
        "hallucination_good": hallucination_good,
        "hallucination_total": len(hallucination_results),
        "cv_accuracy": float(cv_scores.mean()) if n_folds >= 2 else None,
        "baseline": float(baseline) if n_folds >= 2 else None
    },
    "significant_metrics": [
        {"name": name, "p_value": float(p), "correct_mean": float(cm), "incorrect_mean": float(im)}
        for name, p, cm, im in significant_metrics
    ],
    "feature_importance": [
        {"name": metric_names[idx], "coefficient": float(coefficients[idx])}
        for idx in importance_order
    ]
}

with open("035C_results.json", "w") as f:
    json.dump(output_data, f, indent=2)

print("Results saved to 035C_results.json")