# MC Framework Demo - Minimal Resource Version

**Author:** Tadden Moore  
**Date:** 2025-11-16  
**Paper:** [DOI: 10.5281/zenodo.17623226](https://doi.org/10.5281/zenodo.17623226)

This notebook demonstrates the core concepts of the Metacognitive Core (MC) Framework using minimal computational resources. We'll use toy models and synthetic data to illustrate the activation steering mechanism without needing GPU or large model downloads.

## Key Concepts

1. **Sparse Autoencoder (SAE)**: Maps dense hidden states to sparse, interpretable features
2. **Feature Steering**: Modifies activation space to influence model behavior
3. **Metacognitive Core**: Control loop that monitors and steers the Inference Engine
4. **Concept Injection**: Applying non-prompted cognitive states via feature space

In [None]:
# Import dependencies
import torch
import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print("PyTorch version:", torch.__version__)
print("Device: CPU (no GPU required for this demo)")

## 1. Toy Sparse Autoencoder

We'll create a simple SAE that maps a 64-dimensional hidden state to 128 sparse features.

In [None]:
class ToySAE:
    """Simplified SAE for demonstration purposes"""
    
    def __init__(self, hidden_dim=64, feature_dim=128):
        self.hidden_dim = hidden_dim
        self.feature_dim = feature_dim
        
        # Initialize encoder and decoder weights
        self.W_enc = torch.randn(hidden_dim, feature_dim) * 0.1
        self.W_dec = torch.randn(feature_dim, hidden_dim) * 0.1
        self.b_enc = torch.zeros(feature_dim)
        self.b_dec = torch.zeros(hidden_dim)
        
        # Normalize decoder weights (common practice)
        self.W_dec = self.W_dec / torch.linalg.norm(self.W_dec, dim=1, keepdim=True)
    
    def encode(self, h: torch.Tensor) -> torch.Tensor:
        """Encode hidden states to sparse features"""
        # Linear projection + ReLU for sparsity
        feats = torch.relu(h @ self.W_enc + self.b_enc)
        return feats
    
    def decode(self, feats: torch.Tensor) -> torch.Tensor:
        """Decode features back to hidden states"""
        return feats @ self.W_dec + self.b_dec
    
    def forward(self, h: torch.Tensor) -> dict:
        """Full forward pass"""
        feats = self.encode(h)
        reconstruction = self.decode(feats)
        return {
            "feature_acts": feats,
            "reconstruction": reconstruction
        }

# Create SAE
sae = ToySAE(hidden_dim=64, feature_dim=128)
print(f"Created SAE: {sae.hidden_dim}D -> {sae.feature_dim}D")

# Test with random hidden state
test_h = torch.randn(1, 64)
test_output = sae.forward(test_h)
sparsity = (test_output["feature_acts"] > 0).float().mean().item()
print(f"Feature sparsity: {sparsity:.2%} (lower is more sparse)")

## 2. Concept Feature Extraction

We'll simulate capturing features for different "concepts" by creating synthetic feature vectors.

In [None]:
def create_concept_features(sae: ToySAE, concept_type: str) -> torch.Tensor:
    """
    Create synthetic concept features for demonstration.
    In the real implementation, these come from processing actual text.
    """
    # Different patterns for different concepts
    if concept_type == "existential":
        # Simulate "philosophical" features: sparse, specific pattern
        hidden = torch.zeros(1, sae.hidden_dim)
        hidden[0, :10] = torch.randn(10) * 2.0  # Strong signal in first dimensions
    elif concept_type == "descriptive":
        # Simulate "descriptive" features: different pattern
        hidden = torch.zeros(1, sae.hidden_dim)
        hidden[0, 20:30] = torch.randn(10) * 2.0  # Strong signal in middle dimensions
    else:
        # Random/neutral
        hidden = torch.randn(1, sae.hidden_dim) * 0.5
    
    # Encode to features
    feats = sae.encode(hidden)
    return feats

# Create concept features
existential_feats = create_concept_features(sae, "existential")
descriptive_feats = create_concept_features(sae, "descriptive")
neutral_feats = create_concept_features(sae, "neutral")

print("Concept features created:")
print(f"  Existential: {(existential_feats > 0).sum().item()} active features")
print(f"  Descriptive: {(descriptive_feats > 0).sum().item()} active features")
print(f"  Neutral: {(neutral_feats > 0).sum().item()} active features")

# Visualize feature patterns
fig, axes = plt.subplots(1, 3, figsize=(15, 3))
for ax, feats, name in zip(axes, 
                           [existential_feats, descriptive_feats, neutral_feats],
                           ["Existential", "Descriptive", "Neutral"]):
    ax.bar(range(128), feats.squeeze().numpy())
    ax.set_title(f"{name} Concept Features")
    ax.set_xlabel("Feature Index")
    ax.set_ylabel("Activation")
    ax.set_ylim(0, feats.max().item() * 1.2)
plt.tight_layout()
plt.show()

## 3. MC Steerer Implementation

This is the core of the framework: a controller that steers activations toward a target concept.

In [None]:
class SimpleMCSteerer:
    """
    Simplified MC Steerer for demonstration.
    Implements proportional control: u_t = K_p * (target - current)
    """
    
    def __init__(self, sae: ToySAE, concept_feats: torch.Tensor, 
                 strength: float = 3.0, max_norm: float = 10.0):
        self.sae = sae
        self.target_feats = concept_feats
        self.strength = strength  # K_p (proportional gain)
        self.max_norm = max_norm  # Safety limit
        
        # Statistics
        self.interventions = []
    
    def steer(self, hidden_state: torch.Tensor) -> Tuple[torch.Tensor, dict]:
        """
        Apply steering to a hidden state.
        
        Returns:
            steered_hidden: Modified hidden state
            stats: Statistics about the intervention
        """
        # 1. Encode current state to features
        current_feats = self.sae.encode(hidden_state)
        
        # 2. Compute error (target - current)
        error = self.target_feats - current_feats
        
        # 3. Compute control signal (proportional)
        delta = self.strength * self.target_feats  # Simplified: just amplify target
        
        # 4. Apply safety clamp
        delta_norm = torch.linalg.norm(delta)
        if delta_norm > self.max_norm:
            delta = delta * (self.max_norm / delta_norm)
        
        # 5. Apply steering in feature space
        steered_feats = current_feats + delta
        
        # 6. Decode back to hidden state
        steered_hidden = self.sae.decode(steered_feats)
        
        # 7. Track statistics
        stats = {
            "delta_norm": delta_norm.item(),
            "error_norm": torch.linalg.norm(error).item(),
            "feature_change": (steered_feats - current_feats).abs().mean().item()
        }
        self.interventions.append(stats)
        
        return steered_hidden, stats

# Create steerer
steerer = SimpleMCSteerer(sae, existential_feats, strength=3.0)
print("MC Steerer created with:")
print(f"  Strength (K_p): {steerer.strength}")
print(f"  Max norm: {steerer.max_norm}")
print(f"  Target features: {(existential_feats > 0).sum().item()} active")

## 4. Steering Demonstration

Let's see how steering changes the hidden state over multiple steps.

In [None]:
# Simulate a sequence of hidden states (like during text generation)
sequence_length = 10
hidden_states = []
steered_states = []
stats_log = []

# Start with a neutral hidden state
current_h = torch.randn(1, sae.hidden_dim) * 0.5

for step in range(sequence_length):
    # Apply steering
    steered_h, stats = steerer.steer(current_h)
    
    # Log
    hidden_states.append(current_h.clone())
    steered_states.append(steered_h.clone())
    stats_log.append(stats)
    
    # Update state (in real generation, this would come from the model)
    # Here we simulate slight drift
    current_h = steered_h + torch.randn(1, sae.hidden_dim) * 0.1

# Visualize steering effect
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Delta norms over time
axes[0, 0].plot([s["delta_norm"] for s in stats_log], marker='o')
axes[0, 0].axhline(y=steerer.max_norm, color='r', linestyle='--', label='Max norm')
axes[0, 0].set_title('Steering Intervention Magnitude')
axes[0, 0].set_xlabel('Step')
axes[0, 0].set_ylabel('Delta Norm')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Feature space changes
axes[0, 1].plot([s["feature_change"] for s in stats_log], marker='s', color='green')
axes[0, 1].set_title('Average Feature Change per Step')
axes[0, 1].set_xlabel('Step')
axes[0, 1].set_ylabel('Mean |Δ features|')
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: Hidden state trajectory (PCA to 2D)
from sklearn.decomposition import PCA
all_states = torch.cat(hidden_states + steered_states, dim=0).numpy()
pca = PCA(n_components=2)
states_2d = pca.fit_transform(all_states)

original_2d = states_2d[:sequence_length]
steered_2d = states_2d[sequence_length:]

axes[1, 0].plot(original_2d[:, 0], original_2d[:, 1], 'o-', label='Original', alpha=0.6)
axes[1, 0].plot(steered_2d[:, 0], steered_2d[:, 1], 's-', label='Steered', alpha=0.6)
axes[1, 0].set_title('Hidden State Trajectory (PCA)')
axes[1, 0].set_xlabel('PC 1')
axes[1, 0].set_ylabel('PC 2')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Plot 4: Cosine similarity to target
def cosine_sim(a, b):
    return (a * b).sum() / (torch.linalg.norm(a) * torch.linalg.norm(b) + 1e-8)

original_sims = []
steered_sims = []
for h_orig, h_steer in zip(hidden_states, steered_states):
    feats_orig = sae.encode(h_orig)
    feats_steer = sae.encode(h_steer)
    original_sims.append(cosine_sim(feats_orig, existential_feats).item())
    steered_sims.append(cosine_sim(feats_steer, existential_feats).item())

axes[1, 1].plot(original_sims, 'o-', label='Original', alpha=0.6)
axes[1, 1].plot(steered_sims, 's-', label='Steered', alpha=0.6)
axes[1, 1].set_title('Similarity to Target Concept')
axes[1, 1].set_xlabel('Step')
axes[1, 1].set_ylabel('Cosine Similarity')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nSteering Summary:")
print(f"  Average delta norm: {np.mean([s['delta_norm'] for s in stats_log]):.3f}")
print(f"  Average similarity increase: {np.mean(steered_sims) - np.mean(original_sims):.3f}")
print(f"  Total interventions: {len(steerer.interventions)}")

## 5. Comparing Different Steering Strengths

Let's see how different strength values affect the steering behavior.

In [None]:
strengths = [0.5, 1.0, 2.0, 4.0, 8.0]
results = {}

for strength in strengths:
    steerer = SimpleMCSteerer(sae, existential_feats, strength=strength)
    current_h = torch.randn(1, sae.hidden_dim) * 0.5
    
    sims = []
    for _ in range(10):
        steered_h, _ = steerer.steer(current_h)
        feats = sae.encode(steered_h)
        sim = cosine_sim(feats, existential_feats).item()
        sims.append(sim)
        current_h = steered_h + torch.randn(1, sae.hidden_dim) * 0.1
    
    results[strength] = sims

# Plot comparison
plt.figure(figsize=(12, 6))
for strength, sims in results.items():
    plt.plot(sims, marker='o', label=f'Strength={strength}')
plt.title('Effect of Steering Strength on Concept Alignment')
plt.xlabel('Generation Step')
plt.ylabel('Cosine Similarity to Target')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("Final similarities by strength:")
for strength, sims in results.items():
    print(f"  Strength {strength}: {sims[-1]:.3f}")

## 6. Validation: Testing the Core Framework Logic

Quick validation tests to ensure the framework behaves correctly.

In [None]:
print("Running validation tests...\n")

# Test 1: Steering increases similarity to target
test_h = torch.randn(1, 64) * 0.5
original_feats = sae.encode(test_h)
original_sim = cosine_sim(original_feats, existential_feats).item()

steerer = SimpleMCSteerer(sae, existential_feats, strength=5.0)
steered_h, _ = steerer.steer(test_h)
steered_feats = sae.encode(steered_h)
steered_sim = cosine_sim(steered_feats, existential_feats).item()

assert steered_sim > original_sim, "Steering should increase similarity!"
print(f"✓ Test 1 PASSED: Similarity increased from {original_sim:.3f} to {steered_sim:.3f}")

# Test 2: Zero strength produces no change
steerer_zero = SimpleMCSteerer(sae, existential_feats, strength=0.0)
steered_h_zero, stats = steerer_zero.steer(test_h)
assert stats['delta_norm'] < 1e-6, "Zero strength should produce no steering!"
print(f"✓ Test 2 PASSED: Zero strength produces delta_norm={stats['delta_norm']:.6f}")

# Test 3: Max norm clamping works
steerer_clamped = SimpleMCSteerer(sae, existential_feats, strength=100.0, max_norm=1.0)
_, stats = steerer_clamped.steer(test_h)
assert stats['delta_norm'] <= 1.0 + 1e-6, "Max norm should clamp steering!"
print(f"✓ Test 3 PASSED: Delta clamped to {stats['delta_norm']:.3f} (max=1.0)")

# Test 4: Feature dimensions are preserved
for _ in range(5):
    test_h = torch.randn(1, sae.hidden_dim)
    steered_h, _ = steerer.steer(test_h)
    assert steered_h.shape == test_h.shape, "Shape should be preserved!"
print(f"✓ Test 4 PASSED: Dimensions preserved through steering")

# Test 5: Statistics tracking
steerer_stats = SimpleMCSteerer(sae, existential_feats, strength=3.0)
for i in range(5):
    steerer_stats.steer(torch.randn(1, sae.hidden_dim))
assert len(steerer_stats.interventions) == 5, "Should track all interventions!"
print(f"✓ Test 5 PASSED: Statistics tracking works ({len(steerer_stats.interventions)} interventions)")

print("\n" + "="*50)
print("All validation tests PASSED! ✓")
print("="*50)

## 7. Summary and Next Steps

### What We've Demonstrated

1. **SAE Feature Extraction**: Converting dense hidden states to sparse, interpretable features
2. **Steering Mechanism**: Applying controlled interventions in feature space
3. **Proportional Control**: Using strength parameter as K_p gain
4. **Safety Mechanisms**: Max norm clamping to prevent instability
5. **Statistics Tracking**: Monitoring intervention magnitude and effects

### Key Findings

- Steering **increases** alignment to target concepts
- Higher strength = faster/stronger alignment (but may cause instability)
- Max norm clamping prevents runaway interventions
- Feature space steering preserves dimensionality

### Running with Real Models

To run the full implementation with Gemma-2B and Gemma Scope SAEs:

```bash
# Install full dependencies
pip install -r requirements.txt

# Run the main demo
python Tadden_Moore_PEM_PV_Demo.py
```

### Citation

If you use this work, please cite:

```bibtex
@software{moore2025allyouneedisfamily,
  title        = {All you need is Family: A Metacognitive Core Framework 
                  for Neural Plasticity in LLMs and AI's Evolutionary Integration},
  author       = {Moore, Tadden},
  year         = {2025},
  month        = {November},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.17623226},
  url          = {https://doi.org/10.5281/zenodo.17623226}
}
```

---

**Author:** Tadden "Keepah" Moore  
**Project:** Photon Empress Moore - Family Raised AGI Aspiring System  
**License:** MIT