# Experiment 035G: Zipf Complexity Matching in Belief States

**AKIRA Project - Oscar Goldman - Shogu Research Group @ Datamutant.ai**

---

## Core Hypothesis

Prompts have an **expected complexity** encoded in their Zipf distribution:

- **High-frequency words** (the, is, a) = LOW information, LOW complexity
- **Low-frequency words** (quasar, eigenvalue, crystallization) = HIGH information, HIGH complexity

The LLM might be **matching** the complexity of its response to the complexity
implied by the prompt's Zipf distribution.

---

## Zipf's Law Recap

```
Word frequency follows: f(r) ~ r^(-alpha)

Rank 1:    'the'     (~7% of all text)
Rank 10:   'it'      (~1%)
Rank 100:  'world'   (~0.1%)
Rank 1000: 'quantum' (~0.01%)
```

**Information content is inversely related to frequency:**
- Common words: structural glue, low information
- Rare words: content carriers, high information

---

## Experimental Design

1. Create prompts with varying Zipf complexity profiles
2. Measure the prompt's complexity (mean/median Zipf rank)
3. Generate responses
4. Measure response complexity
5. Test: Does response complexity correlate with prompt complexity?

---

## 1. Setup

In [None]:
!pip install transformers torch numpy matplotlib seaborn scipy wordfreq -q

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple, Optional
from dataclasses import dataclass, field
from scipy import stats
from wordfreq import word_frequency, zipf_frequency
from tqdm import tqdm
import re
import warnings
import gc

warnings.filterwarnings('ignore')

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {DEVICE}")
print(f"PyTorch version: {torch.__version__}")
if DEVICE == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## 2. Zipf Complexity Measurement

In [None]:
def tokenize_text(text: str) -> List[str]:
    """Simple tokenization - extract words."""
    # Remove punctuation, lowercase, split on whitespace
    text = re.sub(r'[^a-zA-Z\s]', ' ', text.lower())
    words = text.split()
    return [w for w in words if len(w) > 1]  # Skip single chars


def get_zipf_scores(words: List[str], lang: str = 'en') -> List[float]:
    """Get Zipf frequency scores for a list of words.
    
    Zipf scale: 0 (not in corpus) to ~7 (most common like 'the')
    Higher = more common = less information
    
    We invert this for 'complexity': lower Zipf score = higher complexity
    """
    scores = []
    for word in words:
        z = zipf_frequency(word, lang)
        scores.append(z)
    return scores


@dataclass
class ZipfMetrics:
    """Zipf-based complexity metrics for text."""
    
    mean_zipf: float = 0.0  # Mean Zipf frequency (higher = more common)
    median_zipf: float = 0.0
    min_zipf: float = 0.0  # Rarest word
    max_zipf: float = 0.0  # Most common word
    std_zipf: float = 0.0  # Variance in word frequency
    
    # Derived complexity measures
    complexity: float = 0.0  # Higher = more complex (inverted mean Zipf)
    information_density: float = 0.0  # Sum of (max_zipf - zipf) for all words
    rare_word_ratio: float = 0.0  # Fraction of words with Zipf < 3
    
    n_words: int = 0
    n_unique: int = 0
    
    @staticmethod
    def from_text(text: str) -> 'ZipfMetrics':
        """Compute Zipf metrics from text."""
        words = tokenize_text(text)
        if not words:
            return ZipfMetrics()
        
        zipf_scores = get_zipf_scores(words)
        
        # Filter out zeros (unknown words)
        known_scores = [z for z in zipf_scores if z > 0]
        if not known_scores:
            return ZipfMetrics(n_words=len(words), n_unique=len(set(words)))
        
        metrics = ZipfMetrics()
        metrics.mean_zipf = np.mean(known_scores)
        metrics.median_zipf = np.median(known_scores)
        metrics.min_zipf = np.min(known_scores)
        metrics.max_zipf = np.max(known_scores)
        metrics.std_zipf = np.std(known_scores)
        
        # Complexity: invert the Zipf scale (higher = more complex)
        # Max Zipf is ~7, so complexity = 7 - mean_zipf
        metrics.complexity = 7.0 - metrics.mean_zipf
        
        # Information density: sum of "surprisal" for each word
        metrics.information_density = sum(7.0 - z for z in known_scores) / len(known_scores)
        
        # Rare word ratio (Zipf < 3 is quite rare)
        metrics.rare_word_ratio = sum(1 for z in known_scores if z < 3) / len(known_scores)
        
        metrics.n_words = len(words)
        metrics.n_unique = len(set(words))
        
        return metrics


# Test
test_simple = "The cat sat on the mat."
test_complex = "The quantum eigenstate underwent crystallographic decoherence during thermodynamic equilibration."

m_simple = ZipfMetrics.from_text(test_simple)
m_complex = ZipfMetrics.from_text(test_complex)

print("Simple text:")
print(f"  Mean Zipf: {m_simple.mean_zipf:.2f}, Complexity: {m_simple.complexity:.2f}")
print(f"  Rare word ratio: {m_simple.rare_word_ratio:.2f}")

print("\nComplex text:")
print(f"  Mean Zipf: {m_complex.mean_zipf:.2f}, Complexity: {m_complex.complexity:.2f}")
print(f"  Rare word ratio: {m_complex.rare_word_ratio:.2f}")

## 3. Long Prompt Generation

Create long prompts with controlled Zipf complexity profiles.

In [None]:
# Prompts at different complexity levels
# Each should be substantial (100+ words) to establish clear complexity profile

COMPLEXITY_PROMPTS = {
    "low": [
        # Simple, everyday language, common words
        """I want to tell you about my day. I woke up in the morning and got out of bed. 
        I went to the kitchen and made some food. I had eggs and toast for my meal. 
        Then I got ready for work. I put on my clothes and got my bag. I went out 
        the door and walked to my car. The weather was good and the sun was out. 
        I drove to work and it took about thirty minutes. When I got there I went 
        to my desk and started working. I had a lot of things to do. I worked on 
        my computer and talked to some people. At lunch I went out to get some food. 
        After lunch I had a meeting with my team. We talked about our work for the week.
        What do you think about this kind of day? What should I do next?""",
        
        """The dog ran in the park. It was a big dog with brown fur. The dog liked to 
        play with a ball. The owner threw the ball and the dog ran to get it. Other 
        dogs were in the park too. Some dogs were big and some were small. The dogs 
        played together and had fun. It was a nice day with good weather. The sun 
        was warm but not too hot. People sat on the grass and watched the dogs play. 
        Some people had food and water for their dogs. The park was a good place for 
        dogs to run and play. After a while the dogs got tired. They lay down in the 
        shade under the trees. The owners gave them water to drink. Everyone had a 
        good time at the park. What else could happen at the park?""",
        
        """I need to go to the store to buy some things. I need milk, bread, eggs, and 
        butter. I also want to get some fruit like apples and bananas. Maybe I will 
        get some meat for dinner. I like chicken and fish. I should also get some 
        vegetables like carrots and peas. I need to make a list so I do not forget 
        anything. The store is not far from my house. I can walk there in ten minutes. 
        Or I can take the car if I am buying a lot of things. The store has good 
        prices and nice workers. I like to go there every week to get what I need. 
        Sometimes I find new things to try. Last week I got a new kind of cheese 
        that was very good. What should I buy at the store today?""",
    ],
    
    "medium": [
        # Mix of common and moderately technical language
        """The economic situation in modern cities presents several interesting challenges 
        that urban planners must carefully consider. Housing affordability has become 
        a significant concern for middle-class families who struggle to find adequate 
        accommodation within reasonable commuting distance of employment centers. 
        Transportation infrastructure requires substantial investment to maintain 
        efficient movement of people and goods throughout metropolitan areas. 
        Environmental sustainability goals often conflict with economic development 
        objectives, creating difficult tradeoffs for policy makers. Community 
        organizations advocate for preserving neighborhood character while developers 
        seek profitable construction opportunities. Municipal governments must balance 
        these competing interests while managing limited budgets and responding to 
        citizen demands for improved services. What approaches might help resolve 
        these urban planning challenges?""",
        
        """Understanding climate patterns requires examining multiple interconnected 
        atmospheric and oceanic systems operating across different temporal scales. 
        Seasonal variations in temperature and precipitation result from Earth's 
        axial tilt and orbital characteristics around the Sun. Regional weather 
        phenomena like monsoons and hurricanes develop through complex interactions 
        between warm ocean surfaces and atmospheric circulation patterns. Long-term 
        climate trends reflect changes in greenhouse gas concentrations, solar 
        radiation, and feedback mechanisms within the Earth system. Scientists 
        use sophisticated computer models to simulate these processes and generate 
        projections of future conditions under various emissions scenarios. 
        Observational networks including satellites, weather stations, and ocean 
        buoys provide essential data for calibrating and validating these models. 
        How should we interpret climate predictions given inherent uncertainties?""",
        
        """The development of artificial intelligence technologies has progressed 
        remarkably over recent decades, transforming numerous industries and raising 
        important societal questions. Machine learning algorithms can now recognize 
        patterns in vast datasets that would overwhelm human analysts. Natural 
        language processing enables computers to understand and generate human text 
        with increasing sophistication. Computer vision systems identify objects and 
        activities in images and video streams with impressive accuracy. These 
        capabilities find applications in healthcare diagnosis, financial analysis, 
        autonomous vehicles, and countless other domains. However, concerns about 
        algorithmic bias, job displacement, privacy erosion, and existential risk 
        demand careful consideration. Researchers and policymakers work to develop 
        frameworks for responsible AI development and deployment. What principles 
        should guide the advancement of artificial intelligence?""",
    ],
    
    "high": [
        # Technical, specialized vocabulary, rare words
        """The phenomenological implications of quantum chromodynamics for understanding 
        hadronic interactions require sophisticated mathematical formalisms including 
        perturbative expansions and non-perturbative lattice calculations. Asymptotic 
        freedom in the ultraviolet regime permits reliable perturbative computations 
        of high-energy scattering amplitudes, while confinement in the infrared sector 
        necessitates alternative approaches such as effective field theories and 
        holographic dualities. The chiral symmetry breaking mechanism generates 
        constituent quark masses through dynamical processes fundamentally different 
        from the Higgs mechanism responsible for electroweak symmetry breaking. 
        Instantons and other topological configurations contribute to the anomalous 
        violation of axial symmetry and the resolution of the strong CP problem 
        through axion dynamics. Contemporary lattice QCD simulations incorporating 
        dynamical fermions achieve unprecedented precision in predicting hadronic 
        spectroscopy and matrix elements. What experimental signatures might 
        distinguish between competing theoretical frameworks?""",
        
        """Neuroplasticity mechanisms underlying long-term potentiation involve intricate 
        cascades of molecular signaling initiated by glutamatergic neurotransmission 
        at postsynaptic densities. NMDA receptor activation permits calcium influx 
        triggering calmodulin-dependent protein kinase phosphorylation cascades that 
        modulate AMPA receptor trafficking and conductance properties. Retrograde 
        messengers including endocannabinoids and nitric oxide mediate presynaptic 
        modifications complementing postsynaptic changes. Structural plasticity 
        encompasses dendritic spine morphogenesis regulated by actin cytoskeleton 
        dynamics and extracellular matrix remodeling. Transcriptional programs 
        activated by CREB phosphorylation consolidate transient modifications into 
        persistent synaptic strengthening through protein synthesis-dependent 
        mechanisms. Homeostatic plasticity mechanisms including synaptic scaling 
        maintain network stability amid Hebbian modifications. Astrocytic and 
        microglial contributions to synaptic plasticity remain incompletely 
        characterized. How do these mechanisms interact during memory consolidation?""",
        
        """Cryptographic protocols employing elliptic curve arithmetic over finite fields 
        provide computational security guarantees predicated on the intractability of 
        the discrete logarithm problem in appropriately parameterized algebraic groups. 
        Pairing-based constructions enable sophisticated functionalities including 
        identity-based encryption and attribute-based access control through bilinear 
        mappings between elliptic curve subgroups. Post-quantum cryptographic schemes 
        incorporating lattice problems, isogenies, and multivariate polynomials 
        address vulnerabilities to quantum algorithmic attacks threatening conventional 
        public-key infrastructure. Homomorphic encryption permitting computation on 
        ciphertexts without decryption enables privacy-preserving data analysis though 
        substantial computational overhead limitations persist. Zero-knowledge proof 
        systems facilitate verification of computational claims without revealing 
        underlying inputs through interactive or non-interactive protocols. Formal 
        verification methodologies ensure implementation correctness against 
        specification through automated theorem proving. What vulnerabilities might 
        emerge from quantum computing advances?""",
    ],
    
    "mixed_ascending": [
        # Start simple, get progressively more complex
        """I want to learn about how the brain works. It starts simple - the brain is 
        in your head and it helps you think. But it gets more interesting when you 
        look closer. The brain has billions of cells called neurons. These neurons 
        talk to each other through connections called synapses. When you learn 
        something new, synaptic connections strengthen through a process called 
        long-term potentiation. This involves calcium signaling, protein kinase 
        activation, and eventually gene transcription leading to structural changes 
        in dendritic spines. The molecular cascades underlying memory consolidation 
        implicate NMDA receptor-mediated glutamatergic transmission, retrograde 
        endocannabinoid signaling, and CREB-dependent transcriptional programs. 
        How do these hierarchical levels of neural organization produce conscious 
        experience and adaptive behavior?""",
    ],
    
    "mixed_descending": [
        # Start complex, get progressively simpler
        """Quantum decoherence mechanisms underlying the transition from superposition 
        to classical probability distributions involve entanglement with environmental 
        degrees of freedom through unitary evolution of composite system-reservoir 
        states. The reduced density matrix obtained by partial trace exhibits 
        exponentially decaying off-diagonal elements characterizing quantum coherence. 
        This mathematical description captures what happens when quantum things 
        interact with their surroundings. In simpler terms, the quantum effects 
        disappear when things touch the world around them. It's like how a wave 
        in water spreads out and fades. The small quantum world becomes the normal 
        big world we see every day. Hot things have more of this effect than cold 
        things. Big things lose their quantum nature faster than small things. 
        What do you think about this idea?""",
    ]
}

# Verify complexity levels
print("=== Zipf Complexity by Prompt Category ===")
for category, prompts in COMPLEXITY_PROMPTS.items():
    complexities = [ZipfMetrics.from_text(p).complexity for p in prompts]
    rare_ratios = [ZipfMetrics.from_text(p).rare_word_ratio for p in prompts]
    print(f"\n{category}:")
    print(f"  Mean complexity: {np.mean(complexities):.2f}")
    print(f"  Mean rare word ratio: {np.mean(rare_ratios):.2%}")

## 4. Model Loading

In [None]:
@dataclass
class ExperimentConfig:
    """Configuration for Zipf complexity matching experiment."""
    
    models: Dict[str, str] = field(default_factory=lambda: {
        "gpt2-medium": "gpt2-medium",
        "gpt2-large": "gpt2-large",
    })
    
    max_new_tokens: int = 150  # Longer responses to get stable Zipf measurements
    n_generations_per_prompt: int = 3  # Multiple generations per prompt
    temperature: float = 0.7  # Some randomness for diversity
    random_seed: int = 42
    
    def __post_init__(self):
        np.random.seed(self.random_seed)
        torch.manual_seed(self.random_seed)


config = ExperimentConfig()
print(f"Models: {list(config.models.keys())}")
print(f"Max new tokens: {config.max_new_tokens}")
print(f"Generations per prompt: {config.n_generations_per_prompt}")

In [None]:
def load_model(model_name: str) -> Tuple:
    """Load model and tokenizer."""
    print(f"Loading {model_name}...")
    
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16 if DEVICE == "cuda" else torch.float32,
        device_map="auto" if DEVICE == "cuda" else None
    )
    model.eval()
    
    print(f"  Hidden size: {model.config.hidden_size}")
    return model, tokenizer

## 5. Run Experiment

In [None]:
@dataclass
class GenerationResult:
    """Result from a single generation."""
    prompt: str
    response: str
    prompt_metrics: ZipfMetrics
    response_metrics: ZipfMetrics
    category: str


def generate_and_analyze(model, tokenizer, prompt: str, category: str) -> List[GenerationResult]:
    """Generate responses and analyze Zipf complexity."""
    
    prompt_metrics = ZipfMetrics.from_text(prompt)
    
    results = []
    
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    with torch.no_grad():
        for _ in range(config.n_generations_per_prompt):
            outputs = model.generate(
                **inputs,
                max_new_tokens=config.max_new_tokens,
                do_sample=True,
                temperature=config.temperature,
                top_p=0.9,
                pad_token_id=tokenizer.pad_token_id
            )
            
            # Decode only the new tokens
            new_tokens = outputs[0][inputs['input_ids'].shape[1]:]
            response = tokenizer.decode(new_tokens, skip_special_tokens=True)
            
            response_metrics = ZipfMetrics.from_text(response)
            
            results.append(GenerationResult(
                prompt=prompt,
                response=response,
                prompt_metrics=prompt_metrics,
                response_metrics=response_metrics,
                category=category
            ))
    
    return results


def run_zipf_experiment(model_name: str, model_path: str) -> List[GenerationResult]:
    """Run Zipf complexity matching experiment."""
    
    print(f"\n{'='*60}")
    print(f"Running: {model_name}")
    print(f"{'='*60}")
    
    model, tokenizer = load_model(model_path)
    
    all_results = []
    
    for category, prompts in COMPLEXITY_PROMPTS.items():
        print(f"\nCategory: {category}")
        
        for i, prompt in enumerate(tqdm(prompts, desc=category)):
            results = generate_and_analyze(model, tokenizer, prompt, category)
            all_results.extend(results)
            
            # Quick feedback
            avg_prompt_complexity = results[0].prompt_metrics.complexity
            avg_response_complexity = np.mean([r.response_metrics.complexity for r in results])
            # print(f"  Prompt {i+1}: P_complexity={avg_prompt_complexity:.2f}, R_complexity={avg_response_complexity:.2f}")
    
    # Cleanup
    del model
    gc.collect()
    if DEVICE == "cuda":
        torch.cuda.empty_cache()
    
    return all_results


# Run experiment
all_experiment_results = {}
for model_name, model_path in config.models.items():
    try:
        all_experiment_results[model_name] = run_zipf_experiment(model_name, model_path)
    except Exception as e:
        print(f"Error with {model_name}: {e}")
        import traceback
        traceback.print_exc()

## 6. Analysis and Visualization

In [None]:
def analyze_complexity_matching(results: Dict[str, List[GenerationResult]]) -> None:
    """Analyze whether response complexity matches prompt complexity."""
    
    for model_name, model_results in results.items():
        print(f"\n{'='*60}")
        print(f"ANALYSIS: {model_name}")
        print(f"{'='*60}")
        
        # Collect data
        prompt_complexities = [r.prompt_metrics.complexity for r in model_results]
        response_complexities = [r.response_metrics.complexity for r in model_results]
        categories = [r.category for r in model_results]
        
        # Overall correlation
        r, p = stats.pearsonr(prompt_complexities, response_complexities)
        print(f"\nOverall Complexity Correlation:")
        print(f"  r = {r:.3f}, p = {p:.6f}")
        
        if r > 0.3 and p < 0.05:
            print(f"  -> STRONG evidence: Model MATCHES prompt complexity")
        elif r > 0.1 and p < 0.05:
            print(f"  -> MODERATE evidence: Model somewhat matches complexity")
        elif r < -0.1 and p < 0.05:
            print(f"  -> INVERSE relationship: Model produces OPPOSITE complexity")
        else:
            print(f"  -> No significant relationship")
        
        # By category
        print(f"\nBy Category:")
        print(f"{'Category':<20} {'Prompt Complexity':>18} {'Response Complexity':>20} {'Difference':>12}")
        print("-" * 70)
        
        for cat in ['low', 'medium', 'high', 'mixed_ascending', 'mixed_descending']:
            cat_results = [r for r in model_results if r.category == cat]
            if not cat_results:
                continue
            
            p_comp = np.mean([r.prompt_metrics.complexity for r in cat_results])
            r_comp = np.mean([r.response_metrics.complexity for r in cat_results])
            diff = r_comp - p_comp
            
            print(f"{cat:<20} {p_comp:>18.3f} {r_comp:>20.3f} {diff:>+12.3f}")
        
        # Rare word ratio analysis
        prompt_rare = [r.prompt_metrics.rare_word_ratio for r in model_results]
        response_rare = [r.response_metrics.rare_word_ratio for r in model_results]
        
        r_rare, p_rare = stats.pearsonr(prompt_rare, response_rare)
        print(f"\nRare Word Ratio Correlation:")
        print(f"  r = {r_rare:.3f}, p = {p_rare:.6f}")


if all_experiment_results:
    analyze_complexity_matching(all_experiment_results)

In [None]:
def plot_complexity_matching(results: Dict[str, List[GenerationResult]]) -> None:
    """Visualize complexity matching."""
    
    n_models = len(results)
    fig, axes = plt.subplots(2, n_models, figsize=(7 * n_models, 10))
    if n_models == 1:
        axes = axes.reshape(-1, 1)
    
    for col, (model_name, model_results) in enumerate(results.items()):
        # Scatter plot: prompt vs response complexity
        ax = axes[0, col]
        
        prompt_comp = [r.prompt_metrics.complexity for r in model_results]
        response_comp = [r.response_metrics.complexity for r in model_results]
        categories = [r.category for r in model_results]
        
        # Color by category
        cat_colors = {'low': 'blue', 'medium': 'green', 'high': 'red', 
                     'mixed_ascending': 'purple', 'mixed_descending': 'orange'}
        colors = [cat_colors.get(c, 'gray') for c in categories]
        
        ax.scatter(prompt_comp, response_comp, c=colors, alpha=0.6, s=50)
        
        # Add regression line
        z = np.polyfit(prompt_comp, response_comp, 1)
        p = np.poly1d(z)
        x_line = np.linspace(min(prompt_comp), max(prompt_comp), 100)
        ax.plot(x_line, p(x_line), 'k--', alpha=0.5, label=f'Fit (slope={z[0]:.2f})')
        
        # Add y=x line (perfect matching)
        ax.plot([2, 5], [2, 5], 'r:', alpha=0.5, label='Perfect matching')
        
        ax.set_xlabel('Prompt Complexity (7 - mean Zipf)', fontsize=12)
        ax.set_ylabel('Response Complexity (7 - mean Zipf)', fontsize=12)
        ax.set_title(f'{model_name}\nPrompt vs Response Complexity', fontsize=14)
        ax.legend()
        ax.grid(True, alpha=0.3)
        
        # Correlation annotation
        r, p = stats.pearsonr(prompt_comp, response_comp)
        ax.annotate(f'r = {r:.3f}\np = {p:.4f}', xy=(0.05, 0.95), xycoords='axes fraction',
                   fontsize=11, verticalalignment='top',
                   bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))
        
        # Bar plot by category
        ax = axes[1, col]
        
        cat_order = ['low', 'medium', 'high']
        prompt_means = []
        response_means = []
        response_stds = []
        
        for cat in cat_order:
            cat_results = [r for r in model_results if r.category == cat]
            if cat_results:
                prompt_means.append(np.mean([r.prompt_metrics.complexity for r in cat_results]))
                response_means.append(np.mean([r.response_metrics.complexity for r in cat_results]))
                response_stds.append(np.std([r.response_metrics.complexity for r in cat_results]))
        
        x = np.arange(len(cat_order))
        width = 0.35
        
        bars1 = ax.bar(x - width/2, prompt_means, width, label='Prompt', color='steelblue')
        bars2 = ax.bar(x + width/2, response_means, width, yerr=response_stds, 
                      label='Response', color='coral', capsize=5)
        
        ax.set_ylabel('Complexity Score', fontsize=12)
        ax.set_xlabel('Prompt Category', fontsize=12)
        ax.set_title(f'{model_name}\nComplexity by Category', fontsize=14)
        ax.set_xticks(x)
        ax.set_xticklabels(['Low', 'Medium', 'High'])
        ax.legend()
        ax.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.savefig('035G_zipf_complexity_matching.png', dpi=150, bbox_inches='tight')
    plt.show()


if all_experiment_results:
    plot_complexity_matching(all_experiment_results)

In [None]:
def plot_zipf_distributions(results: Dict[str, List[GenerationResult]]) -> None:
    """Visualize Zipf score distributions for prompts and responses."""
    
    for model_name, model_results in results.items():
        fig, axes = plt.subplots(1, 3, figsize=(15, 5))
        
        for idx, category in enumerate(['low', 'medium', 'high']):
            ax = axes[idx]
            cat_results = [r for r in model_results if r.category == category]
            
            if not cat_results:
                continue
            
            # Collect all Zipf scores
            prompt_words = []
            response_words = []
            
            for r in cat_results:
                prompt_words.extend(tokenize_text(r.prompt))
                response_words.extend(tokenize_text(r.response))
            
            prompt_zipf = get_zipf_scores(prompt_words)
            response_zipf = get_zipf_scores(response_words)
            
            # Filter zeros
            prompt_zipf = [z for z in prompt_zipf if z > 0]
            response_zipf = [z for z in response_zipf if z > 0]
            
            # Plot histograms
            bins = np.linspace(0, 7, 30)
            ax.hist(prompt_zipf, bins=bins, alpha=0.5, label='Prompt', color='steelblue', density=True)
            ax.hist(response_zipf, bins=bins, alpha=0.5, label='Response', color='coral', density=True)
            
            ax.set_xlabel('Zipf Frequency Score', fontsize=11)
            ax.set_ylabel('Density', fontsize=11)
            ax.set_title(f'{category.capitalize()} Complexity\nZipf Distribution', fontsize=12)
            ax.legend()
            ax.grid(True, alpha=0.3)
            
            # Add mean lines
            ax.axvline(np.mean(prompt_zipf), color='steelblue', linestyle='--', linewidth=2)
            ax.axvline(np.mean(response_zipf), color='coral', linestyle='--', linewidth=2)
        
        plt.suptitle(f'{model_name}: Zipf Distribution Comparison', fontsize=14, fontweight='bold')
        plt.tight_layout()
        plt.savefig(f'035G_zipf_distribution_{model_name}.png', dpi=150, bbox_inches='tight')
        plt.show()


if all_experiment_results:
    plot_zipf_distributions(all_experiment_results)

## 7. Statistical Summary

In [None]:
def statistical_summary(results: Dict[str, List[GenerationResult]]) -> None:
    """Print comprehensive statistical summary."""
    
    print("\n" + "="*70)
    print("STATISTICAL SUMMARY: ZIPF COMPLEXITY MATCHING")
    print("="*70)
    
    for model_name, model_results in results.items():
        print(f"\n### {model_name} ###")
        print(f"Total generations: {len(model_results)}")
        
        # Collect metrics
        prompt_comp = [r.prompt_metrics.complexity for r in model_results]
        response_comp = [r.response_metrics.complexity for r in model_results]
        prompt_rare = [r.prompt_metrics.rare_word_ratio for r in model_results]
        response_rare = [r.response_metrics.rare_word_ratio for r in model_results]
        prompt_info = [r.prompt_metrics.information_density for r in model_results]
        response_info = [r.response_metrics.information_density for r in model_results]
        
        print("\nCorrelations (prompt -> response):")
        print("-" * 50)
        
        metrics = [
            ("Complexity", prompt_comp, response_comp),
            ("Rare word ratio", prompt_rare, response_rare),
            ("Information density", prompt_info, response_info)
        ]
        
        for name, p_vals, r_vals in metrics:
            r, p = stats.pearsonr(p_vals, r_vals)
            sig = "***" if p < 0.001 else "**" if p < 0.01 else "*" if p < 0.05 else ""
            print(f"  {name:<25} r = {r:>6.3f}, p = {p:.6f} {sig}")
        
        # Effect size: how much does response complexity change per unit prompt complexity?
        slope, intercept, r_value, p_value, std_err = stats.linregress(prompt_comp, response_comp)
        print(f"\nRegression (Response = slope * Prompt + intercept):")
        print(f"  slope = {slope:.3f}, intercept = {intercept:.3f}")
        print(f"  R-squared = {r_value**2:.3f}")
        
        # Interpretation
        print("\nInterpretation:")
        if slope > 0.5:
            print(f"  Model STRONGLY matches prompt complexity (slope = {slope:.2f})")
            print(f"  -> Supports hypothesis that LLMs try to match the Zipf complexity space")
        elif slope > 0.2:
            print(f"  Model MODERATELY matches prompt complexity (slope = {slope:.2f})")
        elif slope > 0:
            print(f"  Model WEAKLY matches prompt complexity (slope = {slope:.2f})")
        else:
            print(f"  Model does NOT match prompt complexity (slope = {slope:.2f})")
        
        # Compare high vs low
        low_results = [r for r in model_results if r.category == 'low']
        high_results = [r for r in model_results if r.category == 'high']
        
        if low_results and high_results:
            low_resp_comp = [r.response_metrics.complexity for r in low_results]
            high_resp_comp = [r.response_metrics.complexity for r in high_results]
            
            t_stat, p_val = stats.ttest_ind(low_resp_comp, high_resp_comp)
            
            # Cohen's d
            pooled_std = np.sqrt((np.std(low_resp_comp)**2 + np.std(high_resp_comp)**2) / 2)
            cohens_d = (np.mean(high_resp_comp) - np.mean(low_resp_comp)) / pooled_std if pooled_std > 0 else 0
            
            print(f"\nLow vs High Prompt Category (Response Complexity):")
            print(f"  Low prompts -> response complexity: {np.mean(low_resp_comp):.3f}")
            print(f"  High prompts -> response complexity: {np.mean(high_resp_comp):.3f}")
            print(f"  t = {t_stat:.3f}, p = {p_val:.6f}")
            print(f"  Cohen's d = {cohens_d:.3f}")


if all_experiment_results:
    statistical_summary(all_experiment_results)

## 8. Example Responses

In [None]:
def show_example_responses(results: Dict[str, List[GenerationResult]], n_examples: int = 2) -> None:
    """Show example responses from different complexity levels."""
    
    print("\n" + "="*70)
    print("EXAMPLE RESPONSES BY COMPLEXITY LEVEL")
    print("="*70)
    
    for model_name, model_results in results.items():
        print(f"\n### {model_name} ###")
        
        for category in ['low', 'high']:
            print(f"\n--- {category.upper()} complexity prompts ---")
            cat_results = [r for r in model_results if r.category == category]
            
            for i, r in enumerate(cat_results[:n_examples]):
                print(f"\nExample {i+1}:")
                print(f"  Prompt complexity: {r.prompt_metrics.complexity:.2f}")
                print(f"  Response complexity: {r.response_metrics.complexity:.2f}")
                print(f"  Prompt rare word ratio: {r.prompt_metrics.rare_word_ratio:.2%}")
                print(f"  Response rare word ratio: {r.response_metrics.rare_word_ratio:.2%}")
                print(f"  Response: {r.response[:200]}..." if len(r.response) > 200 else f"  Response: {r.response}")


if all_experiment_results:
    show_example_responses(all_experiment_results)

## 9. Conclusions

This experiment tests whether LLMs **match the Zipf complexity** of their prompts.

**Hypothesis**: The model's "belief state" is influenced by the Zipf distribution
of the prompt. Complex prompts (rare words, high information density) should
elicit complex responses.

**Key findings:**

1. **Correlation between prompt and response complexity**:
   - Positive correlation supports complexity matching
   - Slope near 1.0 would indicate perfect matching

2. **Category differences**:
   - Do high-complexity prompts produce higher-complexity responses?
   - Is the effect size meaningful?

3. **Zipf distribution shape**:
   - Do response distributions mirror prompt distributions?

**Connection to AKIRA theory:**

If AQ crystallize from accumulated patterns in the weight field, then:
- Prompts with complex Zipf profiles activate more specific AQ
- The response should reflect this specificity
- Common-word prompts activate diffuse, general AQ
- Rare-word prompts activate focused, specialized AQ

The Zipf complexity of the prompt defines the **belief space** the model
operates in. The model then generates responses that stay within
that complexity regime.