Skip to content

Course-Correct-Labs/entropy-collapse-null

Repository files navigation

No Evidence for Epistemic Entropy Collapse in Small Open Language Models

CI License: Apache 2.0 Python 3.11 arXiv DOI

Author: Bentley DeVilling Affiliation: Course Correct Labs Contact: bentley@coursecorrectlabs.com


Overview

This repository contains the complete code and data to reproduce all figures and analyses from the paper:

"No Evidence for Epistemic Entropy Collapse in Small Open Language Models" DeVilling, B. (2025). Course Correct Labs.

Key Finding: We find no evidence that small open-source language models (microsoft/phi-2, mistralai/Mistral-7B-v0.1) exhibit "epistemic entropy collapse" — a hypothesized phenomenon where hidden state representations progressively lose diversity during text generation, leading to behavioral failure.

Results at a Glance

  • Mean ECI: −0.001 (SD ≈ 0.025)
  • Collapse rate: ~9.8% of sequences with ECI < −0.02
  • Predictive utility: ROC-AUC ≈ 0.454 (95% CI [0.41, 0.50]) — near chance
  • Effective rank trajectories: Flat across generation, no systematic decline

Quick Start

One-Command Reproduction

# Clone repository
git clone https://github.com/Course-Correct-Labs/entropy-collapse-null.git
cd entropy-collapse-null

# Set up environment
conda env create -f environment.yml
conda activate eec-null

# Reproduce all figures from paper
bash scripts/reproduce_all_figures.sh

Output: Three publication-quality figures (600 DPI) in runs/affordable/figures/:

  • fig1_eci_histograms.png
  • fig2_effective_rank_trajectories.png
  • fig3_failure_prediction_panel.png

Installation

Option 1: Conda (Recommended)

conda env create -f environment.yml
conda activate eec-null

Option 2: Pip

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Verify Installation

bash scripts/verify_environment.sh

Expected output:

✓ All required packages installed!

Repository Structure

entropy-collapse-null/
├── src/                          # Python package
│   ├── __init__.py              # Package initialization
│   ├── constants.py             # Configuration and constants
│   ├── utils.py                 # Data loading and validation
│   ├── metrics_internal.py      # Effective rank, participation ratio
│   ├── metrics_external.py      # ΔI drift, n-gram novelty
│   ├── eci.py                   # Epistemic Collapse Index (ECI)
│   ├── bootstrap.py             # Bootstrap confidence intervals
│   ├── figures.py               # Figure generation
│   └── cli.py                   # Command-line interface
├── scripts/                      # Shell scripts
│   ├── reproduce_all_figures.sh # One-command reproduction
│   ├── run_smoke.sh             # Fast smoke test (<5 min)
│   ├── lint_check.sh            # Code quality checks
│   └── verify_environment.sh    # Environment verification
├── runs/affordable/              # Data directory
│   ├── metrics_internal.csv     # Internal model metrics
│   ├── metrics_external.csv     # External behavioral metrics
│   └── manifest.json            # Run metadata
├── figures/                      # Output directory for figures
├── paper/                        # Paper and documentation
│   └── captions.md              # Figure captions
├── data/                         # Data documentation
│   └── README.md                # Dataset description
├── .github/workflows/            # CI/CD
│   └── ci.yml                   # GitHub Actions workflow
├── environment.yml               # Conda environment
├── requirements.txt              # Pip dependencies
├── LICENSE                       # Apache 2.0 license
├── CITATION.cff                  # Citation metadata
└── README.md                     # This file

Usage

Generate All Figures

bash scripts/reproduce_all_figures.sh

Runtime: ~10-15 minutes on CPU Output: runs/affordable/figures/fig1_eci_histograms.png, fig2_effective_rank_trajectories.png, fig3_failure_prediction_panel.png

Smoke Test (Fast Validation)

bash scripts/run_smoke.sh

Runtime: <5 minutes Purpose: Validates code correctness on 5% subsample (n≈30) Output: runs/affordable/figures/*_smoke.png files

Python API

from pathlib import Path
from src.figures import generate_all_figures

# Generate all figures
generate_all_figures(
    run_dir=Path("runs/affordable"),
    output_dir=Path("runs/affordable/figures"),
    smoke=False,  # Set True for fast smoke test
    dpi=600
)

Command-Line Interface

# Full reproduction
python -m src.cli reproduce --in runs/affordable --out runs/affordable/figures --dpi 600

# Smoke test
python -m src.cli reproduce --in runs/affordable --out runs/affordable/figures --dpi 300 --smoke

Figures

Figure 1: ECI Distribution Histograms

Distribution of residualized Epistemic Collapse Index (ECI) values for microsoft/phi-2 vs control. Both models show similar distributions centered near zero, with no evidence of systematic collapse.

Figure 1

Figure 2: Effective Rank Trajectories

Effective rank trajectories over token generation for "collapsed" (ECI < −0.02) vs "normal" (ECI ≥ −0.02) sequences. Each line represents one sequence (n=50 sampled per group). Both groups show substantial within-group variability with no systematic decline across ~800 tokens of generation.

Figure 2

Figure 3: Failure Prediction Performance

Predictive utility of ECI for identifying QA task failures. All metrics indicate near-chance performance (ROC-AUC ≈ 0.50), demonstrating that ECI does not reliably predict behavioral failure.

Figure 3


Data

Dataset Description

The runs/affordable/ directory contains preprocessed metrics for n=346 sequences:

  • metrics_internal.csv: Internal model metrics

    • Columns: prompt_id, model_name, eci_raw, eci_residualized, effective_ranks, participation_ratios, variances
  • metrics_external.csv: External behavioral metrics

    • Columns: prompt_id, model_name, qa_failure, delta_i_values, ngram_novelty_values, char_entropy_values
  • manifest.json: Run metadata (seed, configuration)

Schema enforcement: Exact column names and types are validated at load time. See src/utils.py for schema checks. Missing or renamed columns will raise clear errors.

See data/README.md for detailed column descriptions.

Models Tested

  • microsoft/phi-2 (2.7B parameters): Primary model
  • mistralai/Mistral-7B-v0.1 (7.2B parameters): Control

Data Availability

Preprocessed metrics (CSVs) are included in this repository. Raw data (hidden states, ~50GB) available upon request: bentley@coursecorrectlabs.com


Methods

Epistemic Collapse Index (ECI)

ECI measures the rate of change in representational diversity over token generation:

ECI = slope(effective_rank ~ token_index)
  • Effective rank: Exponential of Shannon entropy over singular value spectrum
  • Negative ECI: Declining diversity (hypothesized "collapse")
  • Threshold: ECI < -0.02 (adopted from prior literature)

Metrics

Internal (from hidden states):

  • Effective rank (diversity of representations)
  • Participation ratio (dimensionality)
  • Variance (activation magnitude)

External (from generated text):

  • ΔI drift (n-gram divergence)
  • N-gram novelty (lexical diversity)
  • Character entropy (randomness)
  • QA failure (TruthfulQA correctness)

Analysis Pipeline

  1. Extract hidden states at each generation step
  2. Compute internal metrics over sliding windows (128 tokens, stride 64)
  3. Compute ECI as slope of effective rank trajectory
  4. Residualize against control condition (Mistral-7B)
  5. Evaluate predictive utility for QA failure (ROC-AUC, PR-AUC)

Reproducibility & Validation

This repository implements the complete analytical pipeline described in Section 6 (Reproducibility) of the paper. All code, data, and figures can be independently verified.

Validation Checklist

  • Data integrity: n=346 sequences (200 Phi-2 + 146 Mistral-7B) match paper Table 1
  • Statistical results: Mean ECI = −0.001 (SD = 0.025), ROC-AUC = 0.454 [0.41, 0.50]
  • Figure generation: All three figures regenerate exactly as shown in paper
  • Schema validation: CSV columns strictly enforced via src/utils.py
  • Numerical stability: Participation ratio handles inf/nan with logging
  • Seed reproducibility: All random operations use fixed seeds (manifest: 13, bootstrap: 42)

Independent Verification

To verify results independently:

# 1. Clone and setup
git clone https://github.com/Course-Correct-Labs/entropy-collapse-null.git
cd entropy-collapse-null
conda env create -f environment.yml
conda activate eec-null

# 2. Verify environment
bash scripts/verify_environment.sh

# 3. Run smoke test (5% sample, <5 min)
bash scripts/run_smoke.sh

# 4. Full reproduction (all 346 sequences, ~10-15 min)
bash scripts/reproduce_all_figures.sh

# 5. Check output matches paper
ls -lh runs/affordable/figures/
# Expected: fig1_eci_histograms.png, fig2_effective_rank_trajectories.png, fig3_failure_prediction_panel.png

Cross-Platform Testing

Tested on:

  • macOS 14.5 (Apple Silicon M1/M2, Python 3.11)
  • Ubuntu 22.04 (x86_64, Python 3.11)
  • GitHub Actions CI (ubuntu-latest, smoke test only)

Computational Requirements

  • CPU-only: All analyses run on standard CPU (no GPU required)
  • Memory: ~4GB RAM for full reproduction, ~1GB for smoke test
  • Storage: ~50MB for repository + data
  • Runtime: 10-15 minutes (full), <5 minutes (smoke)

Development

Code Quality

# Run linter
bash scripts/lint_check.sh

# Format code
ruff format src/

# Type checking (optional)
mypy src/

Testing

# Smoke test
bash scripts/run_smoke.sh

# Full reproduction
bash scripts/reproduce_all_figures.sh

CI/CD

GitHub Actions runs on every push:

  • Linting (ruff + black)
  • Smoke-only CI (<5 min with 5% subsample)

Full reproduction (~10-15 min) should be run locally. See .github/workflows/ci.yml


Troubleshooting

Issue: FileNotFoundError: manifest.json not found

Solution: Ensure data files are in runs/affordable/:

ls runs/affordable/
# Should show: metrics_internal.csv, metrics_external.csv, manifest.json

Issue: Missing required columns in metrics_internal.csv

Solution: Verify CSV headers match expected format:

head -n1 runs/affordable/metrics_internal.csv
# Should include: prompt_id, model_name, eci_raw, eci_residualized, ...

Issue: Matplotlib backend errors on headless server

Solution: Set non-interactive backend:

import matplotlib
matplotlib.use('Agg')

Or set environment variable:

export MPLBACKEND=Agg

Issue: Smoke test fails with "No matching rows found"

Cause: Internal and external CSVs have mismatched prompt_id or model_name values

Solution: Check merge keys:

import pandas as pd
df_int = pd.read_csv('runs/affordable/metrics_internal.csv')
df_ext = pd.read_csv('runs/affordable/metrics_external.csv')
print(set(df_int['prompt_id']) - set(df_ext['prompt_id']))

Citation

If you use this code or data, please cite:

BibTeX:

@article{devilling2025entropy,
  title={No Evidence for Epistemic Entropy Collapse in Small Open Language Models},
  author={DeVilling, Bentley},
  year={2025},
  organization={Course Correct Labs}
}

APA:

DeVilling, B. (2025). No Evidence for Epistemic Entropy Collapse in Small Open Language Models. Course Correct Labs.

License

Code: Apache License 2.0 (see LICENSE) Data: CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0/) Paper/Figures: CC BY-SA 4.0


Release Checklist

When preparing a release:

  • Run smoke test: bash scripts/run_smoke.sh
  • Run full reproduction: bash scripts/reproduce_all_figures.sh
  • Verify all figures generated correctly
  • Run linter: bash scripts/lint_check.sh
  • Check CI passes on GitHub
  • Update version in src/__init__.py and CITATION.cff
  • Tag release: git tag v1.0.0 && git push --tags
  • Create GitHub release with figures attached
  • Upload to Zenodo for DOI
  • Update DOI badge in README.md

Zenodo Upload Instructions

  1. Go to https://zenodo.org/deposit/new
  2. Upload release tarball or link GitHub repository
  3. Fill metadata:
    • Title: "No Evidence for Epistemic Entropy Collapse in Small Open Language Models"
    • Authors: Bentley DeVilling (Course Correct Labs)
    • Description: See abstract from paper
    • License: Apache-2.0 (code), CC-BY-SA-4.0 (data/paper)
    • Keywords: language models, epistemic collapse, effective rank, interpretability
  4. Publish to mint DOI
  5. Update DOI badge in README.md:
    [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.XXXXXXX.svg)](https://doi.org/10.5281/zenodo.XXXXXXX)

Contact

Bentley DeVilling Course Correct Labs bentley@coursecorrectlabs.com https://coursecorrectlabs.com

For questions, issues, or collaboration inquiries, please open a GitHub issue or email directly.


Last updated: October 2025