Author: Bentley DeVilling Affiliation: Course Correct Labs Contact: bentley@coursecorrectlabs.com
This repository contains the complete code and data to reproduce all figures and analyses from the paper:
"No Evidence for Epistemic Entropy Collapse in Small Open Language Models" DeVilling, B. (2025). Course Correct Labs.
Key Finding: We find no evidence that small open-source language models (microsoft/phi-2, mistralai/Mistral-7B-v0.1) exhibit "epistemic entropy collapse" — a hypothesized phenomenon where hidden state representations progressively lose diversity during text generation, leading to behavioral failure.
- Mean ECI: −0.001 (SD ≈ 0.025)
- Collapse rate: ~9.8% of sequences with ECI < −0.02
- Predictive utility: ROC-AUC ≈ 0.454 (95% CI [0.41, 0.50]) — near chance
- Effective rank trajectories: Flat across generation, no systematic decline
# Clone repository
git clone https://github.com/Course-Correct-Labs/entropy-collapse-null.git
cd entropy-collapse-null
# Set up environment
conda env create -f environment.yml
conda activate eec-null
# Reproduce all figures from paper
bash scripts/reproduce_all_figures.shOutput: Three publication-quality figures (600 DPI) in runs/affordable/figures/:
fig1_eci_histograms.pngfig2_effective_rank_trajectories.pngfig3_failure_prediction_panel.png
conda env create -f environment.yml
conda activate eec-nullpython -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txtbash scripts/verify_environment.shExpected output:
✓ All required packages installed!
entropy-collapse-null/
├── src/ # Python package
│ ├── __init__.py # Package initialization
│ ├── constants.py # Configuration and constants
│ ├── utils.py # Data loading and validation
│ ├── metrics_internal.py # Effective rank, participation ratio
│ ├── metrics_external.py # ΔI drift, n-gram novelty
│ ├── eci.py # Epistemic Collapse Index (ECI)
│ ├── bootstrap.py # Bootstrap confidence intervals
│ ├── figures.py # Figure generation
│ └── cli.py # Command-line interface
├── scripts/ # Shell scripts
│ ├── reproduce_all_figures.sh # One-command reproduction
│ ├── run_smoke.sh # Fast smoke test (<5 min)
│ ├── lint_check.sh # Code quality checks
│ └── verify_environment.sh # Environment verification
├── runs/affordable/ # Data directory
│ ├── metrics_internal.csv # Internal model metrics
│ ├── metrics_external.csv # External behavioral metrics
│ └── manifest.json # Run metadata
├── figures/ # Output directory for figures
├── paper/ # Paper and documentation
│ └── captions.md # Figure captions
├── data/ # Data documentation
│ └── README.md # Dataset description
├── .github/workflows/ # CI/CD
│ └── ci.yml # GitHub Actions workflow
├── environment.yml # Conda environment
├── requirements.txt # Pip dependencies
├── LICENSE # Apache 2.0 license
├── CITATION.cff # Citation metadata
└── README.md # This file
bash scripts/reproduce_all_figures.shRuntime: ~10-15 minutes on CPU
Output: runs/affordable/figures/fig1_eci_histograms.png, fig2_effective_rank_trajectories.png, fig3_failure_prediction_panel.png
bash scripts/run_smoke.shRuntime: <5 minutes
Purpose: Validates code correctness on 5% subsample (n≈30)
Output: runs/affordable/figures/*_smoke.png files
from pathlib import Path
from src.figures import generate_all_figures
# Generate all figures
generate_all_figures(
run_dir=Path("runs/affordable"),
output_dir=Path("runs/affordable/figures"),
smoke=False, # Set True for fast smoke test
dpi=600
)# Full reproduction
python -m src.cli reproduce --in runs/affordable --out runs/affordable/figures --dpi 600
# Smoke test
python -m src.cli reproduce --in runs/affordable --out runs/affordable/figures --dpi 300 --smokeDistribution of residualized Epistemic Collapse Index (ECI) values for microsoft/phi-2 vs control. Both models show similar distributions centered near zero, with no evidence of systematic collapse.
Effective rank trajectories over token generation for "collapsed" (ECI < −0.02) vs "normal" (ECI ≥ −0.02) sequences. Each line represents one sequence (n=50 sampled per group). Both groups show substantial within-group variability with no systematic decline across ~800 tokens of generation.
Predictive utility of ECI for identifying QA task failures. All metrics indicate near-chance performance (ROC-AUC ≈ 0.50), demonstrating that ECI does not reliably predict behavioral failure.
The runs/affordable/ directory contains preprocessed metrics for n=346 sequences:
-
metrics_internal.csv: Internal model metrics- Columns:
prompt_id,model_name,eci_raw,eci_residualized,effective_ranks,participation_ratios,variances
- Columns:
-
metrics_external.csv: External behavioral metrics- Columns:
prompt_id,model_name,qa_failure,delta_i_values,ngram_novelty_values,char_entropy_values
- Columns:
-
manifest.json: Run metadata (seed, configuration)
Schema enforcement: Exact column names and types are validated at load time. See src/utils.py for schema checks. Missing or renamed columns will raise clear errors.
See data/README.md for detailed column descriptions.
- microsoft/phi-2 (2.7B parameters): Primary model
- mistralai/Mistral-7B-v0.1 (7.2B parameters): Control
Preprocessed metrics (CSVs) are included in this repository. Raw data (hidden states, ~50GB) available upon request: bentley@coursecorrectlabs.com
ECI measures the rate of change in representational diversity over token generation:
ECI = slope(effective_rank ~ token_index)
- Effective rank: Exponential of Shannon entropy over singular value spectrum
- Negative ECI: Declining diversity (hypothesized "collapse")
- Threshold: ECI < -0.02 (adopted from prior literature)
Internal (from hidden states):
- Effective rank (diversity of representations)
- Participation ratio (dimensionality)
- Variance (activation magnitude)
External (from generated text):
- ΔI drift (n-gram divergence)
- N-gram novelty (lexical diversity)
- Character entropy (randomness)
- QA failure (TruthfulQA correctness)
- Extract hidden states at each generation step
- Compute internal metrics over sliding windows (128 tokens, stride 64)
- Compute ECI as slope of effective rank trajectory
- Residualize against control condition (Mistral-7B)
- Evaluate predictive utility for QA failure (ROC-AUC, PR-AUC)
This repository implements the complete analytical pipeline described in Section 6 (Reproducibility) of the paper. All code, data, and figures can be independently verified.
- ✅ Data integrity: n=346 sequences (200 Phi-2 + 146 Mistral-7B) match paper Table 1
- ✅ Statistical results: Mean ECI = −0.001 (SD = 0.025), ROC-AUC = 0.454 [0.41, 0.50]
- ✅ Figure generation: All three figures regenerate exactly as shown in paper
- ✅ Schema validation: CSV columns strictly enforced via
src/utils.py - ✅ Numerical stability: Participation ratio handles inf/nan with logging
- ✅ Seed reproducibility: All random operations use fixed seeds (manifest: 13, bootstrap: 42)
To verify results independently:
# 1. Clone and setup
git clone https://github.com/Course-Correct-Labs/entropy-collapse-null.git
cd entropy-collapse-null
conda env create -f environment.yml
conda activate eec-null
# 2. Verify environment
bash scripts/verify_environment.sh
# 3. Run smoke test (5% sample, <5 min)
bash scripts/run_smoke.sh
# 4. Full reproduction (all 346 sequences, ~10-15 min)
bash scripts/reproduce_all_figures.sh
# 5. Check output matches paper
ls -lh runs/affordable/figures/
# Expected: fig1_eci_histograms.png, fig2_effective_rank_trajectories.png, fig3_failure_prediction_panel.pngTested on:
- macOS 14.5 (Apple Silicon M1/M2, Python 3.11)
- Ubuntu 22.04 (x86_64, Python 3.11)
- GitHub Actions CI (ubuntu-latest, smoke test only)
- CPU-only: All analyses run on standard CPU (no GPU required)
- Memory: ~4GB RAM for full reproduction, ~1GB for smoke test
- Storage: ~50MB for repository + data
- Runtime: 10-15 minutes (full), <5 minutes (smoke)
# Run linter
bash scripts/lint_check.sh
# Format code
ruff format src/
# Type checking (optional)
mypy src/# Smoke test
bash scripts/run_smoke.sh
# Full reproduction
bash scripts/reproduce_all_figures.shGitHub Actions runs on every push:
- Linting (ruff + black)
- Smoke-only CI (<5 min with 5% subsample)
Full reproduction (~10-15 min) should be run locally. See .github/workflows/ci.yml
Solution: Ensure data files are in runs/affordable/:
ls runs/affordable/
# Should show: metrics_internal.csv, metrics_external.csv, manifest.jsonSolution: Verify CSV headers match expected format:
head -n1 runs/affordable/metrics_internal.csv
# Should include: prompt_id, model_name, eci_raw, eci_residualized, ...Solution: Set non-interactive backend:
import matplotlib
matplotlib.use('Agg')Or set environment variable:
export MPLBACKEND=AggCause: Internal and external CSVs have mismatched prompt_id or model_name values
Solution: Check merge keys:
import pandas as pd
df_int = pd.read_csv('runs/affordable/metrics_internal.csv')
df_ext = pd.read_csv('runs/affordable/metrics_external.csv')
print(set(df_int['prompt_id']) - set(df_ext['prompt_id']))If you use this code or data, please cite:
BibTeX:
@article{devilling2025entropy,
title={No Evidence for Epistemic Entropy Collapse in Small Open Language Models},
author={DeVilling, Bentley},
year={2025},
organization={Course Correct Labs}
}APA:
DeVilling, B. (2025). No Evidence for Epistemic Entropy Collapse in Small Open Language Models. Course Correct Labs.
Code: Apache License 2.0 (see LICENSE)
Data: CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0/)
Paper/Figures: CC BY-SA 4.0
When preparing a release:
- Run smoke test:
bash scripts/run_smoke.sh - Run full reproduction:
bash scripts/reproduce_all_figures.sh - Verify all figures generated correctly
- Run linter:
bash scripts/lint_check.sh - Check CI passes on GitHub
- Update version in
src/__init__.pyandCITATION.cff - Tag release:
git tag v1.0.0 && git push --tags - Create GitHub release with figures attached
- Upload to Zenodo for DOI
- Update DOI badge in README.md
- Go to https://zenodo.org/deposit/new
- Upload release tarball or link GitHub repository
- Fill metadata:
- Title: "No Evidence for Epistemic Entropy Collapse in Small Open Language Models"
- Authors: Bentley DeVilling (Course Correct Labs)
- Description: See abstract from paper
- License: Apache-2.0 (code), CC-BY-SA-4.0 (data/paper)
- Keywords: language models, epistemic collapse, effective rank, interpretability
- Publish to mint DOI
- Update DOI badge in README.md:
[](https://doi.org/10.5281/zenodo.XXXXXXX)
Bentley DeVilling Course Correct Labs bentley@coursecorrectlabs.com https://coursecorrectlabs.com
For questions, issues, or collaboration inquiries, please open a GitHub issue or email directly.
Last updated: October 2025


