# Entropy Estimation Demo

This notebook demonstrates training unigram and n-gram models to estimate entropy (bits per character) for English and Romanian text.


In [None]:
# %load_ext autoreload
# %autoreload 2

from pathlib import Path
import json

from reducelang.models import UnigramModel, NGramModel
from reducelang.alphabet import ENGLISH_ALPHABET, ROMANIAN_ALPHABET
import matplotlib.pyplot as plt


Shannon's entropy rate H measures the average information per symbol. For a unigram model, H_1 = -Σ p(c) log₂ p(c). For n-gram models, H_n captures dependencies between characters.


In [None]:
corpus_file = Path("data/corpora/en/2025-10-01/processed/text8.txt")
text = corpus_file.read_text(encoding="utf-8")
print(f"Corpus size: {len(text)} chars")


In [None]:
split_idx = int(len(text) * 0.8)
train_text = text[:split_idx]
test_text = text[split_idx:]
print(f"Train: {len(train_text)}, Test: {len(test_text)}")


Unigram model demonstration.


In [None]:
unigram = UnigramModel(ENGLISH_ALPHABET)
unigram.fit(train_text)
H1 = unigram.evaluate(test_text)
print(f"Unigram H_1: {H1:.4f} bits/char")
print(f"Max entropy (uniform): {ENGLISH_ALPHABET.log2_size:.4f} bits/char")


N-gram models with different orders.


In [None]:
orders = [2, 3, 5, 8]
results = {}
for order in orders:
    model = NGramModel(ENGLISH_ALPHABET, order=order)
    model.fit(train_text)
    H = model.evaluate(test_text)
    results[order] = H
    print(f"N-gram (order={order}): {H:.4f} bits/char")


In [None]:
plt.figure(figsize=(8, 5))
plt.plot(orders, [results[o] for o in orders], marker='o')
plt.axhline(H1, color='red', linestyle='--', label='Unigram')
plt.xlabel('N-gram order')
plt.ylabel('Cross-entropy (bits/char)')
plt.title('Entropy vs. N-gram Order')
plt.legend()
plt.grid()
plt.show()


Redundancy R = 1 - H / log₂M. For English with M=27, if H≈1.5 bits/char, then R ≈ 1 - 1.5/4.755 ≈ 68%.


In [None]:
M = ENGLISH_ALPHABET.size
log2M = ENGLISH_ALPHABET.log2_size
for order, H in results.items():
    R = 1 - H / log2M
    print(f"Order {order}: H={H:.4f}, R={R:.2%}")


Higher-order n-grams capture more dependencies, reducing entropy and increasing measured redundancy. This demonstrates Shannon's insight that natural language has significant redundancy due to predictable patterns.


In [None]:
output = {"model": "ngram", "orders": orders, "entropy": results}
with open("entropy_results.json", "w", encoding="utf-8") as f:
    json.dump(output, f, indent=2)
print("Results saved to entropy_results.json")
