# PPM Entropy Estimation Demo

This notebook demonstrates Prediction by Partial Matching (PPM) for entropy estimation, including escape methods, context blending, and codelength verification.


In [None]:
%load_ext autoreload
%autoreload 2

from pathlib import Path
import json
import numpy as np
import matplotlib.pyplot as plt

from reducelang.models import PPMModel, UnigramModel, NGramModel
from reducelang.alphabet import ENGLISH_ALPHABET
from reducelang.coding import verify_codelength


PPM uses variable-length contexts to predict the next character. It blends probabilities from multiple context lengths using an escape mechanism. Longer contexts provide better predictions when available; shorter contexts provide fallback estimates.


In [None]:
# Load a sample corpus (adjust path as needed)
corpus_file = Path("data/corpora/en/2025-10-01/processed/text8.txt")
text = corpus_file.read_text(encoding="utf-8")[:100_000]
split_idx = int(len(text) * 0.8)
train_text = text[:split_idx]
test_text = text[split_idx:]


Train PPM models with varying depths to see how entropy changes.


In [None]:
depths = [1, 2, 3, 5, 8]
ppm_results = {}
for depth in depths:
    model = PPMModel(ENGLISH_ALPHABET, depth=depth, escape_method="A")
    model.fit(train_text)
    H = model.evaluate(test_text)
    ppm_results[depth] = H
    print(f"PPM (depth={depth}): {H:.4f} bits/char")


In [None]:
unigram = UnigramModel(ENGLISH_ALPHABET)
unigram.fit(train_text)
H_unigram = unigram.evaluate(test_text)
print(f"Unigram: {H_unigram:.4f} bits/char")

ngram5 = NGramModel(ENGLISH_ALPHABET, order=5)
ngram5.fit(train_text)
H_ngram5 = ngram5.evaluate(test_text)
print(f"N-gram (order=5): {H_ngram5:.4f} bits/char")


In [None]:
plt.figure(figsize=(10, 6))
plt.plot(depths, [ppm_results[d] for d in depths], marker='o', label='PPM')
plt.axhline(H_unigram, color='red', linestyle='--', label='Unigram')
plt.axhline(H_ngram5, color='green', linestyle='--', label='N-gram (order=5)')
plt.xlabel('PPM Depth')
plt.ylabel('Cross-entropy (bits/char)')
plt.title('Entropy vs. PPM Depth')
plt.legend()
plt.grid()
plt.show()


To verify that our entropy estimates are correct, we use an arithmetic coder to compute the actual codelength. The codelength should match the cross-entropy within a small tolerance (< 0.001 bpc).


In [None]:
ppm8 = PPMModel(ENGLISH_ALPHABET, depth=8, escape_method="A")
ppm8.fit(train_text)
verification = verify_codelength(test_text, ppm8, tolerance=1e-3)
print(f"Cross-entropy: {verification['cross_entropy_bpc']:.6f} bpc")
print(f"Codelength:    {verification['codelength_bpc']:.6f} bpc")
print(f"Delta:         {verification['delta_bpc']:.6f} bpc")
print(f"Matches:       {verification['matches']}")


PPM supports different escape methods (A, B, C, D) that allocate escape probability differently. Method A (default) is simplest: escape probability = c / (n + c), where c = unique symbols seen, n = total count.


In [None]:
escape_methods = ["A", "B", "C", "D"]
escape_results = {}
for method in escape_methods:
    try:
        model = PPMModel(ENGLISH_ALPHABET, depth=5, escape_method=method)
        model.fit(train_text)
        H = model.evaluate(test_text)
        escape_results[method] = H
        print(f"Escape method {method}: {H:.4f} bits/char")
    except NotImplementedError:
        print(f"Escape method {method}: Not implemented")
        escape_results[method] = None


Redundancy R = 1 - H / log₂M. For English with M=27, if PPM gives H≈1.3 bits/char, then R ≈ 1 - 1.3/4.755 ≈ 73%.


In [None]:
M = ENGLISH_ALPHABET.size
log2M = ENGLISH_ALPHABET.log2_size
for depth, H in ppm_results.items():
    R = 1.0 - (H / log2M)
    print(f"PPM (depth={depth}): H={H:.4f}, R={R:.2%}")


PPM achieves lower entropy than n-grams by using adaptive context selection. Deeper contexts capture more dependencies, approaching Shannon's theoretical limit. Codelength verification confirms our entropy estimates are accurate.


In [None]:
output = {"model": "ppm", "depths": depths, "entropy": ppm_results, "verification": verification}
with open("ppm_results.json", "w", encoding="utf-8") as f:
    json.dump(output, f, indent=2)
print("Results saved to ppm_results.json")
