# Alphabet Definitions for Redundancy Estimation

This notebook demonstrates the `Alphabet` class from `reducelang.alphabet`, showing how we define symbol sets for English and Romanian, compute log₂M, and handle text normalization.


In [1]:
# %load_ext autoreload
# %autoreload 2

from reducelang.alphabet import (
    Alphabet,
    ENGLISH_ALPHABET,
    ROMANIAN_ALPHABET,
    ENGLISH_NO_SPACE,
)
import math


SyntaxError: unexpected character after line continuation character (sensitivity.py, line 140)

Shannon's redundancy formula: R = 1 - H / log₂M, where M is the alphabet size. For English with 26 letters + space, M=27, log₂27 ≈ 4.755 bits/char.


In [None]:
print(f"English alphabet: {ENGLISH_ALPHABET.symbols}")
print(f"Size M = {ENGLISH_ALPHABET.size}")
print(f"log₂M = {ENGLISH_ALPHABET.log2_size:.3f} bits/char")


In [None]:
text = "Hello, World! 123"
normalized = ENGLISH_ALPHABET.normalize(text)
print(f"Original: {text}")
print(f"Normalized: {normalized}")


Romanian uses 31 letters (A–Z + Ă, Â, Î, Ș, Ț) plus space, giving M=32, so log₂32 = 5.000 bits/char exactly.


In [None]:
print(f"Romanian alphabet: {ROMANIAN_ALPHABET.symbols}")
print(f"Size M = {ROMANIAN_ALPHABET.size}")
print(f"log₂M = {ROMANIAN_ALPHABET.log2_size:.3f} bits/char")

text_ro = "Bună ziua, România!"
normalized_ro = ROMANIAN_ALPHABET.normalize(text_ro)
print(f"Original RO: {text_ro}")
print(f"Normalized RO: {normalized_ro}")


In [None]:
english_no_space = ENGLISH_ALPHABET.variant(include_space=False)
print(f"English without space: M = {english_no_space.size}, log₂M = {english_no_space.log2_size:.3f}")


These alphabet definitions form the foundation for all entropy calculations. In subsequent notebooks, we'll estimate H (entropy rate) and compute redundancy R.


In [None]:
assert ROMANIAN_ALPHABET.log2_size == 5.0, "Romanian log₂M should be exactly 5.0"
