## Unicode — Advanced Exercises (No Solutions)

These problems deepen your understanding of Unicode in Python. You may use only the standard library (e.g. `unicodedata`, `codecs`, `locale`, `re`, `sys`).

**Tip:** Prefer `str.casefold()` to `lower()` for Unicode-aware case-insensitive logic, and use `unicodedata.normalize()` for normalization.


### Exercise 1 — Show-Code-Points Utility
Write a function `show_codepoints(s)` that returns a list of tuples `(char, codepoint_hex, name, category)` for each character in `s`.

- `codepoint_hex` should look like `U+1F600` (always 4 or 6+ hex digits, uppercase).
- Use `unicodedata.name(c, '<no name>')` and `unicodedata.category(c)`.

Example:
```python
show_codepoints('Aα😀')
# [('A', 'U+0041', 'LATIN CAPITAL LETTER A', 'Lu'), ...]
```

In [1]:
import unicodedata as _ud

def show_codepoints(s: str):
    """Return [(char, 'U+XXXX', NAME, CATEGORY), ...] for s."""
    result = []
    for ch in s:
        cp = f"U+{ord(ch):04X}" if ord(ch) <= 0xFFFF else f"U+{ord(ch):06X}"
        name = _ud.name(ch, '<no name>')
        cat = _ud.category(ch)
        result.append((ch, cp, name, cat))
    return result

# TODO: test
show_codepoints('Aα😀')

[('A', 'U+0041', 'LATIN CAPITAL LETTER A', 'Lu'),
 ('α', 'U+03B1', 'GREEK SMALL LETTER ALPHA', 'Ll'),
 ('😀', 'U+01F600', 'GRINNING FACE', 'So')]

### Exercise 2 — Canonical Equivalence Check (NFC vs NFD)
Implement `canonically_equal(a, b)` that returns `True` if strings `a` and `b` are canonically equivalent (equal after NFC normalization). Demonstrate on `'é'` (precomposed) and `'e\u0301'` (combining).

In [2]:
def canonically_equal(a: str, b: str) -> bool:
    # TODO: implement using unicodedata.normalize
    return _ud.normalize('NFC', a) == _ud.normalize('NFC', b)

# TODO: demo
s1 = 'é'
s2 = 'e\u0301'
canonically_equal(s1, s2)

True

### Exercise 3 — Strip Diacritics (Accent Folding)
Write `strip_diacritics(s)` that removes combining marks while keeping base characters.

Hints:
- Normalize to NFD, drop code points with `unicodedata.combining(c) != 0`, then re-normalize to NFC.

Example: `strip_diacritics('Café déjà vu') -> 'Cafe deja vu'`

In [3]:
def strip_diacritics(s: str) -> str:
    nfd = _ud.normalize('NFD', s)
    filtered = ''.join(ch for ch in nfd if _ud.combining(ch) == 0)
    return _ud.normalize('NFC', filtered)

# TODO: demo
strip_diacritics('Café déjà vu — naïve façade')

'Cafe deja vu — naive facade'

### Exercise 4 — Unicode-Aware Palindrome
Implement `is_unicode_palindrome(s)` that returns `True` for palindromes when compared in a **case-insensitive** and **diacritics-insensitive** manner.

Process:
1. Normalize (NFD), strip combining marks, normalize back (NFC).
2. Apply `casefold()`.
3. Remove all characters with category starting with `Z` (separators) and punctuation categories (`P*`).
4. Compare with reversed version.

Test with `'Åna'` vs `'ana'`, `'réifier'` (not palindrome), `'Noël, Léon!'` (is palindrome under this definition).

In [4]:
import re as _re

def _normalize_for_palindrome(s: str) -> str:
    s = _ud.normalize('NFD', s)
    s = ''.join(ch for ch in s if _ud.combining(ch) == 0)
    s = _ud.normalize('NFC', s).casefold()
    # remove separators (Z*) and punctuation (P*) via category prefixes
    return ''.join(ch for ch in s if not (_ud.category(ch).startswith('Z') or _ud.category(ch).startswith('P')))

def is_unicode_palindrome(s: str) -> bool:
    t = _normalize_for_palindrome(s)
    return t == t[::-1]

# TODO: tests
tests = [
    'Åna',
    'ana',
    'réifier',
    'Noël, Léon!',
]
{x: is_unicode_palindrome(x) for x in tests}

{'Åna': True, 'ana': True, 'réifier': True, 'Noël, Léon!': True}

### Exercise 5 — Case-Insensitive, Diacritics-Insensitive Set
Implement `unique_words(text)` that returns unique words ignoring case and diacritics, preserving first-seen original spelling.

Steps:
- Tokenize by splitting on any non-letter (`unicodedata.category(ch)` not starting with `L`).
- For each token, compute a canonical key via `strip_diacritics(token).casefold()`.
- Return a dict `{canonical_key: first_original_form}` or a list of originals in first-seen order.

Test on: `"Café cafe CAFÉ caffè déjà Deja déja"`.

In [5]:
def _is_letter(ch: str) -> bool:
    return _ud.category(ch).startswith('L')

def unique_words(text: str):
    words = []
    buf = []
    for ch in text:
        if _is_letter(ch):
            buf.append(ch)
        else:
            if buf:
                words.append(''.join(buf))
                buf = []
    if buf:
        words.append(''.join(buf))

    seen = {}
    order = []
    for w in words:
        key = strip_diacritics(w).casefold()
        if key not in seen:
            seen[key] = w
            order.append(w)
    return order, seen

# TODO: demo
unique_words('Café cafe CAFÉ caffè déjà Deja déja')

(['Café', 'caffè', 'déjà'], {'cafe': 'Café', 'caffe': 'caffè', 'deja': 'déjà'})

### Exercise 6 — Bytes Encodings Round-Trip
Write `roundtrip_encodings(s, encodings)` that encodes `s` with each encoding and decodes back. Return a dict mapping encoding to `(ok: bool, bytes_len: int or None)`.

Test on `s = 'Aα😀'` with encodings `['utf-8', 'utf-16-le', 'utf-32-be', 'latin-1']`.

Note: Some encodings (e.g. `latin-1`) cannot represent all characters; mark those as `ok=False` and `bytes_len=None`.

In [6]:
def roundtrip_encodings(s: str, encodings):
    out = {}
    for enc in encodings:
        try:
            b = s.encode(enc)
            t = b.decode(enc)
            out[enc] = (t == s, len(b))
        except UnicodeEncodeError:
            out[enc] = (False, None)
    return out

# TODO: demo
roundtrip_encodings('Aα😀', ['utf-8', 'utf-16-le', 'utf-32-be', 'latin-1'])

{'utf-8': (True, 7),
 'utf-16-le': (True, 8),
 'utf-32-be': (True, 12),
 'latin-1': (False, None)}

### Exercise 7 — Flag Emoji → ISO Country Code
A flag emoji like 🇧🇬 is composed of two *Regional Indicator Symbols* (RIS): U+1F1E6..U+1F1FF. Convert a flag emoji string of length 2 into an ISO-3166 alpha-2 code (`'BG'`, `'US'`, etc.).

Hints:
- RIS for `'A'` is U+1F1E6. So `chr_code - 0x1F1E6` yields 0..25, then add `ord('A')`.
- Validate input length and range; raise `ValueError` on invalid input.

In [7]:
def flag_to_iso(flag: str) -> str:
    if len(flag) != 2:
        raise ValueError('Flag must be 2 code points (RIS) long')
    base = 0x1F1E6
    letters = []
    for ch in flag:
        cp = ord(ch)
        if not (0x1F1E6 <= cp <= 0x1F1FF):
            raise ValueError('Not a regional indicator symbol')
        letters.append(chr(ord('A') + (cp - base)))
    return ''.join(letters)

# TODO: demo (🇧🇬 -> BG, 🇺🇸 -> US)
flag_to_iso('🇧🇬'), flag_to_iso('🇺🇸')

('BG', 'US')

### Exercise 8 — Escape Non-ASCII
Write `escape_non_ascii(s, by='hex')` that escapes all characters with `ord(ch) > 0x7F`.

Modes:
- `by='hex'`: use `\uXXXX` or `\UXXXXXXXX` (uppercase hex, zero-padded).
- `by='name'`: use `\N{UNICODE NAME}` where available; fallback to hex escape if name is missing.

Example: `escape_non_ascii('Café 😀')` -> `'Cafe\u00E9 \U0001F600'` (hex mode) or `'Cafe\N{LATIN SMALL LETTER E WITH ACUTE} \N{GRINNING FACE}'` (name mode).

In [8]:
def escape_non_ascii(s: str, by='hex') -> str:
    out = []
    for ch in s:
        cp = ord(ch)
        if cp <= 0x7F:
            out.append(ch)
            continue
        if by == 'name':
            name = _ud.name(ch, None)
            if name:
                out.append(f"\\N{{{name}}}")
                continue
            # fallthrough to hex if no name
        # hex
        if cp <= 0xFFFF:
            out.append(f"\\u{cp:04X}")
        else:
            out.append(f"\\U{cp:08X}")
    return ''.join(out)

# TODO: demo
escape_non_ascii('Café 😀', by='hex'), escape_non_ascii('Café 😀', by='name')

('Caf\\u00E9 \\U0001F600',
 'Caf\\N{LATIN SMALL LETTER E WITH ACUTE} \\N{GRINNING FACE}')

### Exercise 9 — Unicode-Aware Sorting Key
Implement `unicode_sort_key(s)` suitable for grouping user-facing strings:
1. Normalize to NFKD.
2. Strip diacritics (drop combining marks).
3. Casefold.
4. Normalize to NFC.

Then sort a list like `['Éclair', 'eclair', 'élan', 'Zebra', 'Ångström', 'angstrom']` using this key and show the result.

In [9]:
def unicode_sort_key(s: str) -> str:
    s = _ud.normalize('NFKD', s)
    s = ''.join(ch for ch in s if _ud.combining(ch) == 0)
    s = s.casefold()
    return _ud.normalize('NFC', s)

# TODO: demo
items = ['Éclair', 'eclair', 'élan', 'Zebra', 'Ångström', 'angstrom']
sorted(items, key=unicode_sort_key)

['Ångström', 'angstrom', 'Éclair', 'eclair', 'élan', 'Zebra']

### Exercise 10 — Confusables Detection (Basic)
Implement `are_confusable(a, b)` that returns `True` if `a` and `b` become equal after applying NFKC normalization **and** casefolding.

Examples to try:
- `'K'` (Kelvin sign U+212A) vs `'K'`.
- Full-width forms like `'ＡＢＣ'` vs `'ABC'`.

> **Note**: This is a simplistic approach; real confusable detection uses Unicode confusables data.

In [10]:
def fold_nfkc(s: str) -> str:
    return _ud.normalize('NFKC', s).casefold()

def are_confusable(a: str, b: str) -> bool:
    return fold_nfkc(a) == fold_nfkc(b)

# TODO: demos
are_confusable('K', 'K'), are_confusable('ＡＢＣ', 'ABC')

(True, True)

### Exercise 11 — Grapheme Cluster Approximation
Implement `split_graphemes_basic(s)` that *approximately* splits a string into grapheme-like clusters by grouping each base character with subsequent combining marks.

Limitations: This is not full Unicode grapheme segmentation (no ZWJ, emoji sequences, Indic clusters, etc.).

Test on: `'naïve café e\u0301 🇺🇸'` (note: flag will be split into two RIS characters here).

In [11]:
def split_graphemes_basic(s: str):
    clusters = []
    buf = ''
    for ch in s:
        if not buf:
            buf = ch
            continue
        if _ud.combining(ch):
            buf += ch
        else:
            clusters.append(buf)
            buf = ch
    if buf:
        clusters.append(buf)
    return clusters

# TODO: demo
split_graphemes_basic('naïve café e\u0301 🇺🇸')

['n', 'a', 'ï', 'v', 'e', ' ', 'c', 'a', 'f', 'é', ' ', 'é', ' ', '🇺', '🇸']

### Exercise 12 — Visible Width Estimator (Monospace Heuristic)
Implement `display_width(s)` that estimates width in a monospace terminal:
- Treat East Asian *wide* (`W`) and *full-width* (`F`) characters as width 2.
- Treat combining marks as width 0.
- Everything else as width 1.

Use `unicodedata.east_asian_width(ch)` and `unicodedata.combining(ch)`.

Test with mixed ASCII, CJK (`'漢字'`), and emoji.

In [12]:
def display_width(s: str) -> int:
    width = 0
    for ch in s:
        if _ud.combining(ch):
            continue
        eaw = _ud.east_asian_width(ch)
        if eaw in ('W', 'F'):
            width += 2
        else:
            width += 1
    return width

# TODO: demos
display_width('Hello'), display_width('漢字'), display_width('Á'), display_width('😀')

(5, 4, 1, 2)