## Unicode — Medium Exercises with Solutions

These are in-between the basics and the very advanced set. Only standard library: `unicodedata`, `re`.

Run cells top-to-bottom. Each exercise has a short solution you can tweak.

### Exercise 1 — Quick Code Point Table
Write `codepoint_table(s)` that returns a list of rows `(char, dec, hex_u, name)`.
- `dec`: decimal code point (int)
- `hex_u`: like `U+0041` or `U+1F600` (uppercase, zero-padded to 4 or 6 hex digits)
- `name`: Unicode name or `'<no name>'`

In [1]:
import unicodedata as ud

def _uhex(cp: int) -> str:
    return f"U+{cp:04X}" if cp <= 0xFFFF else f"U+{cp:06X}"

def codepoint_table(s: str):
    rows = []
    for ch in s:
        cp = ord(ch)
        rows.append((ch, cp, _uhex(cp), ud.name(ch, '<no name>')))
    return rows

# Demo
ntry = "Aα😀"
for row in codepoint_table(ntry):
    print(row)

('A', 65, 'U+0041', 'LATIN CAPITAL LETTER A')
('α', 945, 'U+03B1', 'GREEK SMALL LETTER ALPHA')
('😀', 128512, 'U+01F600', 'GRINNING FACE')


### Exercise 2 — NFC Equality Checker
Create `equal_nfc(a, b)` that returns `True` if strings are equal after NFC normalization.
Test on `'é'` vs `'e\u0301'`, and `'Å'` vs `'A\u030A'`.

In [2]:
def equal_nfc(a: str, b: str) -> bool:
    return ud.normalize('NFC', a) == ud.normalize('NFC', b)

# Demo
pairs = [('é', 'e\u0301'), ('Å', 'A\u030A'), ('hello', 'hello')]
for a, b in pairs:
    print(a, b, '->', equal_nfc(a, b))

é é -> True
Å Å -> True
hello hello -> True


### Exercise 3 — Strip Diacritics (Accent Folding)
Implement `strip_accents(s)` that removes combining marks but keeps base letters.
Apply it to a sentence with accents and ligatures (e.g., `Café déjà-vu — naïve façade`).

In [3]:
def strip_accents(s: str) -> str:
    nfd = ud.normalize('NFD', s)
    no_marks = ''.join(ch for ch in nfd if ud.combining(ch) == 0)
    return ud.normalize('NFC', no_marks)

# Demo
text = 'Café déjà-vu — naïve façade'
print(strip_accents(text))

Cafe deja-vu — naive facade


### Exercise 4 — Accent-Insensitive Contains
Write `contains_word(text, needle)` that returns `True` if `needle` occurs in `text` ignoring accents and case.
- Hint: use `strip_accents(...)` and `casefold()`.
- It should match whole substrings (no need for word boundaries here).

In [4]:
def norm_fold(s: str) -> str:
    return strip_accents(s).casefold()

def contains_word(text: str, needle: str) -> bool:
    return norm_fold(needle) in norm_fold(text)

# Demo
print(contains_word('Café CAFÉ caffè', 'cafe'))      # True
print(contains_word('El Niño is here', 'nino'))        # True
print(contains_word('Résumé tips', 'resume'))          # True
print(contains_word('Résumé tips', 'sumé'))            # False (substring still respected)

True
True
True
True


### Exercise 5 — Safe Encoding Round-Trip
Write `roundtrip(s, encodings)` that, for each encoding, tries to encode and decode `s`.
Return a dict `{encoding: (ok, byte_length)}` where `ok` is `True` only if decoded equals original.
Test on `s = 'Aα😀'` and encodings `['utf-8', 'utf-16-le', 'latin-1']`.

In [5]:
def roundtrip(s: str, encodings):
    out = {}
    for enc in encodings:
        try:
            b = s.encode(enc)
            back = b.decode(enc)
            out[enc] = (back == s, len(b))
        except UnicodeEncodeError:
            out[enc] = (False, None)
    return out

# Demo
print(roundtrip('Aα😀', ['utf-8', 'utf-16-le', 'latin-1']))

{'utf-8': (True, 7), 'utf-16-le': (True, 8), 'latin-1': (False, None)}


### Exercise 6 — Uppercase-Only Filter
Write `only_upper_letters(s)` that returns a new string containing only characters whose Unicode category is `Lu` (Letter, uppercase).
Apply it to `'AaBbΣσßẞ İ i̇ № ™ ©'` and observe which characters survive.

In [6]:
def only_upper_letters(s: str) -> str:
    return ''.join(ch for ch in s if ud.category(ch) == 'Lu')

# Demo
sample = 'AaBbΣσßẞ İ i̇ № ™ ©'
print(sample)
print(only_upper_letters(sample))

AaBbΣσßẞ İ i̇ № ™ ©
ABΣẞİ


### Exercise 7 — Basic Grapheme-Like Split (Combining Marks Only)
Write `split_graphemes_simple(s)` that groups a base character with the combining marks that follow it.
Test on `'e\u0301 a\u0302\u0301 café'`.
> Note: This is not full Unicode grapheme segmentation but works for simple combining marks cases.

In [7]:
def split_graphemes_simple(s: str):
    clusters = []
    buf = ''
    for ch in s:
        if not buf:
            buf = ch
            continue
        if ud.combining(ch):
            buf += ch
        else:
            clusters.append(buf)
            buf = ch
    if buf:
        clusters.append(buf)
    return clusters

# Demo
print(split_graphemes_simple('e\u0301 a\u0302\u0301 café'))

['é', ' ', 'ấ', ' ', 'c', 'a', 'f', 'é']


### Exercise 8 — Flag Emoji to ISO Code (RIS)
Implement `flag_to_iso(flag)` that converts a 2-character flag emoji (two Regional Indicator Symbols) into its ISO country code (e.g., `'🇧🇬' -> 'BG'`).
Raise `ValueError` for invalid inputs.

In [8]:
def flag_to_iso(flag: str) -> str:
    if len(flag) != 2:
        raise ValueError('Flag must be exactly 2 code points long')
    base = 0x1F1E6
    letters = []
    for ch in flag:
        cp = ord(ch)
        if not (0x1F1E6 <= cp <= 0x1F1FF):
            raise ValueError('Code point is not a Regional Indicator Symbol')
        letters.append(chr(ord('A') + (cp - base)))
    return ''.join(letters)

# Demo
print(flag_to_iso('🇧🇬'))  # BG
print(flag_to_iso('🇺🇸'))  # US

BG
US
