String comparisons are complicated by the fact that Unicode has combining
characters: diacritics and other marks that attach to the preceding character,
appearing as one when printed.

For example, the word “café” may be composed in two ways, using four or
five code points, but the result looks exactly the same:

In [1]:
s1 = 'café'
s2 = 'cafe\N{COMBINING ACUTE ACCENT}'

s1, s2

('café', 'café')

Normalization Form C (NFC) composes the code points to produce the
shortest equivalent string, while NFD decomposes, expanding composed
characters into base characters and separate combining characters. Both of
these normalizations make comparisons work as expected, as the next
example shows.

In [2]:
len(s1), len(s2)


(4, 5)

In [4]:
s1 == s2

False

In [10]:
from unicodedata import normalize, name
s1 = 'café'
s2 = 'cafe\N{COMBINING ACUTE ACCENT}'
len(s1), len(s2)


(4, 5)

In [7]:
len(normalize('NFC', s1)), len(normalize('NFC', s2))

(4, 4)

In [8]:
len(normalize('NFD', s1)), len(normalize('NFD', s2))

(5, 5)

In [11]:
ohm = '\u2126'
name(ohm)

'OHM SIGN'

In [12]:
ohm_c = normalize('NFC', ohm)
name(ohm_c)

'GREEK CAPITAL LETTER OMEGA'

In [13]:
ohm == ohm_c

False

In [14]:
normalize('NFC', ohm) == normalize('NFC', ohm_c)

True

In [16]:
half = '\N{VULGAR FRACTION ONE HALF}'
print(half)

½


In [17]:
normalize('NFKC', half)

'1⁄2'

In [18]:
micro = 'μ'
name(micro)

'GREEK SMALL LETTER MU'

In [20]:
micro_cf = micro.casefold()
name(micro_cf)

'GREEK SMALL LETTER MU'

Example 4-13. normeq.py: normalized Unicode string comparison

In [21]:
def nfc_equal(str1, str2):
    return normalize('NFC', str1) == normalize('NFC', str2)

def fold_equal(str1, str2):
    return (normalize('NFC', str1).casefold() ==  
            normalize('NFC', str2).casefold())
            


Example 4-14. simplify.py: Function to remove all combining marks.

In [22]:
import unicodedata
import string
def shave_marks(txt):
    norm_txt = unicodedata.normalize('NFD', txt)
    shaved = ''.join(c for c in norm_txt
                     if not unicodedata.combining(c))
    return unicodedata.normalize('NFC', shaved)


Example 4-15 shows a couple of uses of shave_marks.
Example 4-15. Two examples using shave_marks from Example 4-14

In [23]:
order = '“Herr Voß: • 1⁄2 cup of ŒtkerTM caffè latte • bowl of açaí.”'

In [24]:
shave_marks(order)

'“Herr Voß: • 1⁄2 cup of ŒtkerTM caffe latte • bowl of acai.”'

In [25]:
Greek = 'Ζέφυρος, Zéfiro'

In [26]:
shave_marks(Greek)

'Ζεφυρος, Zefiro'

Example 4-16. Function to remove combining marks from Latin characters
(import statements are omitted as this is part of the simplify.py module from
Example 4-14)

In [27]:
def shave_marks_latin(txt):
    norm_txt = unicodedata.normalize('NFD', txt)
    latin_base = False
    preserve = []
    for c in norm_txt:
        if unicodedata.combining(c) and latin_base:
            continue
        preserve.append(c)
        if not unicodedata.combining(c):
            latin_base = c in string.ascii_letters
        shaved = ''.join(preserve)
    return unicodedata.normalize('NFC', shaved)


Example 4-17. Transform some Western typographical symbols into ASCII
(this snippet is also part of simplify.py from Example 4-14)

In [None]:
single_map = str.maketrans("""‚ƒ„ˆ‹‘’“”•–— ̃
›""",
"""'f"^<''""---~>""")

In [37]:
multi_map = str.maketrans({'€': 'EUR',
'...': '...',
'Æ': 'AE',
'æ': 'ae',
'Œ': 'OE',
'œ': 'oe',
'TM': '(TM)',
'‰': '<per mille>',
'†': '**',
'‡': '***',
})