<h1>Chapter 04. Unicode texts and Bytes</h1>

**Unicode:**
Unicode is a standardized encoding system that assigns unique code points to represent characters and text from various writing systems around the world. It provides a universal way to represent text in different languages, ensuring consistency and compatibility across different platforms and applications. Unicode supports a vast range of characters, including letters, digits, symbols, and more, making it essential for multilingual and internationalized software development.

**Bytes:**
Bytes, in the context of computing, refer to a unit of digital information storage. In programming, the term "bytes" is often used to represent sequences of eight bits, and it serves as a fundamental data type. Bytes are versatile and can represent a variety of data, including text characters, binary data, and more. They play a crucial role in low-level operations, file handling, and communication between different parts of a computer system. Understanding how to work with bytes is essential, especially when dealing with tasks like file I/O, networking, and encoding/decoding data.

Encoding and decoding

In [1]:
s = 'café'

In [2]:
len(s)  # string 'café' consist of 4 Unicode symbols

4

In [3]:
b = s.encode('utf-8')  # convert str to bytes using UTF-8 encoding
b

b'caf\xc3\xa9'

In [4]:
len(b)

5

In [5]:
b.decode('utf-8')  # convert back bytes to str

'café'

`bytes` is immutable sequence of 8-bit integers in Python, used for storing binary data or text.

`bytearray` is mutable counterpart to `bytes`, allowing in-place modifications of 8-bit integers.

In [6]:
cafe = bytes('café', encoding='utf-8')
cafe

b'caf\xc3\xa9'

In [7]:
cafe[0]  # every item is an integer within range(256)

99

In [8]:
cafe_arr = bytearray(cafe)
cafe_arr

bytearray(b'caf\xc3\xa9')

In [9]:
cafe_arr[-1:]

bytearray(b'\xa9')

Initialization of bytes with data stored in the array

In [10]:
import array


numbers = array.array('h', [-2, -1, 0, 1, 2])  # 'h' means to create an array of numbers (16-bit)
numbers

array('h', [-2, -1, 0, 1, 2])

In [11]:
octets = bytes(numbers)
octets

b'\xfe\xff\xff\xff\x00\x00\x01\x00\x02\x00'

<h2>Basic encoders and decoders</h2>

The string 'El niño' encoded with three codecs gives completely different byte sequences

In [12]:
coders = ['latin-1','cp437', 'utf-8', 'utf-16']  # list of several common encodings
s = 'El Niño'

for codec in coders:
    print(f"{codec}: {s.encode(codec)}")

latin-1: b'El Ni\xf1o'
cp437: b'El Ni\xa4o'
utf-8: b'El Ni\xc3\xb1o'
utf-16: b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'


<h2>Encoding and Decoding Problems</h2>

<h3><code>UnicodeEncodeError</code> handling</h3>

Encoding text to bytes: successful completion and error handling

In [13]:
city = 'São Paulo'

In [14]:
city.encode('utf-8')

b'S\xc3\xa3o Paulo'

In [15]:
city.encode('utf-16')

b'\xff\xfeS\x00\xe3\x00o\x00 \x00P\x00a\x00u\x00l\x00o\x00'

In [16]:
city.encode('iso8859-1')

b'S\xe3o Paulo'

In [17]:
try:
    city.encode('cp437')
except UnicodeEncodeError as e:
    print(e.__repr__())

UnicodeEncodeError('charmap', 'São Paulo', 1, 2, 'character maps to <undefined>')


Add the `errors=` argument to handle errors

In [18]:
city.encode('cp437', errors='ignore')  # skips unencoded characters

b'So Paulo'

In [19]:
city.encode('cp437', errors='replace')  # changes unencoded characters with '?'

b'S?o Paulo'

In [20]:
city.encode('cp437', errors='xmlcharrefreplace')  # changes unencoded characters with XML component

b'S&#227;o Paulo'

<h3><code>UnicodeDecodeError</code> handling</h3>

Decoding bytes to text: successful completion and error handling

In [21]:
octets = b'Montr\xe9al'

In [22]:
octets.decode('cp1252')

'Montréal'

In [23]:
octets.decode('iso8859-1')

'Montréal'

In [24]:
octets.decode('koi8-r')

'MontrИal'

In [25]:
try:
    octets.decode('utf-8')
except UnicodeDecodeError as e:
    print(e.__repr__())

UnicodeDecodeError('utf-8', b'Montr\xe9al', 5, 6, 'invalid continuation byte')


In [26]:
octets.decode('utf-8', errors='replace')  # changes undecoded characters with '�'

'Montr�al'

<h2>Unicode normalization for reliable comparison</h2>

`unicodedata.normalize(form, string)` is a function in Python's `unicodedata` module used to normalize Unicode text. It takes two arguments: `form`, which specifies the normalization form to apply, and `string`, the Unicode string to be normalized.

- `NFC`: Normalizes the string using the Canonical Composition (NFC) form, which combines characters and diacritics where possible.
- `NFD`: Normalizes the string using the Canonical Decomposition (NFD) form, which decomposes characters into their base characters and combining characters.
- `NFKC`: Normalizes the string using the Compatibility Composition (NFKC) form, which performs additional compatibility mappings and compositions.
- `NFKD`: Normalizes the string using the Compatibility Decomposition (NFKD) form, which performs additional compatibility mappings and decompositions.

In [27]:
s1 = 'café'
s2 = 'cafe\N{COMBINING ACUTE ACCENT}'  # adds an acute accent to the letter 'e' in 'cafe', producing 'café'.

In [28]:
s1, s2

('café', 'café')

In [29]:
len(s1), len(s2)

(4, 5)

In [30]:
s1 == s2

False

In [31]:
from unicodedata import normalize, name


len(normalize('NFC', s1)), len(normalize('NFC', s2))

(4, 4)

In [32]:
len(normalize('NFD', s1)), len(normalize('NFD', s2))

(5, 5)

In [33]:
normalize('NFC', s1) == normalize('NFC', s2)

True

In [34]:
ohm = '\u2126'  # represents the Unicode character "OHM SIGN" (Ω)
name(ohm)

'OHM SIGN'

In [35]:
ohm_c = normalize('NFC', ohm)
name(ohm_c)

'GREEK CAPITAL LETTER OMEGA'

In [36]:
ohm == ohm_c

False

In [37]:
normalize('NFC', ohm) == normalize('NFC', ohm_c)

True

In [38]:
half = '\N{VULGAR FRACTION ONE HALF}'  # represents the vulgar fraction one-half
print(half)

½


In [39]:
normalize('NFKC', half)

'1⁄2'

In [40]:
for char in normalize('NFKC', half):
    print(char, name(char), sep=' - ')

1 - DIGIT ONE
⁄ - FRACTION SLASH
2 - DIGIT TWO


In [41]:
four_squared = '4²'

In [42]:
normalize('NFKC', four_squared)

'42'

In [43]:
micro = 'µ'

In [44]:
micro_kc = normalize('NFKC', micro)

In [45]:
micro, micro_kc

('µ', 'μ')

In [46]:
ord(micro), ord(micro_kc)

(181, 956)

In [47]:
name(micro), name(micro_kc)

('MICRO SIGN', 'GREEK SMALL LETTER MU')

<h3>Case collapsing</h3>

Case collapsing (or case folding) is a text operation that converts characters to a standardized form, typically lowercase, for case-insensitive comparisons.

`str.casefold()` returns a casefolded version of the string `str`, suitable for case-insensitive comparisons. It is similar to `lower()` but more aggressive in its conversion, making it suitable for Unicode strings.

In [48]:
eszett = 'ß'
eszett.casefold()

'ss'

In [49]:
s = 'ΔΙΆΛΕΚΤΟΣ'
s.casefold()

'διάλεκτοσ'

In [50]:
s = 'ÉLÉPHANT'
s.casefold()

'éléphant'

<h3>Utility functions for normalized Unicode string comparison</h3>

Using normal form `s` case sensitive

In [51]:
def nfc_equal(str1, str2):
    return normalize('NFC', str1) == normalize('NFC', str2)

In [52]:
s1 = 'café'
s2 = 'cafe\u0301'

s1 == s2

False

In [53]:
nfc_equal(s1, s2)

True

In [54]:
nfc_equal('A', 'a')

False

Using normal form `s` using case folding

In [55]:
def fold_equal(str1, str2):
    return (
        normalize('NFC', str1).casefold() ==
        normalize('NFC', str2).casefold()
    )

In [56]:
s3 = 'Straße'
s4 = 'strasse'

s3 == s4

False

In [57]:
nfc_equal(s3, s4)

False

In [58]:
fold_equal(s3, s4)

True

In [59]:
fold_equal('A', 'a')

True

<h3>Extreme "normalization": removing diacritical marks</h3>

Function for deleting all modifying characters

In [60]:
from unicodedata import combining


def shave_marks(txt):
    # Remove all diacritic marks
    norm_txt = normalize('NFD', txt)
    shaved = ''.join(
        c for c in norm_txt
        if not combining(c)  # get all modifying characters
    )
    return normalize('NFC', shaved)  # reverse the composition

In [61]:
order = '“Herr Voß: • ½ cup of Œtker™ caffè latte • bowl of açaí.”'
shave_marks(order)

'“Herr Voß: • ½ cup of Œtker™ caffe latte • bowl of acai.”'

In [62]:
greek = 'Ζέφυρος, Zéfiro'
shave_marks(greek)

'Ζεφυρος, Zefiro'

Function for removing modifying characters only for symbols from the Latin base

In [63]:
from string import ascii_letters


def shave_marks_latin(txt):
    # Remove all diacritic marks from Latin base characters
    norm_txt = normalize('NFD', txt)
    latin_base = False
    preserve = []

    for c in norm_txt:

        if combining(c) and latin_base:
            continue  # ignore diacritic in Latin base char
        preserve.append(c)

        # if it isn't a combining char, it's a new base char
        if not combining(c):
            latin_base = c in ascii_letters

    shaved = ''.join(preserve)

    return normalize('NFC', shaved)  # reverse the composition

In [64]:
shave_marks_latin(order)

'“Herr Voß: • ½ cup of Œtker™ caffe latte • bowl of acai.”'

In [65]:
shave_marks_latin(greek)

'Ζέφυρος, Zefiro'

The transformation of some western typographic symbols into a ASCII

In [66]:
# Construct a correspondence table for replacing one character with another character 
single_map = str.maketrans("""‚ƒ„ˆ‹‘’“”•–—˜›""",
                           """'f"^<''""---~>""")

# Create a correspondence table for replacing one character with a string of characters
multi_map = str.maketrans({
    '€': 'EUR',
    '…': '...',
    'Æ': 'AE',
    'æ': 'ae',
    'Œ': 'OE',
    'œ': 'oe',
    '™': '(TM)',
    '‰': '<per mille>',
    '†': '**',
    '‡': '***',
})

# Merge the correspondence tables
multi_map.update(single_map)

`dewinize()` takes a text input and replaces certain characters or sequences of characters according to a predefined correspondence table.

In [67]:
def dewinize(txt):
    # Replace Win1252 symbols with ASCII chars of sequences
    return txt.translate(multi_map)

In [68]:
dewinize(order)

'"Herr Voß: - ½ cup of OEtker(TM) caffè latte - bowl of açaí."'

In [69]:
dewinize(greek)

'Ζέφυρος, Zéfiro'

`asciize()` calls `dewinize()`, then removes diacritic marks and replaces 'ß'

In [70]:
def asciize(txt):
    no_marks = shave_marks_latin(dewinize(txt))  # call dewinize() and remove all diacritic marks
    no_marks = no_marks.replace('ß', 'ss')

    return normalize('NFKC', no_marks)

In [71]:
asciize(order)

'"Herr Voss: - 1⁄2 cup of OEtker(TM) caffe latte - bowl of acai."'

In [72]:
asciize(greek)

'Ζέφυρος, Zefiro'

<h2>Unicode Text Sorting</h2>

In [73]:
fruits = ['caju', 'atemoia', 'cajà', 'açai', 'acerola']
sorted(fruits)

['acerola', 'atemoia', 'açai', 'caju', 'cajà']

The `locale` module in Python plays a crucial role in Unicode text sorting by providing functions to set and retrieve locale-specific information. By setting the appropriate locale, developers can ensure that Unicode text sorting follows the conventions and rules of the specified language or region. This allows for accurate sorting of Unicode strings based on linguistic and cultural norms, ensuring that text is ordered correctly according to locale-specific collation rules.

The `locale` module for collation (sorting) using `locale.setlocale()` function with `locale.LC_COLLATE` flag, indicating the collation category, and specifying the desired locale `'pt_BR.UTF-8'` (Portuguese, Brazil) in UTF-8 encoding.

In [74]:
import locale


my_locale = locale.setlocale(category=locale.LC_COLLATE, locale='pt_BR.UTF-8')
my_locale

'pt_BR.UTF-8'

The locale-aware transformation function used for sorting strings according to locale-specific collation rules

In [75]:
sorted_fruits = sorted(fruits, key=locale.strxfrm)
sorted_fruits

['acerola', 'atemoia', 'açai', 'caju', 'cajà']

<h3>Sorting with Unicode ordering algorithm</h3>

`pyuca` is a Python library providing Unicode Collation Algorithm (UCA) support, enabling accurate sorting of Unicode strings based on their linguistic and cultural properties. It allows developers to perform locale-aware string comparison and sorting operations, ensuring correct ordering of text according to various language-specific collation rules.

In [76]:
import pyuca


coll = pyuca.Collator()

sorted_fruits = sorted(fruits, key=coll.sort_key)
sorted_fruits

['açai', 'acerola', 'atemoia', 'cajà', 'caju']

<h2>Unicode Database</h2>

<h3>Character Search by Name</h3>

`unicodedata` is a Python module for accessing Unicode character properties and performing text operations like normalization and case folding.

In [77]:
import unicodedata


unicodedata.name('A')

'LATIN CAPITAL LETTER A'

In [78]:
unicodedata.name('ã')

'LATIN SMALL LETTER A WITH TILDE'

In [79]:
unicodedata.name('😸')

'GRINNING CAT FACE WITH SMILING EYES'

In [80]:
unicodedata.name('🙅')

'FACE WITH NO GOOD GESTURE'

Character Finder Utility

In [81]:
import sys


START, END = ord(' '), sys.maxunicode + 1  # Set START to space character code point, END to cover entire Unicode range

def find(*query_words, start=START, end=END):
    query = {w.upper() for w in query_words}

    for code in range(start, end):
        char = chr(code)
        char_name = unicodedata.name(char, None)  # get character unicodedata name or None

        if char_name and query.issubset(char_name.split()):
            print(f"U+{code:04X}\t{char}\t{char_name}")  # print the symbol, its name, and code position in U+9999 format

In [82]:
find('dog')

U+2EA8	⺨	CJK RADICAL DOG
U+2F5D	⽝	KANGXI RADICAL DOG
U+B3C5	독	HANGUL SYLLABLE DOG
U+1F32D	🌭	HOT DOG
U+1F415	🐕	DOG
U+1F436	🐶	DOG FACE
U+1F9AE	🦮	GUIDE DOG


<h3>Numeric Symbols</h3>

Demonstration of working with numeric character metadata in Unicode database

In [83]:
import re


re_digit = re.compile(r'\d')

sample = '1\xbc\xb2\u0969\u136b\u216b\u2466\u2480\u3285'

for char in sample:
    print(
        f"U+{ord(char):04X}",
        char.center(6),  # symbol centered in the length field 6
        're_dig' if re_digit.match(char) else '-',
        'isdig' if char.isdigit() else '-',
        'isnum' if char.isnumeric() else '-',
        f"{unicodedata.numeric(char):5.2f}",  # get unicodedata numeric value in a 5 wide field with two decimal places
        unicodedata.name(char),  # get character unicodedata name
        sep='\t'
    )

U+0031	  1   	re_dig	isdig	isnum	 1.00	DIGIT ONE
U+00BC	  ¼   	-	-	isnum	 0.25	VULGAR FRACTION ONE QUARTER
U+00B2	  ²   	-	isdig	isnum	 2.00	SUPERSCRIPT TWO
U+0969	  ३   	re_dig	isdig	isnum	 3.00	DEVANAGARI DIGIT THREE
U+136B	  ፫   	-	isdig	isnum	 3.00	ETHIOPIC DIGIT THREE
U+216B	  Ⅻ   	-	-	isnum	12.00	ROMAN NUMERAL TWELVE
U+2466	  ⑦   	-	isdig	isnum	 7.00	CIRCLED DIGIT SEVEN
U+2480	  ⒀   	-	-	isnum	13.00	PARENTHESIZED NUMBER THIRTEEN
U+3285	  ㊅   	-	-	isnum	 6.00	CIRCLED IDEOGRAPH SIX
