# Unicode Text Versus Bytes

## Character Issues

The Unicode standard explicitly separates the identity of characters from specific byte representations:

+ The identity of a character—its _code point_—is a number from 0 to 1,114,111 (base 10), shown in the Unicode standard as 4 to 6 hex digits with a “U+” prefix, from U+0000 to U+10FFFF. 
+ The actual bytes that represent a character depend on the encoding in use. An encoding is an algorithm that converts code points to byte sequences and vice versa. The code point for the letter A (U+0041) is encoded as the single byte `\x41` in the UTF-8 encoding, or as the bytes `\x41\x00` in UTF-16LE encoding. As another example, UTF-8 requires three bytes—`\xe2\x82\xac`—to encode the Euro sign (U+20AC), but in UTF-16LE the same code point is encoded as two bytes: `\xac\x20`.

In [1]:
s = 'café'
len(s)

4

In [2]:
b = s.encode('utf8')
b

b'caf\xc3\xa9'

In [3]:
len(b)

5

In [4]:
b.decode('utf8')

'café'

## Byte Essentials

In [5]:
cafe = bytes('café', encoding='utf_8')
cafe

b'caf\xc3\xa9'

In [6]:
cafe[0]

99

In [7]:
cafe[:1]

b'c'

In [8]:
cafe_arr = bytearray(cafe)
cafe_arr

bytearray(b'caf\xc3\xa9')

In [9]:
cafe_arr[-1:]

bytearray(b'\xa9')

In [10]:
bytes.fromhex('31 4B CE A9')

b'1K\xce\xa9'

In [11]:
import array
numbers = array.array('h', [-2, -1, 0, 1, 2]) # typecode 'h' creates an array of
                                              # short integers (16 bytes each)
octets = bytes(numbers)
octets

b'\xfe\xff\xff\xff\x00\x00\x01\x00\x02\x00'

## Basic Encoders/Decoders

The Python distribution bundles more than 100 _codecs_ (encoder/decoders) for text to byte conversion and vice versa. Each codec has a name, like `'utf_8'`, and often aliases, such as `'utf8'`, `'utf-8'`, and `'U8'`, which you can use as the encoding argument in functions like `open()`, `str.encode()`, `bytes.decode()`, and so on.

In [12]:
for codec in ['latin_1', 'utf_8', 'utf_16']:
    print(codec, 'El Niño'.encode(codec), sep='\t')

latin_1	b'El Ni\xf1o'
utf_8	b'El Ni\xc3\xb1o'
utf_16	b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'


## Understanding Encode/Decode Problems

Although there is a generic `UnicodeError` exception, the error reported by Python is usually more specific: either a `UnicodeEncodeError` (when converting `str` to binary sequences) or a `UnicodeDecodeError` (when reading binary sequences into `str`). Loading Python modules may also raise `SyntaxError` when the source encoding is unexpected.

### Coping with UnicodeEncodeError

In [13]:
city = 'São Paulo'
city.encode('utf_8')

b'S\xc3\xa3o Paulo'

In [14]:
city.encode('utf_16')

b'\xff\xfeS\x00\xe3\x00o\x00 \x00P\x00a\x00u\x00l\x00o\x00'

In [15]:
city.encode('iso8859_1')

b'S\xe3o Paulo'

In [16]:
try:
    city.encode('cp437')
except Exception as e:
    print(f"{e=}")

e=UnicodeEncodeError('charmap', 'São Paulo', 1, 2, 'character maps to <undefined>')


In [17]:
city.encode('cp437', errors='ignore')

b'So Paulo'

In [18]:
city.encode('cp437', errors='replace')

b'S?o Paulo'

In [19]:
city.encode('cp437', errors='xmlcharrefreplace')

b'S&#227;o Paulo'

### Coping with UnicodeDecodeError

Not every byte holds a valid ASCII character, and not every byte sequence is valid UTF-8 or UTF-16; therefore, when you assume one of these encodings while convert‐ ing a binary sequence to text, you will get a `UnicodeDecodeError` if unexpected bytes are found.

In [20]:
octets = b'Montr\xe9al'
octets.decode('cp1252')

'Montréal'

In [21]:
octets.decode('iso8859_7')

'Montrιal'

In [22]:
octets.decode('koi8_r')

'MontrИal'

In [23]:
try:
    octets.decode('utf_8')
except Exception as e:
    print(f"{e=}")

e=UnicodeDecodeError('utf-8', b'Montr\xe9al', 5, 6, 'invalid continuation byte')


In [24]:
octets.decode("utf_8", errors="replace")

'Montr�al'

### `SyntaxError` When Loading Modules with Unexpected Encoding

### BOM: A Useful Gremlin

In [25]:
u16 = 'El Niño'.encode('utf_16')
u16

b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'

In [26]:
list(u16)

[255, 254, 69, 0, 108, 0, 32, 0, 78, 0, 105, 0, 241, 0, 111, 0]

In [27]:
u16le = 'El Niño'.encode('utf_16le')
list(u16le)

[69, 0, 108, 0, 32, 0, 78, 0, 105, 0, 241, 0, 111, 0]

In [28]:
u16be = 'El Niño'.encode('utf_16be')
list(u16be)

[0, 69, 0, 108, 0, 32, 0, 78, 0, 105, 0, 241, 0, 111]

##  Handling Text Files

In [29]:
fp = open('cafe.txt', 'w', encoding='utf_8')
fp

<_io.TextIOWrapper name='cafe.txt' mode='w' encoding='utf_8'>

In [30]:
fp.write('café')
fp.close()

In [31]:
import os
os.stat('cafe.txt').st_size

5

In [32]:
fp2 = open('cafe.txt')
fp2

<_io.TextIOWrapper name='cafe.txt' mode='r' encoding='UTF-8'>

In [33]:
fp2.encoding

'UTF-8'

In [34]:
fp2.read()

'café'

In [35]:
fp3 = open('cafe.txt', encoding='utf_8')
fp3

<_io.TextIOWrapper name='cafe.txt' mode='r' encoding='utf_8'>

In [36]:
fp3.read()

'café'

In [37]:
fp4 = open('cafe.txt', 'rb')
fp4

<_io.BufferedReader name='cafe.txt'>

In [38]:
fp4.read()

b'caf\xc3\xa9'

In [39]:
try:
    fp.close()
    fp2.close()
    fp3.close()
    fp4.close()
except Exception as e:
    print(f"{e=}")

### Beware of Encoding Defaults

In [40]:
import locale
import sys

expressions = """
    locale.getpreferredencoding()
    type(my_file)
    my_file.encoding
    sys.stdout.isatty()
    sys.stdout.encoding
    sys.stdin.isatty()
    sys.stdin.encoding
    sys.stderr.isatty()
    sys.stderr.encoding
    sys.getdefaultencoding()
    sys.getfilesystemencoding()
"""

my_file = open("dummy", "w")

for expression in expressions.split():
    value = eval(expression)
    print(f"{expression:>30} -> {value!r}")

 locale.getpreferredencoding() -> 'UTF-8'
                 type(my_file) -> <class '_io.TextIOWrapper'>
              my_file.encoding -> 'UTF-8'
           sys.stdout.isatty() -> False
           sys.stdout.encoding -> 'UTF-8'
            sys.stdin.isatty() -> False
            sys.stdin.encoding -> 'utf-8'
           sys.stderr.isatty() -> False
           sys.stderr.encoding -> 'UTF-8'
      sys.getdefaultencoding() -> 'utf-8'
   sys.getfilesystemencoding() -> 'utf-8'


In [41]:
import sys
from unicodedata import name

print(sys.version)
print()
print('sys.stdout.isatty():', sys.stdout.isatty())
print('sys.stdout.encoding:', sys.stdout.encoding)
print()

test_chars = [
    '\N{HORIZONTAL ELLIPSIS}',            # exists in cp1252, not in cp437
    '\N{INFINITY}',                       # exists in cp437, not in cp1252
    '\N{CIRCLED NUMBER FORTY TWO}',       # not in cp437 or in cp1252
]

for char in test_chars:
    print(f'Trying to output {name(char)}')
    print(char)

3.11.6 | packaged by conda-forge | (main, Oct  3 2023, 10:40:35) [GCC 12.3.0]

sys.stdout.isatty(): False
sys.stdout.encoding: UTF-8

Trying to output HORIZONTAL ELLIPSIS
…
Trying to output INFINITY
∞
Trying to output CIRCLED NUMBER FORTY TWO
㊷


## Normalizing Unicode for Reliable Comparisons

String comparisons are complicated by the fact that Unicode has combining charac‐ters: diacritics and other marks that attach to the preceding character, appearing as one when printed.
For example, the word “café” may be composed in two ways, using four or five code points, but the result looks exactly the same:

In [42]:
s1 = 'café'
s2 = 'cafe\N{COMBINING ACUTE ACCENT}'
s1, s2

('café', 'café')

In [43]:
len(s1), len(s2)

(4, 5)

In [44]:
s1 == s2

False

In [45]:
from unicodedata import normalize
s1 = 'café'
s2 = 'cafe\N{COMBINING ACUTE ACCENT}'
len(s1), len(s2)

(4, 5)

In [46]:
len(normalize('NFC', s1)), len(normalize('NFC', s2))

(4, 4)

In [47]:
len(normalize('NFD', s1)), len(normalize('NFD', s2))

(5, 5)

In [48]:
normalize('NFC', s1) == normalize('NFC', s2)

True

In [49]:
normalize('NFD', s1) == normalize('NFD', s2)

True

In [50]:
from unicodedata import normalize, name

ohm = '\u2126'
name(ohm)

'OHM SIGN'

In [51]:
ohm_c = normalize('NFC', ohm)
name(ohm_c)

'GREEK CAPITAL LETTER OMEGA'

In [52]:
ohm == ohm_c

False

In [53]:
normalize('NFC', ohm) == normalize('NFC', ohm_c)

True

In [54]:
from unicodedata import normalize, name
half = '\N{VULGAR FRACTION ONE HALF}'
print(half)

½


In [55]:
normalize('NFKC', half)

'1⁄2'

In [56]:
for char in normalize('NFKC', half):
    print(char, name(char), sep='\t')

1	DIGIT ONE
⁄	FRACTION SLASH
2	DIGIT TWO


In [57]:
four_squared = '4²'
normalize('NFKC', four_squared)

'42'

In [58]:
micro = 'µ'
micro_kc = normalize('NFKC', micro)
micro, micro_kc

('µ', 'μ')

In [59]:
ord(micro), ord(micro_kc)

(181, 956)

In [60]:
name(micro), name(micro_kc)

('MICRO SIGN', 'GREEK SMALL LETTER MU')

### Case Folding

Case folding is essentially converting all text to lowercase, with some additional transformations. It is supported by the `str.casefold()` method.

In [61]:
micro = 'µ'
name(micro)

'MICRO SIGN'

In [62]:
micro_cf = micro.casefold()
name(micro_cf)

'GREEK SMALL LETTER MU'

In [63]:
micro, micro_cf

('µ', 'μ')

In [64]:
eszett = 'ß'
name(eszett)

'LATIN SMALL LETTER SHARP S'

In [65]:
eszett_cf = eszett.casefold()
eszett, eszett_cf

('ß', 'ss')

### Utility Functions for Normalized Text Matching

As we’ve seen, NFC and NFD are safe to use and allow sensible comparisons between Unicode strings. NFC is the best normalized form for most applications. `str.casefold()` is the way to go for case-insensitive comparisons.

In [66]:
"""
Utility functions for normalized Unicode string comparison.

Using Normal Form C, case sensitive:

    >>> s1 = 'café'
    >>> s2 = 'cafe\u0301'
    >>> s1 == s2
    False
    >>> nfc_equal(s1, s2)
    True
    >>> nfc_equal('A', 'a')
    False
     
Using Normal Form C with case folding:

    >>> s3 = 'Straße'
    >>> s4 = 'strasse'
    >>> s3 == s4
    False
    >>> nfc_equal(s3, s4)
    False
    >>> fold_equal(s3, s4)
    True
    >>> fold_equal(s1, s2)
    True
    >>> fold_equal('A', 'a')
    True
"""

from unicodedata import normalize

def nfc_equal(str1, str2):
    return normalize('NFC', str1) == normalize('NFC', str2)

def fold_equal(str1, str2):
    return (normalize('NFC', str1).casefold() == 
            normalize('NFC', str2).casefold())

### Extreme "Normalization": Taking Out Diacritics

The Google Search secret sauce involves many tricks, but one of them apparently is ignoring diacritics (e.g., accents, cedillas, etc.), at least in some contexts. Removing diacritics is not a proper form of normalization because it often changes the meaning of words and may produce false positives when searching. But it helps coping with some facts of life: people sometimes are lazy or ignorant about the correct use of dia‐ critics, and spelling rules change over time, meaning that accents come and go in liv‐ ing languages.

In [67]:
import unicodedata
import string

def shave_marks(txt):
    """
    Remove all diacritic marks
    """
    norm_txt = unicodedata.normalize('NFD', txt)
    shaved = ''.join(c for c in norm_txt
                    if not unicodedata.combining(c))
    return unicodedata.normalize('NFC', shaved)

In [68]:
order = '“Herr Voß: • 1⁄2 cup of ŒtkerTM caffè latte • bowl of açaí.”'
shave_marks(order)

'“Herr Voß: • 1⁄2 cup of ŒtkerTM caffe latte • bowl of acai.”'

In [69]:
Greek = 'Ζέφυρος, Zéfiro'
shave_marks(Greek)

'Ζεφυρος, Zefiro'

In [70]:
def shave_marks_latin(txt):
    """Remove all diacritic marks from Latin base characters"""
    norm_txt = unicodedata.normalize('NFD', txt)
    latin_base = False
    preserve = []
    for c in norm_txt:
        if unicodedata.combining(c) and latin_base:
            continue
        preserve.append(c)
        
        if not unicodedata.combining(c):
            latin_base = c in string.ascii_letters
    shaved = ''.join(preserve)
    return unicodedata.normalize('NFC', shaved)

In [71]:
single_map = str.maketrans("""‚ƒ„ˆ‹‘’“”•–—˜›""",  # <1>
                           """'f"^<''""---~>""")

multi_map = str.maketrans({
    '€': 'EUR',
    '…': '...',
    'Æ': 'AE',
    'æ': 'ae',
    'Œ': 'OE',
    'œ': 'oe',
    '™': '(TM)',
    '‰': '<per mille>',
    '†': '**',
    '‡': '***',
})

multi_map.update(single_map)

def dewinize(txt):
    """Replace Win1252 symbols with ASCII chars or sequences"""
    return txt.translate(multi_map)

def asciize(txt):
    no_marks = shave_marks_latin(dewinize(txt))
    no_marks = no_marks.replace('ß', 'ss')
    return unicodedata.normalize('NFKC', no_marks)

In [72]:
order = '“Herr Voß: • 1⁄2 cup of ŒtkerTM caffè latte • bowl of açaí.”'
dewinize(order)

'"Herr Voß: - 1⁄2 cup of OEtkerTM caffè latte - bowl of açaí."'

In [73]:
asciize(order)

'"Herr Voss: - 1⁄2 cup of OEtkerTM caffe latte - bowl of acai."'

## Sorting of Unicode Text

Python sorts sequences of any type by comparing the items in each sequence one by one. For strings, this means comparing the code points. Unfortunately, this produces unacceptable results for anyone who uses non-ASCII characters.

In [74]:
fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']
sorted(fruits)

['acerola', 'atemoia', 'açaí', 'caju', 'cajá']

In [75]:
import locale
try:
    my_locale = locale.setlocale(locale.LC_COLLATE, 'pt_BR.UTF-8')
    print(my_locale)
    fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']
    sorted_fruits = sorted(fruits, key=locale.strxfrm)
    print(sorted_fruits) 
except Exception as e:
    print(f"{e=}")

e=Error('unsupported locale setting')


In [79]:
import pyuca

coll = pyuca.Collator()
fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']
sorted_fruits = sorted(fruits, key=coll.sort_key)
sorted_fruits

['açaí', 'acerola', 'atemoia', 'cajá', 'caju']

## The Unicode Database

In [78]:
from unicodedata import name

name('A')

'LATIN CAPITAL LETTER A'

In [80]:
name('ã')

'LATIN SMALL LETTER A WITH TILDE'

In [81]:
name('♛')

'BLACK CHESS QUEEN'

In [82]:
name('😸')

'GRINNING CAT FACE WITH SMILING EYES'

### Numeric Meaning of Characters

The unicodedata module includes functions to check whether a Unicode character represents a number and, if so, its numeric value for humans—as opposed to its code point number.

In [83]:
import unicodedata
import re

re_digit = re.compile(r'\d')

sample = '1\xbc\xb2\u0969\u136b\u216b\u2466\u2480\u3285'

for char in sample:
    print(f'U+{ord(char):04x}',
          char.center(6),
          're_dig' if re_digit.match(char) else '-',
          'isdig' if char.isdigit() else '-',
          'isnum' if char.isnumeric() else '-',
          f'{unicodedata.numeric(char):5.2f}',
          unicodedata.name(char),
          sep='\t',
          )

U+0031	  1   	re_dig	isdig	isnum	 1.00	DIGIT ONE
U+00bc	  ¼   	-	-	isnum	 0.25	VULGAR FRACTION ONE QUARTER
U+00b2	  ²   	-	isdig	isnum	 2.00	SUPERSCRIPT TWO
U+0969	  ३   	re_dig	isdig	isnum	 3.00	DEVANAGARI DIGIT THREE
U+136b	  ፫   	-	isdig	isnum	 3.00	ETHIOPIC DIGIT THREE
U+216b	  Ⅻ   	-	-	isnum	12.00	ROMAN NUMERAL TWELVE
U+2466	  ⑦   	-	isdig	isnum	 7.00	CIRCLED DIGIT SEVEN
U+2480	  ⒀   	-	-	isnum	13.00	PARENTHESIZED NUMBER THIRTEEN
U+3285	  ㊅   	-	-	isnum	 6.00	CIRCLED IDEOGRAPH SIX


## Dual-Mode `str` and bytes APIs

Python’s standard library has functions that accept `str` or `bytes` arguments and behave differently depending on the type. Some examples can be found in the `re` and `os` modules.

### `str` Versus `bytes` in Regular Expressions

If you build a regular expression with `bytes`, patterns such as `\d` and `\w` only match ASCII characters; in contrast, if these patterns are given as `str`, they match Unicode digits or letters beyond ASCII.

In [85]:
import re

re_numbers_str = re.compile(r'\d+')
re_words_str = re.compile(r'\w+')
re_numbers_bytes = re.compile(rb'\d+')
re_words_bytes = re.compile(rb'\w+')

text_str = ("Ramanujan saw \u0be7\u0bed\u0be8\u0bef"
            " as 1729 = 13 + 123 = 93 + 103.")

text_bytes = text_str.encode('utf_8')

print(f'Text\n {text_str!r}')
print('Numbers')
print(' str :', re_numbers_str.findall(text_str))
print(' bytes:', re_numbers_bytes.findall(text_bytes))
print('Words')
print(' str :', re_words_str.findall(text_str))
print(' bytes:', re_words_bytes.findall(text_bytes))

Text
 'Ramanujan saw ௧௭௨௯ as 1729 = 13 + 123 = 93 + 103.'
Numbers
 str : ['௧௭௨௯', '1729', '13', '123', '93', '103']
 bytes: [b'1729', b'13', b'123', b'93', b'103']
Words
 str : ['Ramanujan', 'saw', '௧௭௨௯', 'as', '1729', '13', '123', '93', '103']
 bytes: [b'Ramanujan', b'saw', b'as', b'1729', b'13', b'123', b'93', b'103']


### `str` Versus `bytes` in `os` Functions

The GNU/Linux kernel is not Unicode savvy, so in the real world you may find file‐ names made of byte sequences that are not valid in any sensible encoding scheme, and cannot be decoded to `str`. File servers with clients using a variety of OSes are particularly prone to this problem.

In order to work around this issue, all `os` module functions that accept filenames or pathnames take arguments as `str` or `bytes`. If one such function is called with a `str` argument, the argument will be automatically converted using the codec named by `sys.getfilesystemencoding()`, and the OS response will be decoded with the same codec. This is almost always what you want, in keeping with the Unicode sandwich best practice.

But if you must deal with (and perhaps fix) filenames that cannot be handled in that way, you can pass `bytes` arguments to the `os` functions to get `bytes` return values. This feature lets you deal with any file or pathname, no matter how many gremlins you may find. 

In [88]:
os.listdir('.')

['LearningSparkV2',
 '.DS_Store',
 '.local',
 '.ipython',
 '.npm',
 '.bash_history',
 '.conda',
 '.sbt',
 '.cache',
 '.ivy2',
 '.bash_profile',
 '.jupyter',
 'Spark and watsonxdata Integration.ipynb',
 '.ipynb_checkpoints',
 'cafe.txt',
 'dummy',
 'digits-of-π.txt']

In [89]:
os.listdir(b'.')

[b'LearningSparkV2',
 b'.DS_Store',
 b'.local',
 b'.ipython',
 b'.npm',
 b'.bash_history',
 b'.conda',
 b'.sbt',
 b'.cache',
 b'.ivy2',
 b'.bash_profile',
 b'.jupyter',
 b'Spark and watsonxdata Integration.ipynb',
 b'.ipynb_checkpoints',
 b'cafe.txt',
 b'dummy',
 b'digits-of-\xcf\x80.txt']

To help with manual handling of str or bytes sequences that are filenames or path‐ names, the `os` module provides special encoding and decoding functions `os.fsen code(name_or_path)` and `os.fsdecode(name_or_path)`. Both of these functions accept an argument of type `str`, `bytes`, or an object implementing the `os.PathLike` interface since Python 3.6.

In [90]:
año = 2025
año

2025