<h1>Chapter 04. Unicode texts and Bytes</h1>

**Unicode:**
Unicode is a standardized encoding system that assigns unique code points to represent characters and text from various writing systems around the world. It provides a universal way to represent text in different languages, ensuring consistency and compatibility across different platforms and applications. Unicode supports a vast range of characters, including letters, digits, symbols, and more, making it essential for multilingual and internationalized software development.

**Bytes:**
Bytes, in the context of computing, refer to a unit of digital information storage. In programming, the term "bytes" is often used to represent sequences of eight bits, and it serves as a fundamental data type. Bytes are versatile and can represent a variety of data, including text characters, binary data, and more. They play a crucial role in low-level operations, file handling, and communication between different parts of a computer system. Understanding how to work with bytes is essential, especially when dealing with tasks like file I/O, networking, and encoding/decoding data.

Encoding and decoding

In [1]:
s = 'café'

In [2]:
len(s)  # string 'café' consist of 4 Unicode symbols

4

In [3]:
b = s.encode('utf-8')  # convert str to bytes using UTF-8 encoding
b

b'caf\xc3\xa9'

In [4]:
len(b)

5

In [5]:
b.decode('utf-8')  # convert back bytes to str

'café'

`bytes` is immutable sequence of 8-bit integers in Python, used for storing binary data or text.

`bytearray` is mutable counterpart to `bytes`, allowing in-place modifications of 8-bit integers.

In [6]:
cafe = bytes('café', encoding='utf-8')
cafe

b'caf\xc3\xa9'

In [7]:
cafe[0]  # every item is an integer within range(256)

99

In [8]:
cafe_arr = bytearray(cafe)
cafe_arr

bytearray(b'caf\xc3\xa9')

In [9]:
cafe_arr[-1:]

bytearray(b'\xa9')

Initialization of bytes with data stored in the array

In [10]:
import array


numbers = array.array('h', [-2, -1, 0, 1, 2])  # 'h' means to create an array of numbers (16-bit)
numbers

array('h', [-2, -1, 0, 1, 2])

In [11]:
octets = bytes(numbers)
octets

b'\xfe\xff\xff\xff\x00\x00\x01\x00\x02\x00'

<h2>Basic encoders and decoders</h2>

The string 'El niño' encoded with three codecs gives completely different byte sequences

In [12]:
coders = ['latin-1','cp437', 'utf-8', 'utf-16']  # list of several common encodings
s = 'El Niño'

for codec in coders:
    print(f"{codec}: {s.encode(codec)}")

latin-1: b'El Ni\xf1o'
cp437: b'El Ni\xa4o'
utf-8: b'El Ni\xc3\xb1o'
utf-16: b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'


<h2>Encoding and Decoding Problems</h2>

<h3><code>UnicodeEncodeError</code> handling</h3>

Encoding text to bytes: successful completion and error handling

In [13]:
city = 'São Paulo'

In [14]:
city.encode('utf-8')

b'S\xc3\xa3o Paulo'

In [15]:
city.encode('utf-16')

b'\xff\xfeS\x00\xe3\x00o\x00 \x00P\x00a\x00u\x00l\x00o\x00'

In [16]:
city.encode('iso8859-1')

b'S\xe3o Paulo'

In [18]:
try:
    city.encode('cp437')
except UnicodeEncodeError as e:
    print(e.__repr__())

UnicodeEncodeError('charmap', 'São Paulo', 1, 2, 'character maps to <undefined>')


Add the `errors=` argument to handle errors

In [19]:
city.encode('cp437', errors='ignore')  # skips unencoded characters

b'So Paulo'

In [20]:
city.encode('cp437', errors='replace')  # changes unencoded characters with '?'

b'S?o Paulo'

In [21]:
city.encode('cp437', errors='xmlcharrefreplace')  # changes unencoded characters with XML component

b'S&#227;o Paulo'

<h3><code>UnicodeDecodeError</code> handling</h3>

Decoding bytes to text: successful completion and error handling

In [22]:
octets = b'Montr\xe9al'

In [23]:
octets.decode('cp1252')

'Montréal'

In [24]:
octets.decode('iso8859-1')

'Montréal'

In [25]:
octets.decode('koi8-r')

'MontrИal'

In [26]:
try:
    octets.decode('utf-8')
except UnicodeDecodeError as e:
    print(e.__repr__())

UnicodeDecodeError('utf-8', b'Montr\xe9al', 5, 6, 'invalid continuation byte')


In [29]:
octets.decode('utf-8', errors='replace')  # changes undecoded characters with '�'

'Montr�al'