<a name="top"></a> Contents (TBD)
===

- [Text vs Bytes](#textvsbytes)
    - [Characters](#chars)
    - [Bytes](#bytes)
- [Encoding/Decoding](#enc_dec)
    - [UnicodeEncodeError](#encode_err)
    - [UnicodeDecodeError](#decode_err)


<a name='textvsbytes'></a>Text vs Bytes
===

**Python 3** introduced a sharp distinction between **strings** of human text and **sequences** of raw bytes. 

Implicit conversion of byte sequences to Unicode text is a thing of the past. 

This notebook deals with _Unicode strings_, _binary sequences_, and the encodings used to convert between them.

Depending on your Python programming context, a deeper understanding of Unicode may or may not be of vital importance to you. 

In the end, most of the issues covered in this notebook do not affect programmers who deal only with ASCII text. 

But even if that is your case, there is no escaping the `str` versus `byte` divide. 

As a bonus, you’ll find that the specialized binary sequence types provide features that the "all-purpose" Python 2 str type does not have.

<a name='char'></a>Characters
---

The concept of “string” is simple enough: a string is a sequence of characters. The problem lies in the definition of “character.”

Nowadays, the best definition of _character_ we have is a **Unicode character**. 

Accordingly, the items you get out of a Python 3 `str` are _Unicode characters_, just like the items of a `unicode` object in 
Python 2 and not the `raw bytes` you get from a Python 2 `str`.

### Brief note about Unicode

The Unicode standard explicitly separates the identity of characters from specific byte representations:

- The identity of a character — its code point — is a number from $0$ to $1,114,111$ (base 10), shown in the _Unicode standard_ as 
$4$ to $6$ hexadecimal digits with a "U" + `prefix`. 

For example, the code point for the letter _A_ is `U+0041`, the _Euro sign_ is `U+20AC`. 

- The actual bytes that represent a character depend on the encoding in use. 

An encoding is an algorithm that converts code points to byte sequences and vice versa. 

The code point for _A_ (`U+0041`) is encoded as the single byte `\x41` in the **UTF-8** encoding, 
or as the bytes `\x41\x00` in **UTF-16LE** encoding. 

As another example, the _Euro sign_ (`U+20AC`) becomes three bytes in **UTF-8**, `\xe2\x82\xac` but in **UTF-16LE** it is encoded as two bytes: `\xac\x20`.

#### Encoding and Decoding

- Converting from _code points_ to _bytes_ is **encoding**; 
- Converting from _bytes_ to _code points_ is **decoding**.

In [2]:
s = 'café'
len(s)

4

In [3]:
b = s.encode('utf8')

In [4]:
b

b'caf\xc3\xa9'

In [5]:
len(b)

5

In [6]:
b.decode('utf8')

'café'

In [7]:
type(b)

bytes

In [8]:
type(s)

str

[top](#top)

<a name='bytes'></a>Bytes
---

The new binary sequence types are unlike the Python 2 str in many regards. 

The first thing to know is that there are **two basic built-in types for binary sequences**: 

- the _immutable_ `bytes` type introduced in Python 3;
- the _mutable_ `bytearray`, added in Python 2.6. 

_For the records_ : 
(Python 2.6 also introduced bytes, but it’s just an alias to the str type, and does not behave like the Python 3 bytes type.)

Each item in `bytes` or `bytearray` is an integer from 0 to 255, and not a one-character string like in the Python 2 str. 

However, a slice of a binary sequence always produces a binary sequence of the same type — including slices of length 1.

In [9]:
cafe = bytes('café', encoding='utf_8')

In [10]:
cafe

b'caf\xc3\xa9'

In [11]:
cafe[0]

99

In [12]:
cafe[:1]

b'c'

In [13]:
cafe_arr = bytearray(cafe)

In [14]:
cafe_arr

bytearray(b'caf\xc3\xa9')

In [15]:
cafe_arr[-1:]

bytearray(b'\xa9')

[top](#top)

<a name='enc_dec'></a>Encoding Decoding
===

The Python distribution bundles more than 100 codecs (encoder/decoder) for text to byte conversion and vice versa. 

Each codec has a name, like 'utf_8', and often aliases, such as 'utf8', 'utf-8', and 'U8', which you can use as the encoding 
argument in functions like `open()`, `str.encode()`, `bytes.decode()`, and so on.

In [16]:
for codec in ['latin_1', 'utf_8', 'utf_16']:
    print(codec, 'El Niño'.encode(codec), sep='\t')

latin_1	b'El Ni\xf1o'
utf_8	b'El Ni\xc3\xb1o'
utf_16	b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'


Although there is a generic `UnicodeError` exception, the error reported is almost always more specific: 

- a `UnicodeEncodeError` when converting `str` to `binary` sequences; 
- a `UnicodeDecodeError` when reading `binary` sequences into `str`. 

<a name='encode_err'></a>UnicodeEncodeError
---

Most non-UTF codecs handle only a small subset of the Unicode characters. 

When converting text to bytes, if a character is not defined in the target encoding, `UnicodeEncodeError` will be raised, unless special handling is provided by passing an errors argument to the encoding method or function. 

In [17]:
city = 'São Paulo'
city.encode('utf_8')

b'S\xc3\xa3o Paulo'

In [18]:
city.encode('utf_16')

b'\xff\xfeS\x00\xe3\x00o\x00 \x00P\x00a\x00u\x00l\x00o\x00'

In [19]:
city.encode('iso8859_1')


b'S\xe3o Paulo'

In [20]:
city.encode('cp437')

UnicodeEncodeError: 'charmap' codec can't encode character '\xe3' in position 1: character maps to <undefined>

`cp437` can’t encode the 'ã' (“a” with tilde). The default error handler —'strict'—raises `UnicodeEncodeError`.

**Strategies to cope with errors**

The `error='ignore'` handler silently skips characters that cannot be encoded; this is usually a very bad idea.

In [21]:
city.encode('cp437', errors='ignore')

b'So Paulo'

When encoding, `error='replace'` substitutes unencodable characters with `'?'`; 
data is lost, but users will know something is amiss.

In [22]:
city.encode('cp437', errors='replace')

b'S?o Paulo'

`'xmlcharrefreplace'` replaces unencodable characters with an XML entity.

In [23]:
city.encode('cp437', errors='xmlcharrefreplace')

b'S&#227;o Paulo'

[top](#top)

<a name='decode_err'></a>UnicodeDecodeError
---

Not every byte holds a valid ASCII character, and not every byte sequence is valid UTF-8 or UTF-16; therefore, when you assume one of these encodings while converting a binary sequence to text, you will get a `UnicodeDecodeError` if unexpected bytes are found.

On the other hand, many legacy 8-bit encodings like `cp1252`, `iso8859_1`, and `koi8_r` are able to decode any stream of bytes, including random noise, without generating errors. 

Therefore, if your program assumes the wrong 8-bit encoding, it will silently decode garbage.

In [24]:
octets = b'Montr\xe9al'

Decoding with `'cp1252'` (Windows 1252) works because it is a proper superset of latin1.

In [25]:
octets.decode('cp1252')

'Montréal'

ISO-8859-7 is intended for Greek, so the `'\xe9'` byte is misinterpreted, and no error is issued.

In [26]:
octets.decode('iso8859_7')

'Montrιal'

In [28]:
octets.decode('utf_8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte

[top](#top)