<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Demo 9.1: UTF-8

INSTRUCTIONS:

- Run the cells
- Observe and understand the results
- Answer the questions

# Encodings, UTF-8 and Python

Unicode is an abstract catalog that maps symbols to code points.

To write content to the disk or send them over the network actual ones and zeroes are necessary.

With computers, this is done through the use of **encodings**, like UTF-8.

Encodings implement the second part of the text process, they're in charge of translating code points into **ones** and **zeroes** (bits).

There are several encodings or implementations of unicode, being the most common one UTF-8. Python 3's default encoding is UTF-8. Other implementations include UTF-16, UTF-32 and UCS-2.

## Encoding and decoding

- **Encoding**: The process of turning abstract symbols defined in the Unicode catalog into bits is called encoding. 😁 -> `1110101`.

- **Decoding**: The process of reading bits (`1`s and `0`s), making sense out of them, and getting back symbols or characters is called decoding. `1110101` -> 😁

Content needs encoding when the user provides symbols and `1`s and `0`s are necessary.

Decoding is necessary when reading some raw source (reading a text file, receiving data from the network) and actual symbols are to be shown to our user.

## Encoding in Python

Once understood the behind the scenes, encoding and decoding in Python should be a simple task.

There are always two types of string data types (regardless of the Python version):
- **Unicode**-like: A unicode type of string. Abstract symbols with no bit representation. They need encoding to get the final binary data. `unicode.encode()` -> **bytes**. This data type is `unicode` in **Python 2** and `str` in **Python 3**.
- **Byte**-like: A binary type of string. These are just ones and zeroes. They need decoding to get the symbols. `bytes.decode()` -> **unicode**. This data type is `str` in **Python 2** and `bytes` in **Python 3**.

The `encode()` and `decode()` methods will accept a few parameters.
- The first and most important one, is the parameter that will indicate the encoding system.(UTF-8, ASCII, SHIFT JIS, etc).
- The same happens when decoding a string. What encoding was used to generate those bytes, in order for Python to decode it.

In [1]:
# Python 3
city = "São Paulo"
utf8_encoded = city.encode('utf-8')
print('Type    of utf8_encoded:', type(utf8_encoded))
print('Content of utf8_encoded:', utf8_encoded)

decoded_city = utf8_encoded.decode('utf-8')
print('Type    of decoded_city:', type(decoded_city))
print('Content of decoded_city:', decoded_city)

Type    of utf8_encoded: <class 'bytes'>
Content of utf8_encoded: b'S\xc3\xa3o Paulo'
Type    of decoded_city: <class 'str'>
Content of decoded_city: São Paulo


## Summary
| Have | Python 2 type | Python 3 type | Method to use | Get | Python 2 type | Python 3 type |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| 😁 | unicode | str | `encode()` | 110101 | str | bytes |
| 110101 | str | bytes | `decode()` | 😁 | unicode | str |

## Encoding and decoding errors
The process of encoding and decoding is really fragile. There is a change that it might fail.

It is important to know potential errors and how to handle them.

The most common error is trying to decode some bytes with the wrong encoding.

In [2]:
city = 'São Paulo'
utf8_encoded = city.encode('utf-8')

# Try with the wrong encoding to decode it
utf8_encoded.decode('ascii')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

The previous example tries to decode a set of bytes using **ASCII**.

Those bytes were the result of a unicode string encoded with UTF-8, so it obviously failed.

The error was `UnicodeDecodeError`. It can also happen that trying to encode a unicode string with an encoding that is not suited for it. For example, ASCII cannot encode an emoji.

In [3]:
greeting = 'Hello 😁!'
greeting.encode('ascii')

UnicodeEncodeError: 'ascii' codec can't encode character '\U0001f601' in position 6: ordinal not in range(128)

## Handling errors
The methods `encode()` and `decode()` accept a second argument that indicates how they should react upon unknown characters. The default value is `strict`, which will raise a `UnicodeError` exception when an invalid character or sequence is found. That is what happened in our previous examples. Aside from `strict`, there are:
- `ignore`: will ignore unknown characters or sequences and leave it blank.
- `replace`: will replace unknown characters or sequences with a question mark ?
- `xmlcharrefreplace`: will replace unknown characters or sequences with a proper XML character
- `backslashreplace`: will replace unknown characters or sequences with backslashed escape sequences

In [4]:
greeting = 'Hello 😁!'
print('Encode using \'ignore\'           :', greeting.encode('ascii', 'ignore'))
print('Encode using \'replace\'          :', greeting.encode('ascii', 'replace'))
print('Encode using \'xmlcharrefreplace\':', greeting.encode('ascii', 'xmlcharrefreplace'))
print('Encode using \'backslashreplace\' :', greeting.encode('ascii', 'backslashreplace'))

Encode using 'ignore'           : b'Hello !'
Encode using 'replace'          : b'Hello ?!'
Encode using 'xmlcharrefreplace': b'Hello &#128513;!'
Encode using 'backslashreplace' : b'Hello \\U0001f601!'


1. UTF-8 can be considered as an object created from the **Unicode** class. The analogy is: Unicode is the class, UTF-8 and other encodings are actual implementations (objects).
2. Most of the time the encoding is unknown. When using data from any source it is never 100% sure what is the encoding used.

© 2020 Institute of Data