# Investigating characters and strings

In [3]:
cafe = 'café'
cb16 = bytes(cafe, 'utf_16')
cb16

b'\xff\xfec\x00a\x00f\x00e\x00\x01\x03'

In [5]:
cb32 = bytes(cafe, 'utf_32')
cb32

b'\xff\xfe\x00\x00c\x00\x00\x00a\x00\x00\x00f\x00\x00\x00e\x00\x00\x00\x01\x03\x00\x00'

In [6]:
cb8 = bytes(cafe, 'utf_8')
cb8

b'cafe\xcc\x81'

In [7]:
cb16le = bytes(cafe, 'utf_16le')
cb16le

b'c\x00a\x00f\x00e\x00\x01\x03'

# Special [encoding error handling](https://docs.python.org/3/library/codecs.html#codecs.register_error)

In [1]:
city = 'São Paulo'

In [2]:
city.encode('cp437')

UnicodeEncodeError: 'charmap' codec can't encode character '\u0303' in position 2: character maps to <undefined>

In [3]:
city.encode('cp437', errors='ignore')

b'Sao Paulo'

In [4]:
city.encode('cp437', errors='replace')

b'Sa?o Paulo'

In [5]:
city.encode('cp437', errors='xmlcharrefreplace')

b'Sa&#771;o Paulo'

The codecs error handling is extensible. You may register extra strings for the errors argument by passing a name and an error handling function to the `codecs.register_error` function. See the [codecs.register_error documentation](https://docs.python.org/3/library/codecs.html#codecs.register_error).

In [6]:
import codecs

In [20]:
codecs.register_error('custom_error', lambda e: ('!', e.end))

In [21]:
city.encode('cp437', errors='custom_error')

b'Sa!o Paulo'

# Decoding garbled characters
Many legacy 8-bit encodings like 'cp1252', 'iso8859_1', and 'koi8_r' are able to decode any stream of bytes, including random noise, without generating errors. Therefore, if your program assumes the wrong 8-bit encoding, it will silently decode garbage.

In [24]:
# These bytes are the characters for “Montréal” encoded as latin1; '\xe9' is the byte for “é”.
octets = b'Montr\xe9al'

# Decoding with 'cp1252' (Windows 1252) works because it is a proper superset of latin1.
octets.decode('cp1252')

'Montréal'

In [26]:
# ISO-8859-7 is intended for Greek, so the '\xe9' byte is misinterpreted, and no error is issued.
octets.decode('iso8859_7')

'Montrιal'

In [27]:
# KOI8-R is for Russian. Now '\xe9' stands for the Cyrillic letter “И”.
octets.decode('koi8_r')

'MontrИal'

In [28]:
# The 'utf_8' codec detects that octets is not valid UTF-8, and raises Unicode DecodeError.
octets.decode('utf_8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte

In [29]:
# Using 'replace' error handling, the \xe9 is replaced by “�” (code point U +FFFD), 
# the official Unicode REPLACEMENT CHARACTER intended to represent unknown characters.
octets.decode('utf_8', errors='replace')

'Montr�al'