# Character representation schemes

# Ascii 
- each character represented by one byte
- english characters and control characters(NL, CR, TAB...)
- to see ascii encoding, in a shell window on mac or linux
  - man ascii 
- 1 bit was not used, leaving room for extensions
- ISO Latin 1, or 8859-1, added 96 chars
- there are other incompatible extensions
- problem - one byte is not enough to represent all the characters in the world

# type 'bytes' represents a sequence of 8 bit bytes
- immutable


In [33]:
# leading b' means bytes
# last two elements written in hex

b = b'foobar\x12\xfb'

[b, len(b), b[3], b[-1], type(b)]

[b'foobar\x12\xfb', 8, 98, 251, bytes]

In [34]:
# 'bytes' are not mutable, like a string

b[3] = 33

TypeError: 'bytes' object does not support item assignment

# type 'bytearray'
- mutable version of 'bytes'

In [35]:
ba = bytearray(b)
[ba, len(ba), ba[-1], type(ba)]

[bytearray(b'foobar\x12\xfb'), 8, 251, bytearray]

In [36]:
ba[0] = ord('F')
ba

bytearray(b'Foobar\x12\xfb')

In [37]:
# stores ints, NOT characters

[ba[0], type(ba[0])]

[70, int]

In [38]:
ba[0] = 255

In [39]:
# in fact, only stores a subset of ints, 0-255

ba[0] = 256

ValueError: byte must be in range(0, 256)


# Unicode
- "universal character set"
- allocates over a million different characters
- every language on earth
    - somebody tried to add Klingon, but it was rejected
- each character represented by a unique integer
- 'encoding' is converting a unicode string into a byte array or stream (in some encoding)
- 'decoding' is converting a byte stream(in some encoding) into a unicode string
- there are several different encodings
- java uses utf-16
- web pages often use utf-8
- the utf-8 encoding has the special property that if the unicode string is just ascii characters, the utf-8 encoding is just the ascii encoding


In [None]:
# 'python' in characters in different unicode character sets

uni = '\u2119\u01b4\u2602\u210c\xf8\u1f24'
[type(uni), uni]

In [41]:
f = open('/tmp/foo', 'w')
import csv
w = csv.writer(f)
w.writerow(['asdf', 'zxcv'])
w.writerow([uni, uni])
w.writerow([b'\xff', b'\xff'])

17

In [42]:
# len says 6, which is the numbers of characters, not the number bytes it takes to represent them
len(uni)

6

In [43]:
utf8, utf16, utf32 = [uni.encode(et) for et in ['utf-8', 'utf-16', 'utf-32']]

In [44]:
# length of byte encoding varies with different encodings

[[len(u), type(u)] for u in [utf8, utf16, utf32]]

[[16, bytes], [14, bytes], [28, bytes]]

In [45]:
# utf8 is type 'bytes', not a str. 
# note b' prefix

[type(uni), type(utf8), utf8, utf16, utf32]

[str,
 bytes,
 b'\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4',
 b'\xff\xfe\x19!\xb4\x01\x02&\x0c!\xf8\x00$\x1f',
 b'\xff\xfe\x00\x00\x19!\x00\x00\xb4\x01\x00\x00\x02&\x00\x00\x0c!\x00\x00\xf8\x00\x00\x00$\x1f\x00\x00']

In [None]:
# decode converts bytes into unicode string

utf32.decode('utf-32')

In [None]:
utf8.decode('utf-8')

In [48]:
# to decode, must know the encoding type(key)
# selecting the wrong decoder doesn't always generate an error
# sometimes you will just get a bogus string

utf32.decode('utf-8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

# Different I/O types
- 'bytes' - 'b' flag to 'open'
- 'str' (unicode) by default

In [49]:
# won't work - file stream expects a 'str' by default, but utf8 is type 'bytes'
unipath = "/tmp/python.uni"

with open(unipath, "w") as f:
    f.write(utf8)


TypeError: write() argument must be str, not bytes

In [50]:
# make a binary stream by adding 'b' flag to 'open'

with open(unipath, 'bw') as f:
    f.write(utf32)

In [51]:
# 'str' mode defaults to utf-8, but the file we wrote is utf-32
# so, read fails

with open(unipath, "r") as f:
    print(f.read())

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

In [None]:
# tell 'open' the right unicode encoding

with open(unipath, "r" , encoding='utf-32') as f:
    print(f.read())

In [53]:
# can read file bytes

with open("/tmp/python.uni", "rb") as f:
    b = f.read()
b

b'\xff\xfe\x00\x00\x19!\x00\x00\xb4\x01\x00\x00\x02&\x00\x00\x0c!\x00\x00\xf8\x00\x00\x00$\x1f\x00\x00'

In [54]:
utf32

b'\xff\xfe\x00\x00\x19!\x00\x00\xb4\x01\x00\x00\x02&\x00\x00\x0c!\x00\x00\xf8\x00\x00\x00$\x1f\x00\x00'

# Python 3 source code
- defaults to utf-8
- can change with a comment at the top of the file:
    
    #-*- coding: utf-32 -*-

# ascii vs unicode
- ascii is easy, because storage media and networks handle bytes, and ascii is just bytes
- no byte order issues(big/little endian)
- unicode is harder, because
    - writing to the network or storage, the unicode string must be ENCODED into a byte stream, in some format like utf-8, utf-16, etc
    - reading from the network or storage, the byte stream must be DECODED into a unicode stream. somehow the encoding used must be provided
- given Python uses unicode, you are always
    - encoding as strings leave your program
    - decoding as strings enter your program
- [standard text encoders](https://docs.python.org/3/library/codecs.html#standard-encodings)