### Unicode notes (Python 2)

Computers store bytes.  People read text.  There needs to be a way of decoding bytes into text, and encoding text into bytes.

What we need is a mapping between numbers and characters.  This is what character encodings do.  So ASCII maps values in the range 0-127 to a lookup table of characters.  With ASCII, a text file is just a sequence of these bytes.


Unicode maps values in the range 0 - 65,536 = $2^{16}$ to a lookup table of characters.  

The 'numbers' in these lookup tables are called 'codepoints'.  It would be nice and simple if the codepoints mapped directly onto bytes, so that codepoint 1 was \x01 etc. This is how it works for ascii.

However, this would be inefficient for unicode, where 1 would map to \u0001, which is two bytes of hex.  That's because most text is in the ascii range, and therefore only requires 1 bytes.

So with unicode there's a level of abstraction between codepoints and how they are encoded into bytes which, among other things, compresses the unicode.  We don't really need to worry about this too much, except to need a method of decoding the unicode bytes into codepoints, and encoding the codepoints into bytes.

In all of this there is something really confusing about Python 2:  'str' as a datatype is really 'bytes'.  So if we load a .png in 'rb' mode, and print it to the console, it will print as a str, with bytes in the range 0-127 printed as ascii characters.  It's of type str, and it looks like a string, but it's really just bytes.

Let's start exploring how this work in practice.

In [99]:
import unicodedata
uni =  u'\u2743'
print uni
print unicodedata.name(uni)
print "This is codepoint {}".format(int("2743", 16))
print "This unicode character happens to need 3 bytes when encoded, many need either one or two"
uni.encode("utf-8")

❃
HEAVY TEARDROP-SPOKED PINWHEEL ASTERISK
This is codepoint 10051
This unicode character happens to need 3 bytes when encoded, many need either one or two


'\xe2\x9d\x83'

In [70]:
# A python 2 string is really just a byte (8 bits = 256 combinations = two hex characters)
print "hello".encode("hex") #'Encode' turns text into bytes.  This is a bit confusin in thi

# A slightly nicer way of looking at it 
print ''.join( [ "%02X " % ord( x ) for x in "hello" ] ).strip()

# The escaped bytes
print ''.join(map(lambda c:'\\x%02x'% ord(c), "hello"))

68656c6c6f
68 65 6C 6C 6F
\x68\x65\x6c\x6c\x6f


In [65]:
"\x68\x65\x6c\x6c\x6f"

'hello'

In [53]:
print "68".decode('hex')
print "65".decode('hex')

h
e


In [76]:
f = open("myfile.txt", 'wb')
f.write("hello")
f.close()

# myfile.txt is a 5 byte file.  If you open it as hex, you will see 6865 6c6c 6f

f = open("myfile.txt", 'rb')
f.read() #Note that this is saying the file consists of the bytes 'hello' when the bytes are rendered as ascii, which is the default.
# It doesn't make sense to say 'why can't i read the file as bytes' - you did - that's what str is.  It's just that the default
# rendering of these bytes is to use ascii

'hello'

How about something that isn't ascii or unicode?

In [103]:
print '\x80' #This is 128, i.e. one above the ascii range

�


'/Users/robinlinacre/Downloads/Ineffective.png'

In [104]:
# We cannot interpret this byte as unicode
'\x80'.decode("utf-8")

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

In [105]:
# We cannot convert this into ascii bytes
'\x80'.encode("ascii")

UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)

In [107]:
# Not can we turn the byte into a unicode string, telling it to decode expecting ascii bytes
'\x80'.decode("ascii")

UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)

In [113]:
# Note we can do this, and get a unicode string back
'\x70'.decode("ascii")

u'p'

In [114]:
# Here are some bytes that use the Windows-1252 encoding.  These bytes can not be decoded to valid ascii or unicode
u'café'.encode("windows-1252")

'caf\xe9'

In [115]:
# This wouldn't have worked if we'd had a character which was not in windows-1252 codepoints:
u'caf❃'.encode("windows-1252")

UnicodeEncodeError: 'charmap' codec can't encode character u'\u2743' in position 3: character maps to <undefined>

In [116]:
# This won't work because it's not a valid utf-8 encoded unicode string
'caf\xe9'.decode("utf-8")

UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 3: unexpected end of data

In [117]:
# But we can get the unicode back by properly decoding
'caf\xe9'.decode("windows-1252")

u'caf\xe9'

Now let's create a table of codepoints

In [120]:
import pandas as pd
table = []
for i in xrange(2048):
    d = {"hex2": r"\x" + str(hex(i)[2:].zfill(2)), "hex4": r'\u'+str(hex(i)[2:].zfill(4))}
    table.append(d)
df = pd.DataFrame(table)
df.head()

Unnamed: 0,hex2,hex4
0,\x00,\u0000
1,\x01,\u0001
2,\x02,\u0002
3,\x03,\u0003
4,\x04,\u0004


In [127]:
import codecs
df["python_str_bytes"] = df["hex2"].apply(lambda x: codecs.escape_decode(x)[0])
df[65:70]

Unnamed: 0,hex2,hex4,python_str_bytes
65,\x41,\u0041,A
66,\x42,\u0042,B
67,\x43,\u0043,C
68,\x44,\u0044,D
69,\x45,\u0045,E


In [129]:
df["ascii"] = df["python_str_bytes"].apply(lambda x: x.decode("ascii", errors="ignore"))

In [134]:
# Note there are some numbers (bytes) in the range 0,255, such as 129 (corresponding to hex 0x81 or \x81) 
# which are not used and will therefore throw an error like this:
# UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined>
def d_latin_1(x):
    try:
        return x.decode("latin-1")
    except:
        return "not used"

df["latin_1"] = df["python_str_bytes"].apply(d_latin_1)

In [136]:
def d_1252(x):
    try:
        return x.decode("windows-1252")
    except:
        return "not used"

df["windows-1252"] = df["python_str_bytes"].apply(d_1252)

In [140]:
df.loc[[0,1,2,65,66,67,126,127,128,129,254,255,256,257]]

Unnamed: 0,hex2,hex4,python_str_bytes,ascii,windows-1252,latin_1
0,\x00,\u0000,�,�,�,�
1,\x01,\u0001,,,,
2,\x02,\u0002,,,,
65,\x41,\u0041,A,A,A,A
66,\x42,\u0042,B,B,B,B
67,\x43,\u0043,C,C,C,C
126,\x7e,\u007e,~,~,~,~
127,\x7f,\u007f,,,,
128,\x80,\u0080,�,,€,
129,\x81,\u0081,�,,not used,


In [150]:
def d_uni(x):
    try:
        return x.decode("utf-8")
    except:
        return "invalid utf-8 bytes"
df["utf-8_from_hex2"] = df["python_str_bytes"].apply(d_uni)


df["utf-8_from_hex4"] = df["hex4"].apply(lambda x: x.decode("unicode-escape"))
df["utf-8_encoded_back_to_bytes"] = df["utf-8_from_hex4"].apply(lambda x: repr(x.encode("utf-8")))

import unicodedata
def get_name(x):
    try:
        return unicodedata.name(x)
    except:
        return "no name found"
df["utf-8_desc"] = df["utf-8_from_hex4"].apply(get_name)

df.loc[[0,1,2,65,66,67,126,127,128,129,254,255,256,257,1000,1200,1400]]

Unnamed: 0,hex2,hex4,python_str_bytes,ascii,windows-1252,latin_1,utf-8_from_hex2,utf-8_from_hex4,utf-8_encoded_back_to_bytes,utf-8_desc
0,\x00,\u0000,�,�,�,�,�,�,'\x00',no name found
1,\x01,\u0001,,,,,,,'\x01',no name found
2,\x02,\u0002,,,,,,,'\x02',no name found
65,\x41,\u0041,A,A,A,A,A,A,'A',LATIN CAPITAL LETTER A
66,\x42,\u0042,B,B,B,B,B,B,'B',LATIN CAPITAL LETTER B
67,\x43,\u0043,C,C,C,C,C,C,'C',LATIN CAPITAL LETTER C
126,\x7e,\u007e,~,~,~,~,~,~,'~',TILDE
127,\x7f,\u007f,,,,,,,'\x7f',no name found
128,\x80,\u0080,�,,€,,invalid utf-8 bytes,,'\xc2\x80',no name found
129,\x81,\u0081,�,,not used,,invalid utf-8 bytes,,'\xc2\x81',no name found


Note that the first 256 CODEPOINTS of unicode correspond to the first 256 codepoints of ISO-8859-1 (also called “latin-1”).  
But that doesn't mean the encodings match.  So a single byte, \x81 is NOT valid unicode but is valid latin-1

In [151]:
pd.set_option('display.max_rows', 10000)
df

Unnamed: 0,hex2,hex4,python_str_bytes,ascii,windows-1252,latin_1,utf-8_from_hex2,utf-8_from_hex4,utf-8_encoded_back_to_bytes,utf-8_desc
0,\x00,\u0000,�,�,�,�,�,�,'\x00',no name found
1,\x01,\u0001,,,,,,,'\x01',no name found
2,\x02,\u0002,,,,,,,'\x02',no name found
3,\x03,\u0003,,,,,,,'\x03',no name found
4,\x04,\u0004,,,,,,,'\x04',no name found
5,\x05,\u0005,,,,,,,'\x05',no name found
6,\x06,\u0006,,,,,,,'\x06',no name found
7,\x07,\u0007,,,,,,,'\x07',no name found
8,\x08,\u0008,,,,,,,'\x08',no name found
9,\x09,\u0009,\t,\t,\t,\t,\t,\t,'\t',no name found
