# Character Encoding

Although it can be complicated, it's necessary to understand hexadecimal, binary, ascii, and unicode character encoding.

Computers and smartphones today use UTF-8 for the most part.  That's so they can display Chinese, Arabic, Greek, and all languages.  Before that we just had ASCII.

ASCII falls into two categories:  displayable and non-displayable, like line feeds.

These are the displayable ASCII characters.  They are digits the the letters a-Z plus the question mark, etc.

Here is ASCII table:  https://theasciicode.com.ar/

In [12]:


for i in range(32,126,1):
    print(chr(i),end=",")

 ,!,",#,$,%,&,',(,),*,+,,,-,.,/,0,1,2,3,4,5,6,7,8,9,:,;,<,=,>,?,@,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,[,\,],^,_,`,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,{,|,},

A, for example has an ASCII code 65.  You can print that using the chr function.

In [13]:

print('chr(65)=', chr(65))



chr(65)= A


Here we print the ASCII code assigned to the letter A.

In [14]:
print("ord('A')",ord('A'))

ord('A') 65


# Special characters

Some numbers have special functions, like tab or new line.  On old computers there was a 
bell sound.  And today we have emojis.

The \n is a line feed and \r windows new line. That causes the text to wrap around to the next time.

You put a backslash in front of the n to indicate that this is a special character.  That's called escaping the character.  So the \ is the escape character.

In [15]:

    
print('\t indent me') 

print('\n here put a blank line before me and after me\n')

 

print('The tab \\t has ASCII code=',ord('\t'))


	 indent me

 here put a blank line before me and after me

The tab \t has ASCII code= 9


# Unicode and UTF-8

Unicode is one standard.  UTF-8 is a subset of that.  Basically they cover all alphabets adn emojis

You put a **\u** in front of unicode characters.  We write this number in hexadecimal format.  We explain the hexacedimal numbering system [here](https://github.com/werowe/HypatiaAcademy/blob/master/bitsAndBytes.ipynb).

Here we print the Greek letter lambda using its unicode val:
 

In [16]:

print('\u03BB')


λ


Here we use encode it to show it in Hexadecimal format.  The \b means binary format.  Since there are two \xs that means there are two bytes. 

In [17]:
'ü'.encode('utf-8')

b'\xc3\xbc'

Below we convert the hex value of Lambda to the displayed value

\x means this is a hexadecimal nymber

In [18]:


b'\xc3\xbc'.decode('utf-8')
 

'ü'

In [19]:
b'\xc3\x87'.decode('utf-8')

'Ç'

Here is an emoji.

In [20]:
'😩'.encode('utf-8')

b'\xf0\x9f\x98\xa9'

Notice that this is 4 bites.

In [21]:
b'\xf0\x9f\x98\xa9'.decode('utf-8')

'😩'

# Print the Greek alphabet

Here we write code to print the Greek alphabet.

Greek Alphabet characters are double-byte characters.  So they take two bytes instead of Latin letters, which take 1.  

You can see the UTF-8 code for each of the letters in the Greek alphabet [here](https://www.utf8-chartable.de/unicode-utf8-table.pl?start=896&number=128)

The hexademical numbers for the Greek letters range from CE91 to CF8E.  However we skip CEBF to CF79 as there is a gap in the middle. In the code below we handle that by throwing an error when we the computer tries to convert those to displayable charcters.
 

We print these by printing every character in the range CE91 to CF8E.
We first convert these two numbers to integers:

this means conver thi hex number 0xce91 to which is in the base 16 (hex) format to an integer:

In [22]:
fr=int("0xce91", 16)
to=int("0xcf8e",16)
print(fr,to)

52881 53134


We loop from the beginning of the range to the end.  

```python for i in range(fr,to,1):```

Here we tell Python to convert the loop value i to two bytes.  Larger characters in the UTF-8 character set require 2 bytes.  Greek letters require 2 bytes.   

In [23]:
twoBytes=fr.to_bytes(2,'big')
print(twoBytes)

b'\xce\x91'


Then we turn that into Greek:

In [24]:
decodedTwoBytes=twoBytes.decode('utf-8')
print(decodedTwoBytes)

Α


As we said the range fr=int("0xce91", 16) to to=int("0xcf8e",16) has a gap in the middle that are not any character at all.  So rather than hard code that range we put **try** and **except** around the code which will throw and then handle an error when Python tries to decode a number which is not a valid UTF-8 letter.



```Python
try:
except UnicodeDecodeError:
       True==True 
```

Here is the complete code.

In [25]:
for i in range(fr,to,1):
    try:
        twoBytes=i.to_bytes(2,'big')
        decodedTwoBytes=twoBytes.decode('utf-8')
        print(i,decodedTwoBytes)
    except UnicodeDecodeError:
       True==True

52881 Α
52882 Β
52883 Γ
52884 Δ
52885 Ε
52886 Ζ
52887 Η
52888 Θ
52889 Ι
52890 Κ
52891 Λ
52892 Μ
52893 Ν
52894 Ξ
52895 Ο
52896 Π
52897 Ρ
52898 ΢
52899 Σ
52900 Τ
52901 Υ
52902 Φ
52903 Χ
52904 Ψ
52905 Ω
52906 Ϊ
52907 Ϋ
52908 ά
52909 έ
52910 ή
52911 ί
52912 ΰ
52913 α
52914 β
52915 γ
52916 δ
52917 ε
52918 ζ
52919 η
52920 θ
52921 ι
52922 κ
52923 λ
52924 μ
52925 ν
52926 ξ
52927 ο
53120 π
53121 ρ
53122 ς
53123 σ
53124 τ
53125 υ
53126 φ
53127 χ
53128 ψ
53129 ω
53130 ϊ
53131 ϋ
53132 ό
53133 ύ


**Exercise** Write a program to read and print out [this UTF-16 encoded file]().  (For historical reasons some Windows files are UTF-16 encoded.)  Open the file directory from the internet like this:

In [26]:
import urllib.request

f = "https://raw.githubusercontent.com/werowe/HypatiaAcademy/master/assignment/encoded.txt"

l = urllib.request.urlopen(f)  