# UNICODE RODEO
## A Short Introductory Adventure with Unicode
Designed for those who may be working with multiple alphabets or character sets in python. 

## The Basics of Characters and Bytes

Character information is stored in bytes. Each character takes up one or two bytes, depending on how its stored. There are a number of different systems for storing information about text. 

You may be familiar with ASCII art: https://www.asciiart.eu/

*ASCII* stands for American Standard Code for Information Interchange, but that's not super important. It's an old school way of encoding character information from the days of telegraphs. It only deals with 128 characters, which is not enough! Now, Unicode is dominant since it's a much more versatile system of encoding text. 

When saving character information, it's at some point encoded, or turned into a format that's more useful for the computer. Python gives us a number of useful functions for this. Let's try out the encode() function. 

In [1]:
text1 = "oboe"
print(text1, len(text1)) # prints the string and it's length in bytes
enc_1 = text1.encode('ascii') # encodes the string in ascii
print(enc_1, len(enc_1)) # prints the string and it's length in bytes

Great. It gives us the byte information. But it's not very interesting. Let's try something better. 

In [2]:
tr_text = "türkçe"
print(tr_text, len(tr_text))
enc_tr = tr_text.encode('utf-8') # encodes the string in utf-8
print(enc_tr, len(enc_tr)) 


Well that's a problem. It looks like ASCII wasn't up to the task of encoding all these characters, and generated a Unicode Encode Error. Luckily, we can fix this. 

## Unicode

Unicode is a widely adoped standard for encoding lots of characters on the web.
And Unicode's website here: https://home.unicode.org/
Unicode consists of a 'code point' for each character, a unique idenitifying number that can be used to reference it and go get more infromation about characters. For example: "U+0041" is the code point for an uppercase "A". Another example is 'U+1f647', which is an emoji. Characters of the same alphabet typically fall into a similar unicode code point range. Check out the database here: https://www.unicode.org/ucd/

### UTF-8

UTF-8 is the most common character encoding. It uses 1 byte for character for the 128 characters defined by ASCII commonly used in english language text, and between 2 and 4 for all others. That makes it fastest for a lot of latin based languages. 
Let's try that for our turkish text. 


In [3]:
tr_text = "türkçe"
print(tr_text, len(tr_text), "characters long unencoded") 
enc_tr = tr_text.encode('utf-8') 
print(enc_tr, len(enc_tr), "bytes long in utf-8 encoding") 

Great. It's encoded our problem characters, and only added 2 bytes to our total byte useage.
Look at how the encoding takes place at the special characters. 

You might notice that this doesn't look like a Unicode code point - that's because it's an encoded version. If you want to find out the code point, you can use the ord() function to get a code point, and the chr() function to turn it back to a character. 
Let's look at 'A' U+0041.

In [4]:
code_point = ord('A')
print(code_point, "is the code point for 'ü', but it's in Python's base 10")
character = chr(code_point)
print(character, f"is the character for code point {code_point}")
hex_code_point= hex(code_point)
print(hex_code_point, "is the same number in hexadecimal! It's a match!")

But what about other Unicode encodings? 

### UTF-16

UTF-16 uses 2 bytes for most characters, with 4 used for some others.  If you're interested in more details on the finer points of unicode encoding, check out: https://medium.com/swlh/what-is-utf-16-63755027eb29


### UTF-32 

UTF-32 uses 4 bytes for all characters, and is not widely used because of how much space it takes up. 

Let's try these out on some text.

In [5]:
text_ar = "الصقر"
print(text_ar, len(text_ar), '5 character arabic text')
enc_ar = text_ar.encode('utf8') # note that it still works without the hyphen
print(enc_ar, len(enc_ar), 'bytes long in utf-8')

enc_ar_2 = text_ar.encode('utf-16')
# contains 10 bytes for 5 chars
print(enc_ar_2, len(enc_ar_2), 'bytes long in utf-16')
print("\n")

As you can see, UTF-8 seems to be the better choice for encoding Arabic and similar alphabets. 
What about Korean?

In [6]:
kr_text = "안녕하세요"
print(kr_text , len(kr_text), '5 character Korean text')
enc_kr = kr_text.encode('utf-8')
print(enc_kr, len(enc_kr), 'bytes long in utf-8')
print("\n")

enc_kr_2 = kr_text.encode('utf-16')
print(enc_kr_2, len(enc_kr_2), 'bytes long in utf-16')
print("\n")

enc_kr_3 = kr_text.encode('utf-32')
print(enc_kr_3, len(enc_kr_3), 'bytes long in utf-32\n')


As you can see, UTF-16 would be your best choice for encoding a significant amount of Korean Text.


Thank you for coming to my Unicode Rodeo! 

If you want to see an appliction of Unicode code points, I made a project a while ago that looks at the first letter of a string to determine the code point. If the code point falls into the arabic range, it applies a right-to-left filter. 
You can see it here: https://github.com/AeronRoemer/Employee-Directory-Display