# Introduction to Unicode in Python
> Understanding how unicode works and great unicode hack *UTF-8*

- toc: true 
- badges: true
- comments: true
- categories: [unicode, utf-8, python, nlp]
- image: images/tibetan-alphabet.png

## What is Unicode

To understand what Unicode is, first we need to understand basic encoding scheme called [ASCII](https://en.wikipedia.org/wiki/ASCII). Essentially *ASSCII* encodes every character in 7-bits, that means it can represent only 128 possible set of characters. Unicode is just a massive version of the ASCII table, which can represent 1,114,112 possible characters. In fact first 128 characters of Unicode table correspong precisely to the ASCII characters. That makes Unicode backward compatible to ASCII. For more detail description of Unicode follow this [link](http://www.unicode.org/standard/WhatIsUnicode.html). 

But remember that Unicode is not an encoding scheme, it's just a collection of code points representing characters and symbols. Unicode has various encoding sheme like `UTF-8` (most commonly used), `UTF-16` and `UTF-32` are used for representing Unicode characterss as binary data.

Every character in the Unicode table is mapped to something called unicode *code point* which is represented in hexadecimal (e.g. code point for `ཀ` is `\u0f40`). Here is the unicode *code point* in python.

In [11]:
'ཀ'.encode('unicode_escape')  # get unicode code point

b'\\u0f40'

In [12]:
int(0x0f40) # 0x denotes hexadecimal

3904

In [13]:
ord('ཀ') # check for integer value.

3904

Every language is given a specific unicode code point range. For eg:
- `0020-007f` (32-127) for basic latin, similary to ASCII code point.
- `0f00-0fff` (3840-4095) for Tibetan

In [14]:
int(0x0020), int(0x007f), int(0x0f00), int(0x0fff)

(32, 127, 3840, 4095)

## UTF-8 - Great Hack of Unicode

We obviously know that for code point of a character larger than  numeral 255 can't be represented with one byte (or 8-bits). It will need more than 1 bytes. So here comes `UFT-16` which represent every character in two bytes and if code point is larger than that we need have `UTF-32` wich represent every character in 4 bytes.

But there some big issues with `UTF-16` and `UTF-32`:
- Waste lots of memory, lots of bytes with all 0s.
- Produces a 0x00 (i.e 00000000), which in many old computer interpreted as end of the string of characters.
- Not compatible with ASCII.
- Required [BOM](https://en.wikipedia.org/wiki/Byte_order_mark)

Luckily `UTF-8` solves all these problems with it's wonderful hack. The hack is how it can represent all other non ASCII characters.

### UTF-8 Character Representation

`UTF-8` was designed for backwrad compatibility with ASCII. So, the first 128 characters are ASCII characters with exact same order. Since unicode character code point order below the 255 requires extra bytes which makes `UTF-8` a **variable-length** encoding scheme and it can encode code points in one to four bytes.

But the questions remain, how it represent unicode character which requires more than one byte while still avoiding eight consecutive 0s ? How it can specify the start of a characters ?. Well, let me explain it's unicode character representation along with an example.

Consider we are encoding first letter of Tibetan alphabet, `ཀ`:
1. The Unicode code point for `ཀ` is U+0F40
1. Hexadecimal 0F40 is binary 00001111 01000000
1. In `UTF-8` encoding, first four bits tells how many bytes required to represent the given code point. According `UTF-8` scheme table, `ཀ` will need three bytes, it will be three 1s (like 1110...).
1. Then four most significant bits of the code point are stored in the remaining low order four bits of the first byte (like **1110**0000).
1. All continuation bytes contains six bits from the code point because **10** is stored in the high order two bits to mark it as a continuation byte. It also avoids eight consecutive 0s. So, the remaing bits code point is placed like this **10**111101 **10**000000.

So `ཀ` in UTF-8 binary is **1110**0000 **10**111101 **10**000000. For more about UTF follow this [link](https://en.wikipedia.org/wiki/UTF-8)

### Calulating Numeral value of a Unicode code point in python

In [18]:
'ཀ'.encode('utf-8')  # in hexadecimal

b'\xe0\xbd\x80'

In [19]:
" ".join(f"{i:08b}" for i in (0xe0, 0xbd, 0x80))  # convert hex to binary 

'11100000 10111101 10000000'

As you can see here, we got the exact same binary value of `ཀ` as above.

Extract code point bits from UTF-8 bits:

Steps:
1. Seperate out code point bits, 1110**0000** 10**111101** 10**000000**
1. We got the code point bits of `ཀ`, 00001111 01000000

convert extracted code point bits to int

In [20]:
int('0000111101000000', 2)

3904

Compare it with python's `ord` function

In [10]:
int('0000111101000000', 2) == ord('ཀ')

True

### Fun facts
- 95% of web page in the internet uses `UTF-8` encoding at time of wiriting this blog.