# Introduction to Unicode in Python
> Understanding how unicode works and great unicode hack *UTF-8*

- toc: true 
- badges: true
- comments: true
- categories: [unicode, utf-8, python, nlp]
- image: images/tibetan-alphabet.png

## What is Unicode

To understand what Unicode is, first we need to understand basic encoding scheme called [ASCII](https://en.wikipedia.org/wiki/ASCII). Essentially *ASSCII* encodes every character in 7-bits, that means it can represent only 128 possible set of characters. Unicode is just a massive version of the ASCII table, which can represent 1,114,112 possible characters. In fact first 128 characters of Unicode table correspong precisely to the ASCII characters. That makes Unicode backward compatible to ASCII. For more detail description of Unicode follow this [link](http://www.unicode.org/standard/WhatIsUnicode.html). 

But remember that Unicode is not an encoding scheme, it's just a collection of code points representing characters and symbols. Unicode has various encoding sheme like `UTF-8` (most commonly used), `UTF-16` and `UTF-32` are used for representing Unicode characterss as binary data.

Every character in the Unicode table is mapped to something called unicode *code point* which is represented in hexadecimal (e.g. code point for `ཀ` is `\u0f40`). Here is the unicode *code point* in python.

In [1]:
'ཀ'.encode('unicode_escape')  # get unicode code point

b'\\u0f40'

In [2]:
int(0x0f40) # 0x denotes for hexadecimal

3904

In [3]:
ord('ཀ') # check for integer value.

3904

Every language is given a specific unicode code point range. For eg:
- `0020-007f` (32-127) for basic latin, similary to ASCII code point.
- `0f00-0fff` (3840-4095) for Tibetan

In [4]:
int(0x0020), int(0x007f), int(0x0f00), int(0x0fff)

(32, 127, 3840, 4095)

## UTF-8 - Greate Hack of Unicode

When obviously know that for code point of a character larger than  decimal 255 can't be represented with 1 bytes (or 8-bits). It will need more than 1 bytes. So here comes `UFT-16` which represent every character in 2 bytes and if code point is larger than that we need have `UTF-32` wich represent every character in 4 bytes.

But there some big issues with `UTF-16` and `UTF-32`:
- Waste lots of memory, lots of bytes with decimal value 0.
- Produces a 0x00 (i.e 00000000), which in many old computer interpreted as end of the string of characters.
- Not compatible with ASCII.
- Required [BOM](https://en.wikipedia.org/wiki/Byte_order_mark)

Luckily `UTF-8` solves all the above problem with it's wonderful hack. It start by just taking ASCII.

### UTF-8 Representation

### Calulating Unicode Index of a Character from It's bits.

In [45]:
#hide
import string
string.punctuation, string.printable

('!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~',
 '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c')