The latin alphabet is stored in small numbers, ASCII.

In [12]:
a = "A"

In [13]:
ord(a)

65

Hex will be important to us as we store the digits for larger numbers.

In [14]:
hex(65)

'0x41'

Basic ASCII takes 7 bits, so 0xxxxxxx

In [15]:
bin(0x41)

'0b1000001'

In [16]:
chr(0b1000001)

'A'

In [17]:
start = ord(a)
alphabet = ""
for i in range(start, start + 26):
    alphabet += chr(i)
print(alphabet)

ABCDEFGHIJKLMNOPQRSTUVWXYZ


The hex is the Code Point for this number in Unicode. We can turn these numbers back into characters using UTF-8 encoding.

In [25]:
b'\x41'.decode("utf-8")

'A'

Lots of ways to write a character with a little escape coding.

In [39]:
"A" == "\x41" == "\N{LATIN CAPITAL LETTER A}" == "\u0041" == "\U00000041"

True

Now for Hirigana Japanese characters. They start with little ah at Code Point U+3041

https://en.wikipedia.org/wiki/Hiragana_(Unicode_block)

In [6]:
ah = "\u3041"

In [49]:
print(ah)

ぁ


In [7]:
ord(ah)

12353

In [8]:
hex(12353)

'0x3041'

Notice the same 3041 hex here as found in the Unicode Code Point

In [9]:
bin(0x3041)

'0b11000001000001'

In [10]:
chr(0b11000001000001)

'ぁ'

In [11]:
start = ord(ah)
hirigana = ""
for i in range(start, start + 86):
    hirigana += chr(i)
print(hirigana)

ぁあぃいぅうぇえぉおかがきぎくぐけげこごさざしじすずせぜそぞただちぢっつづてでとどなにぬねのはばぱひびぴふぶぷへべぺほぼぽまみむめもゃやゅゆょよらりるれろゎわゐゑをんゔゕゖ


But, we don't encode them with the raw binary.

In [22]:
m = b'\x30\x41'

In [23]:
m.decode("utf-8")

'0A'

That's not right. UTF-8 must do something different

In [26]:
ah.encode("utf-8")

b'\xe3\x81\x81'

In [30]:
" ".join(f"{i:08b}" for i in ah.encode("utf-8"))

'11100011 10000001 10000001'

The binary is there, but there are prefixes. Otherwise we wouldn't know how many bits to use each time. UTF-8 standardizes this so that any number will recognizable.

    '11100011 10000001 10000001'
           11   000001   000001

Depends on what you want to write. U+E9 is the e with acute on top

In [32]:
ec = "\u00e9"

In [33]:
ec

'é'

In [37]:
bin(ord(ec))

'0b11101001'

In [34]:
ec.encode("utf-8")

b'\xc3\xa9'

In [36]:
" ".join(f"{i:08b}" for i in ec.encode("utf-8"))

'11000011 10101001'

Small numbers in a byte always start with 0 as shown above.

Medium numbers only need two bytes. They start with 110 in the first byte, then 10 in the second.

    '11000011 10101001'
           11   101001
           
Bigger ones need three bytes. They start with a 1110 to signify three bytes coming, followed by 10 in second and third.


    '11100011 10000001 10000001'
           11   000001   000001
           
Most Emoji are stored in even larger numbers. Four bytes!

In [40]:
blue_heart = "\U0001F499"

In [41]:
print(blue_heart)

💙


In [50]:
start = ord(blue_heart)
emoji = ""
for i in range(start, start + 26):
    emoji += chr(i)
print(emoji)

💙💚💛💜💝💞💟💠💡💢💣💤💥💦💧💨💩💪💫💬💭💮💯💰💱💲


In [45]:
bin(ord(blue_heart))

'0b11111010010011001'

In [44]:
blue_heart.encode("utf-8")

b'\xf0\x9f\x92\x99'

In [43]:
" ".join(f"{i:08b}" for i in blue_heart.encode("utf-8"))

'11110000 10011111 10010010 10011001'

This takes four bytes! Why couldn't it be done in three?

    '11110000 10011111 10010010 10011001'
                 11111   010010   011001

In [51]:
import unicodedata

for c in emoji:
    print(c, f"{ord(c):04x}", unicodedata.category(c), end=" ")
    print(unicodedata.name(c))


💙 1f499 So BLUE HEART
💚 1f49a So GREEN HEART
💛 1f49b So YELLOW HEART
💜 1f49c So PURPLE HEART
💝 1f49d So HEART WITH RIBBON
💞 1f49e So REVOLVING HEARTS
💟 1f49f So HEART DECORATION
💠 1f4a0 So DIAMOND SHAPE WITH A DOT INSIDE
💡 1f4a1 So ELECTRIC LIGHT BULB
💢 1f4a2 So ANGER SYMBOL
💣 1f4a3 So BOMB
💤 1f4a4 So SLEEPING SYMBOL
💥 1f4a5 So COLLISION SYMBOL
💦 1f4a6 So SPLASHING SWEAT SYMBOL
💧 1f4a7 So DROPLET
💨 1f4a8 So DASH SYMBOL
💩 1f4a9 So PILE OF POO
💪 1f4aa So FLEXED BICEPS
💫 1f4ab So DIZZY SYMBOL
💬 1f4ac So SPEECH BALLOON
💭 1f4ad So THOUGHT BALLOON
💮 1f4ae So WHITE FLOWER
💯 1f4af So HUNDRED POINTS SYMBOL
💰 1f4b0 So MONEY BAG
💱 1f4b1 So CURRENCY EXCHANGE
💲 1f4b2 So HEAVY DOLLAR SIGN


So now, we can understand how Egyptian Hieroglyphics can be written with Unicode.

You might need to download a font from https://www.google.com/get/noto/ to see them, namely **Noto Sans Egyptian Hieroglyphs**.

In [52]:
egypt = "\U00013000"

In [53]:
start = ord(egypt)
hieroglyphs = ""
for i in range(start, start + 126):
    hieroglyphs += chr(i)
print(hieroglyphs)

𓀀𓀁𓀂𓀃𓀄𓀅𓀆𓀇𓀈𓀉𓀊𓀋𓀌𓀍𓀎𓀏𓀐𓀑𓀒𓀓𓀔𓀕𓀖𓀗𓀘𓀙𓀚𓀛𓀜𓀝𓀞𓀟𓀠𓀡𓀢𓀣𓀤𓀥𓀦𓀧𓀨𓀩𓀪𓀫𓀬𓀭𓀮𓀯𓀰𓀱𓀲𓀳𓀴𓀵𓀶𓀷𓀸𓀹𓀺𓀻𓀼𓀽𓀾𓀿𓁀𓁁𓁂𓁃𓁄𓁅𓁆𓁇𓁈𓁉𓁊𓁋𓁌𓁍𓁎𓁏𓁐𓁑𓁒𓁓𓁔𓁕𓁖𓁗𓁘𓁙𓁚𓁛𓁜𓁝𓁞𓁟𓁠𓁡𓁢𓁣𓁤𓁥𓁦𓁧𓁨𓁩𓁪𓁫𓁬𓁭𓁮𓁯𓁰𓁱𓁲𓁳𓁴𓁵𓁶𓁷𓁸𓁹𓁺𓁻𓁼𓁽
