## Supplementary notebook for Boston Python's Study Group meeting on `unicodedata`

This notebook contains a set of cells relating to discussion following
a 25 May 2023 meeting of the [Boston Python](https://about.bostonpython.com)
Study Group, which focused on the [`unicodedata` module](https://docs.python.org/3/library/unicodedata.html)

I've reorganized and annotated the cells to hopefully make their
context a bit more clear.

In [1]:
import unicodedata

## What's the difference between a codepoint and encoded bytes?

We discussed how a Unicode codepoint is the assignment of
a "character" to an integer, but since that integer may be
represented by multiple sequences of bytes, there are
multiple encodings of any Unicode codepoint or sequence of codepoints

In [2]:
print(ord("a"))
print("a".encode("utf-8"))
print("a".encode("utf-16"))
print("a".encode("utf-32"))

print('---')

print(ord("二"))
print("二".encode("utf-8"))
print("二".encode("utf-16"))
print("二".encode("utf-32"))

97
b'a'
b'\xff\xfea\x00'
b'\xff\xfe\x00\x00a\x00\x00\x00'
---
20108
b'\xe4\xba\x8c'
b'\xff\xfe\x8cN'
b'\xff\xfe\x00\x00\x8cN\x00\x00'


## Looking up codepoints by name

We talked about how to look up codepoints by name, as well as the fact that Python's support for `\N{NAME OF CODEPOINT}` relies on the same structure of Unicode

In [3]:
print(unicodedata.lookup("SNAKE"))
print("I love Python, it has such good Unicode support! \N{SNAKE}")
print(hex(ord("\N{SNAKE}")))

🐍
I love Python, it has such good Unicode support! 🐍
0x1f40d


## Is there a limit to combining characters?

We discussed the question of whether there is a limit to the behavior of combining characters, with [Zalgo text](https://en.wikipedia.org/wiki/Zalgo_text) as an extreme example of use of them

In [4]:
len("h̶͉̹̻̿̓e̷̥͘l̸̫̯̎l̷̢̗̔̓ò̵̆͜ ̴̥̬͂̔͋ẃ̷͕͛͑ô̸̝̻̋̽ŗ̴̘̇͌̉l̶̛̬̣̅̃d̶̢̹͙͆̏͗")  # that's a lot!

69

## Combining characters and bidirectional text in Hebrew

We explored how to interrogate the combining characters present in Hebrew text, and also talked a bit about the directionality of text.

We consulted a [list of Unicode combining categories](https://www.compart.com/en/unicode/combining) while looking at the output below

In [5]:
s = "hello world שָׁלוֹם"
for codept in s:
    print(
        unicodedata.name(codept),
        unicodedata.combining(codept),
        unicodedata.bidirectional(codept),
        sep="\t",
    )

LATIN SMALL LETTER H	0	L
LATIN SMALL LETTER E	0	L
LATIN SMALL LETTER L	0	L
LATIN SMALL LETTER L	0	L
LATIN SMALL LETTER O	0	L
SPACE	0	WS
LATIN SMALL LETTER W	0	L
LATIN SMALL LETTER O	0	L
LATIN SMALL LETTER R	0	L
LATIN SMALL LETTER L	0	L
LATIN SMALL LETTER D	0	L
SPACE	0	WS
HEBREW LETTER SHIN	0	R
HEBREW POINT QAMATS	18	NSM
HEBREW POINT SHIN DOT	24	NSM
HEBREW LETTER LAMED	0	R
HEBREW LETTER VAV	0	R
HEBREW POINT HOLAM	19	NSM
HEBREW LETTER FINAL MEM	0	R


In [6]:
# a human being considers these equivalent…
s1 = 'ça va?'
s2 = 'ça va?'

# …but they're not!
print(s1 == s2)

# we can get a hint about why by printing the lengths
print(len(s1), len(s2))

False
7 6


In [7]:
# Let's look at the name of each codepoint to find out what's going on here
for codept in s1:
    print(unicodedata.name(codept))

print('---')

for codept in s2:
    print(unicodedata.name(codept))

LATIN SMALL LETTER C
COMBINING CEDILLA
LATIN SMALL LETTER A
SPACE
LATIN SMALL LETTER V
LATIN SMALL LETTER A
QUESTION MARK
---
LATIN SMALL LETTER C WITH CEDILLA
LATIN SMALL LETTER A
SPACE
LATIN SMALL LETTER V
LATIN SMALL LETTER A
QUESTION MARK


## Let's use [normalization forms](https://unicode.org/reports/tr15/) to transform these strings before we compare them

Here we use Normalization Form C (NFC) to transform both strings
into a "canonical composition" form, which in this case means that
the cedilla is not a separate codepoint in both sequences.

There are also Normalization Form D (NFD) that transforms strings
to "canonical decomposition" form, and NFKC and NFKD forms that target
"compatible composition" and decomposition forms, respectively. 
(the K is for "compatible" and avoids confusion with "canonical")

In [8]:
s1_norm = unicodedata.normalize("NFC", s1)
s2_norm = unicodedata.normalize("NFC", s2)

for codept in s1_norm:
    print(unicodedata.name(codept))
    
print('---')

for codept in s2_norm:
    print(unicodedata.name(codept))
    
print('---')
print(s1_norm == s2_norm)

LATIN SMALL LETTER C WITH CEDILLA
LATIN SMALL LETTER A
SPACE
LATIN SMALL LETTER V
LATIN SMALL LETTER A
QUESTION MARK
---
LATIN SMALL LETTER C WITH CEDILLA
LATIN SMALL LETTER A
SPACE
LATIN SMALL LETTER V
LATIN SMALL LETTER A
QUESTION MARK
---
True


## Another normalization example

Here we briefly discussed the importance of order of the combining characters in the NFD form

In [9]:
s = "\u1e69"
s_norm_d = unicodedata.normalize("NFD", s)
s_norm_c = unicodedata.normalize("NFC", s_norm_d)

print(s, len(s))

for codept in s_norm_d:
    print(unicodedata.name(codept))
    
for codept in s_norm_c:
    print(unicodedata.name(codept))

ṩ 1
LATIN SMALL LETTER S
COMBINING DOT BELOW
COMBINING DOT ABOVE
LATIN SMALL LETTER S WITH DOT BELOW AND DOT ABOVE


## Example task: retrieve every 'numeric' codepoint and print the associated numeric properties

In [Unicode's categories](https://www.compart.com/en/unicode/category),
there are three numeric categories:

* `Nl` for "letter number"
* `Nd` for "digit number"
* `No` for "other number"

We can find all the numeric codepoints by checking every codepoint's
category for a leading `'N'`

Fun fact: the `sys` module provides a constant for the number
of codepoints in Unicode. It's a historical artifact from a
time when Python's Unicode support could vary between builds,
see [PEP 393](https://peps.python.org/pep-0393/) for details.

In this case, it's just a convenient way to loop over every codepoint.

In [10]:
import sys


for n in range(sys.maxunicode):
    char = chr(n)
    category = unicodedata.category(char)
    if category.startswith('N'):
        # note: we left-pad the name to align the output
        name = unicodedata.name(char).ljust(64)
        
        # note: these are three distinct concepts of
        # representing numbers with text, not all
        # numeric codepoints define then, so we include
        # a default value of None here
        decimal = unicodedata.decimal(char, None)
        digit = unicodedata.digit(char, None)
        numeric = unicodedata.numeric(char, None)
        
        print(char, name, category, decimal, digit, numeric, sep="\t")


0	DIGIT ZERO                                                      	Nd	0	0	0.0
1	DIGIT ONE                                                       	Nd	1	1	1.0
2	DIGIT TWO                                                       	Nd	2	2	2.0
3	DIGIT THREE                                                     	Nd	3	3	3.0
4	DIGIT FOUR                                                      	Nd	4	4	4.0
5	DIGIT FIVE                                                      	Nd	5	5	5.0
6	DIGIT SIX                                                       	Nd	6	6	6.0
7	DIGIT SEVEN                                                     	Nd	7	7	7.0
8	DIGIT EIGHT                                                     	Nd	8	8	8.0
9	DIGIT NINE                                                      	Nd	9	9	9.0
²	SUPERSCRIPT TWO                                                 	No	None	2	2.0
³	SUPERSCRIPT THREE                                               	No	None	3	3.0
¹	SUPERSCRIPT ONE                                         

## Equivalent identifiers

Python supports Unicode identifiers (names for variables) by closely following
the associated [Unicode guidelines](https://unicode.org/reports/tr31/).

Here, we assign a value to a name spelled with full-width kana and note that
Python treats the half-width kana form as the _same_ identifier.

In [11]:
パイソン = 42
print(ﾊﾟｲｿﾝ)

42


## Identifiers are considered equivalent under NFKC

As described [in the Python documentation](https://docs.python.org/3/reference/lexical_analysis.html#identifiers),
Python source code is transformed at parse time:

> All identifiers are converted into the normal form NFKC while parsing; **comparison of identifiers is based on NFKC.**

In [12]:
# we can confirm that these are the same after applying NFKC
unicodedata.normalize("NFKC", "ﾊﾟｲｿﾝ") == unicodedata.normalize("NFKC", "パイソン")

True

## We also explored this behavior with an identifier inspired by an earlier example

In [13]:
çava = 42   # 6 codepoints
print(çava) # 7 codepoints

42


## We aren't allowed to use _anything_ as an identifier

Only some characters are allowed in identifiers, so we aren't allowed to use
emoji as names, for instance.

In [14]:
🐍 = 42

SyntaxError: invalid character '🐍' (U+1F40D) (2429941938.py, line 1)