Python sorts sequences of any type by comparing the items in each
sequence one by one. For strings, this means comparing the code points.Unfortunately, this produces unacceptable results for anyone who uses non-
ASCII characters.

In [2]:
fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']

In [3]:
sorted(fruits)

['acerola', 'atemoia', 'açaí', 'caju', 'cajá']

The standard way to sort non-ASCII text in Python is to use the
locale.strxfrm function which, according to the locale module
docs, “transforms a string to one that can be used in locale-aware
comparisons.”

Example 4-19. locale_sort.py: using the locale.strxfrm function as
sort key

In [4]:
! export LC_ALL=C

In [5]:
import locale
my_locale = locale.setlocale(locale.LC_COLLATE, 'pt_BR.UTF-8')
print(my_locale)

pt_BR.UTF-8


In [6]:
fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']
sorted_fruits = sorted(fruits, key=locale.strxfrm)
print(sorted_fruits)

['açaí', 'acerola', 'atemoia', 'cajá', 'caju']


Sorting with the Unicode Collation Algorithm

Example 4-20. Using the pyuca.Collator.sort_key method

In [11]:
import pyuca
coll = pyuca.Collator()
fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']
sorted_fruits = sorted(fruits, key=coll.sort_key)
sorted_fruits


['açaí', 'acerola', 'atemoia', 'cajá', 'caju']

Example 4-21. cf.py: the character finder utility

In [13]:
import sys
import unicodedata
START, END = ord(' '), sys.maxunicode + 1

def find(*query_words, start=START, end=END):
    query = {w.upper() for w in query_words}
    for code in range(start, end):
        char = chr(code)
        name = unicodedata.name(char, None)
        if name and query.issubset(name.split()):
            print(f'U+{code:04X}\t{char}\t{name}')

def main(words):
    if words:
        find(*words)
    else:
        print('Please provide words to find.')
main('I am Jalil')


In [14]:
if __name__ == '__main__':
    main('I am Jalil')
