# Text processing in global world

Let's write a function that counts distinct words in a given sentence:
1. Case-insensensetive comparison
2. Ignore punctuation
3. All numbers count as "__number__"
4. All "non-words" count as "__other__"

_All of the below code is for Python 3.6_

In [1]:
from collections import Counter
import re
import string

class WordCounter:
    # Translation table to remove punctuation
    punctuationTranslation = ''.maketrans('', '', string.punctuation)
    
    def CountWords( self, text ):
        counter = Counter()
        # Split text to words by spaces
        for word in text.split(' '):
            # Remove all punctuation
            word = word.translate(WordCounter.punctuationTranslation)
            # Words that consist only of letters count as words
            if re.match('^[a-zA-Z]+$', word):
                # Using casefold to get more stable case-insensetive variant
                counter[word.casefold()] += 1
            # Words that consist only of digits count as __number__
            elif re.match('^[0-9]+$', word):
                counter['__number__'] += 1
            # Everything else goes to __other__ bucket
            elif len(word) > 0 :
                counter['__other__'] += 1
        return counter

Let's test it

In [2]:
wordCounter = WordCounter()
wordCounter.CountWords( 'Hello, World!' )

Counter({'hello': 1, 'world': 1})

Let's test is some more

In [3]:
wordCounter.CountWords( 'I think version 1 of our function will work fine for i18n, no need to create version 2.' )

Counter({'__number__': 2,
         '__other__': 1,
         'create': 1,
         'fine': 1,
         'for': 1,
         'function': 1,
         'i': 1,
         'need': 1,
         'no': 1,
         'of': 1,
         'our': 1,
         'think': 1,
         'to': 1,
         'version': 2,
         'will': 1,
         'work': 1})

And one more test

In [4]:
wordCounter.CountWords( 'But my fiancée thinks that this function will not work even in English' )

Counter({'__other__': 1,
         'but': 1,
         'english': 1,
         'even': 1,
         'function': 1,
         'in': 1,
         'my': 1,
         'not': 1,
         'that': 1,
         'thinks': 1,
         'this': 1,
         'will': 1,
         'work': 1})

Ok, we need to extend our letters set.<br>
But we need to be truly international, so we need all possible diacritics.<br>
And Cyrillic alphabet as well.<br>
And all kind of Indian scripts.<br>
And Thai language.<br>
And maybe few more things …?

## Unicode categories

[Wikipedia about Unicode categories:](https://en.wikipedia.org/wiki/Unicode_character_property#General_Category)<br>
The Unicode Standard assigns character properties to each code point. These properties can be used to handle "characters" (code points) in processes, like in line-breaking, script direction right-to-left or applying controls.<br>
Each code point is assigned a value for General Category. This is one of the character properties that are also defined for unassigned code points, and code points that are defined "not a character".

Letters: Lu, Ll, Lt, Lm, Lo<br>
Marks: Mn, Mc, Me<br>
Numbers: Nd, Nl, No<br>
Punctuations: Pc, Pd, Ps, Pe, Pi, Pf, Po<br>
Symbols: Sm, Sc, Sk, So<br>
Separators: Zs, Zl, Zp<br>
<br>
Other, control: Cc<br>
Other, format: Cf<br>
Other, surrogate: Cs<br>
Other, private use: Co<br>
Other, not assigned: Cn

In [7]:
import unicodedata
unicodedata.category('a')

'Ll'

In [8]:
unicodedata.category('1')

'Nd'

In [9]:
unicodedata.category(' ')

'Zs'

Let's build unicode sets to work with

In [11]:
import unicodedata

def calculateUnicodeSets():
    punctuation = set()
    letters = set()
    numbers = set()
    spaces = set()
    control = set()
    # We go through whole range of possible Unicode characters
    for i in range(0,0x110000):
        char = chr(i)
        category = unicodedata.category( char )
        # Punctuation is everything in P* category
        if( category.startswith('P') ):
            punctuation.add( char )
        # For our goal both letters (L*) and mark signs (M*) will be considered as letters
        elif( category.startswith('L') or category.startswith('M') ):
            letters.add( char )
        # N* goes to numbers
        elif( category.startswith('N') ):
            numbers.add( char )
        # Z* goes to punctuation
        elif( category.startswith('Z') ):
            spaces.add( char )
        # We will need control (Cc) and format (Cf) characters a little bit later
        elif( category == 'Cc' or category == 'Cf' ):
            control.add( char )
    
    # TAB, CR and LF are in Cc category, but we will treat them as spaces
    spaces.add( '\t' )
    spaces.add( '\r' )
    spaces.add( '\n' )
    control.remove( '\t' )
    control.remove( '\r' )
    control.remove( '\n' )
    
    return (punctuation, letters, numbers, spaces, control)

In [12]:
(punctuation, letters, numbers, spaces, control) = calculateUnicodeSets()

Ok, we need universal letters set, but this function looks like overkill.<br>
Why we need punctuation, we already have string.punctuation

In [15]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [16]:
''.join(sorted(punctuation))

'!"#%&\'()*,-./:;?@[\\]_{}¡§«¶·»¿;·՚՛՜՝՞՟։֊־׀׃׆׳״؉؊،؍؛؞؟٪٫٬٭۔܀܁܂܃܄܅܆܇܈܉܊܋܌܍߷߸߹࠰࠱࠲࠳࠴࠵࠶࠷࠸࠹࠺࠻࠼࠽࠾࡞।॥॰૰෴๏๚๛༄༅༆༇༈༉༊་༌།༎༏༐༑༒༔༺༻༼༽྅࿐࿑࿒࿓࿔࿙࿚၊။၌၍၎၏჻፠፡።፣፤፥፦፧፨᐀᙭᙮᚛᚜᛫᛬᛭᜵᜶។៕៖៘៙៚᠀᠁᠂᠃᠄᠅᠆᠇᠈᠉᠊᥄᥅᨞᨟᪠᪡᪢᪣᪤᪥᪦᪨᪩᪪᪫᪬᪭᭚᭛᭜᭝᭞᭟᭠᯼᯽᯾᯿᰻᰼᰽᰾᰿᱾᱿᳀᳁᳂᳃᳄᳅᳆᳇᳓‐‑‒–—―‖‗‘’‚‛“”„‟†‡•‣․‥…‧‰‱′″‴‵‶‷‸‹›※‼‽‾‿⁀⁁⁂⁃⁅⁆⁇⁈⁉⁊⁋⁌⁍⁎⁏⁐⁑⁓⁔⁕⁖⁗⁘⁙⁚⁛⁜⁝⁞⁽⁾₍₎⌈⌉⌊⌋〈〉❨❩❪❫❬❭❮❯❰❱❲❳❴❵⟅⟆⟦⟧⟨⟩⟪⟫⟬⟭⟮⟯⦃⦄⦅⦆⦇⦈⦉⦊⦋⦌⦍⦎⦏⦐⦑⦒⦓⦔⦕⦖⦗⦘⧘⧙⧚⧛⧼⧽⳹⳺⳻⳼⳾⳿⵰⸀⸁⸂⸃⸄⸅⸆⸇⸈⸉⸊⸋⸌⸍⸎⸏⸐⸑⸒⸓⸔⸕⸖⸗⸘⸙⸚⸛⸜⸝⸞⸟⸠⸡⸢⸣⸤⸥⸦⸧⸨⸩⸪⸫⸬⸭⸮⸰⸱⸲⸳⸴⸵⸶⸷⸸⸹⸺⸻⸼⸽⸾⸿⹀⹁⹂⹃⹄、。〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〽゠・꓾꓿꘍꘎꘏꙳꙾꛲꛳꛴꛵꛶꛷꡴꡵꡶꡷꣎꣏꣸꣹꣺꣼꤮꤯꥟꧁꧂꧃꧄꧅꧆꧇꧈꧉꧊꧋꧌꧍꧞꧟꩜꩝꩞꩟꫞꫟꫰꫱꯫﴾﴿︐︑︒︓︔︕︖︗︘︙︰︱︲︳︴︵︶︷︸︹︺︻︼︽︾︿﹀﹁﹂﹃﹄﹅﹆﹇﹈﹉﹊﹋﹌﹍﹎﹏﹐﹑﹒﹔﹕﹖﹗﹘﹙﹚﹛﹜﹝﹞﹟﹠﹡﹣﹨﹪﹫！＂＃％＆＇（）＊，－．／：；？＠［＼］＿｛｝｟｠｡｢｣､･𐄀𐄁𐄂𐎟𐏐𐕯𐡗𐤟𐤿𐩐𐩑𐩒𐩓𐩔𐩕𐩖𐩗𐩘𐩿𐫰𐫱𐫲𐫳𐫴𐫵𐫶𐬹𐬺𐬻𐬼𐬽𐬾𐬿𐮙𐮚𐮛𐮜𑁇𑁈𑁉𑁊𑁋𑁌𑁍𑂻𑂼𑂾𑂿𑃀𑃁𑅀𑅁𑅂𑅃𑅴𑅵𑇅𑇆𑇇𑇈𑇉𑇍𑇛𑇝𑇞𑇟𑈸𑈹𑈺𑈻𑈼𑈽𑊩𑑋𑑌𑑍𑑎𑑏𑑛𑑝𑓆𑗁𑗂𑗃𑗄𑗅𑗆𑗇𑗈𑗉𑗊𑗋𑗌𑗍𑗎𑗏𑗐𑗑𑗒𑗓𑗔𑗕𑗖𑗗𑙁𑙂𑙃𑙠𑙡𑙢𑙣𑙤𑙥𑙦𑙧𑙨𑙩𑙪𑙫𑙬𑜼𑜽𑜾𑱁𑱂𑱃𑱄𑱅𑱰𑱱𒑰𒑱𒑲𒑳𒑴𖩮𖩯𖫵𖬷𖬸𖬹𖬺𖬻𖭄𛲟𝪇𝪈𝪉𝪊𝪋𞥞𞥟'

But what about numbers, there are still only 10 of them.<br>
You may treat different Number categories (Nd, Nl, No) differently, depending on your application

In [18]:
''.join(sorted(numbers))

'0123456789²³¹¼½¾٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯৴৵৶৷৸৹੦੧੨੩੪੫੬੭੮੯૦૧૨૩૪૫૬૭૮૯୦୧୨୩୪୫୬୭୮୯୲୳୴୵୶୷௦௧௨௩௪௫௬௭௮௯௰௱௲౦౧౨౩౪౫౬౭౮౯౸౹౺౻౼౽౾೦೧೨೩೪೫೬೭೮೯൘൙൚൛൜൝൞൦൧൨൩൪൫൬൭൮൯൰൱൲൳൴൵൶൷൸෦෧෨෩෪෫෬෭෮෯๐๑๒๓๔๕๖๗๘๙໐໑໒໓໔໕໖໗໘໙༠༡༢༣༤༥༦༧༨༩༪༫༬༭༮༯༰༱༲༳၀၁၂၃၄၅၆၇၈၉႐႑႒႓႔႕႖႗႘႙፩፪፫፬፭፮፯፰፱፲፳፴፵፶፷፸፹፺፻፼ᛮᛯᛰ០១២៣៤៥៦៧៨៩៰៱៲៳៴៵៶៷៸៹᠐᠑᠒᠓᠔᠕᠖᠗᠘᠙᥆᥇᥈᥉᥊᥋᥌᥍᥎᥏᧐᧑᧒᧓᧔᧕᧖᧗᧘᧙᧚᪀᪁᪂᪃᪄᪅᪆᪇᪈᪉᪐᪑᪒᪓᪔᪕᪖᪗᪘᪙᭐᭑᭒᭓᭔᭕᭖᭗᭘᭙᮰᮱᮲᮳᮴᮵᮶᮷᮸᮹᱀᱁᱂᱃᱄᱅᱆᱇᱈᱉᱐᱑᱒᱓᱔᱕᱖᱗᱘᱙⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉⅐⅑⅒⅓⅔⅕⅖⅗⅘⅙⅚⅛⅜⅝⅞⅟ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩⅪⅫⅬⅭⅮⅯⅰⅱⅲⅳⅴⅵⅶⅷⅸⅹⅺⅻⅼⅽⅾⅿↀↁↂↅↆↇↈ↉①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳⑴⑵⑶⑷⑸⑹⑺⑻⑼⑽⑾⑿⒀⒁⒂⒃⒄⒅⒆⒇⒈⒉⒊⒋⒌⒍⒎⒏⒐⒑⒒⒓⒔⒕⒖⒗⒘⒙⒚⒛⓪⓫⓬⓭⓮⓯⓰⓱⓲⓳⓴⓵⓶⓷⓸⓹⓺⓻⓼⓽⓾⓿❶❷❸❹❺❻❼❽❾❿➀➁➂➃➄➅➆➇➈➉➊➋➌➍➎➏➐➑➒➓⳽〇〡〢〣〤〥〦〧〨〩〸〹〺㆒㆓㆔㆕㈠㈡㈢㈣㈤㈥㈦㈧㈨㈩㉈㉉㉊㉋㉌㉍㉎㉏㉑㉒㉓㉔㉕㉖㉗㉘㉙㉚㉛㉜㉝㉞㉟㊀㊁㊂㊃㊄㊅㊆㊇㊈㊉㊱㊲㊳㊴㊵㊶㊷㊸㊹㊺㊻㊼㊽㊾㊿꘠꘡꘢꘣꘤꘥꘦꘧꘨꘩ꛦꛧꛨꛩꛪꛫꛬꛭꛮꛯ꠰꠱꠲꠳꠴꠵꣐꣑꣒꣓꣔꣕꣖꣗꣘꣙꤀꤁꤂꤃꤄꤅꤆꤇꤈꤉꧐꧑꧒꧓꧔꧕꧖꧗꧘꧙꧰꧱꧲꧳꧴꧵꧶꧷꧸꧹꩐꩑꩒꩓꩔꩕꩖꩗꩘꩙꯰꯱꯲꯳꯴꯵꯶꯷꯸꯹０１２３４５６７８９𐄇𐄈𐄉𐄊𐄋𐄌𐄍𐄎𐄏𐄐𐄑𐄒𐄓𐄔𐄕𐄖𐄗𐄘𐄙𐄚𐄛𐄜𐄝𐄞𐄟𐄠𐄡𐄢𐄣𐄤𐄥𐄦𐄧𐄨𐄩𐄪𐄫𐄬𐄭𐄮𐄯𐄰𐄱𐄲𐄳𐅀𐅁𐅂𐅃𐅄𐅅𐅆𐅇𐅈𐅉𐅊𐅋𐅌𐅍𐅎𐅏𐅐𐅑𐅒𐅓𐅔𐅕𐅖𐅗𐅘𐅙𐅚𐅛𐅜𐅝𐅞𐅟𐅠𐅡𐅢𐅣𐅤𐅥𐅦𐅧𐅨𐅩𐅪𐅫𐅬𐅭𐅮𐅯𐅰𐅱𐅲𐅳𐅴𐅵𐅶𐅷𐅸𐆊𐆋𐋡𐋢𐋣𐋤𐋥𐋦𐋧𐋨𐋩𐋪𐋫𐋬𐋭𐋮𐋯𐋰𐋱𐋲𐋳𐋴𐋵𐋶𐋷𐋸𐋹𐋺𐋻𐌠𐌡𐌢𐌣𐍁𐍊𐏑𐏒𐏓𐏔𐏕𐒠𐒡𐒢𐒣𐒤𐒥𐒦𐒧𐒨𐒩𐡘𐡙𐡚𐡛𐡜𐡝𐡞𐡟𐡹𐡺𐡻𐡼𐡽𐡾𐡿𐢧𐢨𐢩𐢪𐢫𐢬𐢭𐢮𐢯𐣻𐣼𐣽𐣾𐣿𐤖𐤗𐤘𐤙𐤚𐤛𐦼𐦽𐧀𐧁𐧂𐧃𐧄𐧅𐧆𐧇𐧈𐧉𐧊𐧋𐧌𐧍𐧎𐧏𐧒𐧓𐧔𐧕𐧖𐧗𐧘𐧙𐧚𐧛𐧜𐧝𐧞𐧟𐧠𐧡𐧢𐧣𐧤𐧥𐧦𐧧𐧨𐧩𐧪𐧫𐧬𐧭𐧮𐧯𐧰𐧱𐧲𐧳𐧴𐧵𐧶𐧷𐧸𐧹𐧺𐧻𐧼𐧽𐧾𐧿𐩀𐩁𐩂𐩃𐩄𐩅𐩆𐩇𐩽𐩾𐪝𐪞𐪟

What about spaces?

In [19]:
len(spaces)

22

In [20]:
''.join(sorted(spaces))

'\t\n\r \xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000'

Just for reference, how many letters we have?

In [21]:
len(letters)

118863

Let's get back to to our word counter.

In [69]:
from collections import Counter
import re

class WordCounter:
    def __init__(self):
        # Get all our unicode sets
        (self.punctuation, self.letters, self.numbers, self.spaces, self.control) = calculateUnicodeSets()
        # Translation table to remove punctuation
        self.punctuationTranslation = ''.maketrans('', '', ''.join(self.punctuation))
        # Translation table to remove control characters
        self.controlTranslation = ''.maketrans('', '', ''.join(self.control))
        # Regex to find all whitespaces
        self.whitespacesRegex = '|'.join(map(re.escape, self.spaces))
    
    def CountWords( self, text ):
        counter = Counter()
        # Remove control characters from string
        text = text.translate(self.controlTranslation)
        # Split text to words by whitespaces
        for word in re.split(self.whitespacesRegex,text,0):
            # Remove all punctuation
            word = word.translate(self.punctuationTranslation)
            if len(word) == 0:
                # Skip empty words
                continue
            # Creating set of letters from our word, that way it will be easier to compare with letters and numbers
            wordSet = set(word)
            # Words that consist only of letters count as words
            if wordSet.issubset( self.letters ):
                # Using casefold to get more stable case-insensetive variant
                counter[word.casefold()] += 1
            # Words that consist only of digits count as __number__
            elif wordSet.issubset( self.numbers ):
                counter['__number__'] += 1
            # Everything else goes to __other__ bucket
            else:
                counter['__other__'] += 1
        return counter

In [70]:
wordCounter = WordCounter()
wordCounter.CountWords( 'Hello, World!' )

Counter({'hello': 1, 'world': 1})

In [71]:
wordCounter.CountWords( 'Hello,\t\n  World! 123' )

Counter({'__number__': 1, 'hello': 1, 'world': 1})