# Tokenizing and Counting words
Author: Pierre Nugues

# Imports

In [1]:
import math
import regex as re
import sys


## Tokenization

Tokenization has no unique solution. Let us explore some possible strategies

First, let us take a text

In [2]:
text = """Tell me, O muse, of that ingenious hero who
travelled far and wide after he had sacked the famous
town of Troy."""


### Using content

A first tokenizer: sequences of letters

In [3]:
pattern1 = r'\p{L}+'


In [4]:
re.findall(pattern1, text)


['Tell',
 'me',
 'O',
 'muse',
 'of',
 'that',
 'ingenious',
 'hero',
 'who',
 'travelled',
 'far',
 'and',
 'wide',
 'after',
 'he',
 'had',
 'sacked',
 'the',
 'famous',
 'town',
 'of',
 'Troy']

Let us add the other characters

In [5]:
pattern2 = r'\p{L}+|[^\s\p{L}]+'


In [6]:
re.findall(pattern2, text)


['Tell',
 'me',
 ',',
 'O',
 'muse',
 ',',
 'of',
 'that',
 'ingenious',
 'hero',
 'who',
 'travelled',
 'far',
 'and',
 'wide',
 'after',
 'he',
 'had',
 'sacked',
 'the',
 'famous',
 'town',
 'of',
 'Troy',
 '.']

The numbers

In [7]:
pattern3 = r'\p{L}+|\p{N}+|[^\s\p{L}\p{N}]+'


In [8]:
re.findall(pattern3, text)


['Tell',
 'me',
 ',',
 'O',
 'muse',
 ',',
 'of',
 'that',
 'ingenious',
 'hero',
 'who',
 'travelled',
 'far',
 'and',
 'wide',
 'after',
 'he',
 'had',
 'sacked',
 'the',
 'famous',
 'town',
 'of',
 'Troy',
 '.']

And the punctuation as separate tokens

In [9]:
pattern4 = r'\p{L}+|\p{N}+|\p{P}|[^\s\p{L}\p{N}\p{P}]+'


In [10]:
re.findall(pattern4, text)


['Tell',
 'me',
 ',',
 'O',
 'muse',
 ',',
 'of',
 'that',
 'ingenious',
 'hero',
 'who',
 'travelled',
 'far',
 'and',
 'wide',
 'after',
 'he',
 'had',
 'sacked',
 'the',
 'famous',
 'town',
 'of',
 'Troy',
 '.']

### Using boundaries: A first tokenizer

In [11]:
pattern5 = r'\s+'


In [12]:
re.split(pattern5, text)


['Tell',
 'me,',
 'O',
 'muse,',
 'of',
 'that',
 'ingenious',
 'hero',
 'who',
 'travelled',
 'far',
 'and',
 'wide',
 'after',
 'he',
 'had',
 'sacked',
 'the',
 'famous',
 'town',
 'of',
 'Troy.']

Keeping the punctuation

In [13]:
pattern6 = r'([\p{S}\p{P}]+)'


In [14]:
re.split(
    pattern5,
    re.sub(pattern6, r' \1 ', text))


['Tell',
 'me',
 ',',
 'O',
 'muse',
 ',',
 'of',
 'that',
 'ingenious',
 'hero',
 'who',
 'travelled',
 'far',
 'and',
 'wide',
 'after',
 'he',
 'had',
 'sacked',
 'the',
 'famous',
 'town',
 'of',
 'Troy',
 '.',
 '']

In [15]:
list(filter(None, re.split(
    pattern5,
    re.sub(pattern6, r' \1 ', text))))


['Tell',
 'me',
 ',',
 'O',
 'muse',
 ',',
 'of',
 'that',
 'ingenious',
 'hero',
 'who',
 'travelled',
 'far',
 'and',
 'wide',
 'after',
 'he',
 'had',
 'sacked',
 'the',
 'famous',
 'town',
 'of',
 'Troy',
 '.']

## Reading a corpus

In [16]:
import requests
text_copyright = requests.get(
    'http://classics.mit.edu/Homer/iliad.mb.txt').text


This text includes a copyright that we want to exclude from the counts

In [17]:
text_copyright[:70]


'Provided by The Internet Classics Archive.\nSee bottom for copyright. A'

We remove the copyright header and footer before and after dashed line

In [18]:
text = re.search(r'^-+$(.+)^-+$',
                 text_copyright, re.M | re.S).group(1).strip()


In [19]:
text


'BOOK I\n\nSing, O goddess, the anger of Achilles son of Peleus, that brought\ncountless ills upon the Achaeans. Many a brave soul did it send hurrying\ndown to Hades, and many a hero did it yield a prey to dogs and vultures,\nfor so were the counsels of Jove fulfilled from the day on which the\nson of Atreus, king of men, and great Achilles, first fell out with\none another. \n\nAnd which of the gods was it that set them on to quarrel? It was the\nson of Jove and Leto; for he was angry with the king and sent a pestilence\nupon the host to plague the people, because the son of Atreus had\ndishonoured Chryses his priest. Now Chryses had come to the ships\nof the Achaeans to free his daughter, and had brought with him a great\nransom: moreover he bore in his hand the sceptre of Apollo wreathed\nwith a suppliant\'s wreath and he besought the Achaeans, but most of\nall the two sons of Atreus, who were their chiefs. \n\n"Sons of Atreus," he cried, "and all other Achaeans, may the gods\nwho 

Unquote to use a corpus of novels by Selma Lagerlöf

In [20]:
"""
file_name = '../../corpus/Selma.txt'
text = open(file_name).read().strip()
text[:100]
"""


"\nfile_name = '../../corpus/Selma.txt'\ntext = open(file_name).read().strip()\ntext[:100]\n"

## Counting and sorting

We redefine the tokenizer

In [21]:
def tokenize(text, pattern=r'\p{L}+'):
    words = re.findall(pattern, text)
    return words


A function to count the words

In [22]:
def count_unigrams(words):
    frequency = {}
    for word in words:
        if word in frequency:
            frequency[word] += 1
        else:
            frequency[word] = 1
    return frequency


We analyze the text

In [23]:
words = tokenize(text.lower())
frequency = count_unigrams(words)
for word in sorted(frequency.keys(), key=frequency.get, reverse=True)[:10]:
    print(word, '\t', frequency[word])


the 	 9948
and 	 6624
of 	 5606
to 	 3329
he 	 2905
his 	 2537
in 	 2242
him 	 1868
you 	 1810
a 	 1807


## Using `Counter`
A Python class to count items in a list

In [24]:
from collections import Counter

counter = Counter(words)


A counter object is a dictionary

In [25]:
counter['hector']


480

That does not raise an exception

In [26]:
counter['computer']


0

In [27]:
counter.most_common(10)


[('the', 9948),
 ('and', 6624),
 ('of', 5606),
 ('to', 3329),
 ('he', 2905),
 ('his', 2537),
 ('in', 2242),
 ('him', 1868),
 ('you', 1810),
 ('a', 1807)]

In [28]:
'electronics' in counter


False

In [29]:
'computer' in counter


False

## `defaultdict`
If you use `defaultdict`, be aware that the dictionary creates a key when accessing, and only accesssing it. We show the difference with an ordinary dictionary. 

We take the word _computer_ so that we see what could be the consequences on Homer's corpus. _Computer_ is a word he probably never heard of, but accessing a `defaultdict` will let him know.

An ordinary dictionary

In [None]:
cnt_ordinary_dict = dict()

A dictionary with `defaultdict()`

In [36]:
from collections import defaultdict

cnt_def_dict = defaultdict(int)



Dictionaries are empty

In [37]:
'computer' in cnt_def_dict


False

In [38]:
'computer' in cnt_ordinary_dict


False

Now accessing with `get()`

In [39]:
cnt_def_dict.get('computer')


In [40]:
cnt_ordinary_dict.get('computer')


In [35]:
'computer' in cnt_def_dict


False

So far so good, but...

In [41]:
cnt_def_dict['computer']


0

In [42]:
cnt_ordinary_dict['computer']


KeyError: 'computer'

And...

In [43]:
'computer' in cnt_def_dict


True

In [44]:
'computer' in cnt_ordinary_dict


False

Accessing the `defaultdict` with the `[]` notation has created a key. This is not the case with a normal dictionary 