# Simple tokenization with regular expressions
We'll use the split-method to tokenize part of the Penn Treebank Wall Street Journal section, a standard corpus often used in NLP. First, we'll need to get the file `wsj-short.txt`.

In [1]:
!wget "http://verbs.colorado.edu/~mahu0110/teaching/ling5832-2018/data/wsj-short.txt"

--2018-01-18 10:31:45--  http://verbs.colorado.edu/~mahu0110/teaching/ling5832-2018/data/wsj-short.txt
Resolving verbs.colorado.edu... 128.138.73.54
Connecting to verbs.colorado.edu|128.138.73.54|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 503759 (492K) [text/plain]
Saving to: â€˜wsj-short.txt.1â€™


2018-01-18 10:31:46 (439 KB/s) - â€˜wsj-short.txt.1â€™ saved [503759/503759]



Now, we can read all the lines into a file. Since the file has one sentence per line, we'll get an array (list) with one sentence per entry.

In [2]:
lines = [line.strip() for line in open("wsj-short.txt")] # read all lines into var lines

In [3]:
lines[0] # sanity check, first line looks like this

'Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.'

In [5]:
l = lines[0]
l

'Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.'

In [6]:
l.split()

['Pierre',
 'Vinken,',
 '61',
 'years',
 'old,',
 'will',
 'join',
 'the',
 'board',
 'as',
 'a',
 'nonexecutive',
 'director',
 'Nov.',
 '29.']

Since the Python split-method doesn't allow us to specify the splitter, we need to resort to regular expressions.

In [7]:
import re

In [8]:
re.split('[,. ]+', l)

['Pierre',
 'Vinken',
 '61',
 'years',
 'old',
 'will',
 'join',
 'the',
 'board',
 'as',
 'a',
 'nonexecutive',
 'director',
 'Nov',
 '29',
 '']

The split method, however, has the unfortunate side-effect of removing the delimiters. What we'd like to do is keep the delimiters, such as punctuation marks and have them be separate tokens. To do that, we need the `findall`-method.  With the `findall`-method, the logic also needs to be reversed - we're now interested in what to chunk (i.e. the definition of a token to find), as opposed to what to split on.  So we try to define a regular expression that would capture the definition of what a *token* should look like.

In [11]:
re.findall("[\w']+|[.,!?;]+", l)

['Pierre',
 'Vinken',
 ',',
 '61',
 'years',
 'old',
 ',',
 'will',
 'join',
 'the',
 'board',
 'as',
 'a',
 'nonexecutive',
 'director',
 'Nov',
 '.',
 '29',
 '.']

In [31]:
all_tokens = []
token_counter = 0
for l in lines:
    tokens_in_line = re.findall("Jan\.|Feb\.|Mar\.|May\.|Jun\.|Jul\.|Aug\.|Sep\.|Oct\.|Nov\.|Dec\.|[\w']+|[.,!?;]+", l)
    token_counter += len(tokens_in_line)
        
print(token_counter)

94632


In [30]:
tokens_in_line = tokens_in_line = re.findall("Nov.|[\w']+|[.,!?;]+", lines[0])
print(tokens_in_line)

['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']
