Tokenisation
============
As it stands right now, tokenisation in Caterpillar is quite slow. This is mainly due to how the tokenisers are invoked rather than the tokenisers themselves (to an extent anyway).

Right now, text is tokenised into paragraphs, then each paragraph is tokenised into frames using the NLTK sentence tokeniser, then each frame is tokenised into words. This involves invoking the sentence and word tokeniser a bunch of times with some serious overhead.

An alternative approach would be to invoke each tokeniser once for each body of text. This would required recording the boundaries of tokens for each tokeniser then going through the text once with the list of boundaries from each tokeniser to produce a token stream.

In [1]:
from caterpillar.processing.analysis.analyse import *
from caterpillar.processing.analysis.tokenize import *
import os
import nltk
import re
import regex

First, lets marshall the data.

In [2]:
data = open('/Users/rstuart/Workspace/python/caterpillar/caterpillar/test_resources/moby.txt', 'r').read()

Old way first.

In [3]:
def tokenize_old(text):
    paragraph_tokenizer = ParagraphTokenizer()
    sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    analyser = DefaultAnalyser()
    count = 0
    frames = []
    
    for p in paragraph_tokenizer.tokenize(text):
        sentences = sentence_tokenizer.tokenize(p.value, realign_boundaries=True)
        for i in xrange(0, len(sentences), 2):
            frames.append(" ".join(sentences[i:i+2]))
        for s in sentences:
            count += len(list(analyser.analyse(s)))
    print("Tokenised %d tokens, %d frames" % (count, len(frames)))
        
%time tokenize_old(data)

Tokenised 210736 tokens, 6150 frames
CPU times: user 2.45 s, sys: 8.57 ms, total: 2.46 s
Wall time: 2.46 s


First try at a different approach. The idea here is to insert special markers at the paragraph boundaries (`\x03`), and sentence boundaries (`\x02`). Using that information, then mark frame boundaries (`\x04`). Then tokenise words without crossing one of those boundaries.

In [4]:
def tokenize_new(text):
    paragraph_tokenizer = re.compile(
        r'\x02\\S{0,4}\\s*(?:\r?\n)+|\r?\n(?:\r?\n)+', flags=re.DOTALL | re.UNICODE | re.MULTILINE
    )
    sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    frame_tokenizer = re.compile(
        u'(?:[^\x03\x02]*(?:\x02|(?=\x03)|$)){1,2}[\\s\x03]*', flags=re.DOTALL | re.UNICODE | re.MULTILINE
    )
    
    # Mark sentences with \x02
    new_text = u"\x02".join([text[start:end] for start, end in sentence_tokenizer.span_tokenize(text)])
    # Mark paragraphs with \x03
    new_text = paragraph_tokenizer.sub(u'\\g<0>\x03', new_text)
    # Mark frames with \x04, removing all \x02 & \x03
    new_text = frame_tokenizer.sub(u'\\g<0>\x04', new_text)
    # Remove paragraph and setence markers
    new_text = re.sub(u'\x02|\x03', '', new_text, flags=re.DOTALL | re.UNICODE | re.MULTILINE)
    assert '\x02' not in new_text and '\x03' not in new_text
    frames = new_text.split('\x04')
    print("Tokenised %d terms, %d frames" % (len(list(DefaultAnalyser().analyse(text))), len(frames)))

%time tokenize_new(data)

Tokenised 210702 terms, 5079 frames
CPU times: user 2.12 s, sys: 144 ms, total: 2.26 s
Wall time: 2.16 s


That is not much of an improvement. Why?

In [7]:
paragraph_tokenizer = re.compile(
    r'\x02\\S{0,4}\\s*(?:\r?\n)+|\r?\n(?:\r?\n)+', flags=re.DOTALL | re.UNICODE | re.MULTILINE
)
sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
frame_tokenizer = re.compile(
    u'(?:[^\x03\x02]*(?:\x02|(?=\x03)|$)){1,2}[\\s\x03]*', flags=re.DOTALL | re.UNICODE | re.MULTILINE
)
word_tokenizer = SimpleWordTokenizer()
analyser = DefaultAnalyser()

%time sents = u"\x02".join([data[start:end] for start, end in sentence_tokenizer.span_tokenize(data)])
%time paras = paragraph_tokenizer.sub(u'\\g<0>\x03', sents)
%time frames = frame_tokenizer.sub(u'\\g<0>\x04', paras)
%time tokens = list(analyser.analyse(paras))
%time tokens = list(word_tokenizer.tokenize(paras))

CPU times: user 646 ms, sys: 136 ms, total: 783 ms
Wall time: 688 ms
CPU times: user 41 ms, sys: 2.05 ms, total: 43.1 ms
Wall time: 43.1 ms
CPU times: user 25.1 ms, sys: 2.01 ms, total: 27.2 ms
Wall time: 27.2 ms
CPU times: user 1.4 s, sys: 4.14 ms, total: 1.41 s
Wall time: 1.41 s
CPU times: user 562 ms, sys: 2.44 ms, total: 564 ms
Wall time: 565 ms


The issues are clearly sentence tokenisation, our analyser and our word tokenizer. There is no low hanging fruit with sentences (although executing it just once rather than for each paragraph is a vast improvement). It really needs a re-write, probably in Cython. But maybe we can do something about the term tokenisation?