matchall is very slow #3719

Closed
johnmyleswhite opened this Issue Jul 15, 2013 · 13 comments

6 participants

@johnmyleswhite
The Julia Language member

I recently tried translating Norvig's spellchecker into Julia. The following example shows that Julia's string performance needs a lot of work.

To get started, download the file http://norvig.com/ipython/big.txt for tokenization.

We'll tokenize it in Julia first:

function tokenize()
    BIG = readall("big.txt");
    tokens(text::String) =
      [m.match for m in matchall(r"[a-z]+", lowercase(text))]
    @elapsed t = tokens(BIG)
end
tokenize()

This takes 10 seconds on my machine.

In contrast, the following Python code is simpler (because there's no notion that matchall won't return strings directly) and 20x faster.

import re
import time
BIG = file('big.txt').read()
def tokens(text):
    return re.findall('[a-z]+', text.lower())

s = time.time()
t = tokens(BIG)
e = time.time()
e - s

This takes 0.4 seconds on my machine.

@dcjones
The Julia Language member

I investigated this a little. From the profiler output it looks like a lot of time is spent creating and destroying the million or so match objects. Pythons findall avoids that be returning an array of strings.

Here's a julia equivalent.

function matchingstrings(re::Regex, str::ByteString)
    matches = String[]
    offset = 0
    opts = re.options & PCRE.EXECUTE_MASK
    ovec = Array(Int32, 3)
    while true
        result = ccall((:pcre_exec, :libpcre), Int32,
                       (Ptr{Void}, Ptr{Void}, Ptr{Uint8}, Int32,
                       Int32, Int32, Ptr{Int32}, Int32),
                       re.regex, C_NULL, str, length(str.data),
                       offset, opts, ovec, 3)
        if result >= 0
            push!(matches, str[ovec[1]+1:ovec[2]])
            offset = ovec[2] + 1
        else
            break
        end
    end
    matches
end

Using matchingstrings(r"[a-z]+", lowercase(text)) in place of [m.match for m in matchall(r"[a-z]+", lowercase(text))] gives the same results but 0.36 seconds instead of 6.1 seconds for me.

The python is 0.33 for me, so it's a pretty close match for speed.

Given the performance difference, maybe we should include something like this?

p.s.
pypy is 0.49 so it's faster than some pythons at least!

@dcjones
The Julia Language member

Here's 0.22 seconds by returning SubStrings rather than Strings.

function matchingstrings(re::Regex, str::ByteString)
    matches = SubString[]
    offset = 0
    opts = re.options & PCRE.EXECUTE_MASK
    ovec = Array(Int32, 3)
    while true
        result = ccall((:pcre_exec, :libpcre), Int32,
                       (Ptr{Void}, Ptr{Void}, Ptr{Uint8}, Int32,
                       Int32, Int32, Ptr{Int32}, Int32),
                       re.regex, C_NULL, str, length(str.data),
                       offset, opts, ovec, 3)
        if result >= 0
            push!(matches, SubString(str, ovec[1]+1, ovec[2]))
            offset = ovec[2] + 1
        else
            break
        end
    end
    matches
end
@StefanKarpinski
The Julia Language member

Nice work, @dcjones. Great that we're within striking distance for speed here. Maybe we should just change it so that matchall returns an array of strings and if you want to actually get match objects you have to use eachmatch – getting an array of match objects is just a matter of doing collect(eachmatch(...)) anyway. Thanks to @nutsiepully's recent work, those are equivalent. We should be very careful to make sure that a function that returns all matching substrings is always equivalent to doing [m.match for m=matchall(r,s)].

@StefanKarpinski
The Julia Language member

Ah, we really need to make more of our string stuff non-copying.

@johnmyleswhite
The Julia Language member

Awesome job, @dcjones. I personally think matchall returning strings is more useful, but I'd be open to giving this another name. I also agree that most of the string functions should avoid making copies.

@StefanKarpinski
The Julia Language member

The idea that we've kicked around is making all UTF8Strings (and other string types) have an offset as well as a length so that you can take a substring and get something of the normal UTF8String type.

@johnmyleswhite
The Julia Language member

Making UTF8Strings more like C pointers sounds like a great idea to me.

@StefanKarpinski
The Julia Language member

Well, they'd still have a bit more baggage since they'd have to contain a reference to an array object, an offset into that array to start at, and a length. That's significantly more stuff than just a pointer.

@quinnj
The Julia Language member
@StefanKarpinski
The Julia Language member

The search function returns the index or range of indices that the match occurs at, however, not the matching substring itself.

@nutsiepully

The matchall function internally uses the match function which returns a RegexMatch object after a call to the PCRE library. So in case we want a version of matchall that returns strings, we would have to bypass that and call the PCRE library directly.
However, I'm not sure if the slowness is only due to the creation of RegexMatch as compared to a String. Both do string extraction, except that RegexMatch maintians the regex groups as well. The matchall function changes some regex behavior occasionally to address the edge cases #3525. Comparing the old iterator without these fixes (which follows the same matching strategy above) would suggest whether it's only the object creation or also the matching change that's causing the performance to drop.
I don't have access to a dev machine for a week and would not be able to check it out.

@johnmyleswhite
The Julia Language member

I'd love to see @dcjones change get into 0.2.

@dcjones
The Julia Language member

I nearly forgot about this. I'll make a PR later tonight.

@swadey swadey referenced this issue Jun 1, 2014
Closed

match() is slow #7066

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment