Skip to content
This repository

matchall is very slow #3719

Closed
johnmyleswhite opened this Issue · 13 comments

6 participants

John Myles White Daniel Jones Stefan Karpinski Jacob Quinn Pulkit Bhuwalka Jeff Bezanson
John Myles White
Collaborator

I recently tried translating Norvig's spellchecker into Julia. The following example shows that Julia's string performance needs a lot of work.

To get started, download the file http://norvig.com/ipython/big.txt for tokenization.

We'll tokenize it in Julia first:

function tokenize()
    BIG = readall("big.txt");
    tokens(text::String) =
      [m.match for m in matchall(r"[a-z]+", lowercase(text))]
    @elapsed t = tokens(BIG)
end
tokenize()

This takes 10 seconds on my machine.

In contrast, the following Python code is simpler (because there's no notion that matchall won't return strings directly) and 20x faster.

import re
import time
BIG = file('big.txt').read()
def tokens(text):
    return re.findall('[a-z]+', text.lower())

s = time.time()
t = tokens(BIG)
e = time.time()
e - s

This takes 0.4 seconds on my machine.

Daniel Jones
Collaborator

I investigated this a little. From the profiler output it looks like a lot of time is spent creating and destroying the million or so match objects. Pythons findall avoids that be returning an array of strings.

Here's a julia equivalent.

function matchingstrings(re::Regex, str::ByteString)
    matches = String[]
    offset = 0
    opts = re.options & PCRE.EXECUTE_MASK
    ovec = Array(Int32, 3)
    while true
        result = ccall((:pcre_exec, :libpcre), Int32,
                       (Ptr{Void}, Ptr{Void}, Ptr{Uint8}, Int32,
                       Int32, Int32, Ptr{Int32}, Int32),
                       re.regex, C_NULL, str, length(str.data),
                       offset, opts, ovec, 3)
        if result >= 0
            push!(matches, str[ovec[1]+1:ovec[2]])
            offset = ovec[2] + 1
        else
            break
        end
    end
    matches
end

Using matchingstrings(r"[a-z]+", lowercase(text)) in place of [m.match for m in matchall(r"[a-z]+", lowercase(text))] gives the same results but 0.36 seconds instead of 6.1 seconds for me.

The python is 0.33 for me, so it's a pretty close match for speed.

Given the performance difference, maybe we should include something like this?

p.s.
pypy is 0.49 so it's faster than some pythons at least!

Daniel Jones
Collaborator

Here's 0.22 seconds by returning SubStrings rather than Strings.

function matchingstrings(re::Regex, str::ByteString)
    matches = SubString[]
    offset = 0
    opts = re.options & PCRE.EXECUTE_MASK
    ovec = Array(Int32, 3)
    while true
        result = ccall((:pcre_exec, :libpcre), Int32,
                       (Ptr{Void}, Ptr{Void}, Ptr{Uint8}, Int32,
                       Int32, Int32, Ptr{Int32}, Int32),
                       re.regex, C_NULL, str, length(str.data),
                       offset, opts, ovec, 3)
        if result >= 0
            push!(matches, SubString(str, ovec[1]+1, ovec[2]))
            offset = ovec[2] + 1
        else
            break
        end
    end
    matches
end
Stefan Karpinski
Owner

Nice work, @dcjones. Great that we're within striking distance for speed here. Maybe we should just change it so that matchall returns an array of strings and if you want to actually get match objects you have to use eachmatch – getting an array of match objects is just a matter of doing collect(eachmatch(...)) anyway. Thanks to @nutsiepully's recent work, those are equivalent. We should be very careful to make sure that a function that returns all matching substrings is always equivalent to doing [m.match for m=matchall(r,s)].

Stefan Karpinski
Owner

Ah, we really need to make more of our string stuff non-copying.

John Myles White
Collaborator

Awesome job, @dcjones. I personally think matchall returning strings is more useful, but I'd be open to giving this another name. I also agree that most of the string functions should avoid making copies.

Stefan Karpinski
Owner

The idea that we've kicked around is making all UTF8Strings (and other string types) have an offset as well as a length so that you can take a substring and get something of the normal UTF8String type.

John Myles White
Collaborator

Making UTF8Strings more like C pointers sounds like a great idea to me.

Stefan Karpinski
Owner

Well, they'd still have a bit more baggage since they'd have to contain a reference to an array object, an offset into that array to start at, and a length. That's significantly more stuff than just a pointer.

Jacob Quinn
Collaborator
quinnj commented
Stefan Karpinski
Owner

The search function returns the index or range of indices that the match occurs at, however, not the matching substring itself.

Pulkit Bhuwalka

The matchall function internally uses the match function which returns a RegexMatch object after a call to the PCRE library. So in case we want a version of matchall that returns strings, we would have to bypass that and call the PCRE library directly.
However, I'm not sure if the slowness is only due to the creation of RegexMatch as compared to a String. Both do string extraction, except that RegexMatch maintians the regex groups as well. The matchall function changes some regex behavior occasionally to address the edge cases #3525. Comparing the old iterator without these fixes (which follows the same matching strategy above) would suggest whether it's only the object creation or also the matching change that's causing the performance to drop.
I don't have access to a dev machine for a week and would not be able to check it out.

John Myles White
Collaborator

I'd love to see @dcjones change get into 0.2.

Daniel Jones
Collaborator

I nearly forgot about this. I'll make a PR later tonight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.