Use memmem in string index. Uses Knuth-Morris-Pratt on glibc 2.2x+ faster #574

samcv · 2017-04-07T01:56:21Z

Update:
All checks pass on Windows Appveyor and Travis CI on MacOS and Linux. The PR now uses FreeBSD's memmem on both Windows and MacOS to take advantage of the Crochemore-Perrin two-way string matching algorithm it uses. Link to PDF of paper which describes it and interestingly compares it to Knuth-Morris-Pratt and Boyer-Moore for any interested. We will still use glibc's Knuth-Morris-Pratt based memmem on Linux, since it is 13% faster in this test:

say 'a' x 1000000000 ~ 'b' ~~ /abcde/; say now - INIT now; # FreeBSD libc 4.29s, glibc 3.78s

FreeBSD's memmem is included in 3rdparty/freebsd/memmem.c and the #ifdef's and new MVM_memmem function are in src/platform/memmem.h.

Comments within the code explain this as well. :-)

Opening this Pull Request to allow for comments. In perl 6 matching in regex becomes
2.2x faster (in case needle is at very end of string). It is likely that the
index operation itself is slightly faster than that.

On MacOS memmem does not use Knuth-Morris-Pratt algorithm, and just searches in a standard
way. We should get some benchmarks on MacOS and see how they compare.

It may be beneficial to (eventually) put parts of glibc in 3rdparty or somehow incorporate the code
for memmem() so we can use it on MacOS and Windows as well.

Even better would be to adapt the code to make a memmem that can skip a specified number of bytes,
so we can index 32bit strings faster. Although unlikely, at the moment it just panics if it gets back
and pointer that isn't along the boundaries of a 32bit integer. Preliminary plan is to run memmem from
from the returned pointer to continue searching inside the string in the rare (but possible) chance this
were to occur.

Windows now uses FreeBSD's memmem implementation, and has been tested with nqp test suite on Windows

Feedback requested :-)

P.S. glibc-2.8 was released in 2008-04-12, previous to this it did not use a special algorithm but should still work fine with 2.1 or newer (1999-02-03)

samcv · 2017-04-07T02:01:00Z

src/strings/ops.c

+                    return -1;
+                /* If we aren't on a 32 bit boundary then freak out!!! */
+                else if (
+                    ( (intptr_t)mm_return_32 - (intptr_t)haystack->body.storage.blob_32) % sizeof(MVMGrapheme32)


If somebody could check the sanity of this line for determining if the void * pointer returned by memmem() is not on the 32bit integer boundary. Want this double checked.

bdw · 2017-04-07T07:37:31Z

Very good find, nice work, but cannot be merged as is.

memmem is not available on windows, and so we need to either wrap this code path in an ifdef, or implement a wrapper
it throws on finding a match in the non-32 bit index, and obviously that isn't really an error; it should continue at the next boundary (memmem finds the first match, so there is no earlier match at a boundary; this can be skipped if the remainder is less than the length of the short string). So this needs to be converted to a loop.
personally, i'd write the boundary check as (((char*) foo) - ((char*) bar) & 3) == 0; the compiler will probably do the same as an optimization, but still.

bdw · 2017-04-07T07:42:50Z

Note that for a wrapper, you could fallback to memchr + memcmp and a boundary check, which are C standard library functions.

samcv · 2017-04-07T07:44:42Z

Totally agree with you. Wanted to do a PR to judge how to move this forward. The throwing was mainly for me to check if it ever got to that point while MoarVM and roast/testset for a while, which it did not trigger, but it still needs to be planned for because it is possible for that to happen.

I would think that using % sizeof(MVMGrapheme32) would be easier to port to other sizes, and I am considering having a MVM_memmem which will do this check across amounts that are different from 32bit integers as well.

void *MVM_memmem(const void *haystack, size_t haystacklen,
                    const void *needle, size_t needlelen,
                    size_t sizeof);

samcv · 2017-04-07T22:54:41Z

Updated the PR. I have added memmem.c from libc (FreeBSD) and a file platform/memmem.h which handles the distribution specific stuff. For now i'm not going to add a specialized memmem wrapper, but that will probably be something I will do in the future.

coke · 2017-04-07T23:02:34Z

Any licensing issues?

…

On Fri, Apr 7, 2017 at 6:54 PM, Samantha McVey ***@***.***> wrote: Updated the PR. I have added memmem.c from libc (FreeBSD) and a file platform/memmem.h which handles the distribution specific stuff. For now i'm not going to add a specialized memmem wrapper, but that will probably be something I will do in the future. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#574 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADEUZTfijcnNjh-wL4rW_7oAb6G4Xe0ks5rtr6xgaJpZM4M2XrS> .

-- Will "Coke" Coleda

samcv · 2017-04-08T00:54:52Z

@coke it is under MIT licensing.

Our utf8 decoding code is MIT license, so it should probably be fine. UTF8 code for reference: https://github.com/MoarVM/MoarVM/blob/master/src/strings/utf8.c#L3

Side note: ~~~the BSD version which this PR uses for Windows is the same code that is used for MacOS since they have lots of BSD code.~~~ Looks like memmem in MacOS is from FreeBSD but from 2005, and is quite different https://opensource.apple.com/source/Libc/Libc-1158.50.2/string/FreeBSD/memmem.c.auto.html

Preliminary. We may want to integrate the memmem function from glibc into /3rdparty, or make it an #ifdef for computers which don't have it. Makes string search 2x faster for many strings. Temporarily throws an exception if memmem returned a value that is not a multiple of 32 bits.

samcv · 2017-04-08T07:24:26Z

All checks pass on Windows Appveyor and Travis CI on MacOS and Linux. The PR now uses FreeBSD's memmem on both Windows and MacOS to take advantage of the Crochemore-Perrin two-way string matching algorithm it uses. Link to PDF of paper which describes it and interestingly compares it to Knuth-Morris-Pratt and Boyer-Moore for any interested. We will still use glibc's Knuth-Morris-Pratt based memmem on Linux, since it is 13% faster in this test:

say 'a' x 1000000000 ~ 'b' ~~ /abcde/; say now - INIT now; # FreeBSD libc 4.29s, glibc 3.78s

This PR should be ready for merge unless anybody has any other concerns.

Use FreeBSD memmem on MacOS, because although MacOS comes with a memmem, it is much slower than the FBSD one which implements Crochemore+Perrin two-way string matching. Also improve the comments to be more descriptive.

timotimo++ for catching this.

samcv commented Apr 7, 2017

View reviewed changes

samcv force-pushed the memmemindex branch 2 times, most recently from 00f04ca to cbd8479 Compare April 7, 2017 22:33

samcv force-pushed the memmemindex branch 2 times, most recently from fb21f33 to e9a0411 Compare April 8, 2017 01:20

samcv added 5 commits April 7, 2017 18:29

Add FreeBSD memmem.c for use on Windows

1eef9b7

Copyright (c) 2005-2014 Rich Felker, et al.

attempt to add memmem for Windows

d218cda

Comment out two lines from memmem.c, Windows doesn't have sys/cdefs.h

4b00531

Cleanly handle false positive needles @ unaligned positions

9b61962

samcv force-pushed the memmemindex branch 3 times, most recently from e09ee93 to f1eb9cf Compare April 8, 2017 05:53

samcv force-pushed the memmemindex branch 2 times, most recently from fe2f151 to c44c259 Compare April 8, 2017 21:48

samcv added 5 commits April 8, 2017 15:33

Add platform/memmem.h to handle platform specific memmem support

1724bb7

Add freebsd to 3rdparty list in the LICENSE file

4f233ae

Use char* casting instead of intptr_t as recommended by @brrt

fe5df6d

Use FreeBSD memmem function on MacOS and improve comments

20b6041

Use FreeBSD memmem on MacOS, because although MacOS comes with a memmem, it is much slower than the FBSD one which implements Crochemore+Perrin two-way string matching. Also improve the comments to be more descriptive.

Make sure we adjust the haystack length if we have to continue searching

9b9b1b5

timotimo++ for catching this.

samcv force-pushed the memmemindex branch from c44c259 to 9b9b1b5 Compare April 8, 2017 22:33

samcv merged commit e8231a3 into MoarVM:master Apr 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use memmem in string index. Uses Knuth-Morris-Pratt on glibc 2.2x+ faster #574

Use memmem in string index. Uses Knuth-Morris-Pratt on glibc 2.2x+ faster #574

samcv commented Apr 7, 2017 •

edited

samcv Apr 7, 2017

bdw commented Apr 7, 2017

bdw commented Apr 7, 2017

samcv commented Apr 7, 2017

samcv commented Apr 7, 2017

coke commented Apr 7, 2017 via email

samcv commented Apr 8, 2017 •

edited

samcv commented Apr 8, 2017

Use memmem in string index. Uses Knuth-Morris-Pratt on glibc 2.2x+ faster #574

Use memmem in string index. Uses Knuth-Morris-Pratt on glibc 2.2x+ faster #574

Conversation

samcv commented Apr 7, 2017 • edited

samcv Apr 7, 2017

Choose a reason for hiding this comment

bdw commented Apr 7, 2017

bdw commented Apr 7, 2017

samcv commented Apr 7, 2017

samcv commented Apr 7, 2017

coke commented Apr 7, 2017 via email

samcv commented Apr 8, 2017 • edited

samcv commented Apr 8, 2017

samcv commented Apr 7, 2017 •

edited

samcv commented Apr 8, 2017 •

edited