Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use memmem in string index. Uses Knuth-Morris-Pratt on glibc 2.2x+ faster #574

Merged
merged 10 commits into from Apr 9, 2017

Conversation

samcv
Copy link
Member

@samcv samcv commented Apr 7, 2017

Update:
All checks pass on Windows Appveyor and Travis CI on MacOS and Linux. The PR now uses FreeBSD's memmem on both Windows and MacOS to take advantage of the Crochemore-Perrin two-way string matching algorithm it uses. Link to PDF of paper which describes it and interestingly compares it to Knuth-Morris-Pratt and Boyer-Moore for any interested. We will still use glibc's Knuth-Morris-Pratt based memmem on Linux, since it is 13% faster in this test:

say 'a' x 1000000000 ~ 'b' ~~ /abcde/; say now - INIT now; # FreeBSD libc 4.29s, glibc 3.78s

FreeBSD's memmem is included in 3rdparty/freebsd/memmem.c and the #ifdef's and new MVM_memmem function are in src/platform/memmem.h.

Comments within the code explain this as well. :-)


Opening this Pull Request to allow for comments. In perl 6 matching in regex becomes
2.2x faster (in case needle is at very end of string). It is likely that the
index operation itself is slightly faster than that.

On MacOS memmem does not use Knuth-Morris-Pratt algorithm, and just searches in a standard
way. We should get some benchmarks on MacOS and see how they compare.

It may be beneficial to (eventually) put parts of glibc in 3rdparty or somehow incorporate the code
for memmem() so we can use it on MacOS and Windows as well.

Even better would be to adapt the code to make a memmem that can skip a specified number of bytes,
so we can index 32bit strings faster. Although unlikely, at the moment it just panics if it gets back
and pointer that isn't along the boundaries of a 32bit integer. Preliminary plan is to run memmem from
from the returned pointer to continue searching inside the string in the rare (but possible) chance this
were to occur.

  • Windows now uses FreeBSD's memmem implementation, and has been tested with nqp test suite on Windows

Feedback requested :-)

P.S. glibc-2.8 was released in 2008-04-12, previous to this it did not use a special algorithm but should still work fine with 2.1 or newer (1999-02-03)

return -1;
/* If we aren't on a 32 bit boundary then freak out!!! */
else if (
( (intptr_t)mm_return_32 - (intptr_t)haystack->body.storage.blob_32) % sizeof(MVMGrapheme32)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If somebody could check the sanity of this line for determining if the void * pointer returned by memmem() is not on the 32bit integer boundary. Want this double checked.

@bdw
Copy link
Contributor

bdw commented Apr 7, 2017

Very good find, nice work, but cannot be merged as is.

  • memmem is not available on windows, and so we need to either wrap this code path in an ifdef, or implement a wrapper
  • it throws on finding a match in the non-32 bit index, and obviously that isn't really an error; it should continue at the next boundary (memmem finds the first match, so there is no earlier match at a boundary; this can be skipped if the remainder is less than the length of the short string). So this needs to be converted to a loop.
  • personally, i'd write the boundary check as (((char*) foo) - ((char*) bar) & 3) == 0; the compiler will probably do the same as an optimization, but still.

@bdw
Copy link
Contributor

bdw commented Apr 7, 2017

Note that for a wrapper, you could fallback to memchr + memcmp and a boundary check, which are C standard library functions.

@samcv
Copy link
Member Author

samcv commented Apr 7, 2017

Totally agree with you. Wanted to do a PR to judge how to move this forward. The throwing was mainly for me to check if it ever got to that point while MoarVM and roast/testset for a while, which it did not trigger, but it still needs to be planned for because it is possible for that to happen.

I would think that using % sizeof(MVMGrapheme32) would be easier to port to other sizes, and I am considering having a MVM_memmem which will do this check across amounts that are different from 32bit integers as well.

void *MVM_memmem(const void *haystack, size_t haystacklen,
                    const void *needle, size_t needlelen,
                    size_t sizeof);

@samcv samcv force-pushed the memmemindex branch 2 times, most recently from 00f04ca to cbd8479 Compare April 7, 2017 22:33
@samcv
Copy link
Member Author

samcv commented Apr 7, 2017

Updated the PR. I have added memmem.c from libc (FreeBSD) and a file platform/memmem.h which handles the distribution specific stuff. For now i'm not going to add a specialized memmem wrapper, but that will probably be something I will do in the future.

@coke
Copy link
Collaborator

coke commented Apr 7, 2017 via email

@samcv
Copy link
Member Author

samcv commented Apr 8, 2017

@coke it is under MIT licensing.

Our utf8 decoding code is MIT license, so it should probably be fine. UTF8 code for reference: https://github.com/MoarVM/MoarVM/blob/master/src/strings/utf8.c#L3

Side note: ~~~the BSD version which this PR uses for Windows is the same code that is used for MacOS since they have lots of BSD code.~~~ Looks like memmem in MacOS is from FreeBSD but from 2005, and is quite different https://opensource.apple.com/source/Libc/Libc-1158.50.2/string/FreeBSD/memmem.c.auto.html

@samcv samcv force-pushed the memmemindex branch 2 times, most recently from fb21f33 to e9a0411 Compare April 8, 2017 01:20
Preliminary. We may want to integrate the memmem function from glibc into /3rdparty,
or make it an #ifdef for computers which don't have it. Makes string search 2x faster
for many strings.

Temporarily throws an exception if memmem returned a value that is not a multiple
of 32 bits.
Copyright (c) 2005-2014 Rich Felker, et al.
@samcv samcv force-pushed the memmemindex branch 3 times, most recently from e09ee93 to f1eb9cf Compare April 8, 2017 05:53
@samcv
Copy link
Member Author

samcv commented Apr 8, 2017

All checks pass on Windows Appveyor and Travis CI on MacOS and Linux. The PR now uses FreeBSD's memmem on both Windows and MacOS to take advantage of the Crochemore-Perrin two-way string matching algorithm it uses. Link to PDF of paper which describes it and interestingly compares it to Knuth-Morris-Pratt and Boyer-Moore for any interested. We will still use glibc's Knuth-Morris-Pratt based memmem on Linux, since it is 13% faster in this test:

say 'a' x 1000000000 ~ 'b' ~~ /abcde/; say now - INIT now; # FreeBSD libc 4.29s, glibc 3.78s

This PR should be ready for merge unless anybody has any other concerns.

@samcv samcv force-pushed the memmemindex branch 2 times, most recently from fe2f151 to c44c259 Compare April 8, 2017 21:48
Use FreeBSD memmem on MacOS, because although MacOS comes with a memmem,
it is much slower than the FBSD one which implements
Crochemore+Perrin two-way string matching.

Also improve the comments to be more descriptive.
@samcv samcv merged commit e8231a3 into MoarVM:master Apr 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants