New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use memmem in string index. Uses Knuth-Morris-Pratt on glibc 2.2x+ faster #574
Conversation
src/strings/ops.c
Outdated
return -1; | ||
/* If we aren't on a 32 bit boundary then freak out!!! */ | ||
else if ( | ||
( (intptr_t)mm_return_32 - (intptr_t)haystack->body.storage.blob_32) % sizeof(MVMGrapheme32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If somebody could check the sanity of this line for determining if the void * pointer returned by memmem() is not on the 32bit integer boundary. Want this double checked.
Very good find, nice work, but cannot be merged as is.
|
Note that for a wrapper, you could fallback to memchr + memcmp and a boundary check, which are C standard library functions. |
Totally agree with you. Wanted to do a PR to judge how to move this forward. The throwing was mainly for me to check if it ever got to that point while MoarVM and roast/testset for a while, which it did not trigger, but it still needs to be planned for because it is possible for that to happen. I would think that using void *MVM_memmem(const void *haystack, size_t haystacklen,
const void *needle, size_t needlelen,
size_t sizeof); |
00f04ca
to
cbd8479
Compare
Updated the PR. I have added memmem.c from libc (FreeBSD) and a file |
Any licensing issues?
…On Fri, Apr 7, 2017 at 6:54 PM, Samantha McVey ***@***.***> wrote:
Updated the PR. I have added memmem.c from libc (FreeBSD) and a file
platform/memmem.h which handles the distribution specific stuff. For now
i'm not going to add a specialized memmem wrapper, but that will probably
be something I will do in the future.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#574 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADEUZTfijcnNjh-wL4rW_7oAb6G4Xe0ks5rtr6xgaJpZM4M2XrS>
.
--
Will "Coke" Coleda
|
@coke it is under MIT licensing. Our utf8 decoding code is MIT license, so it should probably be fine. UTF8 code for reference: https://github.com/MoarVM/MoarVM/blob/master/src/strings/utf8.c#L3 Side note: ~~~the BSD version which this PR uses for Windows is the same code that is used for MacOS since they have lots of BSD code.~~~ Looks like memmem in MacOS is from FreeBSD but from 2005, and is quite different https://opensource.apple.com/source/Libc/Libc-1158.50.2/string/FreeBSD/memmem.c.auto.html |
fb21f33
to
e9a0411
Compare
Preliminary. We may want to integrate the memmem function from glibc into /3rdparty, or make it an #ifdef for computers which don't have it. Makes string search 2x faster for many strings. Temporarily throws an exception if memmem returned a value that is not a multiple of 32 bits.
Copyright (c) 2005-2014 Rich Felker, et al.
e09ee93
to
f1eb9cf
Compare
All checks pass on Windows Appveyor and Travis CI on MacOS and Linux. The PR now uses FreeBSD's memmem on both Windows and MacOS to take advantage of the Crochemore-Perrin two-way string matching algorithm it uses. Link to PDF of paper which describes it and interestingly compares it to Knuth-Morris-Pratt and Boyer-Moore for any interested. We will still use glibc's Knuth-Morris-Pratt based memmem on Linux, since it is 13% faster in this test: say 'a' x 1000000000 ~ 'b' ~~ /abcde/; say now - INIT now; # FreeBSD libc 4.29s, glibc 3.78s This PR should be ready for merge unless anybody has any other concerns. |
fe2f151
to
c44c259
Compare
Use FreeBSD memmem on MacOS, because although MacOS comes with a memmem, it is much slower than the FBSD one which implements Crochemore+Perrin two-way string matching. Also improve the comments to be more descriptive.
timotimo++ for catching this.
Update:
All checks pass on Windows Appveyor and Travis CI on MacOS and Linux. The PR now uses FreeBSD's memmem on both Windows and MacOS to take advantage of the Crochemore-Perrin two-way string matching algorithm it uses. Link to PDF of paper which describes it and interestingly compares it to Knuth-Morris-Pratt and Boyer-Moore for any interested. We will still use glibc's Knuth-Morris-Pratt based memmem on Linux, since it is 13% faster in this test:
FreeBSD's memmem is included in
3rdparty/freebsd/memmem.c
and the#ifdef
's and newMVM_memmem
function are insrc/platform/memmem.h
.Comments within the code explain this as well. :-)
Opening this Pull Request to allow for comments. In perl 6 matching in regex becomes
2.2x faster (in case needle is at very end of string). It is likely that the
index operation itself is slightly faster than that.
On MacOS memmem does not use Knuth-Morris-Pratt algorithm, and just searches in a standard
way. We should get some benchmarks on MacOS and see how they compare.
It may be beneficial to (eventually) put parts of glibc in 3rdparty or somehow incorporate the code
for memmem() so we can use it on MacOS and Windows as well.
Even better would be to adapt the code to make a memmem that can skip a specified number of bytes,so we can index 32bit strings faster. Although unlikely, at the moment it just panics if it gets back
and pointer that isn't along the boundaries of a 32bit integer. Preliminary plan is to run memmem from
from the returned pointer to continue searching inside the string in the rare (but possible) chance this
were to occur.
Feedback requested :-)
P.S. glibc-2.8 was released in 2008-04-12, previous to this it did not use a special algorithm but should still work fine with 2.1 or newer (1999-02-03)