-
-
Notifications
You must be signed in to change notification settings - Fork 701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(last)indexOfAny(index) #1952
(last)indexOfAny(index) #1952
Conversation
else | ||
{ | ||
immutable innerLoop = | ||
"foreach (dchar o; needles) { dchar oLow = std.uni.toLower(o); " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't quite correct in Unicode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, the unittests with unicode work and the same proved correct for indexOf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
He's talking about dealing with letters such as ß
, where it won't match SS
. The issue is that currently, nothing in string
or algorithm
knows how to deal with them. I think your current code is fine: It would be better to upgrade all of string to know about graphemes, then just a single new function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
Well yes and no, the ß example is bad because in german ß != ss, grüßen != grüssen. So finding ss for every ß is just wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it was just an example. The root issue is a codepoint vs grapheme thing, which I think is fine to ignore for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fine with me, but I was a little bit scared that my browser thinks that ß == ss ;-)
In case of indexOf you still have to test >= 0. In case of find you first have to test if the range is empty, then use the index. I don't see Lastly - decoding is not required at least for case sensitive version. In case of needles being a string you re-decode all of them on every character in haystack, there got to be a better way. (For instance searching for any codeunit, then doing more precise match) |
I don't really know what you want to say in the first paragraph, could you elaborate please. It is there was a bug in indexOf, because of unicode slicing I fixed 1/2 a year ago. The way to do it faster is to do the needles toLower once and save the result, but that requires heap. It is a classic n against m problem and as there is nothing sorted or otherwise this is the way to do it. |
Only for when the cutrent haystack letter is ascii. That said
I think it's safe to say that the I'm unsure if it would be better to store them as chars or dchars though: chars can do a memchr among the needles if the haystack.front.toLower is in ascii range, but if not, then dchar doesn't require decoding. Walter is working on a simple ...or you could just leave it as is. Premature optimization is evil :) |
Permature optimization is evil and simple is better, IMHO. |
@burner I've meant what is a bug anyway pointed out by @monarchdodra string s = "hällo";
auto r = s.findAmong("ztl");
//<---- can test for r.empty here
auto idx = s.length - r.length; // actually findAmong returns leftover portion
//or here if idx == s.length there was no match Is not any worse or more branchy then checking against -1 returned by indexOf. |
There is no such thing as premature optimization of a public function in the standard library. The multitude of potential uses and programs is so large that it is basically guaranteed that some code somewhere will use it in a tight loop. |
well, if you have optimize the wrong special case, it is premature. Additionally, selecting which special implementation to use will cost some cycles as well. |
@burner To be precise about the kind of optimization I have in mind. The 2 loops of decode over a string are replaced with something like this: for(size_t i=0; i<haystack.length; i += stride(haystack, i)) //no decode but keep proper pace
{
//i is always at start of a codepoint
immutable c = haystack[i];
auto mayMatch = needles.representation.indexOf(c);
if(mayMatch < 0)
continue;
//or anything better that decodes 2 fronts
//may aswell check here for ASCII though decoding does it anyway
if(needles[mayMatch..$].front == haystack[i..$].front)
{
return i;
}
}
return -1; Trying to prove if this is the wrong special case needs some numbers, so we have to test it anyway. I trust my experience that tells basically decoding is a big time consumer of simple matching in strings. |
you forgot a loop and a branch bool allAscii = true;
foreach(c; haystack) {
if(c >= 0x7F) {
allAscii = false;
break;
}
}
if(allAscii) { yourImpl(); } and than the call to indexOf will decode anyway |
@burner There is no need for it to be ascii only. This is the important point here. |
Peruse UTF-8 and UTF-16 encodings - a starting code unit can't be found among non-starting ones. |
Speaking of toLower - it's wrong simply because not ever capital letter has lower case mapping and vise-versa. By Unicode standard there is case folding that is applicable for case-insensitive matching which may be simple (1:1 mapping) or full (1:m). |
immutable c = haystack[i]; slices utf(8,16) and you put that into indexOf. So string a ="ö";
string b ="ä";
immutable a1 = a[0];
assert(a1 == b[0]); // is true
assert(a[0] == b[0]); // is true |
so what do you propose? uni.sicmp && case_comparision? |
My bad github let's me edit your posts ...
There is a fault in my code as presented (it's a sketch after all) but it's not with decoding. UPDATE: |
Maybe uni.sicmp, it's a bit too heavy for single chars. I'd just leave TODO note as it's not that big a problem. |
I mostly mean that in the sense that it is (currently) more important to have things that work, rather than improve their performance. In particular, the optimizations you propose only work for when the sizes match. Furthermore, it's a non-optimization the input is dchars. All this creates more branches to test and cover, introducing potentially hard to find bugs.
Interesting use of stride. That said, it would be even simpler to invert and search, and look for a haystack slice inside the needles? {
size_t i = 0, next = 0;
for ( ; i < haystack.length ; i = next)
{
next += stride(haystack, i);
//Search for the haystack slice inside the needles
//Note: find does no decoding
if (!find(needles, haystack[i .. next]).empty)
return haystack[i .. $];
}
return -1;
} Also, I think your solution might be wrong: if (needles[mayMatch..$].front == haystack[i..$].front) You only end up testing the "first" maymatch, which may be a negative match that will short-circuit a later positive match in needles. Also, were we to write such code though, we might as well make it a branch in |
So you decode? I thought that was the problem in the first place. I see decodes in the use of indexOf and the .front .front comparison. I'm fine with a TODO and the currently simple code. Maybe std.uni needs a compareCase and compare for (w,d)char. |
@burner There is |
Yes, exactly the problem I thought of when I said it has faults. In fact, I think your solution is fine and simple. The only question remains is some timings on unicode/non-unicode. |
Is called against representation that is ubyte[] or ushort[]. |
All true. It still very much worth going for it with some special casing. string to string is the majority, with wstring to wstring probably being popular on Windows. By and large D turns out to be an UTF-8 language, see that e.g. Object.toString returns an UTF-8 string and there are many things like that. |
} | ||
else | ||
{ | ||
foreach_reverse (i, dchar c; haystack) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
retro + findAmong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will look into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should work if you use "source" afterwards to get the length back. Also, the "length"/"index" mapping is a bit less trivial:
static if (direction)
{
size_t n = haystack.findAmong(needles).length;
return n ? haystack.length - n : -1;
}
else
{
size_t n = haystack.retro.findAmong(needles).source.length;
return n ? n - strideBack(haystack, n) : -1;
}
This should work I think. This will actually have a "little" overhead, as the elements of haystack
will be decoded, then re-encoded in the find among, but, IMO, that fine: As long is it only happens once. The problem is if you decode needles
repeatedly.
You could squeeze extra performance if you used my above algorithm with strideBack
, but I'm not sure it warrants the extra complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT: Actually, the return n ? : ;
thing could get you into trouble, as you'll run into a "statement not reachable" error for the trailing return -1;
on release builds. Instead, keep it to:
if (n) return ...;
I'm tempted to move this to std.algorithm and generalize it for all ranges. It's not string-specific and can be used for general range search. Thoughts? |
You'd hit the string specific hurdle of For string manipulation (arguably), it makes since to do such a search. For ranges, we'd first want "findLastOf", and from there we could build countFromEnd, splitEnd, or whatnot. In any case, regardless of what we do in algorithm, it makes sense to have a string specific index version in |
I'll be happy to create these for algorithm, but as monarchdodra said the index != count problem makes this worth having. |
thanks |
I'm realizing that my appender solution my not work, because building it allocates, so the solution isn't GC-less for little inputs either. It also explains why the Appender solution was slow for small inputs. Well, I'll give you a fuller output tomorow. Maybe keeping it simple is best for now. Sight. |
Alright, here is the code and the variants I tested. Please feel free to play around it, and to tell me if I missed anything. I added a final "special" case, that forks behavior depending on input's walklength. It has excelent performance for small input, without GC, but still scales pretty damn well for larger input. That's good, because it means it has good behavior regardless of input size. Appart from that, it also doesn't allocate (at all) for small needles (16 points or less). As much as possible, we want to avoid allocating when possible. Tweaking it could extract still more speed, but the extra code/complexity doesn't justify it. In any case. Chose what you think is best. Here is my benchwork: import std.range, std.traits, std.range, std.stdio, std.exception, std.conv, std.string, std.algorithm, std.utf, std.uni, std.exception, std.typetuple;
import std.internal.scopebuffer;
import std.datetime;
enum N = 1_000_000;
void main()
{
string haystack = "hello!world";
string needles = `!.`;
alias Funs = TypeTuple!(
indexOfAny_raw,
indexOfAny_can,
indexOfAny_AppC,
//indexOfAny_ScopC,
indexOfAny_WaltC,
indexOfAny_AppD,
//indexOfAny_ScopD,
indexOfAny_WaltD,
indexOfAny_Sepcial,
);
foreach ( fun ; Funs )
writeln(fun(haystack, needles, CaseSensitive.no));
foreach ( fun ; Funs )
{
StopWatch w;
w.start();
long i;
foreach ( _ ; 0 .. N )
i += fun(haystack, needles, CaseSensitive.no);
writefln("%s: %s", fun.stringof, w.peek.msecs*0.001);
}
}
private ptrdiff_t indexOfAny_raw(Char, Char2) (const(Char)[] haystack, const(Char2)[] needles, CaseSensitive cs) if (isSomeChar!Char && isSomeChar!Char2)
{
if (cs == CaseSensitive.yes) { size_t n = haystack.findAmong(needles).length; return n ? haystack.length - n : -1; }
else
{
static bool toLowerFind(const dchar hay, const(Char2)[] needles)
{
dchar hLow = std.uni.toLower(hay);
foreach (dchar o; needles)
{
dchar oLow = std.uni.toLower(o);
if (hLow == oLow)
{
return true;
}
}
return false;
}
foreach (i, dchar c; haystack)
if (toLowerFind(c, needles))
return i;
}
return -1;
}
private ptrdiff_t indexOfAny_can(Char, Char2) (const(Char)[] haystack, const(Char2)[] needles, CaseSensitive cs) if (isSomeChar!Char && isSomeChar!Char2)
{
if (cs == CaseSensitive.yes) { size_t n = haystack.findAmong(needles).length; return n ? haystack.length - n : -1; }
else
{
static bool f(dchar a, dchar b)
{
return std.uni.toLower(a) == b;
}
foreach (i, dchar c; haystack)
if (canFind!f(needles, std.uni.toLower(c)))
return i;
}
return -1;
}
private ptrdiff_t indexOfAny_AppC(Char, Char2) (const(Char)[] haystack, const(Char2)[] needles, CaseSensitive cs) if (isSomeChar!Char && isSomeChar!Char2)
{
if (cs == CaseSensitive.yes) { size_t n = haystack.findAmong(needles).length; return n ? haystack.length - n : -1; }
else
{
char[64] scratch = void;
auto app = appender(scratch[]);
app.clear();
foreach ( dchar c; needles)
put(app, std.uni.toLower(c));
foreach (i, dchar c; haystack)
if (canFind(app.data, std.uni.toLower(c)))
return i;
}
return -1;
}
private ptrdiff_t indexOfAny_ScopC(Char, Char2) (const(Char)[] haystack, const(Char2)[] needles, CaseSensitive cs) if (isSomeChar!Char && isSomeChar!Char2)
{
if (cs == CaseSensitive.yes) { size_t n = haystack.findAmong(needles).length; return n ? haystack.length - n : -1; }
else
{
char[64] scratch = void;
auto app = ScopeAppender!char(scratch[]);
foreach ( dchar c; needles)
put(app, std.uni.toLower(c));
foreach (i, dchar c; haystack)
if (canFind(app.peek, std.uni.toLower(c)))
return i;
}
return -1;
}
private ptrdiff_t indexOfAny_WaltC(Char, Char2) (const(Char)[] haystack, const(Char2)[] needles, CaseSensitive cs) if (isSomeChar!Char && isSomeChar!Char2)
{
if (cs == CaseSensitive.yes) { size_t n = haystack.findAmong(needles).length; return n ? haystack.length - n : -1; }
else
{
char[64] scratch = void;
auto app = ScopeBuffer!char(scratch[]);
scope(exit) app.free();
foreach ( dchar c; needles)
put(app, std.uni.toLower(c));
foreach (i, dchar c; haystack)
if (canFind(app[], std.uni.toLower(c)))
return i;
}
return -1;
}
private ptrdiff_t indexOfAny_AppD(Char, Char2) (const(Char)[] haystack, const(Char2)[] needles, CaseSensitive cs) if (isSomeChar!Char && isSomeChar!Char2)
{
if (cs == CaseSensitive.yes) { size_t n = haystack.findAmong(needles).length; return n ? haystack.length - n : -1; }
else
{
dchar[16] scratch = void;
auto app = appender(scratch[]);
app.clear();
foreach ( dchar c; needles)
put(app, std.uni.toLower(c));
foreach (i, dchar c; haystack)
if (canFind(app.data, std.uni.toLower(c)))
return i;
}
return -1;
}
private ptrdiff_t indexOfAny_ScopD(Char, Char2) (const(Char)[] haystack, const(Char2)[] needles, CaseSensitive cs) if (isSomeChar!Char && isSomeChar!Char2)
{
if (cs == CaseSensitive.yes) { size_t n = haystack.findAmong(needles).length; return n ? haystack.length - n : -1; }
else
{
dchar[16] scratch = void;
auto app = ScopeAppender!dchar(scratch[]);
foreach ( dchar c; needles)
put(app, std.uni.toLower(c));
foreach (i, dchar c; haystack)
if (canFind(app.peek, std.uni.toLower(c)))
return i;
}
return -1;
}
private ptrdiff_t indexOfAny_WaltD(Char, Char2) (const(Char)[] haystack, const(Char2)[] needles, CaseSensitive cs) if (isSomeChar!Char && isSomeChar!Char2)
{
if (cs == CaseSensitive.yes) { size_t n = haystack.findAmong(needles).length; return n ? haystack.length - n : -1; }
else
{
dchar[16] scratch = void;
auto app = ScopeBuffer!dchar(scratch[]);
scope(exit) app.free();
foreach ( dchar c; needles)
put(app, std.uni.toLower(c));
foreach (i, dchar c; haystack)
if (canFind(app[], std.uni.toLower(c)))
return i;
}
return -1;
}
private ptrdiff_t indexOfAny_Sepcial(Char, Char2) (const(Char)[] haystack, const(Char2)[] needles, CaseSensitive cs) if (isSomeChar!Char && isSomeChar!Char2)
{
if (cs == CaseSensitive.yes) { size_t n = haystack.findAmong(needles).length; return n ? haystack.length - n : -1; }
else
{
immutable len = needles.walkLength;
if (len <= 16)
{
dchar[16] scratch = void;
dchar[] app = scratch[0 .. len];
size_t i = 0;
foreach ( dchar c; needles)
app[i++] = c;
foreach (i, dchar c; haystack)
if (canFind(app, std.uni.toLower(c)))
return i;
}
else
{
char[64] scratch = void;
auto app = appender(scratch[]);
foreach ( dchar c; needles)
put(app, std.uni.toLower(c));
foreach (i, dchar c; haystack)
if (canFind(app.data, std.uni.toLower(c)))
return i;
}
}
return -1;
} |
Thanks for the benchmark code. It seams like the simple version and canFind are the contender. Your special version had a bug, you missed the toLower at line app[i++] = c Anyway here is my benchmark rework http://dpaste.dzfl.pl/97b1fa04f6a0 |
Good catch.
Well written. I like it. However, I think it might be flawed in terms of how the data is generated. You are basically generating characters in the Which explains (I found this odd), why the larger the needle, the faster the trivial case ran, when I'd have expected the opposite: Bigger input, longer run. So in this scenario, yes, the But the test almost completely misses out on the "no match found" test, or more generally "rare match". For example: From a design point of view, I'm wondering if it would make more sense to require that the needles are all already lowercase for case insensitive searches. Not only would this solve the performance/allocation problem, but it also makes sense from an interface point of view, when you know that these functions are often called in loop with the same needle. Like: "To perform a case insensitive search, Or add a third parameter, "assumeNeedlesAreLowerCased"? @andralex : Would you have any design idea for such issues? Is my interface idea acceptable? |
Do you have some source of useful test data? |
I worked some more on the benchmark. I'm using some text from wikipedia now. I found some more bugs in char appender wont work, needs dchar appender. Anyway you can find the new benchmark under http://dpaste.dzfl.pl/0444e719233d The values are proportional, not in msecs of nsecs. I'm to tired to calculate them right. I added a modified version of your special function. |
What kind of bug? It should work with simple char. If not, it's a bug that needs to be fixed. Please report what the use case is.
I get what you are saying. Testing code can get long and very tiring. I'm not using a working environment, so I can't try your tests right now. I'll try them later. |
It should work for single chars, if there is enough space left to put two/four of them into it in case of utf8. But that should not be the case here, as no needles in utf16/32 are passed in. Anyway changing it to dchar fixed whatever was wrong. |
@monarchdodra Have you look at the benchmark? |
The "less than 16" case looks fine. For more than 16, could you try just using Let's just do that, and wrap this up, I think it's a good addition. |
} | ||
|
||
/** | ||
Returns the index of the last occurence of any of the elements in $(D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last occurence? Wrong copy paste?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
I test the to!dstring version. It is about 1% slower in the avg case. So I think no gc and more speed makes the winner. |
any more input for this? |
Sorry, I haven't forgotten about this.
Yeah... I'm still not entirely satisfied with this: Not doing a eager I'd like to try a few more things on my end (I won't ask you to do my testing). If I come up with something good, I'll tell you. If not, I'll merge away anyways. I think this is a good addition. Give me about a week. |
Ok |
@monarchdodra Any news? |
anyone? |
Ok, sorry for the wait. I was worried about an implementation detail, when what counts (the interface) is fine. In particular, the newly introduced So I'm ready to merge this. Just fix the last typo and we are good to go. |
one sec |
good to go, from my part |
} | ||
else | ||
{ | ||
if (needles.length <= 16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually needles.walkLength(17) <= 16
for it to be code points.
17 is to prevent it from walking > 17 characters in the worst case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate? Where do you see the walkLength call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had forgotten you could set an upper bound on walkLength! I had thought that walkLength
would be better, but was afraid of the extra decoding cost, so didn't mention it.
Yes, using needles.walkLength(17) <= 16
is superior.
Or if (needles.length <= 16 || needles.walkLength(17) <= 16)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
I will fix it right away
whitespace some code reductions faster some last fixes, I hope typo walkLength
anything else I missed? |
so, what needs to happen to get this merged? |
Thank you for all the help with this one. @monarchdodra what about lastIndexOfNone? |
While doing string processing I found that std.string is missing indexOfAny(haystack,needles). As in indexOfAny returns the first match of any needle of needles in haystack. You can do.
findAmong returns a range and thats very bad in this case as:
is 0 which is wrong as s[0] != 'x'. So you would have to write another branch stmt.
The indexOfAny and lastIndexOfAny follow the indexOf functions already in std.string and return -1 if nothing is found.
comments please