Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(last)indexOfAny(index) #1952

Merged
merged 1 commit into from
May 16, 2014
Merged

Conversation

burner
Copy link
Member

@burner burner commented Feb 18, 2014

While doing string processing I found that std.string is missing indexOfAny(haystack,needles). As in indexOfAny returns the first match of any needle of needles in haystack. You can do.

string s = "hällo";
auto r = s.findAmong("ztl");
auto idx = r.length;

findAmong returns a range and thats very bad in this case as:

r = s.findAmong("xyz");
idx = r.length;

is 0 which is wrong as s[0] != 'x'. So you would have to write another branch stmt.

The indexOfAny and lastIndexOfAny follow the indexOf functions already in std.string and return -1 if nothing is found.

comments please

else
{
immutable innerLoop =
"foreach (dchar o; needles) { dchar oLow = std.uni.toLower(o); "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't quite correct in Unicode.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, the unittests with unicode work and the same proved correct for indexOf.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

He's talking about dealing with letters such as ß, where it won't match SS. The issue is that currently, nothing in string or algorithm knows how to deal with them. I think your current code is fine: It would be better to upgrade all of string to know about graphemes, then just a single new function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

Well yes and no, the ß example is bad because in german ß != ss, grüßen != grüssen. So finding ss for every ß is just wrong.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it was just an example. The root issue is a codepoint vs grapheme thing, which I think is fine to ignore for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine with me, but I was a little bit scared that my browser thinks that ß == ss ;-)

@DmitryOlshansky
Copy link
Member

is 0 which is wrong as s[0] != 'x'. So you would have to write another branch stmt.

In case of indexOf you still have to test >= 0. In case of find you first have to test if the range is empty, then use the index. I don't see extra branch.

Lastly - decoding is not required at least for case sensitive version.

In case of needles being a string you re-decode all of them on every character in haystack, there got to be a better way. (For instance searching for any codeunit, then doing more precise match)

@burner
Copy link
Member Author

burner commented Feb 27, 2014

I don't really know what you want to say in the first paragraph, could you elaborate please.

It is there was a bug in indexOf, because of unicode slicing I fixed 1/2 a year ago.

The way to do it faster is to do the needles toLower once and save the result, but that requires heap. It is a classic n against m problem and as there is nothing sorted or otherwise this is the way to do it.

@monarchdodra
Copy link
Collaborator

Lastly - decoding is not required at least for case sensitive version.

Only for when the cutrent haystack letter is ascii. That said findAmong, which calls find, does it "nativelly", so just calling either of these should be good enough.

The way to do it faster is to do the needles toLower once and save the result, but that requires heap.

I think it's safe to say that the needles member is probably rarelly more than a handful of characters. Having a simple local stack bufer of, say 16 - 32 (for example) is probably good enough for 99% of the cases? If the lower case fits, then you can convert the lowercase once and for all there (if not, then default to your N*M algorithm).

I'm unsure if it would be better to store them as chars or dchars though: chars can do a memchr among the needles if the haystack.front.toLower is in ascii range, but if not, then dchar doesn't require decoding.

Walter is working on a simple scopedBuffer you would have been able to use here, but it's not ready.

...or you could just leave it as is. Premature optimization is evil :)

@burner
Copy link
Member Author

burner commented Feb 27, 2014

Permature optimization is evil and simple is better, IMHO.

@DmitryOlshansky
Copy link
Member

@burner I've meant what is a bug anyway pointed out by @monarchdodra

string s = "hällo";
auto r = s.findAmong("ztl");
//<---- can test for r.empty here
auto idx = s.length - r.length; // actually findAmong returns leftover portion
//or here  if idx == s.length there was no match

Is not any worse or more branchy then checking against -1 returned by indexOf.

@DmitryOlshansky
Copy link
Member

Permature optimization is evil and simple is better, IMHO.

There is no such thing as premature optimization of a public function in the standard library. The multitude of potential uses and programs is so large that it is basically guaranteed that some code somewhere will use it in a tight loop.

@burner
Copy link
Member Author

burner commented Feb 27, 2014

well, if you have optimize the wrong special case, it is premature. Additionally, selecting which special implementation to use will cost some cycles as well.

@DmitryOlshansky
Copy link
Member

@burner To be precise about the kind of optimization I have in mind.

The 2 loops of decode over a string are replaced with something like this:

for(size_t i=0; i<haystack.length; i += stride(haystack, i)) //no decode but keep proper pace
{
      //i is always at start of a codepoint 
      immutable c = haystack[i];
      auto mayMatch = needles.representation.indexOf(c); 
      if(mayMatch < 0)
         continue;
      //or anything better that decodes 2 fronts
     //may aswell check here for ASCII though decoding does it anyway
      if(needles[mayMatch..$].front == haystack[i..$].front)
     {
         return i;
     }
}
return -1;

Trying to prove if this is the wrong special case needs some numbers, so we have to test it anyway. I trust my experience that tells basically decoding is a big time consumer of simple matching in strings.

@burner
Copy link
Member Author

burner commented Feb 27, 2014

you forgot a loop and a branch

bool allAscii = true;
foreach(c; haystack) {
    if(c >= 0x7F) {
        allAscii = false;
        break;
    }
}

if(allAscii) { yourImpl(); }

and than the call to indexOf will decode anyway

@DmitryOlshansky
Copy link
Member

@burner There is no need for it to be ascii only. This is the important point here.

@DmitryOlshansky
Copy link
Member

Peruse UTF-8 and UTF-16 encodings - a starting code unit can't be found among non-starting ones.

@DmitryOlshansky
Copy link
Member

Speaking of toLower - it's wrong simply because not ever capital letter has lower case mapping and vise-versa. By Unicode standard there is case folding that is applicable for case-insensitive matching which may be simple (1:1 mapping) or full (1:m).

@burner
Copy link
Member Author

burner commented Feb 27, 2014

immutable c = haystack[i]; slices utf(8,16) and you put that into indexOf. So

string a ="ö";
string b ="ä";
immutable a1 = a[0];
assert(a1 == b[0]); // is true
assert(a[0] == b[0]); // is true

@burner
Copy link
Member Author

burner commented Feb 27, 2014

so what do you propose? uni.sicmp && case_comparision?

@DmitryOlshansky
Copy link
Member

My bad github let's me edit your posts ...

immutable c = haystack[i];
You don't read the code apparently. Yes, I do search the first code unit. Then I do decode with .front.

There is a fault in my code as presented (it's a sketch after all) but it's not with decoding.

UPDATE:
Also note how iteration is done: i += stride(haystack, i)

@DmitryOlshansky
Copy link
Member

Maybe uni.sicmp, it's a bit too heavy for single chars. I'd just leave TODO note as it's not that big a problem.

@monarchdodra
Copy link
Collaborator

There is no such thing as premature optimization of a public function in the standard library. The multitude of potential uses and programs is so large that it is basically guaranteed that some code somewhere will use it in a tight loop.

I mostly mean that in the sense that it is (currently) more important to have things that work, rather than improve their performance. In particular, the optimizations you propose only work for when the sizes match. Furthermore, it's a non-optimization the input is dchars. All this creates more branches to test and cover, introducing potentially hard to find bugs.

To be precise about the kind of optimization I have in mind.

Interesting use of stride. That said, it would be even simpler to invert and search, and look for a haystack slice inside the needles?

{
    size_t i = 0, next = 0;
    for ( ; i < haystack.length ; i = next)
    {
        next += stride(haystack, i);
        //Search for the haystack slice inside the needles
        //Note: find does no decoding
        if (!find(needles, haystack[i .. next]).empty)
            return haystack[i .. $];
    }
    return -1;
}

Also, I think your solution might be wrong:

if (needles[mayMatch..$].front == haystack[i..$].front)

You only end up testing the "first" maymatch, which may be a negative match that will short-circuit a later positive match in needles.

Also, were we to write such code though, we might as well make it a branch in findAmong itself.

@burner
Copy link
Member Author

burner commented Feb 27, 2014

So you decode? I thought that was the problem in the first place. I see decodes in the use of indexOf and the .front .front comparison.

I'm fine with a TODO and the currently simple code.

Maybe std.uni needs a compareCase and compare for (w,d)char.

@DmitryOlshansky
Copy link
Member

@burner There is always decode and decode when makes sense. In particular it will decode only on potential matches (by 1 codeunit at least). Search is typically full of mis-matches with only rare hits.

@DmitryOlshansky
Copy link
Member

You only end up testing the "first" maymatch, which may be a negative match that will short-circuit a later positive match in needles.

Yes, exactly the problem I thought of when I said it has faults. In fact, I think your solution is fine and simple. The only question remains is some timings on unicode/non-unicode.

@DmitryOlshansky
Copy link
Member

I see decodes in the use of indexOf

Is called against representation that is ubyte[] or ushort[].

@DmitryOlshansky
Copy link
Member

In particular, the optimizations you propose only work for when the sizes match. Furthermore, it's a non-optimization the input is dchars.

All true. It still very much worth going for it with some special casing. string to string is the majority, with wstring to wstring probably being popular on Windows.

By and large D turns out to be an UTF-8 language, see that e.g. Object.toString returns an UTF-8 string and there are many things like that.

}
else
{
foreach_reverse (i, dchar c; haystack)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retro + findAmong?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look into it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work if you use "source" afterwards to get the length back. Also, the "length"/"index" mapping is a bit less trivial:

static if (direction)
{
    size_t n = haystack.findAmong(needles).length;
    return n ? haystack.length - n : -1;
}
else
{
    size_t n = haystack.retro.findAmong(needles).source.length;
    return n ? n - strideBack(haystack, n) : -1;
}

This should work I think. This will actually have a "little" overhead, as the elements of haystack will be decoded, then re-encoded in the find among, but, IMO, that fine: As long is it only happens once. The problem is if you decode needles repeatedly.

You could squeeze extra performance if you used my above algorithm with strideBack, but I'm not sure it warrants the extra complexity.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: Actually, the return n ? : ; thing could get you into trouble, as you'll run into a "statement not reachable" error for the trailing return -1; on release builds. Instead, keep it to:

if (n) return ...;

@burner burner mentioned this pull request Mar 2, 2014
@andralex
Copy link
Member

I'm tempted to move this to std.algorithm and generalize it for all ranges. It's not string-specific and can be used for general range search. Thoughts?

@monarchdodra
Copy link
Collaborator

I'm tempted to move this to std.algorithm and generalize it for all ranges. It's not string-specific and can be used for general range search. Thoughts?

You'd hit the string specific hurdle of index != count. This is why std.string.indexOf exists, even though there's a generic std.algorithm.countUntil.

For string manipulation (arguably), it makes since to do such a search.

For ranges, we'd first want "findLastOf", and from there we could build countFromEnd, splitEnd, or whatnot.

In any case, regardless of what we do in algorithm, it makes sense to have a string specific index version in string.

@burner
Copy link
Member Author

burner commented Mar 16, 2014

I'll be happy to create these for algorithm, but as monarchdodra said the index != count problem makes this worth having.

@burner
Copy link
Member Author

burner commented Mar 19, 2014

thanks

@monarchdodra
Copy link
Collaborator

I'm realizing that my appender solution my not work, because building it allocates, so the solution isn't GC-less for little inputs either. It also explains why the Appender solution was slow for small inputs. Well, I'll give you a fuller output tomorow.

Maybe keeping it simple is best for now. Sight.

@monarchdodra
Copy link
Collaborator

Alright, here is the code and the variants I tested. Please feel free to play around it, and to tell me if I missed anything.

I added a final "special" case, that forks behavior depending on input's walklength.

It has excelent performance for small input, without GC, but still scales pretty damn well for larger input. That's good, because it means it has good behavior regardless of input size.

Appart from that, it also doesn't allocate (at all) for small needles (16 points or less). As much as possible, we want to avoid allocating when possible.

Tweaking it could extract still more speed, but the extra code/complexity doesn't justify it.

In any case. Chose what you think is best. Here is my benchwork:

import std.range, std.traits, std.range, std.stdio, std.exception, std.conv, std.string, std.algorithm, std.utf, std.uni, std.exception, std.typetuple;
import std.internal.scopebuffer;
import std.datetime;

enum N = 1_000_000;

void main()
{
    string haystack = "hello!world";
    string needles = `!.`;

    alias Funs = TypeTuple!(
        indexOfAny_raw,
        indexOfAny_can,
        indexOfAny_AppC,
        //indexOfAny_ScopC,
        indexOfAny_WaltC,
        indexOfAny_AppD,
        //indexOfAny_ScopD,
        indexOfAny_WaltD,
        indexOfAny_Sepcial,
    );

    foreach ( fun ; Funs )
        writeln(fun(haystack, needles, CaseSensitive.no));

    foreach ( fun ; Funs )
    {
        StopWatch w;
        w.start();
        long i;
        foreach ( _ ; 0 .. N )
            i += fun(haystack, needles, CaseSensitive.no);
        writefln("%s: %s", fun.stringof, w.peek.msecs*0.001);
    }
}

private ptrdiff_t indexOfAny_raw(Char, Char2) (const(Char)[] haystack, const(Char2)[] needles, CaseSensitive cs) if (isSomeChar!Char && isSomeChar!Char2)
{
    if (cs == CaseSensitive.yes) { size_t n = haystack.findAmong(needles).length; return n ? haystack.length - n : -1; }
    else
    {
        static bool toLowerFind(const dchar hay, const(Char2)[] needles)
        {
            dchar hLow = std.uni.toLower(hay);
            foreach (dchar o; needles)
            {
                dchar oLow = std.uni.toLower(o);
                if (hLow == oLow)
                {
                    return true;
                }
            }
            return false;
        }

        foreach (i, dchar c; haystack)
            if (toLowerFind(c, needles))
                return i;
    }
    return -1;
}

private ptrdiff_t indexOfAny_can(Char, Char2) (const(Char)[] haystack, const(Char2)[] needles, CaseSensitive cs) if (isSomeChar!Char && isSomeChar!Char2)
{
    if (cs == CaseSensitive.yes) { size_t n = haystack.findAmong(needles).length; return n ? haystack.length - n : -1; }
    else
    {
        static bool f(dchar a, dchar b)
        {
            return std.uni.toLower(a) == b;
        }

        foreach (i, dchar c; haystack)
            if (canFind!f(needles, std.uni.toLower(c)))
                return i;
    }
    return -1;
}

private ptrdiff_t indexOfAny_AppC(Char, Char2) (const(Char)[] haystack, const(Char2)[] needles, CaseSensitive cs) if (isSomeChar!Char && isSomeChar!Char2)
{
    if (cs == CaseSensitive.yes) { size_t n = haystack.findAmong(needles).length; return n ? haystack.length - n : -1; }
    else
    {
        char[64] scratch = void;
        auto app = appender(scratch[]);
        app.clear();
        foreach ( dchar c; needles)
            put(app, std.uni.toLower(c));

        foreach (i, dchar c; haystack)
            if (canFind(app.data, std.uni.toLower(c)))
                return i;
    }
    return -1;
}

private ptrdiff_t indexOfAny_ScopC(Char, Char2) (const(Char)[] haystack, const(Char2)[] needles, CaseSensitive cs) if (isSomeChar!Char && isSomeChar!Char2)
{
    if (cs == CaseSensitive.yes) { size_t n = haystack.findAmong(needles).length; return n ? haystack.length - n : -1; }
    else
    {
        char[64] scratch = void;
        auto app = ScopeAppender!char(scratch[]);
        foreach ( dchar c; needles)
            put(app, std.uni.toLower(c));

        foreach (i, dchar c; haystack)
            if (canFind(app.peek, std.uni.toLower(c)))
                return i;
    }
    return -1;
}

private ptrdiff_t indexOfAny_WaltC(Char, Char2) (const(Char)[] haystack, const(Char2)[] needles, CaseSensitive cs) if (isSomeChar!Char && isSomeChar!Char2)
{
    if (cs == CaseSensitive.yes) { size_t n = haystack.findAmong(needles).length; return n ? haystack.length - n : -1; }
    else
    {
        char[64] scratch = void;
        auto app = ScopeBuffer!char(scratch[]);
        scope(exit) app.free();
        foreach ( dchar c; needles)
            put(app, std.uni.toLower(c));

        foreach (i, dchar c; haystack)
            if (canFind(app[], std.uni.toLower(c)))
                return i;
    }
    return -1;
}

private ptrdiff_t indexOfAny_AppD(Char, Char2) (const(Char)[] haystack, const(Char2)[] needles, CaseSensitive cs) if (isSomeChar!Char && isSomeChar!Char2)
{
    if (cs == CaseSensitive.yes) { size_t n = haystack.findAmong(needles).length; return n ? haystack.length - n : -1; }
    else
    {
        dchar[16] scratch = void;
        auto app = appender(scratch[]);
        app.clear();
        foreach ( dchar c; needles)
            put(app, std.uni.toLower(c));

        foreach (i, dchar c; haystack)
            if (canFind(app.data, std.uni.toLower(c)))
                return i;
    }
    return -1;
}

private ptrdiff_t indexOfAny_ScopD(Char, Char2) (const(Char)[] haystack, const(Char2)[] needles, CaseSensitive cs) if (isSomeChar!Char && isSomeChar!Char2)
{
    if (cs == CaseSensitive.yes) { size_t n = haystack.findAmong(needles).length; return n ? haystack.length - n : -1; }
    else
    {
        dchar[16] scratch = void;
        auto app = ScopeAppender!dchar(scratch[]);
        foreach ( dchar c; needles)
            put(app, std.uni.toLower(c));

        foreach (i, dchar c; haystack)
            if (canFind(app.peek, std.uni.toLower(c)))
                return i;
    }
    return -1;
}

private ptrdiff_t indexOfAny_WaltD(Char, Char2) (const(Char)[] haystack, const(Char2)[] needles, CaseSensitive cs) if (isSomeChar!Char && isSomeChar!Char2)
{
    if (cs == CaseSensitive.yes) { size_t n = haystack.findAmong(needles).length; return n ? haystack.length - n : -1; }
    else
    {
        dchar[16] scratch = void;
        auto app = ScopeBuffer!dchar(scratch[]);
        scope(exit) app.free();
        foreach ( dchar c; needles)
            put(app, std.uni.toLower(c));

        foreach (i, dchar c; haystack)
            if (canFind(app[], std.uni.toLower(c)))
                return i;
    }
    return -1;
}

private ptrdiff_t indexOfAny_Sepcial(Char, Char2) (const(Char)[] haystack, const(Char2)[] needles, CaseSensitive cs) if (isSomeChar!Char && isSomeChar!Char2)
{
    if (cs == CaseSensitive.yes) { size_t n = haystack.findAmong(needles).length; return n ? haystack.length - n : -1; }
    else
    {
        immutable len = needles.walkLength;
        if (len <= 16)
        {
            dchar[16] scratch = void;
            dchar[] app = scratch[0 .. len];
            size_t i = 0;
            foreach ( dchar c; needles)
                app[i++] = c;

            foreach (i, dchar c; haystack)
                if (canFind(app, std.uni.toLower(c)))
                    return i;
        }
        else
        {
            char[64] scratch = void;
            auto app = appender(scratch[]);
            foreach ( dchar c; needles)
                put(app, std.uni.toLower(c));

            foreach (i, dchar c; haystack)
                if (canFind(app.data, std.uni.toLower(c)))
                    return i;
        }
    }
    return -1;
}

@burner
Copy link
Member Author

burner commented Mar 21, 2014

Thanks for the benchmark code. It seams like the simple version and canFind are the contender. Your special version had a bug, you missed the toLower at line app[i++] = c

Anyway here is my benchmark rework http://dpaste.dzfl.pl/97b1fa04f6a0
Maybe you can have a look

@monarchdodra
Copy link
Collaborator

you missed the toLower at line app[i++] = c

Good catch.

Maybe you can have a look

Well written. I like it.

However, I think it might be flawed in terms of how the data is generated. You are basically generating characters in the letters~"äöüÄÖÜ range. This gives 58 different possible characters, only 29 of which are "case unique". Yet your needles range is sometimes as big 80 chars. You've basically almost got the entire universe in there. The end result is that you'll almost always find a match on one of the first very characters in your haystack.

Which explains (I found this odd), why the larger the needle, the faster the trivial case ran, when I'd have expected the opposite: Bigger input, longer run.

So in this scenario, yes, the simple and canFind versions (they are the same really), perform well.

But the test almost completely misses out on the "no match found" test, or more generally "rare match". For example: indexOfAny(letters, "äöüÄÖÜ") runs 10 times slower with the trivial implementation, but roughly 10 times faster on all others.


From a design point of view, I'm wondering if it would make more sense to require that the needles are all already lowercase for case insensitive searches.

Not only would this solve the performance/allocation problem, but it also makes sense from an interface point of view, when you know that these functions are often called in loop with the same needle.

Like: "To perform a case insensitive search, indexOfAny must be given a needle in an already lowercase case." This would guarantee high performance, and 0 allocations to boot. I'm not sure the design is acceptable though.

Or add a third parameter, "assumeNeedlesAreLowerCased"?


@andralex : Would you have any design idea for such issues? Is my interface idea acceptable?

@burner
Copy link
Member Author

burner commented Mar 21, 2014

Do you have some source of useful test data?

@burner
Copy link
Member Author

burner commented Mar 21, 2014

I worked some more on the benchmark. I'm using some text from wikipedia now. I found some more bugs in char appender wont work, needs dchar appender. Anyway you can find the new benchmark under http://dpaste.dzfl.pl/0444e719233d

The values are proportional, not in msecs of nsecs. I'm to tired to calculate them right. I added a modified version of your special function.

@monarchdodra
Copy link
Collaborator

I found some more bugs in char appender wont work, needs dchar appender.

What kind of bug? It should work with simple char. If not, it's a bug that needs to be fixed. Please report what the use case is.

I'm to tired to calculate them right. I added a modified version of your special function.

I get what you are saying. Testing code can get long and very tiring. I'm not using a working environment, so I can't try your tests right now. I'll try them later.

@burner
Copy link
Member Author

burner commented Mar 23, 2014

It should work for single chars, if there is enough space left to put two/four of them into it in case of utf8. But that should not be the case here, as no needles in utf16/32 are passed in. Anyway changing it to dchar fixed whatever was wrong.

@burner
Copy link
Member Author

burner commented Mar 31, 2014

@monarchdodra Have you look at the benchmark?

@monarchdodra
Copy link
Collaborator

@monarchdodra Have you look at the benchmark?

The "less than 16" case looks fine. For more than 16, could you try just using to!dstring? It allocates, but given the size of the input, I think it's really justified.

Let's just do that, and wrap this up, I think it's a good addition.

}

/**
Returns the index of the last occurence of any of the elements in $(D
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last occurence? Wrong copy paste?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@burner
Copy link
Member Author

burner commented Apr 2, 2014

I test the to!dstring version. It is about 1% slower in the avg case. So I think no gc and more speed makes the winner.

@burner
Copy link
Member Author

burner commented Apr 7, 2014

any more input for this?

@monarchdodra
Copy link
Collaborator

any more input for this?

Sorry, I haven't forgotten about this.

I test the to!dstring version. It is about 1% slower in the avg case. So I think no gc and more speed makes the winner.

Yeah... to!dstring was a retarded idea. I forgot about to do the whole toLower with it too...

I'm still not entirely satisfied with this: Not doing a eager toLower really gives a catastrophic worst case scenario in terms of performance.

I'd like to try a few more things on my end (I won't ask you to do my testing). If I come up with something good, I'll tell you. If not, I'll merge away anyways. I think this is a good addition. Give me about a week.

@burner
Copy link
Member Author

burner commented Apr 13, 2014

Ok

@burner
Copy link
Member Author

burner commented Apr 24, 2014

@monarchdodra Any news?

@burner
Copy link
Member Author

burner commented May 5, 2014

anyone?

@monarchdodra
Copy link
Collaborator

Ok, sorry for the wait. I was worried about an implementation detail, when what counts (the interface) is fine. In particular, the newly introduced @nogc pretty much means potentially using the GC just got real costly.

So I'm ready to merge this. Just fix the last typo and we are good to go.

@burner
Copy link
Member Author

burner commented May 6, 2014

one sec

@burner
Copy link
Member Author

burner commented May 7, 2014

good to go, from my part

}
else
{
if (needles.length <= 16)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually needles.walkLength(17) <= 16 for it to be code points.
17 is to prevent it from walking > 17 characters in the worst case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate? Where do you see the walkLength call?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had forgotten you could set an upper bound on walkLength! I had thought that walkLength would be better, but was afraid of the extra decoding cost, so didn't mention it.

Yes, using needles.walkLength(17) <= 16 is superior.

Or if (needles.length <= 16 || needles.walkLength(17) <= 16)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.
I will fix it right away

whitespace

some code reductions

faster

some last fixes, I hope

typo

walkLength
@burner
Copy link
Member Author

burner commented May 8, 2014

anything else I missed?

@burner
Copy link
Member Author

burner commented May 12, 2014

so, what needs to happen to get this merged?

monarchdodra added a commit that referenced this pull request May 16, 2014
@monarchdodra monarchdodra merged commit da082c7 into dlang:master May 16, 2014
@burner burner deleted the last_indexOfAny_index branch May 16, 2014 08:19
@burner
Copy link
Member Author

burner commented May 16, 2014

Thank you for all the help with this one.

@monarchdodra what about lastIndexOfNone?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants