Skip to content

New std.uni meets std.regex #2001

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Apr 23, 2014
Merged

Conversation

DmitryOlshansky
Copy link
Member

I've been preparing this for over half a year now.
Puts our std.regex on the new rails of recently revamped std.uni that now has enough of mojo to power a Unicode-aware regular expression engine. All of it with a minor exception via normal "user-land" public API.

  1. 8KLOCs of red speak for themselves. This is a massive code reuse going on.
  2. Must yield a faster regex (something to rigorously test, I expect around ~10-20% overall)
  3. May fix some Unicode set operation bugs.
    UPDATE: indeed fixes issue 11784.
  4. Saves ~290Kb of object code on 32bits, ~350Kb on 64bits.

TODOs:

1. Regressed ctRegex - it outs of memory on some patterns that previously were passing just fine. In particular blocks 2 & 3 of CT-tests on win32.

Memory usage is something out of my control. I improved it a bit simplifying std.uni abstractions. Tests are now split in 7 groups instead of 4 but cover more cases.
UPDATE: Only 5, with recent CTFE fix.

2. May have regressed regex compilation speed due to less straight-forward construction.

Seems just fine and even a bit better.

import std.regex, std.stdio, std.datetime, std.string;
void main()
{
    StopWatch sw;
    sw.start();
    size_t cnt;
    char[] buf = `[\p{L}xxxx]`.dup;
    foreach(a; 'a'..'z')
    foreach(b; 'a'..'z')
    foreach(c; 'a'..'z')
    foreach(d; 'a'..'z')
    {
        buf[$-5] = d;
        buf[$-4] = c;
        buf[$-3] = b;
        buf[$-2] = a;
        cnt += regex(buf).charsets.length; //touch Regex object
    }
    sw.stop();
    writefln("%s, in %s ms; %g patterns/s", 
        cnt, sw.peek().msecs, cnt/(sw.peek().msecs*1e-3));
}

@DmitryOlshansky
Copy link
Member Author

Woot, this fixes issue 11784

@andralex
Copy link
Member

so when do we pull?

@DmitryOlshansky
Copy link
Member Author

@andralex
I need to figure out why CTFE of new glorius tries eats so much RAM during compilation.
And I found a couple of performance regressions.

@andralex
Copy link
Member

And I found a couple of performance regressions.

You will need to accompany your fixes with proofs.

(Just kidding!!)

@DmitryOlshansky
Copy link
Member Author

@andralex
Yeah, kinda expected ;)
But in fact it's a reasonable request. And I totally expect everybody to pull this ONLY after a list of results before/after is presented, that looks good and can be reproduced. The good thing is that I always have snippets for things I measure, well because it's how I detect that new one is faster in the first place :)

Simple benchmark for std.regex is no big secret and have been lying for ages here:
https://github.com/blackwhale/FReD/blob/master/bench/search/fred_r.d

And some patterns to use with it:
https://github.com/blackwhale/FReD/blob/master/bench/search/patterns.txt

  1. In particular the one on line 10 became faster, and I have a reason why.
  2. The one on line 5 became slower, and again I've already pinpointed the change affecting it.
    And so on.

@andralex
Copy link
Member

may I close while u work on it?

@DmitryOlshansky
Copy link
Member Author

Okay, I had my share of auto-tester)

@Orvid
Copy link
Contributor

Orvid commented Apr 3, 2014

So, does you re-opening the PR mean it will be ready to be pulled soon? (The description still says it's not)

@DmitryOlshansky
Copy link
Member Author

@Orvid Yes, it's getting ready for prime time.

I'm investigating last bits for signs of potential regressions, hence the "DO NOT PULL LABEL" still here. In a day or so I will either close it for more work, or remove the note and do the last adjustments.

@DmitryOlshansky
Copy link
Member Author

Okay 2 CTFE bugs were worked around, this should be finally ready to go.

@dnadlinger
Copy link
Contributor

Seems to be passing. Okay to merge?

@DmitryOlshansky
Copy link
Member Author

Would be awesome.. Actually wait a sec, the last commit has a few stupid typos in text messages.

@DmitryOlshansky
Copy link
Member Author

Would be awesome.. Actually wait a sec, the last commit has a few stupid typos in text messages.

Fixed. Kenji as usual saves my day, a recent fix nearly halved the memory usage.
Now it takes only 5 compiler runs to cover full regex test suite with ctRegex, except for a couple of tests.

@dnadlinger
Copy link
Contributor

Auto-merge toggled on

Saving amount to ~290Kb on 32bit.
Actually add a test case, the issue was fixed as part of the set
of commits that precede this one.
Simplify set construction.
…ode.

Also use Gallop search policy for CodepointSets, as it's closer to
to the common cases for merging charsets.
Cover even more, but in 5 separate compiler runs.
Few cases still hit CTFE bugs
@braddr
Copy link
Member

braddr commented Apr 23, 2014

Pull updated, auto_merge toggled off

@dnadlinger
Copy link
Contributor

Now I wish GitHub had a pull request update diff view.

@DmitryOlshansky
Copy link
Member Author

@klickverbot It's rebase, since recent fix foobared auto-merge for std.regex

@dnadlinger
Copy link
Contributor

Auto-merge toggled on

dnadlinger added a commit that referenced this pull request Apr 23, 2014
@dnadlinger dnadlinger merged commit 5cd162b into dlang:master Apr 23, 2014
@DmitryOlshansky
Copy link
Member Author

Fantastic!

@CyberShadow
Copy link
Member

This pull request introduced a regression:
https://issues.dlang.org/show_bug.cgi?id=12661

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants