Skip to content

Conversation

DmitryOlshansky
Copy link
Member

New version of std.regex, with all changes that followed from review.

jmdavis added a commit that referenced this pull request Nov 6, 2011
FReD,  replacement engine for std.regex
@jmdavis jmdavis merged commit a627abd into dlang:master Nov 6, 2011
@jmdavis
Copy link
Member

jmdavis commented Nov 6, 2011

Merged.

@braddr
Copy link
Member

braddr commented Nov 8, 2011

The OSX/32 and FreeBSD/32 phobos tests have been failing since this was merged in. Two different asserts:

http://d.puremagic.com/test-results/test_data.ghtml?dataid=113354
core.exception.AssertError@std.regex(2900): Assertion failure

http://d.puremagic.com/test-results/test_data.ghtml?dataid=113355
core.exception.AssertError@std/regex.d(7134): bmatch(R,RegEx) if (is(RegEx == Regex!(BasicElementOf!(R)))): failed to match pattern #205: \b\w+\b

@DmitryOlshansky
Copy link
Member Author

Now that's awesome, since I have no access to Mac OS X nor FreeBSD. Yet MacOS definetely worked in the past.

The first one is not a regex engine but a in "kickstarter" part, it's rather isolated. The only reason I can see for it to fail is that somehow bit-table construction got screwed up. And that smacks of compiler bug.

The second one beats me, since getting to 205 means \w+, \b were working in various combinations already.

At any rate, I can enable debug output and see what's going on down here. Can I get access to these boxes ?

@dnadlinger
Copy link
Contributor

@blackwhale: I don't have an OS X VM or something I could give you access to, just my main box, but if you could tell me what exactly you need, I'd gladly send you any debug output (just no time to fully track down the issue atm).

@DmitryOlshansky
Copy link
Member Author

Ok, first of try to unittest regex.d separately with full debug, empty.d assuming is your dummy main:
dmd -unittest -debug -debug=fred_test -debug=fred_matching std/regex.d std/internal/uni_tab.d std/internal/uni.d empty.d

if it compiles all right ./regex should now blurt a ton of debug info.

Then just send me whatever it produces, I'm going to diff it against my win32 test run.

@dnadlinger
Copy link
Contributor

@blackwhale: Using Phobos 1d8fe48, dlang/dmd@b3bbae5 on OS X Lion:

$ dmd -unittest -debug -debug=fred_test -debug=fred_matching std/regex.d std/internal/uni_tab.d std/internal/uni.d empty.d
$ ./regex
core.exception.AssertError@std.regex(2900): Assertion failure
----------------
5   regex                               0x0017ad06 _d_assertm + 30
6   regex                               0x00057fa3 void std.regex.__assert(int) + 27
7   regex                               0x00086f00 @safe void std.regex.__unittest5().@trusted void __T9test_flexS183std5regex7ShiftOrZ.test_flex() + 2164
8   regex                               0x00059958 @safe void std.regex.__unittest5() + 20
9   regex                               0x00057f3b void std.regex.__modtest() + 11
10  regex                               0x00171f25 extern (C) bool core.runtime.runModuleUnitTests().int __foreachbody265(ref object.ModuleInfo*) + 45
11  regex                               0x0016cc8b int object.ModuleInfo.opApply(scope int delegate(ref object.ModuleInfo*)) + 79
12  regex                               0x00171e16 runModuleUnitTests + 134
13  regex                               0x0017b4a3 extern (C) int rt.dmain2.main(int, char**).void runAll() + 43
14  regex                               0x0017b015 extern (C) int rt.dmain2.main(int, char**).void tryExec(scope void delegate()) + 29
15  regex                               0x0017afaf main + 179
16  regex                               0x00057f25 start + 53
17  ???                                 0x00000001 0x0 + 1

@DmitryOlshansky
Copy link
Member Author

strange it means that OSX also don't get to run the actual regex test now...
Ok try with -debug=fred_search, that should uncover some details of this test

@dnadlinger
Copy link
Contributor

Here you go:

$ dmd -unittest -debug -debug=fred_test -debug=fred_search -debug=fred_matching std/regex.d std/internal/uni_tab.d std/internal/uni.d empty.d
$ ./regex 
ShiftOr stumbled on Eol
Min length: 3
ShiftOr stumbled on Eol
Min length: 3
ShiftOr stumbled on End
ShiftOr stumbled on Char
Min length: 7
ShiftOr stumbled on End
ShiftOr stumbled on Char
Min length: 7
ShiftOr stumbled on End
ShiftOr stumbled on Char
Min length: 10
ShiftOr stumbled on End
ShiftOr stumbled on Char
Min length: 10
ShiftOr stumbled on End
Min length: 6
ShiftOr stumbled on End
Min length: 6
ShiftOr stumbled on End
Min length: 3
ShiftOr stumbled on End
Min length: 3
Limit:                              100
State: 11111111111111111111111111111100
State: 11111111111111111111111111111110
Limit:                              100
State: 11111111111111111111111111111111
State: 11111111111111111111111111111100
ShiftOr stumbled on Eol
Min length: 6
ShiftOr stumbled on Eol
Min length: 6
ShiftOr stumbled on End
ShiftOr stumbled on Char
Min length: 14
ShiftOr stumbled on End
ShiftOr stumbled on Char
Min length: 14
ShiftOr stumbled on End
ShiftOr stumbled on Char
Min length: 20
ShiftOr stumbled on End
ShiftOr stumbled on Char
Min length: 20
ShiftOr stumbled on End
Min length: 12
ShiftOr stumbled on End
Min length: 12
ShiftOr stumbled on End
Min length: 6
ShiftOr stumbled on End
Min length: 6
Limit:                           100000
State: 11111111111111111111111111111101
State: 11111111111111111111111111110101
State: 11111111111111111111111111111111
State: 11111111111111111111111111111101
State: 11111111111111111111111111110101
Limit:                           100000
State: 11111111111111111111111111111101
State: 11111111111111111111111111111111
State: 11111111111111111111111111111101
State: 11111111111111111111111111110101
ShiftOr stumbled on Eol
Min length: 9
ShiftOr stumbled on Eol
Min length: 9
ShiftOr stumbled on End
ShiftOr stumbled on Char
Min length: 21
ShiftOr stumbled on End
ShiftOr stumbled on Char
Min length: 21
ShiftOr stumbled on End
ShiftOr stumbled on Char
Min length: 30
ShiftOr stumbled on End
ShiftOr stumbled on Char
Min length: 30
ShiftOr stumbled on End
Min length: 18
ShiftOr stumbled on End
Min length: 18
ShiftOr stumbled on End
Min length: 9
ShiftOr stumbled on End
Min length: 9
Limit:                        100000000
Limit:                        100000000
ShiftOr stumbled on End
Min length: 4
ShiftOr stumbled on End
Min length: 4
Limit:                             1000
State: 11111111111111111111111111111101
State: 11111111111111111111111111111111
State: 11111111111111111111111111111101
State: 11111111111111111111111111111011
State: 11111111111111111111111111110110
ShiftOr stumbled on End
ShiftOr stumbled on End
ShiftOr stumbled on Char
Min length: 2
ShiftOr stumbled on End
ShiftOr stumbled on End
ShiftOr stumbled on Char
Min length: 2
Limit:                               10
State: 11111111111111111111111111111110
Limit:                               10
State: 11111111111111111111111111111110
Limit:                               10
State: 11111111111111111111111111111111
State: 11111111111111111111111111111110
ShiftOr stumbled on Any
Min length: 0
ShiftOr stumbled on Any
Min length: 0
ShiftOr stumbled on End
ShiftOr stumbled on Char
ShiftOr stumbled on Char
ShiftOr stumbled on End
ShiftOr stumbled on Char
ShiftOr stumbled on Char
Min length: 3
ShiftOr stumbled on End
ShiftOr stumbled on Char
ShiftOr stumbled on Char
ShiftOr stumbled on End
ShiftOr stumbled on Char
ShiftOr stumbled on Char
Min length: 3
Limit:                              100
State: 11111111111111111111111111111101
State: 11111111111111111111111111111110
State: 11111111111111111111111111111101
State: 11111111111111111111111111111011
Limit:                              100
State: 11111111111111111111111111111101
State: 11111111111111111111111111111110
State: 11111111111111111111111111111110
State: 11111111111111111111111111111101
State: 11111111111111111111111111111011
ShiftOr stumbled on End
Min length: 8
ShiftOr stumbled on End
Min length: 8
Limit:                         10000000
State: 11111111111111111111111111111101
State: 11111111111111111111111111111011
State: 11111111111111111111111111110111
State: 11111111111111111111111111111111
State: 11111111111111111111111111111101
State: 11111111111111111111111111111011
State: 11111111111111111111111111110111
State: 11111111111111111111111111101111
State: 11111111111111111111111111011111
State: 11111111111111111111111110111110
State: 11111111111111111111111101111101
ShiftOr stumbled on End
ShiftOr stumbled on End
ShiftOr stumbled on Char
Min length: 4
ShiftOr stumbled on End
ShiftOr stumbled on End
ShiftOr stumbled on Char
Min length: 4
Limit:                             1000
State: 11111111111111111111111111111101
State: 11111111111111111111111111111101
Limit:                             1000
State: 11111111111111111111111111111111
State: 11111111111111111111111111111101
Limit:                             1000
State: 11111111111111111111111111111111
State: 11111111111111111111111111111111
State: 11111111111111111111111111111101
State: 11111111111111111111111111111101
State: 11111111111111111111111111111101
ShiftOr stumbled on Any
Min length: 0
ShiftOr stumbled on Any
Min length: 0
ShiftOr stumbled on End
ShiftOr stumbled on Char
ShiftOr stumbled on Char
ShiftOr stumbled on End
ShiftOr stumbled on Char
ShiftOr stumbled on Char
Min length: 6
ShiftOr stumbled on End
ShiftOr stumbled on Char
ShiftOr stumbled on Char
ShiftOr stumbled on End
ShiftOr stumbled on Char
ShiftOr stumbled on Char
Min length: 6
Limit:                           100000
core.exception.AssertError@std.regex(2900): Assertion failure
----------------
5   regex                               0x0020ec36 _d_assertm + 30
6   regex                               0x000ebad3 void std.regex.__assert(int) + 27
7   regex                               0x0011ac64 @safe void std.regex.__unittest5().@trusted void __T9test_flexS183std5regex7ShiftOrZ.test_flex() + 2164
8   regex                               0x000ed488 @safe void std.regex.__unittest5() + 20
9   regex                               0x000eba6b void std.regex.__modtest() + 11
10  regex                               0x00205e55 extern (C) bool core.runtime.runModuleUnitTests().int __foreachbody265(ref object.ModuleInfo*) + 45
11  regex                               0x00200bbb int object.ModuleInfo.opApply(scope int delegate(ref object.ModuleInfo*)) + 79
12  regex                               0x00205d46 runModuleUnitTests + 134
13  regex                               0x0020f3d3 extern (C) int rt.dmain2.main(int, char**).void runAll() + 43
14  regex                               0x0020ef45 extern (C) int rt.dmain2.main(int, char**).void tryExec(scope void delegate()) + 29
15  regex                               0x0020eedf main + 179
16  regex                               0x000eba55 start + 53
17  ???                                 0x00000001 0x0 + 1

If anybody is annoyed by the GitHub notification spam, I can keep this private between Dmitry and me as well, since he's probably the only one to whom the debug stuff makes much sense anyway.

@DmitryOlshansky
Copy link
Member Author

I'd gather it's pretty useless for others, hopefully there would be no need for dumps this big.
From that it seems like memchr branch it uses bumps out on the first iteration then.

Let's first do sanity check this, change the line 2900, to see if this will print the length of "ababx" (5):
assert(kick.search("ababx",0) == 2, text("oops result is: ",kick.search("ababx",0)));

Then comment out lines 2747-2792, that should look like this:
/* if(fChar != uint.max)
{
...
}
else */

If this works afterwards, then I'd have to dig somewhere around memchr stuff on OSX. Plus it seems like the only platform specific thing there.

@dnadlinger
Copy link
Contributor

Eww, this smells like a codegen bug or some serious kind of corruption: If I add the message to the assert on line 2900, it isn't triggered anymore, instead the line 7136 assert mentioned above fires later on. Trying to get an idea what is going on, I added writeln("before " ~ Char.stringof) and writeln("after " ~ Char.stringof) around the line 2900 assert. Without the additional message, the assertion is triggered for the wchar case, but with the addition in place, the wchar one passes and the dchar one is triggered…

@DmitryOlshansky
Copy link
Member Author

Ouch, something funky is going on. Could be something to do with memory layout then. I think codegen here should be more or less the same for all platforms though.

Another thing to try to isolate bug is commenting out lines that invoke kickstarter at all.
That's inside struct Input, lines 2940-41:

bool search(Kickstart)(ref Kickstart kick, ref dchar res, ref size_t pos)
{
//size_t idx = kick.search(_origin, _index);
//_index = idx;
return nextChar(res, pos);
}

And disable the test block around line 2900 for now, then see if other tests pass.

@dnadlinger
Copy link
Contributor

@blackwhale: I commented out the two lines you mentioned, and disabled the whole unittest block around line 2900. The 7134 assertion still fails, see the gist for the output when compiled with -debug -debug=fred_debug -debug=fred_matching: https://gist.github.com/1348948

@DmitryOlshansky
Copy link
Member Author

That's ridiculous, wordboundary seems to utterly fail (and almost works the opposite way).
The likely problem is that wordTrie is messed up.

First stuff this debug writeln (line 3437):
else if(s.loopBack.nextChar(back, index))
{
bool af = wordTrie[front];
bool ab = wordTrie[back];

debug writeln("-- wordboundary ch", front, " " back, ": ", af," ^ " ,ab, "=", af^ab);
Feel free to stuff simillar lines to other cases, I need to see the chars and result of wordTrie[x] on them.
....
Then I'd rather see dump of wordTrie, the problem is be default it will dump all and every created one, with the whole process.
So to get a clean picture: kill debug block on line 572 of internal/uni.d and add -debug=fred_trie

Then add line
wordTrie.desc();
somewhere in the first test block.

@braddr
Copy link
Member

braddr commented Nov 27, 2011

I spent some time on this one today.. here's what I've found so far.

braddr@freebsd$ cat junk.d
module bug;

extern(C) int printf(const char*, ...);

void main()
{
auto foo = "abc";
printf("foo.ptr = %p, foo.length = %zd\n", foo.ptr, foo.length);
printf("foo = %.*s\n", foo.length, foo.ptr);
}

$ dmd junk.d && ./junk
foo.ptr = 0x8063411, foo.length = 3
foo = abc

so, string literals are not word aligned. This causes code inside struct ShiftOr's search() code to misbehave:
if(!(cast(size_t)p & (Char.sizeof-1)))
break;

This seems to be true on both osx/32 and freebsd/32. I haven't tried freebsd/64.

I'm pretty sure that it'd be easy enough to cause with wchar's on any platform.

@braddr
Copy link
Member

braddr commented Nov 27, 2011

Sorry, ignore the wchar part and here's a version with dstring's explicitly:

braddr@freebsd$ cat junk.d
module bug;
extern(C) int printf(const char*, ...);
void main()
{
dstring foo = "abc";
printf("foo.ptr = %p, foo.length = %zd\n", foo.ptr, foo.length);
}

braddr@freebsd$ ../trunk/dmd/src/dmd junk.d && ./junk
foo.ptr = 0x8063411, foo.length = 3

I'm still trying to coax the simple test into showing similar alignment issues for os/x, but it's clear from debug code that the parameter making its way into search is misaligned: haystack.ptr = 0x8e15e, idx = 0

@braddr
Copy link
Member

braddr commented Nov 27, 2011

Continuing the investigation of the osx bug, I see the same basic symptoms as discussed in the rest of the thread. There needs to be a LOT more unit tests in std.internal.uni. There's a lot of non-trivial code in there and very few tests. The one test that claims to be slow isn't.. it doesn't even compile.

@blackwhale any way you can get access to an osx box and dig into this? It's all your code, right?
@complexmath can you give blackwhale an account on your osx box?

The bottom line is that this needs to be fixed or reverted. Regressing the regex engine between releases is a bad idea. This has lingered for far too long.

@DmitryOlshansky
Copy link
Member Author

  1. I have no access to OSX and was pretty much stuck with it.
  2. About the slow test - it got broken during merging, that I'm fixing.
  3. Thanks for heads up on strings being misaligned, now I can roll out a possible fix.

@DmitryOlshansky
Copy link
Member Author

About small number of tests inside std.internal.uni ---- these functions are actually used & tested inside std.regex.
This is especially true for getCommonCasing and caseEnclose. Though I agree some extra tests should be present here as well.
The prime reason for separation (which wasn't a clean-cut, I admit ) is to put pure codepoint manipulation that can be used elsewhere outside of the regex engine.

@braddr
Copy link
Member

braddr commented Nov 27, 2011

used elsewhere isn't a valid excuse. Each module should be independently tested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants