Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failures in Perl 5.30.3 #229

Open
bwarden opened this issue Jun 3, 2020 · 40 comments
Open

Test failures in Perl 5.30.3 #229

bwarden opened this issue Jun 3, 2020 · 40 comments

Comments

@bwarden
Copy link

bwarden commented Jun 3, 2020

When building Sereal::Decoder 4.0.11 with Perl 5.30.3 on Clear Linux, I encountered new test failures that did not occur with Perl 5.30.2.

Excerpt (similar patterns in other tests, too):

#   Failed test 'long ascii string 'abc' x 9999 (snappy_v4, object-oriented, decoded vs data) - strings different'
#   at t/lib/Sereal/TestSet.pm line 1220.
# long ascii string 'abc' x 9999 (snappy_v4, object-oriented, decoded vs data) - first string difference at octet offset 24
#  got-octets = ..."cabcabcabc" . "\376\3\0\215bcabcabc\376\3\0\215ababcabc\376\3\0\215ab"... (octets: 29997, utf8-flag: 0)
# want-octets = ..."cabcabcabc" . "abcabcabcabcabcabcabcabcabcabc"...                         (octets: 29997, utf8-flag: 0)
#   Failed test 'long ascii string 'abc' x 9999 (snappy_v4, functional simple, decoded vs data) - strings different'
#   at t/lib/Sereal/TestSet.pm line 1220.
# long ascii string 'abc' x 9999 (snappy_v4, functional simple, decoded vs data) - first string difference at octet offset 24
#  got-octets = ..."cabcabcabc" . "\376\3\20\202bcabcabc\376\3\20\202bcabcabc\376\3\20\202bc"... (octets: 29997, utf8-flag: 0)
# want-octets = ..."cabcabcabc" . "abcabcabcabcabcabcabcabcabcabc"...                            (octets: 29997, utf8-flag: 0)
#   Failed test 'long ascii string 'abc' x 9999 (snappy_v4, functional with object, decoded vs data) - strings different'
#   at t/lib/Sereal/TestSet.pm line 1220.
# long ascii string 'abc' x 9999 (snappy_v4, functional with object, decoded vs data) - first string difference at octet offset 24
#  got-octets = ..."cabcabcabc" . "\376\3\220>bcabcabc\376\3\220>ababcabc\376\3\220>ab"... (octets: 29997, utf8-flag: 0)
# want-octets = ..."cabcabcabc" . "abcabcabcabcabcabcabcabcabcabc"...                      (octets: 29997, utf8-flag: 0)
#   Failed test 'long ascii string 'abc' x 9999 (snappy_v4, header-body, decoded vs data) - strings different'
#   at t/lib/Sereal/TestSet.pm line 1220.
# long ascii string 'abc' x 9999 (snappy_v4, header-body, decoded vs data) - first string difference at octet offset 24
#  got-octets = ..."cabcabcabc" . "\376\3\0\0bcabcabc\376\3\0\0bcabcabc\376\3\0\0bc"... (octets: 29997, utf8-flag: 0)
# want-octets = ..."cabcabcabc" . "abcabcabcabcabcabcabcabcabcabc"...                   (octets: 29997, utf8-flag: 0)
#   Failed test 'long ascii string 'abc' x 10000 (snappy_v4, object-oriented, decoded vs data) - strings different'
#   at t/lib/Sereal/TestSet.pm line 1220.
# long ascii string 'abc' x 10000 (snappy_v4, object-oriented, decoded vs data) - first string difference at octet offset 24
#  got-octets = ..."cabcabcabc" . "\376\3P\31bcabcabc\376\3P\31abcacabc\376\3P\31ab"... (octets: 30000, utf8-flag: 0)
# want-octets = ..."cabcabcabc" . "abcabcabcabcabcabcabcabcabcabc"...                   (octets: 30000, utf8-flag: 0)
#   Failed test 'long ascii string 'abc' x 10000 (snappy_v4, functional simple, decoded vs data) - strings different'
#   at t/lib/Sereal/TestSet.pm line 1220.
# long ascii string 'abc' x 10000 (snappy_v4, functional simple, decoded vs data) - first string difference at octet offset 24
#  got-octets = ..."cabcabcabc" . "\376\3\0\0bcabcabc\376\3\0\0bcabcabc\376\3\0\0bc"... (octets: 30000, utf8-flag: 0)
# want-octets = ..."cabcabcabc" . "abcabcabcabcabcabcabcabcabcabc"...                   (octets: 30000, utf8-flag: 0)
#   Failed test 'long ascii string 'abc' x 10000 (snappy_v4, functional with object, decoded vs data) - strings different'
#   at t/lib/Sereal/TestSet.pm line 1220.
# long ascii string 'abc' x 10000 (snappy_v4, functional with object, decoded vs data) - first string difference at octet offset 24
#  got-octets = ..."cabcabcabc" . "\376\3\20\202bcabcabc\376\3\20\202P\31abcabc\376\3\20\202P\31"... (octets: 30000, utf8-flag: 0)
# want-octets = ..."cabcabcabc" . "abcabcabcabcabcabcabcabcabcabc"...                                (octets: 30000, utf8-flag: 0)
#   Failed test 'long ascii string 'abc' x 10000 (snappy_v4, header-body, decoded vs data) - strings different'
#   at t/lib/Sereal/TestSet.pm line 1220.
# long ascii string 'abc' x 10000 (snappy_v4, header-body, decoded vs data) - first string difference at octet offset 24
#  got-octets = ..."cabcabcabc" . "\376\3cabcabcabc\376\3cabcabcabc\376\3cabc"... (octets: 30000, utf8-flag: 0)
# want-octets = ..."cabcabcabc" . "abcabcabcabcabcabcabcabcabcabc"...             (octets: 30000, utf8-flag: 0)
#   Failed test 'long ascii string 'abc' x 10001 (snappy_v4, object-oriented, decoded vs data) - strings different'
#   at t/lib/Sereal/TestSet.pm line 1220.
@bwarden
Copy link
Author

bwarden commented Jun 10, 2020

Full log from cpanp:
Sereal-Decoder-4.011-1591810060.log

Comparing my results with CPANTS, it looks like all those testers ran only 703 or 704 tests due to skipping: "Did not find right version of encoder", while we run 212,671 tests.

@bwarden
Copy link
Author

bwarden commented Jun 10, 2020

For one more point of data, we have 295,002 passing tests with no failures for Sereal and 292,447 passing tests with no failures for Sereal::Encoder.

@demerphq
Copy link
Member

demerphq commented Jun 10, 2020 via email

@bwarden
Copy link
Author

bwarden commented Jun 11, 2020

Probably obvious, but if I hide the previous version of Sereal::Encoder while building Sereal::Decoder, I can make it skip all those failing tests.

@demerphq
Copy link
Member

demerphq commented Jun 11, 2020 via email

@demerphq
Copy link
Member

I've looked into this more, and i dont get it. The test says you have Sereal::Encoder and Sereal::Decoder 4.011. Also all the fails relate to snappy which is suggestive that something broke there.

It looks to me like you installed Sereal::Encoder 4.011 first and then tried to install Sereal::Decoder 4.011, which is strange, usually its the other way around. Anyway, at this point im just plain confused. Can you help dig into this a bit more? Is it possible you are using the build mode where modules for 5.30.3 are shared with 5.30.2 and etc? I need to see your perl -V and some other stuff.

@bwarden
Copy link
Author

bwarden commented Jun 11, 2020

perlv.txt

We had Sereal, Sereal::Encoder, and Sereal::Decoder 4.011 installed under 5.30.2. I upgraded Perl to 5.30.3, and configured it to support modules from 5.30.2 to ease the transition. I rebuilt all of our modules for 5.30.3. It wasn't strictly alphabetical, but the process did complete Sereal and Sereal::Encoder successfully first, then Sereal::Decoder failed.

@demerphq
Copy link
Member

demerphq commented Jun 11, 2020 via email

@bwarden
Copy link
Author

bwarden commented Jun 11, 2020

Encoder 4.014 fails on I think the same snappy tests. What I don't understand is that the chroot where I'm building it doesn't even have snappy installed.

edit: Oh, i see you're using a local copy, not the system's snappy

@demerphq
Copy link
Member

demerphq commented Jun 11, 2020 via email

@demerphq
Copy link
Member

demerphq commented Jun 11, 2020 via email

@Tux
Copy link

Tux commented Jun 12, 2020

Linux 5.6.14-1-default [openSUSE Tumbleweed 20200610] HP ZBook 15G3 Core(TM) i7-6820HQ CPU @ 2.70GHz/2781(8 cores) x86_64 15925 Mb
This is perl 5, version 30, subversion 0 (v5.30.0) built for x86_64-linux-thread-multi-ld
Same failues on 4.014

@bwarden
Copy link
Author

bwarden commented Jun 12, 2020

Another difference that might contribute -- between passing and failing builds we moved from gcc9 to gcc10.

@bwarden
Copy link
Author

bwarden commented Jun 12, 2020

I just confirmed. Building 4.014 on Perl 5.30.3 with gcc9:

All tests successful.
Files=73, Tests=296736, 69 wallclock secs ( 9.90 usr  0.84 sys + 66.62 cusr  2.09 csys = 79.45 CPU)
Result: PASS

With gcc10:

Files=73, Tests=212672, 51 wallclock secs ( 7.26 usr  0.60 sys + 49.00 cusr  1.54 csys = 58.40 CPU)
Result: FAIL
Failed 14/73 test programs. 344/212672 subtests failed.
make: *** [Makefile:1102: test_dynamic] Error 255

@demerphq
Copy link
Member

demerphq commented Jun 14, 2020 via email

@demerphq
Copy link
Member

demerphq commented Jun 14, 2020 via email

@Tux
Copy link

Tux commented Jun 14, 2020 via email

@demerphq
Copy link
Member

demerphq commented Jun 14, 2020 via email

@Tux
Copy link

Tux commented Jun 14, 2020

$ gcc-9 -dumpversion
9

$ gcc-9 --version
gcc-9 (SUSE Linux) 9.3.1 20200406 [revision 6db837a5288ee3ca5ec504fbd5a765817e556ac2]
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ path -p gcc-9
/usr/bin/gcc-9
gcc9-9.3.1+git1296-1.9.x86_64

All tests successful.
Files=73, Tests=296736, 81 wallclock secs (21.15 usr  1.32 sys + 77.86 cusr  2.38 csys = 102.71 CPU)
Result: PASS

But that requires manual intervention, as I need to alter the generated Makefile

@demerphq
Copy link
Member

demerphq commented Jun 15, 2020 via email

@dgryski
Copy link
Member

dgryski commented Jun 15, 2020

Does gcc-10 miscompile the snappy code in isolation?

@demerphq
Copy link
Member

demerphq commented Jun 15, 2020 via email

@soig
Copy link

soig commented Jun 19, 2020

We are seeing the same issue on Mageia Cauldron (using gcc-10 too)

@pterjan
Copy link
Contributor

pterjan commented Jul 24, 2020

I played a bit with the test cases and failures are with patterns of length < 16 and not power of 2, when the total length is >= 1022...

Pattern (length) Last success count (length) First failing count (length)
"ab" (2) Always
"abc" (3) 340 (1020) 341 (1023)
"abcd" (4) Always
"abcde" (5) 204 (1020) 205 (1025)
"abcdef" (6) 170 (1020) 171 (1026)
"abcdefg" (7) 145 (1015) 146 (1022)
"abcdefgh" (8) Always
"abcdefghi" (9) 113 (1017) 114 (1026)
"abcdefghijklmno" (15) 68 (1020) 69 (1035)
"abcdefghijklmnop" (16) Always
"abcdefghijklmnopq" (17) Always
"abcdefghijklmnopqr" (18) Always
"abcdefghijklmnopqrs" (19) Always
Anything larger I tried Always

Are the strings directly passed to snappy or encoded / processed before?
I tried csnappy directly on those strings and compress + decompress returns original string.

I also tried building with -O0 instead of -O2 and this doesn't help.

@dgryski
Copy link
Member

dgryski commented Jul 24, 2020

Compression happens on the Sereal document body after the entire structure has been serialized, then a header is added. If gcc-10 is miscompiling snappy, we should be able to write test cases that target the snappy library and not worry about sereal.

@pterjan
Copy link
Contributor

pterjan commented Jul 24, 2020

Reading the source, the reason why tests pass when the size is < 1024 is because compression is disabled at that threshold by default. If changing the threshold to 1, it fails for 9 x "abc", at offset 24 like for longer strings.

@pterjan
Copy link
Contributor

pterjan commented Jul 24, 2020

The corruption is FE 03 00 C0 getting written at offset 24-28 then 8 bytes are ok, then FE 03 00 C0 then 8 bytes are ok, then FE 03 00 C0 then everything is fine.

FE 03 00 is what makes most of the compressed body.

@pterjan
Copy link
Contributor

pterjan commented Jul 25, 2020

I changed the test code to dump data, encoded and decoded into files, for "abc" x 32.

$ od -x t/700_roundtrip/v1/data/snappy_v1/long_ascii_string_abc_x_32.data 
0000000 6261 6163 6362 6261 6163 6362 6261 6163
0000020 6362 6261 6163 6362 6261 6163 6362 6261
0000040 6163 6362 6261 6163 6362 6261 6163 6362
0000060 6261 6163 6362 6261 6163 6362 6261 6163
0000100 6362 6261 6163 6362 6261 6163 6362 6261
0000120 6163 6362 6261 6163 6362 6261 6163 6362
0000140
$ od -x t/700_roundtrip/v1/data/snappy_v1/long_ascii_string_abc_x_32.data_with_header
0000000 6026 6261 6163 6362 6261 6163 6362 6261
0000020 6163 6362 6261 6163 6362 6261 6163 6362
0000040 6261 6163 6362 6261 6163 6362 6261 6163
0000060 6362 6261 6163 6362 6261 6163 6362 6261
0000100 6163 6362 6261 6163 6362 6261 6163 6362
0000120 6261 6163 6362 6261 6163 6362 6261 6163
0000140 6362 000a
0000143
$ od -x t/700_roundtrip/v1/data/snappy_v1/long_ascii_string_abc_x_32.enc
0000000 733d 6c72 0011 1062 6026 6261 fe63 0003
0000020 0372 0000
0000023
$ od -x t/700_roundtrip/v1/data/snappy_v1/long_ascii_string_abc_x_32.dec 
0000000 6261 6163 6362 6261 6163 6362 6261 6163
0000020 6362 6261 6163 6362 0000 0000 6362 6261
0000040 6163 6362 0000 0000 0000 0000 6163 6362
0000060 0000 0000 0000 0000 0000 0000 0000 0000
*
0000140

So 4 \0 get added at position 24, then 8 \0 at position 44, then only \0 starting at position 60...

I tried a pure C encoding / decoding of the data without header, which works fine, and the compressed part looks consistent:

$ od -x t/700_roundtrip/v1/data/snappy_v1/long_ascii_string_abc_x_32.enc-c 
0000000 1062 6026 6261 fe63 0003 0372 0000
0000015
$ od -x t/700_roundtrip/v1/data/snappy_v1/long_ascii_string_abc_x_32.dec-c 
0000000 6026 6261 6163 6362 6261 6163 6362 6261
0000020 6163 6362 6261 6163 6362 6261 6163 6362
0000040 6261 6163 6362 6261 6163 6362 6261 6163
0000060 6362 6261 6163 6362 6261 6163 6362 6261
0000100 6163 6362 6261 6163 6362 6261 6163 6362
0000120 6261 6163 6362 6261 6163 6362 6261 6163
0000140 6362
0000142

As the encoded data looks similar (apart from an added header) with the one I get in C that gets decompressed fine, I would blame decoding rather than encoding, and probably not in the snappy code.

@dgryski
Copy link
Member

dgryski commented Jul 25, 2020

I wonder if gcc-10 changed the default size of an int (or something related) and we're using an unsigned integer type somewhere?

@pterjan
Copy link
Contributor

pterjan commented Jul 26, 2020

I have spent a few more hours trying to debug this in the Decoder, adding a simple test. After adding a lot of logging the test passed, gradually removing it I ended up with this patch making it work:

--- snappy.sav/csnappy_decompress.c	2020-02-02 02:55:14.000000000 +0000
+++ snappy/csnappy_decompress.c	2020-07-26 21:19:15.458305340 +0000
@@ -41,6 +41,7 @@
 #include <linux/module.h>
 #endif
 #include "csnappy.h"
+#include <stdio.h>
 
 int
 csnappy_get_uncompressed_length(
@@ -246,6 +247,7 @@
 		op += op - src;
 	}
 	while (len > 0) {
+		printf("Foo\n");
 		UnalignedCopy64(src, op);
 		src += 8;
 		op += 8;

So it seems gcc is doing something strange int that loop, and the problem is really in csnappy not Sereal.

@dgryski
Copy link
Member

dgryski commented Jul 26, 2020

Wow, thanks for your work hunting this down.

@pterjan
Copy link
Contributor

pterjan commented Jul 26, 2020

So my previous attempt at changing the optimisation flag didn't help as I was doing it while building Encode while the problem was in Decode.

Changing OPTIMIZE = -O3 to be -O2 also fixes the problem.

And the following reproduces the problem:

#include <string.h>
#include <stdio.h>

static inline void IncrementalCopyFastPath(const char *src, char *op, int len)
{
        while (op - src < 8) {
                *(unsigned long*)op = *(const unsigned long*)src;
                len -= op - src;
                op += op - src;
        }
        while (len > 0) {
                *(unsigned long*)op = *(const unsigned long*)src;
                src += 8;
                op += 8;
                len -= 8;
        }
}

int main() {
	char buf[1024];
	strcpy(buf, "abc");
	IncrementalCopyFastPath(buf, buf+3, 64);
	printf("%s\n", buf);
}
[pterjan@mageia Sereal-Decoder-4.017]$ gcc -O2 -Wall t.c; ./a.out 
abcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcab
[pterjan@mageia Sereal-Decoder-4.017]$ gcc -O3 -Wall t.c; ./a.out 
abcabcabcabcabcabcabcabc

I'll report it to gcc, and will build with -O2 meanwhile.

@pterjan
Copy link
Contributor

pterjan commented Jul 27, 2020

@demerphq
Copy link
Member

demerphq commented Jul 27, 2020 via email

@dgryski
Copy link
Member

dgryski commented Jul 27, 2020

From the gcc bug report:

The following works - your unsigned long objects are not aligned according
to the type which makes the accesses undefined (actually C even specifies
forming the pointer itself is undefined).

So, seems like we should 1) inform csnappy of this and 2) run the perl test suite with ubsan/: https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html

@demerphq
Copy link
Member

demerphq commented Jul 29, 2020 via email

@demerphq
Copy link
Member

demerphq commented Aug 3, 2020 via email

@demerphq
Copy link
Member

demerphq commented Aug 3, 2020 via email

@Tux
Copy link

Tux commented Aug 3, 2020

Linux 5.7.11-1-default [openSUSE Tumbleweed 20200731]  HP ZBook 15G3
Core(TM) i7-6820HQ CPU @ 2.70GHz/2118(8 cores) x86_64  15922 Mb
gcc (SUSE Linux) 10.1.1 20200625 [revision c91e43e9363bd119a695d64505f96539fa451bf2]
Sereal-Encoder-4.017_001
All tests successful.
Files=70, Tests=294317, 82 wallclock secs (21.37 usr  1.40 sys + 79.00 cusr  2.28 csys = 104.05 CPU)
Result: PASS
Sereal-Decoder-4.017_001
All tests successful.
Files=74, Tests=296792, 81 wallclock secs (21.01 usr  1.39 sys + 78.95 cusr  2.37 csys = 103.72 CPU)
Result: PASS
Sereal-4.017_001
All tests successful.
Files=87, Tests=296905, 86 wallclock secs (22.31 usr  1.40 sys + 83.61 cusr  2.45 csys = 109.77 CPU)
Result: PASS

@demerphq
Copy link
Member

demerphq commented Aug 3, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants