Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix differing behavior in C implementation on big-endian platforms #118

Conversation

pascal-cuoq
Copy link

@pascal-cuoq pascal-cuoq commented Sep 28, 2020

Hello,

J-P did not express much interest for “exotic” platforms, so @jakub-zwolakowski only configured the C implementation to be interpreted on a 32-bit big-endian and on a 64-bit big-endian platform, in addition to x86 and x86-64. The chosen platforms, ppc32 and ppc64, have the advantage of defining char as an unsigned integer type, so they provide one more portability check for the same price (BLAKE3 is not written to rely on the signedness of char, but since the type char appears in the prototypes of standard functions, it is difficult to be absolutely sure).

Long story short, there was no Undefined Behavior for these big-endian platforms but they were computing different digests than the little-endian platforms. One possibility is that a bug in TrustInSoft CI's emulation of these big-endian architectures causes the execution to be imperfectly emulated, but that does not seem very likely, because any ordinary bug would already be visible for shorter plaintexts. Another possibility is that although the C implementation of BLAKE3 is written to behave the same on little- and big-endian platforms with functions named load32 and store32, one spot where these functions were necessary was missed.

The tests that show the wrong digest being computed are: https://dev.ci.trust-in-soft.com/projects/jakub-zwolakowski/BLAKE3/37?page=1

The same tests, with the change in this PR applied, compute the correct digest in: https://ci.trust-in-soft.com/projects/pascal-cuoq/BLAKE3/2?page=4

I expect the change to make more sense to you when you see it than it did to me when writing it. If the change does not seem to make sense, then we can investigate further.

@pascal-cuoq
Copy link
Author

This change only appeared to fix the problem seen with big-endian platforms because of a misconfiguration on my part, it seems.

@sneves
Copy link
Collaborator

sneves commented Sep 29, 2020

Is the bug still there and this fix was bad, or was there no bug to begin with?

@pascal-cuoq
Copy link
Author

The fix was bad.
The likelihood is still that there is a bug for plaintexts of 1025 characters or more on big-endian platforms. Since I have access to the software simulator of big-endian C platforms which diagnosed the problem (and in which there is a small possibility that the bug actually is), I will continue to try to isolate the place where the executions diverge between little- and big-endian.

@pascal-cuoq
Copy link
Author

@sneves What about the for loop below? Does that look like a change that would fix a bug in the handling of 1025-byte plaintexts in the portable C version of BLAKE3 on big-endian architectures?

diff --git a/c/blake3.c b/c/blake3.c
index c99d059..1d139b2 100644
--- a/c/blake3.c
+++ b/c/blake3.c
@@ -592,6 +592,16 @@ void blake3_hasher_finalize_seek(const blake3_hasher *self, uint64_t seek,
     uint8_t parent_block[BLAKE3_BLOCK_LEN];
     memcpy(parent_block, &self->cv_stack[cvs_remaining * 32], 32);
     output_chaining_value(&output, &parent_block[32]);
+    tis_show_each_parent_block(
+                               parent_block[0], parent_block[1], parent_block[2], parent_block[3],
+                               parent_block[32], parent_block[33], parent_block[34], parent_block[35]);
+    for (int i = 0; i < 16; i++) {
+      uint32_t word = ((uint32_t*)parent_block)[i];
+      parent_block[4*i] = (uint8_t)word;
+      parent_block[4*i+1] = (uint8_t)(word>>8);
+      parent_block[4*i+2] = (uint8_t)(word>>16);
+      parent_block[4*i+3] = (uint8_t)(word>>24);
+    }
     output = parent_output(parent_block, self->key, self->chunk.flags);
   }
   output_root_bytes(&output, seek, out, out_len);

If it looks like the problem exists and this change is the right way to fix it, I will clean up the change so that it does not violate strict aliasing rules, ask @jakub-zwolakowski to check my work, and re-open this PR.

@pascal-cuoq
Copy link
Author

The above appears to fix a problem for plaintexts of length 1025 but something continues to be wrong for plaintexts of length 2048.

@oconnor663
Copy link
Member

oconnor663 commented Sep 29, 2020

EDIT: Oh ignore this comment, I'm confused and talking about the Rust implementation. I tried to get the C implementation running under cross at some point, but I think I ran into trouble with the compiler toolchain? Can't remember.


I haven't looked closely at the proposed fix yet, but here's an important data point: We do big endian emulation as part of our CI testing, using cross to target mips-unknown-linux-gnu. Here's an example run, the most recent on master: https://github.com/BLAKE3-team/BLAKE3/runs/1162056819. The tests exercise cargo test and also our fixed test vectors (where we have this cross_test.sh script to work around some unfortunate details in how files get shared with the emulated machine). Both of those runs cover the same set of cases (CHUNK_LEN is 1024):

pub const TEST_CASES: &[usize] = &[
    0,
    1,
    CHUNK_LEN - 1,
    CHUNK_LEN,
    CHUNK_LEN + 1,
    2 * CHUNK_LEN,
    2 * CHUNK_LEN + 1,
    3 * CHUNK_LEN,
    3 * CHUNK_LEN + 1,
    4 * CHUNK_LEN,
    4 * CHUNK_LEN + 1,
    5 * CHUNK_LEN,
    5 * CHUNK_LEN + 1,
    6 * CHUNK_LEN,
    6 * CHUNK_LEN + 1,
    7 * CHUNK_LEN,
    7 * CHUNK_LEN + 1,
    8 * CHUNK_LEN,
    8 * CHUNK_LEN + 1,
    16 * CHUNK_LEN,  // AVX512's bandwidth
    31 * CHUNK_LEN,  // 16 + 8 + 4 + 2 + 1
    100 * CHUNK_LEN, // subtrees larger than MAX_SIMD_DEGREE chunks
];

As a sanity check, I've pushed a branch that intentionally breaks big endian. Here's the CI run: https://github.com/BLAKE3-team/BLAKE3/runs/1182629230. As expected, only the emulated mips-unknown-linux-gnu tests show the failure.

I guess the first thing I should ask is whether we're testing the same cases. I think CHUNK_LEN + 1 (1025) and 2 * CHUNK_LEN (2048) should be the same cases you mentioned, but could you help me double check? If we are testing the same cases, maybe it could be an emulation issue?

@oconnor663
Copy link
Member

oconnor663 commented Sep 29, 2020

Ok, now that I understand what's going on, I was able to hack together cross test for the C code, and I can reproduce the failure:

---- test::test_compare_reference_impl stdout ----
[src/test.rs:361] case = 0
[src/test.rs:361] case = 1
[src/test.rs:361] case = 2
[src/test.rs:361] case = 3
[src/test.rs:361] case = 4
[src/test.rs:361] case = 5
[src/test.rs:361] case = 6
[src/test.rs:361] case = 7
[src/test.rs:361] case = 8
[src/test.rs:361] case = 63
[src/test.rs:361] case = 64
[src/test.rs:361] case = 65
[src/test.rs:361] case = 127
[src/test.rs:361] case = 128
[src/test.rs:361] case = 129
[src/test.rs:361] case = 1023
[src/test.rs:361] case = 1024
[src/test.rs:361] case = 1025
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `[72, 8, 152, 47, 210, 139, 206, 90, 252, 54, 46, 64, 124, 145, 171, 18, 106, 103, 215, 52, 9, 203, 58, 129, 45, 101, 76, 78, 167, 226, 99, 44, 169, 73, 22, 7, 82, 15, 18, 33, 56, 215
, 120, 237, 89, 207, 188, 237, 77, 171, 42, 58, 46, 233, 44, 117, 181, 187, 40, 80, 192, 131, 151, 171, 236, 53, 60, 47, 98, 208, 228, 209, 94, 155, 136, 141, 152, 95, 38, 137, 146, 190, 253,
 85, 234, 112, 31, 73, 0, 72, 55, 119, 19, 9, 93, 172, 108, 131, 167, 220, 94, 7, 106, 184, 10, 46, 49, 78, 150, 159, 11, 45, 69, 226, 2, 87, 147, 199, 50, 166, 138, 170, 252, 44, 91, 66, 242
, 163, 232, 17, 30, 87, 49, 85, 209, 139, 29, 154, 94, 174, 98, 100, 144, 220, 113, 171, 68, 96, 14, 84, 73, 188, 178, 148, 56, 21, 71, 174, 142, 203, 58, 151, 226, 204, 26, 165, 235, 114, 14
5, 239, 254, 58, 91, 235, 154, 215, 34, 207, 82, 43, 18, 214, 16, 2, 5, 149, 135, 182, 6, 80, 115, 20, 167, 196, 233, 49, 238, 184, 209, 63, 123, 163, 31, 152, 5, 70, 155, 79, 160, 9, 158, 15
9, 31, 82, 115, 30, 10, 144, 136, 237, 88, 114, 136, 175, 162, 77, 97, 37, 187, 132, 221, 124, 212, 205, 58, 126, 91, 152, 13, 82, 115, 22, 248, 249, 12, 241, 31, 74, 153, 174, 251, 26, 68, 1
79, 185, 125, 109, 175, 141, 80, 121, 230, 227, 211, 118, 165, 142, 131, 250, 102, 240, 11, 92, 67, 74, 50, 131, 39, 73, 167, 154, 119, 134, 90, 3, 178, 248, 139, 126, 0, 133, 63, 24, 122, 21
7, 140, 90, 84, 82, 227, 14, 80, 244]`,
 right: `[208, 2, 120, 174, 71, 235, 39, 179, 79, 174, 207, 103, 180, 254, 38, 63, 130, 213, 65, 41, 22, 193, 255, 217, 124, 140, 183, 251, 129, 75, 132, 68, 244, 196, 162, 43, 75, 57, 145, 8
5, 53, 138, 153, 78, 82, 191, 37, 93, 230, 0, 53, 116, 46, 199, 27, 208, 138, 194, 117, 161, 181, 28, 198, 191, 227, 50, 176, 239, 132, 180, 9, 16, 140, 218, 8, 14, 98, 105, 237, 75, 62, 44,
63, 125, 114, 42, 164, 205, 201, 141, 22, 222, 181, 84, 229, 98, 123, 232, 249, 85, 201, 142, 29, 95, 149, 101, 169, 25, 76, 173, 12, 66, 133, 249, 55, 0, 6, 45, 149, 149, 173, 185, 146, 174,
 104, 255, 18, 128, 10, 182, 122, 254, 165, 22, 242, 33, 207, 125, 31, 132, 52, 252, 54, 216, 246, 251, 223, 56, 212, 69, 196, 77, 150, 186, 59, 177, 212, 162, 226, 174, 154, 83, 253, 70, 211
, 147, 7, 98, 138, 71, 248, 144, 202, 182, 172, 51, 58, 173, 72, 192, 193, 30, 223, 105, 226, 166, 67, 126, 138, 191, 66, 211, 53, 50, 126, 238, 179, 41, 187, 201, 195, 161, 107, 6, 24, 166,
247, 14, 212, 235, 110, 138, 61, 0, 248, 0, 24, 73, 44, 251, 147, 204, 202, 10, 50, 169, 180, 99, 18, 94, 233, 10, 210, 226, 198, 207, 165, 179, 253, 156, 16, 150, 162, 182, 54, 66, 154, 131,
 184, 93, 73, 255, 5, 70, 77, 226, 219, 107, 135, 120, 95, 57, 121, 141, 234, 236, 172, 95, 213, 238, 134, 184, 88, 16, 196, 57, 80, 198, 208, 127, 128, 102, 100, 15, 193, 248, 115, 219, 239,
 35, 251, 43, 155, 98, 74, 54, 15, 70, 4, 2, 232, 103, 41]`', src/test.rs:374:13

@oconnor663 oconnor663 reopened this Sep 29, 2020
@pascal-cuoq
Copy link
Author

pascal-cuoq commented Sep 29, 2020

@oconnor663 I am impressed that cross allows you to reproduce this problem that lies in the C code (when executed on a big-endian machine). One possible next step is to check if the for loop that I proposed (in comment #118 (comment) ) fixes it for you for 1025, but it may not be the right fix since it does not fix it for other plaintext sizes.

@oconnor663
Copy link
Member

@pascal-cuoq I had the exact same experience you did. Fixing the bug at 1025 exposed another bug at 2048. I've pushed a branch that fixes both: https://github.com/BLAKE3-team/BLAKE3/tree/fix_big_endian. If you get a chance, could you see if that commit works for you?

@oconnor663
Copy link
Member

And thanks a million for catching this issue! I will definitely get our C code under cross in CI as part of fixing this.

@oconnor663
Copy link
Member

oconnor663 commented Sep 29, 2020

For some context as to why these failures appeared at the lengths where they did:

  • output_chaining_value computes an internal chaining value from a chunk. This isn't called until the input is more than one chunk long, which is why we see a failure at 1025 bytes. Prior to that, only the output_root_bytes function is called, which wasn't buggy.
  • hash_one_portable is what underlies the portable branch of blake3_hash_many. That's the abstraction that allows really fast SIMD implementations to hash multiple entire chunks at once. The first time it's called is when you have two full chunks, which is 2048 bytes.

@pascal-cuoq
Copy link
Author

I will accept the thanks on behalf of @jakub-zwolakowski who did all the work. TrustInSoft CI has now checked that your branch https://github.com/BLAKE3-team/BLAKE3/tree/fix_big_endian makes the 32-bit and 64-bit big-endian results of the C implementation consistent with the little-endian results for the following input sizes:

-rw-r--r--  1 pascal  staff       0 Sep 29 11:47 tis/test_vectors/01_input.bin
-rw-r--r--  1 pascal  staff       1 Sep 29 11:47 tis/test_vectors/02_input.bin
-rw-r--r--  1 pascal  staff    1023 Sep 29 11:47 tis/test_vectors/03_input.bin
-rw-r--r--  1 pascal  staff    1024 Sep 29 11:47 tis/test_vectors/04_input.bin
-rw-r--r--  1 pascal  staff    1025 Sep 29 11:47 tis/test_vectors/05_input.bin
-rw-r--r--  1 pascal  staff    2048 Sep 29 11:47 tis/test_vectors/06_input.bin
-rw-r--r--  1 pascal  staff    2049 Sep 29 11:47 tis/test_vectors/07_input.bin
-rw-r--r--  1 pascal  staff    3072 Sep 29 11:47 tis/test_vectors/08_input.bin
-rw-r--r--  1 pascal  staff    3073 Sep 29 11:47 tis/test_vectors/09_input.bin
-rw-r--r--  1 pascal  staff    4096 Sep 29 11:47 tis/test_vectors/10_input.bin
-rw-r--r--  1 pascal  staff    4097 Sep 29 11:47 tis/test_vectors/11_input.bin
-rw-r--r--  1 pascal  staff    5120 Sep 29 11:47 tis/test_vectors/12_input.bin
-rw-r--r--  1 pascal  staff    5121 Sep 29 11:47 tis/test_vectors/13_input.bin
-rw-r--r--  1 pascal  staff    6144 Sep 29 11:47 tis/test_vectors/14_input.bin
-rw-r--r--  1 pascal  staff    6145 Sep 29 11:47 tis/test_vectors/15_input.bin
-rw-r--r--  1 pascal  staff    7168 Sep 29 11:47 tis/test_vectors/16_input.bin
-rw-r--r--  1 pascal  staff    7169 Sep 29 11:47 tis/test_vectors/17_input.bin
-rw-r--r--  1 pascal  staff    8192 Sep 29 11:47 tis/test_vectors/18_input.bin
-rw-r--r--  1 pascal  staff    8193 Sep 29 11:47 tis/test_vectors/19_input.bin
-rw-r--r--  1 pascal  staff   16384 Sep 29 11:47 tis/test_vectors/20_input.bin

After that branch of yours is merged, @jakub-zwolakowski will submit a PR to show what the configuration files for TrustInSoft CI looks like for these four architectures. It is complementary with cross, and we hope that you will find it useful: it can only check the pure C implementation (with the implementation-defined parameters of GCC targeting x86/x86-64/ppc32/ppc64) but it detects all the C Undefined Behaviors that ASan/UBSan/MSan detect and more.

oconnor663 pushed a commit that referenced this pull request Sep 29, 2020
Kudos to @pascal-cuoq and @jakub-zwolakowski from TrustInSoft for
catching these bugs.

Original report: #118
@oconnor663
Copy link
Member

I just put up #119 with those fixes and expanded CI coverage for this case. I'll merge it as soon as CI passes.

@oconnor663
Copy link
Member

#119 is now merged.

oconnor663 pushed a commit that referenced this pull request Oct 1, 2020
Changes since 0.3.6:
- BUGFIX: The C implementation was incorrect on big endian systems for
  inputs longer than 1024 bytes. This bug affected all previous versions
  of the C implementation. Little endian platforms like x86 were
  unaffected. The Rust implementation was also unaffected.
  @jakub-zwolakowski and @pascal-cuoq from TrustInSoft reported this
  bug: #118
- BUGFIX: The C build on x86-64 was producing binaries with an
  executable stack. @tristanheaven reported this bug:
  #109
- @mkrupcale added optimized implementations for SSE2. This improves
  performance on older x86 processors that don't support SSE4.1.
- The C implementation now exposes the
  `blake3_hasher_init_derive_key_raw` function, to make it easier to
  implement language bindings. Added by @k0001.
kevingoh pushed a commit to ITS-AT-dev/BLAKE3 that referenced this pull request Oct 23, 2023
Kudos to @pascal-cuoq and @jakub-zwolakowski from TrustInSoft for
catching these bugs.

Original report: BLAKE3-team#118
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants