divergent XS heap snapshots on gcc-9 (i.e. Ubuntu-20.04-LTS) #7829

warner · 2023-05-24T00:15:55Z

Today's Emerynet upgrade exposed a behavior divergence: some validators (call them Team Number) produced a different AppHash for the bootstrap block than the rest (call those ones Team Integer). All of the validators that we (@arirubinstein) ran were on Team Integer, however Team Number had a 2/3rds majority. The first block stalled a lot, as the two teams refused to vote for each other's blocks, and several rounds of voting (and timeouts) had to occur before a Team Number validator became the proposer. (Our debugging was hindered by the fact that we were using a dashboard hosted by a Team Integer node, which did not accept the ultimately-winning Team Number blocks).

We extracted the genesis export from members of both teams, and compared their contents, revealing differences in the transcript spans and XS heap snapshots for a particular vat (v42, the vault manager I think). We then extracted the two heap snapshots (made as delivery 4, one from each team) through packages/SwingSet/misc-tools/ingest-xs-snapshot.js to parse them into a little DB, and compared the slots table of each. We found exactly one divergence:

┌────────┬───────────┬──────┬──────┬────────┬───────────────────────┬────────────────────────────┬─────────────────────────┐
│ slotID │   kind    │ flag │  id  │  next  │      fieldsJSON       │       slotFieldsJSON       │     chunkFieldsJSON     │
├────────┼───────────┼──────┼──────┼────────┼───────────────────────┼────────────────────────────┼─────────────────────────┤
│ 107112 │ INTEGER   │ 8    │ 0    │ 0      │ {"integer":100000000} │ {}                         │ {}                      │

on a Team Integer validator, vs:

│ 107112 │ NUMBER    │ 8    │ 0    │ 0      │ {"number":100000000} │ {}                         │ {}                      │

on a Team Number validator.

That means in one cohort, one particular slot was of type XS_NUMBER_KIND (a JS Number, aka a float), but in the other cohort it was of type XS_INTEGER_ID. This is an optimization of Number that only holds 32-bit integers (note: not a BigInt, just something in the integer subset of Number).

One place that performs such an optimization (although probably not the one causing the problem) is in xsJSON.c / fxParseJSONToken where it's parsing numbers:

			c_memcpy(the->nameBuffer, s, size);
			the->nameBuffer[size] = 0;
			theParser->number = fxStringToNumber(the->dtoa, the->nameBuffer, 0);
			theParser->integer = (txInteger)theParser->number;
			number = theParser->integer;
			if ((theParser->number == number) && (theParser->number != -0))
				theParser->token = XS_JSON_TOKEN_INTEGER;
			else
				theParser->token = XS_JSON_TOKEN_NUMBER;

With Moddable's help, we eventually concluded that it was most likely to be in xsRun.c, where basic XS bytecodes like "multiply two things" do:

		mxCase(XS_CODE_MULTIPLY)
			slot = mxStack + 1;
			if (slot->kind == XS_INTEGER_KIND) {
				if (mxStack->kind == XS_INTEGER_KIND) {
 ..
					#if __has_builtin(__builtin_mul_overflow)
						if (__builtin_mul_overflow(slot->value.integer, mxStack->value.integer, &scratch.value.integer)) {
							slot->kind = XS_NUMBER_KIND;
							slot->value.number = (txNumber)(slot->value.integer) * (txNumber)(mxStack->value.integer);
						}
						else
							slot->value.integer = scratch.value.integer;
					#else
						slot->kind = XS_NUMBER_KIND;
						slot->value.number = (txNumber)(slot->value.integer) * (txNumber)(mxStack->value.integer);
					#endif
 ..

If that code is multiplying two INTEGERs, the result may or may not fit into a 32-bit integer. If it does, this will produce a slot with XS_INTEGER_KIND (which is faster), but only if it can use __builtin_add_overflow to efficiently tell whether it overflows or not. If it does not have __builtin_add_overflow, then it unconditionally produces XS_NUMBER_KIND. And it needs __has_builtin() to sense whether __builtin_add_overflow is available or not.

We eventually concluded that Team Integer validators compiled their xsnap with a compiler that offers __has_builtin() (specifically gcc-10 or newer), whereas Team Number used an older compiler that did not (gcc-9). Ubuntu 20.04 LTS offers both, but gcc-9 is the default, and it is not trivial to use gcc-10 for the xsnap compile. Ubuntu 22.04 LTS uses gcc-11 as the default.

Note that gcc-9 provides __builtin_mul_overflow and the rest, it's only missing the nice portable way to inquire about their presence: __has_builtin().

The difference between XS_NUMBER_KIND and XS_INTEGER_KIND is not visible to JavaScript: both have typeof x === 'number' and they behave identically. However the heap snapshot records the slot Kind, so the heap snapshots were different. And because we now include heap snapshot hashes in consensus, Team Number and Team Integer did not approve of each other's blocks.

We got lucky that the divergence occurred in the first four deliveries of some vat, because that allowed us to see the difference in the heap snapshots right away. If the divergence happened in delivery 5, we wouldn't have seen the next heap snapshot for another 2000 deliveries.

The Fix

We're going to require that the code be compiled with a compiler that offers all the builtins needed to use these optimizations: Team Integer is our preferred winner.

To avoid requiring validators to upgrade their Ubuntu 20.04 LTS platforms to one with gcc-10 (after all 20.04 LTS is still supported, and is only three years old), we're doing to do this by #defineing the __has_builtin() into something that always returns true. @michaelfig came up with a patch that only affects agoric-sdk, and avoids the need to modify both our XS fork and patch agoric-sdk to use it:

diff --git a/packages/xsnap/src/build.js b/packages/xsnap/src/build.js
index fbb4bbb67..ed053405f 100644
--- a/packages/xsnap/src/build.js
+++ b/packages/xsnap/src/build.js
@@ -235,6 +235,7 @@ async function main(args, { env, stdout, spawn, fs, os }) {
         `MODDABLE=${ModdableSDK.MODDABLE}`,
         `GOAL=${goal}`,
         `XSNAP_VERSION=${pkg.version}`,
+        `CC=cc "-D__has_builtin(x)=1" -Wno-builtin-macro-redefined`,
         '-f',
         'xsnap-worker.mk',
       ],

With this in place, XS will always attempt to use __builtin_mul_overflow and it's add/sub friends, which will work on gcc-9 (and maybe even older versions). If someone attempts to compile it on a really old platform that lacks them, they'll get a linker error. But all validators should be doing the same thing.

We also intend to add a unit test (later) that performs a multiply of two small numbers and creates a heap snapshot, comparing it against a "golden" value created on a new platform (and manually verified to be Team Integer). This test would then fail if, for whatever reason, XS chose to use the XS_NUMBER_KIND variant.

The text was updated successfully, but these errors were encountered:

warner · 2023-05-24T00:39:53Z

Correction: we've already started the communcation about requiring a newer GCC. So instead, we'll be changing xsnap-worker.c to insist that _has_builtin() is available, with something like:

#ifndef __has_builtin
#error "xsnap requires __has_builtin, so GCC>=10, e.g. Ubuntu-22.04 or newer"
#endif

This will require two PRs: the first to update https://github.com/agoric-labs/xsnap-pub to include the #error, and the second to update agoric-sdk packages/xsnap/xsnap-native to use it. Plus changes to the docs to describe the new requirement.

XS emits different heap snapshots when compiled under gcc-9 (missing __has_builtin) and gcc-10 or later (has __has_builtin). To avoid the consensus problems that result, require __has_builtin, and thus a newer compiler. refs Agoric/agoric-sdk#7829

This switches to a version of `xsnap-pub` that adds a compile-time assertion of the availability of `__has_builtin()`. This function was added to gcc-10, and was missing in gcc-9. XS can produce different heap snapshots depending upon whether this function is available or not (breaking consensus, which includes the heap snapshot hashes). By mandating its presence, we ensure that all instances will behave the same way, maintaining consensus. closes #7829

warner · 2023-05-24T15:48:41Z

Both PRs have landed, so I'm closing this one out.

We chose to have xsnap to require that __has_builtin() is available, which was the more conservative approach in the moment, but it still leaves us open to valid compiler behavior creating consensus problems. It only helped us in this case because all the compilers we know about provide all the actual builtins, like __builtin_mul_overflow, and the specific problem was that XS couldn't (portably) determine whether or not the builtin was available.

With the approach we used, a compiler which has a__has_builtin() that returns false for __builtin_mul_overflow will cause XS to use the non-optimized path, and create an XS_NUMBER_KIND, where a more capable environment would create XS_INTEGER_KIND, causing divergence. If XS acquires more compile-time optimization options, these will be other opportunities for platform-triggered divergence.

@michaelfig and I developed a different approach, in which we edit the build process to use -D to hardwire __has_builtin(x) with a version that always returns true regardless of x. This will cause a linker error on a platform that lacks all of the builtins XS wants to use, which might be harder to diagnose, but would remove this degree of freedom and reduce the chances for divergence correspondingly. We didn't have the time to think through (or audit for) the consequences of this change, so we went with the other approach.

I think, in the vaults+1 timeframe, we should switch to the -D/always-true approach, as I think it'll be safer in the future.

warner added bug Something isn't working xsnap the XS execution tool labels May 24, 2023

warner self-assigned this May 24, 2023

ivanlei added this to the Vaults Go Live milestone May 24, 2023

warner mentioned this issue May 24, 2023

require __has_builtin, from gcc-10 or newer agoric-labs/xsnap-pub#40

Merged

warner mentioned this issue May 24, 2023

fix(xsnap): use newer xsnap-pub, with requirement for __has_builtin #7830

Merged

warner mentioned this issue May 24, 2023

change xsnap to require all compiler builtins #7834

Open

warner closed this as completed May 24, 2023

michaelfig mentioned this issue May 25, 2023

Prevent __has_builtin compiler differences from causing divergence #7836

Merged

mhofman mentioned this issue May 26, 2023

Acceptance test for new Moddable XS versions #6929

Open

mhofman mentioned this issue Oct 24, 2023

Prevent validator from using mismatched agoric-sdk version #8471

Closed

gibson042 mentioned this issue Jun 28, 2024

Reported app hash on devnet for agoric-upgrade-16 #9602

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

divergent XS heap snapshots on gcc-9 (i.e. Ubuntu-20.04-LTS) #7829

divergent XS heap snapshots on gcc-9 (i.e. Ubuntu-20.04-LTS) #7829

warner commented May 24, 2023

warner commented May 24, 2023

warner commented May 24, 2023

divergent XS heap snapshots on gcc-9 (i.e. Ubuntu-20.04-LTS) #7829

divergent XS heap snapshots on gcc-9 (i.e. Ubuntu-20.04-LTS) #7829

Comments

warner commented May 24, 2023

The Fix

warner commented May 24, 2023

warner commented May 24, 2023