Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

divergent XS heap snapshots on gcc-9 (i.e. Ubuntu-20.04-LTS) #7829

Closed
warner opened this issue May 24, 2023 · 2 comments
Closed

divergent XS heap snapshots on gcc-9 (i.e. Ubuntu-20.04-LTS) #7829

warner opened this issue May 24, 2023 · 2 comments
Assignees
Labels
bug Something isn't working xsnap the XS execution tool

Comments

@warner
Copy link
Member

warner commented May 24, 2023

Today's Emerynet upgrade exposed a behavior divergence: some validators (call them Team Number) produced a different AppHash for the bootstrap block than the rest (call those ones Team Integer). All of the validators that we (@arirubinstein) ran were on Team Integer, however Team Number had a 2/3rds majority. The first block stalled a lot, as the two teams refused to vote for each other's blocks, and several rounds of voting (and timeouts) had to occur before a Team Number validator became the proposer. (Our debugging was hindered by the fact that we were using a dashboard hosted by a Team Integer node, which did not accept the ultimately-winning Team Number blocks).

We extracted the genesis export from members of both teams, and compared their contents, revealing differences in the transcript spans and XS heap snapshots for a particular vat (v42, the vault manager I think). We then extracted the two heap snapshots (made as delivery 4, one from each team) through packages/SwingSet/misc-tools/ingest-xs-snapshot.js to parse them into a little DB, and compared the slots table of each. We found exactly one divergence:

┌────────┬───────────┬──────┬──────┬────────┬───────────────────────┬────────────────────────────┬─────────────────────────┐
│ slotID │   kind    │ flag │  id  │  next  │      fieldsJSON       │       slotFieldsJSON       │     chunkFieldsJSON     │
├────────┼───────────┼──────┼──────┼────────┼───────────────────────┼────────────────────────────┼─────────────────────────┤
│ 107112 │ INTEGER   │ 8    │ 0    │ 0      │ {"integer":100000000} │ {}                         │ {}                      │

on a Team Integer validator, vs:

│ 107112 │ NUMBER    │ 8    │ 0    │ 0      │ {"number":100000000} │ {}                         │ {}                      │

on a Team Number validator.

That means in one cohort, one particular slot was of type XS_NUMBER_KIND (a JS Number, aka a float), but in the other cohort it was of type XS_INTEGER_ID. This is an optimization of Number that only holds 32-bit integers (note: not a BigInt, just something in the integer subset of Number).

One place that performs such an optimization (although probably not the one causing the problem) is in xsJSON.c / fxParseJSONToken where it's parsing numbers:

			c_memcpy(the->nameBuffer, s, size);
			the->nameBuffer[size] = 0;
			theParser->number = fxStringToNumber(the->dtoa, the->nameBuffer, 0);
			theParser->integer = (txInteger)theParser->number;
			number = theParser->integer;
			if ((theParser->number == number) && (theParser->number != -0))
				theParser->token = XS_JSON_TOKEN_INTEGER;
			else
				theParser->token = XS_JSON_TOKEN_NUMBER;

With Moddable's help, we eventually concluded that it was most likely to be in xsRun.c, where basic XS bytecodes like "multiply two things" do:

		mxCase(XS_CODE_MULTIPLY)
			slot = mxStack + 1;
			if (slot->kind == XS_INTEGER_KIND) {
				if (mxStack->kind == XS_INTEGER_KIND) {
 ..
					#if __has_builtin(__builtin_mul_overflow)
						if (__builtin_mul_overflow(slot->value.integer, mxStack->value.integer, &scratch.value.integer)) {
							slot->kind = XS_NUMBER_KIND;
							slot->value.number = (txNumber)(slot->value.integer) * (txNumber)(mxStack->value.integer);
						}
						else
							slot->value.integer = scratch.value.integer;
					#else
						slot->kind = XS_NUMBER_KIND;
						slot->value.number = (txNumber)(slot->value.integer) * (txNumber)(mxStack->value.integer);
					#endif
 ..

If that code is multiplying two INTEGERs, the result may or may not fit into a 32-bit integer. If it does, this will produce a slot with XS_INTEGER_KIND (which is faster), but only if it can use __builtin_add_overflow to efficiently tell whether it overflows or not. If it does not have __builtin_add_overflow, then it unconditionally produces XS_NUMBER_KIND. And it needs __has_builtin() to sense whether __builtin_add_overflow is available or not.

We eventually concluded that Team Integer validators compiled their xsnap with a compiler that offers __has_builtin() (specifically gcc-10 or newer), whereas Team Number used an older compiler that did not (gcc-9). Ubuntu 20.04 LTS offers both, but gcc-9 is the default, and it is not trivial to use gcc-10 for the xsnap compile. Ubuntu 22.04 LTS uses gcc-11 as the default.

Note that gcc-9 provides __builtin_mul_overflow and the rest, it's only missing the nice portable way to inquire about their presence: __has_builtin().

The difference between XS_NUMBER_KIND and XS_INTEGER_KIND is not visible to JavaScript: both have typeof x === 'number' and they behave identically. However the heap snapshot records the slot Kind, so the heap snapshots were different. And because we now include heap snapshot hashes in consensus, Team Number and Team Integer did not approve of each other's blocks.

We got lucky that the divergence occurred in the first four deliveries of some vat, because that allowed us to see the difference in the heap snapshots right away. If the divergence happened in delivery 5, we wouldn't have seen the next heap snapshot for another 2000 deliveries.

The Fix

We're going to require that the code be compiled with a compiler that offers all the builtins needed to use these optimizations: Team Integer is our preferred winner.

To avoid requiring validators to upgrade their Ubuntu 20.04 LTS platforms to one with gcc-10 (after all 20.04 LTS is still supported, and is only three years old), we're doing to do this by #defineing the __has_builtin() into something that always returns true. @michaelfig came up with a patch that only affects agoric-sdk, and avoids the need to modify both our XS fork and patch agoric-sdk to use it:

diff --git a/packages/xsnap/src/build.js b/packages/xsnap/src/build.js
index fbb4bbb67..ed053405f 100644
--- a/packages/xsnap/src/build.js
+++ b/packages/xsnap/src/build.js
@@ -235,6 +235,7 @@ async function main(args, { env, stdout, spawn, fs, os }) {
         `MODDABLE=${ModdableSDK.MODDABLE}`,
         `GOAL=${goal}`,
         `XSNAP_VERSION=${pkg.version}`,
+        `CC=cc "-D__has_builtin(x)=1" -Wno-builtin-macro-redefined`,
         '-f',
         'xsnap-worker.mk',
       ],

With this in place, XS will always attempt to use __builtin_mul_overflow and it's add/sub friends, which will work on gcc-9 (and maybe even older versions). If someone attempts to compile it on a really old platform that lacks them, they'll get a linker error. But all validators should be doing the same thing.

We also intend to add a unit test (later) that performs a multiply of two small numbers and creates a heap snapshot, comparing it against a "golden" value created on a new platform (and manually verified to be Team Integer). This test would then fail if, for whatever reason, XS chose to use the XS_NUMBER_KIND variant.

@warner warner added bug Something isn't working xsnap the XS execution tool labels May 24, 2023
@warner warner self-assigned this May 24, 2023
@ivanlei ivanlei added this to the Vaults Go Live milestone May 24, 2023
@warner
Copy link
Member Author

warner commented May 24, 2023

Correction: we've already started the communcation about requiring a newer GCC. So instead, we'll be changing xsnap-worker.c to insist that _has_builtin() is available, with something like:

#ifndef __has_builtin
#error "xsnap requires __has_builtin, so GCC>=10, e.g. Ubuntu-22.04 or newer"
#endif

This will require two PRs: the first to update https://github.com/agoric-labs/xsnap-pub to include the #error, and the second to update agoric-sdk packages/xsnap/xsnap-native to use it. Plus changes to the docs to describe the new requirement.

warner added a commit to agoric-labs/xsnap-pub that referenced this issue May 24, 2023
XS emits different heap snapshots when compiled under gcc-9 (missing
__has_builtin) and gcc-10 or later (has __has_builtin). To avoid the
consensus problems that result, require __has_builtin, and thus a
newer compiler.

refs Agoric/agoric-sdk#7829
warner added a commit that referenced this issue May 24, 2023
This switches to a version of `xsnap-pub` that adds a compile-time
assertion of the availability of `__has_builtin()`. This function was
added to gcc-10, and was missing in gcc-9. XS can produce different
heap snapshots depending upon whether this function is available or
not (breaking consensus, which includes the heap snapshot hashes). By
mandating its presence, we ensure that all instances will behave the
same way, maintaining consensus.

closes #7829
warner added a commit that referenced this issue May 24, 2023
This switches to a version of `xsnap-pub` that adds a compile-time
assertion of the availability of `__has_builtin()`. This function was
added to gcc-10, and was missing in gcc-9. XS can produce different
heap snapshots depending upon whether this function is available or
not (breaking consensus, which includes the heap snapshot hashes). By
mandating its presence, we ensure that all instances will behave the
same way, maintaining consensus.

closes #7829
warner added a commit that referenced this issue May 24, 2023
This switches to a version of `xsnap-pub` that adds a compile-time
assertion of the availability of `__has_builtin()`. This function was
added to gcc-10, and was missing in gcc-9. XS can produce different
heap snapshots depending upon whether this function is available or
not (breaking consensus, which includes the heap snapshot hashes). By
mandating its presence, we ensure that all instances will behave the
same way, maintaining consensus.

closes #7829
@warner
Copy link
Member Author

warner commented May 24, 2023

Both PRs have landed, so I'm closing this one out.

We chose to have xsnap to require that __has_builtin() is available, which was the more conservative approach in the moment, but it still leaves us open to valid compiler behavior creating consensus problems. It only helped us in this case because all the compilers we know about provide all the actual builtins, like __builtin_mul_overflow, and the specific problem was that XS couldn't (portably) determine whether or not the builtin was available.

With the approach we used, a compiler which has a__has_builtin() that returns false for __builtin_mul_overflow will cause XS to use the non-optimized path, and create an XS_NUMBER_KIND, where a more capable environment would create XS_INTEGER_KIND, causing divergence. If XS acquires more compile-time optimization options, these will be other opportunities for platform-triggered divergence.

@michaelfig and I developed a different approach, in which we edit the build process to use -D to hardwire __has_builtin(x) with a version that always returns true regardless of x. This will cause a linker error on a platform that lacks all of the builtins XS wants to use, which might be harder to diagnose, but would remove this degree of freedom and reduce the chances for divergence correspondingly. We didn't have the time to think through (or audit for) the consequences of this change, so we went with the other approach.

I think, in the vaults+1 timeframe, we should switch to the -D/always-true approach, as I think it'll be safer in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working xsnap the XS execution tool
Projects
None yet
Development

No branches or pull requests

2 participants