Skip to content

Deplatforming #7049

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 36 commits into
base: main
Choose a base branch
from
Draft

Deplatforming #7049

wants to merge 36 commits into from

Conversation

eddyashton
Copy link
Member

@eddyashton eddyashton commented Jun 11, 2025

The build-time distinction between SNP and Virtual is unhelpful, results in a huge amount of duplication in our CMake. This was necessary on SGX, where the targets had significantly different build options, but is no longer doing anything.

This PR removes the COMPILE_TARGET option from CMake. This has a bunch of knock-on effects, that I think fall into a few distinct camps:

  1. De-duplicated targets. We no longer have ccf_kv.host and ccf_kv.snp, we just have ccf_kv. I've manually tried to verify in each instance that the calls to each target were always identical, but I may have missed some.
  2. Removal of PLATFORM_ preprocessor define. Most places were #if PLATFORM_VIRTUAL || PLATFORM_SNP, so the #if/#endif pair can be trivially removed. Deciding what kind of quote we produce is now a run-time decision (each binary is capable of trying to produce either format), based on the existing enclave.platform config argument. I think that ought to be promoted to a CLI argument (security critical), but that's deferred.
  3. Test clarity. Functions that check whether an ioctl device is available (in C++ and Python) are now named to show that they're testing for SNP_SUPPORT. The decision of whether this run (of an e2e test or a node) should try to get a virtual or SNP attesation is driven by configuration, not by this capability detection.
  4. Simpler library names. We still (for now) produce a single cchost and then a .so for each "enclave" application. But those applications are now libjs_generic.so, rather than libjs_generic.virtual.so/libjs_generic.snp.so.
  5. Gluing those together it turns out it is still convenient to have a global "default test platform" in CMake, so that we can indicate whether we think we're testing virtual or SNP. That looks a bit like COMPILE_TARGET, but it's now called CCF_TEST_PLATFORM, and hopefully used in few enough places to clarify intent. Perhaps the default value should be derived from capability detection? TBD.

I've only run a handful of builds and tests locally, this is likely missing some obvious breakages, so opening as a Draft initially and awaiting CI coverage.

Also, I think this is a big enough breaking change that it shouldn't affect 6.0, so I propose a release/6.x branch.

Follow-ups:

  • Remove enclave section in configuration #6623
  • De-duplicate builds in CI: build once and copy to test runners
  • Tear out more enclave/platform distinctions in the test infra. I'm always wary of how hard we can go on this while maintaining some lts_ story, but I think right now we can be pretty aggressive? But let's do it separately to derisk.

endif()

add_library(qcbor.host STATIC ${QCBOR_SRCS})
add_library(qcbor STATIC ${QCBOR_SRCS})
Copy link
Member Author

@eddyashton eddyashton Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub's diff view doesn't make this very obvious, but really what we had for almost all of our libraries was this:

if(SNP)
  add_library(XXX.snp)
  options(XXX.snp OPTIONS MORE_OPTIONS)
  also_options(XXX.snp MOOOOORE)
  if(CCF_DEVEL)
    install(XXX.snp)
  endif()
endif()

add_library(XXX.host)
options(XXX.host OPTIONS MORE_OPTIONS)
also_options(XXX.host MOOOOORE)
if(INSTALL_VIRTUAL_LIBRARIES)
  install(XXX.host)
endif()

And now we just have one library, called XXX, built unconditionally.

add_library(XXX)
options(XXX OPTIONS MORE_OPTIONS)
also_options(XXX MOOOOORE)
if(CCF_DEVEL)
  install(XXX)
endif()

I think that in all cases the options and more_options and etc etc calls were providing all of the same arguments, but I may have missed some. So a risk of this big change is that we miss a subtle option from one of these. Feel free to scan a few, and let's see what the tests say.

I think the difference between INSTALL_VIRTUAL_LIBRARIES and CCF_DEVEL was basically a bug, that we never noticed because of the options we passed in practice? Regardless, the consolidation is to a single library that checks CCF_DEVEL to decide if it should be installed - there are no virtual-specific libraries, and no more INSTALL_VIRTUAL_LIBRARIES.

@@ -46,6 +46,7 @@ extern "C"
size_t* enclave_version_len,
StartType start_type,
ccf::LoggerLevel enclave_log_level,
ccf::pal::Platform platform,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the .so no longer knows what platform it is at build-time, we need to pass this through from the host. Ugly to add more args here, but we'll nuke the whole thing shortly.

Comment on lines +85 to +90
if (!ccf::pal::snp::supports_sev_snp())
{
std::cout << "Skipping all tests as this is not running in SEV-SNP"
<< std::endl;
return 0;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously disabled at build-time somewhere in CMake. Now built and run all the time, and a run-time capability probe determines whether it's actually testing anything.

Comment on lines -16 to -23
# Install tree
PLATFORM_FILE="${PATH_HERE}"/../share/PLATFORM
# If not there, try locally
if [ ! -f "${VERSION_FILE}" ]; then
PLATFORM_FILE=PLATFORM
fi
platform=$(<"${PLATFORM_FILE}")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When platform was a build-time decision, we created a file named PLATFORM and installed it, so that that sandbox.sh could automatically infer what type of binaries it was surrounded by. This doesn't exist anymore - it's surrounded by platform-agnostic binaries. --enclave-platform can still be passed explicitly to sandbox.sh/start_network.py/similar wrappers, and is inserted for all of our e2e tests, but doesn't come from file checks.

CMakeLists.txt Outdated
Comment on lines 7 to 8
set(CCF_TEST_PLATFORM
"virtual"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I argue that although this diff is small, this is fundamentally a new arg, so I'm not changing a default. We're not making things less secure (the default build still produces SNP-capable binaries, that default to using SNP logic), but our tests default to using the "virtual" attestations unless told otherwise. As a result, we don't have to pass -DCOMPILE_TARGET=virtual (or -DCCF_TEST_PLATFORM=virtual) in all of our CI yamls and whenever we're telling people how to build, the default is the majority-option and the few things that want to explicitly try SNP-networks can do so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tests should detect the platform the same way the binaries do, and enforce their opinion of the platform when they act as the consortium that agrees to open the network. Telling them to do otherwise should require an override. Let's discuss.

@@ -191,7 +190,7 @@ def create_connections_until_exhaustion(
client.post(
"/log/private",
{"id": 42, "msg": "foo"},
timeout=3 if IS_SNP else 1,
timeout=3 if args.enclave_platform == "snp" else 1,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure if we still need SNP-specific timeouts? But plausible, so leaving them for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, let's check

@@ -540,7 +539,7 @@ def _setup_common_folder(self, constitution):
), f"Could not copy governance {fragment} to {self.common_dir}"
# It is more convenient to create a symlink in the common directory than generate
# certs and keys in the top directory and move them across
cmd = ["cp"] if IS_SNP else ["ln", "-s"]
cmd = ["ln", "-s"]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a punt here, assuming it's now safe to symlink everywhere? If I'm missing something, we can restore this branch based on SNP-support, or something more precise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@achamayou
Copy link
Member

achamayou commented Jun 11, 2025

Deciding what kind of quote we produce is now a run-time decision (each binary is capable of trying to produce either format), based on the existing enclave.platform config argument. I think that ought to be promoted to a CLI argument (security critical), but that's deferred.

It seems to me that in 99% of cases, this can be automatically decided without security implications, because either the join policy, or the consortium-issued transition service to open decide whether the attestation is acceptable.

Having a way to force virtual is likely useful, either through an env var or through the configuration, but is not security critical for the reason above. Whether that's good usability is arguable.

Gluing those together it turns out it is still convenient to have a global "default test platform" in CMake, so that we can indicate whether we think we're testing virtual or SNP. That looks a bit like COMPILE_TARGET, but it's now called CCF_TEST_PLATFORM, and hopefully used in few enough places to clarify intent. Perhaps the default value should be derived from capability detection? TBD.

Because we (still) use ctest as a launcher, I agree that this is good, but that's something we could decide to revisit, as the decrease in compile-time parameters also diminishes the value of that layer.

@achamayou
Copy link
Member

Also, I think this is a big enough breaking change that it shouldn't affect 6.0, so I propose a release/6.x branch.

100%

@eddyashton
Copy link
Member Author

It seems to me that in 99% of cases, this can be automatically decided without security implications, because either the join policy, or the consortium-issued transition service to open decide whether the attestation is acceptable.

Having a way to force virtual is likely useful, either through an env var or through the configuration, but is not security critical for the reason above. Whether that's good usability is arguable.

I'm coming round to the idea of automatic platform detection, let's discuss tomorrow. Just listing a few discussion points:

  • Can we still force the platform to emit a certain kind of attestation when we want to test that?
  • Do we silently lose test coverage when the automated capability detection fails/is flaky?
  • Do we promote visibility of the chosen platform somewhere? It's in the attestation if we get that far, and I guess we argue that behaviour up-to that point is identical? So we log as soon as we branch, and store it in the KV ASAP.

@achamayou
Copy link
Member

Can we still force the platform to emit a certain kind of attestation when we want to test that?

I assume you mean a virtual attestation? I would say yes, not because I can think of a reason why it might be useful, but because it seems really easy to do, and could come in handy when isolating failures.

Do we silently lose test coverage when the automated capability detection fails/is flaky?

So long as the tests/test infra enforce a specific platform is being tested, I don't think so? If we are worried that the test auto-detection can fail silently too, we could push the number of tests run to bencher, and set up a threshold alert?
Having said that ./tests --platform=SNP is pretty simple, and not subject to this failure mode at all. It may well be the right thing to do.

Do we promote visibility of the chosen platform somewhere? It's in the attestation if we get that far, and I guess we argue that behaviour up-to that point is identical? So we log as soon as we branch, and store it in the KV ASAP.

The attestation gets stored in Tx0. I think we want to tighten the initial consortium check to only open if we're happy with the platform, and we should be good from there.

@eddyashton eddyashton added the run-long-test Run Long Test job label Jun 13, 2025
@eddyashton
Copy link
Member Author

I'm currently convinced that auto-detection is fine, and we don't need any explicit/tested overrides. In my head:

This means a CCF node will look for SNP IOCTL devices, and produce an SNP attestation if it finds them (at this point requiring some additional SNP endorsement server config), else it'll fall back and produce a virtual attestation. The Python infra will do the same check independently, to decide whether it is expecting SNP attestations (with associated tests), and to confirm the attestation format before submitting service open. CMake needs no knowledge of what platforms the tests are running on/targeting, so this can go entirely (I think - maybe something about test labels that need to contain platform info, but I think that can be inserted later). Then we convince ourselves that some of the tests are actually running on SNP, by writing a standalone "confirm_this_is_snp" script/util, and calling that early in the appropriate CI jobs.

Possible future things we can support in future:

  • Running SNP nodes from non-SNP host-infra. We're moving away from this in the current infra, since it's expensive to maintain and unused. The story here would be simple - the detection needs to run on the target, not the host.
  • A manual override for getting a virtual attestation despite SNP being available. I've wanted to do this once before, but very manually. Tempted to bury a magic env var that provides this, and exclude it from any automated testing? Plumbing the same override through the Python infra would be expensive and unnecessary, but is also easier to patch in manually later.
  • Running cross-platform upgrade tests. This would need either of the above options - infra launches remote nodes, or is able to override what some local nodes do. Not currently tested, can add when it's needed.

@eddyashton
Copy link
Member Author

I think I've decided this is good-to-go, but the lts compat test is going to fail until we have a 7.x tag (because of reasons). So parking this, briefly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-long-test Run Long Test job
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants