Skip to content

guard(abi): compile-time tripwire on the plug-in DP-vtable ABI + ADR-020#348

Merged
dfattal merged 1 commit into
mainfrom
guard/plugin-abi-tripwire
May 27, 2026
Merged

guard(abi): compile-time tripwire on the plug-in DP-vtable ABI + ADR-020#348
dfattal merged 1 commit into
mainfrom
guard/plugin-abi-tripwire

Conversation

@dfattal
Copy link
Copy Markdown
Collaborator

@dfattal dfattal commented May 27, 2026

What

Adds the immediate, behavior-neutral guard against the class of bug that caused the standalone-VK no-weave / xrCreateSession crash, plus the policy that explains the durable follow-up. No runtime logic changes.

Why

That bug was an unguarded ABI drift: runtime d665014ab inserted is_self_submitting/on_pause/on_resume into struct xrt_display_processor (after v1.4.1) without bumping XRT_PLUGIN_API_VERSION_CURRENT, and the leia plug-in shipped built against the older layout. The DP vtable is a fixed-offset ABI with no struct_size, so the runtime's set_chroma_key call (slot 0x50) landed on the plug-in's destroy (0x50 in the old layout) and self-destructed the DP right after creation. Nothing flagged it — the version contract only ever mentioned the structs in xrt_plugin.h, the DP vtable was excluded, and the loader only logs the negotiated version.

Changes

  • xrt_display_processor.h — a block of _Static_asserts pinning every DP-vtable slot index + struct size. Any layout change fails the build locally and in CI, right next to a loud comment instructing the author to bump XRT_PLUGIN_API_VERSION_CURRENT, update the asserts, re-pin every plug-in, and follow ADR-020. Offsets are asserted as index * sizeof(void *) so the check holds on 32-bit (Android) too.
    • Verified: clean header compiles (cl /std:c11, exit 0); corrupting a slot index → error C2338, build fails.
  • xrt_plugin.h — widen the API-version contract comment to state the DP vtables are part of this ABI (the documentation gap that let d665014ab slip).
  • docs/adr/ADR-020 — the full policy: struct_size + append-only on the DP vtable, major-version = compatibility boundary, loader rejects major mismatch, this tripwire, and plug-in-CI pin self-consistency; and why a bundle-level lock-step pairing assert is explicitly not the mechanism (it fights independent plug-in/runtime updates).

Scope

This is the safe-to-land-now slice. The coordinated struct_size + append-only + loader-enforcement work (ADR-020 rules 1–3 — a major bump that rebuilds all first-party plug-ins) is a separate, hardware-validated follow-up and does not gate the immediate leia v1.0.5 fix.

No auto-merge — for @dfattal review.

🤖 Generated with Claude Code

The standalone-VK no-weave / xrCreateSession-crash bug was an unguarded ABI
drift: runtime d665014 inserted is_self_submitting/on_pause/on_resume into
struct xrt_display_processor (after v1.4.1) without bumping
XRT_PLUGIN_API_VERSION_CURRENT, and the leia plug-in was built against the older
layout — so the runtime's set_chroma_key call (slot 0x50) hit the plug-in's
destroy (0x50 in the old layout) and self-destructed the DP. Nothing flagged it:
the DP vtable is a fixed-offset ABI with no struct_size negotiation and was
never covered by the version contract (which only mentioned the structs in
xrt_plugin.h), and the loader only logs the negotiated version.

This commit adds the immediate, behavior-neutral guard (ADR-020 rule 4) and the
policy doc:

- xrt_display_processor.h: a block of _Static_asserts pinning every DP-vtable
  slot index + struct size. Any layout change fails the build locally AND in CI
  (verified: corrupting a slot index -> error C2338), adjacent to a loud comment
  telling the author to bump XRT_PLUGIN_API_VERSION_CURRENT, update the asserts,
  re-pin every plug-in, and follow ADR-020. Offsets asserted as index *
  sizeof(void*) so the check holds on 32-bit (Android) too.
- xrt_plugin.h: widen the API-version contract comment to state the DP vtables
  are part of this ABI (the documentation gap that let d665014 slip).
- docs/adr/ADR-020: the full policy — struct_size + append-only on the DP
  vtable, major-version = compatibility boundary, loader rejects major
  mismatch, this tripwire, and plug-in CI pin self-consistency; and why a
  bundle-level lock-step pairing assert is explicitly NOT the mechanism (it
  fights independent plug-in/runtime updates).

Behavior-neutral: no runtime logic changes. The coordinated struct_size +
append-only + loader-enforcement work (ADR-020 rules 1-3, a major bump that
rebuilds all first-party plug-ins) is a separate, validated follow-up and does
NOT gate the immediate leia v1.0.5 fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dfattal dfattal merged commit 5691ecc into main May 27, 2026
21 checks passed
@dfattal dfattal deleted the guard/plugin-abi-tripwire branch May 27, 2026 20:56
dfattal added a commit that referenced this pull request May 28, 2026
…reject + #342 re-scan (ADR-020)

Ships ADR-020 rules 1–3 as the coordinated runtime v1.6.0 major-version bump,
preventing the class of bug that broke standalone-VK weaving (leia plugin
pinned runtime headers v1.4.1 while runtime was v1.5.2 → DP-vtable offset
skew → runtime's set_chroma_key call hit the plug-in's destroy).

ABI v2 — struct_size + append-only + loader enforcement:

* xrt_display_processor.h: add `uint32_t struct_size; uint32_t reserved_0;`
  as the 8-byte first-field header. Rewrites the #348 tripwire asserts to
  anchor at XRT_DP_BASE_OFF = offsetof(.., process_atlas) (asserted == 8 on
  both 64-bit and 32-bit/Android) plus i*sizeof(void*). Adds
  XRT_DP_HAS_SLOT(xdp, field): bounds the field against the plug-in's
  reported struct_size, so appending a new method at the END of the vtable is
  now backward- AND forward-compatible within a major. Gates all 12 optional
  inline wrappers on HAS_SLOT in addition to the per-pointer NULL check;
  mandatory process_atlas (slot 0) + destroy (last) stay un-gated.

* xrt_display_processor_{d3d11,d3d12,gl,metal}.h: same 8-byte header + a
  per-API tripwire block (XRT_DP_{D3D11,D3D12,GL,METAL}_BASE_OFF) + HAS_SLOT
  gating on every optional wrapper. d3d12 has the extra set_output_format
  slot (index 1) the other APIs omit. The XRT_DP_ABI_ASSERT / XRT_DP_ABI_MSG /
  XRT_DP_HAS_SLOT macros are #ifndef-guarded in both the base and per-API
  headers so any include order is safe (no MSVC C4005 redefinitions).

* xrt_plugin.h: XRT_PLUGIN_API_VERSION_2 = 2;
  XRT_PLUGIN_API_VERSION_CURRENT points at it. v1 → v2 is the one-time
  break introducing the struct_size header on the DP vtables and turning
  the loader's version log into an enforced reject. ABI-v1 plug-ins
  (≤ leia v1.0.5) are rejected and must rebuild against v2 headers.

* target_plugin_loader.c: rule 3 enforcement — after a successful
  xrtPluginNegotiate but BEFORE iface->probe, reject any plug-in whose
  reported plugin_api_version != XRT_PLUGIN_API_VERSION_CURRENT in all
  three try_load_one variants (Windows registry, Android, POSIX/JSON).
  The DLL is unloaded and the loader falls back to the next plug-in
  (sim_display).

* sim_display_processor{,_d3d11,_d3d12,_gl,_metal}.{c,cpp,m}: every
  factory sets base.struct_size = sizeof(struct xrt_display_processor[_<api>])
  before assigning the vtable. calloc already zeroes reserved_0.

* oxr_plugin_stub.c: static_assert pin moved from
  XRT_PLUGIN_API_VERSION_CURRENT == _1 to == _2.

Folded #342 — durable DP re-scan at per-client compositor create:

The bundle finalize service-restart (shipped in Track A / v0.5.0) covers
the fresh-bundle-install ordering, but a service started mid-install
outside the bundle path (or a standalone service) still bakes the
already-discovered factories into xrt_system_compositor_info forever.
This change picks up a vendor plug-in registered AFTER the service
started on the next app launch, without requiring a service restart.

Mechanism (callback-on-syscomp-info, layering-clean — confirmed with user):

* xrt_compositor.h: add `void (*refresh_display_processors)(...)`
  function-pointer field to xrt_system_compositor_info.

* target_plugin_loader.{c,h}: new public target_plugin_refresh_active().
  Tracks the winning ProbeOrder of the active plug-in in a new static
  g_active_probe_order. discover_active_plugin now takes a
  uint32_t max_probe_order filter — the first-call path passes
  UINT32_MAX, the refresh path passes the active ProbeOrder so it only
  attempts STRICTLY-better candidates (never re-probes the active
  plug-in). Mutex-guarded via os_mutex (not C11 atomics — MinGW caveat
  per CLAUDE.md). The previous DLL is intentionally leaked on swap,
  consistent with the existing load-path leak.

* target_instance.c: factor xsysc->info.dp_factory_* assignment block
  into fill_dp_factories_from_plugin() and install
  refresh_display_processors_cb (calls target_plugin_refresh_active +
  re-derives factories) on xsysc->info.refresh_display_processors. The
  in-process / handle path leaves the field NULL — fresh-instance-per-
  launch already covers it.

* comp_multi_compositor.c (VK service path) and comp_d3d11_service.cpp
  (D3D11 service path) invoke the callback at the top of
  multi_compositor_create / system_create_native_compositor — i.e. once
  per IPC client at compositor create, before any DP-factory read.

Cleanups bundled with the v1.6.0 bump:

* comp_vk_native_compositor.c: the dp_vtable_looks_sane degrade-log now
  names "a plug-in built against a different runtime ABI major (ADR-020)"
  alongside the heap-collision possibility (the misleading old wording
  was flagged in project_plugin_abi_policy_and_release).

* comp_d3d11_window.cpp (#345, dev-only): the dev-manifest
  auto-resolution at WM_WORKSPACE_LAUNCH_APP is now gated on
  `getenv("DISPLAYXR_DEV") != NULL && getenv("XR_RUNTIME_JSON") == NULL`
  — installs (with no build/Release/ sibling) no longer force a stale
  XR_RUNTIME_JSON onto child apps.

Docs:

* docs/adr/ADR-020: status Proposed → Accepted; Status/rollout section
  reflects rules 1–3 done in v1.6.0, the per-API DP structs that gained
  struct_size, and leia v1.0.6's rule-5 pin self-check.

* docs/specs/runtime/plugin-discovery.md: §6 rewritten for v2 — describes
  XRT_DP_HAS_SLOT, the strict major-match reject (rule 3), and the
  XRT_PLUGIN_API_VERSION_3 reservation for the next break.

* CMakeLists.txt: VERSION 1.5.2 → 1.6.0.

Validation: scripts\build_windows.bat build → exit 0; all _Static_asserts
(base + 4 per-API blocks) hold; DisplayXRClient.dll + displayxr-service.exe
+ DisplayXR-SimDisplay.dll all relink. Pre-existing C4090 + cmake_install
\d warning unchanged.

Hardware testing requires the matching leia v1.0.6 (Track B2). With v1.6.0
+ the currently-shipped leia v1.0.5 (ABI v1), the loader will log
"ABI major mismatch — plugin_api=1, runtime expects 2 ... skipping
(ADR-020 rule 3)" and fall back to sim_display SBS — intentional.

Memory: project_plugin_abi_policy_and_release.
Plan: ~/.claude/plans/task-ship-the-displayxr-ticklish-quokka.md (Track B).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant