Android: per-tile-blit fallback runtime (audit fixes + HW debug logs)#270
Android: per-tile-blit fallback runtime (audit fixes + HW debug logs)#270leaiss wants to merge 15 commits into
Conversation
Three changes to make the existing Android target buildable against the publicly-distributed CNSDK 0.7.28 zip (github.com/LeiaInc/leiainc.github.io/CNSDK/cnsdk-android-0.7.28.zip): 1. leia_cnsdk.cpp: leia_interlacer_release() was renamed in CNSDK to leia_interlacer_shutdown(core, interlacer) — the only ABI drift between the wrapper and the current SDK headers. Verified by symbol-diff against all 22 calls the wrapper makes; the other 21 are identical. 2. build.gradle: the JNI exclude list pointed at libleiaSDK.so, which doesn't ship in 0.7.28. The actual shared libs are libleiaSDK-faceTrackingInApp.so (the one we want for the in-process variant) and libleiaSDK-faceTrackingInService.so (the one we exclude so it doesn't collide at load time). 3. .gitignore: ignore src/xrt/targets/openxr_android/cnsdk/. The SDK is 84 MB, fetched per-developer from the public LeiaInc asset repo, and has no business in git. No build target changes, no new code paths. Just enough to get the already-plumbed Android target's CMake configure step past find_package(CNSDK CONFIG). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds a 24 MB inProcess-debug APK from a clean tree in ~40 s, bundling libopenxr_displayxr.so plus the three CNSDK 0.7.28 native libs (libleiaSDK-faceTrackingInApp.so, libblink.so, liblicense_utils.so). The runtime .so still has no Android compositor wired up — vk_native is gated to WIN32+APPLE — so xrCreateSession will fail at runtime. The next chunk of work (M7 #127 / #130 / #125) un-gates vk_native and plumbs ANativeWindow + CNSDK weaver. Five fixes to unblock the build: * openxr_android/build.gradle::unpackEigen — emit a header-only Eigen3Config.cmake in a doLast block. Upstream tarball ships only the .in template, so find_package(Eigen3 REQUIRED NO_MODULE) at CMakeLists.txt:103 fails. The stub creates an Eigen3::Eigen IMPORTED INTERFACE target pointing at the unpacked source dir. * CMakeLists.txt — guard simulatedreality / srDirectX find_package with if(NOT ANDROID). They're Windows-only LeiaSR SDK packages, and the GLOBAL keyword on find_package isn't supported by the NDK's bundled CMake 3.22.1 (needs 3.24+). * openxr_android/build.gradle — rename CMake target from openxr_monado to openxr_displayxr in both inProcess and outOfProcess flavor blocks. Stale Monado-era reference. * oxr_session.c — wrap the qwerty_set_process_keys() call at line 2759 in #ifdef XRT_BUILD_DRIVER_QWERTY. The header include was already guarded; the call site wasn't. * openxr_android/build.gradle — switch the CNSDK AAR file dependency from sdk-faceTrackingService to sdk-faceTrackingInApp. Single-app in-process is the whole POC plan; the inApp AAR is what bundles the three CNSDK .so files into the APK. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Build now links libcomp_vk_native.a into libopenxr_displayxr.so on Android (verified via llvm-nm — comp_vk_native_compositor_create and the rest of the surface). The Android branch in comp_vk_native_target_create wraps the ANativeWindow* passed as hwnd into a VkAndroidSurfaceCreateInfoKHR and feeds it to vk->vkCreateAndroidSurfaceKHR (already loaded by vk_function_loaders.c when VK_USE_PLATFORM_ANDROID_KHR is defined; the CMake already sets that macro on Android). Caller is still responsible for plumbing a real ANativeWindow* down to the compositor — that's #130 and the next commit. Today the runtime will fail with "ANativeWindow* is NULL on Android" inside xrCreateSession, which is the right next failure to chase. * compositor/vk_native/CMakeLists.txt — gate (WIN32 OR APPLE OR ANDROID) * compositor/CMakeLists.txt — same gate on the add_subdirectory line (without this the subdir wasn't entered and the .a never got built) * compositor/vk_native/comp_vk_native_target.cpp — add #elif Android branch in the #ifdef chain inside comp_vk_native_target_create, plus the vulkan_android.h / native_window.h includes at the top * state_trackers/oxr/CMakeLists.txt — add an if(ANDROID) block that pulls in oxr_session_gfx_vk_native.c, links comp_vk_native, and defines XRT_HAVE_VK_NATIVE_COMPOSITOR (mirrors the WIN32 block) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… POC day-3 part 2) When an OpenXR app on Android calls xrCreateSession with a Vulkan graphics binding but no window_handle (which is everyone today — XR_EXT_android_surface_binding doesn't exist yet), the runtime now spawns a SurfaceView on the Activity via the existing Monado android_custom_surface_async_start helper, blocks up to 5 s for the surfaceCreated callback, and feeds the resulting ANativeWindow* to comp_vk_native_compositor_create. The compositor's Android branch (added in c9f59c3) already knows what to do from there: wrap it in a VkAndroidSurfaceCreateInfoKHR, call vk->vkCreateAndroidSurfaceKHR, then run the standard swapchain create path. This is the POC bypass for #130 — single-app, in-process, leaks the custom_surface handle for process scope. The proper fix is the EXT spec + header + OXR glue mirroring XR_EXT_win32_window_binding / XR_EXT_cocoa_window_binding; that's still TODO. VM + activity are already stored by oxr_xrCreateInstance from XrInstanceCreateInfoAndroidKHR (required on Android per spec), so the android_globals lookups inside the new block resolve. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The OpenXR Loader on Android discovers the active runtime via the RuntimeService's android.value metadata. Pointing at libopenxr_monado.so means no real OpenXR app would ever find this runtime — that .so doesn't exist in our APK; ours is libopenxr_displayxr.so. Two refs, both replaced (one in MonadoVrListener service, one in RuntimeService). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The runtime now registers dp_factory_vk = leia_dp_factory_cnsdk on Android. The new DP owns a leia_cnsdk handle for the session lifetime, which kicks off CNSDK's async core init the moment xrCreateSession runs. Verified via llvm-nm: leia_dp_factory_cnsdk + leia_cnsdk_create are both in libopenxr_displayxr.so. POC SCOPE: process_atlas() is intentionally a no-op. CNSDK's leia_interlacer_vulkan_do_post_process records and submits its own command buffer, which doesn't fit the compositor's "record into my cmd_buffer" contract. Wiring the actual weave needs both #126 (a self_submitting DP flag the compositor honors by skipping its own submit) and a per-tile VkImageView pass to split the SBS atlas image into the separate left/right views CNSDK expects via leia_interlacer_vulkan_set_view_for_texture_array. What this DP does provide for POC milestone 1: * hardcoded IPD-only eye positions (-0.0325, 0, 0.5) / (+0.0325, 0, 0.5) — bypasses face tracking entirely per the android-poc-state memory * hardcoded Lume Pad-class display dimensions (0.1934 × 0.1209 m, 2560 × 1600 px) so XR_EXT_display_info reports something sensible * full lifecycle so the compositor no longer logs "No VK display processor factory provided" on Android Also caught + fixed two more CNSDK 0.7.28 ABI drifts in leia_cnsdk.cpp that the day-1 commit missed (these only surfaced now because the new DP file is the first thing to pull the static lib's CNSDK TU into the final link on Android): * leia_core_release → leia_core_shutdown (matches the rename pattern already applied to leia_interlacer_release → leia_interlacer_shutdown) * leia_interlacer_vulkan_set_view_for_texture_array dropped its trailing array-layer argument (now implicit) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds is_self_submitting() to the xrt_display_processor vtable for DPs whose backing SDK records and submits its own VkCommandBuffer (CNSDK's leia_interlacer_vulkan_do_post_process). The vk_native compositor honors the flag at both process_atlas call sites: end+submit+wait the pre-DP cmd buffer for coherent atlas/target state, pass VK_NULL_HANDLE for the cmd_buffer arg, and skip the post-process_atlas submit. The window- target path allocates a fresh post-DP cmd buffer for the HUD overlay. The CNSDK DP now performs the real weave: lazily creates one VkImage + VkImageView per view in VK_FORMAT_B8G8R8A8_SRGB (matches the format leia_cnsdk.cpp passes to leia_interlacer_vulkan_initialize), blits the two SBS atlas halves into them via vkCmdBlitImage, transitions to SHADER_READ_ONLY_OPTIMAL, submits+waits, then calls leia_cnsdk_weave with the per-view image views. CNSDK takes over from there. CNSDK-only drv_leia now links aux_vk for struct vk_bundle access. assembleInProcessDebug builds clean (1m46s), APK 54 MB; llvm-nm confirms process_atlas_weave, is_self_submitting_true, leia_cnsdk_weave all linked. Not yet verified on a Lume Pad — perf will need a semaphore chain to CNSDK's imageAvailableSemaphore instead of host stalls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s (POC day-5b) Day-5 left a vkQueueWaitIdle in process_atlas between the per-tile blit submit and CNSDK's weave submit — one host stall per frame plus per- frame VkCommandBuffer alloc/free churn. Address both: 1. Extend leia_cnsdk_weave with a VkSemaphore waitSemaphore arg that passes through to leia_interlacer_vulkan_do_post_process's imageAvailableSemaphore param. CNSDK now waits for the upstream blit on the GPU instead of the host blocking between submits. 2. Cache the blit VkCommandBuffer, VkSemaphore (signaled by blit submit, waited by CNSDK weave), and VkFence (gates per-frame cmd buffer reset) on the leia_dp_cnsdk struct. Allocated once in the factory, freed in destroy_impl after waiting on any in-flight fence. 3. Per-frame blit path now: wait blit_fence if in flight → reset fence → vkResetCommandBuffer → re-record blit → vkQueueSubmit signaling blit_done + blit_fence (no host wait). leia_cnsdk_weave receives blit_done as its wait semaphore. Build clean in 29s incremental; llvm-nm shows the new create_blit_resources / destroy_blit_resources symbols and the updated leia_cnsdk_weave. Behavior still unverified on a Lume Pad. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the hardcoded Lume Pad 2 display metrics and IPD-only eye position stubs in the CNSDK display processor with values pulled from CNSDK once its async core init completes. Three new wrappers in leia_cnsdk so the DP doesn't include CNSDK headers directly: - leia_cnsdk_get_display_metrics — leia_core_get_device_config + read displaySizeInMm (converted to meters) and panelResolution. Returns false while the core is still initializing so callers can poll. - leia_cnsdk_ensure_face_tracking_started — idempotent enable + start. CNSDK docs warn enable_face_tracking is heavy and shouldn't run on the main thread; POC accepts the one-time stall. - leia_cnsdk_get_primary_face — wraps leia_core_get_primary_face, packaging the float[3] into a leia_float_slice the CNSDK API expects. leia_display_processor_cnsdk now uses these lazily: - get_display_dimensions / get_display_pixel_info call the metrics wrapper and fall back to hardcoded Lume Pad numbers until CNSDK is ready. - get_predicted_eye_positions lazily starts face tracking, polls the primary face, and derives L/R eyes by face_pos ± IPD/2 along X. is_tracking now reflects reality instead of always returning false. Build clean in 27s; llvm-nm confirms the three new wrappers landed. Behavior still unverified on a Lume Pad — face position coordinate system + units assumed to match xrt_eye_position (x=right, y=up, z=toward viewer, meters); will likely need a sign flip or scale calibration on hardware. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Day-6 left a "POC accepts the one-time stall" caveat on the first call to leia_cnsdk_ensure_face_tracking_started — CNSDK's enable_face_tracking is documented as too heavy for the main thread. Lift that work onto a worker thread spawned in leia_cnsdk_create: - struct leia_cnsdk now owns std::atomic<bool> face_tracking_started, std::atomic<bool> shutting_down, and std::thread worker. Switched from calloc/free to new/delete so the C++ members value-initialize. - The worker polls leia_core_is_initialized every 50 ms (honoring shutting_down for prompt teardown), then calls enable_face_tracking + start_face_tracking, sets the atomic flag, and exits. - leia_cnsdk_ensure_face_tracking_started is now a non-blocking atomic load; the render thread can poll it every frame at zero cost. - leia_cnsdk_destroy signals shutting_down and joins the worker before tearing down the core. If the worker is mid-enable_face_tracking we wait it out — CNSDK provides no interruption hook, so destroy may block briefly during initial init races. Build clean in 30s; llvm-nm confirms the new face_tracking_worker symbol and the updated wrappers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Identified in the 2026-05-22 audit. Without these fixes, every branch
day-5 through day-8b is unfit for hardware testing: validation errors
on frame 1, undefined behavior on frame 2, wrong eye positions even if
the GPU survives, and UAF on session teardown.
B1 — atlas layout barrier (leia_display_processor_cnsdk.cpp).
vkCmdBlitImage hardcodes srcLayout=TRANSFER_SRC_OPTIMAL, but the
compositor leaves the atlas in SHADER_READ_ONLY_OPTIMAL on entry to
process_atlas (renderer terminating barrier; comment at
comp_vk_native_compositor.c:2057). Insert a SHADER_READ→TRANSFER_SRC
barrier at the top of blit_atlas_to_per_view and the inverse at the
bottom so the compositor's invariant ("atlas is shader-read-only
when DP returns") still holds.
B3 — binary semaphore double-signal.
leia_cnsdk_weave returned early without calling do_post_process when
CNSDK's async core init wasn't finished, leaving blit_done signaled
but never waited. Next frame would signal an already-signaled binary
semaphore — UB per Vulkan spec.
New leia_cnsdk_ensure_interlacer(cnsdk, device, physDev, targetFmt):
idempotent lazy interlacer creation, returns false until the core is
ready and the interlacer exists. process_atlas_weave now calls this
before any blit submit; if it returns false, the entire blit-and-
weave is skipped — no semaphore signal, no double-signal hazard.
B4 + B5 — face position units + frame.
CNSDK header types.h:72 is explicit: face position is
"Head location in mm. The origin point is the location of the
camera." The wrapper was forwarding the raw values as if they were
meters relative to the display center — off by a factor of 1000
AND offset by ~50–100 mm (camera distance from display center).
Worker thread now snapshots leia_device_config.cameraCenter{X,Y,Z}
into the cnsdk struct (mm→m at storage time) right before
enable_face_tracking. leia_cnsdk_get_primary_face divides the CNSDK
position by 1000 then subtracts the cached camera center, returning
display-relative meters that match xrt_eye_position's convention.
The cached fields are read on the render thread after
face_tracking_started.load(acquire) returns true; the atomic gives
the read happens-before visibility on the worker's writes — no
separate mutex needed.
B6 — vkDeviceWaitIdle on destroy.
destroy_impl was waiting on blit_fence, which only tracks the blit
submit — NOT CNSDK's own do_post_process queue submit. After our
blit completed, CNSDK could still be reading per-view VkImages /
VkImageViews / blit_done semaphore via its in-flight submit;
freeing them = UAF.
Add vkDeviceWaitIdle at the top of destroy_impl to drain all GPU
work (ours + CNSDK's) before any resource teardown. Single stall
during session destroy, acceptable cost.
Build clean in 29s. Verified leia_cnsdk_ensure_interlacer + the
updated wrappers landed in libopenxr_displayxr.so.
Audit reference: [[project-android-poc-audit-2026-05-22]] memory
entry. Remaining medium/low issues (B7–B18) deliberately deferred to
follow-up branches per the branch-per-chunk pattern.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…B12) Two robustness fixes from the audit follow-up list: B11 — stop infinite retry on interlacer init failure. leia_cnsdk_ensure_interlacer was calling leia_interlacer_vulkan_initialize every frame for the lifetime of the session if it returned NULL once (out of memory, wrong VkDevice format, CNSDK lib mismatch, etc.). Add an interlacer_init_failed bool on the wrapper struct; set it once on failure, gate the retry on it. Logs once. Render-thread-only, no atomic needed. B12 — cache device config in the worker; eliminate per-frame churn. leia_cnsdk_get_display_metrics was hitting leia_core_get_device_config + leia_core_release_device_config per call — CNSDK allocates a copy per get/release pair (it's not a pointer-grab). At 60 fps that's 60 allocs/sec for what is genuinely immutable data after init. The day-7 worker thread already snapshots the camera center from the device config; extend it to also snapshot displaySizeInMm, panelResolution into new cached fields, then atomically set display_metrics_cached. leia_cnsdk_get_display_metrics now early-returns from the atomic flag check and reads the cached values directly. No more per-frame allocation; no more concurrent device-config access from the render thread (which implicitly resolves audit B9). Build clean in 33s. No behavior change beyond the two fixes; the critical-bugs branch's behavior in steady-state is unchanged (faster path, but same values flow out to the consumer). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
B2 — per-view format gamma double-correction. Atlas is rendered to VK_FORMAT_B8G8R8A8_UNORM (linear) by comp_vk_native_renderer.c. Per-view images and CNSDK interlacer init were both _SRGB. vkCmdBlitImage between same-texel formats reinterprets bytes, so CNSDK's SRGB sampler was applying SRGB→linear conversion to atlas data that was already linear — darkening mid-tones, leaving extremes (0.0/1.0) untouched. Switch both per-view image format and CNSDK interlacer view format to VK_FORMAT_B8G8R8A8_UNORM. No more gamma round-trip. B10 — worker watchdog timeout. leia_cnsdk_destroy was calling std::thread::join unconditionally on the face-tracking worker. If the worker was mid- leia_core_enable_face_tracking and CNSDK deadlocks (camera permission denied, driver bug, etc.), destroy would hang indefinitely — and CNSDK exposes no cancel API. Add a 2-second watchdog using a side joiner thread + atomic + sleep loop (std::thread::join has no timeout variant). On happy path (worker already finished) join returns instantly, no extra cost. On hang path we detach (leaking the std::thread) so destroy can return. Build clean in 41s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t B7) The window-target self-submitting branch was vulnerable to a target_image race: CNSDK's queue submit (via process_atlas) writes to the swapchain image and may still be in flight when the compositor starts recording the HUD overlay into a fresh cmd buffer that targets the same image. Insert a vkDeviceWaitIdle right after process_atlas returns in the self-submitting branch. The cost is one host stall per frame in the unhappy path, but Android's c->hud is currently NULL — render_hud early-returns — so this is a no-op in practice. Worth landing now so the bug doesn't materialize the first time HUD enables on Android. Symmetric to the destroy-side vkDeviceWaitIdle in leia_display_processor_cnsdk.cpp::destroy_impl (B6). Build clean in 40s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same XRT_DEBUG_ANDROID_VERBOSE-gated DXR_HW_DBG / DXR_HW_DBG_ONCE macros as feat/android-hw-debug-logs, applied to the per-tile-blit fallback runtime stack (fix/compositor-b7). Both Hardware Test A (atlas runtime) and Hardware Test C (this branch's fallback runtime) now emit the same set of log markers for diagnosing CNSDK init / DP behavior on a Lume Pad. leia_cnsdk.cpp instrumentation matches the atlas variant 1:1 (create / destroy / worker / ensure_interlacer / get_primary_face). leia_display_processor_cnsdk.cpp adds per-tile-blit- specific markers: - blit_atlas_to_per_view first-submit (sem + fence handles) - process_atlas_weave per-second frame log with view/target dims - create_blit_resources cmd/sem/fence handle dump - factory log noting "self-submitting, per-tile blit + CNSDK weave" build.gradle's debug variant gets the same cppFlags/cFlags '-DXRT_DEBUG_ANDROID_VERBOSE' addition so the in-process Debug APK gets the macros expanded. Built clean in 47s. With this branch + feat/android-hw-debug-logs, both Test A and Test C runtime APKs emit consistent HW_DBG_CNSDK / HW_DBG_DP logcat tags. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Closing per your earlier suggestion — atlas mode in PR #268 supersedes the per-tile-blit path entirely (-340 LoC; audits B1 and B3 architecturally erased). No reason to carry both forward. If atlas mode hits an unrecoverable issue on Lume Pad hardware, the per-tile-blit code is still recoverable via |
|
@dfattal — done as you suggested. Closed in favor of #268 (atlas mode supersedes per-tile-blit; -340 LoC; audits B1 and B3 architecturally erased). Per-tile-blit code stays recoverable via the safety branch |
Summary
Alternative Android runtime that uses our original per-tile-blit + semaphore-chain approach instead of CNSDK atlas mode (PR #268). Same audit fixes, same HW debug log layer. Provided as a fallback in case atlas mode misbehaves on hardware — bisect target if PR #268's first-light reveals an atlas-specific bug.
Functionally equivalent to PR #268's first 10 commits (everything through audit B7) plus the parallel debug-log layer, but without atlas mode, mono passthrough, pause/resume, or the build-guide / bringup-checklist docs.
When to use
`docs/getting-started/android-bringup-checklist.md` § Test C: only install this runtime if Test A (PR #268's atlas-mode runtime) fails in a way that points at atlas mode. Pairs with the same test app APK from PR #269.
If PR #268 works on hardware, this PR is informational only and can be closed without merging.
Build status
`./gradlew :src:xrt:targets:openxr_android:assembleInProcessDebug` — clean in ~47s.
🤖 Generated with Claude Code