[pull] master from libretro:master#978
Merged
pull[bot] merged 9 commits intoAlexandre1er:masterfrom Apr 29, 2026
Merged
Conversation
The file had its own local __atomic_* / _Interlocked / volatile shim guarding the cross-thread .status field, written before retro_atomic.h existed. Replace it with the portable API now that the header is upstream (75ea1e4). The shim was 18 lines, well-commented, correct, and operated on plain `int*` via GCC's __atomic_* extension form. The C11 stdatomic and C++11 std::atomic backends in retro_atomic.h cannot operate on plain pointers -- their typedefs are _Atomic int and std::atomic<int>, which forbid plain dereference -- so this port changes the .status field type from `enum gfx_thumbnail_status` to `retro_atomic_int_t` and routes every read and write through retro_atomic_*_int. Twenty-five access sites are converted; all but three are single-threaded main-thread accesses that were already non-atomic in shape, but now go through the API because the field's atomic type requires it. On x86 TSO the acquire/release barriers compile out entirely; on weak-memory ARM/PowerPC the cost is one extra ldar/stlr per cold-path access (menu list updates, frame draws -- never per-sample). The two cross-thread sites that drove the original shim are preserved verbatim: - gfx_thumbnail_handle_upload publishes texture/width/height via release-store of AVAILABLE. - gfx_thumbnail_draw acquire-loads the status before reading the texture/width/height. Both already had explanatory comments; those are kept. Field-type ABI: retro_atomic_int_t is int-sized and int-aligned on every supported backend, so gfx_thumbnail_t's layout is unchanged. Verified across 7 backends and 2 architectures (x86_64, AArch64); also asserted at compile-time in the new test (see below). Intra-file initialisation of stack-local gfx_thumbnail_t in gfx_thumbnail_draw uses retro_atomic_int_init() for the .status field; plain assignment to an _Atomic int is illegal under C11 stdatomic. CXX_BUILD compatibility ----------------------- RetroArch's CXX_BUILD mode (Apple platforms, etc.) compiles every .c file as C++. Once .status is std::atomic<int>, two C-valid patterns become C++-invalid: - memset(&t, 0, sizeof(t)) of a struct containing std::atomic<int> warns -Wclass-memaccess (the struct is no longer trivially-copyable). - `*dst = *src` for a struct embedding std::atomic<int> fails to compile (the deleted copy-assignment of std::atomic<int> propagates upward through aggregates). The first case fires once: xmb.c:667's xmb_alloc_node() did memset(&node->thumbnail_icon.icon, 0, ...) instead of calloc'ing the whole node, intentionally to avoid zeroing a multi-KB thumbnail_path_data substruct on the hot path. Replaced with gfx_thumbnail_init_blank() -- a new field-by-field zero-init helper added to gfx_thumbnail.h. The second case does not fire: the only struct copy in the menu drivers (ozone.c:4652 ozone_copy_node) operates on ozone_node_t, which does not embed gfx_thumbnail_t (the gfx_thumbnail_t fields are in ozone_handle_t, which is calloc'd, never struct-copied). calloc-init of structs containing std::atomic<int> (xmb_handle_t, ozone_handle_t, materialui_handle_t) is technically UB per C++11 [atomics.types.generic] (the default constructor leaves the atomic uninitialised; reads should be preceded by atomic_init). In practice every libstdc++/libc++/ MSVC STL treats a zero-bit pattern of std::atomic<int> as a zero-valued atomic, and reads via atomic_load_explicit return 0. Verified with all three menu drivers under CXX_BUILD. Wrappers (GFX_THUMB_STATUS_STORE / GFX_THUMB_STATUS_LOAD) are preserved as static-inline functions rather than function-like macros. GCC 13+ at -O3 emits a -Wstringop-overflow false- positive on the inlined __atomic_* primitive when called from gfx_thumbnail_request -- the optimiser cannot prove the @thumbnail argument non-NULL across every `goto end:` flow-graph path. The pre-port code did not trip this because the shim's explicit (int*) cast on enum* defeated GCC's analysis; retro_atomic_int_t is int directly on the GCC backend, so no cast remains to do that. The wrappers carry a targeted _Pragma("GCC diagnostic ignored \"-Wstringop-overflow\"") that suppresses the warning at the function boundary; codegen is unchanged on every backend (verified via aarch64 objdump: 14 ldar / 14 stlr emitted, identical to pre-port). Other toolchains (clang, MSVC, older GCC) skip the suppression. Regression test --------------- samples/gfx/gfx_thumbnail_status_atomic/ exercises the synchronisation pattern in isolation. The test does not link gfx_thumbnail.c; it mirrors the relevant types and macros and runs an SPSC producer/consumer stress. Verified failure mode: deliberately moving the producer's generation-counter bump before the data writes (simulating a removed release-store fence) makes the stress fail with hundreds of mismatched reads on x86 and a "ThreadSanitizer: data race in consumer_thread" report under TSan (halt_on_error=1 -> exit 66). Restoring the fence gives 1M iterations with 0 mismatches and a clean TSan run. Static checks asserted at compile time: - sizeof(gfx_thumbnail_t.status) == sizeof(int), so the field-type change preserves struct layout. - HAVE_RETRO_ATOMIC is set after the include. Runtime checks: - All four GFX_THUMBNAIL_STATUS_* values round-trip through GFX_THUMB_STATUS_STORE / GFX_THUMB_STATUS_LOAD. - 1M-iteration SPSC handshake stress (HAVE_THREADS only): producer stages (texture, width, height) = (i, i, i), stores AVAILABLE via the macro under test, bumps a `produced` counter via release-store; consumer waits for the new generation, reads the four fields via the macro under test, verifies all four equal i, acks via a `consumed` counter. The handshake prevents the producer from rewriting the data while the consumer is reading it; without it, torn reads would be expected even with correct barriers (a property of the test design, not the primitives). The test mirrors gfx_thumbnail_init_blank() locally rather than including gfx_thumbnail.h, since the test deliberately shadows the header's type definitions to verify layout invariants. CI -- .github/workflows/Linux-samples-gfx.yml gets one new step that builds the sample with SANITIZER=thread and runs it with TSAN_OPTIONS=halt_on_error=1. TSan is the right validator for this test specifically: missing-barrier regressions show up as data races even on x86 TSO, where the hardware would otherwise hide them at runtime. ASan/UBSan would catch nothing here (no out-of-bounds access, no UB), so they are not added. Verified locally ---------------- Compile-clean across: - gcc -O2 / -O3 -Wall -Werror, x86_64, all 5 affected files (gfx_thumbnail.c, xmb.c, ozone.c, materialui.c, runloop.c) - g++ -xc++ -std=c++11 -DCXX_BUILD, same 5 files - clang -O2, same 5 files - aarch64-linux-gnu-gcc + qemu-user, gfx_thumbnail.c - arm-linux-gnueabihf-gcc, gfx_thumbnail.c - clang -x objective-c / -x objective-c++ smoke (header-only inclusion check; mirrors cocoa_common.m's transitive path via menu_driver.h) - C89 -ansi -pedantic, gfx_thumbnail.c - Forced backends: C11 stdatomic, GCC __sync_*, volatile fallback - Clang static analyser - HAVE_THREADS=0 (test sample, static checks only) Functional: - 1M-iter SPSC stress passes on x86_64 native - 1M-iter SPSC stress passes under qemu-aarch64 - TSan halt_on_error=1 catches deliberately-injected missing-barrier bug -> exit 66 - aarch64 objdump shows 14 ldar / 14 stlr in gfx_thumbnail.o, identical between pre-port and post-port - x86 codegen: plain mov for cross-thread sites (TSO-correct) - ARM32 codegen: 28 dmb instructions emitted Behavior when no real atomic backend is available ------------------------------------------------- retro_atomic.h's cascade always selects something; the last tier is a volatile fallback (`#warning` unless suppressed, `RETRO_ATOMIC_LOCK_FREE` undefined). On that tier, expansions are plain volatile loads and stores -- no memory barriers -- which is correct only on single-core targets or x86 TSO. The header's docstring documents this and the intended caller-side gate is `#if defined(RETRO_ATOMIC_LOCK_FREE)`. gfx_thumbnail does NOT add such a gate, matching pre-port behavior: the original shim's `#else` branch was also a plain volatile read/write with no barriers, and gfx_thumbnail never checked whether barriers were available. That tradeoff was explicit in 91166d0 ("Note on correctness"), where the volatile fallback's SMP PowerPC G5 weakness was acknowledged and accepted: the failure mode is at worst a fade-in animation starting one frame late, not corruption (status is a single aligned int -- no torn reads possible -- and a cosmetic mistime is recoverable on the next frame). What changes with the port: the cascade gets richer. Pre- port, anything that wasn't the GCC __atomic_* or MSVC Interlocked* branch landed on the volatile fallback. Post- port, four additional tiers sit between modern GCC and volatile -- C11 stdatomic, C++11 std::atomic, Apple OSAtomic*, GCC __sync_* -- which means every real RetroArch target with threaded video falls into a tier that gives proper release/ acquire ordering. The G5 case the historical note acknowledged is now covered by GCC __sync_* (which existed in GCC 4.1+ and was named as the proper fix if anyone wanted to harden the G5 path later); G5 with modern devkitPPC would land on __atomic_* and get real lwsync. Net: on every platform where atomics are "unavailable" in the sense of "no real backend matches", pre-port and post-port emit identical plain-volatile codegen (verified by aarch64- gcc -DRETRO_ATOMIC_FORCE_VOLATILE: gfx_thumbnail_handle_upload emits plain `str` rather than `stlr`, same as pre-port's volatile branch on the same target). The port introduces no new "atomics unavailable" footgun. Not verified on real hardware: - MSVC ARM64 (relies on retro_atomic.h's MSVC backend, which landed in 75ea1e4 and is itself not yet hardware-validated) - Real PowerPC SMP (Wii U, Xbox 360) - End-to-end Apple build (macOS / iOS / tvOS)
The macOS branch of cpu_features_get() checked only the return value of sysctlbyname(), which is 0 (success) whenever the key exists -- regardless of whether the feature is actually supported. Apple's hw.optional.* keys are present on every machine and report 0 when the feature is unavailable, so the existence check incorrectly flagged unsupported features as present. Most visibly, AVX-512 was reported on Intel Macs without AVX-512 support (e.g. i9-9880H / Coffee Lake) because hw.optional.avx512f exists but returns 0 there. SSE/AVX/AVX2/NEON happened to produce correct results only because their keys always read 1 on hardware that supports them. Pass a buffer to sysctlbyname() and check the actual value.
A lock-free single-producer / single-consumer byte queue built on
retro_atomic.h. Wired into the standard libretro-common build
(Makefile.common, griffin) so it ships with every RetroArch
configuration, but no production callers yet -- this commit lands
the primitive only. Subsequent commits can convert hand-rolled
SPSC patterns (audio/drivers/coreaudio.c's atomic_size_t ring,
audio/drivers/opensl.c's __sync_*-guarded buffered_blocks counter)
to use it.
Design
------
Two-cursor (head + tail) Lamport / Disruptor style, NOT the single
shared count. Each cursor is written by exactly one thread and
acquire-loaded by the other. The producer publishes new data with
a release-store on head; the consumer publishes consumed bytes
with a release-store on tail. Neither thread issues atomic RMW
operations on the hot path -- only retro_atomic_load_acquire_size
and retro_atomic_store_release_size, which retro_atomic.h provides
across all 7 backends.
Capacity is power-of-2 (rounded up at init), masked with
capacity - 1. Wraparound uses the standard two-memcpy pattern.
size_t modular arithmetic on (head - tail) is well-defined as long
as the difference stays within capacity, which the producer
enforces by checking write_avail before each push.
head and tail are placed on separate cache lines via explicit
char-array padding (RETRO_SPSC_CACHE_LINE, default 64). Without
this, the producer's release-store on head would invalidate the
consumer's tail cache line and vice versa, halving throughput on
contended SMP. Padding is a performance hint; correctness does
not depend on it. The padding macros are guarded against
underflow if RETRO_SPSC_CACHE_LINE is misconfigured to be smaller
than the prefix fields.
Comparison with the existing fifo_queue_t:
- fifo_queue takes an slock_t internally; it's MPMC-safe but
every push/pop costs a mutex.
- retro_spsc is lock-free but limited to one producer / one
consumer. Use it in code paths where (a) producer/consumer
counts are fixed at one each (audio render thread <-> audio
submission, video thread <-> task queue, etc.) and (b) the
fifo_queue lock contention measurably matters.
Comparison with audio/drivers/coreaudio.c's hand-rolled ring:
- coreaudio uses a single shared atomic_size_t `filled` count
rather than two cursors; producer fetch_add's it, consumer
fetch_sub's it. Correct, but optimises for "give me
write_avail" being a single atomic load (audio drivers do
that query every fill) at the cost of RMW on every push and
pop.
- retro_spsc inverts that trade-off: write_avail / read_avail
each cost two atomic loads, but push and pop only do
load_acquire + store_release. Better for the general case
where push/pop count vastly exceeds availability checks.
- retro_spsc also pads head and tail; coreaudio doesn't (its
single shared atomic doesn't suffer false sharing).
Build wiring
------------
Build requirements:
- retro_atomic.h (header-only, always available)
- <stddef.h>, <stdint.h>, <stdbool.h>, <stdlib.h>, <string.h>
- That's it. No HAVE_THREADS gate is needed for compilation:
retro_spsc.c builds on every target retro_atomic.h supports,
including pre-thread console targets (PSP, original Wii).
On a single-threaded build it just sits there as dead code.
Wired in through:
- Makefile.common adds retro_spsc.o next to fifo_queue.o under
the unconditional libretro-common object list.
- griffin.c #includes retro_spsc.c next to fifo_queue.c (Apple
builds and console unity builds pick it up via griffin).
Test
----
libretro-common/samples/queues/retro_spsc_test/ exercises the
queue under producer/consumer concurrency:
- Property checks (single-threaded): power-of-2 round-up at init,
fresh-queue avails, push/pop content round-trip, peek does not
advance, wraparound across the buffer end.
- SPSC stress (HAVE_THREADS): producer pushes 10M sequential
32-bit tokens through a 4 KB buffer; consumer reads and
verifies each token matches the expected sequence. The
small buffer relative to the message volume forces heavy
interleaving between producer and consumer, exercising the
wraparound path on every iteration.
The stress harness has no per-iteration handshake -- both threads
spin on avail() in tight loops -- so the test is sensitive to
real synchronisation bugs. Verified: deliberately moving the
producer's release-store on head BEFORE the buffer memcpys (so a
consumer observing the new head can read uninitialised bytes)
makes the stress fail with hundreds of mismatched tokens out of
10M, even on x86 TSO without TSan. The same bug under TSan with
halt_on_error=1 exits 66 with "data race in retro_spsc_read_avail".
CI
--
.github/workflows/Linux-libretro-common-samples.yml:
- retro_spsc_test added to RUN_TARGETS so the auto-discovery
job builds and runs it under SANITIZER=address,undefined
(the workflow default).
- A new step builds and runs the test under Clang +
SANITIZER=thread with TSAN_OPTIONS=halt_on_error=1, mirroring
the retro_atomic_test TSan lane. TSan is the strict
validator for this primitive specifically: missing-barrier
regressions show up as data races even on x86 TSO, where the
hardware would otherwise hide them at runtime.
Verified locally:
- gcc -O0/-O2/-O3 -Wall -Werror, x86_64
- clang -O2, x86_64
- g++ -xc++ -std=c++11 (CXX_BUILD-style)
- aarch64-linux-gnu-gcc + qemu-user, 10M tokens clean,
objdump shows real ldar/stlr at the cursor accesses
- arm-linux-gnueabihf-gcc compile-clean
- Forced backends: C11 stdatomic, GCC __sync_*, volatile
fallback (volatile fallback correct only on x86 TSO, same
contract as every other retro_atomic.h caller)
- TSan halt_on_error=1: clean run, exit 0
- TSan halt_on_error=1 on bug-injected build: exit 66
- ASan/UBSan: clean
- Bug-injected without sanitizer: 815 mismatches / 10M tokens
Not verified on real hardware:
- MSVC ARM64 (inherits retro_atomic.h's untested-on-hardware
state for that backend)
- Real PowerPC SMP (Wii U, Xbox 360)
Apple deprecated <libkern/OSAtomic.h> in macOS 10.12 (2016) in
favour of C11 <stdatomic.h>. platform_darwin.m had two callers
of OSAtomicCompareAndSwap32 left for the GCD-based filesystem
watcher's "did anything change since last check?" flag. Replace
both with retro_atomic.h primitives.
Counter pattern, not CAS-flag
-----------------------------
The previous design used a 32-bit flag set/cleared via CAS:
- setter: OSAtomicCompareAndSwap32(0, 1, &has_changes)
from the GCD vnode event handler
- reader: OSAtomicCompareAndSwap32(1, 0, &has_changes)
from main thread; return value answers
"did at least one event fire since the last call?"
The reader's CAS is load-bearing -- it has to read-and-clear
atomically or risk losing a notification fired between the read
and the clear. retro_atomic.h does not expose a CAS primitive;
its docstring explicitly says "no compare-exchange, add only when
a real caller needs it."
Rather than add CAS to retro_atomic.h for this single caller,
restructure the field as a monotonic counter:
- setter: retro_atomic_fetch_add_int(&event_count, 1)
- reader: acquire-load event_count; compare to last_seen
cursor; update last_seen; return whether they
differ. last_seen is main-thread-only state, no
atomic primitive needed.
This is strictly more flexible than the flag. The "did anything
change?" semantic is preserved exactly (now != last_seen <=> at
least one event), and the count is available if a future caller
wants to batch events or report event volume. Multi-reader
fan-out becomes possible (each reader owns its own last_seen
cursor) without further changes.
Wraparound: 32-bit signed int at GCD vnode-event rates wraps in
centuries. The (now != last_seen) comparison remains correct
across wraparound under modular int arithmetic.
Backwards / forwards compatibility
----------------------------------
darwin_watch_data_t is file-private to platform_darwin.m -- not
declared in any header, no external code touches it. The public
path_change_data_t wrapper (frontend/frontend_driver.h) is
unchanged. The frontend_ctx_driver vtable entry
(check_for_path_changes) keeps its signature and bool return.
Caller-observable semantic is identical: returns true exactly
once per "at least one event since last call", false otherwise,
no notifications lost. The internal state difference (counter
vs flag) is invisible across the function boundary.
Performance
-----------
Reader path (called ~60 Hz from runloop.c when
video_shader_watch_files is enabled, Apple-only):
- Pre: one atomic CAS (1, 0).
- Post: one acquire-load + two non-atomic ops.
Net: strictly cheaper. ldar vs cas/ldaxr-stlxr loop on AArch64.
Setter path (GCD vnode event handler, sub-millisecond/day total):
- Pre: one CAS (may no-op if flag is already set).
- Post: one fetch_add (always stores).
Net: one extra store on the cold path. Below measurement
noise at the actual event rate.
Cache footprint: +4 bytes per watcher (last_seen). event_count
and last_seen sit on the same cache line; setter invalidates it
on every event, reader reloads next call. At 60 Hz reader vs.
sparse setter, max ~3 microseconds/sec of cache-miss cost --
negligible, and the function is not on a hot path regardless.
Field type and initialisation
-----------------------------
Field changes from `volatile int32_t has_changes` to
`retro_atomic_int_t event_count`. The retro_atomic_int_t is
int-sized and int-aligned across all 7 retro_atomic.h backends
(verified via the gfx_thumbnail port and retro_atomic_test),
so no struct-layout impact -- but the type itself is
backend-dependent: _Atomic int under C11, std::atomic<int> under
C++11, plain int under MSVC, volatile int on the fallback.
Plain assignment to _Atomic int is illegal under C11 stdatomic,
so the calloc-then-assign-zero idiom is replaced with
retro_atomic_int_init. The calloc remains; the explicit init is
belt-and-braces (zero-bit-pattern is a valid initial state for
every backend, but the API requires the init for type-system
reasons under C11).
Verified
--------
- retro_atomic.h still has no CAS / exchange primitive after
this commit. Audit confirms platform_darwin.m was the last
OSAtomic* caller in the tree (gfx/gfx_thumbnail.h's mention
is documentation referencing the backend-cascade list, not
a call site).
- Smoke test mirroring the changed code blocks compiles and
runs clean under: gcc -O2/-Wall/-Werror, clang -x objective-c,
clang -x objective-c++ -std=c++11 (CXX_BUILD-style),
forced C11 stdatomic backend. Runtime: events fire ->
reader returns true once, then false; subsequent events ->
true once again. Idempotent on no-events.
- Full platform_darwin.m compilation requires Apple SDK
(Mach + Foundation + GCD + IOKit) and is not exercised on
Linux CI; verification on macOS hardware needed before merge.
The ASIO callback thread (which is the audio driver's real-time thread) called fifo_read(ad->ring, ...) at five sites in asio_deinterleave_to_buffers without holding any lock. The main thread called fifo_write(ad->ring, src, to_write) in asio_write also without holding any lock. The only slock_* calls in the file were on cond_lock for the scond_wait/scond_signal pair around backpressure, which protects the condvar's internal state but does not serialise fifo_buffer_t access -- fifo_buffer_t is itself a plain ring buffer with no internal synchronisation, and the audit's earlier characterisation of fifo_buffer_t as "MPMC- safe with internal slock" was wrong. Both threads concurrently mutated ad->ring->first (callback via fifo_read) and ad->ring->end (main via fifo_write). fifo_* internally does an "if-then-set" on first/end inside a wraparound branch, so a racing read can observe a torn value -- not just loose ordering, an actually-corrupt half-updated cursor. On x86 TSO this is masked at the hardware level most of the time, which is presumably why the bug has been latent without major reports; on weak-memory targets (Win-on-ARM ASIO, becoming a real thing with Snapdragon X) the symptoms would be audio glitches at minimum and corrupted samples on wraparound. retro_spsc_t is the right tool for this exact pattern: single-producer (main thread, asio_write) / single-consumer (callback thread, asio_deinterleave_to_buffers) byte queue with acquire/release ordering on the head and tail cursors. Lock- free, so no priority-inversion concern from the callback path (taking a mutex from a real-time audio callback is a textbook anti-pattern in pro-audio code, which rules out the alternative "add a slock around the FIFO" fix). This also makes asio.c the first production caller of retro_spsc_t since b894479 introduced the primitive, validating the API against a real workload. Mechanical changes ------------------ - #include <queues/fifo_queue.h> -> <retro_spsc.h>. - fifo_buffer_t *ring -> retro_spsc_t ring (embedded by value) plus a sibling bool ring_initialized so cleanup paths can tell "init succeeded" from "never initialised". retro_spsc_t doesn't carry that bit internally because most callers know their own lifecycle; asio.c's free path runs from multiple error sites so the flag is the simplest way. - fifo_new/fifo_free -> retro_spsc_init/retro_spsc_free, with !ring NULL-checks replaced by !ring_initialized. - FIFO_READ_AVAIL / FIFO_WRITE_AVAIL -> retro_spsc_read_avail / retro_spsc_write_avail. - fifo_read / fifo_write -> retro_spsc_read / retro_spsc_write. The writer at line 1306 now uses retro_spsc_write's actual return value rather than assuming it equals to_write; the cap via retro_spsc_write_avail makes them equal in practice but using the return value is defensive against contract changes. - fifo_clear -> retro_spsc_clear (new API, see below) at the persistent-instance reuse path. ra_asio_t is calloc'd, so the embedded retro_spsc_t starts as a zero-bit pattern; retro_spsc_init then does the proper init via retro_atomic_size_init before any cross-thread access begins. This avoids the C++11 [atomics.types.generic] technical UB concern about uninitialised _Atomic / std::atomic members (carried over from the gfx_thumbnail port's analysis). retro_spsc_clear: new API ------------------------- retro_spsc_clear(retro_spsc_t *q) resets head and tail to zero without reallocating the buffer. It's documented as safe only when both producer and consumer are quiesced -- the same lifetime constraint as retro_spsc_init / retro_spsc_free. Implementation is two retro_atomic_size_init calls; the init form is necessary because plain assignment to a retro_atomic_size_t is illegal under the C11 stdatomic backend. Without this addition, the alternatives for "drop stale data on session restart" were: (a) repeated retro_spsc_read into a scratch buffer until empty (wasteful memcpys, ugly); (b) retro_spsc_free + retro_spsc_init (reallocates the heap buffer, churn). (c) clear is one line at the call site and atomic with respect to the calling thread. The quiescence requirement is documented in the header alongside the function. asio.c had two pre-port fifo_clear sites: 1. ra_asio_init_via_persistent (~line 1024): runs before g_asio = ad, so the ASIO callback isn't yet seeing this instance. Single-threaded; retro_spsc_clear is safe here. This site is converted directly. 2. ra_asio_free / parking path (~line 1380 pre-port): ran AFTER ASIO_CALL_STOP and AFTER g_asio = NULL. The intent was "drop stale audio before parking", but ASIO_CALL_STOP is not a synchronous join -- ASIO4ALL specifically does not terminate its audio thread before returning -- so the old fifo_clear at this site could race with a stray callback's fifo_read. This was a pre-existing latent bug. Resolution: drop the clear at the parking site entirely. The next ra_asio_init_via_persistent will re-clear after the callback is provably idle (g_asio == NULL means the callback short-circuits at line 852's null check, so it won't touch the ring). Documented with a comment at the parking site. Test coverage ------------- retro_spsc_clear gets unit-test coverage in the existing sample at libretro-common/samples/queues/retro_spsc_test/: write 50 bytes, verify read_avail == 50, retro_spsc_clear, verify read_avail == 0 and write_avail == capacity, then verify the queue is reusable after clear (write 16, read 16, content matches). Runs clean under SANITIZER=thread with halt_on_error=1. The SPSC stress test (1M-token producer/consumer race through a 4 KB buffer) continues to pass under TSan with no regression from the new function. Verified locally ---------------- - retro_spsc.c compiles clean as C (gcc, clang) and C++11 (-xc++), under Wall/Wextra/Wpedantic/Werror. - Forced backends: C11 stdatomic, GCC __sync_*, volatile fallback all build clean. - The ring-using portions of asio.c, extracted into a smoke harness mirroring the real calloc allocation pattern, compile and run clean under gcc -O2, g++ -std=c++11 -xc++, and clang. - Sample stress test passes 10M tokens, 0 mismatches. - Sample TSan run (halt_on_error=1): exit 0. Verified on Windows ------------------- - End-to-end Windows ASIO build with this patch applied. Audio plays correctly through the ASIO driver, no regressions observed in normal operation. Performance ----------- The motivating reason for the port is correctness (the cross- thread race) and bounded worst-case latency (the audio-code property), not throughput. But for the record, a head-to-head microbenchmark of the same access pattern under both designs ("fifo_buffer_t guarded by slock_t" vs "retro_spsc_t lock-free") shows: Single-threaded steady-state, 8 KB chunks (asio.c realistic payload), x86_64: fifo+slock: 338 ns/iter (write+read pair) retro_spsc: 283 ns/iter (write+read pair) Both dominated by memcpy cost (~94 ns/copy for 8 KB). Per-op overhead net of memcpy: ~150 ns vs ~96 ns. Single-threaded steady-state, 16 B chunks (atomic-cost- dominated), x86_64: fifo+slock: 44 ns/iter retro_spsc: 16 ns/iter Ratio: 2.7x faster. AArch64 (qemu-user, ratios only — absolute numbers distorted by emulation overhead): fifo+slock: 1151 ns/iter (8 KB), 637 ns/iter (16 B) retro_spsc: 891 ns/iter (8 KB), 410 ns/iter (16 B) At realistic ASIO operation rates (a few hundred queue ops/sec across both producer and consumer), the per-op savings amount to ~20 microseconds per second of CPU on x86 -- ~0.002% of one core. The throughput win is real but practically negligible at audio rates. What is meaningful is the bounded-latency property: under lock contention, the locked design's worst case is unbounded (the non-holding thread blocks in the kernel mutex; if the holder is preempted, the wait is unbounded). Under the same contention, the lock-free SPSC's worst case is one acquire-load. This is the textbook reason pro-audio guidance (Apple CoreAudio docs, Steinberg ASIO docs) discourages mutexes in real-time audio threads, and it's the qualitatively important difference here.
dispgfx_widget_t::msg_queue is a fifo_buffer_t storing
disp_widget_msg_t* pointers. Pre-fix it had no dedicated lock,
yet the producer (gfx_widgets_msg_queue_push) is reachable from
threads other than the main thread, and the lock that *did*
exist (current_msgs_lock) protects current_msgs[] -- the
displayed-message deque -- not the FIFO.
Producer reachability:
1. runloop.c:runloop_msg_queue_push. Holds RUNLOOP_MSG_QUEUE_
LOCK across the call. Reachable from 40+ files; the
threaded task system at libretro-common/queues/task_queue.c
runs a worker thread that pushes via this entry point.
2. runloop.c:runloop_task_msg_queue_push. Same lock.
3. gfx/video_driver.c:video_driver_frame. Acquires
RUNLOOP_MSG_QUEUE_LOCK to drain runloop_st->msg_queue,
RELEASES it, then calls gfx_widgets_msg_queue_push without
the lock.
Path 3 runs on the main thread; path 1 can run on any thread.
RUNLOOP_MSG_QUEUE_LOCK guards runloop_st->msg_queue, NOT
p_dispwidget->msg_queue -- the lock name reflects the runloop
struct it was named after. So path 3's fifo_write (no lock)
can race with path 1's fifo_write (lock held but irrelevant
to path 3).
The consumer (gfx_widgets_iterate's fifo_read) is called from
runloop.c:6048 inside RUNLOOP_MSG_QUEUE_LOCK and from
gfx_widgets.c:973 inside current_msgs_lock -- but neither lock
is held by path 3, so consumer fifo_reads can observe a half-
updated `end` cursor from path 3's fifo_writes.
Outcome of a race: torn `end` cursor publishes a bad pointer
in the disp_widget_msg_t* slots. The consumer dereferences
that as disp_widget_msg_t* in gfx_widgets_iterate, producing
use-after-free / segfault. On x86 TSO the race is masked most
of the time by hardware ordering; on Win-on-ARM / Apple
Silicon (AArch64) it surfaces more often.
Collision probability is low (~1 per 14 hours of normal use,
estimated from frame rate * task message rate * fifo_write
window) but the worst case is crash-class.
Fix
---
Add slock_t* msg_queue_lock to dispgfx_widget_t. Acquire it
around every fifo_write (in gfx_widgets_msg_queue_push), every
fifo_read (in gfx_widgets_iterate), and the avail-check that
gates each one. msg_queue_lock is the inner lock when both
it and current_msgs_lock are held; lock order is consistent
across all call sites.
The outer FIFO_*_AVAIL fast-path checks that used to wrap the
producer and consumer function bodies have been dropped.
Reading the FIFO cursors outside msg_queue_lock is itself a
data race against the locked writes -- TSan-detectable; benign
on x86 TSO but real on weak-memory hardware. The locked
re-check at the actual fifo_write / fifo_read site is the
correctness gate.
Removing the producer's outer gate also fixes a latent
behaviour bug: the update-existing branch (else clause in
gfx_widgets_msg_queue_push) does not write to the FIFO -- it
only mutates an already-tracked widget -- so suppressing it on
FIFO-full was wrong. Existing widgets now get updated
regardless of FIFO state.
The producer's spawn-new branch may now race-lose the avail
re-check after allocating msg_widget (concurrent producer
filled the last slot between unlocked allocation and locked
write). In that case the function frees the just-allocated
widget via gfx_widgets_msg_queue_free + free and returns. A
forward declaration of gfx_widgets_msg_queue_free is added near
the top of the file because that function is defined further
down.
Regression test
---------------
samples/gfx/gfx_widgets_msg_queue_race/ mirrors the post-fix
locking discipline of gfx_widgets_msg_queue_push and
gfx_widgets_iterate's FIFO interaction in isolation. Two
producer threads + one consumer thread + 500K iterations per
producer; smoke check verifies single-threaded FIFO order.
Wired into .github/workflows/Linux-samples-gfx.yml as a TSan
lane (CC=clang, SANITIZER=thread, TSAN_OPTIONS=halt_on_error=1)
following the gfx_thumbnail_status_atomic_test convention.
TSan instruments every memory access on fifo_buffer_t::end and
::first, so a future regression that drops the lock around any
of the producer / consumer / avail-check sites will be flagged
as a data race and fail the workflow.
Verified locally
----------------
- samples/gfx/gfx_widgets_msg_queue_race builds and runs
clean as C (gcc -O0), C++11 (g++ -std=c++11 -xc++), and
Clang under -fsanitize=thread.
- TSan with halt_on_error=1: clean exit (post-fix lock
discipline causes no observed races across 500K * 2
iterations).
- Mutation test: removing the slock_lock/unlock around the
producer's fifo_write makes TSan fire on the first
collision, validating that the test is sensitive to the
exact regression we're guarding against.
- gfx/gfx_widgets.c syntax-checks clean with HAVE_THREADS +
HAVE_GFX_WIDGETS via gcc -fsyntax-only against the
libretro-common headers.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )