Skip to content

[pull] master from libretro:master#978

Merged
pull[bot] merged 9 commits intoAlexandre1er:masterfrom
libretro:master
Apr 29, 2026
Merged

[pull] master from libretro:master#978
pull[bot] merged 9 commits intoAlexandre1er:masterfrom
libretro:master

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented Apr 29, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

The file had its own local __atomic_* / _Interlocked / volatile
shim guarding the cross-thread .status field, written before
retro_atomic.h existed.  Replace it with the portable API now
that the header is upstream (75ea1e4).

The shim was 18 lines, well-commented, correct, and operated on
plain `int*` via GCC's __atomic_* extension form.  The C11
stdatomic and C++11 std::atomic backends in retro_atomic.h cannot
operate on plain pointers -- their typedefs are _Atomic int and
std::atomic<int>, which forbid plain dereference -- so this port
changes the .status field type from `enum gfx_thumbnail_status`
to `retro_atomic_int_t` and routes every read and write through
retro_atomic_*_int.  Twenty-five access sites are converted; all
but three are single-threaded main-thread accesses that were
already non-atomic in shape, but now go through the API because
the field's atomic type requires it.  On x86 TSO the
acquire/release barriers compile out entirely; on weak-memory
ARM/PowerPC the cost is one extra ldar/stlr per cold-path
access (menu list updates, frame draws -- never per-sample).

The two cross-thread sites that drove the original shim are
preserved verbatim:
  - gfx_thumbnail_handle_upload publishes texture/width/height
    via release-store of AVAILABLE.
  - gfx_thumbnail_draw acquire-loads the status before reading
    the texture/width/height.
Both already had explanatory comments; those are kept.

Field-type ABI: retro_atomic_int_t is int-sized and int-aligned
on every supported backend, so gfx_thumbnail_t's layout is
unchanged.  Verified across 7 backends and 2 architectures
(x86_64, AArch64); also asserted at compile-time in the new
test (see below).

Intra-file initialisation of stack-local gfx_thumbnail_t in
gfx_thumbnail_draw uses retro_atomic_int_init() for the .status
field; plain assignment to an _Atomic int is illegal under C11
stdatomic.

CXX_BUILD compatibility
-----------------------

RetroArch's CXX_BUILD mode (Apple platforms, etc.) compiles
every .c file as C++.  Once .status is std::atomic<int>, two
C-valid patterns become C++-invalid:

  - memset(&t, 0, sizeof(t)) of a struct containing
    std::atomic<int> warns -Wclass-memaccess (the struct is no
    longer trivially-copyable).
  - `*dst = *src` for a struct embedding std::atomic<int>
    fails to compile (the deleted copy-assignment of
    std::atomic<int> propagates upward through aggregates).

The first case fires once: xmb.c:667's xmb_alloc_node() did
memset(&node->thumbnail_icon.icon, 0, ...) instead of calloc'ing
the whole node, intentionally to avoid zeroing a multi-KB
thumbnail_path_data substruct on the hot path.  Replaced with
gfx_thumbnail_init_blank() -- a new field-by-field zero-init
helper added to gfx_thumbnail.h.  The second case does not
fire: the only struct copy in the menu drivers
(ozone.c:4652 ozone_copy_node) operates on ozone_node_t, which
does not embed gfx_thumbnail_t (the gfx_thumbnail_t fields are
in ozone_handle_t, which is calloc'd, never struct-copied).

calloc-init of structs containing std::atomic<int>
(xmb_handle_t, ozone_handle_t, materialui_handle_t) is
technically UB per C++11 [atomics.types.generic] (the default
constructor leaves the atomic uninitialised; reads should be
preceded by atomic_init).  In practice every libstdc++/libc++/
MSVC STL treats a zero-bit pattern of std::atomic<int> as a
zero-valued atomic, and reads via atomic_load_explicit return 0.
Verified with all three menu drivers under CXX_BUILD.

Wrappers (GFX_THUMB_STATUS_STORE / GFX_THUMB_STATUS_LOAD) are
preserved as static-inline functions rather than function-like
macros.  GCC 13+ at -O3 emits a -Wstringop-overflow false-
positive on the inlined __atomic_* primitive when called from
gfx_thumbnail_request -- the optimiser cannot prove the
@thumbnail argument non-NULL across every `goto end:` flow-graph
path.  The pre-port code did not trip this because the shim's
explicit (int*) cast on enum* defeated GCC's analysis;
retro_atomic_int_t is int directly on the GCC backend, so no
cast remains to do that.  The wrappers carry a targeted
_Pragma("GCC diagnostic ignored \"-Wstringop-overflow\"") that
suppresses the warning at the function boundary; codegen is
unchanged on every backend (verified via aarch64 objdump:
14 ldar / 14 stlr emitted, identical to pre-port).  Other
toolchains (clang, MSVC, older GCC) skip the suppression.

Regression test
---------------

samples/gfx/gfx_thumbnail_status_atomic/ exercises the
synchronisation pattern in isolation.  The test does not link
gfx_thumbnail.c; it mirrors the relevant types and macros and
runs an SPSC producer/consumer stress.

Verified failure mode: deliberately moving the producer's
generation-counter bump before the data writes (simulating a
removed release-store fence) makes the stress fail with
hundreds of mismatched reads on x86 and a "ThreadSanitizer:
data race in consumer_thread" report under TSan
(halt_on_error=1 -> exit 66).  Restoring the fence gives 1M
iterations with 0 mismatches and a clean TSan run.

Static checks asserted at compile time:
  - sizeof(gfx_thumbnail_t.status) == sizeof(int), so the
    field-type change preserves struct layout.
  - HAVE_RETRO_ATOMIC is set after the include.

Runtime checks:
  - All four GFX_THUMBNAIL_STATUS_* values round-trip through
    GFX_THUMB_STATUS_STORE / GFX_THUMB_STATUS_LOAD.
  - 1M-iteration SPSC handshake stress (HAVE_THREADS only):
    producer stages (texture, width, height) = (i, i, i),
    stores AVAILABLE via the macro under test, bumps a
    `produced` counter via release-store; consumer waits for
    the new generation, reads the four fields via the macro
    under test, verifies all four equal i, acks via a
    `consumed` counter.  The handshake prevents the producer
    from rewriting the data while the consumer is reading it;
    without it, torn reads would be expected even with correct
    barriers (a property of the test design, not the
    primitives).

The test mirrors gfx_thumbnail_init_blank() locally rather than
including gfx_thumbnail.h, since the test deliberately shadows
the header's type definitions to verify layout invariants.

CI
--

.github/workflows/Linux-samples-gfx.yml gets one new step that
builds the sample with SANITIZER=thread and runs it with
TSAN_OPTIONS=halt_on_error=1.  TSan is the right validator for
this test specifically: missing-barrier regressions show up as
data races even on x86 TSO, where the hardware would otherwise
hide them at runtime.  ASan/UBSan would catch nothing here
(no out-of-bounds access, no UB), so they are not added.

Verified locally
----------------

Compile-clean across:
  - gcc -O2 / -O3 -Wall -Werror, x86_64, all 5 affected files
    (gfx_thumbnail.c, xmb.c, ozone.c, materialui.c, runloop.c)
  - g++ -xc++ -std=c++11 -DCXX_BUILD, same 5 files
  - clang -O2, same 5 files
  - aarch64-linux-gnu-gcc + qemu-user, gfx_thumbnail.c
  - arm-linux-gnueabihf-gcc, gfx_thumbnail.c
  - clang -x objective-c / -x objective-c++ smoke (header-only
    inclusion check; mirrors cocoa_common.m's transitive path
    via menu_driver.h)
  - C89 -ansi -pedantic, gfx_thumbnail.c
  - Forced backends: C11 stdatomic, GCC __sync_*, volatile
    fallback
  - Clang static analyser
  - HAVE_THREADS=0 (test sample, static checks only)

Functional:
  - 1M-iter SPSC stress passes on x86_64 native
  - 1M-iter SPSC stress passes under qemu-aarch64
  - TSan halt_on_error=1 catches deliberately-injected
    missing-barrier bug -> exit 66
  - aarch64 objdump shows 14 ldar / 14 stlr in
    gfx_thumbnail.o, identical between pre-port and post-port
  - x86 codegen: plain mov for cross-thread sites (TSO-correct)
  - ARM32 codegen: 28 dmb instructions emitted

Behavior when no real atomic backend is available
-------------------------------------------------

retro_atomic.h's cascade always selects something; the last
tier is a volatile fallback (`#warning` unless suppressed,
`RETRO_ATOMIC_LOCK_FREE` undefined).  On that tier, expansions
are plain volatile loads and stores -- no memory barriers --
which is correct only on single-core targets or x86 TSO.  The
header's docstring documents this and the intended caller-side
gate is `#if defined(RETRO_ATOMIC_LOCK_FREE)`.

gfx_thumbnail does NOT add such a gate, matching pre-port
behavior: the original shim's `#else` branch was also a plain
volatile read/write with no barriers, and gfx_thumbnail never
checked whether barriers were available.  That tradeoff was
explicit in 91166d0 ("Note on correctness"), where the
volatile fallback's SMP PowerPC G5 weakness was acknowledged
and accepted: the failure mode is at worst a fade-in
animation starting one frame late, not corruption (status is
a single aligned int -- no torn reads possible -- and a
cosmetic mistime is recoverable on the next frame).

What changes with the port: the cascade gets richer.  Pre-
port, anything that wasn't the GCC __atomic_* or MSVC
Interlocked* branch landed on the volatile fallback.  Post-
port, four additional tiers sit between modern GCC and
volatile -- C11 stdatomic, C++11 std::atomic, Apple OSAtomic*,
GCC __sync_* -- which means every real RetroArch target with
threaded video falls into a tier that gives proper release/
acquire ordering.  The G5 case the historical note
acknowledged is now covered by GCC __sync_* (which existed in
GCC 4.1+ and was named as the proper fix if anyone wanted to
harden the G5 path later); G5 with modern devkitPPC would land
on __atomic_* and get real lwsync.

Net: on every platform where atomics are "unavailable" in the
sense of "no real backend matches", pre-port and post-port
emit identical plain-volatile codegen (verified by aarch64-
gcc -DRETRO_ATOMIC_FORCE_VOLATILE: gfx_thumbnail_handle_upload
emits plain `str` rather than `stlr`, same as pre-port's
volatile branch on the same target).  The port introduces no
new "atomics unavailable" footgun.

Not verified on real hardware:
  - MSVC ARM64 (relies on retro_atomic.h's MSVC backend, which
    landed in 75ea1e4 and is itself not yet hardware-validated)
  - Real PowerPC SMP (Wii U, Xbox 360)
  - End-to-end Apple build (macOS / iOS / tvOS)
The macOS branch of cpu_features_get() checked only the return value of
sysctlbyname(), which is 0 (success) whenever the key exists -- regardless
of whether the feature is actually supported. Apple's hw.optional.* keys
are present on every machine and report 0 when the feature is unavailable,
so the existence check incorrectly flagged unsupported features as present.

Most visibly, AVX-512 was reported on Intel Macs without AVX-512 support
(e.g. i9-9880H / Coffee Lake) because hw.optional.avx512f exists but
returns 0 there. SSE/AVX/AVX2/NEON happened to produce correct results
only because their keys always read 1 on hardware that supports them.

Pass a buffer to sysctlbyname() and check the actual value.
A lock-free single-producer / single-consumer byte queue built on
retro_atomic.h.  Wired into the standard libretro-common build
(Makefile.common, griffin) so it ships with every RetroArch
configuration, but no production callers yet -- this commit lands
the primitive only.  Subsequent commits can convert hand-rolled
SPSC patterns (audio/drivers/coreaudio.c's atomic_size_t ring,
audio/drivers/opensl.c's __sync_*-guarded buffered_blocks counter)
to use it.

Design
------

Two-cursor (head + tail) Lamport / Disruptor style, NOT the single
shared count.  Each cursor is written by exactly one thread and
acquire-loaded by the other.  The producer publishes new data with
a release-store on head; the consumer publishes consumed bytes
with a release-store on tail.  Neither thread issues atomic RMW
operations on the hot path -- only retro_atomic_load_acquire_size
and retro_atomic_store_release_size, which retro_atomic.h provides
across all 7 backends.

Capacity is power-of-2 (rounded up at init), masked with
capacity - 1.  Wraparound uses the standard two-memcpy pattern.
size_t modular arithmetic on (head - tail) is well-defined as long
as the difference stays within capacity, which the producer
enforces by checking write_avail before each push.

head and tail are placed on separate cache lines via explicit
char-array padding (RETRO_SPSC_CACHE_LINE, default 64).  Without
this, the producer's release-store on head would invalidate the
consumer's tail cache line and vice versa, halving throughput on
contended SMP.  Padding is a performance hint; correctness does
not depend on it.  The padding macros are guarded against
underflow if RETRO_SPSC_CACHE_LINE is misconfigured to be smaller
than the prefix fields.

Comparison with the existing fifo_queue_t:
  - fifo_queue takes an slock_t internally; it's MPMC-safe but
    every push/pop costs a mutex.
  - retro_spsc is lock-free but limited to one producer / one
    consumer.  Use it in code paths where (a) producer/consumer
    counts are fixed at one each (audio render thread <-> audio
    submission, video thread <-> task queue, etc.) and (b) the
    fifo_queue lock contention measurably matters.

Comparison with audio/drivers/coreaudio.c's hand-rolled ring:
  - coreaudio uses a single shared atomic_size_t `filled` count
    rather than two cursors; producer fetch_add's it, consumer
    fetch_sub's it.  Correct, but optimises for "give me
    write_avail" being a single atomic load (audio drivers do
    that query every fill) at the cost of RMW on every push and
    pop.
  - retro_spsc inverts that trade-off: write_avail / read_avail
    each cost two atomic loads, but push and pop only do
    load_acquire + store_release.  Better for the general case
    where push/pop count vastly exceeds availability checks.
  - retro_spsc also pads head and tail; coreaudio doesn't (its
    single shared atomic doesn't suffer false sharing).

Build wiring
------------

Build requirements:
  - retro_atomic.h (header-only, always available)
  - <stddef.h>, <stdint.h>, <stdbool.h>, <stdlib.h>, <string.h>
  - That's it.  No HAVE_THREADS gate is needed for compilation:
    retro_spsc.c builds on every target retro_atomic.h supports,
    including pre-thread console targets (PSP, original Wii).
    On a single-threaded build it just sits there as dead code.

Wired in through:
  - Makefile.common adds retro_spsc.o next to fifo_queue.o under
    the unconditional libretro-common object list.
  - griffin.c #includes retro_spsc.c next to fifo_queue.c (Apple
    builds and console unity builds pick it up via griffin).

Test
----

libretro-common/samples/queues/retro_spsc_test/ exercises the
queue under producer/consumer concurrency:

  - Property checks (single-threaded): power-of-2 round-up at init,
    fresh-queue avails, push/pop content round-trip, peek does not
    advance, wraparound across the buffer end.
  - SPSC stress (HAVE_THREADS): producer pushes 10M sequential
    32-bit tokens through a 4 KB buffer; consumer reads and
    verifies each token matches the expected sequence.  The
    small buffer relative to the message volume forces heavy
    interleaving between producer and consumer, exercising the
    wraparound path on every iteration.

The stress harness has no per-iteration handshake -- both threads
spin on avail() in tight loops -- so the test is sensitive to
real synchronisation bugs.  Verified: deliberately moving the
producer's release-store on head BEFORE the buffer memcpys (so a
consumer observing the new head can read uninitialised bytes)
makes the stress fail with hundreds of mismatched tokens out of
10M, even on x86 TSO without TSan.  The same bug under TSan with
halt_on_error=1 exits 66 with "data race in retro_spsc_read_avail".

CI
--

.github/workflows/Linux-libretro-common-samples.yml:
  - retro_spsc_test added to RUN_TARGETS so the auto-discovery
    job builds and runs it under SANITIZER=address,undefined
    (the workflow default).
  - A new step builds and runs the test under Clang +
    SANITIZER=thread with TSAN_OPTIONS=halt_on_error=1, mirroring
    the retro_atomic_test TSan lane.  TSan is the strict
    validator for this primitive specifically: missing-barrier
    regressions show up as data races even on x86 TSO, where the
    hardware would otherwise hide them at runtime.

Verified locally:
  - gcc -O0/-O2/-O3 -Wall -Werror, x86_64
  - clang -O2, x86_64
  - g++ -xc++ -std=c++11 (CXX_BUILD-style)
  - aarch64-linux-gnu-gcc + qemu-user, 10M tokens clean,
    objdump shows real ldar/stlr at the cursor accesses
  - arm-linux-gnueabihf-gcc compile-clean
  - Forced backends: C11 stdatomic, GCC __sync_*, volatile
    fallback (volatile fallback correct only on x86 TSO, same
    contract as every other retro_atomic.h caller)
  - TSan halt_on_error=1: clean run, exit 0
  - TSan halt_on_error=1 on bug-injected build: exit 66
  - ASan/UBSan: clean
  - Bug-injected without sanitizer: 815 mismatches / 10M tokens

Not verified on real hardware:
  - MSVC ARM64 (inherits retro_atomic.h's untested-on-hardware
    state for that backend)
  - Real PowerPC SMP (Wii U, Xbox 360)
Apple deprecated <libkern/OSAtomic.h> in macOS 10.12 (2016) in
favour of C11 <stdatomic.h>.  platform_darwin.m had two callers
of OSAtomicCompareAndSwap32 left for the GCD-based filesystem
watcher's "did anything change since last check?" flag.  Replace
both with retro_atomic.h primitives.

Counter pattern, not CAS-flag
-----------------------------

The previous design used a 32-bit flag set/cleared via CAS:
  - setter: OSAtomicCompareAndSwap32(0, 1, &has_changes)
            from the GCD vnode event handler
  - reader: OSAtomicCompareAndSwap32(1, 0, &has_changes)
            from main thread; return value answers
            "did at least one event fire since the last call?"

The reader's CAS is load-bearing -- it has to read-and-clear
atomically or risk losing a notification fired between the read
and the clear.  retro_atomic.h does not expose a CAS primitive;
its docstring explicitly says "no compare-exchange, add only when
a real caller needs it."

Rather than add CAS to retro_atomic.h for this single caller,
restructure the field as a monotonic counter:

  - setter: retro_atomic_fetch_add_int(&event_count, 1)
  - reader: acquire-load event_count; compare to last_seen
            cursor; update last_seen; return whether they
            differ.  last_seen is main-thread-only state, no
            atomic primitive needed.

This is strictly more flexible than the flag.  The "did anything
change?" semantic is preserved exactly (now != last_seen <=> at
least one event), and the count is available if a future caller
wants to batch events or report event volume.  Multi-reader
fan-out becomes possible (each reader owns its own last_seen
cursor) without further changes.

Wraparound: 32-bit signed int at GCD vnode-event rates wraps in
centuries.  The (now != last_seen) comparison remains correct
across wraparound under modular int arithmetic.

Backwards / forwards compatibility
----------------------------------

darwin_watch_data_t is file-private to platform_darwin.m -- not
declared in any header, no external code touches it.  The public
path_change_data_t wrapper (frontend/frontend_driver.h) is
unchanged.  The frontend_ctx_driver vtable entry
(check_for_path_changes) keeps its signature and bool return.

Caller-observable semantic is identical: returns true exactly
once per "at least one event since last call", false otherwise,
no notifications lost.  The internal state difference (counter
vs flag) is invisible across the function boundary.

Performance
-----------

Reader path (called ~60 Hz from runloop.c when
video_shader_watch_files is enabled, Apple-only):
  - Pre:  one atomic CAS (1, 0).
  - Post: one acquire-load + two non-atomic ops.
  Net: strictly cheaper.  ldar vs cas/ldaxr-stlxr loop on AArch64.

Setter path (GCD vnode event handler, sub-millisecond/day total):
  - Pre:  one CAS (may no-op if flag is already set).
  - Post: one fetch_add (always stores).
  Net: one extra store on the cold path.  Below measurement
       noise at the actual event rate.

Cache footprint: +4 bytes per watcher (last_seen).  event_count
and last_seen sit on the same cache line; setter invalidates it
on every event, reader reloads next call.  At 60 Hz reader vs.
sparse setter, max ~3 microseconds/sec of cache-miss cost --
negligible, and the function is not on a hot path regardless.

Field type and initialisation
-----------------------------

Field changes from `volatile int32_t has_changes` to
`retro_atomic_int_t event_count`.  The retro_atomic_int_t is
int-sized and int-aligned across all 7 retro_atomic.h backends
(verified via the gfx_thumbnail port and retro_atomic_test),
so no struct-layout impact -- but the type itself is
backend-dependent: _Atomic int under C11, std::atomic<int> under
C++11, plain int under MSVC, volatile int on the fallback.
Plain assignment to _Atomic int is illegal under C11 stdatomic,
so the calloc-then-assign-zero idiom is replaced with
retro_atomic_int_init.  The calloc remains; the explicit init is
belt-and-braces (zero-bit-pattern is a valid initial state for
every backend, but the API requires the init for type-system
reasons under C11).

Verified
--------

  - retro_atomic.h still has no CAS / exchange primitive after
    this commit.  Audit confirms platform_darwin.m was the last
    OSAtomic* caller in the tree (gfx/gfx_thumbnail.h's mention
    is documentation referencing the backend-cascade list, not
    a call site).
  - Smoke test mirroring the changed code blocks compiles and
    runs clean under: gcc -O2/-Wall/-Werror, clang -x objective-c,
    clang -x objective-c++ -std=c++11 (CXX_BUILD-style),
    forced C11 stdatomic backend.  Runtime: events fire ->
    reader returns true once, then false; subsequent events ->
    true once again.  Idempotent on no-events.
  - Full platform_darwin.m compilation requires Apple SDK
    (Mach + Foundation + GCD + IOKit) and is not exercised on
    Linux CI; verification on macOS hardware needed before merge.
The ASIO callback thread (which is the audio driver's real-time
thread) called fifo_read(ad->ring, ...) at five sites in
asio_deinterleave_to_buffers without holding any lock.  The main
thread called fifo_write(ad->ring, src, to_write) in asio_write
also without holding any lock.  The only slock_* calls in the
file were on cond_lock for the scond_wait/scond_signal pair
around backpressure, which protects the condvar's internal state
but does not serialise fifo_buffer_t access -- fifo_buffer_t is
itself a plain ring buffer with no internal synchronisation, and
the audit's earlier characterisation of fifo_buffer_t as "MPMC-
safe with internal slock" was wrong.

Both threads concurrently mutated ad->ring->first (callback via
fifo_read) and ad->ring->end (main via fifo_write).  fifo_*
internally does an "if-then-set" on first/end inside a wraparound
branch, so a racing read can observe a torn value -- not just
loose ordering, an actually-corrupt half-updated cursor.  On x86
TSO this is masked at the hardware level most of the time, which
is presumably why the bug has been latent without major reports;
on weak-memory targets (Win-on-ARM ASIO, becoming a real thing
with Snapdragon X) the symptoms would be audio glitches at
minimum and corrupted samples on wraparound.

retro_spsc_t is the right tool for this exact pattern:
single-producer (main thread, asio_write) / single-consumer
(callback thread, asio_deinterleave_to_buffers) byte queue with
acquire/release ordering on the head and tail cursors.  Lock-
free, so no priority-inversion concern from the callback path
(taking a mutex from a real-time audio callback is a textbook
anti-pattern in pro-audio code, which rules out the alternative
"add a slock around the FIFO" fix).  This also makes asio.c the
first production caller of retro_spsc_t since b894479 introduced
the primitive, validating the API against a real workload.

Mechanical changes
------------------

  - #include <queues/fifo_queue.h> -> <retro_spsc.h>.
  - fifo_buffer_t *ring -> retro_spsc_t ring (embedded by value)
    plus a sibling bool ring_initialized so cleanup paths can tell
    "init succeeded" from "never initialised".  retro_spsc_t
    doesn't carry that bit internally because most callers know
    their own lifecycle; asio.c's free path runs from multiple
    error sites so the flag is the simplest way.
  - fifo_new/fifo_free -> retro_spsc_init/retro_spsc_free, with
    !ring NULL-checks replaced by !ring_initialized.
  - FIFO_READ_AVAIL / FIFO_WRITE_AVAIL -> retro_spsc_read_avail /
    retro_spsc_write_avail.
  - fifo_read / fifo_write -> retro_spsc_read / retro_spsc_write.
    The writer at line 1306 now uses retro_spsc_write's actual
    return value rather than assuming it equals to_write; the cap
    via retro_spsc_write_avail makes them equal in practice but
    using the return value is defensive against contract changes.
  - fifo_clear -> retro_spsc_clear (new API, see below) at the
    persistent-instance reuse path.

ra_asio_t is calloc'd, so the embedded retro_spsc_t starts as a
zero-bit pattern; retro_spsc_init then does the proper init via
retro_atomic_size_init before any cross-thread access begins.
This avoids the C++11 [atomics.types.generic] technical UB
concern about uninitialised _Atomic / std::atomic members
(carried over from the gfx_thumbnail port's analysis).

retro_spsc_clear: new API
-------------------------

retro_spsc_clear(retro_spsc_t *q) resets head and tail to zero
without reallocating the buffer.  It's documented as safe only
when both producer and consumer are quiesced -- the same lifetime
constraint as retro_spsc_init / retro_spsc_free.  Implementation
is two retro_atomic_size_init calls; the init form is necessary
because plain assignment to a retro_atomic_size_t is illegal
under the C11 stdatomic backend.

Without this addition, the alternatives for "drop stale data on
session restart" were:
  (a) repeated retro_spsc_read into a scratch buffer until empty
      (wasteful memcpys, ugly);
  (b) retro_spsc_free + retro_spsc_init (reallocates the heap
      buffer, churn).

(c) clear is one line at the call site and atomic with respect
to the calling thread.  The quiescence requirement is documented
in the header alongside the function.

asio.c had two pre-port fifo_clear sites:

  1. ra_asio_init_via_persistent (~line 1024): runs before
     g_asio = ad, so the ASIO callback isn't yet seeing this
     instance.  Single-threaded; retro_spsc_clear is safe here.
     This site is converted directly.

  2. ra_asio_free / parking path (~line 1380 pre-port): ran
     AFTER ASIO_CALL_STOP and AFTER g_asio = NULL.  The intent
     was "drop stale audio before parking", but ASIO_CALL_STOP is
     not a synchronous join -- ASIO4ALL specifically does not
     terminate its audio thread before returning -- so the old
     fifo_clear at this site could race with a stray callback's
     fifo_read.  This was a pre-existing latent bug.

     Resolution: drop the clear at the parking site entirely.
     The next ra_asio_init_via_persistent will re-clear after the
     callback is provably idle (g_asio == NULL means the callback
     short-circuits at line 852's null check, so it won't touch
     the ring).  Documented with a comment at the parking site.

Test coverage
-------------

retro_spsc_clear gets unit-test coverage in the existing sample at
libretro-common/samples/queues/retro_spsc_test/: write 50 bytes,
verify read_avail == 50, retro_spsc_clear, verify read_avail == 0
and write_avail == capacity, then verify the queue is reusable
after clear (write 16, read 16, content matches).  Runs clean
under SANITIZER=thread with halt_on_error=1.

The SPSC stress test (1M-token producer/consumer race through a
4 KB buffer) continues to pass under TSan with no regression
from the new function.

Verified locally
----------------

  - retro_spsc.c compiles clean as C (gcc, clang) and C++11
    (-xc++), under Wall/Wextra/Wpedantic/Werror.
  - Forced backends: C11 stdatomic, GCC __sync_*, volatile
    fallback all build clean.
  - The ring-using portions of asio.c, extracted into a smoke
    harness mirroring the real calloc allocation pattern,
    compile and run clean under gcc -O2, g++ -std=c++11 -xc++,
    and clang.
  - Sample stress test passes 10M tokens, 0 mismatches.
  - Sample TSan run (halt_on_error=1): exit 0.

Verified on Windows
-------------------

  - End-to-end Windows ASIO build with this patch applied.
    Audio plays correctly through the ASIO driver, no
    regressions observed in normal operation.

Performance
-----------

The motivating reason for the port is correctness (the cross-
thread race) and bounded worst-case latency (the audio-code
property), not throughput.  But for the record, a head-to-head
microbenchmark of the same access pattern under both designs
("fifo_buffer_t guarded by slock_t" vs "retro_spsc_t lock-free")
shows:

  Single-threaded steady-state, 8 KB chunks (asio.c realistic
  payload), x86_64:
    fifo+slock:  338 ns/iter (write+read pair)
    retro_spsc:  283 ns/iter (write+read pair)
    Both dominated by memcpy cost (~94 ns/copy for 8 KB).
    Per-op overhead net of memcpy:  ~150 ns vs ~96 ns.

  Single-threaded steady-state, 16 B chunks (atomic-cost-
  dominated), x86_64:
    fifo+slock:  44 ns/iter
    retro_spsc:  16 ns/iter
    Ratio:  2.7x faster.

  AArch64 (qemu-user, ratios only — absolute numbers distorted
  by emulation overhead):
    fifo+slock:  1151 ns/iter (8 KB), 637 ns/iter (16 B)
    retro_spsc:   891 ns/iter (8 KB), 410 ns/iter (16 B)

At realistic ASIO operation rates (a few hundred queue ops/sec
across both producer and consumer), the per-op savings amount
to ~20 microseconds per second of CPU on x86 -- ~0.002% of one
core.  The throughput win is real but practically negligible at
audio rates.

What is meaningful is the bounded-latency property: under lock
contention, the locked design's worst case is unbounded (the
non-holding thread blocks in the kernel mutex; if the holder is
preempted, the wait is unbounded).  Under the same contention,
the lock-free SPSC's worst case is one acquire-load.  This is
the textbook reason pro-audio guidance (Apple CoreAudio docs,
Steinberg ASIO docs) discourages mutexes in real-time audio
threads, and it's the qualitatively important difference here.
dispgfx_widget_t::msg_queue is a fifo_buffer_t storing
disp_widget_msg_t* pointers.  Pre-fix it had no dedicated lock,
yet the producer (gfx_widgets_msg_queue_push) is reachable from
threads other than the main thread, and the lock that *did*
exist (current_msgs_lock) protects current_msgs[] -- the
displayed-message deque -- not the FIFO.

Producer reachability:

  1. runloop.c:runloop_msg_queue_push.  Holds RUNLOOP_MSG_QUEUE_
     LOCK across the call.  Reachable from 40+ files; the
     threaded task system at libretro-common/queues/task_queue.c
     runs a worker thread that pushes via this entry point.
  2. runloop.c:runloop_task_msg_queue_push.  Same lock.
  3. gfx/video_driver.c:video_driver_frame.  Acquires
     RUNLOOP_MSG_QUEUE_LOCK to drain runloop_st->msg_queue,
     RELEASES it, then calls gfx_widgets_msg_queue_push without
     the lock.

Path 3 runs on the main thread; path 1 can run on any thread.
RUNLOOP_MSG_QUEUE_LOCK guards runloop_st->msg_queue, NOT
p_dispwidget->msg_queue -- the lock name reflects the runloop
struct it was named after.  So path 3's fifo_write (no lock)
can race with path 1's fifo_write (lock held but irrelevant
to path 3).

The consumer (gfx_widgets_iterate's fifo_read) is called from
runloop.c:6048 inside RUNLOOP_MSG_QUEUE_LOCK and from
gfx_widgets.c:973 inside current_msgs_lock -- but neither lock
is held by path 3, so consumer fifo_reads can observe a half-
updated `end` cursor from path 3's fifo_writes.

Outcome of a race: torn `end` cursor publishes a bad pointer
in the disp_widget_msg_t* slots.  The consumer dereferences
that as disp_widget_msg_t* in gfx_widgets_iterate, producing
use-after-free / segfault.  On x86 TSO the race is masked most
of the time by hardware ordering; on Win-on-ARM / Apple
Silicon (AArch64) it surfaces more often.

Collision probability is low (~1 per 14 hours of normal use,
estimated from frame rate * task message rate * fifo_write
window) but the worst case is crash-class.

Fix
---

Add slock_t* msg_queue_lock to dispgfx_widget_t.  Acquire it
around every fifo_write (in gfx_widgets_msg_queue_push), every
fifo_read (in gfx_widgets_iterate), and the avail-check that
gates each one.  msg_queue_lock is the inner lock when both
it and current_msgs_lock are held; lock order is consistent
across all call sites.

The outer FIFO_*_AVAIL fast-path checks that used to wrap the
producer and consumer function bodies have been dropped.
Reading the FIFO cursors outside msg_queue_lock is itself a
data race against the locked writes -- TSan-detectable; benign
on x86 TSO but real on weak-memory hardware.  The locked
re-check at the actual fifo_write / fifo_read site is the
correctness gate.

Removing the producer's outer gate also fixes a latent
behaviour bug: the update-existing branch (else clause in
gfx_widgets_msg_queue_push) does not write to the FIFO -- it
only mutates an already-tracked widget -- so suppressing it on
FIFO-full was wrong.  Existing widgets now get updated
regardless of FIFO state.

The producer's spawn-new branch may now race-lose the avail
re-check after allocating msg_widget (concurrent producer
filled the last slot between unlocked allocation and locked
write).  In that case the function frees the just-allocated
widget via gfx_widgets_msg_queue_free + free and returns.  A
forward declaration of gfx_widgets_msg_queue_free is added near
the top of the file because that function is defined further
down.

Regression test
---------------

samples/gfx/gfx_widgets_msg_queue_race/ mirrors the post-fix
locking discipline of gfx_widgets_msg_queue_push and
gfx_widgets_iterate's FIFO interaction in isolation.  Two
producer threads + one consumer thread + 500K iterations per
producer; smoke check verifies single-threaded FIFO order.

Wired into .github/workflows/Linux-samples-gfx.yml as a TSan
lane (CC=clang, SANITIZER=thread, TSAN_OPTIONS=halt_on_error=1)
following the gfx_thumbnail_status_atomic_test convention.
TSan instruments every memory access on fifo_buffer_t::end and
::first, so a future regression that drops the lock around any
of the producer / consumer / avail-check sites will be flagged
as a data race and fail the workflow.

Verified locally
----------------

  - samples/gfx/gfx_widgets_msg_queue_race builds and runs
    clean as C (gcc -O0), C++11 (g++ -std=c++11 -xc++), and
    Clang under -fsanitize=thread.
  - TSan with halt_on_error=1: clean exit (post-fix lock
    discipline causes no observed races across 500K * 2
    iterations).
  - Mutation test: removing the slock_lock/unlock around the
    producer's fifo_write makes TSan fire on the first
    collision, validating that the test is sensitive to the
    exact regression we're guarding against.
  - gfx/gfx_widgets.c syntax-checks clean with HAVE_THREADS +
    HAVE_GFX_WIDGETS via gcc -fsyntax-only against the
    libretro-common headers.
@pull pull Bot locked and limited conversation to collaborators Apr 29, 2026
@pull pull Bot added the ⤵️ pull label Apr 29, 2026
@pull pull Bot merged commit 4e26c88 into Alexandre1er:master Apr 29, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant