Stale-data crash family: per-slot token table + DL-range registry#172
Merged
Conversation
Closes the Linux/glibc cross-scene stale-data crash family documented in
docs/bugs/linux_stale_scene_data_family_2026-05-11.md. The previous
session shipped a defensive NULL-file_head guard in efManagerMakeEffect
and documented four more variants for follow-up; this commit lands the
structural fix that eliminates the class at its root, validated by a
clean 15-min autonomous attract-loop run to SSB64_MAX_FRAMES=54000 with
zero crashes (previously crashed at ~90s / ~5s / ~9 min depending on
demo permutation).
The structural fix — per-slot RelocPointerTable (port/resource/
RelocPointerTable.cpp + .h):
Previously the token table maintained a single global generation that
incremented on every lbRelocInitSetup() call (every scene boundary).
All previously-minted tokens fail decode after that bump, even tokens
for intern-buffer files (mainmotion, submotion, model, special1-4,
shieldpose) whose backing memory persists across scenes. Downstream
PORT_RESOLVE returned NULL; downstream consumers (gcSetupCustomDObjs,
ftMainSetStatus joint-init, gcAddMObjForDObj) didn't always NULL-check
the result and SIGSEGV'd reading parent->child or dobjdesc->id.
New model: each table slot owns its own 12-bit generation. Tokens
carry the slot's gen at registration. Resolve checks per-slot:
slots[idx].gen == token_gen. Invalidation is range-based via the new
portRelocInvalidateRange(base, size) — only slots whose pointer falls
in the recycled range are NULL'd and their generation bumped (with
free-list reuse). Tokens for intern-buffer files don't intersect the
scene-arena range, so they stay valid forever (until the file
unloads). Tokens for arena-allocated data go stale exactly when their
backing memory recycles. lbRelocInitSetup no longer calls the
wholesale portRelocResetPointerTable; the range-based path in
port_taskman_evict_arena_caches handles it surgically.
DL-range registry (port/port_dl_ranges.{cpp,h} — new):
Defensive infrastructure tracking valid display-list memory ranges
(scene arena + reloc files). Hooked into libultraship's GFX walker
via the new game-agnostic callback API in fast/interpreter.h
(RegisterDLBoundsCheck, RegisterAddressClassifier). gfx_step and the
G_DL handler bounds-check `cmd` before deref/push; if a walker has
stepped past the end of a registered range (variant 5 — runaway DL
without a gsSPEndDisplayList terminator), the entire walk is torn
down via g_exec_stack.stop() instead of dereferencing into an
unmapped page. The classifier also feeds the SIGSEGV diag dump
(RelocPointerTable diagnostics + recent DL pushes + segment writes)
so any future stale-pointer crash that escapes the structural fix
prints actionable triage info.
Port wires this in via port_dl_ranges_init() called from PortInit
before any GFX activity. libultraship has zero compile-time symbol
dependencies on port_dl_* — all integration is via the callback API.
SIGSEGV diag dump (port_watchdog.cpp + libultraship/CrashHandler.cpp):
Both crash handlers now call Fast::DumpDLDiag() with siginfo->si_addr
before showing the dialog, so the recent-DL-pushes + segment-writes
ring buffers land in the spdlog log file even if the dialog hangs.
The diag was already in interpreter.cpp but gated on ASan; this
commit makes it always-on so it survives in shipping builds where
the heap-layout-dependent stale-pointer crashes actually manifest.
Submodule bumps:
- decomp: real game bug fix (gcDrawMObjForDObj G_ENDDL terminator),
sound consumer NULL-checks (gcSetupCustomDObjs, ftMainSetStatus),
file-scope gPortSceneHeap + arena range registration with the
DL-range registry. (See decomp commit message for details.)
- libultraship: callback API (Fast::RegisterDLBoundsCheck /
RegisterAddressClassifier / DumpDLDiag), universal correctness
fixes (gfx_pop_shader empty-check, mShaderStack per-frame reset),
always-on diag ring buffers. (See libultraship commit message.)
Validation:
- Clean SSB64_MAX_FRAMES=54000 exit (15 min game-time / 24 min real-
time autonomous attract loop), zero crashes through full
ControlDeck::ShutdownRaphnet → destruct fast3dwindow → destruct
ResourceManager shutdown.
- Previously documented variants:
* Variant 1 (Kirby Cutter / Pikachu Thunder Jolt / Shield effect):
addressed by the per-slot token table (effect_desc->file_head
tokens now resolve correctly across scene cycles for intern-
buffer files) + the consumer NULL-check at gcSetupCustomDObjs.
* Variant 2 (mnCharacters joint init, ftMainSetStatus): addressed
by the per-slot token table + the consumer NULL-check at
ftMainSetStatus.
* Variant 3 (stale MObjSub token warning at objanim.c:2869):
addressed by the per-slot token table.
* Variant 4 (segment-E DL pointer at gfx_step opcode fetch): the
new gfx_step bounds-check rejects unresolved N64-segment pointers
before deref.
* Variant 5 (DL walker running past arena end): addressed by the
gcDrawMObjForDObj G_ENDDL terminator + the gfx_step bounds-check
tearing down the walk on walked-past detection.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the Linux/glibc cross-scene stale-data crash family documented in
docs/bugs/linux_stale_scene_data_family_2026-05-11.md. The previous session shipped a defensive NULL-file_head guard inefManagerMakeEffectand documented four more variants for follow-up; this PR lands the structural fix that eliminates the class at its root.Validation: clean
SSB64_MAX_FRAMES=54000exit (15 min game-time / 24 min real-time autonomous attract loop) throughControlDeck::ShutdownRaphnet→destruct fast3dwindow→destruct ResourceManager, zero crashes. Pre-fix the attract loop crashed at ~90s / ~5s / ~9 min depending on demo permutation.The structural fix — per-slot RelocPointerTable
Previously the token table maintained a single global generation that incremented on every
lbRelocInitSetup()(every scene boundary). All previously-minted tokens failed decode after that bump — including tokens for intern-buffer files (mainmotion, submotion, model, special1-4, shieldpose) whose backing memory persists across scenes. DownstreamPORT_RESOLVEreturned NULL; consumers (gcSetupCustomDObjs,ftMainSetStatusjoint-init,gcAddMObjForDObj) didn't always NULL-check and SIGSEGV'd readingparent->childordobjdesc->id.New model: each slot owns its own 12-bit generation. Resolve checks per-slot: `slots[idx].gen == token_gen`. Invalidation is range-based via the new `portRelocInvalidateRange(base, size)` — only slots whose pointer falls in the recycled range get NULL'd + gen-bumped (with free-list reuse). Tokens for intern-buffer files don't intersect the scene-arena range so they stay valid until the file unloads. Tokens for arena-allocated data go stale exactly when the arena recycles. The wholesale `portRelocResetPointerTable` call is gone from `lbRelocInitSetup`.
Other pieces (in dependency order)
Variant-by-variant coverage
Pruned (no longer needed)
After the structural fix validated, these were removed because they were workarounds for the (now-fixed) token-invalidation bug:
Test plan
🤖 Generated with Claude Code