Skip to content

Stale-data crash family: per-slot token table + DL-range registry#172

Merged
JRickey merged 1 commit into
mainfrom
fix/stale-data-family
May 13, 2026
Merged

Stale-data crash family: per-slot token table + DL-range registry#172
JRickey merged 1 commit into
mainfrom
fix/stale-data-family

Conversation

@JRickey
Copy link
Copy Markdown
Owner

@JRickey JRickey commented May 13, 2026

Summary

Closes the Linux/glibc cross-scene stale-data crash family documented in docs/bugs/linux_stale_scene_data_family_2026-05-11.md. The previous session shipped a defensive NULL-file_head guard in efManagerMakeEffect and documented four more variants for follow-up; this PR lands the structural fix that eliminates the class at its root.

Validation: clean SSB64_MAX_FRAMES=54000 exit (15 min game-time / 24 min real-time autonomous attract loop) through ControlDeck::ShutdownRaphnetdestruct fast3dwindowdestruct ResourceManager, zero crashes. Pre-fix the attract loop crashed at ~90s / ~5s / ~9 min depending on demo permutation.

The structural fix — per-slot RelocPointerTable

Previously the token table maintained a single global generation that incremented on every lbRelocInitSetup() (every scene boundary). All previously-minted tokens failed decode after that bump — including tokens for intern-buffer files (mainmotion, submotion, model, special1-4, shieldpose) whose backing memory persists across scenes. Downstream PORT_RESOLVE returned NULL; consumers (gcSetupCustomDObjs, ftMainSetStatus joint-init, gcAddMObjForDObj) didn't always NULL-check and SIGSEGV'd reading parent->child or dobjdesc->id.

New model: each slot owns its own 12-bit generation. Resolve checks per-slot: `slots[idx].gen == token_gen`. Invalidation is range-based via the new `portRelocInvalidateRange(base, size)` — only slots whose pointer falls in the recycled range get NULL'd + gen-bumped (with free-list reuse). Tokens for intern-buffer files don't intersect the scene-arena range so they stay valid until the file unloads. Tokens for arena-allocated data go stale exactly when the arena recycles. The wholesale `portRelocResetPointerTable` call is gone from `lbRelocInitSetup`.

Other pieces (in dependency order)

  • libultraship (`JRickey/libultraship#fix/stale-data-family`) — game-agnostic callback API (`RegisterDLBoundsCheck`, `RegisterAddressClassifier`, `DumpDLDiag`), universal correctness fixes (`gfx_pop_shader` empty-check, `mShaderStack` per-frame reset), always-on diag ring buffers.
  • decomp (`JRickey/ssb-decomp-re#fix/stale-data-family`) — real game bug fix (`gcDrawMObjForDObj` G_ENDDL terminator on `branch_dl`), sound consumer NULL-checks at `gcSetupCustomDObjs` and `ftMainSetStatus`, file-scope `gPortSceneHeap` + arena-range registration. All under `#ifdef PORT`.
  • port/port_dl_ranges.{cpp,h} (new) — defensive DL-range registry; libultraship's GFX walker consults it via the callback API. The bounds-check tears down a walk that's stepped past the end of any registered range (variant 5 — runaway DL without `gsSPEndDisplayList`), preventing reads into unmapped pages.
  • port/port_watchdog.cpp + libultraship/CrashHandler.cpp — call `Fast::DumpDLDiag(siginfo->si_addr)` on SIGSEGV so recent-DL-pushes + segment-writes hit the spdlog log before the crash dialog blocks. Survives shipping builds.

Variant-by-variant coverage

Variant Mechanism Resolution
1 (Kirby Cutter / Pikachu Thunder Jolt / Shield effect) `*effect_desc->file_head` stale; `addr + offset` resolves to garbage DObjDesc with stale tokens Per-slot token table keeps intern-buffer file tokens valid + `gcSetupCustomDObjs` NULL-check on resolved DL
2 (mnCharacters fighter joint init, `ftMainSetStatus+0xbdb`) Same stale-token chain through `attr->commonparts_container` Per-slot token table + `ftMainSetStatus` NULL-check on `dobjdesc`
3 (`RelocPointerTable: invalid/stale token` warning at objanim.c:2869) Stale token in MObjSub array Per-slot token table eliminates the artificial gen bump that created the staleness
4 (segment-E DL pointer at gfx_step opcode fetch) Walker dereferences a `gsSPDisplayList(0xE000000+offset)` whose segment was unbound gfx_step bounds-check rejects pointers ≤ 0x0FFFFFFF
5 (DL walker running past arena end into unmapped page) `gcDrawMObjForDObj` builds a per-MObj setup buffer in the graphics heap without a final `gsSPEndDisplayList` after the loop closes Added the missing terminator + gfx_step bounds-check tears down the walk via `g_exec_stack.stop()` on walked-past detection

Pruned (no longer needed)

After the structural fix validated, these were removed because they were workarounds for the (now-fixed) token-invalidation bug:

  • `port_ftmanager_clear_file_globals` (redundant — per-slot fix keeps file globals' tokens valid)
  • GObj generation-stamping (instrumented to confirm zero stale-eject events; decomp's existing eject paths are already comprehensive for GObjs)
  • Over-engineered `id`-bounds and `i`-bound checks (only mattered when stale tokens corrupted iteration data)

Test plan

  • Worktree build with all three submodule branches: clean, no warnings beyond pre-existing
  • Autonomous attract loop with `SSB64_MAX_FRAMES=54000`: clean exit, zero crashes
  • Variant 1-5 behavior change verified by guard-fire telemetry (RelocPointerTable invalidations log per scene transition; gfx_step walked-past guard caught and recovered without escalation)
  • Manual smoke: launch on D3D11/Metal/Windows to confirm no regression — the per-slot token semantics are platform-agnostic, but worth checking
  • CI build matrix once the three submodule PRs are reviewable

🤖 Generated with Claude Code

Closes the Linux/glibc cross-scene stale-data crash family documented in
docs/bugs/linux_stale_scene_data_family_2026-05-11.md. The previous
session shipped a defensive NULL-file_head guard in efManagerMakeEffect
and documented four more variants for follow-up; this commit lands the
structural fix that eliminates the class at its root, validated by a
clean 15-min autonomous attract-loop run to SSB64_MAX_FRAMES=54000 with
zero crashes (previously crashed at ~90s / ~5s / ~9 min depending on
demo permutation).

The structural fix — per-slot RelocPointerTable (port/resource/
RelocPointerTable.cpp + .h):

  Previously the token table maintained a single global generation that
  incremented on every lbRelocInitSetup() call (every scene boundary).
  All previously-minted tokens fail decode after that bump, even tokens
  for intern-buffer files (mainmotion, submotion, model, special1-4,
  shieldpose) whose backing memory persists across scenes. Downstream
  PORT_RESOLVE returned NULL; downstream consumers (gcSetupCustomDObjs,
  ftMainSetStatus joint-init, gcAddMObjForDObj) didn't always NULL-check
  the result and SIGSEGV'd reading parent->child or dobjdesc->id.

  New model: each table slot owns its own 12-bit generation. Tokens
  carry the slot's gen at registration. Resolve checks per-slot:
  slots[idx].gen == token_gen. Invalidation is range-based via the new
  portRelocInvalidateRange(base, size) — only slots whose pointer falls
  in the recycled range are NULL'd and their generation bumped (with
  free-list reuse). Tokens for intern-buffer files don't intersect the
  scene-arena range, so they stay valid forever (until the file
  unloads). Tokens for arena-allocated data go stale exactly when their
  backing memory recycles. lbRelocInitSetup no longer calls the
  wholesale portRelocResetPointerTable; the range-based path in
  port_taskman_evict_arena_caches handles it surgically.

DL-range registry (port/port_dl_ranges.{cpp,h} — new):

  Defensive infrastructure tracking valid display-list memory ranges
  (scene arena + reloc files). Hooked into libultraship's GFX walker
  via the new game-agnostic callback API in fast/interpreter.h
  (RegisterDLBoundsCheck, RegisterAddressClassifier). gfx_step and the
  G_DL handler bounds-check `cmd` before deref/push; if a walker has
  stepped past the end of a registered range (variant 5 — runaway DL
  without a gsSPEndDisplayList terminator), the entire walk is torn
  down via g_exec_stack.stop() instead of dereferencing into an
  unmapped page. The classifier also feeds the SIGSEGV diag dump
  (RelocPointerTable diagnostics + recent DL pushes + segment writes)
  so any future stale-pointer crash that escapes the structural fix
  prints actionable triage info.

Port wires this in via port_dl_ranges_init() called from PortInit
before any GFX activity. libultraship has zero compile-time symbol
dependencies on port_dl_* — all integration is via the callback API.

SIGSEGV diag dump (port_watchdog.cpp + libultraship/CrashHandler.cpp):

  Both crash handlers now call Fast::DumpDLDiag() with siginfo->si_addr
  before showing the dialog, so the recent-DL-pushes + segment-writes
  ring buffers land in the spdlog log file even if the dialog hangs.
  The diag was already in interpreter.cpp but gated on ASan; this
  commit makes it always-on so it survives in shipping builds where
  the heap-layout-dependent stale-pointer crashes actually manifest.

Submodule bumps:

  - decomp: real game bug fix (gcDrawMObjForDObj G_ENDDL terminator),
    sound consumer NULL-checks (gcSetupCustomDObjs, ftMainSetStatus),
    file-scope gPortSceneHeap + arena range registration with the
    DL-range registry. (See decomp commit message for details.)
  - libultraship: callback API (Fast::RegisterDLBoundsCheck /
    RegisterAddressClassifier / DumpDLDiag), universal correctness
    fixes (gfx_pop_shader empty-check, mShaderStack per-frame reset),
    always-on diag ring buffers. (See libultraship commit message.)

Validation:

  - Clean SSB64_MAX_FRAMES=54000 exit (15 min game-time / 24 min real-
    time autonomous attract loop), zero crashes through full
    ControlDeck::ShutdownRaphnet → destruct fast3dwindow → destruct
    ResourceManager shutdown.
  - Previously documented variants:
    * Variant 1 (Kirby Cutter / Pikachu Thunder Jolt / Shield effect):
      addressed by the per-slot token table (effect_desc->file_head
      tokens now resolve correctly across scene cycles for intern-
      buffer files) + the consumer NULL-check at gcSetupCustomDObjs.
    * Variant 2 (mnCharacters joint init, ftMainSetStatus): addressed
      by the per-slot token table + the consumer NULL-check at
      ftMainSetStatus.
    * Variant 3 (stale MObjSub token warning at objanim.c:2869):
      addressed by the per-slot token table.
    * Variant 4 (segment-E DL pointer at gfx_step opcode fetch): the
      new gfx_step bounds-check rejects unresolved N64-segment pointers
      before deref.
    * Variant 5 (DL walker running past arena end): addressed by the
      gcDrawMObjForDObj G_ENDDL terminator + the gfx_step bounds-check
      tearing down the walk on walked-past detection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JRickey JRickey merged commit 77ffc3d into main May 13, 2026
JRickey added a commit that referenced this pull request May 13, 2026
This reverts commit 77ffc3d, reversing
changes made to e0f4d0d.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant