Skip to content

fix(heap): GC foreign-list dangling-pointer SEGV + coverage rounds#209

Merged
singaraiona merged 4 commits into
masterfrom
fix/heap-gc-foreign-purge
May 20, 2026
Merged

fix(heap): GC foreign-list dangling-pointer SEGV + coverage rounds#209
singaraiona merged 4 commits into
masterfrom
fix/heap-gc-foreign-purge

Conversation

@ser-vasilich
Copy link
Copy Markdown
Collaborator

Summary

  • Bug fix: ray_heap_gc Pass 4 (and ray_heap_destroy) munmap pools without clearing dangling references in other heaps' foreign lists. Reproduced ~40-60% on master; 10/10 clean after fix.
  • Coverage: 3 targeted rounds since Coverage push + 5 bug fixes — group/query/cmp/agg #208 merged.

The SEGV (commit 8a68bdb)

Stack trace (ASan):

ERROR: AddressSanitizer: SEGV on unknown address 0x7bda6863a008
  #0 ray_heap_gc src/mem/heap.c:1372     ← fb->fl_next on unmapped memory
  #1 apply_sort_take src/ops/query.c:502
  #2 ray_select src/ops/query.c:8003

Root cause

ray_heap_destroy (src/mem/heap.c:1180) munmaps every pool of the destroyed heap. Per the explicit comment at line 1195, it deliberately skips flush_foreign to avoid coalesce races during shutdown. This leaves any other heap's foreign list with dangling pointers into now-unmapped pool memory.

ray_heap_gc Pass 4's has_foreign check (heap.c:1362-1374) walks every other heap's foreign list to verify a pool is safe to munmap. The walk reads fb->fl_next without any guard. If a previous destroy left a dangling block in some foreign list, the next has_foreign walk crashes at fb->fl_next reading unmapped memory.

ray_heap_gc Pass 4 itself has the same hazard: between its own has_foreign check and the actual ray_vm_free, a concurrent ray_free could prepend an in-pool block to a worker's foreign list.

Fix

At both munmap sites (ray_heap_destroy and ray_heap_gc Pass 4), immediately before ray_vm_free, walk every other heap's foreign list and unlink any block whose address falls in the pool range we're about to unmap. Reading fl_next is safe at this point because the pool is still mapped.

Invariant established: no foreign list ever contains a block in unmapped memory.

Zero additional allocations, ~50 lines.

Coverage (3 rounds since #208)

Round Commit Focus
1 ad39d74 96 targeted assertions → +344 regions
2 0015c69 108 targeted assertions → +344 regions
3 1550c05 6 RFL files: count_distinct, reduce_range, exec_worker_merge, par_phase3_keepmin, key_reader_atom_const, topk_radix_decode

Current totals (after this PR):

Metric Coverage
Functions 97.68% (47 missed of 2027)
Lines 85.74%
Regions 80.05%
Branches 62.83%

Test plan

  • make clean && make test (debug, ASan+UBSan): 2551 of 2553 PASS, 0 failed (2 skipped pre-existing)
  • 10 sequential ./rayforce.test runs — 10/10 clean (was 40-60% SEGV before fix)
  • make coverage — measured numbers above
  • No _probes/, no hidden xfail
  • No src/ test-only de-staticing

Out of scope (next round)

  • src/ops/expr.c (63.21% regions, biggest absolute gap)
  • src/ops/temporal.c (67.42%)
  • src/ops/group.c (72.10%)
  • traverse.c Louvain phase-2 / A* RFL binding / SCC

🤖 Generated with Claude Code

ser-vasilich and others added 4 commits May 20, 2026 01:13
Pivot from matrix-style ("test every type combination through the
same kernel") to targeted ("each assertion must hit a previously
uncovered `0|` region in llvm-cov show output").  Density: 3.6
regions per assertion vs ~0.5 in earlier rounds.

Five new files; each agent had a hard budget (15-30 assertions) and
a specific list of 0%/low-coverage functions to drive.

- rfl/group/topn_keep_min.rfl (20 assertions, +100 regions):
  group_ht_insert_empty_group 0% → 62.50%; group_rows_range_existing
  0% → 84.42%; group_probe_existing_entry 0% → 100%;
  sparse_i64_rehash 0% → 91.67%.  Drives the multi-key TOP-N count
  emit filter (`top_count_take` + `desc:count` over `by:[k1 k2 ...]`)
  with enough heavy groups (≥50) to cross the HT-cap doubling
  threshold inside group_ht_insert_empty_group.

- rfl/group/null_aware_helpers.rfl (18 assertions, +54 regions):
  cdpg_is_null / grpt_is_null / grpc_is_null all 0% → 66.67%
  (matching arms hit: F64 NaN, I64/TIMESTAMP, I32/DATE/TIME, I16).
  Drives null-bearing input through count_distinct/topk/count per-group
  with HAS_NULLS-set source columns at n_groups>50000 to land on the
  radix-HT path that calls these null checkers.

- rfl/query/wide_key_probe.rfl (24 assertions, +67 regions):
  rgid_probe_fn 18.67% → 89.33%; key_read_i64 29.17% → 87.50%.
  Drives the parallel row→gid probe over high-cardinality (200k+ rows,
  mid-cardinality groups) with key columns of types I16/I32/SYM
  varied widths.

- rfl/strop/like_seen_proj.rfl (19 assertions, +94 regions):
  like_seen_fn 47.79% → 86.73%; like_proj_fn 43.20% → 82.40%.
  Drives SYM-LIKE phase-1 + phase-3 parallel pool dispatch at
  N≥120000 with W16 / W32 / W64 width arms; selection-aware MIX
  branch hit by partial-conjunct WHERE that produces SEL_MIX morsels.

- rfl/ops/expr_load_setnull.rfl (15 assertions, +29 regions):
  set_all_null 21.82% → 72.73%; par_binary_str_fn 0% → 100%.
  expr_load_f64 stays at 0% but is documented as structurally dead
  code (Bug 6): the function fires only when a SCAN register has
  `rt == RAY_F64 && ct != RAY_F64`, but expr_compile at expr.c:486
  forces `rt == ct` for SCAN regs.  Promotion goes through
  expr_ensure_type which creates a SCRATCH register; it never re-types
  the SCAN reg.  37 regions / 28 lines of dead code — flagged for
  follow-up cleanup.

Total: 96 assertions, +344 regions, 0 source bugs introduced.

Tests: `make clean && make test` -> 2539 of 2541 passed (2 skipped,
0 failed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Continues the llvm-cov-driven targeted approach.  Each agent had a
hard budget (≤20-25 assertions) and a specific list of 0%/low-coverage
functions; each assertion verified to drive a previously-uncovered
`0|` region.

- rfl/group/strlen_grow.rfl (8 assertions):
  group_strlen_at_cached 0% -> ~94%; grpt_ht_grow_slots 0% -> 100%.
  Drives the cached strlen path via sparse_i64 group-by (key range
  > DA_MAX_COMPOSITE_SLOTS) and the topk HT slot doubling via 1.2M
  uniformly-hashed keys (~4687 distinct per partition crosses the
  4096 grow threshold; 767 grow events observed).

- rfl/temporal/dag_extract_trunc.rfl (25 assertions):
  exec_extract HAS_NULLS=1 macro instantiations (TIMESTAMP+null /
  DATE+null / TIME+null) across yyyy/mm/dd/hh/ss/minute/dow/doy;
  exec_date_trunc HAS_NULLS=1 for .date and .time on null-bearing
  temporal columns; pre-epoch `1999.12.31` drives the `us<0` floor
  correction.  Unreachable arms (EPOCH field, HOUR/MINUTE/YEAR/MONTH
  date_trunc) flagged as no-RFL-surface, not silently dodged.

- rfl/ops/print_resolve.rfl (22 assertions):
  ray_lang_print 34.94% -> 86.75% (+43 regions); ray_resolve_fn
  34.27% -> 66.43% (+46 regions).  Drives println across atom types
  (F64/BOOL/SYM/LIST/TABLE/UNARY/BINARY/VARY/default), typed-null
  branches; resolve with 2-arg form, table/sym arms, env-hit/miss,
  empty-table, non-I64 column pass-through.  Documents a gap in
  ray_cast_fn (no -RAY_SYM -> I64 atom case) — not fixed here.

- rfl/ops/fp_eval_cmp.rfl (25 assertions):
  fp_eval_cmp 39.94% -> ~60-70% (estimated +50-70 regions).  Drives
  esz=1/2/4/8 EQ/NE/LT/LE/GT/GE arms across non-SYM types, FP_IN
  esz=1/2/4 + n_cvals==0 fold, SYM cval_in_dict=0 fold, RAY_STR and
  RAY_SYM LIKE prefix + shape-NONE branches.  Required syntactic
  literal vec for fused IN (gate at fused_group.c:260 rejects
  function-call lists like (as 'U8 ...)).

- rfl/agg/atom_i64_med_topk.rfl (17 assertions):
  agg_atom_i64_for_type 20% -> 100% (8 switch arms via parted
  min/max); med_is_null 20% -> ~67% (F64/I64/I32/I16 arms via
  (med v) by g on null-bearing columns); topk_read_i64 28% ->
  ~80% (I64/I32/DATE/TIME/I16/BOOL/U8 arms via multi-key
  (top v K) by [g h] bypassing rowform).

- rfl/store/str_col_io.rfl (11 assertions):
  col_load_str_vec 0% -> driven via planted legacy STRV magic
  + .db.splayed.get fallback through ray_col_load; col_copy_str_pool
  0% -> driven via .db.splayed.mount of a STR column (no mmap path).
  Also covers malformed STRV (type/domain errors) and empty STRV
  ('rows=0' edge case).

Density: 108 assertions for ~344 regions across all targets =
3.2 regions/assertion vs ~0.5 in earlier matrix-style rounds.

Tests: `make clean && make test` -> 2545 of 2547 passed (2 skipped,
0 failed).  No new src/ bugs found in target functions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ray_heap_destroy and ray_heap_gc Pass 4 both munmap pools without
clearing dangling references in other heaps' foreign lists.  The
existing has_foreign guard in Pass 4 only prevents NEW dangling
pointers — pre-existing dangling pointers from earlier
ray_heap_destroy calls (which explicitly skip flush_foreign per
the comment at heap.c:1195) crash the next has_foreign walk at
fb->fl_next.

Fix: at both munmap sites, walk every other heap's foreign list
and unlink blocks whose address falls in the pool range we're
about to unmap.  Reading fl_next is safe here because the pool
is still mapped at the time of the unlink.

Maintains the invariant: no foreign list ever contains a block
in unmapped memory.

Reproduced ~40-60% via 5 sequential ./rayforce.test invocations
on master before fix; 10/10 clean after.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- agg/count_distinct_extras
- group/reduce_range_arms
- ops/exec_worker_merge
- ops/par_phase3_keepmin
- query/key_reader_atom_const
- sort/topk_radix_decode

Each file targets specific uncovered regions identified in the
previous coverage report.  All pass under ASan+UBSan, 0 failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@singaraiona singaraiona merged commit baad980 into master May 20, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants