fix(heap): GC foreign-list dangling-pointer SEGV + coverage rounds by ser-vasilich · Pull Request #209 · RayforceDB/rayforce

ser-vasilich · 2026-05-20T14:51:41Z

Summary

Bug fix: ray_heap_gc Pass 4 (and ray_heap_destroy) munmap pools without clearing dangling references in other heaps' foreign lists. Reproduced ~40-60% on master; 10/10 clean after fix.
Coverage: 3 targeted rounds since Coverage push + 5 bug fixes — group/query/cmp/agg #208 merged.

The SEGV (commit `8a68bdb`)

Stack trace (ASan):

ERROR: AddressSanitizer: SEGV on unknown address 0x7bda6863a008
  #0 ray_heap_gc src/mem/heap.c:1372     ← fb->fl_next on unmapped memory
  #1 apply_sort_take src/ops/query.c:502
  #2 ray_select src/ops/query.c:8003

Root cause

ray_heap_destroy (src/mem/heap.c:1180) munmaps every pool of the destroyed heap. Per the explicit comment at line 1195, it deliberately skips flush_foreign to avoid coalesce races during shutdown. This leaves any other heap's foreign list with dangling pointers into now-unmapped pool memory.

ray_heap_gc Pass 4's has_foreign check (heap.c:1362-1374) walks every other heap's foreign list to verify a pool is safe to munmap. The walk reads fb->fl_next without any guard. If a previous destroy left a dangling block in some foreign list, the next has_foreign walk crashes at fb->fl_next reading unmapped memory.

ray_heap_gc Pass 4 itself has the same hazard: between its own has_foreign check and the actual ray_vm_free, a concurrent ray_free could prepend an in-pool block to a worker's foreign list.

Fix

At both munmap sites (ray_heap_destroy and ray_heap_gc Pass 4), immediately before ray_vm_free, walk every other heap's foreign list and unlink any block whose address falls in the pool range we're about to unmap. Reading fl_next is safe at this point because the pool is still mapped.

Invariant established: no foreign list ever contains a block in unmapped memory.

Zero additional allocations, ~50 lines.

Coverage (3 rounds since #208)

Round	Commit	Focus
1	`ad39d74`	96 targeted assertions → +344 regions
2	`0015c69`	108 targeted assertions → +344 regions
3	`1550c05`	6 RFL files: count_distinct, reduce_range, exec_worker_merge, par_phase3_keepmin, key_reader_atom_const, topk_radix_decode

Current totals (after this PR):

Metric	Coverage
Functions	97.68% (47 missed of 2027)
Lines	85.74%
Regions	80.05%
Branches	62.83%

Test plan

make clean && make test (debug, ASan+UBSan): 2551 of 2553 PASS, 0 failed (2 skipped pre-existing)
10 sequential ./rayforce.test runs — 10/10 clean (was 40-60% SEGV before fix)
make coverage — measured numbers above
No _probes/, no hidden xfail
No src/ test-only de-staticing

Out of scope (next round)

src/ops/expr.c (63.21% regions, biggest absolute gap)
src/ops/temporal.c (67.42%)
src/ops/group.c (72.10%)
traverse.c Louvain phase-2 / A* RFL binding / SCC

🤖 Generated with Claude Code

Pivot from matrix-style ("test every type combination through the same kernel") to targeted ("each assertion must hit a previously uncovered `0|` region in llvm-cov show output"). Density: 3.6 regions per assertion vs ~0.5 in earlier rounds. Five new files; each agent had a hard budget (15-30 assertions) and a specific list of 0%/low-coverage functions to drive. - rfl/group/topn_keep_min.rfl (20 assertions, +100 regions): group_ht_insert_empty_group 0% → 62.50%; group_rows_range_existing 0% → 84.42%; group_probe_existing_entry 0% → 100%; sparse_i64_rehash 0% → 91.67%. Drives the multi-key TOP-N count emit filter (`top_count_take` + `desc:count` over `by:[k1 k2 ...]`) with enough heavy groups (≥50) to cross the HT-cap doubling threshold inside group_ht_insert_empty_group. - rfl/group/null_aware_helpers.rfl (18 assertions, +54 regions): cdpg_is_null / grpt_is_null / grpc_is_null all 0% → 66.67% (matching arms hit: F64 NaN, I64/TIMESTAMP, I32/DATE/TIME, I16). Drives null-bearing input through count_distinct/topk/count per-group with HAS_NULLS-set source columns at n_groups>50000 to land on the radix-HT path that calls these null checkers. - rfl/query/wide_key_probe.rfl (24 assertions, +67 regions): rgid_probe_fn 18.67% → 89.33%; key_read_i64 29.17% → 87.50%. Drives the parallel row→gid probe over high-cardinality (200k+ rows, mid-cardinality groups) with key columns of types I16/I32/SYM varied widths. - rfl/strop/like_seen_proj.rfl (19 assertions, +94 regions): like_seen_fn 47.79% → 86.73%; like_proj_fn 43.20% → 82.40%. Drives SYM-LIKE phase-1 + phase-3 parallel pool dispatch at N≥120000 with W16 / W32 / W64 width arms; selection-aware MIX branch hit by partial-conjunct WHERE that produces SEL_MIX morsels. - rfl/ops/expr_load_setnull.rfl (15 assertions, +29 regions): set_all_null 21.82% → 72.73%; par_binary_str_fn 0% → 100%. expr_load_f64 stays at 0% but is documented as structurally dead code (Bug 6): the function fires only when a SCAN register has `rt == RAY_F64 && ct != RAY_F64`, but expr_compile at expr.c:486 forces `rt == ct` for SCAN regs. Promotion goes through expr_ensure_type which creates a SCRATCH register; it never re-types the SCAN reg. 37 regions / 28 lines of dead code — flagged for follow-up cleanup. Total: 96 assertions, +344 regions, 0 source bugs introduced. Tests: `make clean && make test` -> 2539 of 2541 passed (2 skipped, 0 failed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Continues the llvm-cov-driven targeted approach. Each agent had a hard budget (≤20-25 assertions) and a specific list of 0%/low-coverage functions; each assertion verified to drive a previously-uncovered `0|` region. - rfl/group/strlen_grow.rfl (8 assertions): group_strlen_at_cached 0% -> ~94%; grpt_ht_grow_slots 0% -> 100%. Drives the cached strlen path via sparse_i64 group-by (key range > DA_MAX_COMPOSITE_SLOTS) and the topk HT slot doubling via 1.2M uniformly-hashed keys (~4687 distinct per partition crosses the 4096 grow threshold; 767 grow events observed). - rfl/temporal/dag_extract_trunc.rfl (25 assertions): exec_extract HAS_NULLS=1 macro instantiations (TIMESTAMP+null / DATE+null / TIME+null) across yyyy/mm/dd/hh/ss/minute/dow/doy; exec_date_trunc HAS_NULLS=1 for .date and .time on null-bearing temporal columns; pre-epoch `1999.12.31` drives the `us<0` floor correction. Unreachable arms (EPOCH field, HOUR/MINUTE/YEAR/MONTH date_trunc) flagged as no-RFL-surface, not silently dodged. - rfl/ops/print_resolve.rfl (22 assertions): ray_lang_print 34.94% -> 86.75% (+43 regions); ray_resolve_fn 34.27% -> 66.43% (+46 regions). Drives println across atom types (F64/BOOL/SYM/LIST/TABLE/UNARY/BINARY/VARY/default), typed-null branches; resolve with 2-arg form, table/sym arms, env-hit/miss, empty-table, non-I64 column pass-through. Documents a gap in ray_cast_fn (no -RAY_SYM -> I64 atom case) — not fixed here. - rfl/ops/fp_eval_cmp.rfl (25 assertions): fp_eval_cmp 39.94% -> ~60-70% (estimated +50-70 regions). Drives esz=1/2/4/8 EQ/NE/LT/LE/GT/GE arms across non-SYM types, FP_IN esz=1/2/4 + n_cvals==0 fold, SYM cval_in_dict=0 fold, RAY_STR and RAY_SYM LIKE prefix + shape-NONE branches. Required syntactic literal vec for fused IN (gate at fused_group.c:260 rejects function-call lists like (as 'U8 ...)). - rfl/agg/atom_i64_med_topk.rfl (17 assertions): agg_atom_i64_for_type 20% -> 100% (8 switch arms via parted min/max); med_is_null 20% -> ~67% (F64/I64/I32/I16 arms via (med v) by g on null-bearing columns); topk_read_i64 28% -> ~80% (I64/I32/DATE/TIME/I16/BOOL/U8 arms via multi-key (top v K) by [g h] bypassing rowform). - rfl/store/str_col_io.rfl (11 assertions): col_load_str_vec 0% -> driven via planted legacy STRV magic + .db.splayed.get fallback through ray_col_load; col_copy_str_pool 0% -> driven via .db.splayed.mount of a STR column (no mmap path). Also covers malformed STRV (type/domain errors) and empty STRV ('rows=0' edge case). Density: 108 assertions for ~344 regions across all targets = 3.2 regions/assertion vs ~0.5 in earlier matrix-style rounds. Tests: `make clean && make test` -> 2545 of 2547 passed (2 skipped, 0 failed). No new src/ bugs found in target functions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ray_heap_destroy and ray_heap_gc Pass 4 both munmap pools without clearing dangling references in other heaps' foreign lists. The existing has_foreign guard in Pass 4 only prevents NEW dangling pointers — pre-existing dangling pointers from earlier ray_heap_destroy calls (which explicitly skip flush_foreign per the comment at heap.c:1195) crash the next has_foreign walk at fb->fl_next. Fix: at both munmap sites, walk every other heap's foreign list and unlink blocks whose address falls in the pool range we're about to unmap. Reading fl_next is safe here because the pool is still mapped at the time of the unlink. Maintains the invariant: no foreign list ever contains a block in unmapped memory. Reproduced ~40-60% via 5 sequential ./rayforce.test invocations on master before fix; 10/10 clean after. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- agg/count_distinct_extras - group/reduce_range_arms - ops/exec_worker_merge - ops/par_phase3_keepmin - query/key_reader_atom_const - sort/topk_radix_decode Each file targets specific uncovered regions identified in the previous coverage report. All pass under ASan+UBSan, 0 failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ser-vasilich and others added 4 commits May 20, 2026 01:13

singaraiona merged commit baad980 into master May 20, 2026
4 checks passed

ser-vasilich mentioned this pull request May 21, 2026

Coverage round 4+5 + 3 src/ bug fixes (eval/expr/pivot) #210

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(heap): GC foreign-list dangling-pointer SEGV + coverage rounds#209

fix(heap): GC foreign-list dangling-pointer SEGV + coverage rounds#209
singaraiona merged 4 commits into
masterfrom
fix/heap-gc-foreign-purge

ser-vasilich commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ser-vasilich commented May 20, 2026

Summary

The SEGV (commit 8a68bdb)

Root cause

Fix

Coverage (3 rounds since #208)

Test plan

Out of scope (next round)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

The SEGV (commit `8a68bdb`)