fix(heap): GC foreign-list dangling-pointer SEGV + coverage rounds#209
Merged
Conversation
Pivot from matrix-style ("test every type combination through the
same kernel") to targeted ("each assertion must hit a previously
uncovered `0|` region in llvm-cov show output"). Density: 3.6
regions per assertion vs ~0.5 in earlier rounds.
Five new files; each agent had a hard budget (15-30 assertions) and
a specific list of 0%/low-coverage functions to drive.
- rfl/group/topn_keep_min.rfl (20 assertions, +100 regions):
group_ht_insert_empty_group 0% → 62.50%; group_rows_range_existing
0% → 84.42%; group_probe_existing_entry 0% → 100%;
sparse_i64_rehash 0% → 91.67%. Drives the multi-key TOP-N count
emit filter (`top_count_take` + `desc:count` over `by:[k1 k2 ...]`)
with enough heavy groups (≥50) to cross the HT-cap doubling
threshold inside group_ht_insert_empty_group.
- rfl/group/null_aware_helpers.rfl (18 assertions, +54 regions):
cdpg_is_null / grpt_is_null / grpc_is_null all 0% → 66.67%
(matching arms hit: F64 NaN, I64/TIMESTAMP, I32/DATE/TIME, I16).
Drives null-bearing input through count_distinct/topk/count per-group
with HAS_NULLS-set source columns at n_groups>50000 to land on the
radix-HT path that calls these null checkers.
- rfl/query/wide_key_probe.rfl (24 assertions, +67 regions):
rgid_probe_fn 18.67% → 89.33%; key_read_i64 29.17% → 87.50%.
Drives the parallel row→gid probe over high-cardinality (200k+ rows,
mid-cardinality groups) with key columns of types I16/I32/SYM
varied widths.
- rfl/strop/like_seen_proj.rfl (19 assertions, +94 regions):
like_seen_fn 47.79% → 86.73%; like_proj_fn 43.20% → 82.40%.
Drives SYM-LIKE phase-1 + phase-3 parallel pool dispatch at
N≥120000 with W16 / W32 / W64 width arms; selection-aware MIX
branch hit by partial-conjunct WHERE that produces SEL_MIX morsels.
- rfl/ops/expr_load_setnull.rfl (15 assertions, +29 regions):
set_all_null 21.82% → 72.73%; par_binary_str_fn 0% → 100%.
expr_load_f64 stays at 0% but is documented as structurally dead
code (Bug 6): the function fires only when a SCAN register has
`rt == RAY_F64 && ct != RAY_F64`, but expr_compile at expr.c:486
forces `rt == ct` for SCAN regs. Promotion goes through
expr_ensure_type which creates a SCRATCH register; it never re-types
the SCAN reg. 37 regions / 28 lines of dead code — flagged for
follow-up cleanup.
Total: 96 assertions, +344 regions, 0 source bugs introduced.
Tests: `make clean && make test` -> 2539 of 2541 passed (2 skipped,
0 failed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Continues the llvm-cov-driven targeted approach. Each agent had a
hard budget (≤20-25 assertions) and a specific list of 0%/low-coverage
functions; each assertion verified to drive a previously-uncovered
`0|` region.
- rfl/group/strlen_grow.rfl (8 assertions):
group_strlen_at_cached 0% -> ~94%; grpt_ht_grow_slots 0% -> 100%.
Drives the cached strlen path via sparse_i64 group-by (key range
> DA_MAX_COMPOSITE_SLOTS) and the topk HT slot doubling via 1.2M
uniformly-hashed keys (~4687 distinct per partition crosses the
4096 grow threshold; 767 grow events observed).
- rfl/temporal/dag_extract_trunc.rfl (25 assertions):
exec_extract HAS_NULLS=1 macro instantiations (TIMESTAMP+null /
DATE+null / TIME+null) across yyyy/mm/dd/hh/ss/minute/dow/doy;
exec_date_trunc HAS_NULLS=1 for .date and .time on null-bearing
temporal columns; pre-epoch `1999.12.31` drives the `us<0` floor
correction. Unreachable arms (EPOCH field, HOUR/MINUTE/YEAR/MONTH
date_trunc) flagged as no-RFL-surface, not silently dodged.
- rfl/ops/print_resolve.rfl (22 assertions):
ray_lang_print 34.94% -> 86.75% (+43 regions); ray_resolve_fn
34.27% -> 66.43% (+46 regions). Drives println across atom types
(F64/BOOL/SYM/LIST/TABLE/UNARY/BINARY/VARY/default), typed-null
branches; resolve with 2-arg form, table/sym arms, env-hit/miss,
empty-table, non-I64 column pass-through. Documents a gap in
ray_cast_fn (no -RAY_SYM -> I64 atom case) — not fixed here.
- rfl/ops/fp_eval_cmp.rfl (25 assertions):
fp_eval_cmp 39.94% -> ~60-70% (estimated +50-70 regions). Drives
esz=1/2/4/8 EQ/NE/LT/LE/GT/GE arms across non-SYM types, FP_IN
esz=1/2/4 + n_cvals==0 fold, SYM cval_in_dict=0 fold, RAY_STR and
RAY_SYM LIKE prefix + shape-NONE branches. Required syntactic
literal vec for fused IN (gate at fused_group.c:260 rejects
function-call lists like (as 'U8 ...)).
- rfl/agg/atom_i64_med_topk.rfl (17 assertions):
agg_atom_i64_for_type 20% -> 100% (8 switch arms via parted
min/max); med_is_null 20% -> ~67% (F64/I64/I32/I16 arms via
(med v) by g on null-bearing columns); topk_read_i64 28% ->
~80% (I64/I32/DATE/TIME/I16/BOOL/U8 arms via multi-key
(top v K) by [g h] bypassing rowform).
- rfl/store/str_col_io.rfl (11 assertions):
col_load_str_vec 0% -> driven via planted legacy STRV magic
+ .db.splayed.get fallback through ray_col_load; col_copy_str_pool
0% -> driven via .db.splayed.mount of a STR column (no mmap path).
Also covers malformed STRV (type/domain errors) and empty STRV
('rows=0' edge case).
Density: 108 assertions for ~344 regions across all targets =
3.2 regions/assertion vs ~0.5 in earlier matrix-style rounds.
Tests: `make clean && make test` -> 2545 of 2547 passed (2 skipped,
0 failed). No new src/ bugs found in target functions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ray_heap_destroy and ray_heap_gc Pass 4 both munmap pools without clearing dangling references in other heaps' foreign lists. The existing has_foreign guard in Pass 4 only prevents NEW dangling pointers — pre-existing dangling pointers from earlier ray_heap_destroy calls (which explicitly skip flush_foreign per the comment at heap.c:1195) crash the next has_foreign walk at fb->fl_next. Fix: at both munmap sites, walk every other heap's foreign list and unlink blocks whose address falls in the pool range we're about to unmap. Reading fl_next is safe here because the pool is still mapped at the time of the unlink. Maintains the invariant: no foreign list ever contains a block in unmapped memory. Reproduced ~40-60% via 5 sequential ./rayforce.test invocations on master before fix; 10/10 clean after. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- agg/count_distinct_extras - group/reduce_range_arms - ops/exec_worker_merge - ops/par_phase3_keepmin - query/key_reader_atom_const - sort/topk_radix_decode Each file targets specific uncovered regions identified in the previous coverage report. All pass under ASan+UBSan, 0 failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ray_heap_gcPass 4 (andray_heap_destroy) munmap pools without clearing dangling references in other heaps' foreign lists. Reproduced ~40-60% on master; 10/10 clean after fix.The SEGV (commit 8a68bdb)
Stack trace (ASan):
Root cause
ray_heap_destroy(src/mem/heap.c:1180) munmaps every pool of the destroyed heap. Per the explicit comment at line 1195, it deliberately skipsflush_foreignto avoid coalesce races during shutdown. This leaves any other heap'sforeignlist with dangling pointers into now-unmapped pool memory.ray_heap_gcPass 4'shas_foreigncheck (heap.c:1362-1374) walks every other heap's foreign list to verify a pool is safe to munmap. The walk readsfb->fl_nextwithout any guard. If a previous destroy left a dangling block in some foreign list, the next has_foreign walk crashes atfb->fl_nextreading unmapped memory.ray_heap_gcPass 4 itself has the same hazard: between its own has_foreign check and the actualray_vm_free, a concurrentray_freecould prepend an in-pool block to a worker's foreign list.Fix
At both munmap sites (
ray_heap_destroyandray_heap_gcPass 4), immediately beforeray_vm_free, walk every other heap's foreign list and unlink any block whose address falls in the pool range we're about to unmap. Readingfl_nextis safe at this point because the pool is still mapped.Invariant established: no foreign list ever contains a block in unmapped memory.
Zero additional allocations, ~50 lines.
Coverage (3 rounds since #208)
Current totals (after this PR):
Test plan
make clean && make test(debug, ASan+UBSan): 2551 of 2553 PASS, 0 failed (2 skipped pre-existing)./rayforce.testruns — 10/10 clean (was 40-60% SEGV before fix)make coverage— measured numbers above_probes/, no hidden xfailsrc/test-only de-staticingOut of scope (next round)
src/ops/expr.c(63.21% regions, biggest absolute gap)src/ops/temporal.c(67.42%)src/ops/group.c(72.10%)traverse.cLouvain phase-2 / A* RFL binding / SCC🤖 Generated with Claude Code