Skip to content

Prefetch hints for arrangement reads (joins, reduces) #716

@antiguru

Description

@antiguru

Motivation

Arrangements backed by large batches can incur cache misses or page faults when cursors seek through cold data. With a custom allocator that supports prefetching (e.g., madvise(MADV_WILLNEED) or explicit page-in), we could hide this latency — if operators can express what data they'll need next.

This issue explores where prefetch hints could be injected and what API shapes might work.

Current access pattern

The join loop (join.rs:367–396) is a synchronized merge of two cursors:

while let (Some(batch_key), Some(trace_key), ..) = (..) {
    match trace_key.cmp(&batch_key) {
        Less    => trace.seek_key(trace_storage, batch_key),
        Greater => batch.seek_key(batch_storage, trace_key),
        Equal   => { /* load vals/times, compute, step both */ }
    }
}

Key observations:

  • The batch cursor drives the access pattern. Its keys are known upfront (small, new batch). The trace cursor seeks to match.
  • CursorList::seek_key fans out to every inner cursor (cursor_list.rs:151–155), each doing exponential+binary search in its keys container.
  • After a key match, all vals and (time, diff) pairs are loadedValueHistory::load walks the full value/time structure for that key.
  • No lookahead exists today. The join processes one key at a time with zero knowledge of the next.

Reduce is better positioned: it maintains a pending_keys list and could prefetch the full batch of upcoming keys.

Where prefetch helps

  1. Key lookup across batches — when seeking k in the trace, prefetch the approximate landing position for the next batch key in each spine batch.
  2. Value/update loading — once we know key_cursor, prefetch the vals and upds ranges before starting the Cartesian product for the previous key.
  3. Parallel prefetch across CursorList — issue prefetches to all batches before performing the actual seeks.

Possible API shapes

A) Cursor-level hint (operator-facing, minimal)

trait Cursor {
    /// Hint that the cursor will soon seek to `key`. Default: no-op.
    fn prefetch_key(&self, _storage: &Self::Storage, _key: Self::Key<'_>) {}
}

The join peeks at the next batch key and calls trace.prefetch_key(storage, next_key) one iteration ahead. CursorList fans out to all inner cursors.

Pros: Simple, opt-in, no breaking changes.
Cons: Only one key ahead; the operator must drive it.

B) Container-level hint (storage-facing, transparent)

trait BatchContainer {
    /// Hint that indices in [start, end) will be accessed soon. Default: no-op.
    fn prefetch(&self, _start: usize, _end: usize) {}
}

OrdValCursor::seek_key could, after computing the destination index, prefetch the associated value/update offset and data ranges. Invisible to operators.

Pros: Transparent; also benefits merge cursors.
Cons: Container must know its memory layout to compute addresses.

C) Trace-level batch hint (maximum information)

trait TraceReader {
    /// Hint that these keys will be looked up soon. Default: no-op.
    fn prefetch_keys<I: IntoIterator<Item = ...>>(&self, keys: I) {}
}

The operator extracts the next N keys from the batch cursor and passes them all at once. The spine fans out to each batch; the allocator gets a batch of prefetch requests.

Pros: Maximum information for the allocator; amortizable.
Cons: More invasive; trace must expose inner structure.

Pieces not yet in place

  • Allocator ↔ container connection. BatchContainer / Layout have no reference to an allocator. Containers would need either an allocator handle or a global/thread-local prefetch mechanism.
  • Address computation. Translating "key index N" to a byte range is trivial for Vec<T> (ptr + N * size_of::<T>()) but encoding-dependent for columnar containers.
  • Peek-ahead in cursors. The Cursor trait has no peek_next_key(). The join could read ahead in the batch directly (it owns the storage), or a small API addition could formalize this.

Suggested starting point

Combine A + B: add prefetch_key on Cursor (default no-op) and prefetch on BatchContainer (default no-op). In OrdValCursor::prefetch_key, estimate the key's landing position and prefetch the key/val/update byte ranges. In the join loop, peek at the next batch key and call the hint one iteration ahead. All zero-cost until a container opts in.

For reduce, prefetch the full pending_keys list upfront since it's known ahead of time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions