spec-dec: support device-aware recurrent GPU tape placement on ROCm multi-GPU#32
spec-dec: support device-aware recurrent GPU tape placement on ROCm multi-GPU#32nycdubliner wants to merge 4 commits into
Conversation
…ulti-GPU For multi-GPU ROCm/HIP setups, allocating all spec-dec tape buffers on a single device (e.g. ROCm0) causes execution failures or severe bottlenecks when recurrent layers are split across devices (e.g. ROCm1 cannot access ROCm0 source views). This commit enables device-aware recurrent GPU tape placement: - Tape buffers are allocated per recurrent layer on the exact physical GPU device assigned to compile/run that layer. - During DFlash tape capture and direct GPU replay, state memory is verified to reside on the local physical GPU before executing. - Non-local device memory accesses and associated ROCm IPC/peer faults are avoided. - A fallback path to CPU tape capture and replay is preserved when `GGML_DFLASH_ALLOW_MULTI_GPU_TAPE=0` is set or on single-GPU setups. - Retained optimized scheduler callback-mode synchronization checks to minimize overhead when Hidden/GDN callback evaluation is enabled.
nycdubliner
left a comment
There was a problem hiding this comment.
Expert PR Review — Claude Opus 4.6 (Thinking)
Model under test: Qwen 3.6 27B (Q5_K_S target / Q4_K_M DFlash drafter)
Hardware: 2× RX 7800 XT, ROCm/HIP gfx1101
Overall Assessment
This is a well-structured PR that addresses a real performance bottleneck. The core idea — per-layer device-aware tape buffer allocation using model.dev_layer(il) — is the correct architectural approach. The diff is 222+/126- across 5 files, touching the right subsystems. The ROCm backend-name awareness ("CUDA" || "ROCm") is applied consistently. Good first contribution.
Verdict: Approve with requested changes (mostly documentation and one correctness concern).
What works well
-
allocate_tape_gpurefactor — Per-layer ggml context + buffer allocation keyed tomodel.dev_layer(il)viabackend_for_dev(). This is the right fix. Each tape layer'sbuf,ctx, anddevare now owned at layer granularity with correct cleanup in the destructor. -
dflash_gpu_backend_reg()helper — Centralizes the CUDA-then-ROCm registry lookup, eliminating 5 separateggml_backend_reg_by_name("CUDA")callsites. Clean dedup. -
dflash_is_cuda_compatible_tensor()helper — Replaces 3 localis_cuda_tensorlambdas with a shared function that checks both"CUDA"and"ROCm"buffer names. Good. -
build_recurrent_copy_plan/copy_cellinllama-memory-recurrent.cpp— Removing the singlecopy_plan_deviceand tracking per-entry device with per-deviceset_device/sync_devicecalls. This is the correct multi-GPU D2D pattern. -
tape_replay_gdn_direct_gpudevice validation — Now checks that ALL inputs (state, k, v, gate, beta) are on the same device before launching, and correctly setsreplay_device = -2for heterogeneous launches with per-ptr sync. Solid. -
dflash_memory_seq_cp_recurrent_ordered— Removed themodel.n_devices() > 1early-return gate and replaced it with per-backend sync across all GPU devices. Correct. -
ggml scheduler callback gate — Moving
ggml_backend_synchronize(split_backend)inside theif (need)block is a targeted optimization that avoids unnecessary CPU-GPU sync when no callback is registered for a tensor. This is safe because theneedflag already tracks callback presence.
Requested Changes
1. allocate_hidden_gpu still has an ungated n_devices() > 1 early return
// llama-context.cpp L1969-1973 (unmodified in this diff)
if (model.n_devices() > 1) {
dflash_capture->hidden_gpu.clear();
...
return;
}This means multi-GPU tape is enabled but hidden GPU capture always falls back to eval callbacks. The PR body says "hidden capture stays on eval callback" — so this is intentional. But it's not documented in the code. Please add a comment at L1969 explaining why hidden GPU capture is not yet enabled for multi-GPU (e.g., "// Hidden GPU capture requires same-device graph output tensors; not yet supported for multi-GPU layer splits").
2. allocate_prefill_gpu still has an ungated n_devices() > 1 early return
// llama-context.cpp L2061-2063 (unmodified in this diff)
if (model.n_devices() > 1) {
return false;
}Same situation — this is probably intentional but should have a comment noting it's not yet multi-GPU aware.
3. Env var GGML_DFLASH_ALLOW_MULTI_GPU_TAPE behavior
The implementation at L1138:
return env && env[0] != '\0' && std::strcmp(env, "0") != 0;This means =1, =yes, =banana all enable it. This matches the pattern used by GGML_DFLASH_GPU_RING, so it's consistent — good. But the env var is undocumented. Please add at minimum:
- A one-line comment at the
dflash_allow_multi_gpu_tape()function explaining when/why to use it - A note in the PR body about the interaction: GPU Ring (
GGML_DFLASH_GPU_RING) controls the cross-attention ring path; this env var controls the recurrent tape path. They are independent.
4. set_dflash_gpu_capture forced-on behavior
dflash_capture->gpu_capture_enabled =
enabled || (model.n_devices() > 1 && dflash_allow_multi_gpu_tape());This means if the server calls set_dflash_gpu_capture(false) but the env var is set, GPU capture stays enabled on multi-GPU. This is probably correct for the tape path, but it means an explicit disable request is silently overridden. Consider at least a LLAMA_LOG_INFO when this override fires, so users don't get confused if they try to force CPU-only.
5. std::map include
Adding #include <map> for the layers_by_dev diagnostic log is fine, but the project style prefers avoiding "fancy-looking modern STL constructs" (CONTRIBUTING.md). A simple loop counting devices would avoid the include. Not blocking — just flagging for awareness.
6. touched_devices vector in copy_cell hot path
std::vector<int> touched_devices; // allocated every copy_cell callcopy_cell is called per-cell during recurrent state management. Allocating a std::vector on every call adds heap pressure. Consider a small fixed-size array (max 8 devices) or moving touched_devices into the class as a reusable scratch buffer, similar to how copy_plan_entries is already a member.
7. AI disclosure
CONTRIBUTING.md requires: "Disclose that AI was used in your PR description." The PR body doesn't currently mention AI assistance. If AI tools were used during development, please add a disclosure line.
Minor / Nit
- L1408 indentation change: The
dflash_eval_callbackblock has a pure re-indent (4 spaces less). This makes the diff noisy but doesn't change logic. Consider splitting whitespace-only changes into a separate commit for cleaner review history. - Unused variable removal:
n_embd_rwas removed fromtape_replay— correct, it was unused after the refactor. Good cleanup. - Double
fn_preparecall intape_replay_gdn_direct_from_cpu_tapeat L2930-2931 still exists (called once in the validation loop, then again in the launch loop). This was pre-existing, not introduced by this PR, but worth noting.
Questions for the author
- Have you tested with
GGML_DFLASH_ALLOW_MULTI_GPU_TAPEunset (not=0, literally absent from environment) to confirm the existing behavior is byte-identical? - Is there a reason
tape_replay_conv_gpuat L2959 still has the hardmodel.n_devices() > 1 → return falsegate? This means conv state replay always falls back to CPU on multi-GPU even with the env var set. Is conv replay not needed for the measured speedup, or is this a follow-up? - The validation matrix in the PR body only shows
kv_report_module. The continuation-state doc shows 3 prompts with varying acceptance (27%–41%). Consider including all 3 in the PR body for reviewer confidence.
Summary
| Area | Status |
|---|---|
| Core tape placement correctness | ✅ Sound — per-layer device-aware allocation |
| ROCm compatibility | ✅ Consistent CUDA/ROCm buffer name + registry handling |
| Multi-GPU D2D copy plan | ✅ Per-device set/sync instead of single-device assumption |
| Replay device validation | ✅ All inputs verified same-device before launch |
| Fallback preservation | ✅ CPU path intact when env var disabled/unset |
| Documentation | |
| Performance hot path | touched_devices vector allocation in copy_cell |
| AI disclosure | |
| Scheduler callback gate | ✅ Safe optimization |
- Documented n_devices() > 1 multi-GPU limitations for hidden and prefill GPU allocations. - Documented role of GGML_DFLASH_ALLOW_MULTI_GPU_TAPE environment variable. - Added warning logging when explicitly disabling GPU capture is overridden by the env var. - Avoid std::map dependency in allocate_tape_gpu by using a vector of structures to count device occurrences. - Move touched_devices vector out of copy_cell hot-path stack to class member to avoid heap pressure.
✅ Review Complete — Ready for Maintainer ReviewAll review items from the initial review have been addressed in commit 4b208f7:
Additional validation noted: Determinism test, VRAM leak profiling (15 requests), multi-turn session stability. Recommendation: This PR is ready to come out of draft and be sent to the maintainer for final review. Review model: Claude Opus 4.6 (Thinking) |
|
Thank you. I'm currently investigating rebasing to latest llama.cpp, so it might take some time before I review it. |
- Auto-enable peer access dynamically in dflash_cross_ring_gpu_write_d2d to avoid manual setup. - Clean up unused struct dflash_hidden_gpu_layer and dead buf/ctx fields in dflash_hidden_gpu. - Resolve speculative.cpp duplication by introducing a public llama_dflash_allow_multi_gpu_tape() API. - Replace magic array sizes of 32 with GGML_CUDA_MAX_DEVICES for touched devices tracking. - Document unconditional call to tape_replay_conv_gpu as internally gated. - Update test-dflash-plumbing.cpp regex assertions for renamed gating function.
Phase 2 Follow-up Review —
|
|
@Anbeeld Thanks for your work. :) Just in case it's useful, I've a follow up PR clearing out the rest of the Multi-GPU ROCm stuff I found here: |
Goal / Overview
This PR implements Phase 2 of ROCm Multi-GPU DFlash support: enabling GPU conv state replay (
tape_replay_conv_gpu), GPU hidden-state capture, and direct GPU-to-GPU peer copying into the speculative verification ring buffer. It also addresses all reviewer feedback from the Phase 2 evaluation.Key Implementations
GPU Conv State Replay & Hidden-State Capture:
tape_replay_conv_gpuon multi-GPU setups.cudaMemcpyPeerAsync) to write hidden states directly into the speculative verification ring buffer.cpu_copy=0.000 msoverhead in GPU ring writes.PR Review Feedback Addressal:
cudaDeviceCanAccessPeerandcudaDeviceEnablePeerAccess) insidedflash_cross_ring_gpu_write_d2don the first cross-device copy. This guarantees out-of-the-box D2D transfer support without requiring users to manually configure environment variables (likeGGML_CUDA_P2P=1).struct dflash_hidden_gpu_layerdefinition and removed the legacy singlebuf/ctxfields fromdflash_hidden_gpu.llama_dflash_allow_multi_gpu_tape()inllama.hand resolving duplications inllama-context.cppandcommon/speculative.cpp.32withGGML_CUDA_MAX_DEVICES(incorporating safe fallback macros).tape_replay_conv_gpu.Verification & Performance Matrix (Dual AMD Radeon RX 7800 XT)
All benchmarks were executed at
temperature=0.0,ctx=8192,ubatch=128,batch=512, using cache typesq5_0(K) andq4_1(V) under alayersplit strategy.task_store_module: 18.23 tok/skv_report_module: 18.35 tok/sdoubly_linked_list: 19.06 tok/s-fit off):task_store_module: 20.75 tok/s (1.14x speedup, 25.45% acceptance)kv_report_module: 31.27 tok/s (1.70x speedup, 40.73% acceptance)doubly_linked_list: 23.16 tok/s (1.22x speedup, 24.61% acceptance)task_store_module: 22.61 tok/s (1.24x speedup, 29.77% acceptance withcross_ctx=1024)kv_report_module: 31.80 tok/s (1.73x speedup, 34.11% acceptance withn_max=24)doubly_linked_list: 24.21 tok/s (1.27x speedup, 20.90% acceptance withn_max=32)-fit onStability Test):-fit onis 100% stable and yields:kv_report_module: 31.86 tok/s (1.74x speedup, 40.73% acceptance)-otd '.*=ROCm0'while leaving the verifier split-ts 1,1.kv_report_module: 31.66 tok/s (1.73x speedup). This demonstrates that split-drafter placement peer copy overhead is negligible, but pinning still saves minor latency.How to Reproduce / Benchmark Commands
Run the commands below from the parent directory:
# Run server with dynamic P2P and -fit on enabled: GGML_DFLASH_ALLOW_MULTI_GPU_TAPE=1 GGML_DFLASH_GPU_RING=1 ./build/bin/llama-server \ -hf unsloth/Qwen3.6-27B-GGUF:Q5_K_S --no-mmproj \ --spec-draft-hf Anbeeld/Qwen3.6-27B-DFlash-GGUF:Q4_K_M \ --spec-type dflash --spec-branch-budget 0 --spec-dflash-cross-ctx 512 \ --spec-draft-ctx-size 2048 --spec-draft-n-max 12 --no-spec-dm-adaptive \ -ngl all --spec-draft-ngl all -sm layer -ts 1,1 \ -c 8192 -np 1 --kv-unified -b 512 -ub 128 \ --cache-type-k q5_0 --cache-type-v q4_1 --flash-attn on -fit on \ --cache-ram 0 -rea off --host 0.0.0.0 --port 8082Disclosure: This Pull Request was developed with the assistance of AI coding assistants (Antigravity).