-
Notifications
You must be signed in to change notification settings - Fork 73
Host IR Intermediate Allocation #5665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Greptile OverviewGreptile SummaryThis PR adds support for intermediate buffer allocation in Host IR by:
Key Issues:
Confidence Score: 2/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant F as Fusion
participant L as Lowering
participant HIC as HostIrContainer
participant HIE as HostIrEvaluator
participant KE as KernelExecutor
F->>L: lowerSegmentedFusionToHostIr()
L->>L: getIntermediateBuffersFromKernelIR()
Note over L: Extract intermediate TVs from kernel summary
L->>HIC: Create kir::Allocate for intermediates
L->>HIC: Create LaunchKernel(ins, outs, intermediates)
HIC->>HIE: runWithInputs()
HIE->>HIE: handle(kir::Allocate)
Note over HIE: Allocate intermediate tensors
HIE->>HIE: handle(LaunchKernel)
HIE->>HIE: Collect inputs, outputs, intermediates
HIE->>KE: run(args, outputs, params, intermediates)
KE->>KE: args.push(intermediate_args)
Note over KE: Skip internal allocation if pre-allocated
KE->>KE: Launch CUDA kernel
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10 files reviewed, 1 comment
| // Allocate intermediate buffers | ||
| for (auto* intermediate : intermediates) { | ||
| auto* allocate = | ||
| IrBuilder::create<kir::Allocate>(intermediate, MemoryType::Global); | ||
| innermost_scope.push_back(allocate); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: The kir::Allocate is created without propagating zero_init and resets_to_zero from the original allocation. This could cause correctness issues for kernels that require zero-initialized intermediate buffers (e.g., grid synchronization semaphores).
The original allocation info is available in summary.global_allocations - consider passing alloc->zeroInit() and alloc->resetsToZero() to the new kir::Allocate.
| // Allocate intermediate buffers | |
| for (auto* intermediate : intermediates) { | |
| auto* allocate = | |
| IrBuilder::create<kir::Allocate>(intermediate, MemoryType::Global); | |
| innermost_scope.push_back(allocate); | |
| } | |
| // Allocate intermediate buffers | |
| for (auto* intermediate : intermediates) { | |
| auto* allocate = | |
| IrBuilder::create<kir::Allocate>(intermediate, MemoryType::Global); | |
| innermost_scope.push_back(allocate); | |
| } |
Note: This would require refactoring getIntermediateBuffersForKernel to return the original allocation info alongside the cloned TensorView.
ca98146 to
8f0ab51
Compare
Description
|
| Relevant files | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Enhancement | 7 files
| ||||||||||||||
| Tests |
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Commented Code
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10 files reviewed, 1 comment
| if (!intermediates_preallocated) { | ||
| //KernelArgumentHolder local_intermediate_args; | ||
| //FUSER_PERF_SCOPE("KernelExecutor::runFusion::intermediates"); | ||
| //// Intermediates just use logical sizes and strides even though they're | ||
| //// really allocation sizes and strides. | ||
| //// | ||
| //// This is simply because the convention used is that allocation | ||
| //// sizes/strides are optional, logical are not. | ||
| //for (const auto intermediate_i : | ||
| // arange(executor_entry->intermediates.size())) { | ||
| // const auto& buf_info = executor_entry->intermediates.at(intermediate_i); | ||
| // bool has_expansion = false; | ||
| // std::vector<int64_t> unexpanded_sizes; | ||
| // unexpanded_sizes.reserve(buf_info.shape_info.logical_sizes.size()); | ||
| // NVF_ERROR( | ||
| // buf_info.shape_info.logical_sizes.size() == | ||
| // buf_info.shape_info.logical_strides.size()) | ||
| // for (const auto j : arange(buf_info.shape_info.logical_sizes.size())) { | ||
| // if (buf_info.shape_info.logical_strides[j] == 0) { | ||
| // has_expansion = true; | ||
| // unexpanded_sizes.push_back(1L); | ||
| // } else { | ||
| // unexpanded_sizes.push_back(buf_info.shape_info.logical_sizes[j]); | ||
| // } | ||
| // } | ||
| // at::Tensor intermediate_buffer; | ||
| // if (buf_info.zero_init) { | ||
| // if (isOptionEnabled(EnableOption::ReuseZeroedMemory) || | ||
| // buf_info.resets_to_zero) { | ||
| // // Allow access to reusable zeroed memory if buffer is guaranteed | ||
| // // to reset to zero upon completion of the kernel, or if we have | ||
| // // enabled the option (unsafe) | ||
| // intermediate_buffer = contigZeroedTensor( | ||
| // unexpanded_sizes, buf_info.type, compiled_kernel_->device()); | ||
| // } else { | ||
| // intermediate_buffer = at::zeros( | ||
| // unexpanded_sizes, | ||
| // at::TensorOptions() | ||
| // .dtype(buf_info.type) | ||
| // .device(compiled_kernel_->device())); | ||
| // } | ||
| // } else { | ||
| // intermediate_buffer = at::native::empty_cuda( | ||
| // unexpanded_sizes, | ||
| // buf_info.type, | ||
| // c10::nullopt, | ||
| // compiled_kernel_->device(), | ||
| // c10::nullopt); | ||
| // if (shouldFillAllocationWithNan()) { | ||
| // fillTensorWithNan(intermediate_buffer); | ||
| // } | ||
| // } | ||
| // if (has_expansion) { | ||
| // intermediate_buffer = at::native::expand( | ||
| // intermediate_buffer, buf_info.shape_info.logical_sizes); | ||
| // } | ||
| // args.push(intermediate_buffer); | ||
| // local_intermediate_args.push(intermediate_buffer); | ||
| // if (buf_info.is_profile_buffer) { | ||
| // profile_buffer = intermediate_buffer; | ||
| // } | ||
| //} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: The fallback intermediate allocation path is entirely commented out. This breaks all non-Host-IR code paths that require intermediate buffers (e.g., grid reductions, sync buffers). Unless this PR is intended to be WIP, this code should remain functional for backward compatibility.
Consider either:
- Re-enabling the fallback path
- Adding
NVF_ERROR(!intermediates_preallocated, "...")to explicitly fail if this path is entered
No description provided.