Host IR Intermediate Allocation #5665

mdavis36 · 2025-12-11T16:45:30Z

No description provided.

greptile-apps · 2025-12-11T16:52:42Z

Greptile Overview

Greptile Summary

This PR adds support for intermediate buffer allocation in Host IR by:

Extracting intermediate buffers from kernel IR during lowering via getIntermediateBuffersFromKernelIR()
Storing intermediates as attributes in LaunchKernel expression
Passing pre-allocated intermediates through the evaluator and JIT paths to KernelExecutor::run()

Key Issues:

The fallback intermediate allocation path in executor.cpp is entirely commented out, which will break all code paths that don't use Host IR pre-allocation
The zero_init and resets_to_zero flags from the original kernel allocations are not propagated when creating new kir::Allocate nodes (noted in previous thread)
Test ExprEvalAndKernel is re-enabled but only works with HostIrJit disabled

Confidence Score: 2/5

This PR has significant issues: the fallback allocation path is disabled, breaking non-Host-IR code paths.
Score of 2 reflects that the core feature (Host IR intermediate allocation) appears functional, but the commented-out fallback code in executor.cpp breaks backward compatibility. The PR also has the known issue of missing zero_init propagation. Commit messages like "save." and "Temp!" suggest this may be a work-in-progress.
csrc/runtime/executor.cpp - fallback allocation path is disabled; csrc/host_ir/lowering.cpp - missing zero_init/resets_to_zero propagation

Important Files Changed

File Analysis

Filename	Score	Overview
csrc/host_ir/lowering.cpp	3/5	Adds `getIntermediateBuffersFromKernelIR` to extract intermediate buffers from kernel IR, creates `kir::Allocate` nodes for them during lowering, and passes them to `LaunchKernel`. Missing zero_init/resets_to_zero propagation from original allocations.
csrc/runtime/executor.cpp	2/5	Adds `intermediate_args` parameter to `KernelExecutor::run()`. Fallback allocation path is entirely commented out - will break non-Host-IR code paths that use intermediates.
csrc/host_ir/ir.h	5/5	Adds `intermediates` parameter to `LaunchKernel` constructor and `intermediateBuffers()` accessor method.
csrc/host_ir/evaluator.cpp	5/5	Collects intermediate buffers from `LaunchKernel` and passes them to `KernelExecutor::run()` as a separate argument.
tests/cpp/test_host_ir_integration.cpp	4/5	Re-enables `ExprEvalAndKernel` test but disables `HostIrJit` option to use evaluator path instead.

Sequence Diagram

sequenceDiagram
    participant F as Fusion
    participant L as Lowering
    participant HIC as HostIrContainer
    participant HIE as HostIrEvaluator
    participant KE as KernelExecutor

    F->>L: lowerSegmentedFusionToHostIr()
    L->>L: getIntermediateBuffersFromKernelIR()
    Note over L: Extract intermediate TVs from kernel summary
    L->>HIC: Create kir::Allocate for intermediates
    L->>HIC: Create LaunchKernel(ins, outs, intermediates)
    
    HIC->>HIE: runWithInputs()
    HIE->>HIE: handle(kir::Allocate)
    Note over HIE: Allocate intermediate tensors
    HIE->>HIE: handle(LaunchKernel)
    HIE->>HIE: Collect inputs, outputs, intermediates
    HIE->>KE: run(args, outputs, params, intermediates)
    KE->>KE: args.push(intermediate_args)
    Note over KE: Skip internal allocation if pre-allocated
    KE->>KE: Launch CUDA kernel

greptile-apps

_{10 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-12-11T16:52:41Z

csrc/host_ir/lowering.cpp

+      // Allocate intermediate buffers
+      for (auto* intermediate : intermediates) {
+        auto* allocate =
+            IrBuilder::create<kir::Allocate>(intermediate, MemoryType::Global);
+        innermost_scope.push_back(allocate);
+      }


logic: The kir::Allocate is created without propagating zero_init and resets_to_zero from the original allocation. This could cause correctness issues for kernels that require zero-initialized intermediate buffers (e.g., grid synchronization semaphores).

The original allocation info is available in summary.global_allocations - consider passing alloc->zeroInit() and alloc->resetsToZero() to the new kir::Allocate.

Suggested change

// Allocate intermediate buffers

for (auto* intermediate : intermediates) {

auto* allocate =

IrBuilder::create<kir::Allocate>(intermediate, MemoryType::Global);

innermost_scope.push_back(allocate);

}

// Allocate intermediate buffers

for (auto* intermediate : intermediates) {

auto* allocate =

IrBuilder::create<kir::Allocate>(intermediate, MemoryType::Global);

innermost_scope.push_back(allocate);

}

Note: This would require refactoring getIntermediateBuffersForKernel to return the original allocation info alongside the cloned TensorView.

github-actions · 2025-12-16T16:17:20Z

Description

Add intermediate buffer allocation support to Host IR
Extract and reconstruct intermediate buffers from kernel IR during lowering
Pass intermediates separately to KernelExecutor to avoid allocation conflicts
Update LaunchKernel interface to store and access intermediate buffers

Changes walkthrough

Relevant files

Enhancement

7 files

evaluator.cpp `Collect and pass intermediate buffers to KernelExecutor`	+11/-1
ir.cpp `Update LaunchKernel constructor and toString for intermediates`	+12/-0
jit.cpp `Pack intermediate buffers for JIT compilation`	+5/-0
lowering.cpp `Extract intermediate buffers from kernel IR and allocate in Host IR`	+76/-2
executor.cpp `Handle pre-allocated intermediates in KernelExecutor::run`	+82/-62
ir.h `Add intermediateBuffers() method and constructor parameter`	+11/-0
executor.h `Update run() method signature to accept intermediate_args`	+2/-1

Tests

3 files

test_host_ir_evaluator.cpp `Update LaunchKernel constructor calls in tests`	+3/-0
test_host_ir_integration.cpp `Enable ExprEvalAndKernel test with intermediate support`	+2/-1
test_host_ir_jit.cpp `Update LaunchKernel constructor calls in JIT tests`	+1/-0

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Commented Code

Large block of commented code (lines 1143-1203) that appears to be the original intermediate allocation logic. While this might be intentional for reference, it should be either removed or properly documented with a clear explanation of why it's kept.

//KernelArgumentHolder local_intermediate_args;
//FUSER_PERF_SCOPE("KernelExecutor::runFusion::intermediates");
//// Intermediates just use logical sizes and strides even though they're
//// really allocation sizes and strides.
////
//// This is simply because the convention used is that allocation
//// sizes/strides are optional, logical are not.
//for (const auto intermediate_i :
//     arange(executor_entry->intermediates.size())) {
//  const auto& buf_info = executor_entry->intermediates.at(intermediate_i);
//  bool has_expansion = false;
//  std::vector<int64_t> unexpanded_sizes;
//  unexpanded_sizes.reserve(buf_info.shape_info.logical_sizes.size());
//  NVF_ERROR(
//      buf_info.shape_info.logical_sizes.size() ==
//      buf_info.shape_info.logical_strides.size())
//  for (const auto j : arange(buf_info.shape_info.logical_sizes.size())) {
//    if (buf_info.shape_info.logical_strides[j] == 0) {
//      has_expansion = true;
//      unexpanded_sizes.push_back(1L);
//    } else {
//      unexpanded_sizes.push_back(buf_info.shape_info.logical_sizes[j]);
//    }
//  }
//  at::Tensor intermediate_buffer;
//  if (buf_info.zero_init) {
//    if (isOptionEnabled(EnableOption::ReuseZeroedMemory) ||
//        buf_info.resets_to_zero) {
//      // Allow access to reusable zeroed memory if buffer is guaranteed
//      // to reset to zero upon completion of the kernel, or if we have
//      // enabled the option (unsafe)
//      intermediate_buffer = contigZeroedTensor(
//          unexpanded_sizes, buf_info.type, compiled_kernel_->device());
//    } else {
//      intermediate_buffer = at::zeros(
//          unexpanded_sizes,
//          at::TensorOptions()
//              .dtype(buf_info.type)
//              .device(compiled_kernel_->device()));
//    }
//  } else {
//    intermediate_buffer = at::native::empty_cuda(
//        unexpanded_sizes,
//        buf_info.type,
//        c10::nullopt,
//        compiled_kernel_->device(),
//        c10::nullopt);
//    if (shouldFillAllocationWithNan()) {
//      fillTensorWithNan(intermediate_buffer);
//    }
//  }
//  if (has_expansion) {
//    intermediate_buffer = at::native::expand(
//        intermediate_buffer, buf_info.shape_info.logical_sizes);
//  }
//  args.push(intermediate_buffer);
//  local_intermediate_args.push(intermediate_buffer);
//  if (buf_info.is_profile_buffer) {
//    profile_buffer = intermediate_buffer;
//  }
//}

Error Handling

The getIntermediateBuffersFromKernelIR function doesn't have explicit error handling for cases where kernel IR cloning fails or intermediate buffer reconstruction encounters issues. Consider adding validation.

std::vector<Val*> getIntermediateBuffersFromKernelIR(
    const kir::Kernel* kernel,
    hir::HostIrContainer* host_container) {
  std::vector<Val*> intermediates;

  // Set Host IR as active container for building new IR nodes
  FusionGuard fg(host_container);

  // Create an IrCloner to clone symbolic expressions from kernel IR to Host IR
  IrCloner kernel_to_host_cloner(host_container);

  const kir::KernelSummary& summary = kernel->summary();

  for (const auto* alloc : summary.global_allocations) {
    TensorView* kernel_tv = alloc->buffer()->as<TensorView>();

    // Skip if it's a fusion output
    if (kernel_tv->isFusionOutput()) {
      continue;
    }

    // Skip if it's an alias
    if (alloc->alias() != nullptr) {
      continue;
    }

    // Reconstruct the TensorView in Host IR
    // We need to manually rebuild it to ensure all symbolic extents are
    // properly cloned with their definitions
    std::vector<IterDomain*> new_ids;
    for (IterDomain* kernel_id : kernel_tv->getLoopDomain()) {
      // Clone the extent and its definition to Host IR
      Val* extent = kernel_to_host_cloner.clone(kernel_id->extent());

      // Build new IterDomain in Host IR
      auto* new_id = IterDomainBuilder(kernel_id->start(), extent)
                         .iter_type(kernel_id->getIterType())
                         .build();
      new_ids.push_back(new_id);
    }

    // Create TensorDomain
    auto* td = IrBuilder::create<TensorDomain>(new_ids);

    // Create TensorView in Host IR
    auto* host_tv = IrBuilder::create<TensorView>(
        td, kernel_tv->dtype(), kernel_tv->getMemoryType());

    intermediates.push_back(host_tv);
  }

  return intermediates;
}

Performance Impact

The new intermediate buffer handling adds additional parameter passing and conditional logic. While the goal is to improve performance by pre-allocating buffers, the actual performance impact should be measured and documented to ensure the changes provide the intended benefits.

// Collect intermediate buffers (allocated in Host IR)
// Pass them separately to avoid confusion with inputs/outputs
KernelArgumentHolder intermediates;
for (Val* intermediate : launch_kernel->intermediateBuffers()) {
  intermediates.push(getKnownConcreteValue(intermediate));
}

args.setDeviceIndex();

// run the compiled kernel
container_->getKernelExecutor(launch_kernel->groupId())
    .run(
        args,
        outputs,
        launch_kernel->launchParams(),
        launch_kernel->compileParams(),
        intermediates);

greptile-apps

_{10 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-12-16T16:18:26Z

csrc/runtime/executor.cpp

+  if (!intermediates_preallocated) {
+    //KernelArgumentHolder local_intermediate_args;
+    //FUSER_PERF_SCOPE("KernelExecutor::runFusion::intermediates");
+    //// Intermediates just use logical sizes and strides even though they're
+    //// really allocation sizes and strides.
+    ////
+    //// This is simply because the convention used is that allocation
+    //// sizes/strides are optional, logical are not.
+    //for (const auto intermediate_i :
+    //     arange(executor_entry->intermediates.size())) {
+    //  const auto& buf_info = executor_entry->intermediates.at(intermediate_i);
+    //  bool has_expansion = false;
+    //  std::vector<int64_t> unexpanded_sizes;
+    //  unexpanded_sizes.reserve(buf_info.shape_info.logical_sizes.size());
+    //  NVF_ERROR(
+    //      buf_info.shape_info.logical_sizes.size() ==
+    //      buf_info.shape_info.logical_strides.size())
+    //  for (const auto j : arange(buf_info.shape_info.logical_sizes.size())) {
+    //    if (buf_info.shape_info.logical_strides[j] == 0) {
+    //      has_expansion = true;
+    //      unexpanded_sizes.push_back(1L);
+    //    } else {
+    //      unexpanded_sizes.push_back(buf_info.shape_info.logical_sizes[j]);
+    //    }
+    //  }
+    //  at::Tensor intermediate_buffer;
+    //  if (buf_info.zero_init) {
+    //    if (isOptionEnabled(EnableOption::ReuseZeroedMemory) ||
+    //        buf_info.resets_to_zero) {
+    //      // Allow access to reusable zeroed memory if buffer is guaranteed
+    //      // to reset to zero upon completion of the kernel, or if we have
+    //      // enabled the option (unsafe)
+    //      intermediate_buffer = contigZeroedTensor(
+    //          unexpanded_sizes, buf_info.type, compiled_kernel_->device());
+    //    } else {
+    //      intermediate_buffer = at::zeros(
+    //          unexpanded_sizes,
+    //          at::TensorOptions()
+    //              .dtype(buf_info.type)
+    //              .device(compiled_kernel_->device()));
+    //    }
+    //  } else {
+    //    intermediate_buffer = at::native::empty_cuda(
+    //        unexpanded_sizes,
+    //        buf_info.type,
+    //        c10::nullopt,
+    //        compiled_kernel_->device(),
+    //        c10::nullopt);
+    //    if (shouldFillAllocationWithNan()) {
+    //      fillTensorWithNan(intermediate_buffer);
+    //    }
+    //  }
+    //  if (has_expansion) {
+    //    intermediate_buffer = at::native::expand(
+    //        intermediate_buffer, buf_info.shape_info.logical_sizes);
+    //  }
+    //  args.push(intermediate_buffer);
+    //  local_intermediate_args.push(intermediate_buffer);
+    //  if (buf_info.is_profile_buffer) {
+    //    profile_buffer = intermediate_buffer;
+    //  }
+    //}


logic: The fallback intermediate allocation path is entirely commented out. This breaks all non-Host-IR code paths that require intermediate buffers (e.g., grid reductions, sync buffers). Unless this PR is intended to be WIP, this code should remain functional for backward compatibility.

Consider either:

Re-enabling the fallback path

Adding NVF_ERROR(!intermediates_preallocated, "...") to explicitly fail if this path is entered

greptile-apps bot reviewed Dec 11, 2025

View reviewed changes

mdavis36 linked an issue Dec 11, 2025 that may be closed by this pull request

hir::LaunchKernel Intermediate Buffer Support #5624

Open

mdavis36 added 11 commits December 16, 2025 08:13

Adding intermediates as attributes to LaunchKernel.

d94488d

Insert intermediate allocations during hir lowering.

f70504d

Pass intermediates to evaluator KE::run.

4311fa1

Merge in order [input, output, intermediates]

54b44df

Pack intermediate buffers for hir jit.

b30a318

Update LaunchKernel interface in host_ir tests.

f152cb8

Pass intermediates as separate arg in KE::run.

f487394

Temp! Run ExprEvalAndKernel w/ evaluator.

b6b0024

Seemingly fixed ExprEvalAndKernel

4f12c7b

Disable fallback intermediate allocation path.

488cc0c

save.

8f0ab51

mdavis36 force-pushed the md/hir-intermediates branch from ca98146 to 8f0ab51 Compare December 16, 2025 16:14

greptile-apps bot reviewed Dec 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Host IR Intermediate Allocation #5665

Host IR Intermediate Allocation #5665

Uh oh!

mdavis36 commented Dec 11, 2025

Uh oh!

greptile-apps bot commented Dec 11, 2025 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Dec 11, 2025

Uh oh!

github-actions bot commented Dec 16, 2025

Changes walkthrough

PR Reviewer Guide

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Host IR Intermediate Allocation #5665

Are you sure you want to change the base?

Host IR Intermediate Allocation #5665

Uh oh!

Conversation

mdavis36 commented Dec 11, 2025

Uh oh!

greptile-apps bot commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 16, 2025

Description

Changes walkthrough

PR Reviewer Guide

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps bot commented Dec 11, 2025 •

edited

Loading