Skip to content

Conversation

@mdavis36
Copy link
Collaborator

No description provided.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 11, 2025

Greptile Overview

Greptile Summary

This PR adds support for intermediate buffer allocation in Host IR by:

  • Extracting intermediate buffers from kernel IR during lowering via getIntermediateBuffersFromKernelIR()
  • Storing intermediates as attributes in LaunchKernel expression
  • Passing pre-allocated intermediates through the evaluator and JIT paths to KernelExecutor::run()

Key Issues:

  • The fallback intermediate allocation path in executor.cpp is entirely commented out, which will break all code paths that don't use Host IR pre-allocation
  • The zero_init and resets_to_zero flags from the original kernel allocations are not propagated when creating new kir::Allocate nodes (noted in previous thread)
  • Test ExprEvalAndKernel is re-enabled but only works with HostIrJit disabled

Confidence Score: 2/5

  • This PR has significant issues: the fallback allocation path is disabled, breaking non-Host-IR code paths.
  • Score of 2 reflects that the core feature (Host IR intermediate allocation) appears functional, but the commented-out fallback code in executor.cpp breaks backward compatibility. The PR also has the known issue of missing zero_init propagation. Commit messages like "save." and "Temp!" suggest this may be a work-in-progress.
  • csrc/runtime/executor.cpp - fallback allocation path is disabled; csrc/host_ir/lowering.cpp - missing zero_init/resets_to_zero propagation

Important Files Changed

File Analysis

Filename Score Overview
csrc/host_ir/lowering.cpp 3/5 Adds getIntermediateBuffersFromKernelIR to extract intermediate buffers from kernel IR, creates kir::Allocate nodes for them during lowering, and passes them to LaunchKernel. Missing zero_init/resets_to_zero propagation from original allocations.
csrc/runtime/executor.cpp 2/5 Adds intermediate_args parameter to KernelExecutor::run(). Fallback allocation path is entirely commented out - will break non-Host-IR code paths that use intermediates.
csrc/host_ir/ir.h 5/5 Adds intermediates parameter to LaunchKernel constructor and intermediateBuffers() accessor method.
csrc/host_ir/evaluator.cpp 5/5 Collects intermediate buffers from LaunchKernel and passes them to KernelExecutor::run() as a separate argument.
tests/cpp/test_host_ir_integration.cpp 4/5 Re-enables ExprEvalAndKernel test but disables HostIrJit option to use evaluator path instead.

Sequence Diagram

sequenceDiagram
    participant F as Fusion
    participant L as Lowering
    participant HIC as HostIrContainer
    participant HIE as HostIrEvaluator
    participant KE as KernelExecutor

    F->>L: lowerSegmentedFusionToHostIr()
    L->>L: getIntermediateBuffersFromKernelIR()
    Note over L: Extract intermediate TVs from kernel summary
    L->>HIC: Create kir::Allocate for intermediates
    L->>HIC: Create LaunchKernel(ins, outs, intermediates)
    
    HIC->>HIE: runWithInputs()
    HIE->>HIE: handle(kir::Allocate)
    Note over HIE: Allocate intermediate tensors
    HIE->>HIE: handle(LaunchKernel)
    HIE->>HIE: Collect inputs, outputs, intermediates
    HIE->>KE: run(args, outputs, params, intermediates)
    KE->>KE: args.push(intermediate_args)
    Note over KE: Skip internal allocation if pre-allocated
    KE->>KE: Launch CUDA kernel
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +391 to +429
// Allocate intermediate buffers
for (auto* intermediate : intermediates) {
auto* allocate =
IrBuilder::create<kir::Allocate>(intermediate, MemoryType::Global);
innermost_scope.push_back(allocate);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: The kir::Allocate is created without propagating zero_init and resets_to_zero from the original allocation. This could cause correctness issues for kernels that require zero-initialized intermediate buffers (e.g., grid synchronization semaphores).

The original allocation info is available in summary.global_allocations - consider passing alloc->zeroInit() and alloc->resetsToZero() to the new kir::Allocate.

Suggested change
// Allocate intermediate buffers
for (auto* intermediate : intermediates) {
auto* allocate =
IrBuilder::create<kir::Allocate>(intermediate, MemoryType::Global);
innermost_scope.push_back(allocate);
}
// Allocate intermediate buffers
for (auto* intermediate : intermediates) {
auto* allocate =
IrBuilder::create<kir::Allocate>(intermediate, MemoryType::Global);
innermost_scope.push_back(allocate);
}

Note: This would require refactoring getIntermediateBuffersForKernel to return the original allocation info alongside the cloned TensorView.

@mdavis36 mdavis36 linked an issue Dec 11, 2025 that may be closed by this pull request
@mdavis36 mdavis36 force-pushed the md/hir-intermediates branch from ca98146 to 8f0ab51 Compare December 16, 2025 16:14
@github-actions
Copy link

Description

  • Add intermediate buffer allocation support to Host IR

  • Extract and reconstruct intermediate buffers from kernel IR during lowering

  • Pass intermediates separately to KernelExecutor to avoid allocation conflicts

  • Update LaunchKernel interface to store and access intermediate buffers

Changes walkthrough

Relevant files
Enhancement
7 files
evaluator.cpp
Collect and pass intermediate buffers to KernelExecutor   
+11/-1   
ir.cpp
Update LaunchKernel constructor and toString for intermediates
+12/-0   
jit.cpp
Pack intermediate buffers for JIT compilation                       
+5/-0     
lowering.cpp
Extract intermediate buffers from kernel IR and allocate in Host IR
+76/-2   
executor.cpp
Handle pre-allocated intermediates in KernelExecutor::run
+82/-62 
ir.h
Add intermediateBuffers() method and constructor parameter
+11/-0   
executor.h
Update run() method signature to accept intermediate_args
+2/-1     
Tests
3 files
test_host_ir_evaluator.cpp
Update LaunchKernel constructor calls in tests                     
+3/-0     
test_host_ir_integration.cpp
Enable ExprEvalAndKernel test with intermediate support   
+2/-1     
test_host_ir_jit.cpp
Update LaunchKernel constructor calls in JIT tests             
+1/-0     

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
Commented Code

Large block of commented code (lines 1143-1203) that appears to be the original intermediate allocation logic. While this might be intentional for reference, it should be either removed or properly documented with a clear explanation of why it's kept.

//KernelArgumentHolder local_intermediate_args;
//FUSER_PERF_SCOPE("KernelExecutor::runFusion::intermediates");
//// Intermediates just use logical sizes and strides even though they're
//// really allocation sizes and strides.
////
//// This is simply because the convention used is that allocation
//// sizes/strides are optional, logical are not.
//for (const auto intermediate_i :
//     arange(executor_entry->intermediates.size())) {
//  const auto& buf_info = executor_entry->intermediates.at(intermediate_i);
//  bool has_expansion = false;
//  std::vector<int64_t> unexpanded_sizes;
//  unexpanded_sizes.reserve(buf_info.shape_info.logical_sizes.size());
//  NVF_ERROR(
//      buf_info.shape_info.logical_sizes.size() ==
//      buf_info.shape_info.logical_strides.size())
//  for (const auto j : arange(buf_info.shape_info.logical_sizes.size())) {
//    if (buf_info.shape_info.logical_strides[j] == 0) {
//      has_expansion = true;
//      unexpanded_sizes.push_back(1L);
//    } else {
//      unexpanded_sizes.push_back(buf_info.shape_info.logical_sizes[j]);
//    }
//  }
//  at::Tensor intermediate_buffer;
//  if (buf_info.zero_init) {
//    if (isOptionEnabled(EnableOption::ReuseZeroedMemory) ||
//        buf_info.resets_to_zero) {
//      // Allow access to reusable zeroed memory if buffer is guaranteed
//      // to reset to zero upon completion of the kernel, or if we have
//      // enabled the option (unsafe)
//      intermediate_buffer = contigZeroedTensor(
//          unexpanded_sizes, buf_info.type, compiled_kernel_->device());
//    } else {
//      intermediate_buffer = at::zeros(
//          unexpanded_sizes,
//          at::TensorOptions()
//              .dtype(buf_info.type)
//              .device(compiled_kernel_->device()));
//    }
//  } else {
//    intermediate_buffer = at::native::empty_cuda(
//        unexpanded_sizes,
//        buf_info.type,
//        c10::nullopt,
//        compiled_kernel_->device(),
//        c10::nullopt);
//    if (shouldFillAllocationWithNan()) {
//      fillTensorWithNan(intermediate_buffer);
//    }
//  }
//  if (has_expansion) {
//    intermediate_buffer = at::native::expand(
//        intermediate_buffer, buf_info.shape_info.logical_sizes);
//  }
//  args.push(intermediate_buffer);
//  local_intermediate_args.push(intermediate_buffer);
//  if (buf_info.is_profile_buffer) {
//    profile_buffer = intermediate_buffer;
//  }
//}
Error Handling

The getIntermediateBuffersFromKernelIR function doesn't have explicit error handling for cases where kernel IR cloning fails or intermediate buffer reconstruction encounters issues. Consider adding validation.

std::vector<Val*> getIntermediateBuffersFromKernelIR(
    const kir::Kernel* kernel,
    hir::HostIrContainer* host_container) {
  std::vector<Val*> intermediates;

  // Set Host IR as active container for building new IR nodes
  FusionGuard fg(host_container);

  // Create an IrCloner to clone symbolic expressions from kernel IR to Host IR
  IrCloner kernel_to_host_cloner(host_container);

  const kir::KernelSummary& summary = kernel->summary();

  for (const auto* alloc : summary.global_allocations) {
    TensorView* kernel_tv = alloc->buffer()->as<TensorView>();

    // Skip if it's a fusion output
    if (kernel_tv->isFusionOutput()) {
      continue;
    }

    // Skip if it's an alias
    if (alloc->alias() != nullptr) {
      continue;
    }

    // Reconstruct the TensorView in Host IR
    // We need to manually rebuild it to ensure all symbolic extents are
    // properly cloned with their definitions
    std::vector<IterDomain*> new_ids;
    for (IterDomain* kernel_id : kernel_tv->getLoopDomain()) {
      // Clone the extent and its definition to Host IR
      Val* extent = kernel_to_host_cloner.clone(kernel_id->extent());

      // Build new IterDomain in Host IR
      auto* new_id = IterDomainBuilder(kernel_id->start(), extent)
                         .iter_type(kernel_id->getIterType())
                         .build();
      new_ids.push_back(new_id);
    }

    // Create TensorDomain
    auto* td = IrBuilder::create<TensorDomain>(new_ids);

    // Create TensorView in Host IR
    auto* host_tv = IrBuilder::create<TensorView>(
        td, kernel_tv->dtype(), kernel_tv->getMemoryType());

    intermediates.push_back(host_tv);
  }

  return intermediates;
}
Performance Impact

The new intermediate buffer handling adds additional parameter passing and conditional logic. While the goal is to improve performance by pre-allocating buffers, the actual performance impact should be measured and documented to ensure the changes provide the intended benefits.

// Collect intermediate buffers (allocated in Host IR)
// Pass them separately to avoid confusion with inputs/outputs
KernelArgumentHolder intermediates;
for (Val* intermediate : launch_kernel->intermediateBuffers()) {
  intermediates.push(getKnownConcreteValue(intermediate));
}

args.setDeviceIndex();

// run the compiled kernel
container_->getKernelExecutor(launch_kernel->groupId())
    .run(
        args,
        outputs,
        launch_kernel->launchParams(),
        launch_kernel->compileParams(),
        intermediates);

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +1142 to +1203
if (!intermediates_preallocated) {
//KernelArgumentHolder local_intermediate_args;
//FUSER_PERF_SCOPE("KernelExecutor::runFusion::intermediates");
//// Intermediates just use logical sizes and strides even though they're
//// really allocation sizes and strides.
////
//// This is simply because the convention used is that allocation
//// sizes/strides are optional, logical are not.
//for (const auto intermediate_i :
// arange(executor_entry->intermediates.size())) {
// const auto& buf_info = executor_entry->intermediates.at(intermediate_i);
// bool has_expansion = false;
// std::vector<int64_t> unexpanded_sizes;
// unexpanded_sizes.reserve(buf_info.shape_info.logical_sizes.size());
// NVF_ERROR(
// buf_info.shape_info.logical_sizes.size() ==
// buf_info.shape_info.logical_strides.size())
// for (const auto j : arange(buf_info.shape_info.logical_sizes.size())) {
// if (buf_info.shape_info.logical_strides[j] == 0) {
// has_expansion = true;
// unexpanded_sizes.push_back(1L);
// } else {
// unexpanded_sizes.push_back(buf_info.shape_info.logical_sizes[j]);
// }
// }
// at::Tensor intermediate_buffer;
// if (buf_info.zero_init) {
// if (isOptionEnabled(EnableOption::ReuseZeroedMemory) ||
// buf_info.resets_to_zero) {
// // Allow access to reusable zeroed memory if buffer is guaranteed
// // to reset to zero upon completion of the kernel, or if we have
// // enabled the option (unsafe)
// intermediate_buffer = contigZeroedTensor(
// unexpanded_sizes, buf_info.type, compiled_kernel_->device());
// } else {
// intermediate_buffer = at::zeros(
// unexpanded_sizes,
// at::TensorOptions()
// .dtype(buf_info.type)
// .device(compiled_kernel_->device()));
// }
// } else {
// intermediate_buffer = at::native::empty_cuda(
// unexpanded_sizes,
// buf_info.type,
// c10::nullopt,
// compiled_kernel_->device(),
// c10::nullopt);
// if (shouldFillAllocationWithNan()) {
// fillTensorWithNan(intermediate_buffer);
// }
// }
// if (has_expansion) {
// intermediate_buffer = at::native::expand(
// intermediate_buffer, buf_info.shape_info.logical_sizes);
// }
// args.push(intermediate_buffer);
// local_intermediate_args.push(intermediate_buffer);
// if (buf_info.is_profile_buffer) {
// profile_buffer = intermediate_buffer;
// }
//}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: The fallback intermediate allocation path is entirely commented out. This breaks all non-Host-IR code paths that require intermediate buffers (e.g., grid reductions, sync buffers). Unless this PR is intended to be WIP, this code should remain functional for backward compatibility.

Consider either:

  1. Re-enabling the fallback path
  2. Adding NVF_ERROR(!intermediates_preallocated, "...") to explicitly fail if this path is entered

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

hir::LaunchKernel Intermediate Buffer Support

2 participants