[WIP] Basic latency tests by csarofeen · Pull Request #4053 · NVIDIA/Fuser

csarofeen · 2025-03-09T22:24:54Z

No description provided.

github-actions · 2025-03-09T22:25:54Z

Description

Refactor MatmulOp::evaluate to simplify output handling.
Rename DistributedTensor to Sharding and update related methods.
Modify FusionDefinition::execute to return output shardings.
Add latency tests for expression evaluation.

Changes walkthrough 📝

Relevant files

Enhancement

14 files

nodes.cpp `Simplify MatmulOp::evaluate output handling`	+16/-16
distributed_tensor.cpp `Rename DistributedTensor to Sharding`	+2/-3
fusion_definition.cpp `Modify execute to return output shardings`	+61/-41
multidevice_bindings.cpp `Update bindings for Sharding`	+3/-6
python_bindings.cpp `Update execute method to handle output shardings`	+16/-6
__init__.py `Update execute method signature`	+11/-6
test_communication.py `Update execute method calls`	+8/-8
test_dtensor.py `Update execute method calls and handle output shardings`	+14/-17
test_multidevice.py `Update execute method calls and handle output shardings`	+35/-37
instrumentation.h `Remove event tracing methods`	+3/-16
distributed_tensor.h `Rename DistributedTensor to Sharding`	+6/-17
fusion_definition.h `Update execute method signature`	+10/-5
executor_kernel_arg.h `Add vector method to KernelArgumentHolder`	+4/-0
fusion_kernel_runtime.h `Add NVF_API to FusionKernelRuntime class`	+1/-1

Tests

2 files

test_matmul_perf.cpp `Add latency tests for expression evaluation`	+301/-0
CMakeLists.txt `Add test_matmul_perf.cpp to JIT_TEST_SRCS`	+1/-0

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Code Comment

The commented-out code in MatmulOp::evaluate might be useful for future development or debugging. Consider keeping it or documenting why it was removed.

// if (const auto rfactor_did_idx = getRFactorDeviceDimensionIndex(out());
//     rfactor_did_idx != -1) {
//   matmul_out = matmul_out.unsqueeze(rfactor_did_idx);
// }

// const auto& [sizes, strides] = inferShapeOfOutput(out(), ee);
// auto meta_out = at::detail::empty_strided_meta(sizes, strides, a.dtype());

// if (meta_out.is_contiguous()) {
//   return {matmul_out};
// }

// auto strided_matmul_out = at::empty_strided(sizes, strides, a.options());
// strided_matmul_out = strided_matmul_out.copy_(matmul_out);
// return {strided_matmul_out};

Performance Concern

The performance test in tests/cpp/test_matmul_perf.cpp should be evaluated against a baseline to ensure that the changes do not introduce performance regressions.

      os << "\n";
    }
  }
  os << std::endl;
}

namespace {
// Returns the output shardings of the given fusion. As a short cut, if none of
// the outputs have a device mesh, returns an empty vector indicating single-GPU
// execution.
std::vector<Sharding> getOutputShardings(Fusion* fusion) {
  std::vector<Sharding> output_shardings;
  if (std::none_of(
          fusion->outputs().begin(), fusion->outputs().end(), [](Val* v) {
            if (auto* tv = dynamic_cast<TensorView*>(v)) {
              return tv->hasDeviceMesh();
            }
            return false;
          })) {
    return output_shardings;
  }

  output_shardings.reserve(fusion->outputs().size());
  for (Val* out_val : fusion->outputs()) {
    if (auto* out_tv = dynamic_cast<TensorView*>(out_val)) {
      if (fusion->getOutputAlias(out_tv).hide_output) {
        continue;
      }
      const DeviceMesh& mesh = out_tv->getDeviceMesh();
      Sharding& output_sharding = output_shardings.emplace_back(mesh);
      if (mesh.size() > 0) {
        for (const ParallelType parallel_type : kParallelTypeDIDs) {
          if (const auto axis = getShardedLogicalAxis(out_tv, parallel_type);
              axis != -1) {
            output_sharding.setAxisIsShardedOn(axis, parallel_type);
          }
        }
      }
    } else {
      output_shardings.emplace_back(DeviceMesh());
    }
  }

  return output_shardings;
}
} // namespace

std::pair<KernelArgumentHolder, std::vector<Sharding>> FusionDefinition::
    execute(
        KernelArgumentHolder args,
        std::optional<int8_t> selected_device,
        bool override_user_schedule,
        bool capture_debug_output,
        bool profile,
        std::vector<std::string> _enable_options,
        std::vector<std::string> _disable_options) const {
  debug_output_ = std::nullopt;
  std::stringstream debug_ss;
  DebugStreamGuard dsg(capture_debug_output ? debug_ss : std::cout);
  args.setDeviceIndex(selected_device);
  NVF_CHECK(id().has_value(), "Valid fusion schedule is not available!");

  auto scheds = fusionCache()->queryFusionSchedules(id().value());

  if (profile) {
    ProfilerOptionsGuard::getCurOptions().set(ProfilerOption::Enable);
  }

  EnableOptionsGuard enable_opt_guard;
  for (const auto& _enable_option : _enable_options) {
    std::optional<EnableOption> opt = stringToEnableOption(_enable_option);
    NVF_CHECK(opt.has_value(), "Unrecognized enable_option: ", _enable_option);
    EnableOptionsGuard::getCurOptions().set(opt.value());
  }

  DisableOptionsGuard disable_opt_guard;
  for (const auto& _disable_option : _disable_options) {
    std::optional<DisableOption> opt = stringToDisableOption(_disable_option);
    NVF_CHECK(
        opt.has_value(), "Unrecognized disable_option: ", _disable_option);
    DisableOptionsGuard::getCurOptions().set(opt.value());
  }

  auto find_user_schedule = [&]() -> const UserSchedule* {
    if (override_user_schedule) {
      return nullptr;
    }

    auto user_sched_id = fusionCache()->queryUserScheduleId(scheds, args);
    if (!user_sched_id.has_value()) {
      return nullptr;
    }

    NVF_CHECK(
        args.empty() || args.getDeviceIndex() > -1,
        "Inputs are not all on the same device or don't match selection!");
    const UserSchedule& user_sched = fusionCache()->queryUserSchedule(
        scheds, user_sched_id.value(), args.getDeviceIndex());
    return &user_sched;
  };
  const auto* user_sched = find_user_schedule();

  KernelArgumentHolder outputs;
  if (user_sched == nullptr) {
    scheds->createExecutorIfNotExists();
    outputs = scheds->auto_gen_schedules->runFusionWithInputs(
        args, std::nullopt, args.getDeviceIndex());
  } else {
    if (isProfilerEnabledWithCupti()) {
      FusionProfiler::start();
      FusionProfiler::createSegments(1);
    }

    scheds->last_user_def_scheduled_ir = user_sched->scheduled_fusion.get();
    scheds->last_user_def_executor = user_sched->executor.get();

    if (user_sched->heuristic_params == nullptr) {
      // Manual schedule
      if (!user_sched->executor->isCompiled()) {
        user_sched->executor->compile(user_sched->scheduled_fusion.get(), args);
      }
      outputs = user_sched->executor->run(args);
    } else {
      // Automatic scheduler was used for UserSchedule.
      // Pass launch and compile params to compileFusion and runFusion.
      if (!user_sched->executor->isCompiled()) {
        user_sched->executor->compile(
            user_sched->scheduled_fusion.get(),
            args,
            user_sched->heuristic_params->lparams,
            user_sched->heuristic_params->cparams,
            user_sched->heuristic_params->scheduler_type);
      }
      outputs = user_sched->executor->run(
          args,
          {},
          user_sched->heuristic_params->lparams,
          user_sched->heuristic_params->cparams);
    }

    if (isProfilerEnabledWithCupti()) {
      FusionProfiler::segment(0).scheduler("user");
      FusionProfiler::stop();
      if (isProfilerPrintingEnabled()) {
        debug() << FusionProfiler::profile();
      }
    }
  }

  if (profile) {
    ProfilerOptionsGuard::getCurOptions().unset(ProfilerOption::Enable);
  }

Functionality Loss

The beginEvent and endEvent methods in Trace class are now empty, which might remove important tracing functionality. Ensure that this is intentional and that tracing is handled elsewhere if necessary.

public:
 using Clock = std::chrono::steady_clock;
 int64_t times_called_ = 0;

csarofeen added 3 commits March 7, 2025 06:28

Merged and squashed PR 4035

52c0956

Testing latencies of expr eval.

48e637a

Improve infra, add ::evaluate option.

a5a8fc7

csarofeen changed the title ~~Basic latency tests~~ [WIP] Basic latency tests Mar 10, 2025

csarofeen closed this Jan 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Basic latency tests#4053

[WIP] Basic latency tests#4053
csarofeen wants to merge 3 commits intomainfrom
expr_eval_latency

csarofeen commented Mar 9, 2025

Uh oh!

github-actions bot commented Mar 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

csarofeen commented Mar 9, 2025

Uh oh!

github-actions bot commented Mar 9, 2025

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant