Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
89 commits
Select commit Hold shift + click to select a range
8e294ba
First draft of task sorter
jacobhinkle Aug 13, 2025
7b279c2
Completed first draft of sort() algorithm
jacobhinkle Aug 13, 2025
6012846
Start building into FusionSegmenter
jacobhinkle Aug 13, 2025
da53605
Use optimalTopoSort
jacobhinkle Aug 13, 2025
daeb547
Fixes
jacobhinkle Aug 13, 2025
8076f0a
Add missing exhaustive check
jacobhinkle Aug 13, 2025
b29ad5e
Fix definition of graph. Now have error in validation
jacobhinkle Aug 13, 2025
6488c24
Fix error in validation. Working
jacobhinkle Aug 13, 2025
e7cf966
Add repro to test_repro.py
jacobhinkle Aug 13, 2025
589653f
Check that ordering is topological
jacobhinkle Aug 13, 2025
9e7c2ba
Erase uses from ready_tasks_ when backtracking
jacobhinkle Aug 13, 2025
647fdb5
Refactor conversion into SegmentedGroupTaskGraphConverter
jacobhinkle Aug 13, 2025
1d9b750
Avoid errors with aliasing
jacobhinkle Aug 13, 2025
29b684c
Add printout with timing of topo sorting
jacobhinkle Aug 13, 2025
05b42c5
Merge remote-tracking branch 'origin/main' into jh/optimal_segment_order
jacobhinkle Aug 14, 2025
082d379
Add PTX ranges, remove debug prints
jacobhinkle Aug 14, 2025
5de4683
Add comments. Check aliasing condition
jacobhinkle Aug 14, 2025
1ec54f0
Merge remote-tracking branch 'origin/main' into jh/optimal_segment_order
jacobhinkle Aug 15, 2025
3998406
Remove debug prints in test
jacobhinkle Aug 15, 2025
be8332a
Respect aliased input constraint
jacobhinkle Aug 15, 2025
91e942f
Fix lintrunner
jacobhinkle Aug 15, 2025
253a750
Use runtime_info to get correct sizes in FusionKernelRuntime
jacobhinkle Aug 15, 2025
a53650b
Merge remote-tracking branch 'origin/main' into jh/optimal_segment_order
jacobhinkle Aug 15, 2025
cb77db5
Add test file
jacobhinkle Aug 16, 2025
ce7335d
Fix bug in initializing task_has_aliased_input_
jacobhinkle Aug 20, 2025
f515cd4
Pipe runtime_info to lowerSegmentedFusionToHostIr
jacobhinkle Aug 20, 2025
5892206
Place TODO in csrc/host_ir/lower.cpp
jacobhinkle Aug 20, 2025
4e9ba4d
Fix bugs in initialization of graph. validate inputs
jacobhinkle Aug 20, 2025
cc753c9
Add ImpossibleAlias test
jacobhinkle Aug 20, 2025
760b73b
Add cycle tests
jacobhinkle Aug 20, 2025
bcad2a3
Start improving tests
jacobhinkle Aug 20, 2025
51c8c6c
More improvements to tests
jacobhinkle Aug 20, 2025
2e85c6f
Minor
jacobhinkle Aug 20, 2025
ef65076
Fix backtracking bug
jacobhinkle Aug 20, 2025
0f4126c
Fix up DifferentSizes test
jacobhinkle Aug 20, 2025
620cf3a
Drafted new failing test
jacobhinkle Aug 20, 2025
290b652
Merge remote-tracking branch 'origin/main' into jh/optimal_segment_order
jacobhinkle Aug 20, 2025
ed4585f
Add test from Kayaaslan 2018
jacobhinkle Aug 21, 2025
cdf8598
Introduce time limit for sorting
jacobhinkle Aug 21, 2025
d5951f3
lintrunner tests
jacobhinkle Aug 21, 2025
a67c6e2
Merge remote-tracking branch 'origin/main' into jh/optimal_segment_order
jacobhinkle Aug 21, 2025
6480530
Change diagrams to proper multiline comments
jacobhinkle Aug 21, 2025
42fcc64
Add NVFUSER_DUMP=task_graph
jacobhinkle Aug 21, 2025
41aa154
Add mermaid printing when NVFUSER_DUMP=task_graph is given
jacobhinkle Aug 21, 2025
9493cf0
Add more tests. SharedIntermediateWithAlias is failing to sort
jacobhinkle Aug 21, 2025
020c333
Merge branch 'main' into jh/optimal_segment_order
jacobhinkle Aug 21, 2025
5aafe11
Merge remote-tracking branch 'origin/main' into jh/optimal_segment_order
jacobhinkle Aug 22, 2025
b23274d
Convert graphs with aliases
jacobhinkle Aug 22, 2025
95dcf42
Remove code related to aliasing from TaskSorter
jacobhinkle Aug 22, 2025
3c5f091
Remove mistakenly added empty file
jacobhinkle Aug 22, 2025
a52875f
Validate during backtracking only in testing.
jacobhinkle Aug 22, 2025
589b997
Update stroke colors on mermaid plots
jacobhinkle Aug 22, 2025
5e9e899
Merge remote-tracking branch 'origin/main' into jh/optimal_segment_order
jacobhinkle Sep 2, 2025
55bbd7a
Only update best_steps if hwm is improved
jacobhinkle Sep 3, 2025
532ddcc
Merge branch 'main' into jh/optimal_segment_order
jacobhinkle Sep 3, 2025
549124e
Fix typo
jacobhinkle Sep 3, 2025
595ce79
Skip looking up sizes for sharded inputs
jacobhinkle Sep 3, 2025
d95e0bb
Handle CPU scalars properly
jacobhinkle Sep 5, 2025
7ad6e97
Merge remote-tracking branch 'origin/main' into jh/optimal_segment_order
jacobhinkle Sep 5, 2025
9dc649f
Finish DifferentSizes test
jacobhinkle Sep 5, 2025
957dc67
Merge remote-tracking branch 'origin/main' into jh/optimal_segment_order
jacobhinkle Sep 15, 2025
ed2395e
Address some reviewer comments
jacobhinkle Sep 15, 2025
965436b
Merge remote-tracking branch 'origin/main' into jh/optimal_segment_order
jacobhinkle Sep 17, 2025
3546fa3
Fix typos, use ExpressionEvaluator to compute numel
jacobhinkle Sep 17, 2025
239a80d
Add comment about assumptions in inferData in tests
jacobhinkle Sep 17, 2025
51ee594
Fill uses and definitions, and validate
jacobhinkle Sep 17, 2025
2de536d
Skip manually setting definition and uses in fusion_segmenter.cpp
jacobhinkle Sep 17, 2025
3a0f9e2
Remove early exit for unsegmented fusions
jacobhinkle Sep 17, 2025
c270fec
Merge remote-tracking branch 'origin/main' into jh/optimal_segment_order
jacobhinkle Sep 17, 2025
f313e1e
Simplify alias check
jacobhinkle Sep 17, 2025
fb36a15
aliases_input optional<DataId> -> DataId
jacobhinkle Sep 17, 2025
5cd1e66
Remove unused include
jacobhinkle Sep 17, 2025
f1260a1
Print result of findOptimalOrder in debug dump
jacobhinkle Sep 17, 2025
337921a
Merge remote-tracking branch 'origin/main' into jh/optimal_segment_order
jacobhinkle Sep 19, 2025
60aa579
Fix AliasTest.TrivialInputForwarding
jacobhinkle Sep 19, 2025
d08bd53
Handle cases where tv->dtype() is Index
jacobhinkle Sep 19, 2025
ab38575
Update csrc/graph/task_graph.cpp
jacobhinkle Sep 23, 2025
a444097
Update csrc/fusion_segmenter.cpp
jacobhinkle Sep 24, 2025
fa63660
Update csrc/graph/task_graph.cpp
jacobhinkle Sep 24, 2025
efbe8ba
Merge remote-tracking branch 'origin/main' into jh/optimal_segment_order
jacobhinkle Sep 24, 2025
8e4f508
Merge remote-tracking branch 'origin/main' into jh/optimal_segment_order
jacobhinkle Oct 1, 2025
c7cec9b
Use std::lower_bound
jacobhinkle Oct 1, 2025
2ae3a31
Update csrc/graph/task_graph.cpp
jacobhinkle Oct 1, 2025
d80f63b
Remove iteration limit
jacobhinkle Oct 1, 2025
64d1fa9
Merge remote-tracking branch 'origin/main' into jh/optimal_segment_order
jacobhinkle Oct 21, 2025
064d939
Remove debug prints
jacobhinkle Oct 21, 2025
6c1a6f7
Merge remote-tracking branch 'origin/main' into jh/optimal_segment_order
jacobhinkle Oct 28, 2025
6011cac
Update csrc/graph/task_graph.cpp
jacobhinkle Oct 28, 2025
e92731e
Update csrc/graph/task_graph.cpp
jacobhinkle Oct 28, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -261,6 +261,7 @@ list(APPEND NVFUSER_SRCS
${NVFUSER_SRCS_DIR}/fusion_guard.cpp
${NVFUSER_SRCS_DIR}/fusion_segmenter.cpp
${NVFUSER_SRCS_DIR}/global_allocator.cpp
${NVFUSER_SRCS_DIR}/graph/task_graph.cpp
${NVFUSER_SRCS_DIR}/grouped_reduction.cpp
${NVFUSER_SRCS_DIR}/host_ir/container.cpp
${NVFUSER_SRCS_DIR}/host_ir/evaluator.cpp
Expand Down Expand Up @@ -992,6 +993,7 @@ list(APPEND JIT_TEST_SRCS
${NVFUSER_ROOT}/tests/cpp/test_statement_guard.cpp
${NVFUSER_ROOT}/tests/cpp/test_stream.cpp
${NVFUSER_ROOT}/tests/cpp/test_swizzle.cpp
${NVFUSER_ROOT}/tests/cpp/test_task_graph.cpp
${NVFUSER_ROOT}/tests/cpp/test_tensor_factories.cpp
${NVFUSER_ROOT}/tests/cpp/test_tmem.cpp
${NVFUSER_ROOT}/tests/cpp/test_transpose.cpp
Expand Down
162 changes: 160 additions & 2 deletions csrc/fusion_segmenter.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,9 @@
#include <debug.h>
#include <device_lower/utils.h>
#include <disjoint_set.h>
#include <exceptions.h>
#include <fusion.h>
#include <graph/task_graph.h>
#include <instrumentation.h>
#include <ir/all_nodes.h>
#include <ir/cloner.h>
Expand All @@ -28,6 +30,7 @@
#include <options.h>
#include <scheduler/debug_utils.h>
#include <scheduler/normalization_utils.h>
#include <scheduler/runtime_info.h>
#include <transform_iter.h>
#include <transform_replay.h>

Expand Down Expand Up @@ -2001,8 +2004,159 @@ bool SegmentCandidateFinder::hasSegmentHints(Fusion* fusion) {
}

namespace {

class SegmentedGroupTaskGraphConverter {
public:
static TaskGraph convert(
const std::vector<SegmentedGroup*>& groups,
SchedulerRuntimeInfo* runtime_info) {
SegmentedGroupTaskGraphConverter conv(runtime_info);
for (SegmentedGroup* group : groups) {
conv.processGroup(group);
}
return TaskGraph(conv.all_tasks_, conv.all_data_);
}

private:
SegmentedGroupTaskGraphConverter(SchedulerRuntimeInfo* runtime_info)
: runtime_info_(runtime_info) {}

void processGroup(SegmentedGroup* group) {
// When there are aliased inputs, they will appear as _outputs_ of the
// SegmentedGroup. To avoid actually adding those as outputs, we record them
// here first
std::unordered_set<TensorView*> aliased_input_tvs;
for (Val* v : group->outputs()) {
if (auto* aliased_input_tv = dynamic_cast<TensorView*>(
v->fusion()->getOutputAlias(v).aliased_io)) {
aliased_input_tvs.insert(aliased_input_tv);
}
}

std::vector<TaskGraph::DataId> inputs;
// These are fusion inputs, so they are not edges between segments
for (Val* v : group->inputs()) {
if (auto* tv = dynamic_cast<TensorView*>(v)) {
// Ignore scalar inputs
TaskGraph::DataId data_id = maybeRegisterTv(tv);
TaskGraph::Data& data = all_data_.at(data_id);
data.can_free = !tv->isFusionInput();
inputs.push_back(data_id);
}
}
std::vector<TaskGraph::DataId> outputs;
for (Val* v : group->outputs()) {
if (auto* tv = dynamic_cast<TensorView*>(v)) {
if (aliased_input_tvs.count(tv) || tv->isFusionInput()) {
// These are counted as outputs but are actually _inputs_ to this
// group
// Note that we skip setting alias links in the graph when the input
// is simply forwarded to the outputs unchanged.
// See AliasTest.TrivialInputForwarding for an example of this
continue;
}
TaskGraph::DataId data_id = maybeRegisterTv(tv);
TaskGraph::Data& data = all_data_.at((size_t)data_id);
if (auto* aliased_input_tv = dynamic_cast<TensorView*>(
tv->fusion()->getOutputAlias(tv).aliased_io)) {
data.aliases_input = maybeRegisterTv(aliased_input_tv);
}
data.can_free = !tv->isFusionOutput();
outputs.push_back(data_id);
}
}

// TODO: inspect compiled segment executors to determine temp gmem needed
TaskGraph::Size temp_space = 0;
Comment on lines +2069 to +2070
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently prepare the runtime order before compilation of segments. However, if we did it afterward we would have access to the executor telling us how much temp space to use.


all_tasks_.emplace_back(inputs, outputs, temp_space);
}

int64_t getNumAllocatedElements(TensorView* tv) {
if (tv->isCpuScalar()) {
// Since CPU scalars do not result in any GPU allocation we count them as
// empty.
return 0;
}
int64_t numel = 1;
// Use ExpressionEvaluator for computed tensors assuming they are
// contiguous
for (IterDomain* id : tv->getMaybeAllocationDomain()) {
if (id->isBroadcast() || id->isReduction() || id->isDeviceDim()) {
continue;
}
PolymorphicValue pv = std::monostate{};
if (runtime_info_ != nullptr) {
pv = runtime_info_->expressionEvaluator().evaluate(id->extent());
}
// If we can't determine the size of this dimension, just assume
// it's 2. This way we will give precedence to tensors with
// allocation domains that have more concrete IDs.
int64_t dim_size = pv.is<int64_t>() ? pv.as<int64_t>() : 2;
numel *= dim_size;
}
return numel;
}

TaskGraph::DataId maybeRegisterTv(TensorView* tv) {
auto it = tv2dataid_.find(tv);
if (it != tv2dataid_.end()) {
// tv is already registered
return it->second;
}

// Register this TV
auto new_id = static_cast<TaskGraph::DataId>(std::ssize(all_data_));
tv2dataid_[tv] = new_id;

// If the TV is of type Index, we don't know if it will be 8 bytes or 4
// bytes until we are given input
DataType dtype = tv->dtype();
if (dtype == DataType::Index) {
// If we don't have runtime info, assume it is 64-bit
dtype = runtime_info_ != nullptr ? runtime_info_->getIndexType()
: DataType::Int;
}
TaskGraph::Size size =
getNumAllocatedElements(tv) * dataTypeSizeByte(dtype);

all_data_.emplace_back(
/*definition=*/std::nullopt,
/*uses=*/std::vector<TaskGraph::TaskId>{},
/*aliases_input=*/-1,
size,
/*can_free=*/true);
return new_id;
}

private:
SchedulerRuntimeInfo* runtime_info_;
std::vector<TaskGraph::Data> all_data_;
std::unordered_map<TensorView*, TaskGraph::DataId> tv2dataid_;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you find this indirection any useful? tv2dataid_ could map TensorView* to Data directly.

I once created such an indirection thinking it would be faster. It turned out to speed up very little and made implementation unnecessarily more complicated.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether it's faster in practice, but I can try it and check. I did originally try and keep things small and simple for perf reasons because I knew we'd need to do some expensive iterative optimizations. There are cases in our tests where exhaustive search runs out of time even if I increase the time limit to 100 seconds. Of course, it's unclear whether we get any benefit from more iterations even in those cases and an algorithmic improvement would probably be preferable.

I agree that usually the indirection is slightly annoying. I need to use getData and getTask to get the actual task. However, I can copy a TaskGraph without worrying about anything breaking, and we can build graphs easily by directly creating Data and Tasks and copying them in as arguments. Moving to using pointers means we need the graph to own the objects and request new ones with newData and newTask, and then either make TaskGraph a NonCopyable or implement a clone method. These are all doable of course. It felt like a toss-up when I started implementing but let me have a try and the pointer method.

std::vector<TaskGraph::Task> all_tasks_;
};

std::vector<SegmentedGroup*> optimalTopoSort(
const std::vector<SegmentedGroup*>& groups,
SchedulerRuntimeInfo* runtime_info) {
FUSER_PERF_SCOPE("optimalTopoSort");

TaskGraph graph =
SegmentedGroupTaskGraphConverter::convert(groups, runtime_info);

TaskGraph::SortResult result = graph.findOptimalOrder(/*validate=*/false);

std::vector<SegmentedGroup*> order;
order.reserve(groups.size());
for (const TaskGraph::Step& step : result.steps) {
order.push_back(groups.at((size_t)step.task));
}
return order;
}

std::vector<SegmentedGroup*> toposort(
const std::vector<SegmentedGroup*>& groups) {
FUSER_PERF_SCOPE("toposort");
std::deque<SegmentedGroup*> to_visit;
std::unordered_map<SegmentedGroup*, int64_t> num_producer_edges;
for (SegmentedGroup* group : groups) {
Expand Down Expand Up @@ -5383,7 +5537,10 @@ void SegmentedFusion::annotateFP16IntermediateTensors() {
}
}

RuntimeWorkSpace prepareRuntimeOrder(const SegmentedFusion& segmented_fusion) {
RuntimeWorkSpace prepareRuntimeOrder(
const SegmentedFusion& segmented_fusion,
SchedulerRuntimeInfo* runtime_info) {
FUSER_PERF_SCOPE("prepareRuntimeOrder");
RuntimeWorkSpace runtime_workspace;

// setup the order tensor dimensions are bound
Expand All @@ -5398,7 +5555,8 @@ RuntimeWorkSpace prepareRuntimeOrder(const SegmentedFusion& segmented_fusion) {
}
}

runtime_workspace.group_run_order = toposort(segmented_fusion.groups());
runtime_workspace.group_run_order =
optimalTopoSort(segmented_fusion.groups(), runtime_info);

return runtime_workspace;
}
Expand Down
4 changes: 3 additions & 1 deletion csrc/fusion_segmenter.h
Original file line number Diff line number Diff line change
Expand Up @@ -477,7 +477,9 @@ struct RuntimeWorkSpace {

// Perform a topological sort of different groups composiong the Segmented
// Fusion
RuntimeWorkSpace prepareRuntimeOrder(const SegmentedFusion& segmented_fusion);
RuntimeWorkSpace prepareRuntimeOrder(
const SegmentedFusion& segmented_fusion,
SchedulerRuntimeInfo* runtime_info = nullptr);

//! This is a base class for segmenter analysis
//! provides the minimal implementation on header so that
Expand Down
Loading
Loading