I think I mentioned this somewhere, but looks like I never created an actual issue.
Since registers are such a scarce resource on GPUs and have huge impact on occupancy, I want to propose a feature that helps with register usage.
I call it "register fusion", but there is probably a more established naming for it.
Here is an example extracted from the actual code base:
fn find_proofs_impl(
// ...
) {
let table_6_proof_targets = /* ... */;
// `0` for left `1` for right
let left_right = (subgroup_local_invocation_id % 2) as usize;
let mut group_left_x_index = subgroup_local_invocation_id * 2;
// Reading positions from table 6
for table_6_chunk in 0..2 {
let table_6_proof_targets = subgroup_shuffle(
table_6_proof_targets,
SUBGROUP_SIZE / 2 * table_6_chunk + subgroup_local_invocation_id / 2,
);
let table_6_proof_target = table_6_proof_targets.to_array()[left_right];
let table_5_proof_targets = if table_6_proof_target == Position::SENTINEL {
[Position::SENTINEL; 2]
} else {
table_6_positions[table_6_proof_target as usize]
};
// Reading positions from table 5
for table_5_chunk in 0..2 {
let table_5_proof_targets = subgroup_shuffle(
table_5_proof_targets,
SUBGROUP_SIZE / 2 * table_5_chunk + subgroup_local_invocation_id / 2,
);
let table_5_proof_target = table_5_proof_targets.to_array()[left_right];
let table_4_proof_targets = if table_5_proof_target == Position::SENTINEL {
[Position::SENTINEL; 2]
} else {
table_5_positions[table_5_proof_target as usize]
};
// Reading positions from table 4
for table_4_chunk in 0..2 {
let table_4_proof_targets = subgroup_shuffle(
table_4_proof_targets,
SUBGROUP_SIZE / 2 * table_4_chunk + subgroup_local_invocation_id / 2,
);
let table_4_proof_target = table_4_proof_targets.to_array()[left_right];
let table_3_proof_targets = if table_4_proof_target == Position::SENTINEL {
[Position::SENTINEL; 2]
} else {
table_4_positions[table_4_proof_target as usize]
};
// Reading positions from table 3
for table_3_chunk in 0..2 {
let table_3_proof_targets = subgroup_shuffle(
table_3_proof_targets,
SUBGROUP_SIZE / 2 * table_3_chunk + subgroup_local_invocation_id / 2,
);
let table_3_proof_target = table_3_proof_targets.to_array()[left_right];
let table_2_proof_targets = if table_3_proof_target == Position::SENTINEL {
[Position::SENTINEL; 2]
} else {
table_3_positions[table_3_proof_target as usize]
};
// Reading positions from table 2
for table_2_chunk in 0..2 {
let table_2_proof_targets = subgroup_shuffle(
table_2_proof_targets,
SUBGROUP_SIZE / 2 * table_2_chunk + subgroup_local_invocation_id / 2,
);
let table_2_proof_target = table_2_proof_targets.to_array()[left_right];
let [x_left, x_right] = if table_2_proof_target == Position::SENTINEL {
[Position::SENTINEL; 2]
} else {
table_2_positions[table_2_proof_target as usize]
};
// Some logic
}
}
}
}
}
}
There are 5 nested loops here, each with just 2 iterations controlled by table_*_chunk variable and table_*_positions are in shared memory.
The high-level design here is that the whole data set is loaded into subgroup registers exactly once at the start to avoid touching memory afterwards and then invocations share local data with each other completely in registers for processing. It adapts to the subgroup sizes to always load unique data at the maximum possible width in parallel, while doing zero redundant/duplicated memory accesses.
The observation here is that with careful loop refactoring, each of the 5 nested loops needs exactly 1 bit to track its progress (do while). But what compiler faithfully generates right now is 5 separate registers, one use to track each loop separately.
This was a substantial occupancy constraint in my code base and after refactoring it to use a single register and extracting bits from it on demand resulted in massive performance improvement. Here is the PR and commit doing it, there are other commits of similar nature in that PR which improve register usage too: nazar-pc/abundance@eb1a8a3
My feature request is to make it possible to write idiomatic Rust code and have compiler fuse multiple registers that have only a few bits occupied with a single register + a bit of code to extract necessary values on demand.
I did this with multiple data structures to reduce memory usage and fit into shared memory, which is much harder to do automatically without range types and such. But for registers specifically compiler should have full visibility and a lot more opportunities for automatic transparent rewriting. The compute cost of packing and unpacking small values from registers is almost always worth it when compared to reduced occupancy or spilling into any kind of memory.
As for ways to implement this, I think the transparent way when developer doesn't have to do anything at all is ideal, but may be too complex, at least at first.
Maybe some compiler pseudo-macro could be used to tag the variables that must be packed together, something like pack_value!(var_name, num_bits, "tag_name"). In example above it'll be something like this:
for pack_value!(table_6_chunk, 2, "table_chunk") in 0..2 {
I used 2 for the width since the loop uses "while do" pattern rather "do while". But it'll already be a huge improvement that way.
Then all variables tagged with tag_name will be (if possible) packed into the same register.
I think I mentioned this somewhere, but looks like I never created an actual issue.
Since registers are such a scarce resource on GPUs and have huge impact on occupancy, I want to propose a feature that helps with register usage.
I call it "register fusion", but there is probably a more established naming for it.
Here is an example extracted from the actual code base:
There are 5 nested loops here, each with just 2 iterations controlled by
table_*_chunkvariable andtable_*_positionsare in shared memory.The high-level design here is that the whole data set is loaded into subgroup registers exactly once at the start to avoid touching memory afterwards and then invocations share local data with each other completely in registers for processing. It adapts to the subgroup sizes to always load unique data at the maximum possible width in parallel, while doing zero redundant/duplicated memory accesses.
The observation here is that with careful loop refactoring, each of the 5 nested loops needs exactly 1 bit to track its progress (do while). But what compiler faithfully generates right now is 5 separate registers, one use to track each loop separately.
This was a substantial occupancy constraint in my code base and after refactoring it to use a single register and extracting bits from it on demand resulted in massive performance improvement. Here is the PR and commit doing it, there are other commits of similar nature in that PR which improve register usage too: nazar-pc/abundance@eb1a8a3
My feature request is to make it possible to write idiomatic Rust code and have compiler fuse multiple registers that have only a few bits occupied with a single register + a bit of code to extract necessary values on demand.
I did this with multiple data structures to reduce memory usage and fit into shared memory, which is much harder to do automatically without range types and such. But for registers specifically compiler should have full visibility and a lot more opportunities for automatic transparent rewriting. The compute cost of packing and unpacking small values from registers is almost always worth it when compared to reduced occupancy or spilling into any kind of memory.
As for ways to implement this, I think the transparent way when developer doesn't have to do anything at all is ideal, but may be too complex, at least at first.
Maybe some compiler pseudo-macro could be used to tag the variables that must be packed together, something like
pack_value!(var_name, num_bits, "tag_name"). In example above it'll be something like this:I used
2for the width since the loop uses "while do" pattern rather "do while". But it'll already be a huge improvement that way.Then all variables tagged with
tag_namewill be (if possible) packed into the same register.