-
Notifications
You must be signed in to change notification settings - Fork 266
Optimize sequence_gen and uniform_sequence_gen to reduce template instantiation depth #3585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
a477221 to
57c8cb1
Compare
|
Do you want to add unit tests for this, or just rely on the tests of all the code that depends on this? If it's easy to add unit tests, that's generally better, but I'm also fine with moving fast to cut down compilation times. |
| // generate sequence | ||
| template <index_t NSize, typename F> | ||
| struct sequence_gen | ||
| // Four sequences: direct concatenation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like these specializations. It will be interesting to get a survey of the code to see how often the specializations are used and if these four smallest cases are the most impactful ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using the build traces to drive the optimizations. Maybe removing the unused code is one other aspect which could help with parsing times
let's move fast, in case something important breaks CI will catch it, tests also need maintenance and let's see how much of the metaprogramming is left after the initial sprint |
Replace recursive template instantiation with compiler intrinsic __make_integer_seq and pack expansion for O(1) instantiation depth. Before: Maximum nesting depth of 90 levels with recursive divide-and-conquer After: Maximum nesting depth of 26 levels using flat pack expansion Performance improvements measured on example_grouped_conv_fwd_xdl_fp16: - Template instantiation wall-clock time: 36.8s -> 18.7s (49% faster) - Template instantiation cumulative time: 56.6s -> 25.8s (54% faster) - Maximum nesting depth: 90 -> 26 (71% reduction) The key changes: - sequence_gen: Uses __make_integer_seq to generate indices 0..N-1, then applies functor F via pack expansion in a single step - uniform_sequence_gen: Uses __make_integer_seq with pack expansion to generate N copies of a constant value Co-Authored-By: Claude <noreply@anthropic.com>
Replace linear recursive instantiation with direct pack expansion for 1-4 sequences, and binary tree reduction for larger cases. Before: O(N) depth for merging N sequences After: O(log N) depth with O(1) for up to 4 sequences This further reduces maximum nesting depth from 26 to 22 levels when combined with the previous sequence_gen optimization. Co-Authored-By: Claude <noreply@anthropic.com>
57c8cb1 to
3d46680
Compare
Summary
__make_integer_seqforsequence_genanduniform_sequence_genMotivation
The recursive
sequence_gen_implcreates deep template chains for large sequences, increasing compile time and memory usage.Test Plan
PR Stack
__make_integer_seq