Skip to content

Conversation

@tenpercent
Copy link
Contributor

@tenpercent tenpercent commented Jan 16, 2026

Summary

  • Use compiler builtin __make_integer_seq for sequence_gen and uniform_sequence_gen
  • Reduces template instantiation depth from O(N) to O(1)

Motivation

The recursive sequence_gen_impl creates deep template chains for large sequences, increasing compile time and memory usage.

Test Plan

  • Waiting for full CI

PR Stack

# PR Description
1 #3585 sequence_gen with __make_integer_seq
2 #3588 generate_identity_sequences helper
3 #3589 Named functors in transform_tensor_descriptor
4 #3590 container_concat optimization
5 #3596 O(1) pack expansion rewrites
6 #3600 TensorDescriptor/TensorAdaptor lambda elimination

@shumway
Copy link
Collaborator

shumway commented Jan 16, 2026

Do you want to add unit tests for this, or just rely on the tests of all the code that depends on this? If it's easy to add unit tests, that's generally better, but I'm also fine with moving fast to cut down compilation times.

// generate sequence
template <index_t NSize, typename F>
struct sequence_gen
// Four sequences: direct concatenation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like these specializations. It will be interesting to get a survey of the code to see how often the specializations are used and if these four smallest cases are the most impactful ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using the build traces to drive the optimizations. Maybe removing the unused code is one other aspect which could help with parsing times

@tenpercent
Copy link
Contributor Author

Do you want to add unit tests for this, or just rely on the tests of all the code that depends on this? If it's easy to add unit tests, that's generally better, but I'm also fine with moving fast to cut down compilation times.

let's move fast, in case something important breaks CI will catch it, tests also need maintenance and let's see how much of the metaprogramming is left after the initial sprint

tenpercent and others added 2 commits January 16, 2026 21:45
Replace recursive template instantiation with compiler intrinsic
__make_integer_seq and pack expansion for O(1) instantiation depth.

Before: Maximum nesting depth of 90 levels with recursive divide-and-conquer
After: Maximum nesting depth of 26 levels using flat pack expansion

Performance improvements measured on example_grouped_conv_fwd_xdl_fp16:
- Template instantiation wall-clock time: 36.8s -> 18.7s (49% faster)
- Template instantiation cumulative time: 56.6s -> 25.8s (54% faster)
- Maximum nesting depth: 90 -> 26 (71% reduction)

The key changes:
- sequence_gen: Uses __make_integer_seq to generate indices 0..N-1,
  then applies functor F via pack expansion in a single step
- uniform_sequence_gen: Uses __make_integer_seq with pack expansion
  to generate N copies of a constant value

Co-Authored-By: Claude <noreply@anthropic.com>
Replace linear recursive instantiation with direct pack expansion
for 1-4 sequences, and binary tree reduction for larger cases.

Before: O(N) depth for merging N sequences
After: O(log N) depth with O(1) for up to 4 sequences

This further reduces maximum nesting depth from 26 to 22 levels
when combined with the previous sequence_gen optimization.

Co-Authored-By: Claude <noreply@anthropic.com>
@tenpercent tenpercent force-pushed the tenpercent/old-ck-pack-rewrites branch from 57c8cb1 to 3d46680 Compare January 17, 2026 03:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants