Though you can still partition the layout to be 4 '4x4' (128bits per element) blocks, it seems unnecessary when crosswise is 64 and it just adds more compute? it can just be treated as 1 '8x8' block?
For example, these lines,
if (Policy::LdsmShape::kContiguous == 4) {
// Matrix multiply 1688 A/B
// Q0 Q1 Q2 Q3 (Q stands for 1 8x128bit block).
// Four blocks are next to each other in the contiguous dimension.
partition_contiguous_idx = ((lane_in_quad_pair >> 2) ^ i);
access_contiguous_idx = (quad_pair ^ lane_in_quad);
access_strided_idx = lane_in_quad_pair;
}
Why isn't it just written as the following?
if (Policy::LdsmShape::kContiguous == 4) {
// Matrix multiply 1688 A/B
// Q0 Q1 Q2 Q3 (Q stands for 1 8x128bit block).
// Four blocks are next to each other in the contiguous dimension.
partition_contiguous_idx = 0; // not needed, 8x8 block, no partition
access_contiguous_idx = (quad_pair + (i << 2)) ^ lane_in_quad_pair;
access_strided_idx = lane_in_quad_pair;
}
Is there any case where this partition have to happen?