Hi
In sm90_mma_tma_gmma_ss.hpp, for example cluster shape is <2, 2, 1>, it seems that each block in the cluster will issue two tma.multicast for both inputA and inputB in each stage, and it's full barrier's tx_count is the sum of the data needed in this stage.
My question is that, since each tma.multicast will reduce the tx_count of all barriers in the mask, one block's full barrier will reduced twice and arrived when only half of the data is ready. It will make sense when just block with cluster.x=0 or cluster.y=0 issue tma.multicast, is there some place has this condition that I didn't find out? thanks.
Hi
In sm90_mma_tma_gmma_ss.hpp, for example cluster shape is <2, 2, 1>, it seems that each block in the cluster will issue two tma.multicast for both inputA and inputB in each stage, and it's full barrier's tx_count is the sum of the data needed in this stage.
My question is that, since each tma.multicast will reduce the tx_count of all barriers in the mask, one block's full barrier will reduced twice and arrived when only half of the data is ready. It will make sense when just block with cluster.x=0 or cluster.y=0 issue tma.multicast, is there some place has this condition that I didn't find out? thanks.