Hi! I'm trying to change the tile size in examples/36_gather_scatter_fusion to find the best performance. The A, B, C and D in my code is all fp32. According to #612 , I read generator.py and find GenerateSM80_TensorOp_1688(manifest, cuda_version). I use the tile size in its TileDescription list, but get compilation error like this:
/home/me/cutlass/include/cutlass/transform/pitch_linear_thread_map.h(295): error: static assertion failed with "Number of iterations must be non-zero"
The tile settings in my source code:
// This code section describes the tile size a thread block will compute
using ShapeMMAThreadBlock =
cutlass::gemm::GemmShape<256, 128, 16>;
// This code section describes tile size a warp will compute
using ShapeMMAWarp = cutlass::gemm::GemmShape<4, 2, 1>;
// This code section describes the size of MMA op
using ShapeMMAOp = cutlass::gemm::GemmShape<16, 8, 8>;
...
// Number of pipelines you want to use
constexpr int NumStages = 3;
Hi! I'm trying to change the tile size in examples/36_gather_scatter_fusion to find the best performance. The
A,B,CandDin my code is all fp32. According to #612 , I read generator.py and findGenerateSM80_TensorOp_1688(manifest, cuda_version). I use the tile size in itsTileDescriptionlist, but get compilation error like this:/home/me/cutlass/include/cutlass/transform/pitch_linear_thread_map.h(295): error: static assertion failed with "Number of iterations must be non-zero"The tile settings in my source code: