Hello! I'm trying to make examples/36_gather_scatter_fusion works for fp64. A,B,C and D are all fp64 and row major, with a shape of (m,n,k)=(4,4,4), and index_size=2.
I found examples/18_ampere_fp64_tensorop_affine2_gemm and set a tile size same as this example:
// This code section describes the tile size a thread block will compute
using ShapeMMAThreadBlock = cutlass::gemm::GemmShape<128, 128, 16>;
// This code section describes tile size a warp will compute
using ShapeMMAWarp = cutlass::gemm::GemmShape<32, 64, 16>;
// This code section describes the size of MMA op
using ShapeMMAOp = cutlass::gemm::GemmShape<8, 8, 4>;
But a complication error occur:
/home/me/cutlass/include/cutlass/transform/threadblock/predicated_tile_access_iterator.h(353): error: static assertion failed with "Vectors implied by the thread map must be divisible by the access type."
detected during:
instantiation of class "cutlass::transform::threadblock::PredicatedTileAccessIterator<Shape_, Element_, cutlass::layout::PitchLinear, AdvanceRank, ThreadMap_, AccessType_, Gather> [with Shape_=cutlass::layout::PitchLinearShape<16, 128>, Element_=double, AdvanceRank=0, ThreadMap_=cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::layout::PitchLinearShape<16, 128>, 256, cutlass::layout::PitchLinearShape<16, 2>, 1>, AccessType_=cutlass::Array<double, 2, true>, Gather=true]"
According to #566, I guess may be I should set the tile size to half of the tile size for fp32, so I set the tile size as:
using ShapeMMAThreadBlock =
cutlass::gemm::GemmShape<128, 128, 8>;
// This code section describes tile size a warp will compute
using ShapeMMAWarp = cutlass::gemm::GemmShape<64, 64, 8>;
// This code section describes the size of MMA op
using ShapeMMAOp = cutlass::gemm::GemmShape<16, 8, 4>;
But a similar complication error occur. :(
What tile size works for fp64? Are there any rules I can refer to to set the correct tile size? By the way, is it possible to change tile size to accelerate?
Hello! I'm trying to make examples/36_gather_scatter_fusion works for fp64.
A,B,CandDare all fp64 and row major, with a shape of(m,n,k)=(4,4,4), andindex_size=2.I found examples/18_ampere_fp64_tensorop_affine2_gemm and set a tile size same as this example:
But a complication error occur:
According to #566, I guess may be I should set the tile size to half of the tile size for fp32, so I set the tile size as:
But a similar complication error occur. :(
What tile size works for fp64? Are there any rules I can refer to to set the correct tile size? By the way, is it possible to change tile size to accelerate?