Skip to content

Conversation

mandroid6
Copy link

@mandroid6 mandroid6 commented Feb 4, 2025

Currently cutlass profiler lists down all the arguments to the benchmark but doesn't list down per kernel values for cluster_k, cluster_m and cluster_n.

This change updates the profiler report generation to include these arguments.

Before:

As we see below, the values for cluster_m,cluster_n,cluster_k are missing in the kernel result.

Problem,Provider,OperationKind,Operation,Disposition,Status,gemm_kind,m,n,k,A,B,C,D,alpha,beta,split_k_mode,split_k_slices,batch_count,raster_order,use_pdl,swizzle_size,op_class,accum,cta_m,cta_n,cta_k,cluster_m,cluster_n,cluster_k,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Flops/Byte,Runtime,GB/s,GFLOPs
1,CUTLASS,gemm,cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_tnn_align8,incorrect,success,universal,4352,4096,4096,bf16:row,bf16:column,bf16:column,bf16:column,1,0,serial,1,1,heuristic,false,1,tensorop,f32,128,128,64,,,,7,4,2,1,64,128,16,90,90,104857600,146064539648,1392,0.235348,414.944,620633

After:

Problem,Provider,OperationKind,Operation,Disposition,Status,gemm_kind,m,n,k,A,B,C,D,alpha,beta,split_k_mode,split_k_slices,batch_count,raster_order,use_pdl,swizzle_size,op_class,accum,cta_m,cta_n,cta_k,cluster_m,cluster_n,cluster_k,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Flops/Byte,Runtime,GB/s,GFLOPs
1,CUTLASS,gemm,cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_tnn_align8,incorrect,success,universal,4352,4096,4096,bf16:row,bf16:column,bf16:column,bf16:column,1,0,serial,1,1,heuristic,false,1,tensorop,f32,128,128,64,1,2,1,7,4,2,1,64,128,16,90,90,104857600,146064539648,1392,0.235348,414.944,620633

Repro commands:

Build cutlass

git clone https://github.com/NVIDIA/cutlass
cd cutlass
mkdir build
cmake .. -DCUTLASS_NVCC_ARCHS=90a -DCUTLASS_LIBRARY_KERNELS=cutlass3x_sm90_tensorop_s*16gemm_bf16_bf16_f32_bf16_bf16_*tnn* -DCUTLASS_ENABLE_TESTS=OFF -GNinja -DCUTLASS_LIBRARY_INSTANTIATION_LEVEL=9992 -DCUTLASS_LIBRARY_OPERATIONS=Gemm

Run profiler

 ./tools/profiler/cutlass_profiler --operation=Gemm --output=data --dist=gaussian,mean:0.0,stddev:1.0,scale:-1 --m=4352 --n=4096 --k=4096 --A=bf16:row --B=bf16:column --C=bf16:column --D=bf16:column

Currently cutlass profiler lists down all the arguments to the benchmark but doesn't list down per kernel values for cluster_k, cluster_m and cluster_n.

This change updates the profiler report generation to include these arguments.
@mandroid6
Copy link
Author

@hwu36 @kerrmudgeon

@hwu36
Copy link
Collaborator

hwu36 commented Feb 5, 2025

@itramble , could you please review first?

@mandroid6
Copy link
Author

@itramble could you help take a look? (cc @hwu36 )

@itramble
Copy link

Hi @mandroid6, thanks for raising this. I think this was changed recently. As of today, I see:

Problem,Provider,OperationKind,Operation,Disposition,Status,gemm_kind,m,n,k,A,B,C,D,alpha,beta,split_k_mode,split_k_slices,batch_count,raster_order,runtime_input_datatype_a,runtime_input_datatype_b,use_pdl,enable_sm90_mixed_dtype_shuffle_test,swizzle_size,op_class,accum,cta_m,cta_n,cta_k,cluster_m,cluster_n,cluster_k,cluster_m_fallback,cluster_n_fallback,cluster_k_fallback,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Flops/Byte,Runtime,GB/s,GFLOPs
1,CUTLASS,gemm,cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_tnn_align8,incorrect,success,universal,4352,4096,4096,bf16:row,bf16:column,bf16:column,bf16:column,1,0,serial,1,1,heuristic,invalid,invalid,false,false,1,tensorop,f32,128,128,64,1,1,1,0,0,0,7,4,2,1,64,128,16,90,90,104857600,146064539648,1392,0.359317,271.783,406506

Unfortunately, this is not entirely correct either. We currently report the "cluster*" arguments that were passed to the profiler (or defaults, see here). We do this because there is a new Blackwell feature for using runtime cluster shapes (described here) in addition to static compile-time cluster shapes that were supported for Hopper. Runtime cluster shapes are indicated when one of operation_desc.tile_description.cluster_shape.m/n/k() is 0. When none of the cluster_shapes are 0 (true for Hopper CUTLASS kernels), then your change is correct.

Copy link

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

Copy link

This PR has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants