Hello! I'm using splitK in kGemmSplitKParallel mode with examples/36. Some shapes have a very large k (e.g. 5x1e4). I set the spltk_slices to max(min(256, k/128), 1). I find the kernels with splitK give an error output. I have some questions:
- Should I use splitK on shapes with very large k?
- Do I need to call
cudaDeviceSynchronize() after the splitK kernel?
- How can I use streamK in examples/36?
hardware: a100
cuda version: cuda 11.7
Hello! I'm using splitK in
kGemmSplitKParallelmode with examples/36. Some shapes have a very large k (e.g. 5x1e4). I set thespltk_slicestomax(min(256, k/128), 1). I find the kernels with splitK give an error output. I have some questions:cudaDeviceSynchronize()after the splitK kernel?hardware: a100
cuda version: cuda 11.7